How bad data keeps us from good AI | Mainak Mazumdar

48,842 views ・ 2021-03-05

TED


請雙擊下方英文字幕播放視頻。

00:00
Transcriber: Leslie Gauthier Reviewer: Joanna Pietrulewicz
0
0
7000
譯者: C Leung 審譯者: 麗玲 辛
未來 10 年,AI 能為全球經濟
增加 16 萬億美元。
這種經濟不是由數十億人
00:13
AI could add 16 trillion dollars to the global economy
1
13750
4351
或數百萬家工廠所建成,
00:18
in the next 10 years.
2
18125
2268
而是由電腦和演算法。
00:20
This economy is not going to be built by billions of people
3
20417
4642
我們已經看到 AI 在簡化任務、
00:25
or millions of factories,
4
25083
2143
00:27
but by computers and algorithms.
5
27250
2643
提高效率和改善生活方面 帶來的驚人好處。
00:29
We have already seen amazing benefits of AI
6
29917
4684
從這 15% 難以接觸到的 羣體中收集信息。
00:34
in simplifying tasks,
7
34625
2184
00:36
bringing efficiencies
8
36833
1601
00:38
and improving our lives.
9
38458
2393
AI 還未兌現它的承諾。
00:40
However, when it comes to fair and equitable policy decision-making,
10
40875
5976
AI 正成為經濟的守門人,
決定誰能獲得工作,
00:46
AI has not lived up to its promise.
11
46875
3143
誰能獲得貸款。
00:50
AI is becoming a gatekeeper to the economy,
12
50042
2892
AI 只是更快速、更大規模地 加深、放大我們的偏見,
00:52
deciding who gets a job
13
52958
2185
00:55
and who gets an access to a loan.
14
55167
3434
影響社會。
00:58
AI is only reinforcing and accelerating our bias
15
58625
4309
那麼,AI 是否會讓我們失望?
我們是否設計這些演算法, 去作出偏頗及錯誤決定?
01:02
at speed and scale
16
62958
1851
01:04
with societal implications.
17
64833
2393
01:07
So, is AI failing us?
18
67250
2226
作為數據科學家,我在此告訴你,
01:09
Are we designing these algorithms to deliver biased and wrong decisions?
19
69500
5417
並不是演算法,
而是偏誤的數據
造成這些決定。
01:16
As a data scientist, I'm here to tell you,
20
76292
2892
為了讓 AI 能造福人類社會,
01:19
it's not the algorithm,
21
79208
1685
01:20
but the biased data
22
80917
1476
我們急需重置。
01:22
that's responsible for these decisions.
23
82417
3059
我們要關注的不是演算法,而是數據。
01:25
To make AI possible for humanity and society,
24
85500
4434
我的任務不是趕快建立新的演算法,
01:29
we need an urgent reset.
25
89958
2351
卻忽略了設計和收集 高品質、有脈絡的數據。
01:32
Instead of algorithms,
26
92333
2101
01:34
we need to focus on the data.
27
94458
2310
01:36
We're spending time and money to scale AI
28
96792
2642
我們需要停止使用現有的偏誤數據,
01:39
at the expense of designing and collecting high-quality and contextual data.
29
99458
6018
而專注於三件事:
數據基礎建設、
01:45
We need to stop the data, or the biased data that we already have,
30
105500
4268
數據品質
和數據素養。
今年 6 月,
01:49
and focus on three things:
31
109792
2392
我們看到杜克大學,一個名為 PULSE 的 AI 模型尷尬偏誤。
01:52
data infrastructure,
32
112208
1601
01:53
data quality
33
113833
1393
01:55
and data literacy.
34
115250
2101
它將一張模糊的圖像
01:57
In June of this year,
35
117375
1309
01:58
we saw embarrassing bias in the Duke University AI model
36
118708
4768
增強呈現為一張可識別的人物照片。
02:03
called PULSE,
37
123500
1559
該演算法錯誤地將非白人圖像 增強呈現為白人圖像。
02:05
which enhanced a blurry image
38
125083
3018
02:08
into a recognizable photograph of a person.
39
128125
4018
在訓練集中, 非裔美國人圖像的代表性不足,
02:12
This algorithm incorrectly enhanced a nonwhite image into a Caucasian image.
40
132167
6166
導致誤判和錯的預測。
02:19
African-American images were underrepresented in the training set,
41
139042
5017
這可能不是你首次
看到 AI 誤認黑人圖像。
02:24
leading to wrong decisions and predictions.
42
144083
3417
儘管 AI 方法有所改進,
02:28
Probably this is not the first time
43
148333
2143
02:30
you have seen an AI misidentify a Black person's image.
44
150500
4768
但種族和民族人口代表性不足
仍然給我們偏誤結果。
02:35
Despite an improved AI methodology,
45
155292
3892
這項研究是學術性的,
02:39
the underrepresentation of racial and ethnic populations
46
159208
3810
但並非所有數據偏誤都是學術性。
02:43
still left us with biased results.
47
163042
2684
偏誤會帶來嚴重後果。
02:45
This research is academic,
48
165750
2018
以 2020 年美國人口普查為例。
02:47
however, not all data biases are academic.
49
167792
3976
人口普查
02:51
Biases have real consequences.
50
171792
3142
是許多社會和經濟決策的基礎,
02:54
Take the 2020 US Census.
51
174958
2334
因此人口普查必須對美國 100% 的人口進行統計。
02:58
The census is the foundation
52
178042
1726
02:59
for many social and economic policy decisions,
53
179792
4392
然而,由於新冠疫情
03:04
therefore the census is required to count 100 percent of the population
54
184208
4518
和公民身份問題的政治因素,
03:08
in the United States.
55
188750
2018
少計少數族裔的情況確實存在。
03:10
However, with the pandemic
56
190792
2476
難以找出、聯繫、說服 和訪談的少數族裔羣體,
03:13
and the politics of the citizenship question,
57
193292
3267
我預計他們在人口普查中 數量會被嚴重低估。
03:16
undercounting of minorities is a real possibility.
58
196583
3393
03:20
I expect significant undercounting of minority groups
59
200000
4309
少計會帶來偏誤,
03:24
who are hard to locate, contact, persuade and interview for the census.
60
204333
5268
也會削弱我們數據基礎設施的品質。
讓我們看看 2010 年 人口普查中的少計情況。
03:29
Undercounting will introduce bias
61
209625
3393
在最終統計有 1600 萬人被忽略,
03:33
and erode the quality of our data infrastructure.
62
213042
3184
相當於亞利桑那州、阿肯色州、
03:36
Let's look at undercounts in the 2010 census.
63
216250
3976
俄克拉荷馬州和愛荷華州 當年人口的總和。
03:40
16 million people were omitted in the final counts.
64
220250
3934
03:44
This is as large as the total population
65
224208
3143
我們還看到在 2010 年人口普查,
03:47
of Arizona, Arkansas, Oklahoma and Iowa put together for that year.
66
227375
5809
少計了約 100 萬名五歲以下兒童。
在其他國家的人口普查中,
03:53
We have also seen about a million kids under the age of five undercounted
67
233208
4310
少計少數族裔的情況很常見,
03:57
in the 2010 Census.
68
237542
2101
因為少數族裔或更難被接觸到,
03:59
Now, undercounting of minorities
69
239667
2976
他們不信任政府,
04:02
is common in other national censuses,
70
242667
2976
或者生活在政治動盪的地區。
04:05
as minorities can be harder to reach,
71
245667
3184
例如,
2016 年澳洲人口普查
04:08
they're mistrustful towards the government
72
248875
2059
04:10
or they live in an area under political unrest.
73
250958
3476
少計了約 17.5% 原住民 和托雷斯海峽人口 。
04:14
For example,
74
254458
1810
04:16
the Australian Census in 2016
75
256292
2934
我們估計,2020 年的少計比例
04:19
undercounted Aboriginals and Torres Strait populations
76
259250
3934
將遠高於 2010 年,
04:23
by about 17.5 percent.
77
263208
3060
而這種偏誤的影響可能非常深遠。
04:26
We estimate undercounting in 2020
78
266292
3142
讓我們來看看人口普查數據的影響。
04:29
to be much higher than 2010,
79
269458
3018
04:32
and the implications of this bias can be massive.
80
272500
2917
人口普查是最可靠、最開放和最公開,
04:36
Let's look at the implications of the census data.
81
276625
3208
關於人口組成和特徵的豐富數據。
04:40
Census is the most trusted, open and publicly available rich data
82
280917
5559
雖然企業擁有關於消費者的
專有信息,
但人口普查局會報告 明確、公開的統計數據,
04:46
on population composition and characteristics.
83
286500
3851
包括了年齡、性別、民族、
04:50
While businesses have proprietary information
84
290375
2184
04:52
on consumers,
85
292583
1393
種族、就業、家庭狀況
04:54
the Census Bureau reports definitive, public counts
86
294000
4143
以及地理分佈,
這些都是人口數據基礎設施的基石。
04:58
on age, gender, ethnicity,
87
298167
2434
05:00
race, employment, family status,
88
300625
2851
當少數羣體被少計時,
05:03
as well as geographic distribution,
89
303500
2268
支援公共交通、
05:05
which are the foundation of the population data infrastructure.
90
305792
4184
住房、醫療保健
和保險的 AI 模型,
很可能會忽略最需要這些服務的社群。
05:10
When minorities are undercounted,
91
310000
2393
05:12
AI models supporting public transportation,
92
312417
2976
05:15
housing, health care,
93
315417
1434
要改善結果的第一步,
05:16
insurance
94
316875
1268
是根據人口普查數據,
05:18
are likely to overlook the communities that require these services the most.
95
318167
5392
使數據庫在年齡、性別、 民族和種族方面
05:23
First step to improving results
96
323583
2185
具有代表性。
05:25
is to make that database representative
97
325792
2392
由於人口普查如此重要,
我們必須盡力做到百分百的統計。
05:28
of age, gender, ethnicity and race
98
328208
3268
05:31
per census data.
99
331500
1292
數據品質和準確性投放資源
05:33
Since census is so important,
100
333792
1642
05:35
we have to make every effort to count 100 percent.
101
335458
4101
對促成 AI 非常重要,
這並非為了少數人和特權階層,
05:39
Investing in this data quality and accuracy
102
339583
4060
而是為了社會中的每個人。
05:43
is essential to making AI possible,
103
343667
3226
大多數 AI 系統使用已有的
05:46
not for only few and privileged,
104
346917
2226
或為其他目的而收集的數據,
05:49
but for everyone in the society.
105
349167
2517
因為這樣既方便又便宜。
05:51
Most AI systems use the data that's already available
106
351708
3560
然而數據品質是一門需要投入,
05:55
or collected for some other purposes
107
355292
2434
真真正正投入的學問。
05:57
because it's convenient and cheap.
108
357750
2268
這種對偏誤的定義、
06:00
Yet data quality is a discipline that requires commitment --
109
360042
4684
數據收集和測量的關注
在追求速度、規模和方便的世界裏,
06:04
real commitment.
110
364750
1768
06:06
This attention to the definition,
111
366542
2809
不僅沒有得到重視,
06:09
data collection and measurement of the bias,
112
369375
2768
還常常被忽視。
作為尼爾森數據科學團隊一員,
06:12
is not only underappreciated --
113
372167
2476
06:14
in the world of speed, scale and convenience,
114
374667
3267
我曾實地考察
走訪了上海和班加羅爾的 零售店,收集數據。
06:17
it's often ignored.
115
377958
1810
06:19
As part of Nielsen data science team,
116
379792
2809
考察目的是統計這些商店的零售額。
06:22
I went to field visits to collect data,
117
382625
2351
06:25
visiting retail stores outside Shanghai and Bangalore.
118
385000
3934
我們駕車遠離市外,
找到這些小店,
06:28
The goal of that visit was to measure retail sales from those stores.
119
388958
5060
非正規、難以接觸到的小店。
你可能想知道,
我們為何會對這些特定商店感興趣?
06:34
We drove miles outside the city,
120
394042
2184
06:36
found these small stores --
121
396250
1976
我們本可以選擇市內的商店,
06:38
informal, hard to reach.
122
398250
2059
將電子數據輕鬆地整合到數據管道中,
06:40
And you may be wondering --
123
400333
2018
06:42
why are we interested in these specific stores?
124
402375
3518
便宜、方便、簡單。
06:45
We could have selected a store in the city
125
405917
2142
為何我們對這些商店的數據品質
06:48
where the electronic data could be easily integrated into a data pipeline --
126
408083
4101
和準確性如此着迷?
06:52
cheap, convenient and easy.
127
412208
2851
答案很簡單:
因為這些農村商店的數據舉足輕重。
06:55
Why are we so obsessed with the quality
128
415083
3060
06:58
and accuracy of the data from these stores?
129
418167
2976
根據國際勞工組織的資料,
07:01
The answer is simple:
130
421167
1559
07:02
because the data from these rural stores matter.
131
422750
3250
40% 中國人
和 65% 印度人生活在農村地區。
07:07
According to the International Labour Organization,
132
427708
3726
試想一下,
如果印度 65% 的消費力 被排除在模型之外,決䇿上的偏誤,
07:11
40 percent Chinese
133
431458
1768
07:13
and 65 percent of Indians live in rural areas.
134
433250
4643
代表着決策會看重城市多於農村。
07:17
Imagine the bias in decision
135
437917
1892
07:19
when 65 percent of consumption in India is excluded in models,
136
439833
5226
如果沒有這種城鄉背景
以及有關生計、生活方式、 經濟和價值觀的信號,
07:25
meaning the decision will favor the urban over the rural.
137
445083
3834
零售品牌就會在定價、 廣告和營銷方面做出錯誤投資。
07:29
Without this rural-urban context
138
449583
2268
07:31
and signals on livelihood, lifestyle, economy and values,
139
451875
5226
甚或城市偏誤將導致
07:37
retail brands will make wrong investments on pricing, advertising and marketing.
140
457125
5792
有關農村衛生和 其他資源投放的錯誤決策。
07:43
Or the urban bias will lead to wrong rural policy decisions
141
463750
4893
錯誤決策並非 AI 演算法的問題,
07:48
with regards to health and other investments.
142
468667
3517
而是數據問題,
因它首先排除了想測量的領域。
07:52
Wrong decisions are not the problem with the AI algorithm.
143
472208
3625
07:56
It's a problem of the data
144
476792
2142
要優先考慮的是背景數據,
07:58
that excludes areas intended to be measured in the first place.
145
478958
4792
而不是演算法。
我們再來看一個例子。
我走訪了俄勒岡州 這些偏遠的拖車公園住宅
08:04
The data in the context is a priority,
146
484917
2392
08:07
not the algorithms.
147
487333
1935
和紐約市的公寓,
08:09
Let's look at another example.
148
489292
2267
邀請這些家庭參加尼爾森小組。
08:11
I visited these remote, trailer park homes in Oregon state
149
491583
4560
我們邀請這些家庭 在一段時間內參與統計,
08:16
and New York City apartments
150
496167
1642
小組是具有統計代表性的家庭樣本。
08:17
to invite these homes to participate in Nielsen panels.
151
497833
3976
08:21
Panels are statistically representative samples of homes
152
501833
3601
我們的任務是把每人都算進統計,
從這些有收看地面電視習慣的
08:25
that we invite to participate in the measurement
153
505458
2601
08:28
over a period of time.
154
508083
2018
08:30
Our mission to include everybody in the measurement
155
510125
3309
西班牙裔和非洲裔家庭中收集數據。
08:33
led us to collect data from these Hispanic and African homes
156
513458
5101
根據尼爾森的數據,
這些家庭佔美國家庭總數的 15%,
08:38
who use over-the-air TV reception to an antenna.
157
518583
3834
約有 4500 萬人。
08:43
Per Nielsen data,
158
523292
1601
08:44
these homes constitute 15 percent of US households,
159
524917
4851
對品質的承諾和關注,
代表我們要盡一切努力,
從這 15% 難以接觸到的羣體中收集信息。
08:49
which is about 45 million people.
160
529792
2726
08:52
Commitment and focus on quality means we made every effort
161
532542
4684
它為什麼要緊?
08:57
to collect information
162
537250
1559
這是一個相當大的羣體,
08:58
from these 15 percent, hard-to-reach groups.
163
538833
4601
對營銷人員、品牌
和媒體公司都非常重要。
09:03
Why does it matter?
164
543458
1459
如沒有這些數據,
09:05
This is a sizeable group
165
545875
1309
營銷人員、品牌和他們的數據模型
09:07
that's very, very important to the marketers, brands,
166
547208
3310
就無法接觸到這些人,
09:10
as well as the media companies.
167
550542
2601
也無法向這些非常重要的 少數羣體投放廣告。
09:13
Without the data,
168
553167
1351
09:14
the marketers and brands and their models
169
554542
2892
沒有廣告收入,
09:17
would not be able to reach these folks,
170
557458
2393
Telemundo 或 Univision 等廣播公司
09:19
as well as show ads to these very, very important minority populations.
171
559875
4684
就無法提供免費內容,
09:24
And without the ad revenue,
172
564583
1976
包括對我們的民主 至關重要的新聞媒體。
09:26
the broadcasters such as Telemundo or Univision,
173
566583
4060
09:30
would not be able to deliver free content,
174
570667
3142
這些數據對企業和社會都不可或缺。
09:33
including news media,
175
573833
2101
09:35
which is so foundational to our democracy.
176
575958
3560
我們有個千載難逢機會, 去減少 AI 中人類的偏見,
09:39
This data is essential for businesses and society.
177
579542
3541
始於數據。
我的任務不是趕快建立新的演算法,
09:44
Our once-in-a-lifetime opportunity to reduce human bias in AI
178
584000
4601
而是建立更好的數據基礎設施,
09:48
starts with the data.
179
588625
2309
使符合道德規範的 AI 成為可能。
09:50
Instead of racing to build new algorithms,
180
590958
3476
我希望你也能加入我的任務。
09:54
my mission is to build a better data infrastructure
181
594458
3851
謝謝。
09:58
that makes ethical AI possible.
182
598333
3060
10:01
I hope you will join me in my mission as well.
183
601417
3559
10:05
Thank you.
184
605000
1250
關於本網站

本網站將向您介紹對學習英語有用的 YouTube 視頻。 您將看到來自世界各地的一流教師教授的英語課程。 雙擊每個視頻頁面上顯示的英文字幕,從那裡播放視頻。 字幕與視頻播放同步滾動。 如果您有任何意見或要求,請使用此聯繫表與我們聯繫。

https://forms.gle/WvT1wiN1qDtmnspy7


This website was created in October 2020 and last updated on June 12, 2025.

It is now archived and preserved as an English learning resource.

Some information may be out of date.

隱私政策

eng.lish.video

Developer's Blog