How bad data keeps us from good AI | Mainak Mazumdar

48,347 views ・ 2021-03-05

TED


請雙擊下方英文字幕播放視頻。

00:00
Transcriber: Leslie Gauthier Reviewer: Joanna Pietrulewicz
0
0
7000
譯者: C Leung 審譯者: 麗玲 辛
未來 10 年,AI 能為全球經濟
增加 16 萬億美元。
這種經濟不是由數十億人
00:13
AI could add 16 trillion dollars to the global economy
1
13750
4351
或數百萬家工廠所建成,
00:18
in the next 10 years.
2
18125
2268
而是由電腦和演算法。
00:20
This economy is not going to be built by billions of people
3
20417
4642
我們已經看到 AI 在簡化任務、
00:25
or millions of factories,
4
25083
2143
00:27
but by computers and algorithms.
5
27250
2643
提高效率和改善生活方面 帶來的驚人好處。
00:29
We have already seen amazing benefits of AI
6
29917
4684
從這 15% 難以接觸到的 羣體中收集信息。
00:34
in simplifying tasks,
7
34625
2184
00:36
bringing efficiencies
8
36833
1601
00:38
and improving our lives.
9
38458
2393
AI 還未兌現它的承諾。
00:40
However, when it comes to fair and equitable policy decision-making,
10
40875
5976
AI 正成為經濟的守門人,
決定誰能獲得工作,
00:46
AI has not lived up to its promise.
11
46875
3143
誰能獲得貸款。
00:50
AI is becoming a gatekeeper to the economy,
12
50042
2892
AI 只是更快速、更大規模地 加深、放大我們的偏見,
00:52
deciding who gets a job
13
52958
2185
00:55
and who gets an access to a loan.
14
55167
3434
影響社會。
00:58
AI is only reinforcing and accelerating our bias
15
58625
4309
那麼,AI 是否會讓我們失望?
我們是否設計這些演算法, 去作出偏頗及錯誤決定?
01:02
at speed and scale
16
62958
1851
01:04
with societal implications.
17
64833
2393
01:07
So, is AI failing us?
18
67250
2226
作為數據科學家,我在此告訴你,
01:09
Are we designing these algorithms to deliver biased and wrong decisions?
19
69500
5417
並不是演算法,
而是偏誤的數據
造成這些決定。
01:16
As a data scientist, I'm here to tell you,
20
76292
2892
為了讓 AI 能造福人類社會,
01:19
it's not the algorithm,
21
79208
1685
01:20
but the biased data
22
80917
1476
我們急需重置。
01:22
that's responsible for these decisions.
23
82417
3059
我們要關注的不是演算法,而是數據。
01:25
To make AI possible for humanity and society,
24
85500
4434
我的任務不是趕快建立新的演算法,
01:29
we need an urgent reset.
25
89958
2351
卻忽略了設計和收集 高品質、有脈絡的數據。
01:32
Instead of algorithms,
26
92333
2101
01:34
we need to focus on the data.
27
94458
2310
01:36
We're spending time and money to scale AI
28
96792
2642
我們需要停止使用現有的偏誤數據,
01:39
at the expense of designing and collecting high-quality and contextual data.
29
99458
6018
而專注於三件事:
數據基礎建設、
01:45
We need to stop the data, or the biased data that we already have,
30
105500
4268
數據品質
和數據素養。
今年 6 月,
01:49
and focus on three things:
31
109792
2392
我們看到杜克大學,一個名為 PULSE 的 AI 模型尷尬偏誤。
01:52
data infrastructure,
32
112208
1601
01:53
data quality
33
113833
1393
01:55
and data literacy.
34
115250
2101
它將一張模糊的圖像
01:57
In June of this year,
35
117375
1309
01:58
we saw embarrassing bias in the Duke University AI model
36
118708
4768
增強呈現為一張可識別的人物照片。
02:03
called PULSE,
37
123500
1559
該演算法錯誤地將非白人圖像 增強呈現為白人圖像。
02:05
which enhanced a blurry image
38
125083
3018
02:08
into a recognizable photograph of a person.
39
128125
4018
在訓練集中, 非裔美國人圖像的代表性不足,
02:12
This algorithm incorrectly enhanced a nonwhite image into a Caucasian image.
40
132167
6166
導致誤判和錯的預測。
02:19
African-American images were underrepresented in the training set,
41
139042
5017
這可能不是你首次
看到 AI 誤認黑人圖像。
02:24
leading to wrong decisions and predictions.
42
144083
3417
儘管 AI 方法有所改進,
02:28
Probably this is not the first time
43
148333
2143
02:30
you have seen an AI misidentify a Black person's image.
44
150500
4768
但種族和民族人口代表性不足
仍然給我們偏誤結果。
02:35
Despite an improved AI methodology,
45
155292
3892
這項研究是學術性的,
02:39
the underrepresentation of racial and ethnic populations
46
159208
3810
但並非所有數據偏誤都是學術性。
02:43
still left us with biased results.
47
163042
2684
偏誤會帶來嚴重後果。
02:45
This research is academic,
48
165750
2018
以 2020 年美國人口普查為例。
02:47
however, not all data biases are academic.
49
167792
3976
人口普查
02:51
Biases have real consequences.
50
171792
3142
是許多社會和經濟決策的基礎,
02:54
Take the 2020 US Census.
51
174958
2334
因此人口普查必須對美國 100% 的人口進行統計。
02:58
The census is the foundation
52
178042
1726
02:59
for many social and economic policy decisions,
53
179792
4392
然而,由於新冠疫情
03:04
therefore the census is required to count 100 percent of the population
54
184208
4518
和公民身份問題的政治因素,
03:08
in the United States.
55
188750
2018
少計少數族裔的情況確實存在。
03:10
However, with the pandemic
56
190792
2476
難以找出、聯繫、說服 和訪談的少數族裔羣體,
03:13
and the politics of the citizenship question,
57
193292
3267
我預計他們在人口普查中 數量會被嚴重低估。
03:16
undercounting of minorities is a real possibility.
58
196583
3393
03:20
I expect significant undercounting of minority groups
59
200000
4309
少計會帶來偏誤,
03:24
who are hard to locate, contact, persuade and interview for the census.
60
204333
5268
也會削弱我們數據基礎設施的品質。
讓我們看看 2010 年 人口普查中的少計情況。
03:29
Undercounting will introduce bias
61
209625
3393
在最終統計有 1600 萬人被忽略,
03:33
and erode the quality of our data infrastructure.
62
213042
3184
相當於亞利桑那州、阿肯色州、
03:36
Let's look at undercounts in the 2010 census.
63
216250
3976
俄克拉荷馬州和愛荷華州 當年人口的總和。
03:40
16 million people were omitted in the final counts.
64
220250
3934
03:44
This is as large as the total population
65
224208
3143
我們還看到在 2010 年人口普查,
03:47
of Arizona, Arkansas, Oklahoma and Iowa put together for that year.
66
227375
5809
少計了約 100 萬名五歲以下兒童。
在其他國家的人口普查中,
03:53
We have also seen about a million kids under the age of five undercounted
67
233208
4310
少計少數族裔的情況很常見,
03:57
in the 2010 Census.
68
237542
2101
因為少數族裔或更難被接觸到,
03:59
Now, undercounting of minorities
69
239667
2976
他們不信任政府,
04:02
is common in other national censuses,
70
242667
2976
或者生活在政治動盪的地區。
04:05
as minorities can be harder to reach,
71
245667
3184
例如,
2016 年澳洲人口普查
04:08
they're mistrustful towards the government
72
248875
2059
04:10
or they live in an area under political unrest.
73
250958
3476
少計了約 17.5% 原住民 和托雷斯海峽人口 。
04:14
For example,
74
254458
1810
04:16
the Australian Census in 2016
75
256292
2934
我們估計,2020 年的少計比例
04:19
undercounted Aboriginals and Torres Strait populations
76
259250
3934
將遠高於 2010 年,
04:23
by about 17.5 percent.
77
263208
3060
而這種偏誤的影響可能非常深遠。
04:26
We estimate undercounting in 2020
78
266292
3142
讓我們來看看人口普查數據的影響。
04:29
to be much higher than 2010,
79
269458
3018
04:32
and the implications of this bias can be massive.
80
272500
2917
人口普查是最可靠、最開放和最公開,
04:36
Let's look at the implications of the census data.
81
276625
3208
關於人口組成和特徵的豐富數據。
04:40
Census is the most trusted, open and publicly available rich data
82
280917
5559
雖然企業擁有關於消費者的
專有信息,
但人口普查局會報告 明確、公開的統計數據,
04:46
on population composition and characteristics.
83
286500
3851
包括了年齡、性別、民族、
04:50
While businesses have proprietary information
84
290375
2184
04:52
on consumers,
85
292583
1393
種族、就業、家庭狀況
04:54
the Census Bureau reports definitive, public counts
86
294000
4143
以及地理分佈,
這些都是人口數據基礎設施的基石。
04:58
on age, gender, ethnicity,
87
298167
2434
05:00
race, employment, family status,
88
300625
2851
當少數羣體被少計時,
05:03
as well as geographic distribution,
89
303500
2268
支援公共交通、
05:05
which are the foundation of the population data infrastructure.
90
305792
4184
住房、醫療保健
和保險的 AI 模型,
很可能會忽略最需要這些服務的社群。
05:10
When minorities are undercounted,
91
310000
2393
05:12
AI models supporting public transportation,
92
312417
2976
05:15
housing, health care,
93
315417
1434
要改善結果的第一步,
05:16
insurance
94
316875
1268
是根據人口普查數據,
05:18
are likely to overlook the communities that require these services the most.
95
318167
5392
使數據庫在年齡、性別、 民族和種族方面
05:23
First step to improving results
96
323583
2185
具有代表性。
05:25
is to make that database representative
97
325792
2392
由於人口普查如此重要,
我們必須盡力做到百分百的統計。
05:28
of age, gender, ethnicity and race
98
328208
3268
05:31
per census data.
99
331500
1292
數據品質和準確性投放資源
05:33
Since census is so important,
100
333792
1642
05:35
we have to make every effort to count 100 percent.
101
335458
4101
對促成 AI 非常重要,
這並非為了少數人和特權階層,
05:39
Investing in this data quality and accuracy
102
339583
4060
而是為了社會中的每個人。
05:43
is essential to making AI possible,
103
343667
3226
大多數 AI 系統使用已有的
05:46
not for only few and privileged,
104
346917
2226
或為其他目的而收集的數據,
05:49
but for everyone in the society.
105
349167
2517
因為這樣既方便又便宜。
05:51
Most AI systems use the data that's already available
106
351708
3560
然而數據品質是一門需要投入,
05:55
or collected for some other purposes
107
355292
2434
真真正正投入的學問。
05:57
because it's convenient and cheap.
108
357750
2268
這種對偏誤的定義、
06:00
Yet data quality is a discipline that requires commitment --
109
360042
4684
數據收集和測量的關注
在追求速度、規模和方便的世界裏,
06:04
real commitment.
110
364750
1768
06:06
This attention to the definition,
111
366542
2809
不僅沒有得到重視,
06:09
data collection and measurement of the bias,
112
369375
2768
還常常被忽視。
作為尼爾森數據科學團隊一員,
06:12
is not only underappreciated --
113
372167
2476
06:14
in the world of speed, scale and convenience,
114
374667
3267
我曾實地考察
走訪了上海和班加羅爾的 零售店,收集數據。
06:17
it's often ignored.
115
377958
1810
06:19
As part of Nielsen data science team,
116
379792
2809
考察目的是統計這些商店的零售額。
06:22
I went to field visits to collect data,
117
382625
2351
06:25
visiting retail stores outside Shanghai and Bangalore.
118
385000
3934
我們駕車遠離市外,
找到這些小店,
06:28
The goal of that visit was to measure retail sales from those stores.
119
388958
5060
非正規、難以接觸到的小店。
你可能想知道,
我們為何會對這些特定商店感興趣?
06:34
We drove miles outside the city,
120
394042
2184
06:36
found these small stores --
121
396250
1976
我們本可以選擇市內的商店,
06:38
informal, hard to reach.
122
398250
2059
將電子數據輕鬆地整合到數據管道中,
06:40
And you may be wondering --
123
400333
2018
06:42
why are we interested in these specific stores?
124
402375
3518
便宜、方便、簡單。
06:45
We could have selected a store in the city
125
405917
2142
為何我們對這些商店的數據品質
06:48
where the electronic data could be easily integrated into a data pipeline --
126
408083
4101
和準確性如此着迷?
06:52
cheap, convenient and easy.
127
412208
2851
答案很簡單:
因為這些農村商店的數據舉足輕重。
06:55
Why are we so obsessed with the quality
128
415083
3060
06:58
and accuracy of the data from these stores?
129
418167
2976
根據國際勞工組織的資料,
07:01
The answer is simple:
130
421167
1559
07:02
because the data from these rural stores matter.
131
422750
3250
40% 中國人
和 65% 印度人生活在農村地區。
07:07
According to the International Labour Organization,
132
427708
3726
試想一下,
如果印度 65% 的消費力 被排除在模型之外,決䇿上的偏誤,
07:11
40 percent Chinese
133
431458
1768
07:13
and 65 percent of Indians live in rural areas.
134
433250
4643
代表着決策會看重城市多於農村。
07:17
Imagine the bias in decision
135
437917
1892
07:19
when 65 percent of consumption in India is excluded in models,
136
439833
5226
如果沒有這種城鄉背景
以及有關生計、生活方式、 經濟和價值觀的信號,
07:25
meaning the decision will favor the urban over the rural.
137
445083
3834
零售品牌就會在定價、 廣告和營銷方面做出錯誤投資。
07:29
Without this rural-urban context
138
449583
2268
07:31
and signals on livelihood, lifestyle, economy and values,
139
451875
5226
甚或城市偏誤將導致
07:37
retail brands will make wrong investments on pricing, advertising and marketing.
140
457125
5792
有關農村衛生和 其他資源投放的錯誤決策。
07:43
Or the urban bias will lead to wrong rural policy decisions
141
463750
4893
錯誤決策並非 AI 演算法的問題,
07:48
with regards to health and other investments.
142
468667
3517
而是數據問題,
因它首先排除了想測量的領域。
07:52
Wrong decisions are not the problem with the AI algorithm.
143
472208
3625
07:56
It's a problem of the data
144
476792
2142
要優先考慮的是背景數據,
07:58
that excludes areas intended to be measured in the first place.
145
478958
4792
而不是演算法。
我們再來看一個例子。
我走訪了俄勒岡州 這些偏遠的拖車公園住宅
08:04
The data in the context is a priority,
146
484917
2392
08:07
not the algorithms.
147
487333
1935
和紐約市的公寓,
08:09
Let's look at another example.
148
489292
2267
邀請這些家庭參加尼爾森小組。
08:11
I visited these remote, trailer park homes in Oregon state
149
491583
4560
我們邀請這些家庭 在一段時間內參與統計,
08:16
and New York City apartments
150
496167
1642
小組是具有統計代表性的家庭樣本。
08:17
to invite these homes to participate in Nielsen panels.
151
497833
3976
08:21
Panels are statistically representative samples of homes
152
501833
3601
我們的任務是把每人都算進統計,
從這些有收看地面電視習慣的
08:25
that we invite to participate in the measurement
153
505458
2601
08:28
over a period of time.
154
508083
2018
08:30
Our mission to include everybody in the measurement
155
510125
3309
西班牙裔和非洲裔家庭中收集數據。
08:33
led us to collect data from these Hispanic and African homes
156
513458
5101
根據尼爾森的數據,
這些家庭佔美國家庭總數的 15%,
08:38
who use over-the-air TV reception to an antenna.
157
518583
3834
約有 4500 萬人。
08:43
Per Nielsen data,
158
523292
1601
08:44
these homes constitute 15 percent of US households,
159
524917
4851
對品質的承諾和關注,
代表我們要盡一切努力,
從這 15% 難以接觸到的羣體中收集信息。
08:49
which is about 45 million people.
160
529792
2726
08:52
Commitment and focus on quality means we made every effort
161
532542
4684
它為什麼要緊?
08:57
to collect information
162
537250
1559
這是一個相當大的羣體,
08:58
from these 15 percent, hard-to-reach groups.
163
538833
4601
對營銷人員、品牌
和媒體公司都非常重要。
09:03
Why does it matter?
164
543458
1459
如沒有這些數據,
09:05
This is a sizeable group
165
545875
1309
營銷人員、品牌和他們的數據模型
09:07
that's very, very important to the marketers, brands,
166
547208
3310
就無法接觸到這些人,
09:10
as well as the media companies.
167
550542
2601
也無法向這些非常重要的 少數羣體投放廣告。
09:13
Without the data,
168
553167
1351
09:14
the marketers and brands and their models
169
554542
2892
沒有廣告收入,
09:17
would not be able to reach these folks,
170
557458
2393
Telemundo 或 Univision 等廣播公司
09:19
as well as show ads to these very, very important minority populations.
171
559875
4684
就無法提供免費內容,
09:24
And without the ad revenue,
172
564583
1976
包括對我們的民主 至關重要的新聞媒體。
09:26
the broadcasters such as Telemundo or Univision,
173
566583
4060
09:30
would not be able to deliver free content,
174
570667
3142
這些數據對企業和社會都不可或缺。
09:33
including news media,
175
573833
2101
09:35
which is so foundational to our democracy.
176
575958
3560
我們有個千載難逢機會, 去減少 AI 中人類的偏見,
09:39
This data is essential for businesses and society.
177
579542
3541
始於數據。
我的任務不是趕快建立新的演算法,
09:44
Our once-in-a-lifetime opportunity to reduce human bias in AI
178
584000
4601
而是建立更好的數據基礎設施,
09:48
starts with the data.
179
588625
2309
使符合道德規範的 AI 成為可能。
09:50
Instead of racing to build new algorithms,
180
590958
3476
我希望你也能加入我的任務。
09:54
my mission is to build a better data infrastructure
181
594458
3851
謝謝。
09:58
that makes ethical AI possible.
182
598333
3060
10:01
I hope you will join me in my mission as well.
183
601417
3559
10:05
Thank you.
184
605000
1250
關於本網站

本網站將向您介紹對學習英語有用的 YouTube 視頻。 您將看到來自世界各地的一流教師教授的英語課程。 雙擊每個視頻頁面上顯示的英文字幕,從那裡播放視頻。 字幕與視頻播放同步滾動。 如果您有任何意見或要求,請使用此聯繫表與我們聯繫。

https://forms.gle/WvT1wiN1qDtmnspy7