How bad data keeps us from good AI | Mainak Mazumdar

48,347 views ・ 2021-03-05

TED


请双击下面的英文字幕来播放视频。

00:00
Transcriber: Leslie Gauthier Reviewer: Joanna Pietrulewicz
0
0
7000
翻译人员: Yip Yan Yeung 校对人员: Helen Chang
人工智能(AI) 会在接下来的十年
为全球经济注入 16 万亿美元。
这样的经济不是由几十亿人
00:13
AI could add 16 trillion dollars to the global economy
1
13750
4351
或者几百万个工厂建造,
00:18
in the next 10 years.
2
18125
2268
而是由电脑和算法创造。
00:20
This economy is not going to be built by billions of people
3
20417
4642
我们已经看到了AI的惊人优势:
00:25
or millions of factories,
4
25083
2143
简化任务、
00:27
but by computers and algorithms.
5
27250
2643
提高效益、
00:29
We have already seen amazing benefits of AI
6
29917
4684
改善生活。
然而,谈到政策 制定的公平公正时,
00:34
in simplifying tasks,
7
34625
2184
00:36
bringing efficiencies
8
36833
1601
00:38
and improving our lives.
9
38458
2393
AI并没有兑现承诺。
00:40
However, when it comes to fair and equitable policy decision-making,
10
40875
5976
AI成为了经济的守门人,
决定谁能拿到工作,
00:46
AI has not lived up to its promise.
11
46875
3143
谁能拿到贷款。
00:50
AI is becoming a gatekeeper to the economy,
12
50042
2892
AI只是在大规模和迅速地
00:52
deciding who gets a job
13
52958
2185
巩固和加速我们
00:55
and who gets an access to a loan.
14
55167
3434
对社会影响的偏见。
00:58
AI is only reinforcing and accelerating our bias
15
58625
4309
那么,AI辜负我们了吗?
我们设计这些算法是为了输出 带有偏见的错误决定吗?
01:02
at speed and scale
16
62958
1851
01:04
with societal implications.
17
64833
2393
01:07
So, is AI failing us?
18
67250
2226
作为一个数据科学家, 我可以告诉你,
01:09
Are we designing these algorithms to deliver biased and wrong decisions?
19
69500
5417
不是算法的问题,
而是偏置数据,
它们得对这些决定负责。
01:16
As a data scientist, I'm here to tell you,
20
76292
2892
为了让AI可以被 人类和社会使用,
01:19
it's not the algorithm,
21
79208
1685
01:20
but the biased data
22
80917
1476
我们急需一次重置。
01:22
that's responsible for these decisions.
23
82417
3059
我们该关注的不是算法,
01:25
To make AI possible for humanity and society,
24
85500
4434
而是数据。
我们花了大量时间和金钱,
01:29
we need an urgent reset.
25
89958
2351
以设计和收集高质量的背景数据 为代价,扩大AI的规模。
01:32
Instead of algorithms,
26
92333
2101
01:34
we need to focus on the data.
27
94458
2310
01:36
We're spending time and money to scale AI
28
96792
2642
我们需要停下这些数据, 或我们已有的偏置数据,
01:39
at the expense of designing and collecting high-quality and contextual data.
29
99458
6018
去关注三件事:
数据基础设施、
01:45
We need to stop the data, or the biased data that we already have,
30
105500
4268
数据质量
和数据素养。
今年六月,
01:49
and focus on three things:
31
109792
2392
我们看到了一个杜克大学的
01:52
data infrastructure,
32
112208
1601
01:53
data quality
33
113833
1393
AI模型,PULSE, 产生了令人尴尬的偏差,
01:55
and data literacy.
34
115250
2101
它增强了一张模糊的图像,
01:57
In June of this year,
35
117375
1309
01:58
we saw embarrassing bias in the Duke University AI model
36
118708
4768
使它呈现了一张可辨人像。
02:03
called PULSE,
37
123500
1559
这个算法错误地将一张非白人 图像增强至了一张白人像。
02:05
which enhanced a blurry image
38
125083
3018
02:08
into a recognizable photograph of a person.
39
128125
4018
在训练集中,没有足够的 非裔美国人的图像,
02:12
This algorithm incorrectly enhanced a nonwhite image into a Caucasian image.
40
132167
6166
导致了错误的结果和预测。
02:19
African-American images were underrepresented in the training set,
41
139042
5017
也许这不是你第一次
看见AI误判黑人的图像。
02:24
leading to wrong decisions and predictions.
42
144083
3417
即使AI方法已经改进过了,
02:28
Probably this is not the first time
43
148333
2143
02:30
you have seen an AI misidentify a Black person's image.
44
150500
4768
对有些种族的忽视
依旧会产生有偏差的结果。
02:35
Despite an improved AI methodology,
45
155292
3892
研究是学术性的,
02:39
the underrepresentation of racial and ethnic populations
46
159208
3810
但是,不是所有 数据偏差都是学术性的。
02:43
still left us with biased results.
47
163042
2684
偏差有真实的后果。
02:45
This research is academic,
48
165750
2018
比如 2020 年 美国人口普查。
02:47
however, not all data biases are academic.
49
167792
3976
人口普查是
02:51
Biases have real consequences.
50
171792
3142
很多社会和经济政策制定的基础,
02:54
Take the 2020 US Census.
51
174958
2334
所以人口普查应当记录
02:58
The census is the foundation
52
178042
1726
02:59
for many social and economic policy decisions,
53
179792
4392
美国人口的 100%。
然而,由于新冠疫情
03:04
therefore the census is required to count 100 percent of the population
54
184208
4518
和国籍背后的政治问题,
03:08
in the United States.
55
188750
2018
实际上非常有可能 少算了少数群体。
03:10
However, with the pandemic
56
190792
2476
我预计会有大量少数群体的少计,
03:13
and the politics of the citizenship question,
57
193292
3267
很难定位、联系、说服、 采访他们加入人口普查。
03:16
undercounting of minorities is a real possibility.
58
196583
3393
03:20
I expect significant undercounting of minority groups
59
200000
4309
少计会带来偏差,
03:24
who are hard to locate, contact, persuade and interview for the census.
60
204333
5268
威胁我们数据基础设施的质量。
我们来看 2010 年 人口普查的少计。
03:29
Undercounting will introduce bias
61
209625
3393
在最终统计中, 有 1600 万人被忽略了。
03:33
and erode the quality of our data infrastructure.
62
213042
3184
这相当于亚利桑那州、阿肯色州、
03:36
Let's look at undercounts in the 2010 census.
63
216250
3976
俄克拉何马州、爱荷华州 当年的人口总和。
03:40
16 million people were omitted in the final counts.
64
220250
3934
03:44
This is as large as the total population
65
224208
3143
我们还可以看到在 2010 年 人口普查中,少计了
03:47
of Arizona, Arkansas, Oklahoma and Iowa put together for that year.
66
227375
5809
大约 100 万 低于五岁的儿童。
如今,少数群体的少计
03:53
We have also seen about a million kids under the age of five undercounted
67
233208
4310
在其他国家的 人口普查中也很常见,
03:57
in the 2010 Census.
68
237542
2101
因为很难联系到这些少数群体,
03:59
Now, undercounting of minorities
69
239667
2976
他们不信任政府,
04:02
is common in other national censuses,
70
242667
2976
或者生活在政治动荡区域。
04:05
as minorities can be harder to reach,
71
245667
3184
比如,
2016 年的澳大利亚人口普查
04:08
they're mistrustful towards the government
72
248875
2059
04:10
or they live in an area under political unrest.
73
250958
3476
少算了大约 17.5% 的 澳洲土著
04:14
For example,
74
254458
1810
和托雷斯海峡人口。
04:16
the Australian Census in 2016
75
256292
2934
我们估计 2020 年的少计
04:19
undercounted Aboriginals and Torres Strait populations
76
259250
3934
会远远超过 2010 年,
04:23
by about 17.5 percent.
77
263208
3060
这种偏差可能造成的 影响是深远的。
04:26
We estimate undercounting in 2020
78
266292
3142
我们来看人口普查数据的影响。
04:29
to be much higher than 2010,
79
269458
3018
04:32
and the implications of this bias can be massive.
80
272500
2917
人口普查是最权威的、 可公开获取的丰富数据,
04:36
Let's look at the implications of the census data.
81
276625
3208
它记录了人口组成和特征。
04:40
Census is the most trusted, open and publicly available rich data
82
280917
5559
虽然商家会对顾客收集
私密信息,
但是美国人口调查局 会报告完整的、公开的
04:46
on population composition and characteristics.
83
286500
3851
年龄、性别、民族、
04:50
While businesses have proprietary information
84
290375
2184
04:52
on consumers,
85
292583
1393
种族、就业情况、家庭情况信息,
04:54
the Census Bureau reports definitive, public counts
86
294000
4143
还有地理分布情况,
这是人口数据基础设施的基础。
04:58
on age, gender, ethnicity,
87
298167
2434
05:00
race, employment, family status,
88
300625
2851
当少数群体被少计时,
05:03
as well as geographic distribution,
89
303500
2268
支撑公共交通、
05:05
which are the foundation of the population data infrastructure.
90
305792
4184
住房、医疗、
保险的AI模型
更可能忽视最需要 这些服务的群体。
05:10
When minorities are undercounted,
91
310000
2393
05:12
AI models supporting public transportation,
92
312417
2976
05:15
housing, health care,
93
315417
1434
改进结果的第一步
05:16
insurance
94
316875
1268
就是让每一次人口普查数据的
05:18
are likely to overlook the communities that require these services the most.
95
318167
5392
年龄、性别、民族、种族数据库
05:23
First step to improving results
96
323583
2185
具有代表性。
05:25
is to make that database representative
97
325792
2392
既然人口普查如此的重要,
我们就要不遗余力地 让它记录 100% 。
05:28
of age, gender, ethnicity and race
98
328208
3268
05:31
per census data.
99
331500
1292
在数据质量和准确率上投入精力
05:33
Since census is so important,
100
333792
1642
05:35
we have to make every effort to count 100 percent.
101
335458
4101
对AI的可行性至关重要,
不但是为了少数有特权的人,
05:39
Investing in this data quality and accuracy
102
339583
4060
还是为了社会中的每一个人。
05:43
is essential to making AI possible,
103
343667
3226
大多数的AI系统利用已有的
05:46
not for only few and privileged,
104
346917
2226
或者为其他目的收集的数据,
05:49
but for everyone in the society.
105
349167
2517
因为很方便又便宜。
05:51
Most AI systems use the data that's already available
106
351708
3560
但是数据质量是一个 需要大量投入的领域,
05:55
or collected for some other purposes
107
355292
2434
真正的投入。
05:57
because it's convenient and cheap.
108
357750
2268
对偏差的定义、
06:00
Yet data quality is a discipline that requires commitment --
109
360042
4684
数据采集和偏差测量的关注
在快速、大规模、便利的世界
06:04
real commitment.
110
364750
1768
06:06
This attention to the definition,
111
366542
2809
不仅不受重视,
06:09
data collection and measurement of the bias,
112
369375
2768
还会被无视。
作为尼尔森公司(Nielsen) 数据科学组的一员,
06:12
is not only underappreciated --
113
372167
2476
06:14
in the world of speed, scale and convenience,
114
374667
3267
我实地走访收集了数据,
拜访了上海和班加罗尔 之外的零售店。
06:17
it's often ignored.
115
377958
1810
06:19
As part of Nielsen data science team,
116
379792
2809
拜访的目的是测量 这些店铺的零售业绩。
06:22
I went to field visits to collect data,
117
382625
2351
06:25
visiting retail stores outside Shanghai and Bangalore.
118
385000
3934
我们开出了市区,
发现了这些小店——
06:28
The goal of that visit was to measure retail sales from those stores.
119
388958
5060
不正规、交通不便利。
你可能在想
为什么我们会对这些店感兴趣?
06:34
We drove miles outside the city,
120
394042
2184
06:36
found these small stores --
121
396250
1976
我们可以在市区找一家店,
06:38
informal, hard to reach.
122
398250
2059
电子数据可以轻松地 接入数据流程,
06:40
And you may be wondering --
123
400333
2018
06:42
why are we interested in these specific stores?
124
402375
3518
便宜、方便又简单。
06:45
We could have selected a store in the city
125
405917
2142
为什么我们对这些店的数据质量
06:48
where the electronic data could be easily integrated into a data pipeline --
126
408083
4101
和准确率如此感兴趣?
06:52
cheap, convenient and easy.
127
412208
2851
答案很简单:
因为这些乡镇小店的数据很重要。
06:55
Why are we so obsessed with the quality
128
415083
3060
06:58
and accuracy of the data from these stores?
129
418167
2976
国际劳工组织表示,
07:01
The answer is simple:
130
421167
1559
07:02
because the data from these rural stores matter.
131
422750
3250
40% 的中国人
和 65% 的印度人 生活在乡村。
07:07
According to the International Labour Organization,
132
427708
3726
想像印度消费的 65%
07:11
40 percent Chinese
133
431458
1768
被排除在模型之外, 政策制定的偏差会怎样,
07:13
and 65 percent of Indians live in rural areas.
134
433250
4643
意味着这些决策会偏向 对城市比对乡村更为有利。
07:17
Imagine the bias in decision
135
437917
1892
07:19
when 65 percent of consumption in India is excluded in models,
136
439833
5226
没有这些城乡背景信息
和民生、生活方式、 经济和商品价值的信号,
07:25
meaning the decision will favor the urban over the rural.
137
445083
3834
零售品牌会在定价、广告 和营销上做出错误的决定。
07:29
Without this rural-urban context
138
449583
2268
07:31
and signals on livelihood, lifestyle, economy and values,
139
451875
5226
倾向城市的偏差会导致 错误的乡村政策制定,
07:37
retail brands will make wrong investments on pricing, advertising and marketing.
140
457125
5792
包括医疗和其他方面。
07:43
Or the urban bias will lead to wrong rural policy decisions
141
463750
4893
错误的决定不是AI算法的问题。
07:48
with regards to health and other investments.
142
468667
3517
这是数据的问题,
排除了那些早就该 被统计的地区的数据。
07:52
Wrong decisions are not the problem with the AI algorithm.
143
472208
3625
07:56
It's a problem of the data
144
476792
2142
这些背景数据才该被首先关注,
07:58
that excludes areas intended to be measured in the first place.
145
478958
4792
而不是算法。
我们来看另一个例子。
我拜访了一些俄勒冈州 偏远的房车营地,
08:04
The data in the context is a priority,
146
484917
2392
08:07
not the algorithms.
147
487333
1935
和纽约的公寓,
08:09
Let's look at another example.
148
489292
2267
邀请这些家庭加入 一些尼尔森论坛。
08:11
I visited these remote, trailer park homes in Oregon state
149
491583
4560
我们在这段时间邀请了一些家庭
08:16
and New York City apartments
150
496167
1642
参与论坛的统计,
08:17
to invite these homes to participate in Nielsen panels.
151
497833
3976
数据样本具有代表性。
08:21
Panels are statistically representative samples of homes
152
501833
3601
我们的目标是邀请所有人加入统计,
让我们可以收集这些西班牙裔 和非洲裔家庭的数据,
08:25
that we invite to participate in the measurement
153
505458
2601
08:28
over a period of time.
154
508083
2018
08:30
Our mission to include everybody in the measurement
155
510125
3309
他们都会收看有线电视。
08:33
led us to collect data from these Hispanic and African homes
156
513458
5101
根据尼尔森的数据,
这些家庭占美国 家庭的 15% ,
08:38
who use over-the-air TV reception to an antenna.
157
518583
3834
大约 4500 万人。
08:43
Per Nielsen data,
158
523292
1601
08:44
these homes constitute 15 percent of US households,
159
524917
4851
对质量的投入和关注意味着
我们要尽全力
08:49
which is about 45 million people.
160
529792
2726
从这 15% 难以 触及的人群收集信息。
08:52
Commitment and focus on quality means we made every effort
161
532542
4684
为什么这很重要?
08:57
to collect information
162
537250
1559
这是个人数可观的群体,
08:58
from these 15 percent, hard-to-reach groups.
163
538833
4601
对市场人员、品牌、
传媒公司都非常非常重要。
09:03
Why does it matter?
164
543458
1459
没有这些数据,
09:05
This is a sizeable group
165
545875
1309
市场人员、品牌和他们的模型
09:07
that's very, very important to the marketers, brands,
166
547208
3310
无法接触到这些人,
09:10
as well as the media companies.
167
550542
2601
也无法对这些非常非常 重要的少数群体做广告。
09:13
Without the data,
168
553167
1351
09:14
the marketers and brands and their models
169
554542
2892
没有广告收入,
09:17
would not be able to reach these folks,
170
557458
2393
像Telemundo和Univision 这样的电视台
09:19
as well as show ads to these very, very important minority populations.
171
559875
4684
就无法播送免费的内容,
09:24
And without the ad revenue,
172
564583
1976
包括新闻媒体,
09:26
the broadcasters such as Telemundo or Univision,
173
566583
4060
我们民主制度的基础。
09:30
would not be able to deliver free content,
174
570667
3142
这些数据对商家 和社会都十分重要。
09:33
including news media,
175
573833
2101
09:35
which is so foundational to our democracy.
176
575958
3560
我们减少AI人为偏差的 千载难逢的机会
09:39
This data is essential for businesses and society.
177
579542
3541
始于数据。
我的目标不是 争先恐后设计新算法,
09:44
Our once-in-a-lifetime opportunity to reduce human bias in AI
178
584000
4601
而是建设更好的数据基础设施,
09:48
starts with the data.
179
588625
2309
让AI更合乎伦理。
09:50
Instead of racing to build new algorithms,
180
590958
3476
我希望你能加入我的使命。
09:54
my mission is to build a better data infrastructure
181
594458
3851
谢谢。
09:58
that makes ethical AI possible.
182
598333
3060
10:01
I hope you will join me in my mission as well.
183
601417
3559
10:05
Thank you.
184
605000
1250
关于本网站

这个网站将向你介绍对学习英语有用的YouTube视频。你将看到来自世界各地的一流教师教授的英语课程。双击每个视频页面上显示的英文字幕,即可从那里播放视频。字幕会随着视频的播放而同步滚动。如果你有任何意见或要求,请使用此联系表与我们联系。

https://forms.gle/WvT1wiN1qDtmnspy7