How bad data keeps us from good AI | Mainak Mazumdar

48,842 views ・ 2021-03-05

TED


请双击下面的英文字幕来播放视频。

00:00
Transcriber: Leslie Gauthier Reviewer: Joanna Pietrulewicz
0
0
7000
翻译人员: Yip Yan Yeung 校对人员: Helen Chang
人工智能(AI) 会在接下来的十年
为全球经济注入 16 万亿美元。
这样的经济不是由几十亿人
00:13
AI could add 16 trillion dollars to the global economy
1
13750
4351
或者几百万个工厂建造,
00:18
in the next 10 years.
2
18125
2268
而是由电脑和算法创造。
00:20
This economy is not going to be built by billions of people
3
20417
4642
我们已经看到了AI的惊人优势:
00:25
or millions of factories,
4
25083
2143
简化任务、
00:27
but by computers and algorithms.
5
27250
2643
提高效益、
00:29
We have already seen amazing benefits of AI
6
29917
4684
改善生活。
然而,谈到政策 制定的公平公正时,
00:34
in simplifying tasks,
7
34625
2184
00:36
bringing efficiencies
8
36833
1601
00:38
and improving our lives.
9
38458
2393
AI并没有兑现承诺。
00:40
However, when it comes to fair and equitable policy decision-making,
10
40875
5976
AI成为了经济的守门人,
决定谁能拿到工作,
00:46
AI has not lived up to its promise.
11
46875
3143
谁能拿到贷款。
00:50
AI is becoming a gatekeeper to the economy,
12
50042
2892
AI只是在大规模和迅速地
00:52
deciding who gets a job
13
52958
2185
巩固和加速我们
00:55
and who gets an access to a loan.
14
55167
3434
对社会影响的偏见。
00:58
AI is only reinforcing and accelerating our bias
15
58625
4309
那么,AI辜负我们了吗?
我们设计这些算法是为了输出 带有偏见的错误决定吗?
01:02
at speed and scale
16
62958
1851
01:04
with societal implications.
17
64833
2393
01:07
So, is AI failing us?
18
67250
2226
作为一个数据科学家, 我可以告诉你,
01:09
Are we designing these algorithms to deliver biased and wrong decisions?
19
69500
5417
不是算法的问题,
而是偏置数据,
它们得对这些决定负责。
01:16
As a data scientist, I'm here to tell you,
20
76292
2892
为了让AI可以被 人类和社会使用,
01:19
it's not the algorithm,
21
79208
1685
01:20
but the biased data
22
80917
1476
我们急需一次重置。
01:22
that's responsible for these decisions.
23
82417
3059
我们该关注的不是算法,
01:25
To make AI possible for humanity and society,
24
85500
4434
而是数据。
我们花了大量时间和金钱,
01:29
we need an urgent reset.
25
89958
2351
以设计和收集高质量的背景数据 为代价,扩大AI的规模。
01:32
Instead of algorithms,
26
92333
2101
01:34
we need to focus on the data.
27
94458
2310
01:36
We're spending time and money to scale AI
28
96792
2642
我们需要停下这些数据, 或我们已有的偏置数据,
01:39
at the expense of designing and collecting high-quality and contextual data.
29
99458
6018
去关注三件事:
数据基础设施、
01:45
We need to stop the data, or the biased data that we already have,
30
105500
4268
数据质量
和数据素养。
今年六月,
01:49
and focus on three things:
31
109792
2392
我们看到了一个杜克大学的
01:52
data infrastructure,
32
112208
1601
01:53
data quality
33
113833
1393
AI模型,PULSE, 产生了令人尴尬的偏差,
01:55
and data literacy.
34
115250
2101
它增强了一张模糊的图像,
01:57
In June of this year,
35
117375
1309
01:58
we saw embarrassing bias in the Duke University AI model
36
118708
4768
使它呈现了一张可辨人像。
02:03
called PULSE,
37
123500
1559
这个算法错误地将一张非白人 图像增强至了一张白人像。
02:05
which enhanced a blurry image
38
125083
3018
02:08
into a recognizable photograph of a person.
39
128125
4018
在训练集中,没有足够的 非裔美国人的图像,
02:12
This algorithm incorrectly enhanced a nonwhite image into a Caucasian image.
40
132167
6166
导致了错误的结果和预测。
02:19
African-American images were underrepresented in the training set,
41
139042
5017
也许这不是你第一次
看见AI误判黑人的图像。
02:24
leading to wrong decisions and predictions.
42
144083
3417
即使AI方法已经改进过了,
02:28
Probably this is not the first time
43
148333
2143
02:30
you have seen an AI misidentify a Black person's image.
44
150500
4768
对有些种族的忽视
依旧会产生有偏差的结果。
02:35
Despite an improved AI methodology,
45
155292
3892
研究是学术性的,
02:39
the underrepresentation of racial and ethnic populations
46
159208
3810
但是,不是所有 数据偏差都是学术性的。
02:43
still left us with biased results.
47
163042
2684
偏差有真实的后果。
02:45
This research is academic,
48
165750
2018
比如 2020 年 美国人口普查。
02:47
however, not all data biases are academic.
49
167792
3976
人口普查是
02:51
Biases have real consequences.
50
171792
3142
很多社会和经济政策制定的基础,
02:54
Take the 2020 US Census.
51
174958
2334
所以人口普查应当记录
02:58
The census is the foundation
52
178042
1726
02:59
for many social and economic policy decisions,
53
179792
4392
美国人口的 100%。
然而,由于新冠疫情
03:04
therefore the census is required to count 100 percent of the population
54
184208
4518
和国籍背后的政治问题,
03:08
in the United States.
55
188750
2018
实际上非常有可能 少算了少数群体。
03:10
However, with the pandemic
56
190792
2476
我预计会有大量少数群体的少计,
03:13
and the politics of the citizenship question,
57
193292
3267
很难定位、联系、说服、 采访他们加入人口普查。
03:16
undercounting of minorities is a real possibility.
58
196583
3393
03:20
I expect significant undercounting of minority groups
59
200000
4309
少计会带来偏差,
03:24
who are hard to locate, contact, persuade and interview for the census.
60
204333
5268
威胁我们数据基础设施的质量。
我们来看 2010 年 人口普查的少计。
03:29
Undercounting will introduce bias
61
209625
3393
在最终统计中, 有 1600 万人被忽略了。
03:33
and erode the quality of our data infrastructure.
62
213042
3184
这相当于亚利桑那州、阿肯色州、
03:36
Let's look at undercounts in the 2010 census.
63
216250
3976
俄克拉何马州、爱荷华州 当年的人口总和。
03:40
16 million people were omitted in the final counts.
64
220250
3934
03:44
This is as large as the total population
65
224208
3143
我们还可以看到在 2010 年 人口普查中,少计了
03:47
of Arizona, Arkansas, Oklahoma and Iowa put together for that year.
66
227375
5809
大约 100 万 低于五岁的儿童。
如今,少数群体的少计
03:53
We have also seen about a million kids under the age of five undercounted
67
233208
4310
在其他国家的 人口普查中也很常见,
03:57
in the 2010 Census.
68
237542
2101
因为很难联系到这些少数群体,
03:59
Now, undercounting of minorities
69
239667
2976
他们不信任政府,
04:02
is common in other national censuses,
70
242667
2976
或者生活在政治动荡区域。
04:05
as minorities can be harder to reach,
71
245667
3184
比如,
2016 年的澳大利亚人口普查
04:08
they're mistrustful towards the government
72
248875
2059
04:10
or they live in an area under political unrest.
73
250958
3476
少算了大约 17.5% 的 澳洲土著
04:14
For example,
74
254458
1810
和托雷斯海峡人口。
04:16
the Australian Census in 2016
75
256292
2934
我们估计 2020 年的少计
04:19
undercounted Aboriginals and Torres Strait populations
76
259250
3934
会远远超过 2010 年,
04:23
by about 17.5 percent.
77
263208
3060
这种偏差可能造成的 影响是深远的。
04:26
We estimate undercounting in 2020
78
266292
3142
我们来看人口普查数据的影响。
04:29
to be much higher than 2010,
79
269458
3018
04:32
and the implications of this bias can be massive.
80
272500
2917
人口普查是最权威的、 可公开获取的丰富数据,
04:36
Let's look at the implications of the census data.
81
276625
3208
它记录了人口组成和特征。
04:40
Census is the most trusted, open and publicly available rich data
82
280917
5559
虽然商家会对顾客收集
私密信息,
但是美国人口调查局 会报告完整的、公开的
04:46
on population composition and characteristics.
83
286500
3851
年龄、性别、民族、
04:50
While businesses have proprietary information
84
290375
2184
04:52
on consumers,
85
292583
1393
种族、就业情况、家庭情况信息,
04:54
the Census Bureau reports definitive, public counts
86
294000
4143
还有地理分布情况,
这是人口数据基础设施的基础。
04:58
on age, gender, ethnicity,
87
298167
2434
05:00
race, employment, family status,
88
300625
2851
当少数群体被少计时,
05:03
as well as geographic distribution,
89
303500
2268
支撑公共交通、
05:05
which are the foundation of the population data infrastructure.
90
305792
4184
住房、医疗、
保险的AI模型
更可能忽视最需要 这些服务的群体。
05:10
When minorities are undercounted,
91
310000
2393
05:12
AI models supporting public transportation,
92
312417
2976
05:15
housing, health care,
93
315417
1434
改进结果的第一步
05:16
insurance
94
316875
1268
就是让每一次人口普查数据的
05:18
are likely to overlook the communities that require these services the most.
95
318167
5392
年龄、性别、民族、种族数据库
05:23
First step to improving results
96
323583
2185
具有代表性。
05:25
is to make that database representative
97
325792
2392
既然人口普查如此的重要,
我们就要不遗余力地 让它记录 100% 。
05:28
of age, gender, ethnicity and race
98
328208
3268
05:31
per census data.
99
331500
1292
在数据质量和准确率上投入精力
05:33
Since census is so important,
100
333792
1642
05:35
we have to make every effort to count 100 percent.
101
335458
4101
对AI的可行性至关重要,
不但是为了少数有特权的人,
05:39
Investing in this data quality and accuracy
102
339583
4060
还是为了社会中的每一个人。
05:43
is essential to making AI possible,
103
343667
3226
大多数的AI系统利用已有的
05:46
not for only few and privileged,
104
346917
2226
或者为其他目的收集的数据,
05:49
but for everyone in the society.
105
349167
2517
因为很方便又便宜。
05:51
Most AI systems use the data that's already available
106
351708
3560
但是数据质量是一个 需要大量投入的领域,
05:55
or collected for some other purposes
107
355292
2434
真正的投入。
05:57
because it's convenient and cheap.
108
357750
2268
对偏差的定义、
06:00
Yet data quality is a discipline that requires commitment --
109
360042
4684
数据采集和偏差测量的关注
在快速、大规模、便利的世界
06:04
real commitment.
110
364750
1768
06:06
This attention to the definition,
111
366542
2809
不仅不受重视,
06:09
data collection and measurement of the bias,
112
369375
2768
还会被无视。
作为尼尔森公司(Nielsen) 数据科学组的一员,
06:12
is not only underappreciated --
113
372167
2476
06:14
in the world of speed, scale and convenience,
114
374667
3267
我实地走访收集了数据,
拜访了上海和班加罗尔 之外的零售店。
06:17
it's often ignored.
115
377958
1810
06:19
As part of Nielsen data science team,
116
379792
2809
拜访的目的是测量 这些店铺的零售业绩。
06:22
I went to field visits to collect data,
117
382625
2351
06:25
visiting retail stores outside Shanghai and Bangalore.
118
385000
3934
我们开出了市区,
发现了这些小店——
06:28
The goal of that visit was to measure retail sales from those stores.
119
388958
5060
不正规、交通不便利。
你可能在想
为什么我们会对这些店感兴趣?
06:34
We drove miles outside the city,
120
394042
2184
06:36
found these small stores --
121
396250
1976
我们可以在市区找一家店,
06:38
informal, hard to reach.
122
398250
2059
电子数据可以轻松地 接入数据流程,
06:40
And you may be wondering --
123
400333
2018
06:42
why are we interested in these specific stores?
124
402375
3518
便宜、方便又简单。
06:45
We could have selected a store in the city
125
405917
2142
为什么我们对这些店的数据质量
06:48
where the electronic data could be easily integrated into a data pipeline --
126
408083
4101
和准确率如此感兴趣?
06:52
cheap, convenient and easy.
127
412208
2851
答案很简单:
因为这些乡镇小店的数据很重要。
06:55
Why are we so obsessed with the quality
128
415083
3060
06:58
and accuracy of the data from these stores?
129
418167
2976
国际劳工组织表示,
07:01
The answer is simple:
130
421167
1559
07:02
because the data from these rural stores matter.
131
422750
3250
40% 的中国人
和 65% 的印度人 生活在乡村。
07:07
According to the International Labour Organization,
132
427708
3726
想像印度消费的 65%
07:11
40 percent Chinese
133
431458
1768
被排除在模型之外, 政策制定的偏差会怎样,
07:13
and 65 percent of Indians live in rural areas.
134
433250
4643
意味着这些决策会偏向 对城市比对乡村更为有利。
07:17
Imagine the bias in decision
135
437917
1892
07:19
when 65 percent of consumption in India is excluded in models,
136
439833
5226
没有这些城乡背景信息
和民生、生活方式、 经济和商品价值的信号,
07:25
meaning the decision will favor the urban over the rural.
137
445083
3834
零售品牌会在定价、广告 和营销上做出错误的决定。
07:29
Without this rural-urban context
138
449583
2268
07:31
and signals on livelihood, lifestyle, economy and values,
139
451875
5226
倾向城市的偏差会导致 错误的乡村政策制定,
07:37
retail brands will make wrong investments on pricing, advertising and marketing.
140
457125
5792
包括医疗和其他方面。
07:43
Or the urban bias will lead to wrong rural policy decisions
141
463750
4893
错误的决定不是AI算法的问题。
07:48
with regards to health and other investments.
142
468667
3517
这是数据的问题,
排除了那些早就该 被统计的地区的数据。
07:52
Wrong decisions are not the problem with the AI algorithm.
143
472208
3625
07:56
It's a problem of the data
144
476792
2142
这些背景数据才该被首先关注,
07:58
that excludes areas intended to be measured in the first place.
145
478958
4792
而不是算法。
我们来看另一个例子。
我拜访了一些俄勒冈州 偏远的房车营地,
08:04
The data in the context is a priority,
146
484917
2392
08:07
not the algorithms.
147
487333
1935
和纽约的公寓,
08:09
Let's look at another example.
148
489292
2267
邀请这些家庭加入 一些尼尔森论坛。
08:11
I visited these remote, trailer park homes in Oregon state
149
491583
4560
我们在这段时间邀请了一些家庭
08:16
and New York City apartments
150
496167
1642
参与论坛的统计,
08:17
to invite these homes to participate in Nielsen panels.
151
497833
3976
数据样本具有代表性。
08:21
Panels are statistically representative samples of homes
152
501833
3601
我们的目标是邀请所有人加入统计,
让我们可以收集这些西班牙裔 和非洲裔家庭的数据,
08:25
that we invite to participate in the measurement
153
505458
2601
08:28
over a period of time.
154
508083
2018
08:30
Our mission to include everybody in the measurement
155
510125
3309
他们都会收看有线电视。
08:33
led us to collect data from these Hispanic and African homes
156
513458
5101
根据尼尔森的数据,
这些家庭占美国 家庭的 15% ,
08:38
who use over-the-air TV reception to an antenna.
157
518583
3834
大约 4500 万人。
08:43
Per Nielsen data,
158
523292
1601
08:44
these homes constitute 15 percent of US households,
159
524917
4851
对质量的投入和关注意味着
我们要尽全力
08:49
which is about 45 million people.
160
529792
2726
从这 15% 难以 触及的人群收集信息。
08:52
Commitment and focus on quality means we made every effort
161
532542
4684
为什么这很重要?
08:57
to collect information
162
537250
1559
这是个人数可观的群体,
08:58
from these 15 percent, hard-to-reach groups.
163
538833
4601
对市场人员、品牌、
传媒公司都非常非常重要。
09:03
Why does it matter?
164
543458
1459
没有这些数据,
09:05
This is a sizeable group
165
545875
1309
市场人员、品牌和他们的模型
09:07
that's very, very important to the marketers, brands,
166
547208
3310
无法接触到这些人,
09:10
as well as the media companies.
167
550542
2601
也无法对这些非常非常 重要的少数群体做广告。
09:13
Without the data,
168
553167
1351
09:14
the marketers and brands and their models
169
554542
2892
没有广告收入,
09:17
would not be able to reach these folks,
170
557458
2393
像Telemundo和Univision 这样的电视台
09:19
as well as show ads to these very, very important minority populations.
171
559875
4684
就无法播送免费的内容,
09:24
And without the ad revenue,
172
564583
1976
包括新闻媒体,
09:26
the broadcasters such as Telemundo or Univision,
173
566583
4060
我们民主制度的基础。
09:30
would not be able to deliver free content,
174
570667
3142
这些数据对商家 和社会都十分重要。
09:33
including news media,
175
573833
2101
09:35
which is so foundational to our democracy.
176
575958
3560
我们减少AI人为偏差的 千载难逢的机会
09:39
This data is essential for businesses and society.
177
579542
3541
始于数据。
我的目标不是 争先恐后设计新算法,
09:44
Our once-in-a-lifetime opportunity to reduce human bias in AI
178
584000
4601
而是建设更好的数据基础设施,
09:48
starts with the data.
179
588625
2309
让AI更合乎伦理。
09:50
Instead of racing to build new algorithms,
180
590958
3476
我希望你能加入我的使命。
09:54
my mission is to build a better data infrastructure
181
594458
3851
谢谢。
09:58
that makes ethical AI possible.
182
598333
3060
10:01
I hope you will join me in my mission as well.
183
601417
3559
10:05
Thank you.
184
605000
1250
关于本网站

这个网站将向你介绍对学习英语有用的YouTube视频。你将看到来自世界各地的一流教师教授的英语课程。双击每个视频页面上显示的英文字幕,即可从那里播放视频。字幕会随着视频的播放而同步滚动。如果你有任何意见或要求,请使用此联系表与我们联系。

https://forms.gle/WvT1wiN1qDtmnspy7


This website was created in October 2020 and last updated on June 12, 2025.

It is now archived and preserved as an English learning resource.

Some information may be out of date.

隐私政策

eng.lish.video

Developer's Blog