How bad data keeps us from good AI | Mainak Mazumdar

48,448 views ・ 2021-03-05

TED


Please double-click on the English subtitles below to play the video.

00:00
Transcriber: Leslie Gauthier Reviewer: Joanna Pietrulewicz
0
0
7000
00:13
AI could add 16 trillion dollars to the global economy
1
13750
4351
00:18
in the next 10 years.
2
18125
2268
00:20
This economy is not going to be built by billions of people
3
20417
4642
00:25
or millions of factories,
4
25083
2143
00:27
but by computers and algorithms.
5
27250
2643
00:29
We have already seen amazing benefits of AI
6
29917
4684
00:34
in simplifying tasks,
7
34625
2184
00:36
bringing efficiencies
8
36833
1601
00:38
and improving our lives.
9
38458
2393
00:40
However, when it comes to fair and equitable policy decision-making,
10
40875
5976
00:46
AI has not lived up to its promise.
11
46875
3143
00:50
AI is becoming a gatekeeper to the economy,
12
50042
2892
00:52
deciding who gets a job
13
52958
2185
00:55
and who gets an access to a loan.
14
55167
3434
00:58
AI is only reinforcing and accelerating our bias
15
58625
4309
01:02
at speed and scale
16
62958
1851
01:04
with societal implications.
17
64833
2393
01:07
So, is AI failing us?
18
67250
2226
01:09
Are we designing these algorithms to deliver biased and wrong decisions?
19
69500
5417
01:16
As a data scientist, I'm here to tell you,
20
76292
2892
01:19
it's not the algorithm,
21
79208
1685
01:20
but the biased data
22
80917
1476
01:22
that's responsible for these decisions.
23
82417
3059
01:25
To make AI possible for humanity and society,
24
85500
4434
01:29
we need an urgent reset.
25
89958
2351
01:32
Instead of algorithms,
26
92333
2101
01:34
we need to focus on the data.
27
94458
2310
01:36
We're spending time and money to scale AI
28
96792
2642
01:39
at the expense of designing and collecting high-quality and contextual data.
29
99458
6018
01:45
We need to stop the data, or the biased data that we already have,
30
105500
4268
01:49
and focus on three things:
31
109792
2392
01:52
data infrastructure,
32
112208
1601
01:53
data quality
33
113833
1393
01:55
and data literacy.
34
115250
2101
01:57
In June of this year,
35
117375
1309
01:58
we saw embarrassing bias in the Duke University AI model
36
118708
4768
02:03
called PULSE,
37
123500
1559
02:05
which enhanced a blurry image
38
125083
3018
02:08
into a recognizable photograph of a person.
39
128125
4018
02:12
This algorithm incorrectly enhanced a nonwhite image into a Caucasian image.
40
132167
6166
02:19
African-American images were underrepresented in the training set,
41
139042
5017
02:24
leading to wrong decisions and predictions.
42
144083
3417
02:28
Probably this is not the first time
43
148333
2143
02:30
you have seen an AI misidentify a Black person's image.
44
150500
4768
02:35
Despite an improved AI methodology,
45
155292
3892
02:39
the underrepresentation of racial and ethnic populations
46
159208
3810
02:43
still left us with biased results.
47
163042
2684
02:45
This research is academic,
48
165750
2018
02:47
however, not all data biases are academic.
49
167792
3976
02:51
Biases have real consequences.
50
171792
3142
02:54
Take the 2020 US Census.
51
174958
2334
02:58
The census is the foundation
52
178042
1726
02:59
for many social and economic policy decisions,
53
179792
4392
03:04
therefore the census is required to count 100 percent of the population
54
184208
4518
03:08
in the United States.
55
188750
2018
03:10
However, with the pandemic
56
190792
2476
03:13
and the politics of the citizenship question,
57
193292
3267
03:16
undercounting of minorities is a real possibility.
58
196583
3393
03:20
I expect significant undercounting of minority groups
59
200000
4309
03:24
who are hard to locate, contact, persuade and interview for the census.
60
204333
5268
03:29
Undercounting will introduce bias
61
209625
3393
03:33
and erode the quality of our data infrastructure.
62
213042
3184
03:36
Let's look at undercounts in the 2010 census.
63
216250
3976
03:40
16 million people were omitted in the final counts.
64
220250
3934
03:44
This is as large as the total population
65
224208
3143
03:47
of Arizona, Arkansas, Oklahoma and Iowa put together for that year.
66
227375
5809
03:53
We have also seen about a million kids under the age of five undercounted
67
233208
4310
03:57
in the 2010 Census.
68
237542
2101
03:59
Now, undercounting of minorities
69
239667
2976
04:02
is common in other national censuses,
70
242667
2976
04:05
as minorities can be harder to reach,
71
245667
3184
04:08
they're mistrustful towards the government
72
248875
2059
04:10
or they live in an area under political unrest.
73
250958
3476
04:14
For example,
74
254458
1810
04:16
the Australian Census in 2016
75
256292
2934
04:19
undercounted Aboriginals and Torres Strait populations
76
259250
3934
04:23
by about 17.5 percent.
77
263208
3060
04:26
We estimate undercounting in 2020
78
266292
3142
04:29
to be much higher than 2010,
79
269458
3018
04:32
and the implications of this bias can be massive.
80
272500
2917
04:36
Let's look at the implications of the census data.
81
276625
3208
04:40
Census is the most trusted, open and publicly available rich data
82
280917
5559
04:46
on population composition and characteristics.
83
286500
3851
04:50
While businesses have proprietary information
84
290375
2184
04:52
on consumers,
85
292583
1393
04:54
the Census Bureau reports definitive, public counts
86
294000
4143
04:58
on age, gender, ethnicity,
87
298167
2434
05:00
race, employment, family status,
88
300625
2851
05:03
as well as geographic distribution,
89
303500
2268
05:05
which are the foundation of the population data infrastructure.
90
305792
4184
05:10
When minorities are undercounted,
91
310000
2393
05:12
AI models supporting public transportation,
92
312417
2976
05:15
housing, health care,
93
315417
1434
05:16
insurance
94
316875
1268
05:18
are likely to overlook the communities that require these services the most.
95
318167
5392
05:23
First step to improving results
96
323583
2185
05:25
is to make that database representative
97
325792
2392
05:28
of age, gender, ethnicity and race
98
328208
3268
05:31
per census data.
99
331500
1292
05:33
Since census is so important,
100
333792
1642
05:35
we have to make every effort to count 100 percent.
101
335458
4101
05:39
Investing in this data quality and accuracy
102
339583
4060
05:43
is essential to making AI possible,
103
343667
3226
05:46
not for only few and privileged,
104
346917
2226
05:49
but for everyone in the society.
105
349167
2517
05:51
Most AI systems use the data that's already available
106
351708
3560
05:55
or collected for some other purposes
107
355292
2434
05:57
because it's convenient and cheap.
108
357750
2268
06:00
Yet data quality is a discipline that requires commitment --
109
360042
4684
06:04
real commitment.
110
364750
1768
06:06
This attention to the definition,
111
366542
2809
06:09
data collection and measurement of the bias,
112
369375
2768
06:12
is not only underappreciated --
113
372167
2476
06:14
in the world of speed, scale and convenience,
114
374667
3267
06:17
it's often ignored.
115
377958
1810
06:19
As part of Nielsen data science team,
116
379792
2809
06:22
I went to field visits to collect data,
117
382625
2351
06:25
visiting retail stores outside Shanghai and Bangalore.
118
385000
3934
06:28
The goal of that visit was to measure retail sales from those stores.
119
388958
5060
06:34
We drove miles outside the city,
120
394042
2184
06:36
found these small stores --
121
396250
1976
06:38
informal, hard to reach.
122
398250
2059
06:40
And you may be wondering --
123
400333
2018
06:42
why are we interested in these specific stores?
124
402375
3518
06:45
We could have selected a store in the city
125
405917
2142
06:48
where the electronic data could be easily integrated into a data pipeline --
126
408083
4101
06:52
cheap, convenient and easy.
127
412208
2851
06:55
Why are we so obsessed with the quality
128
415083
3060
06:58
and accuracy of the data from these stores?
129
418167
2976
07:01
The answer is simple:
130
421167
1559
07:02
because the data from these rural stores matter.
131
422750
3250
07:07
According to the International Labour Organization,
132
427708
3726
07:11
40 percent Chinese
133
431458
1768
07:13
and 65 percent of Indians live in rural areas.
134
433250
4643
07:17
Imagine the bias in decision
135
437917
1892
07:19
when 65 percent of consumption in India is excluded in models,
136
439833
5226
07:25
meaning the decision will favor the urban over the rural.
137
445083
3834
07:29
Without this rural-urban context
138
449583
2268
07:31
and signals on livelihood, lifestyle, economy and values,
139
451875
5226
07:37
retail brands will make wrong investments on pricing, advertising and marketing.
140
457125
5792
07:43
Or the urban bias will lead to wrong rural policy decisions
141
463750
4893
07:48
with regards to health and other investments.
142
468667
3517
07:52
Wrong decisions are not the problem with the AI algorithm.
143
472208
3625
07:56
It's a problem of the data
144
476792
2142
07:58
that excludes areas intended to be measured in the first place.
145
478958
4792
08:04
The data in the context is a priority,
146
484917
2392
08:07
not the algorithms.
147
487333
1935
08:09
Let's look at another example.
148
489292
2267
08:11
I visited these remote, trailer park homes in Oregon state
149
491583
4560
08:16
and New York City apartments
150
496167
1642
08:17
to invite these homes to participate in Nielsen panels.
151
497833
3976
08:21
Panels are statistically representative samples of homes
152
501833
3601
08:25
that we invite to participate in the measurement
153
505458
2601
08:28
over a period of time.
154
508083
2018
08:30
Our mission to include everybody in the measurement
155
510125
3309
08:33
led us to collect data from these Hispanic and African homes
156
513458
5101
08:38
who use over-the-air TV reception to an antenna.
157
518583
3834
08:43
Per Nielsen data,
158
523292
1601
08:44
these homes constitute 15 percent of US households,
159
524917
4851
08:49
which is about 45 million people.
160
529792
2726
08:52
Commitment and focus on quality means we made every effort
161
532542
4684
08:57
to collect information
162
537250
1559
08:58
from these 15 percent, hard-to-reach groups.
163
538833
4601
09:03
Why does it matter?
164
543458
1459
09:05
This is a sizeable group
165
545875
1309
09:07
that's very, very important to the marketers, brands,
166
547208
3310
09:10
as well as the media companies.
167
550542
2601
09:13
Without the data,
168
553167
1351
09:14
the marketers and brands and their models
169
554542
2892
09:17
would not be able to reach these folks,
170
557458
2393
09:19
as well as show ads to these very, very important minority populations.
171
559875
4684
09:24
And without the ad revenue,
172
564583
1976
09:26
the broadcasters such as Telemundo or Univision,
173
566583
4060
09:30
would not be able to deliver free content,
174
570667
3142
09:33
including news media,
175
573833
2101
09:35
which is so foundational to our democracy.
176
575958
3560
09:39
This data is essential for businesses and society.
177
579542
3541
09:44
Our once-in-a-lifetime opportunity to reduce human bias in AI
178
584000
4601
09:48
starts with the data.
179
588625
2309
09:50
Instead of racing to build new algorithms,
180
590958
3476
09:54
my mission is to build a better data infrastructure
181
594458
3851
09:58
that makes ethical AI possible.
182
598333
3060
10:01
I hope you will join me in my mission as well.
183
601417
3559
10:05
Thank you.
184
605000
1250
About this website

This site will introduce you to YouTube videos that are useful for learning English. You will see English lessons taught by top-notch teachers from around the world. Double-click on the English subtitles displayed on each video page to play the video from there. The subtitles scroll in sync with the video playback. If you have any comments or requests, please contact us using this contact form.

https://forms.gle/WvT1wiN1qDtmnspy7