How AI Models Steal Creative Work — and What to Do About It | Ed Newton-Rex | TED

11,130 views ・ 2025-03-19

TED


Please double-click on the English subtitles below to play the video.

00:04
The technology and vision behind generative AI is amazing,
0
4301
5072
00:09
but stealing the work of the world's creators to build it is not.
1
9373
3771
00:14
There are three key things that AI companies need to build their models,
2
14078
4972
00:19
three key resources -- people, compute and data.
3
19083
4271
00:24
That is, engineers to build the models,
4
24121
2937
00:27
GPUs to run the training process
5
27091
2102
00:29
and data to train the models on.
6
29226
2336
00:32
AI companies spend vast sums on the first two,
7
32696
5005
00:37
sometimes a million dollars per engineer
8
37735
2402
00:40
and up to a billion dollars per model.
9
40137
3337
00:43
But they expect to take the third resource, training data, for free.
10
43507
5439
00:48
Right now, many AI companies train on creative work they haven't paid for
11
48979
4772
00:53
or even asked permission to use.
12
53784
1869
00:56
This is unfair and unsustainable.
13
56320
2970
01:00
But if we reset, and license our training data,
14
60224
3270
01:03
we can build a better generative AI ecosystem that works for everyone,
15
63527
3704
01:07
both the AI companies themselves and the creators,
16
67264
3304
01:10
without whose work these models would not exist.
17
70601
2636
01:14
Most AI companies today do not license the majority of their training data.
18
74271
4839
01:19
They use web scrapers to find, download
19
79143
2402
01:21
and train on as much content as they can gather.
20
81579
3036
01:24
They're often pretty secretive about what they do train on,
21
84615
2870
01:27
but what's clear is that training on copyrighted work without a license
22
87485
4037
01:31
is rife.
23
91555
1268
01:32
For instance, when the Mozilla Foundation
24
92857
2569
01:35
looked at 47 large language models released between 2019 and 2023,
25
95459
5406
01:40
they found that 64 percent of them were trained, in part, on Common Crawl,
26
100898
4905
01:45
a dataset that includes copyrighted works,
27
105836
3070
01:48
such as newspaper articles from major publications.
28
108939
2403
01:51
And a further 21 percent didn’t reveal enough information to know either way.
29
111375
5239
01:57
Training on copyrighted work without a license
30
117982
2302
02:00
has rapidly become standard across much of the generative AI industry.
31
120284
4404
02:04
But this training,
32
124722
1601
02:06
this unlicensed training on creative work,
33
126357
2903
02:09
has serious negative consequences for the people behind that work.
34
129260
3603
02:12
And this is for the simple reason
35
132863
1969
02:14
that generative AI competes with its training data.
36
134832
3337
02:18
This is not the narrative that AI companies like to portray.
37
138202
3270
02:21
We like to talk about democratization, about letting more people be creative.
38
141505
3637
02:25
But the fact that AI competes with its training data is inescapable.
39
145176
3870
02:30
A large language model trained on short stories
40
150080
2269
02:32
can create competing short stories.
41
152383
1768
02:34
An AI image model trained on stock images can create competing stock images.
42
154185
3703
02:37
An AI music model trained on music that's licensed to TV shows
43
157888
3771
02:41
can create competing music to license to TV shows.
44
161659
2836
02:44
These models, however imperfect,
45
164528
3837
02:48
are so quick and easy to use that this competition is inevitable.
46
168399
3937
02:53
And this isn't just theoretical.
47
173270
1869
02:55
Generative AI is still pretty new,
48
175172
2269
02:57
but we're already seeing exactly the sort of effects you'd expect
49
177474
3270
03:00
in a world in which generative AI competes with its training data.
50
180778
4237
03:05
For instance, the well-known filmmaker Ram Gopal Varma recently said
51
185049
4804
03:09
that he'll use AI music in all his projects going forward.
52
189853
3704
03:13
Indeed, there are multiple reports of people starting to listen to AI music
53
193591
4137
03:17
in place of human-produced music,
54
197761
1635
03:19
and recently, an AI song hit number 48 in the German charts.
55
199430
5472
03:24
In all these cases, AI music is competing with the songs it was trained on.
56
204935
5205
03:30
Or take Kelly McKernan.
57
210174
2636
03:32
Kelly is an artist from Nashville.
58
212843
2770
03:35
For 10 years, they made enough money selling their work
59
215646
4104
03:39
that art was their full-time income.
60
219783
1836
03:41
But in 2022, a dataset that included their works
61
221652
4104
03:45
was used to train a popular AI image model.
62
225756
3837
03:51
Their name was one of many used by huge numbers of people
63
231362
5038
03:56
to create art in the style of specific human artists.
64
236433
4772
04:01
Kelly's income fell by 33 percent almost overnight.
65
241238
3871
04:05
Illustrators around the world report similar stories,
66
245109
3203
04:08
being outcompeted by AI models
67
248312
2569
04:10
they have reason to believe were trained on their work.
68
250914
3571
04:14
The freelance platform Upwork wrote a white paper
69
254518
3337
04:17
in which they looked at the effects, that they've seen on the job market,
70
257888
4138
04:22
of generative AI.
71
262026
2002
04:24
They looked at how job postings on their platform have changed
72
264028
3136
04:27
since the introduction of ChatGPT,
73
267197
1669
04:28
and sure enough, they found exactly what you'd expect,
74
268866
2903
04:31
that generative AI has reduced the demand for freelance writing tasks by 8 percent,
75
271769
4804
04:36
which increases to 18 percent
76
276607
2469
04:39
if you look at only what they term lower-value tasks.
77
279109
2836
04:41
So the initial data we have, plus the individual stories we hear,
78
281979
6373
04:48
all align with the logical assumption:
79
288385
1835
04:50
"Generative AI competes with the work it's trained on."
80
290254
2936
04:53
It's so quick and easy to use, it's inevitable,
81
293223
2803
04:56
and it competes with the people behind that work.
82
296060
2636
04:58
Now creators argue this training is illegal.
83
298729
3704
05:02
The legal framework of copyright
84
302433
1568
05:04
affords creators the exclusive right to authorize copies of their work,
85
304034
4071
05:08
and AI training involves copying.
86
308138
3470
05:11
Here, in the US, many AI companies argue
87
311642
3170
05:14
that training AI falls under the fair use copyright exception,
88
314845
4204
05:19
which allows unlicensed copying in a limited set of circumstances,
89
319083
4137
05:23
such as creating parodies of a work.
90
323253
2203
05:26
Creators and rights holders strongly disagree,
91
326523
3137
05:29
saying there's no way this narrow exception can be used
92
329693
3037
05:32
to legitimize the mass exploitation of creative work
93
332763
4905
05:37
to create automated competitors to that work.
94
337701
2102
05:39
And for the record, I entirely agree.
95
339837
3436
05:43
Of course, this question is previously untested in the courts,
96
343307
3804
05:47
and there are currently around 30 ongoing lawsuits
97
347111
2535
05:49
brought by rights holders against AI companies,
98
349680
2769
05:52
which will help to address this question.
99
352483
2569
05:55
But this will take time, and creators are suffering
100
355085
2870
05:57
from what they see as unjust competition right now.
101
357988
3971
06:01
So they propose a solution that has been used and worked before --
102
361992
5205
06:07
licensing.
103
367197
1569
06:08
If a commercial entity wants to use copyrighted work,
104
368799
3870
06:12
be it for merchandise manufacturing or building a streaming service,
105
372669
3504
06:16
they license that work.
106
376206
1435
06:17
Now AI companies have a bunch of reasons why this shouldn’t apply to them.
107
377674
5306
06:23
There’s the fair use legal exception that I’ve already mentioned.
108
383013
5439
06:29
There's also the argument
109
389686
1469
06:31
that since humans can train on copyrighted work without a license,
110
391188
4004
06:35
AI should be allowed to, too.
111
395225
1569
06:36
But this is a very hard claim to justify.
112
396827
3270
06:40
Artists have been learning from each other for centuries.
113
400130
2736
06:42
When you create, you expect other people to learn from you.
114
402900
4304
06:47
You learn from a range of sources,
115
407237
1669
06:48
from other art to textbooks to taking lessons.
116
408939
2870
06:51
Much of this you or someone else paid for,
117
411842
2169
06:54
supporting the entire ecosystem.
118
414011
1768
06:55
In generative AI,
119
415813
2035
06:57
commercial entities valued at millions or billions of dollars
120
417881
3137
07:01
scrape as much content as they can,
121
421018
2302
07:03
often against creators' will, without payment,
122
423353
2436
07:05
making multiple copies along the way --
123
425823
3103
07:08
which are subject to copyright law --
124
428959
2369
07:11
to create a highly scalable competitor to what they're copying.
125
431361
3270
07:14
So scalable, in fact, that there are AI image generators
126
434631
3170
07:17
estimated to be making 2.5 million images a day
127
437835
3470
07:21
and AI song generators outputting 10 songs a second.
128
441338
2736
07:24
To argue that human learning and AI training are the same
129
444107
2770
07:26
and should be treated the same
130
446910
1569
07:28
is preposterous.
131
448512
1268
07:31
AI companies also argue
132
451281
2203
07:33
that licensing their training data would be impractical.
133
453517
3103
07:36
They use so much training data, they say,
134
456620
2169
07:38
that individual payments to each creator behind the data would be small.
135
458789
4504
07:43
But this is true of many content-licensing markets.
136
463327
3069
07:46
Creators still want to get paid, even if the payments are small.
137
466430
3036
07:50
AI companies also argue that they simply use too much data
138
470234
3069
07:53
for licensing to even be feasible.
139
473337
2435
07:56
But this is harder and harder to believe
140
476440
2102
07:58
in a world in which there is such a range of datasets
141
478575
3203
08:01
that you can access with permission.
142
481812
3036
08:04
You can license data from media companies.
143
484848
2069
08:06
There have been 27 major deals
144
486950
2436
08:09
between AI companies and rights holders in the last year alone,
145
489419
3204
08:12
and that's to say nothing of the smaller ones that don't get reported.
146
492656
3337
08:16
There are marketplaces of training data where you can get more data.
147
496026
3203
08:19
You can expand this with data that's in the public domain --
148
499229
3203
08:22
that is, in which no copyright exists,
149
502466
2202
08:24
like the 500-billion-word dataset Common Corpus.
150
504668
4338
08:29
You can expand this further with synthetic data,
151
509006
2903
08:31
that is, data that's created itself by an AI model,
152
511942
3270
08:35
in which usually no copyright exists.
153
515212
2636
08:37
So there are multiple options available to you
154
517881
2236
08:40
if you want to build your model without infringing copyright.
155
520150
3003
08:44
But the strongest evidence
156
524321
1401
08:45
that it's possible to license all your data
157
525756
3170
08:48
is that there are multiple companies doing it already.
158
528926
2569
08:51
I know, because I've done it myself.
159
531528
1735
08:53
I've worked in what we now call generative AI for over a decade,
160
533297
3336
08:56
and last September,
161
536667
1334
08:58
my team at Stability AI released an AI music model
162
538001
3938
09:01
that trained on licensed music.
163
541972
2069
09:06
A number of other companies have done the same thing,
164
546310
3103
09:09
and I founded Fairly Trained in order to highlight this fact,
165
549446
3704
09:13
and these companies.
166
553183
1602
09:15
Fairly Trained is a nonprofit that certifies generative AI companies
167
555586
4704
09:20
that don't train on copyrighted work without a license.
168
560290
2770
09:23
We launched in January of this year, and we've already certified 18 companies.
169
563093
4071
09:27
Now these companies take a variety of approaches
170
567164
2436
09:29
to licensing their training data.
171
569633
1601
09:31
We have an AI voice model that's trained on individual voices it's licensed.
172
571234
5039
09:36
We have an AI music model that's licensed more than 40 music catalogs.
173
576306
4138
09:40
We have a large language model
174
580477
1535
09:42
that's trained only on data in the public domain,
175
582045
2403
09:44
mostly from government documents and records.
176
584481
2136
09:46
We have companies who have paid upfront fees for their data.
177
586650
4871
09:52
We have companies who share their revenue with their data providers.
178
592356
3470
09:55
There is no one answer to the exact specifics
179
595826
3570
09:59
of how one of these licensing deals has to work.
180
599396
2836
10:02
The beauty of licensing is that the two parties can come together
181
602265
3270
10:05
and figure out what works for them.
182
605569
1735
10:07
And this is happening more and more now.
183
607337
1935
10:09
You will hear that a requirement to license training data
184
609306
3470
10:12
somehow stifles innovation,
185
612809
2069
10:14
that it's only the big AI companies that can afford
186
614878
2503
10:17
these huge upfront licensing fees.
187
617414
1969
10:19
But in reality, it's the smaller start-ups
188
619383
2902
10:22
who are bothering to license all their data,
189
622319
2803
10:25
and they're doing so, often, without hefty upfront licensing fees,
190
625155
3136
10:28
but using models such as revenue shares.
191
628325
3070
10:32
And there's another major upside to licensing your training data.
192
632996
3070
10:36
All of this training on copyrighted work
193
636099
3971
10:40
is forcing publishers to shut off access to their content.
194
640103
4138
10:44
The Data Provenance Initiative
195
644274
1802
10:46
looked at 14,000 websites commonly used in AI training sets,
196
646109
4238
10:50
and they found that, over the course of a single year,
197
650347
2636
10:53
looking at only the domains of the highest value for AI training,
198
653016
4338
10:57
the number that was restricted via opt-outs or terms of service
199
657387
4004
11:01
increased from three percent to between 20 and 33 percent.
200
661391
4805
11:06
The web is being gradually closed due to unlicensed training.
201
666196
3637
11:09
Now this is bad for new AI models, for new entrants to the market,
202
669866
3170
11:13
but also for everyone --
203
673070
1601
11:14
researchers, consumers and more, who benefit from an open internet.
204
674705
4070
11:20
It should come as no surprise
205
680510
1402
11:21
that the general public do not agree with AI companies
206
681945
3070
11:25
about what they can train their models on.
207
685048
2436
11:27
One poll from the AI Policy Institute, in April,
208
687517
3170
11:30
asked people about the common policy among AI companies
209
690721
3169
11:33
of training on publicly available data.
210
693924
2969
11:36
This is data that is openly available online,
211
696927
3036
11:39
which of course includes a lot of copyrighted work,
212
699996
2436
11:42
like news articles and, often, pirated media.
213
702466
3036
11:45
60 percent of people said this should not be allowed
214
705535
3704
11:49
versus only 19 percent who said it should.
215
709272
3037
11:52
The same poll went on to ask
216
712309
2402
11:54
whether AI companies should compensate data providers.
217
714745
3970
11:58
74 percent said yes, and only nine percent said no.
218
718749
5305
12:04
Time and time again, when we ask the public these questions,
219
724087
3671
12:07
they show support for requirements around permission and payment,
220
727791
4972
12:12
and a rejection of the notion
221
732796
1435
12:14
that something being publicly available somehow makes it fair game.
222
734264
3470
12:19
And the people who make the art that society consumes feel the same way.
223
739336
4237
12:23
Today, we launched a "Statement on AI Training,"
224
743607
3703
12:27
a short, simple open letter, which simply reads:
225
747310
4338
12:31
“The unlicensed use of creative works for training generative AI
226
751681
3571
12:35
is a major, unjust threat to the livelihoods
227
755285
2503
12:37
of the people behind those works,
228
757821
1735
12:39
and must not be permitted."
229
759589
2303
12:42
This has already been signed by 11,000 and counting creators around the world,
230
762225
4638
12:46
including Nobel-winning authors,
231
766897
2235
12:49
Academy Award-winning actors and Oscar-winning composers.
232
769132
2803
12:51
And if you agree with this sentiment,
233
771968
1869
12:53
I encourage you to sign it today at aitrainingstatement.org.
234
773870
2937
12:56
What this statement and previous ones like it make abundantly clear
235
776840
3937
13:00
is that these artists, these creators,
236
780811
2302
13:03
view the unlicensed training on their work by generative AI models
237
783113
3370
13:06
as totally unjust and potentially catastrophic to their professions.
238
786516
4571
13:11
So if you are an advocate for unlicensed AI training,
239
791121
4004
13:15
just remember that the people who wrote the music that you are listening to
240
795158
4838
13:20
and the books you’re reading
241
800030
1768
13:21
probably disagree.
242
801798
1468
13:24
So where does this leave us?
243
804634
1435
13:26
Well, right now, many of the world's artists,
244
806102
2503
13:28
writers, musicians, creators
245
808605
2002
13:30
straight-up hate generative AI.
246
810607
2369
13:32
And we know, from their own words, that one of the reasons for this
247
812976
3403
13:36
is that we're training on their work without asking them.
248
816413
3003
13:39
But it doesn't have to be this way.
249
819449
2302
13:41
The AI industry and the creative industries
250
821785
2069
13:43
can be and should be mutually beneficial.
251
823887
2369
13:46
But for this mutually beneficial relationship to emerge,
252
826256
4204
13:50
we have to start from a position of respect
253
830460
2770
13:53
for the value of the works being trained on
254
833263
2102
13:55
and the rights of the people who made them.
255
835398
2803
13:59
I'm not arguing that all AI development should be halted.
256
839002
3270
14:02
I'm not arguing that AI should not exist.
257
842305
2069
14:04
What I'm arguing is that the resources used to build generative AI
258
844374
4071
14:08
should be paid for.
259
848478
1502
14:10
Licensing is hard work.
260
850413
2136
14:12
It will slow you down in the short term,
261
852549
1968
14:14
but you'll ultimately reach exactly the same point --
262
854551
2569
14:17
models that are just as capable, just as powerful --
263
857153
2436
14:19
and you'll do so without forcing the world's publishers
264
859623
4037
14:23
to batten down the hatches and destroy the commons,
265
863693
3370
14:27
and without pitting the world's creators against you.
266
867063
3571
14:30
So I hope that more AI companies will follow the example
267
870667
4404
14:35
set by those we've certified at Fairly Trained,
268
875105
2369
14:37
and license all their training data.
269
877507
1835
14:39
I hope that employees at these companies will demand this of their employers.
270
879376
4037
14:43
And I hope that everyone who uses generative AI
271
883446
3204
14:46
will ask what their favorite models were trained on.
272
886683
2703
14:49
There is a future in which generative AI and human creativity can coexist,
273
889419
5939
14:55
not just peacefully, but symbiotically.
274
895358
2670
14:59
It's been a rough start,
275
899029
1668
15:00
but it's not too late to change course.
276
900730
1902
15:03
Thank you.
277
903433
1168
15:04
(Applause)
278
904601
2302
About this website

This site will introduce you to YouTube videos that are useful for learning English. You will see English lessons taught by top-notch teachers from around the world. Double-click on the English subtitles displayed on each video page to play the video from there. The subtitles scroll in sync with the video playback. If you have any comments or requests, please contact us using this contact form.

https://forms.gle/WvT1wiN1qDtmnspy7