What we learned from 5 million books

236,062 views ・ 2011-09-20

TED


Dobbeltklik venligst på de engelske undertekster nedenfor for at afspille videoen.

Translator: Bjarne Poulsen Reviewer: Jonas Tholstrup Christensen
00:15
Erez Lieberman Aiden: Everyone knows
0
15260
2000
Erez Lieberman Aiden: Alle ved
00:17
that a picture is worth a thousand words.
1
17260
3000
at et billede siger mere end tusind ord
00:22
But we at Harvard
2
22260
2000
Men på Harvard
00:24
were wondering if this was really true.
3
24260
3000
spurgte vi os selv, om det egentlig er sandt.
00:27
(Laughter)
4
27260
2000
(Latter)
00:29
So we assembled a team of experts,
5
29260
4000
Så vi samlede et hold eksperter,
00:33
spanning Harvard, MIT,
6
33260
2000
både fra Harvard, MIT,
00:35
The American Heritage Dictionary, The Encyclopedia Britannica
7
35260
3000
The American Heritage Dictionary, The Encyclopedia Britannica
00:38
and even our proud sponsors,
8
38260
2000
og sågar vores stolte sponsor...
00:40
the Google.
9
40260
3000
The Google.
00:43
And we cogitated about this
10
43260
2000
Og vi har funderet over dette
00:45
for about four years.
11
45260
2000
i cirka fire år.
00:47
And we came to a startling conclusion.
12
47260
5000
Og vores konklusion er overraskende.
00:52
Ladies and gentlemen, a picture is not worth a thousand words.
13
52260
3000
Mine damer og herrer, et billede siger ikke mere end tusind ord.
00:55
In fact, we found some pictures
14
55260
2000
Det viste sig faktisk at nogle billeder
00:57
that are worth 500 billion words.
15
57260
5000
siger mere end 500 milliarder ord.
01:02
Jean-Baptiste Michel: So how did we get to this conclusion?
16
62260
2000
Jean-Baptiste Michel: Hvordan når vi denne konklusion?
01:04
So Erez and I were thinking about ways
17
64260
2000
Erez og jeg tænkte på, hvordan man
01:06
to get a big picture of human culture
18
66260
2000
kunne få overblik over menneskets kultur og historie -
01:08
and human history: change over time.
19
68260
3000
- og ændringen over tid.
01:11
So many books actually have been written over the years.
20
71260
2000
Der skrevet så mange bøger gennem tiderne.
01:13
So we were thinking, well the best way to learn from them
21
73260
2000
Så vi tænkte at man kan lære mest af alle disse bøger
01:15
is to read all of these millions of books.
22
75260
2000
ved at læse dem alle sammen.
01:17
Now of course, if there's a scale for how awesome that is,
23
77260
3000
Hvis der er en skala for, hvor fantastisk det er
01:20
that has to rank extremely, extremely high.
24
80260
3000
må det selvfølgelig ligge meget, meget højt (Awesome).
01:23
Now the problem is there's an X-axis for that,
25
83260
2000
Problemet er, at der også er en X-akse,
01:25
which is the practical axis.
26
85260
2000
og det aksen for, om det også er praktisk.
01:27
This is very, very low.
27
87260
2000
Den er meget, meget lav.
01:29
(Applause)
28
89260
3000
(Bifald)
01:32
Now people tend to use an alternative approach,
29
92260
3000
Folk bruger som regel en anden tilgang,
01:35
which is to take a few sources and read them very carefully.
30
95260
2000
Man tager nogle få kilder og læser dem meget omhyggeligt.
01:37
This is extremely practical, but not so awesome.
31
97260
2000
Dette er meget praktisk, men ikke særlig fantastisk.
01:39
What you really want to do
32
99260
3000
Det bedste må være
01:42
is to get to the awesome yet practical part of this space.
33
102260
3000
at nå til dette fantastiske men alligevel praktiske område.
01:45
So it turns out there was a company across the river called Google
34
105260
3000
Et firma på den anden side af floden - Google -
01:48
who had started a digitization project a few years back
35
108260
2000
startede et digitaliseringsprojekt for nogle år siden
01:50
that might just enable this approach.
36
110260
2000
og det kan måske gøre denne tilgang mulig.
01:52
They have digitized millions of books.
37
112260
2000
De har digitaliseret millioner af bøger.
01:54
So what that means is, one could use computational methods
38
114260
3000
Man kan således bruge computerbaserede metoder
01:57
to read all of the books in a click of a button.
39
117260
2000
til at læse alle bøgerne med et enkelt klik.
01:59
That's very practical and extremely awesome.
40
119260
3000
Det er meget praktisk og ekstremt fantastisk.
02:03
ELA: Let me tell you a little bit about where books come from.
41
123260
2000
ELA: Nu skal I høre, hvor bøger stammer fra.
02:05
Since time immemorial, there have been authors.
42
125260
3000
Der har altid eksisteret forfattere.
02:08
These authors have been striving to write books.
43
128260
3000
Disse forfattere har bestræbt sig på at skrive bøger.
02:11
And this became considerably easier
44
131260
2000
Og det blev væsentligt nemmere
02:13
with the development of the printing press some centuries ago.
45
133260
2000
da trykpressen blev opfundet for nogle hundrede år siden.
02:15
Since then, the authors have won
46
135260
3000
Siden da, er det lykkedes forfattere
02:18
on 129 million distinct occasions,
47
138260
2000
at udgive bøger
02:20
publishing books.
48
140260
2000
129 millioner gange.
02:22
Now if those books are not lost to history,
49
142260
2000
Hvis disse bøger ikke er gået tabt for historien,
02:24
then they are somewhere in a library,
50
144260
2000
findes de på et bibliotek et sted,
02:26
and many of those books have been getting retrieved from the libraries
51
146260
3000
og mange bøgerne er blevet taget fra hylderne
02:29
and digitized by Google,
52
149260
2000
og er blevet digitaliseret af Google,
02:31
which has scanned 15 million books to date.
53
151260
2000
som til dato har scannet 15 millioner bøger.
02:33
Now when Google digitizes a book, they put it into a really nice format.
54
153260
3000
Når Google digitaliserer en bog, får den et rigtig fint format.
02:36
Now we've got the data, plus we have metadata.
55
156260
2000
Nu har vi både data og metada.
02:38
We have information about things like where was it published,
56
158260
3000
Vi har f.eks. oplysninger om, hvor den blev udgivet,
02:41
who was the author, when was it published.
57
161260
2000
hvem forfatteren var, og hvornår den blev udgivet.
02:43
And what we do is go through all of those records
58
163260
3000
Og vi går gennem alle disse arkiver
02:46
and exclude everything that's not the highest quality data.
59
166260
4000
og udelukker alle data, der ikke er af højeste kvalitet.
02:50
What we're left with
60
170260
2000
Det, der er tilbage, er en samling
02:52
is a collection of five million books,
61
172260
3000
på fem millioner bøger,
02:55
500 billion words,
62
175260
3000
500 milliarder ord,
02:58
a string of characters a thousand times longer
63
178260
2000
en tegnstreng, der er tusind gange længere
03:00
than the human genome --
64
180260
3000
end menneskets arvemasse.
03:03
a text which, when written out,
65
183260
2000
Hvis teksten blev skrevet ud,
03:05
would stretch from here to the Moon and back
66
185260
2000
ville den nå herfra til månen og tilbage igen
03:07
10 times over --
67
187260
2000
10 gange!
03:09
a veritable shard of our cultural genome.
68
189260
4000
- Et sandt brudstykke af vores kulturelle arvemasse.
03:13
Of course what we did
69
193260
2000
Det vi gjorde,
03:15
when faced with such outrageous hyperbole ...
70
195260
3000
da vi stod over for så vanvittige sammenligninger...
03:18
(Laughter)
71
198260
2000
(Latter)
03:20
was what any self-respecting researchers
72
200260
3000
var, hvad enhver forskere med respekt for sig selv
03:23
would have done.
73
203260
3000
ville have gjort.
03:26
We took a page out of XKCD,
74
206260
2000
Vi gjorde som i tegneserien XKCD,
03:28
and we said, "Stand back.
75
208260
2000
og sagde "Gør plads!
03:30
We're going to try science."
76
210260
2000
Vi prøver med videnskab".
03:32
(Laughter)
77
212260
2000
(Latter)
03:34
JM: Now of course, we were thinking,
78
214260
2000
JM: Først tænkte vi selvfølgelig,
03:36
well let's just first put the data out there
79
216260
2000
"Vi gør bare data tilgængelige,
03:38
for people to do science to it.
80
218260
2000
så andre kan bruge videnskab på dem."
03:40
Now we're thinking, what data can we release?
81
220260
2000
Nu tænker vi "Hvilke data kan vi lægge ud?"
03:42
Well of course, you want to take the books
82
222260
2000
Egentlig vil vi gerne tage bøgerne
03:44
and release the full text of these five million books.
83
224260
2000
og lægge teksten fra alle fem millioner bøger ud.
03:46
Now Google, and Jon Orwant in particular,
84
226260
2000
Men Google - og særligt Jon Orwant -
03:48
told us a little equation that we should learn.
85
228260
2000
fortalte om en ligning, vi skulle lære.
03:50
So you have five million, that is, five million authors
86
230260
3000
Vi har altså fem millioner forfattere
03:53
and five million plaintiffs is a massive lawsuit.
87
233260
3000
altså fem millioner, der gerne vil sagsøge os.
03:56
So, although that would be really, really awesome,
88
236260
2000
Så selvom det ville være virkelig, virkelig fantastisk,
03:58
again, that's extremely, extremely impractical.
89
238260
3000
ville det også være helt ekstremt upraktisk.
04:01
(Laughter)
90
241260
2000
(Latter)
04:03
Now again, we kind of caved in,
91
243260
2000
Igen lod vi os overtale
04:05
and we did the very practical approach, which was a bit less awesome.
92
245260
3000
og fulgte den praktiske tilgang, der var lidt mindre fantastisk.
04:08
We said, well instead of releasing the full text,
93
248260
2000
I stedet for at lægge den fulde tekst ud ville vi
04:10
we're going to release statistics about the books.
94
250260
2000
gøre statistikker om bøgerne tilgængelige.
04:12
So take for instance "A gleam of happiness."
95
252260
2000
Et eksempel er "A gleam of happiness" - Et glimpt af lykke
04:14
It's four words; we call that a four-gram.
96
254260
2000
Det er fire ord - det vi kalder et fire-gram
04:16
We're going to tell you how many times a particular four-gram
97
256260
2000
Vi vil nu fortælle jer, hvor mange gange et bestemt fire-gram
04:18
appeared in books in 1801, 1802, 1803,
98
258260
2000
optrådte i bøger i 1801, 1802, 1803,
04:20
all the way up to 2008.
99
260260
2000
og helt op til 2008
04:22
That gives us a time series
100
262260
2000
Det giver os en tidsserie, der viser hvor hyppigt
04:24
of how frequently this particular sentence was used over time.
101
264260
2000
denne ene sætning er blevet brugt over tid.
04:26
We do that for all the words and phrases that appear in those books,
102
266260
3000
Det gør vi for alle ord og udtryk i disse bøger.
04:29
and that gives us a big table of two billion lines
103
269260
3000
Det giver os en stor tabel med to milliarder linjer
04:32
that tell us about the way culture has been changing.
104
272260
2000
som viser hvordan kulturen har ændret sig.
04:34
ELA: So those two billion lines,
105
274260
2000
ELA: Disse to milliarder linjer
04:36
we call them two billion n-grams.
106
276260
2000
som vi kalder to milliarder n-grammer...
04:38
What do they tell us?
107
278260
2000
Hvad fortæller de os?
04:40
Well the individual n-grams measure cultural trends.
108
280260
2000
De enkelte n-grammer måler kulturelle tendenser.
04:42
Let me give you an example.
109
282260
2000
Lad mig give et eksempel.
04:44
Let's suppose that I am thriving,
110
284260
2000
Jeg vil sige, at jeg trives,
04:46
then tomorrow I want to tell you about how well I did.
111
286260
2000
i morgen siger jeg så, hvor godt jeg havde det.
04:48
And so I might say, "Yesterday, I throve."
112
288260
3000
Jeg ville sige "I går trivedes (throve) jeg".
04:51
Alternatively, I could say, "Yesterday, I thrived."
113
291260
3000
Man kan også bruge "thrived" i stedet for "throve".
04:54
Well which one should I use?
114
294260
3000
Hvilket af de to ord skal jeg bruge?
04:57
How to know?
115
297260
2000
Hvor skulle jeg vide det fra?
04:59
As of about six months ago,
116
299260
2000
Indtil for seks måneder siden
05:01
the state of the art in this field
117
301260
2000
var den anerkendte metode på dette område
05:03
is that you would, for instance,
118
303260
2000
at du f.eks. kunne få fat i
05:05
go up to the following psychologist with fabulous hair,
119
305260
2000
denne psykolog med lækkert hår
05:07
and you'd say,
120
307260
2000
og spørge ham:
05:09
"Steve, you're an expert on the irregular verbs.
121
309260
3000
"Steve, du er ekspert i uregelmæssige verber.
05:12
What should I do?"
122
312260
2000
Hvad skal jeg gøre?"
05:14
And he'd tell you, "Well most people say thrived,
123
314260
2000
Og han ville sige: "De fleste mennesker bruger "thrived"
05:16
but some people say throve."
124
316260
3000
men nogle siger "throve".
05:19
And you also knew, more or less,
125
319260
2000
Og du vidste også - mere eller mindre -
05:21
that if you were to go back in time 200 years
126
321260
3000
at hvis du gik 200 år tilbage i tiden
05:24
and ask the following statesman with equally fabulous hair,
127
324260
3000
og spurgte denne statsmand med ligeså lækkert hår:
05:27
(Laughter)
128
327260
3000
(Latter)
05:30
"Tom, what should I say?"
129
330260
2000
"Tom, hvad ville du sige?"
05:32
He'd say, "Well, in my day, most people throve,
130
332260
2000
Han ville sige: "På min tid brugte de fleste "throve,
05:34
but some thrived."
131
334260
3000
mens andre brugte "thrived".
05:37
So now what I'm just going to show you is raw data.
132
337260
2000
Så nu vil jeg bare vise jer rå data.
05:39
Two rows from this table of two billion entries.
133
339260
4000
To rækker i denne tabel ud af to millarder poster.
05:43
What you're seeing is year by year frequency
134
343260
2000
Den viser hyppigheden pr. år
05:45
of "thrived" and "throve" over time.
135
345260
3000
af "thrived" og "throve" over tid.
05:49
Now this is just two
136
349260
2000
Det her er kun to
05:51
out of two billion rows.
137
351260
3000
ud af to milliarder rækker.
05:54
So the entire data set
138
354260
2000
Så hele datasættet
05:56
is a billion times more awesome than this slide.
139
356260
3000
er en milliard gange mere fantastisk end dette slide.
05:59
(Laughter)
140
359260
2000
(Latter)
06:01
(Applause)
141
361260
4000
(Bifald)
06:05
JM: Now there are many other pictures that are worth 500 billion words.
142
365260
2000
JM: Der er jo mange andre billeder, der siger mere end 500 milliarder ord.
06:07
For instance, this one.
143
367260
2000
For eksempel dette.
06:09
If you just take influenza,
144
369260
2000
Hvis vi bare ser på influenza,
06:11
you will see peaks at the time where you knew
145
371260
2000
vil I se høje udslag på de tidspunkter, hvor I vidste
06:13
big flu epidemics were killing people around the globe.
146
373260
3000
at der var store globale influenzaepidemier.
06:16
ELA: If you were not yet convinced,
147
376260
3000
ELA: Hvis du ikke er overbevist,
06:19
sea levels are rising,
148
379260
2000
stiger vandstanden i havene -
06:21
so is atmospheric CO2 and global temperature.
149
381260
3000
det gør CO2-indholdet i atmosfæren og den globale temperatur også.
06:24
JM: You might also want to have a look at this particular n-gram,
150
384260
3000
JM: Prøv også at kaste et blik på dette n-gram,
06:27
and that's to tell Nietzsche that God is not dead,
151
387260
3000
og det fortæller Nietzsche, at Gud ikke er død,
06:30
although you might agree that he might need a better publicist.
152
390260
3000
selvom du måske også synes, han har brug for en bedre ///presseagent.
06:33
(Laughter)
153
393260
2000
(Latter)
06:35
ELA: You can get at some pretty abstract concepts with this sort of thing.
154
395260
3000
ELA: Man kan få nogle ret abstrakte begreber med disse ting.
06:38
For instance, let me tell you the history
155
398260
2000
Lad mig f.eks. fortælle jer historien
06:40
of the year 1950.
156
400260
2000
om året 1950.
06:42
Pretty much for the vast majority of history,
157
402260
2000
I den største del af vores historie
06:44
no one gave a damn about 1950.
158
404260
2000
har ingen interesseret sig en pind for 1950.
06:46
In 1700, in 1800, in 1900,
159
406260
2000
I 1700 og 1800 og 1900
06:48
no one cared.
160
408260
3000
var ingen interesseret.
06:52
Through the 30s and 40s,
161
412260
2000
Op gennem 30'erne og 40'erne
06:54
no one cared.
162
414260
2000
var ingen interesseret.
06:56
Suddenly, in the mid-40s,
163
416260
2000
Pludselig, midt i 40'erne,
06:58
there started to be a buzz.
164
418260
2000
blev der hvisket i krogene.
07:00
People realized that 1950 was going to happen,
165
420260
2000
Folk indså at 1950 var noget, der ville ske,
07:02
and it could be big.
166
422260
2000
og det kunne være noget stort.
07:04
(Laughter)
167
424260
3000
(Latter)
07:07
But nothing got people interested in 1950
168
427260
3000
Men det der gjorde folk allermest interesseret i 1950
07:10
like the year 1950.
169
430260
3000
var året 1950.
07:13
(Laughter)
170
433260
3000
(Latter)
07:16
People were walking around obsessed.
171
436260
2000
Folk var som besat.
07:18
They couldn't stop talking
172
438260
2000
De kunne ikke lade være med at tale
07:20
about all the things they did in 1950,
173
440260
3000
om alt det, de lavede i 1950,
07:23
all the things they were planning to do in 1950,
174
443260
3000
alt det de planlagde at skulle gøre i 1950,
07:26
all the dreams of what they wanted to accomplish in 1950.
175
446260
5000
og alle drømmene om, hvad de ville opnå i 1950.
07:31
In fact, 1950 was so fascinating
176
451260
2000
Faktisk var 1950 så fascinerende
07:33
that for years thereafter,
177
453260
2000
at folk i flere år efter
07:35
people just kept talking about all the amazing things that happened,
178
455260
3000
bare blev ved med at tale om alle de utrolige ting, der skete -
07:38
in '51, '52, '53.
179
458260
2000
i 1951, 1952 og 1953.
07:40
Finally in 1954,
180
460260
2000
Omsider i 1954
07:42
someone woke up and realized
181
462260
2000
var der en der vågnede op og indså
07:44
that 1950 had gotten somewhat passé.
182
464260
4000
at 1950 var blevet noget passé.
07:48
(Laughter)
183
468260
2000
(Latter)
07:50
And just like that, the bubble burst.
184
470260
2000
Og uden videre sprang boblen.
07:52
(Laughter)
185
472260
2000
(Latter)
07:54
And the story of 1950
186
474260
2000
Og historien om 1950
07:56
is the story of every year that we have on record,
187
476260
2000
er historien om alle de år, vi har registreret,
07:58
with a little twist, because now we've got these nice charts.
188
478260
3000
med et lille tvist, fordi vi nu har disse fine grafer.
08:01
And because we have these nice charts, we can measure things.
189
481260
3000
Og fordi vi har disse fine grafer, kan vi nu måle ting.
08:04
We can say, "Well how fast does the bubble burst?"
190
484260
2000
Vi kan sige "Hvor hurtigt springer boblen?"
08:06
And it turns out that we can measure that very precisely.
191
486260
3000
Og de viser sig, at vi kan måle dette meget præcist.
08:09
Equations were derived, graphs were produced,
192
489260
3000
Der blev udledt ligninger, og der opstillet grafer,
08:12
and the net result
193
492260
2000
og nettoresultatet er
08:14
is that we find that the bubble bursts faster and faster
194
494260
3000
at det viser sig, at boblen springer hurtigere og hurtigere
08:17
with each passing year.
195
497260
2000
for hvert år der går.
08:19
We are losing interest in the past more rapidly.
196
499260
5000
Vi mister interessen for fortiden hurtigere.
08:24
JM: Now a little piece of career advice.
197
504260
2000
JM: Og nu et godt karrieretip:
08:26
So for those of you who seek to be famous,
198
506260
2000
For de af jer, der vil være berømte,
08:28
we can learn from the 25 most famous political figures,
199
508260
2000
kan vi lære af de 25 mest berømte politiske personligheder,
08:30
authors, actors and so on.
200
510260
2000
forfattere, skuespillere osv.
08:32
So if you want to become famous early on, you should be an actor,
201
512260
3000
Så hvis du vil være berømt tidligt, skal du være skuespiller,
08:35
because then fame starts rising by the end of your 20s --
202
515260
2000
fordi berømmelsen så begynder at stige, nrå du er sidst i 20'erne –
08:37
you're still young, it's really great.
203
517260
2000
Du er stadig ung, og det er virkelig skønt.
08:39
Now if you can wait a little bit, you should be an author,
204
519260
2000
Men hvis du kan vente lidt, skal du blive forfatter,
08:41
because then you rise to very great heights,
205
521260
2000
fordi så opnår meget stor berømmelse,
08:43
like Mark Twain, for instance: extremely famous.
206
523260
2000
som f.eks. Mark Twain: Ekstremt berømt.
08:45
But if you want to reach the very top,
207
525260
2000
Men hvis du vil helt til toppen,
08:47
you should delay gratification
208
527260
2000
skal du udskyde den tilfredsstillelse, det er
08:49
and, of course, become a politician.
209
529260
2000
at blive berømt - og selvfølgelig blive politiker.
08:51
So here you will become famous by the end of your 50s,
210
531260
2000
Her vil du blive berømt, når du er i slutningen af 50'erne,
08:53
and become very, very famous afterward.
211
533260
2000
og blive meget, meget berømt derefter.
08:55
So scientists also tend to get famous when they're much older.
212
535260
3000
Videnskabsfolk plejer også at blive berømte, når de er meget ældre.
08:58
Like for instance, biologists and physics
213
538260
2000
For eksempel biologer og fysikere
09:00
tend to be almost as famous as actors.
214
540260
2000
bliver næsten ligeså berømte som skuespillere.
09:02
One mistake you should not do is become a mathematician.
215
542260
3000
En fejl, du ikke skal begå, er at blive matematiker.
09:05
(Laughter)
216
545260
2000
(Latter)
09:07
If you do that,
217
547260
2000
Hvis du gør det,
09:09
you might think, "Oh great. I'm going to do my best work when I'm in my 20s."
218
549260
3000
tænker du måske "Herligt! Jeg leverer mit bedste arbejde, når jeg er i 20'erne"
09:12
But guess what, nobody will really care.
219
552260
2000
Men tænk engang... stort set ingen lægger mærke til det.
09:14
(Laughter)
220
554260
3000
(Latter)
09:17
ELA: There are more sobering notes
221
557260
2000
ELA: Der er mere nøgterne observationer
09:19
among the n-grams.
222
559260
2000
blandt n-grammerne.
09:21
For instance, here's the trajectory of Marc Chagall,
223
561260
2000
Her er f.eks. Marc Chagalls livsforløb,
09:23
an artist born in 1887.
224
563260
2000
som kunster født i 1887.
09:25
And this looks like the normal trajectory of a famous person.
225
565260
3000
Og dette ligner det normale forløb for en berømt person.
09:28
He gets more and more and more famous,
226
568260
4000
Han bliver mere og mere berømt,
09:32
except if you look in German.
227
572260
2000
bare ikke hvis vi ser på tysk.
09:34
If you look in German, you see something completely bizarre,
228
574260
2000
På tysk ser vi noget ganske bizart,
09:36
something you pretty much never see,
229
576260
2000
noget man stort set aldrig ser,
09:38
which is he becomes extremely famous
230
578260
2000
og det er, at han bliver ekstremt berømt
09:40
and then all of a sudden plummets,
231
580260
2000
hvorefter berømmelsen falder brat
09:42
going through a nadir between 1933 and 1945,
232
582260
3000
og er på nulpunktet mellem 1933 og 1945,
09:45
before rebounding afterward.
233
585260
3000
hvorefter berømmelsen vender tilbage.
09:48
And of course, what we're seeing
234
588260
2000
Og de vi selvfølgelig kan se
09:50
is the fact Marc Chagall was a Jewish artist
235
590260
3000
er at Marc Chagall var jødisk kunstner
09:53
in Nazi Germany.
236
593260
2000
i nazi-Tyskland
09:55
Now these signals
237
595260
2000
Disse signaler
09:57
are actually so strong
238
597260
2000
er faktisk så stærk,
09:59
that we don't need to know that someone was censored.
239
599260
3000
at vi ikke behøver at vide, at en person er blevet censureret.
10:02
We can actually figure it out
240
602260
2000
Vi kan faktisk regne det ud
10:04
using really basic signal processing.
241
604260
2000
ved hjælp af meget grundlæggende behandling af signalerne.
10:06
Here's a simple way to do it.
242
606260
2000
Her er en simpel måde at gøre det på.
10:08
Well, a reasonable expectation
243
608260
2000
Det er rimeligt at forvente
10:10
is that somebody's fame in a given period of time
244
610260
2000
at en persons berømmelse i en given periode
10:12
should be roughly the average of their fame before
245
612260
2000
vil være nogenlunde gennemsnittet af berømmelsen før
10:14
and their fame after.
246
614260
2000
og berømmelsen efter perioden.
10:16
So that's sort of what we expect.
247
616260
2000
Så det er nogenlunde, det vi forventer.
10:18
And we compare that to the fame that we observe.
248
618260
3000
Og vi sammenligner med den berømmelse, vi kan aflæse.
10:21
And we just divide one by the other
249
621260
2000
Og så dividerer vi bare den ene med den anden
10:23
to produce something we call a suppression index.
250
623260
2000
så vi får noget, vi kalder et undertrykkelsesindeks.
10:25
If the suppression index is very, very, very small,
251
625260
3000
Hvis undertrykkelsesindekset er meget, meget, meget lavt,
10:28
then you very well might be being suppressed.
252
628260
2000
er der stor sandsynlighed for at du er undertrykt.
10:30
If it's very large, maybe you're benefiting from propaganda.
253
630260
3000
Hvis det er meget højt, får du måske hjælp af propaganda.
10:34
JM: Now you can actually look at
254
634260
2000
JM: Nu kan man faktisk se på
10:36
the distribution of suppression indexes over whole populations.
255
636260
3000
fordelingen af undertrykkelsesindekser over hele populationer.
10:39
So for instance, here --
256
639260
2000
For eksempel her:
10:41
this suppression index is for 5,000 people
257
641260
2000
Dette undertrykkelsesindeks er for 5.000 personer
10:43
picked in English books where there's no known suppression --
258
643260
2000
taget fra engelske bøger uden nogen kendt undertrykkelse.
10:45
it would be like this, basically tightly centered on one.
259
645260
2000
Det ville være på denne måde, tæt centreret om ét.
10:47
What you expect is basically what you observe.
260
647260
2000
Det man kan aflæse, er grundlæggende som forventet.
10:49
This is distribution as seen in Germany --
261
649260
2000
Dette er fordelingen, som den ses i Tyskland.
10:51
very different, it's shifted to the left.
262
651260
2000
Meget anderledes... den er forskudt til venstre.
10:53
People talked about it twice less as it should have been.
263
653260
3000
Folk talte dobbelt så lidt om det, som de burde.
10:56
But much more importantly, the distribution is much wider.
264
656260
2000
Men vigtigere er, at fordelingen er meget bredere.
10:58
There are many people who end up on the far left on this distribution
265
658260
3000
Der er mange personer, der ender ude til venstre i fordelingen,
11:01
who are talked about 10 times fewer than they should have been.
266
661260
3000
som der bliver talt 10 gange så lidt om, som der burde.
11:04
But then also many people on the far right
267
664260
2000
Men der er også personer ude til højre,
11:06
who seem to benefit from propaganda.
268
666260
2000
som synes at være hjulpet af propaganda.
11:08
This picture is the hallmark of censorship in the book record.
269
668260
3000
Dette er kendetegnende for censur i bogregisteret.
11:11
ELA: So culturomics
270
671260
2000
ELA: Denne metode
11:13
is what we call this method.
271
673260
2000
kalder vi "culturomics".
11:15
It's kind of like genomics.
272
675260
2000
Det er lidt ligesom genforskning
11:17
Except genomics is a lens on biology
273
677260
2000
Genomics - genforskning - er et nærbillede af biologi
11:19
through the window of the sequence of bases in the human genome.
274
679260
3000
hvor man ser på sekvenser af baser i arvemassen.
11:22
Culturomics is similar.
275
682260
2000
Culturomics minder om dette.
11:24
It's the application of massive-scale data collection analysis
276
684260
3000
Det er en analyse af en kæmpe samling data
11:27
to the study of human culture.
277
687260
2000
anvendt på studiet af menneskets kultur.
11:29
Here, instead of through the lens of a genome,
278
689260
2000
I stedet for at bruge arvemassen som perspektiv,
11:31
through the lens of digitized pieces of the historical record.
279
691260
3000
bruges digitaliserede stykker af historisk materiale.
11:34
The great thing about culturomics
280
694260
2000
Det gode ved culturomics er
11:36
is that everyone can do it.
281
696260
2000
at alle kan gøre det.
11:38
Why can everyone do it?
282
698260
2000
Hvorfor kan alle gøre det?
11:40
Everyone can do it because three guys,
283
700260
2000
Alle kan gøre det, fordi disse tre herrer,
11:42
Jon Orwant, Matt Gray and Will Brockman over at Google,
284
702260
3000
Jon Orwant, Matt Gray og Will Brockman hos Google,
11:45
saw the prototype of the Ngram Viewer,
285
705260
2000
så prototypen af Ngram Viewer,
11:47
and they said, "This is so fun.
286
707260
2000
og sagde, "Det er så sjovt,
11:49
We have to make this available for people."
287
709260
3000
at vi må gøre det tilgængeligt for alle."
11:52
So in two weeks flat -- the two weeks before our paper came out --
288
712260
2000
På nøjagtig de to uger inden offentliggørelsen af vores rapport
11:54
they coded up a version of the Ngram Viewer for the general public.
289
714260
3000
kodede de en version af Ngram Viewer til almen brug.
11:57
And so you too can type in any word or phrase that you're interested in
290
717260
3000
Du kan så skrive et vilkårligt ord, du er interesseret i
12:00
and see its n-gram immediately --
291
720260
2000
og straks se det tilhørende n-gram,
12:02
also browse examples of all the various books
292
722260
2000
og du kan gennemse eksempler på alle bøger
12:04
in which your n-gram appears.
293
724260
2000
som dit n-gram optræder i.
12:06
JM: Now this was used over a million times on the first day,
294
726260
2000
Dette blev brugt over en million gang første dag,
12:08
and this is really the best of all the queries.
295
728260
2000
og dette er den bedste af alle søgninger.
12:10
So people want to be their best, put their best foot forward.
296
730260
3000
Så folk ønsker at yde deres bedste.
12:13
But it turns out in the 18th century, people didn't really care about that at all.
297
733260
3000
Men i det 18. årh. var folk ligeglade med alt det.
12:16
They didn't want to be their best, they wanted to be their beft.
298
736260
3000
De ville ikke gøre bedste, de ville være "beft".
12:19
So what happened is, of course, this is just a mistake.
299
739260
3000
Dette var selvfølgelig bare en fejl.
12:22
It's not that strove for mediocrity,
300
742260
2000
Man stræbte ikke efter middelmådighed,
12:24
it's just that the S used to be written differently, kind of like an F.
301
744260
3000
men tidligere skrev man S anderledes, nærmest som et f.
12:27
Now of course, Google didn't pick this up at the time,
302
747260
3000
Det opdagede Google selvfølgelig ikke dengang,
12:30
so we reported this in the science article that we wrote.
303
750260
3000
så vi skrev det i den videnskabelige artikel.
12:33
But it turns out this is just a reminder
304
753260
2000
Dette minder os om, at
12:35
that, although this is a lot of fun,
305
755260
2000
selvom det er rigtig sjovt,
12:37
when you interpret these graphs, you have to be very careful,
306
757260
2000
at fortolke disse grafer, skal man være forsigtig
12:39
and you have to adopt the base standards in the sciences.
307
759260
3000
og overholde de videnskabelige standarder.
12:42
ELA: People have been using this for all kinds of fun purposes.
308
762260
3000
Folk har brugt dette til mange sjove formål.
12:45
(Laughter)
309
765260
7000
(Latter)
12:52
Actually, we're not going to have to talk,
310
772260
2000
Vi behøver faktisk ikke tale,
12:54
we're just going to show you all the slides and remain silent.
311
774260
3000
vi viser bare alle slides og tier stille.
12:57
This person was interested in the history of frustration.
312
777260
3000
Denne person var interesseret i frustrationens historie.
13:00
There's various types of frustration.
313
780260
3000
Der er forskellige typer frustration.
13:03
If you stub your toe, that's a one A "argh."
314
783260
3000
Hvis slår tåen, er der ét A i "argh".
13:06
If the planet Earth is annihilated by the Vogons
315
786260
2000
Hvis Jorden udslettes af Vogonerne
13:08
to make room for an interstellar bypass,
316
788260
2000
for at gøre plads til en intergalaktisk ekspresrute,
13:10
that's an eight A "aaaaaaaargh."
317
790260
2000
er det et "aaaaaaaargh" med otte A'er.
13:12
This person studies all the "arghs,"
318
792260
2000
Personen undersøger alle udgaver af "argh"
13:14
from one through eight A's.
319
794260
2000
fra ét til otte A'er.
13:16
And it turns out
320
796260
2000
Og det viser sig
13:18
that the less-frequent "arghs"
321
798260
2000
at de mindst hyppige "argh" vedrører
13:20
are, of course, the ones that correspond to things that are more frustrating --
322
800260
3000
vedrører ting, der er mere frustrerende
13:23
except, oddly, in the early 80s.
323
803260
3000
men sjovt nok ikke i de tidlige 80'ere.
13:26
We think that might have something to do with Reagan.
324
806260
2000
Vi tror det kan være noget med Reagan.
13:28
(Laughter)
325
808260
2000
(Latter)
13:30
JM: There are many usages of this data,
326
810260
3000
Disse data kan bruges til mange ting,
13:33
but the bottom line is that the historical record is being digitized.
327
813260
3000
men grundlaget er, at historien bliver digitaliseret.
13:36
Google has started to digitize 15 million books.
328
816260
2000
Google er begyndt at digitalisere 15 millioner bøger.
13:38
That's 12 percent of all the books that have ever been published.
329
818260
2000
Det er 12 % af alle bøger, der er udgivet.
13:40
It's a sizable chunk of human culture.
330
820260
3000
Det er en god klump af menneskets kultur.
13:43
There's much more in culture: there's manuscripts, there newspapers,
331
823260
3000
Kultur er meget mere: manuskripter, aviser
13:46
there's things that are not text, like art and paintings.
332
826260
2000
noget er ikke tekst, f.eks. kunst og malerier.
13:48
These all happen to be on our computers,
333
828260
2000
Disse vil alle findes på vores computere,
13:50
on computers across the world.
334
830260
2000
på computere i hele verden.
13:52
And when that happens, that will transform the way we have
335
832260
3000
Og når det sker, ændrer det den måde
13:55
to understand our past, our present and human culture.
336
835260
2000
vi forstår vores fortid, vores nutid og menneskets kultur.
13:57
Thank you very much.
337
837260
2000
Mange tak.
13:59
(Applause)
338
839260
3000
(Bifald)
Om denne hjemmeside

På dette websted kan du se YouTube-videoer, der er nyttige til at lære engelsk. Du vil se engelskundervisning, der er udført af førsteklasses lærere fra hele verden. Dobbeltklik på de engelske undertekster, der vises på hver videoside, for at afspille videoen derfra. Underteksterne ruller i takt med videoafspilningen. Hvis du har kommentarer eller ønsker, bedes du kontakte os ved hjælp af denne kontaktformular.

https://forms.gle/WvT1wiN1qDtmnspy7