What we learned from 5 million books

236,154 views ・ 2011-09-20

TED


請雙擊下方英文字幕播放視頻。

譯者: Joyce Chou 審譯者: Qi Gu
00:15
Erez Lieberman Aiden: Everyone knows
0
15260
2000
Erez Lieberman Aiden:大家都知道
00:17
that a picture is worth a thousand words.
1
17260
3000
一張圖勝過千言萬語
00:22
But we at Harvard
2
22260
2000
但我們在哈佛時
00:24
were wondering if this was really true.
3
24260
3000
卻在思考這道理是否真是如此
00:27
(Laughter)
4
27260
2000
(笑聲)
00:29
So we assembled a team of experts,
5
29260
4000
所以我們由來自哈佛大學
00:33
spanning Harvard, MIT,
6
33260
2000
麻省理工學院
00:35
The American Heritage Dictionary, The Encyclopedia Britannica
7
35260
3000
美國傳統英語詞典,大英百科全書
00:38
and even our proud sponsors,
8
38260
2000
甚至我們偉大的贊助商─Google的專家們
00:40
the Google.
9
40260
3000
組成一個團隊
00:43
And we cogitated about this
10
43260
2000
我們花了四年的時間
00:45
for about four years.
11
45260
2000
在思考這個問題
00:47
And we came to a startling conclusion.
12
47260
5000
然後我們得到了一個驚人的結論
00:52
Ladies and gentlemen, a picture is not worth a thousand words.
13
52260
3000
女士先生們,一張圖片其實不只勝過千言萬語
00:55
In fact, we found some pictures
14
55260
2000
事實上,我們發現某些圖片
00:57
that are worth 500 billion words.
15
57260
5000
更是勝過五千億個字
01:02
Jean-Baptiste Michel: So how did we get to this conclusion?
16
62260
2000
Jean-Baptiste Michel:我們是如何得出這項結論的呢?
01:04
So Erez and I were thinking about ways
17
64260
2000
Erez和我思考了不同的方式
01:06
to get a big picture of human culture
18
66260
2000
想更加了解人類文化
01:08
and human history: change over time.
19
68260
3000
以及人類歷史從古到今的變化的全景
01:11
So many books actually have been written over the years.
20
71260
2000
事實上,多年來已經出版了許多書籍。
01:13
So we were thinking, well the best way to learn from them
21
73260
2000
所以我們認為最好的學習方式
01:15
is to read all of these millions of books.
22
75260
2000
就是將這上百萬的書全讀過一遍
01:17
Now of course, if there's a scale for how awesome that is,
23
77260
3000
如果能有一個尺規來說明此舉的驚人程度
01:20
that has to rank extremely, extremely high.
24
80260
3000
這將會相當驚人
01:23
Now the problem is there's an X-axis for that,
25
83260
2000
但問題是這裡的X軸
01:25
which is the practical axis.
26
85260
2000
是表示實用程度
01:27
This is very, very low.
27
87260
2000
這相當不實用
01:29
(Applause)
28
89260
3000
(掌聲)
01:32
Now people tend to use an alternative approach,
29
92260
3000
現在人們希望用別的方式
01:35
which is to take a few sources and read them very carefully.
30
95260
2000
可以讀少一點書,但讀得非常仔細
01:37
This is extremely practical, but not so awesome.
31
97260
2000
這會相當實用,但這一點都不吸引人
01:39
What you really want to do
32
99260
3000
我們真正想做的是
01:42
is to get to the awesome yet practical part of this space.
33
102260
3000
要用一種吸引人且實用的方法來閱讀這些書
01:45
So it turns out there was a company across the river called Google
34
105260
3000
所以在河的對岸有間公司叫做Google
01:48
who had started a digitization project a few years back
35
108260
2000
他們幾年之前開始了一項數字化計畫
01:50
that might just enable this approach.
36
110260
2000
這項計畫讓我們能實踐剛說的方法
01:52
They have digitized millions of books.
37
112260
2000
他們已將數百萬本書給數位化
01:54
So what that means is, one could use computational methods
38
114260
3000
這意味著,我們可以透過電腦
01:57
to read all of the books in a click of a button.
39
117260
2000
簡單按個按鈕就能閱讀所有的書
01:59
That's very practical and extremely awesome.
40
119260
3000
這非常實用而且相當棒
02:03
ELA: Let me tell you a little bit about where books come from.
41
123260
2000
ELA:讓我為各位介紹這些書都來自何方
02:05
Since time immemorial, there have been authors.
42
125260
3000
自古以來,有非常多作家
02:08
These authors have been striving to write books.
43
128260
3000
這些作家一直努力寫作
02:11
And this became considerably easier
44
131260
2000
但現在寫作變得相當容易
02:13
with the development of the printing press some centuries ago.
45
133260
2000
這歸功於幾世紀前印刷術的革新
02:15
Since then, the authors have won
46
135260
3000
自那時起作家們
02:18
on 129 million distinct occasions,
47
138260
2000
能在一億兩千九百萬個不同的地方
02:20
publishing books.
48
140260
2000
出版書籍
02:22
Now if those books are not lost to history,
49
142260
2000
如果那些書沒有因為時代交替而遺失
02:24
then they are somewhere in a library,
50
144260
2000
那麼那些書可能在某個圖書館的一處
02:26
and many of those books have been getting retrieved from the libraries
51
146260
3000
有相當多書可以從圖書館中被借閱
02:29
and digitized by Google,
52
149260
2000
由Google將其數位化
02:31
which has scanned 15 million books to date.
53
151260
2000
迄今Google已經掃描了一千五百萬本書
02:33
Now when Google digitizes a book, they put it into a really nice format.
54
153260
3000
Google將一本書數位化,並以優良的型式呈現
02:36
Now we've got the data, plus we have metadata.
55
156260
2000
現在我們有了這些數據,加上這些詮釋資料
02:38
We have information about things like where was it published,
56
158260
3000
我們有了相關的資訊,比如出版地區,
02:41
who was the author, when was it published.
57
161260
2000
作者,出版時間
02:43
And what we do is go through all of those records
58
163260
3000
我們所做的就是透過這些記錄
02:46
and exclude everything that's not the highest quality data.
59
166260
4000
並剔除不是最精華的資料
02:50
What we're left with
60
170260
2000
我們後來得到的是
02:52
is a collection of five million books,
61
172260
3000
五百萬本書
02:55
500 billion words,
62
175260
3000
五千億個詞
02:58
a string of characters a thousand times longer
63
178260
2000
這是一串比人類基因組
03:00
than the human genome --
64
180260
3000
還要長上一千倍的字符
03:03
a text which, when written out,
65
183260
2000
如果寫成文章
03:05
would stretch from here to the Moon and back
66
185260
2000
將會是從這裡到月球來回距離
03:07
10 times over --
67
187260
2000
的十倍以上
03:09
a veritable shard of our cultural genome.
68
189260
4000
這是我們文化基因名副其實的的一部分
03:13
Of course what we did
69
193260
2000
當然當我們面臨
03:15
when faced with such outrageous hyperbole ...
70
195260
3000
如此誇張的情況時
03:18
(Laughter)
71
198260
2000
(笑聲)
03:20
was what any self-respecting researchers
72
200260
3000
我們也跟每一位有自尊心的研究人員一樣
03:23
would have done.
73
203260
3000
會做相同的事
03:26
We took a page out of XKCD,
74
206260
2000
我們也和四格漫畫一樣
03:28
and we said, "Stand back.
75
208260
2000
我們決定「等等
03:30
We're going to try science."
76
210260
2000
我們要用科學的方式來處理。」
03:32
(Laughter)
77
212260
2000
(笑聲)
03:34
JM: Now of course, we were thinking,
78
214260
2000
JM:當然,我們在思考
03:36
well let's just first put the data out there
79
216260
2000
首先我們先把資料提取出來
03:38
for people to do science to it.
80
218260
2000
讓其他人以科學的方式去分析
03:40
Now we're thinking, what data can we release?
81
220260
2000
現在我們在思考,我們能發行何種數據?
03:42
Well of course, you want to take the books
82
222260
2000
當然,我們想拿這些書
03:44
and release the full text of these five million books.
83
224260
2000
將這五百萬本書的內容全部釋出
03:46
Now Google, and Jon Orwant in particular,
84
226260
2000
現在Google,特別是Jon Orwant
03:48
told us a little equation that we should learn.
85
228260
2000
告訴我們一個我們該注意的小方程式
03:50
So you have five million, that is, five million authors
86
230260
3000
我們有五百萬本書,也就是有五百萬名作者
03:53
and five million plaintiffs is a massive lawsuit.
87
233260
3000
而五百萬名原告是一場龐大的訴訟
03:56
So, although that would be really, really awesome,
88
236260
2000
雖然這個過程是相當地驚人
03:58
again, that's extremely, extremely impractical.
89
238260
3000
但這還是極度的不切實際
04:01
(Laughter)
90
241260
2000
(笑聲)
04:03
Now again, we kind of caved in,
91
243260
2000
然後,我們似乎有點妥協
04:05
and we did the very practical approach, which was a bit less awesome.
92
245260
3000
我們試了比較實際的方式,這方法不怎麼吸引人
04:08
We said, well instead of releasing the full text,
93
248260
2000
我們認為,與其釋出全部的書籍資料
04:10
we're going to release statistics about the books.
94
250260
2000
我們選擇將這些書的數據資料給呈現出來
04:12
So take for instance "A gleam of happiness."
95
252260
2000
舉個例子「幸福的光」
04:14
It's four words; we call that a four-gram.
96
254260
2000
這是四個字,我們稱做「四字詞」
04:16
We're going to tell you how many times a particular four-gram
97
256260
2000
我們要告訴各位一個特定的四字詞
04:18
appeared in books in 1801, 1802, 1803,
98
258260
2000
從1801,1802,1803年開始出現在書本裡
04:20
all the way up to 2008.
99
260260
2000
直到2008年
04:22
That gives us a time series
100
262260
2000
這給我們一個時間軸來了解
04:24
of how frequently this particular sentence was used over time.
101
264260
2000
這些特定的字句從過去到現在的使用頻率
04:26
We do that for all the words and phrases that appear in those books,
102
266260
3000
我們計算了所有出現在這些書中的字詞
04:29
and that gives us a big table of two billion lines
103
269260
3000
彙整出的資料畫出了二十億條曲線
04:32
that tell us about the way culture has been changing.
104
272260
2000
這告訴了我們文化是如何改變的
04:34
ELA: So those two billion lines,
105
274260
2000
ELA:這二十億條曲線
04:36
we call them two billion n-grams.
106
276260
2000
我們稱為二十億組詞
04:38
What do they tell us?
107
278260
2000
這告訴了我們
04:40
Well the individual n-grams measure cultural trends.
108
280260
2000
每一組詞代表了不同的文化趨勢
04:42
Let me give you an example.
109
282260
2000
讓我舉個例子
04:44
Let's suppose that I am thriving,
110
284260
2000
假設我做了件不得了的事
04:46
then tomorrow I want to tell you about how well I did.
111
286260
2000
明天我要告訴你是多不得了
04:48
And so I might say, "Yesterday, I throve."
112
288260
3000
我可能會說「"Yesterday, I throve."」
04:51
Alternatively, I could say, "Yesterday, I thrived."
113
291260
3000
或者,我也可以說「"Yesterday, I thrived."」
04:54
Well which one should I use?
114
294260
3000
但我應該說哪一種呢?
04:57
How to know?
115
297260
2000
要怎麼知道
04:59
As of about six months ago,
116
299260
2000
大概在六個月前
05:01
the state of the art in this field
117
301260
2000
要知道這一領域最尖端的方法
05:03
is that you would, for instance,
118
303260
2000
你可能得要去詢問
05:05
go up to the following psychologist with fabulous hair,
119
305260
2000
一位有著時髦髮型的心理學家
05:07
and you'd say,
120
307260
2000
你可能會問
05:09
"Steve, you're an expert on the irregular verbs.
121
309260
3000
「史蒂夫,你是不規則動詞的專家。
05:12
What should I do?"
122
312260
2000
我該怎麼說呢?」
05:14
And he'd tell you, "Well most people say thrived,
123
314260
2000
而他會告訴你「嗯,大部分的人會說"thrive"
05:16
but some people say throve."
124
316260
3000
但有些人會說"throve"。」
05:19
And you also knew, more or less,
125
319260
2000
而你也或多或少知道
05:21
that if you were to go back in time 200 years
126
321260
3000
如果我們回到兩百年前
05:24
and ask the following statesman with equally fabulous hair,
127
324260
3000
去問一位同樣也有時髦髮型的政治家
05:27
(Laughter)
128
327260
3000
(笑聲)
05:30
"Tom, what should I say?"
129
330260
2000
「湯姆,我應該怎麼說呢?」
05:32
He'd say, "Well, in my day, most people throve,
130
332260
2000
他說「嗯,在我的年代,大部份的人說"throve",
05:34
but some thrived."
131
334260
3000
但少部分的人說"thrived"」
05:37
So now what I'm just going to show you is raw data.
132
337260
2000
現在我要向各位展示原始數據
05:39
Two rows from this table of two billion entries.
133
339260
4000
這二十億條目資料中的其中兩條數據
05:43
What you're seeing is year by year frequency
134
343260
2000
各位將會看到的是"thrived"和"throve"兩個字
05:45
of "thrived" and "throve" over time.
135
345260
3000
在各年時期的出現頻率
05:49
Now this is just two
136
349260
2000
這只是二十億筆資料中
05:51
out of two billion rows.
137
351260
3000
其中兩個詞條的資訊
05:54
So the entire data set
138
354260
2000
這全部的數據資料
05:56
is a billion times more awesome than this slide.
139
356260
3000
將會比此張投影片還要驚人億萬倍
05:59
(Laughter)
140
359260
2000
(笑聲)
06:01
(Applause)
141
361260
4000
(掌聲)
06:05
JM: Now there are many other pictures that are worth 500 billion words.
142
365260
2000
JM:還有其他圖片也具有五千億字的價值
06:07
For instance, this one.
143
367260
2000
例如這張
06:09
If you just take influenza,
144
369260
2000
如果談到感冒
06:11
you will see peaks at the time where you knew
145
371260
2000
從這幾個高峰點我們可以知道
06:13
big flu epidemics were killing people around the globe.
146
373260
3000
感冒病毒的大流行在全球造成人類死亡
06:16
ELA: If you were not yet convinced,
147
376260
3000
ELA:如果各位還不太相信
06:19
sea levels are rising,
148
379260
2000
其他像是海平面升高
06:21
so is atmospheric CO2 and global temperature.
149
381260
3000
大氣中的二氧化碳和全球暖化
06:24
JM: You might also want to have a look at this particular n-gram,
150
384260
3000
JM:你也許會想看看這組特別的詞組
06:27
and that's to tell Nietzsche that God is not dead,
151
387260
3000
「告訴尼采,上帝還沒死」
06:30
although you might agree that he might need a better publicist.
152
390260
3000
也許你可能還會認為,他可能需要一個更好的公關
06:33
(Laughter)
153
393260
2000
(笑聲)
06:35
ELA: You can get at some pretty abstract concepts with this sort of thing.
154
395260
3000
ELA:從這當中,各位也能獲得一些相當抽象的概念
06:38
For instance, let me tell you the history
155
398260
2000
例如,讓我跟各位說說
06:40
of the year 1950.
156
400260
2000
有關「1950年」的歷史
06:42
Pretty much for the vast majority of history,
157
402260
2000
幾乎在絕大多數的歷史裡
06:44
no one gave a damn about 1950.
158
404260
2000
沒有特別談論1950這一年
06:46
In 1700, in 1800, in 1900,
159
406260
2000
在1700年,在1800年,1900年
06:48
no one cared.
160
408260
3000
沒有人在乎
06:52
Through the 30s and 40s,
161
412260
2000
甚至到30年代和40年代
06:54
no one cared.
162
414260
2000
也沒有人在談論
06:56
Suddenly, in the mid-40s,
163
416260
2000
突然到了40年代中期
06:58
there started to be a buzz.
164
418260
2000
開始出現了風潮
07:00
People realized that 1950 was going to happen,
165
420260
2000
人們意識到1950年就要來臨
07:02
and it could be big.
166
422260
2000
這是件大事
07:04
(Laughter)
167
424260
3000
(笑聲)
07:07
But nothing got people interested in 1950
168
427260
3000
但也沒有因此讓大眾對該年份產生興趣
07:10
like the year 1950.
169
430260
3000
像是「那1950年」
07:13
(Laughter)
170
433260
3000
(笑聲)
07:16
People were walking around obsessed.
171
436260
2000
人們開始對這一年著迷
07:18
They couldn't stop talking
172
438260
2000
大家無法停止談論
07:20
about all the things they did in 1950,
173
440260
3000
有關他們在1950年所做的一切
07:23
all the things they were planning to do in 1950,
174
443260
3000
所有他們計畫要在1950年所做的事
07:26
all the dreams of what they wanted to accomplish in 1950.
175
446260
5000
所有他們要在1950年完成的夢想
07:31
In fact, 1950 was so fascinating
176
451260
2000
事實上,1950年跟往後幾年相較
07:33
that for years thereafter,
177
453260
2000
是相當迷人的一年
07:35
people just kept talking about all the amazing things that happened,
178
455260
3000
人們不停談論所有發生在
07:38
in '51, '52, '53.
179
458260
2000
'51,'52,'53年的驚奇事件
07:40
Finally in 1954,
180
460260
2000
直到1954年
07:42
someone woke up and realized
181
462260
2000
有人驚覺而且意識到
07:44
that 1950 had gotten somewhat passé.
182
464260
4000
1950年已經變得過時了
07:48
(Laughter)
183
468260
2000
(笑聲)
07:50
And just like that, the bubble burst.
184
470260
2000
這一切就像泡沫破滅一樣
07:52
(Laughter)
185
472260
2000
(笑聲)
07:54
And the story of 1950
186
474260
2000
1950年的情況
07:56
is the story of every year that we have on record,
187
476260
2000
其實就是我們數據上每一個年份的情況一樣
07:58
with a little twist, because now we've got these nice charts.
188
478260
3000
稍微編排一下,我們有這些精美的圖表
08:01
And because we have these nice charts, we can measure things.
189
481260
3000
因為有這些不錯的圖表,我們就能計算
08:04
We can say, "Well how fast does the bubble burst?"
190
484260
2000
我們可以了解「風潮消逝的速度是多快?」
08:06
And it turns out that we can measure that very precisely.
191
486260
3000
結果就是我們能很精確測量出一份數據
08:09
Equations were derived, graphs were produced,
192
489260
3000
有了方程式,也有圖表
08:12
and the net result
193
492260
2000
最終的結果就是
08:14
is that we find that the bubble bursts faster and faster
194
494260
3000
談論年份的風潮一年比一年
08:17
with each passing year.
195
497260
2000
消退的更快
08:19
We are losing interest in the past more rapidly.
196
499260
5000
我們對於過去的興趣日漸消逝
08:24
JM: Now a little piece of career advice.
197
504260
2000
JM:這張圖是有關職業建議
08:26
So for those of you who seek to be famous,
198
506260
2000
對於那些想成名的人
08:28
we can learn from the 25 most famous political figures,
199
508260
2000
我們可以知道二十五位最有名的政治人物
08:30
authors, actors and so on.
200
510260
2000
作家、演員等等
08:32
So if you want to become famous early on, you should be an actor,
201
512260
3000
如果各位想在年輕時就成名,那麼各位應該要當演員
08:35
because then fame starts rising by the end of your 20s --
202
515260
2000
因為你的名氣會從二十歲後開始累積
08:37
you're still young, it's really great.
203
517260
2000
那時正值青春年華,會相當不錯
08:39
Now if you can wait a little bit, you should be an author,
204
519260
2000
如果各位有耐心一點,那麼就應該當個作家
08:41
because then you rise to very great heights,
205
521260
2000
因為各位就能攀上高峰
08:43
like Mark Twain, for instance: extremely famous.
206
523260
2000
成為像是馬克吐溫這樣有名望的作家
08:45
But if you want to reach the very top,
207
525260
2000
但如果各位想攀上最頂尖的位置
08:47
you should delay gratification
208
527260
2000
就得延後滿足自己的慾望
08:49
and, of course, become a politician.
209
529260
2000
然後當一位政治家
08:51
So here you will become famous by the end of your 50s,
210
531260
2000
那麼各位會在五十歲過後開始成名
08:53
and become very, very famous afterward.
211
533260
2000
然後你的名氣會在未來持續延續
08:55
So scientists also tend to get famous when they're much older.
212
535260
3000
科學家也往往是在老年時才成名
08:58
Like for instance, biologists and physics
213
538260
2000
而生物學家和物理學家一樣
09:00
tend to be almost as famous as actors.
214
540260
2000
往往也是和演員一樣著名
09:02
One mistake you should not do is become a mathematician.
215
542260
3000
唯一不要做的職業就是變成數學家
09:05
(Laughter)
216
545260
2000
(笑聲)
09:07
If you do that,
217
547260
2000
如果各位真要做這行
09:09
you might think, "Oh great. I'm going to do my best work when I'm in my 20s."
218
549260
3000
各位可能會想「太好了,當我在二十多歲時,我會盡一切努力。」
09:12
But guess what, nobody will really care.
219
552260
2000
但事實上,沒人會真正去在乎你所做的事
09:14
(Laughter)
220
554260
3000
(笑聲)
09:17
ELA: There are more sobering notes
221
557260
2000
ELA:在我們的資料裡
09:19
among the n-grams.
222
559260
2000
還有其他更發人省思的紀錄
09:21
For instance, here's the trajectory of Marc Chagall,
223
561260
2000
例如馬克‧夏卡爾的名字出現的頻率軌跡
09:23
an artist born in 1887.
224
563260
2000
夏卡爾是位1887年出生的藝術家
09:25
And this looks like the normal trajectory of a famous person.
225
565260
3000
這看起來是一位名人名字正常出現在書中的軌跡
09:28
He gets more and more and more famous,
226
568260
4000
他的名氣日益響亮
09:32
except if you look in German.
227
572260
2000
但如果看德國的數據就不是如此
09:34
If you look in German, you see something completely bizarre,
228
574260
2000
如果看德國的數據,會看到某部份是非常奇怪的
09:36
something you pretty much never see,
229
576260
2000
這是幾乎不太可能看到的
09:38
which is he becomes extremely famous
230
578260
2000
就是他變得非常有名
09:40
and then all of a sudden plummets,
231
580260
2000
卻突然在1933年至1945年間
09:42
going through a nadir between 1933 and 1945,
232
582260
3000
聲勢跌落谷底
09:45
before rebounding afterward.
233
585260
3000
又反彈回升
09:48
And of course, what we're seeing
234
588260
2000
當然我們看的出來
09:50
is the fact Marc Chagall was a Jewish artist
235
590260
3000
這是因為馬克‧夏卡爾是一位猶太裔藝術家
09:53
in Nazi Germany.
236
593260
2000
當時德國是納粹統治
09:55
Now these signals
237
595260
2000
這些指標
09:57
are actually so strong
238
597260
2000
事實上相當明確
09:59
that we don't need to know that someone was censored.
239
599260
3000
我們不需要知道有人在審查書籍
10:02
We can actually figure it out
240
602260
2000
我們能運用基本的信號運算方式
10:04
using really basic signal processing.
241
604260
2000
實際了解當時狀況
10:06
Here's a simple way to do it.
242
606260
2000
我們可以用簡單的方式來做
10:08
Well, a reasonable expectation
243
608260
2000
合理的預期是
10:10
is that somebody's fame in a given period of time
244
610260
2000
在一段特定的時間裡某人的名氣指數
10:12
should be roughly the average of their fame before
245
612260
2000
應該會是他們成名前
10:14
and their fame after.
246
614260
2000
和成名後的指數的平均值
10:16
So that's sort of what we expect.
247
616260
2000
這大概是我們預期的結果
10:18
And we compare that to the fame that we observe.
248
618260
3000
我們比較了我們觀察到的名人
10:21
And we just divide one by the other
249
621260
2000
我們將前後的數值相除
10:23
to produce something we call a suppression index.
250
623260
2000
得到的數值,我們稱作抑制指數
10:25
If the suppression index is very, very, very small,
251
625260
3000
如果抑制指數的值非常的小
10:28
then you very well might be being suppressed.
252
628260
2000
那麼就表示此人也許遭受到打壓
10:30
If it's very large, maybe you're benefiting from propaganda.
253
630260
3000
但如果數值非常大,也許此人獲得大量的推廣
10:34
JM: Now you can actually look at
254
634260
2000
JM:各位現在可以看到
10:36
the distribution of suppression indexes over whole populations.
255
636260
3000
抑制指數在抽樣整體人數中的分佈情況
10:39
So for instance, here --
256
639260
2000
所以,例如這裡 --
10:41
this suppression index is for 5,000 people
257
641260
2000
這個抑制指數的抽樣人數是五千人
10:43
picked in English books where there's no known suppression --
258
643260
2000
選自出版時期沒有打壓限制的英文書籍來做調查
10:45
it would be like this, basically tightly centered on one.
259
645260
2000
曲線基本上會在數值1的地方呈現高峰
10:47
What you expect is basically what you observe.
260
647260
2000
基本上預期的會和觀察到的數值是相同的
10:49
This is distribution as seen in Germany --
261
649260
2000
這份分佈圖則是德國的部分 --
10:51
very different, it's shifted to the left.
262
651260
2000
相當不同,曲線移往左側
10:53
People talked about it twice less as it should have been.
263
653260
3000
人們談論事物的次數比預期的少了兩倍
10:56
But much more importantly, the distribution is much wider.
264
656260
2000
更重要的是,整體分佈的情況更寬廣
10:58
There are many people who end up on the far left on this distribution
265
658260
3000
有相當多人是落在圖表較左側的位置
11:01
who are talked about 10 times fewer than they should have been.
266
661260
3000
因為他們比應該被提及的次數少了十倍
11:04
But then also many people on the far right
267
664260
2000
但也有相當多人是落在較右側的部分
11:06
who seem to benefit from propaganda.
268
666260
2000
似乎是因為被大量宣傳
11:08
This picture is the hallmark of censorship in the book record.
269
668260
3000
這張圖是明顯看出書本中具有審查制度
11:11
ELA: So culturomics
270
671260
2000
ELA:文化組學
11:13
is what we call this method.
271
673260
2000
是我們用的方法
11:15
It's kind of like genomics.
272
675260
2000
這和基因組學有些類似
11:17
Except genomics is a lens on biology
273
677260
2000
不過基因組學是透過生物學
11:19
through the window of the sequence of bases in the human genome.
274
679260
3000
基本的序列基礎來檢視人類基因組
11:22
Culturomics is similar.
275
682260
2000
文化組學是類似的
11:24
It's the application of massive-scale data collection analysis
276
684260
3000
這是應用收集分析規模龐大的數據
11:27
to the study of human culture.
277
687260
2000
來研究人類文化
11:29
Here, instead of through the lens of a genome,
278
689260
2000
不透過檢視基因組
11:31
through the lens of digitized pieces of the historical record.
279
691260
3000
而是檢視歷史紀錄的數位資料
11:34
The great thing about culturomics
280
694260
2000
文化組學的好處是
11:36
is that everyone can do it.
281
696260
2000
每個人都能執行
11:38
Why can everyone do it?
282
698260
2000
為何每個人都能做呢?
11:40
Everyone can do it because three guys,
283
700260
2000
因為這三位人士
11:42
Jon Orwant, Matt Gray and Will Brockman over at Google,
284
702260
3000
Google的Jon Orwant,Matt Gray還有Will Brockman
11:45
saw the prototype of the Ngram Viewer,
285
705260
2000
他們看到Ngram瀏覽器的原型
11:47
and they said, "This is so fun.
286
707260
2000
他們說「這太有趣了。」
11:49
We have to make this available for people."
287
709260
3000
我們要讓大家都可以使用這功能
11:52
So in two weeks flat -- the two weeks before our paper came out --
288
712260
2000
所以在兩週的時間 -- 我們的報告出來的兩週前 --
11:54
they coded up a version of the Ngram Viewer for the general public.
289
714260
3000
他們編寫了一個大眾版本的Ngram瀏覽器
11:57
And so you too can type in any word or phrase that you're interested in
290
717260
3000
各位可以打上任何各位有興趣的字或詞組
12:00
and see its n-gram immediately --
291
720260
2000
然後立即看到該字詞的頻率變化 --
12:02
also browse examples of all the various books
292
722260
2000
同時根據你搜尋的字詞
12:04
in which your n-gram appears.
293
724260
2000
瀏覽不同書籍中的各種例子
12:06
JM: Now this was used over a million times on the first day,
294
726260
2000
JM:這功能在首日就被使用了超過一百萬次
12:08
and this is really the best of all the queries.
295
728260
2000
這也是各種查詢工具中最好的一個
12:10
So people want to be their best, put their best foot forward.
296
730260
3000
人們希望做到最好的,以最好的狀態像前進
12:13
But it turns out in the 18th century, people didn't really care about that at all.
297
733260
3000
但事實證明在18世紀,人們一點也不關心這一切
12:16
They didn't want to be their best, they wanted to be their beft.
298
736260
3000
他們不想做到最好,他們想變成"beft"
12:19
So what happened is, of course, this is just a mistake.
299
739260
3000
這是怎麼回事,當然這只是個錯誤
12:22
It's not that strove for mediocrity,
300
742260
2000
這並不是說他們想要平凡
12:24
it's just that the S used to be written differently, kind of like an F.
301
744260
3000
這只是因為"S"常被寫的不一樣,寫得像"F"
12:27
Now of course, Google didn't pick this up at the time,
302
747260
3000
當然,Google並沒有挑出來
12:30
so we reported this in the science article that we wrote.
303
750260
3000
所以我們在自己寫科學文章中提到此事
12:33
But it turns out this is just a reminder
304
753260
2000
不過這只是個提醒
12:35
that, although this is a lot of fun,
305
755260
2000
雖然這相當有趣
12:37
when you interpret these graphs, you have to be very careful,
306
757260
2000
當你要解讀這些圖表,你必須非常謹慎
12:39
and you have to adopt the base standards in the sciences.
307
759260
3000
而且必須採納科學的基礎標準
12:42
ELA: People have been using this for all kinds of fun purposes.
308
762260
3000
ELA:大家一直在使用這工具來滿足各種樂趣
12:45
(Laughter)
309
765260
7000
(笑聲)
12:52
Actually, we're not going to have to talk,
310
772260
2000
事實上,我們不需要說明的
12:54
we're just going to show you all the slides and remain silent.
311
774260
3000
我們原本只想播放所有的投影片然後在一旁保持沉默
12:57
This person was interested in the history of frustration.
312
777260
3000
此人對於挫折的歷史感興趣
13:00
There's various types of frustration.
313
780260
3000
挫折有非常多種方式
13:03
If you stub your toe, that's a one A "argh."
314
783260
3000
如果你踢到腳趾,哀叫聲「啊」就是一個"A"的"argh"
13:06
If the planet Earth is annihilated by the Vogons
315
786260
2000
如果地球被外星人毀滅
13:08
to make room for an interstellar bypass,
316
788260
2000
變成星際間的通道
13:10
that's an eight A "aaaaaaaargh."
317
790260
2000
那麼哀叫聲「啊」就是有八個"A"的"aaaaaaaargh"
13:12
This person studies all the "arghs,"
318
792260
2000
此人研究了所有書籍上出現的哀叫聲「啊」
13:14
from one through eight A's.
319
794260
2000
有從一個"A"到八個"A"
13:16
And it turns out
320
796260
2000
結果是
13:18
that the less-frequent "arghs"
321
798260
2000
較不頻繁的「啊」“arghs”
13:20
are, of course, the ones that correspond to things that are more frustrating --
322
800260
3000
對應了那些相對較令人沮喪的的事情
13:23
except, oddly, in the early 80s.
323
803260
3000
也有例外,奇怪的是在80年代初
13:26
We think that might have something to do with Reagan.
324
806260
2000
我們認為這也許是受到雷根的影響
13:28
(Laughter)
325
808260
2000
(笑聲)
13:30
JM: There are many usages of this data,
326
810260
3000
JM:這份書據資料有相當多用途
13:33
but the bottom line is that the historical record is being digitized.
327
813260
3000
不過最終就是歷史紀錄都被數位化了
13:36
Google has started to digitize 15 million books.
328
816260
2000
Google已經開始將一千五百萬本書數位化
13:38
That's 12 percent of all the books that have ever been published.
329
818260
2000
其中百分之十二的書是已出版的
13:40
It's a sizable chunk of human culture.
330
820260
3000
這涵蓋了相當大量的人類文化
13:43
There's much more in culture: there's manuscripts, there newspapers,
331
823260
3000
這當中有非常多的文化資料:裡頭有手稿,報紙
13:46
there's things that are not text, like art and paintings.
332
826260
2000
也有不是文字的資料,像是藝術品和畫作
13:48
These all happen to be on our computers,
333
828260
2000
現在這都存放在我們的電腦裡
13:50
on computers across the world.
334
830260
2000
在世界各處的電腦裡
13:52
And when that happens, that will transform the way we have
335
832260
3000
如果這一切成真,就會改變
13:55
to understand our past, our present and human culture.
336
835260
2000
我們了解過去、現在和人類文化的方式
13:57
Thank you very much.
337
837260
2000
非常謝謝各位
13:59
(Applause)
338
839260
3000
(掌聲)
關於本網站

本網站將向您介紹對學習英語有用的 YouTube 視頻。 您將看到來自世界各地的一流教師教授的英語課程。 雙擊每個視頻頁面上顯示的英文字幕,從那裡播放視頻。 字幕與視頻播放同步滾動。 如果您有任何意見或要求,請使用此聯繫表與我們聯繫。

https://forms.gle/WvT1wiN1qDtmnspy7