What we learned from 5 million books

236,154 views ・ 2011-09-20

TED


请双击下面的英文字幕来播放视频。

翻译人员: Lili Liang 校对人员: dahong zhang
00:15
Erez Lieberman Aiden: Everyone knows
0
15260
2000
Erez Liberman Aiden:人说
00:17
that a picture is worth a thousand words.
1
17260
3000
一副画面抵过一千个词
00:22
But we at Harvard
2
22260
2000
但是我们在哈佛大学
00:24
were wondering if this was really true.
3
24260
3000
却在思考这是不是一定正确
00:27
(Laughter)
4
27260
2000
(众人笑)
00:29
So we assembled a team of experts,
5
29260
4000
我们召集了各方专家
00:33
spanning Harvard, MIT,
6
33260
2000
他们来自哈佛 麻省理工
00:35
The American Heritage Dictionary, The Encyclopedia Britannica
7
35260
3000
《英国大百科全书》 《美国传统英语字典》
00:38
and even our proud sponsors,
8
38260
2000
还有我们骄傲的赞助商
00:40
the Google.
9
40260
3000
谷歌
00:43
And we cogitated about this
10
43260
2000
我们思考了
00:45
for about four years.
11
45260
2000
大概四年
00:47
And we came to a startling conclusion.
12
47260
5000
最后得出一个惊人的结论
00:52
Ladies and gentlemen, a picture is not worth a thousand words.
13
52260
3000
女士们先生们 一副画面可不止一千个词那么简单
00:55
In fact, we found some pictures
14
55260
2000
事实上 我们发现有时候
00:57
that are worth 500 billion words.
15
57260
5000
一幅画面抵过5千亿个词
01:02
Jean-Baptiste Michel: So how did we get to this conclusion?
16
62260
2000
Jean-Baptiste Michel: 我们是如何得出这个结论的呢
01:04
So Erez and I were thinking about ways
17
64260
2000
是这样的 Erez和我
01:06
to get a big picture of human culture
18
66260
2000
在想怎样找到一幅展现人类文明
01:08
and human history: change over time.
19
68260
3000
和人文历史的画面: 历史的变迁
01:11
So many books actually have been written over the years.
20
71260
2000
人们在漫长岁月中写了很多书
01:13
So we were thinking, well the best way to learn from them
21
73260
2000
所以我们想 向他们学习的最佳方法
01:15
is to read all of these millions of books.
22
75260
2000
就是把那几百万本书全都读完
01:17
Now of course, if there's a scale for how awesome that is,
23
77260
3000
当然 如果用坐标来表示这样做的好处
01:20
that has to rank extremely, extremely high.
24
80260
3000
那Y轴上的值一定是极高的
01:23
Now the problem is there's an X-axis for that,
25
83260
2000
但问题是还有X轴
01:25
which is the practical axis.
26
85260
2000
也就是可行性
01:27
This is very, very low.
27
87260
2000
这是极低的
01:29
(Applause)
28
89260
3000
(众人鼓掌)
01:32
Now people tend to use an alternative approach,
29
92260
3000
现在人们倾向于另一种做法
01:35
which is to take a few sources and read them very carefully.
30
95260
2000
那就是选择几本书进行精读
01:37
This is extremely practical, but not so awesome.
31
97260
2000
可行性极高但还不够好
01:39
What you really want to do
32
99260
3000
人们真正想要的
01:42
is to get to the awesome yet practical part of this space.
33
102260
3000
是一个既好又可行的方法
01:45
So it turns out there was a company across the river called Google
34
105260
3000
结果 在水一方 有一家叫“谷歌”的公司
01:48
who had started a digitization project a few years back
35
108260
2000
他们在此之前的几年前就开始了一个数字化工程
01:50
that might just enable this approach.
36
110260
2000
有可能帮我们找到这个“既好又可行”的方法
01:52
They have digitized millions of books.
37
112260
2000
他们已经将几百万本书进行了数字化
01:54
So what that means is, one could use computational methods
38
114260
3000
这就意味着人们在电脑上点几个键
01:57
to read all of the books in a click of a button.
39
117260
2000
就能阅读所有的书
01:59
That's very practical and extremely awesome.
40
119260
3000
这真的是既可行又好
02:03
ELA: Let me tell you a little bit about where books come from.
41
123260
2000
这些书是哪里来的呢
02:05
Since time immemorial, there have been authors.
42
125260
3000
从古时候开始 人们就开始写作了
02:08
These authors have been striving to write books.
43
128260
3000
这些作家写书都非常卖力
02:11
And this became considerably easier
44
131260
2000
几个世纪前印刷机问世了
02:13
with the development of the printing press some centuries ago.
45
133260
2000
写书的过程变得简单多了
02:15
Since then, the authors have won
46
135260
3000
自那以后
02:18
on 129 million distinct occasions,
47
138260
2000
作家们已经出版了
02:20
publishing books.
48
140260
2000
1.29亿本书
02:22
Now if those books are not lost to history,
49
142260
2000
如果这些书没有随年月而遗失
02:24
then they are somewhere in a library,
50
144260
2000
就都在图书馆里存着
02:26
and many of those books have been getting retrieved from the libraries
51
146260
3000
谷歌已经把许多书从图书馆中调了出来
02:29
and digitized by Google,
52
149260
2000
进行了数字化
02:31
which has scanned 15 million books to date.
53
151260
2000
被扫描的书籍到目前已有1500万册
02:33
Now when Google digitizes a book, they put it into a really nice format.
54
153260
3000
谷歌扫描图书时 把书的格式做得很好
02:36
Now we've got the data, plus we have metadata.
55
156260
2000
现在我们不但有了数据 还有元数据
02:38
We have information about things like where was it published,
56
158260
3000
我们掌握了这些书的出版地
02:41
who was the author, when was it published.
57
161260
2000
作者 出版时间等信息
02:43
And what we do is go through all of those records
58
163260
3000
接下来 我们就要从所有这些记录中
02:46
and exclude everything that's not the highest quality data.
59
166260
4000
筛选出质量最高的数据
02:50
What we're left with
60
170260
2000
最后剩下的
02:52
is a collection of five million books,
61
172260
3000
是5百万本书
02:55
500 billion words,
62
175260
3000
5000亿个词
02:58
a string of characters a thousand times longer
63
178260
2000
这么多词连起来
03:00
than the human genome --
64
180260
3000
长度是人类基因组的1000倍
03:03
a text which, when written out,
65
183260
2000
如果把这些词连续写出来
03:05
would stretch from here to the Moon and back
66
185260
2000
其长度相当于在地月之间
03:07
10 times over --
67
187260
2000
往返10次以上
03:09
a veritable shard of our cultural genome.
68
189260
4000
这还仅是我们文化基因组的小小一段
03:13
Of course what we did
69
193260
2000
当然啦
03:15
when faced with such outrageous hyperbole ...
70
195260
3000
面对如此令人崩溃的结果
03:18
(Laughter)
71
198260
2000
(众人笑)
03:20
was what any self-respecting researchers
72
200260
3000
我们做了一个懂得自重的研究者
03:23
would have done.
73
203260
3000
应该做的事
03:26
We took a page out of XKCD,
74
206260
2000
我们借鉴了XKCD(科学漫画)
03:28
and we said, "Stand back.
75
208260
2000
说:" 往后站。
03:30
We're going to try science."
76
210260
2000
我们要用科学来解决问题。”
03:32
(Laughter)
77
212260
2000
(众人笑)
03:34
JM: Now of course, we were thinking,
78
214260
2000
当然 这时我们在想
03:36
well let's just first put the data out there
79
216260
2000
何不先把数据放上去
03:38
for people to do science to it.
80
218260
2000
让人们通过科学来运用数据
03:40
Now we're thinking, what data can we release?
81
220260
2000
现在我们在思考 哪些数据可以公开
03:42
Well of course, you want to take the books
82
222260
2000
你当然想把这所有5百万本书
03:44
and release the full text of these five million books.
83
224260
2000
全文公开
03:46
Now Google, and Jon Orwant in particular,
84
226260
2000
现在谷歌 具体地说是乔恩. 奥温特
03:48
told us a little equation that we should learn.
85
228260
2000
告诉教给我们一个有用的方程式
03:50
So you have five million, that is, five million authors
86
230260
3000
你有5百万本书 那就有五百万个作者
03:53
and five million plaintiffs is a massive lawsuit.
87
233260
3000
一个有5百万个原告的官司可不小啊
03:56
So, although that would be really, really awesome,
88
236260
2000
所以尽管这是个好想法
03:58
again, that's extremely, extremely impractical.
89
238260
3000
但是也极不现实
04:01
(Laughter)
90
241260
2000
(众人笑)
04:03
Now again, we kind of caved in,
91
243260
2000
现在我们做出些许让步
04:05
and we did the very practical approach, which was a bit less awesome.
92
245260
3000
采用一个非常可行但稍微没那么好的方法
04:08
We said, well instead of releasing the full text,
93
248260
2000
我们不公开全书内容
04:10
we're going to release statistics about the books.
94
250260
2000
而是公开书本的相关统计数据
04:12
So take for instance "A gleam of happiness."
95
252260
2000
拿“A gleam of happiness”这个词组做例子
04:14
It's four words; we call that a four-gram.
96
254260
2000
它有四个单词 我们称它为四字格
04:16
We're going to tell you how many times a particular four-gram
97
256260
2000
我们会告诉你直到2008年出版的书中
04:18
appeared in books in 1801, 1802, 1803,
98
258260
2000
在1801年 1802年 1803年一直到2008年
04:20
all the way up to 2008.
99
260260
2000
某个四字格一共出现了多少次
04:22
That gives us a time series
100
262260
2000
这让我们看到
04:24
of how frequently this particular sentence was used over time.
101
264260
2000
这个词组在这段时期内被使用的频率
04:26
We do that for all the words and phrases that appear in those books,
102
266260
3000
我们对在这些书中的所有单词和词组都这么处理
04:29
and that gives us a big table of two billion lines
103
269260
3000
于是我们得出了一个由20亿曲线
04:32
that tell us about the way culture has been changing.
104
272260
2000
表示出文化变化的情况
04:34
ELA: So those two billion lines,
105
274260
2000
这20亿条曲线
04:36
we call them two billion n-grams.
106
276260
2000
我们成作20亿个n字格
04:38
What do they tell us?
107
278260
2000
它们告诉了我们什么
04:40
Well the individual n-grams measure cultural trends.
108
280260
2000
这些n字格衡量的是文化的走势
04:42
Let me give you an example.
109
282260
2000
我来举个例子
04:44
Let's suppose that I am thriving,
110
284260
2000
假设 我正在发财
04:46
then tomorrow I want to tell you about how well I did.
111
286260
2000
明天我告诉你我发财的情况
04:48
And so I might say, "Yesterday, I throve."
112
288260
3000
我会说:“昨天,我发了。”
04:51
Alternatively, I could say, "Yesterday, I thrived."
113
291260
3000
也可以说:“昨天,我发财了。”
04:54
Well which one should I use?
114
294260
3000
我到底应该用哪个说法呢
04:57
How to know?
115
297260
2000
怎么找答案
04:59
As of about six months ago,
116
299260
2000
6个月以前
05:01
the state of the art in this field
117
301260
2000
很流行的做法是
05:03
is that you would, for instance,
118
303260
2000
比如说
05:05
go up to the following psychologist with fabulous hair,
119
305260
2000
你去问这位秀发飘逸的心理学家
05:07
and you'd say,
120
307260
2000
你说
05:09
"Steve, you're an expert on the irregular verbs.
121
309260
3000
“史蒂夫,你是不规则动词的专家。
05:12
What should I do?"
122
312260
2000
我该怎么办啊?”
05:14
And he'd tell you, "Well most people say thrived,
123
314260
2000
他会说:“大多数人说‘发财了’,
05:16
but some people say throve."
124
316260
3000
但有些人说‘发了’。”
05:19
And you also knew, more or less,
125
319260
2000
如果你可以
05:21
that if you were to go back in time 200 years
126
321260
3000
回到200年前
05:24
and ask the following statesman with equally fabulous hair,
127
324260
3000
问问这位秀发同样飘逸的政治家
05:27
(Laughter)
128
327260
3000
(众人笑)
05:30
"Tom, what should I say?"
129
330260
2000
“托马斯,我该怎么说?”
05:32
He'd say, "Well, in my day, most people throve,
130
332260
2000
他会回答:“嗯,在我的时代,大多数人说‘发了’,
05:34
but some thrived."
131
334260
3000
但是少数人说‘发财了’。”
05:37
So now what I'm just going to show you is raw data.
132
337260
2000
现在我给你们看一个原始数据
05:39
Two rows from this table of two billion entries.
133
339260
4000
这是20亿本书中的其中两本书的曲线
05:43
What you're seeing is year by year frequency
134
343260
2000
你们将看到“发了”和“发财了”这两个词
05:45
of "thrived" and "throve" over time.
135
345260
3000
随时间的推移被使用的频率
05:49
Now this is just two
136
349260
2000
这还只是
05:51
out of two billion rows.
137
351260
3000
20亿条曲线中的其中两条
05:54
So the entire data set
138
354260
2000
整套数据
05:56
is a billion times more awesome than this slide.
139
356260
3000
比这张幻灯片要宏伟10亿倍
05:59
(Laughter)
140
359260
2000
(众人笑)
06:01
(Applause)
141
361260
4000
(众人鼓掌)
06:05
JM: Now there are many other pictures that are worth 500 billion words.
142
365260
2000
很多画面都相当于5千亿个词
06:07
For instance, this one.
143
367260
2000
比如这一幅
06:09
If you just take influenza,
144
369260
2000
如果你找“流行感冒”这一词
06:11
you will see peaks at the time where you knew
145
371260
2000
你会看到几个全球范围内
06:13
big flu epidemics were killing people around the globe.
146
373260
3000
祸害人命的流感高峰
06:16
ELA: If you were not yet convinced,
147
376260
3000
如果这不足以令人信服
06:19
sea levels are rising,
148
379260
2000
海平面正在上升
06:21
so is atmospheric CO2 and global temperature.
149
381260
3000
大气中二氧化碳含量和全球气温都在升高
06:24
JM: You might also want to have a look at this particular n-gram,
150
384260
3000
你们也可以看看这个n字格
06:27
and that's to tell Nietzsche that God is not dead,
151
387260
3000
告诉尼采上帝没死
06:30
although you might agree that he might need a better publicist.
152
390260
3000
你可能也认为他或许要换一个企宣了
06:33
(Laughter)
153
393260
2000
(众人笑)
06:35
ELA: You can get at some pretty abstract concepts with this sort of thing.
154
395260
3000
你可以通过这个得到非常抽象的概念
06:38
For instance, let me tell you the history
155
398260
2000
我跟你们说说
06:40
of the year 1950.
156
400260
2000
1950年的历史
06:42
Pretty much for the vast majority of history,
157
402260
2000
在漫漫历史长河中
06:44
no one gave a damn about 1950.
158
404260
2000
几乎没人在意1950年
06:46
In 1700, in 1800, in 1900,
159
406260
2000
1700年 1800年 1900年
06:48
no one cared.
160
408260
3000
没有人在意
06:52
Through the 30s and 40s,
161
412260
2000
20世纪三十年代和四十年代
06:54
no one cared.
162
414260
2000
没有人在意
06:56
Suddenly, in the mid-40s,
163
416260
2000
到了四十年代中期 突然间
06:58
there started to be a buzz.
164
418260
2000
关注度飞升
07:00
People realized that 1950 was going to happen,
165
420260
2000
人们意识到1950年快来了
07:02
and it could be big.
166
422260
2000
这一年可能非同小可啊
07:04
(Laughter)
167
424260
3000
(众人笑)
07:07
But nothing got people interested in 1950
168
427260
3000
1950年 正如人们想象的一样
07:10
like the year 1950.
169
430260
3000
没发生任何有意思的事情
07:13
(Laughter)
170
433260
3000
(众人笑)
07:16
People were walking around obsessed.
171
436260
2000
人们都着了魔了
07:18
They couldn't stop talking
172
438260
2000
无时无刻不在谈论
07:20
about all the things they did in 1950,
173
440260
3000
他们1950年做过的事情
07:23
all the things they were planning to do in 1950,
174
443260
3000
他们打算在1950年做的事情
07:26
all the dreams of what they wanted to accomplish in 1950.
175
446260
5000
后者他们1950年想要实现的梦想
07:31
In fact, 1950 was so fascinating
176
451260
2000
事实上 1950年是不同凡响的一年
07:33
that for years thereafter,
177
453260
2000
即使过了好多年
07:35
people just kept talking about all the amazing things that happened,
178
455260
3000
人们还是不停地谈论那年发生的所有美好事情
07:38
in '51, '52, '53.
179
458260
2000
51年 52年 53年
07:40
Finally in 1954,
180
460260
2000
终于到了1954年
07:42
someone woke up and realized
181
462260
2000
人们醒悟过来
07:44
that 1950 had gotten somewhat passé.
182
464260
4000
1950年已成往事了
07:48
(Laughter)
183
468260
2000
(众人笑)
07:50
And just like that, the bubble burst.
184
470260
2000
就这样 泡泡破了
07:52
(Laughter)
185
472260
2000
(众人笑)
07:54
And the story of 1950
186
474260
2000
1950年的情况
07:56
is the story of every year that we have on record,
187
476260
2000
以及每一年的情况 我们都记录了下来
07:58
with a little twist, because now we've got these nice charts.
188
478260
3000
多亏了这些漂亮的图表 我们的工作顺利多了
08:01
And because we have these nice charts, we can measure things.
189
481260
3000
有了这些漂亮的图表 我们就能测量各种事物
08:04
We can say, "Well how fast does the bubble burst?"
190
484260
2000
我们会说:“泡泡破掉的速度有多快?”
08:06
And it turns out that we can measure that very precisely.
191
486260
3000
结果证明 我们可以对此进行精准的测量
08:09
Equations were derived, graphs were produced,
192
489260
3000
等式出来了 图表也做好了
08:12
and the net result
193
492260
2000
最终结果是
08:14
is that we find that the bubble bursts faster and faster
194
494260
3000
泡泡破掉的速度
08:17
with each passing year.
195
497260
2000
每年都在加快
08:19
We are losing interest in the past more rapidly.
196
499260
5000
我们对过去的遗忘不断加快
08:24
JM: Now a little piece of career advice.
197
504260
2000
好 现在给大家一些发展事业的建议
08:26
So for those of you who seek to be famous,
198
506260
2000
如果你想成名
08:28
we can learn from the 25 most famous political figures,
199
508260
2000
我们可以向25位最著名的政治人物
08:30
authors, actors and so on.
200
510260
2000
作家 演员学习
08:32
So if you want to become famous early on, you should be an actor,
201
512260
3000
如果你想早点成名 你就应该做个演员
08:35
because then fame starts rising by the end of your 20s --
202
515260
2000
因为 演员在20来岁的时候成名
08:37
you're still young, it's really great.
203
517260
2000
你还很年轻 这是本钱
08:39
Now if you can wait a little bit, you should be an author,
204
519260
2000
如果你能等一等 那就当个作家
08:41
because then you rise to very great heights,
205
521260
2000
因为你可以像马克.吐温这样
08:43
like Mark Twain, for instance: extremely famous.
206
523260
2000
成为文坛巨星
08:45
But if you want to reach the very top,
207
525260
2000
如果你想到达万人之上
08:47
you should delay gratification
208
527260
2000
你就不能安于现状
08:49
and, of course, become a politician.
209
529260
2000
要成为一个政治家
08:51
So here you will become famous by the end of your 50s,
210
531260
2000
到了快60岁的时候 你就成名了
08:53
and become very, very famous afterward.
211
533260
2000
而且之后名声远扬
08:55
So scientists also tend to get famous when they're much older.
212
535260
3000
科学家通常在年纪一大把的时候才成名
08:58
Like for instance, biologists and physics
213
538260
2000
生物学家和物理学家的名声
09:00
tend to be almost as famous as actors.
214
540260
2000
通常能跟演员的名声媲美
09:02
One mistake you should not do is become a mathematician.
215
542260
3000
有一个错误你不要犯 那就是成为一个数学家
09:05
(Laughter)
216
545260
2000
(众人笑)
09:07
If you do that,
217
547260
2000
如果你成了数学家
09:09
you might think, "Oh great. I'm going to do my best work when I'm in my 20s."
218
549260
3000
你会想:“太好啦,我20多岁的时候会有最辉煌的成就。”
09:12
But guess what, nobody will really care.
219
552260
2000
谁知道 人们连睬都不睬你
09:14
(Laughter)
220
554260
3000
(众人笑)
09:17
ELA: There are more sobering notes
221
557260
2000
n字格中
09:19
among the n-grams.
222
559260
2000
有些情况更为明了
09:21
For instance, here's the trajectory of Marc Chagall,
223
561260
2000
这是Marc Chagall的名声起落
09:23
an artist born in 1887.
224
563260
2000
他是出生于1887的一位艺术家
09:25
And this looks like the normal trajectory of a famous person.
225
565260
3000
他的名声起落看似乎没有什么异常
09:28
He gets more and more and more famous,
226
568260
4000
他的名声越来越大
09:32
except if you look in German.
227
572260
2000
然而如果你在德语书中搜索 情况就不同了
09:34
If you look in German, you see something completely bizarre,
228
574260
2000
在德语书中 你会看到非常奇怪的现象
09:36
something you pretty much never see,
229
576260
2000
闻所未闻 见所未见
09:38
which is he becomes extremely famous
230
578260
2000
他先是名极一时
09:40
and then all of a sudden plummets,
231
580260
2000
但突然之间 名声直线下落
09:42
going through a nadir between 1933 and 1945,
232
582260
3000
在1933年到1945年间达到了低谷
09:45
before rebounding afterward.
233
585260
3000
后来才回升
09:48
And of course, what we're seeing
234
588260
2000
当然 实际情况是
09:50
is the fact Marc Chagall was a Jewish artist
235
590260
3000
Marc Chagall是一个犹太艺术家
09:53
in Nazi Germany.
236
593260
2000
当时身在纳粹德国
09:55
Now these signals
237
595260
2000
这些信号
09:57
are actually so strong
238
597260
2000
实在太强了
09:59
that we don't need to know that someone was censored.
239
599260
3000
我们无需知道谁被禁了
10:02
We can actually figure it out
240
602260
2000
我们事实上可以
10:04
using really basic signal processing.
241
604260
2000
通过非常基本的信号处理来找出答案
10:06
Here's a simple way to do it.
242
606260
2000
这里有一个简单的方法
10:08
Well, a reasonable expectation
243
608260
2000
一个人在特定时期内
10:10
is that somebody's fame in a given period of time
244
610260
2000
所拥有的知名度
10:12
should be roughly the average of their fame before
245
612260
2000
应当大致为他成名前与成名后知名度的平均值
10:14
and their fame after.
246
614260
2000
这么想是有道理的
10:16
So that's sort of what we expect.
247
616260
2000
我们也是怎么想的
10:18
And we compare that to the fame that we observe.
248
618260
3000
我们把观察到的知名度进行对比
10:21
And we just divide one by the other
249
621260
2000
我们把前者比上后者
10:23
to produce something we call a suppression index.
250
623260
2000
产生的结果叫做抑制指数
10:25
If the suppression index is very, very, very small,
251
625260
3000
如果抑制指数非常非常小
10:28
then you very well might be being suppressed.
252
628260
2000
那么你的知名度正在被抑制
10:30
If it's very large, maybe you're benefiting from propaganda.
253
630260
3000
如果数值非常大 或许就表明你从宣传中获益
10:34
JM: Now you can actually look at
254
634260
2000
你还可以看到
10:36
the distribution of suppression indexes over whole populations.
255
636260
3000
压抑指数在总人数中的分布情况
10:39
So for instance, here --
256
639260
2000
这里有个例子
10:41
this suppression index is for 5,000 people
257
641260
2000
这是从没有明显抑制的英文书籍中
10:43
picked in English books where there's no known suppression --
258
643260
2000
选出的5000个人
10:45
it would be like this, basically tightly centered on one.
259
645260
2000
它是这个样子的 基本上以1为中心
10:47
What you expect is basically what you observe.
260
647260
2000
实际情况与预想差不多
10:49
This is distribution as seen in Germany --
261
649260
2000
而这在是德文书籍中的分布情况
10:51
very different, it's shifted to the left.
262
651260
2000
与前者大为不同 往左偏了
10:53
People talked about it twice less as it should have been.
263
653260
3000
人们对它的关注较预期要少了两倍
10:56
But much more importantly, the distribution is much wider.
264
656260
2000
更重要的是 这个分布的跨度更宽
10:58
There are many people who end up on the far left on this distribution
265
658260
3000
不少人处于左边的部分
11:01
who are talked about 10 times fewer than they should have been.
266
661260
3000
人数比预期中少了10倍
11:04
But then also many people on the far right
267
664260
2000
而也有不少人处于更靠右的部分
11:06
who seem to benefit from propaganda.
268
666260
2000
他们的宣传起了作用
11:08
This picture is the hallmark of censorship in the book record.
269
668260
3000
这幅图反映了书籍记录中的审查情况
11:11
ELA: So culturomics
270
671260
2000
我们把这种方法
11:13
is what we call this method.
271
673260
2000
称作文化组学
11:15
It's kind of like genomics.
272
675260
2000
有点像基因组学
11:17
Except genomics is a lens on biology
273
677260
2000
只不过 基因组学是生物学上
11:19
through the window of the sequence of bases in the human genome.
274
679260
3000
观察人类基因组序列的透镜
11:22
Culturomics is similar.
275
682260
2000
文化组学很类似
11:24
It's the application of massive-scale data collection analysis
276
684260
3000
它指的是对人类文明研究的
11:27
to the study of human culture.
277
687260
2000
大规模数据收集分析的应用
11:29
Here, instead of through the lens of a genome,
278
689260
2000
它使用的不是基因组这个透镜
11:31
through the lens of digitized pieces of the historical record.
279
691260
3000
而是用数字化的历史记录片段作为透镜
11:34
The great thing about culturomics
280
694260
2000
文化组学的优点是
11:36
is that everyone can do it.
281
696260
2000
人人都会用它
11:38
Why can everyone do it?
282
698260
2000
为什么呢
11:40
Everyone can do it because three guys,
283
700260
2000
这是因为这三个人
11:42
Jon Orwant, Matt Gray and Will Brockman over at Google,
284
702260
3000
谷歌的乔恩.奥温特 迈特.格雷和威尔.布洛克曼
11:45
saw the prototype of the Ngram Viewer,
285
705260
2000
看到了n字格后
11:47
and they said, "This is so fun.
286
707260
2000
说:“这太有意思了,
11:49
We have to make this available for people."
287
709260
3000
我们得让所有人都用上它。”
11:52
So in two weeks flat -- the two weeks before our paper came out --
288
712260
2000
于是在我们的论文发表之前的整整两个星期中
11:54
they coded up a version of the Ngram Viewer for the general public.
289
714260
3000
他们编了一个面向公众的Ngram Viewer版本
11:57
And so you too can type in any word or phrase that you're interested in
290
717260
3000
现在你们也可以输入任何你感兴趣的单词或词组
12:00
and see its n-gram immediately --
291
720260
2000
查看它的n字格
12:02
also browse examples of all the various books
292
722260
2000
并阅览所有书籍中
12:04
in which your n-gram appears.
293
724260
2000
出现n字格的例句
12:06
JM: Now this was used over a million times on the first day,
294
726260
2000
这个词在第一天就被使用了超过一百万次
12:08
and this is really the best of all the queries.
295
728260
2000
这真的是最棒的一个搜索词
12:10
So people want to be their best, put their best foot forward.
296
730260
3000
人们总想做到最好 总想展示最好的一面
12:13
But it turns out in the 18th century, people didn't really care about that at all.
297
733260
3000
但是在18世纪 人们对此并不在乎
12:16
They didn't want to be their best, they wanted to be their beft.
298
736260
3000
他们不想做到最好(“best”)而是“beft”
12:19
So what happened is, of course, this is just a mistake.
299
739260
3000
实际上 这是个错别字
12:22
It's not that strove for mediocrity,
300
742260
2000
这并不是因为人们不识字
12:24
it's just that the S used to be written differently, kind of like an F.
301
744260
3000
而是因为当时英文字母S的写法跟现在不同 看起来像F
12:27
Now of course, Google didn't pick this up at the time,
302
747260
3000
当然 谷歌没有意识到这一点
12:30
so we reported this in the science article that we wrote.
303
750260
3000
于是我们对此在论文中做了报告
12:33
But it turns out this is just a reminder
304
753260
2000
这实际上只是一个小提示
12:35
that, although this is a lot of fun,
305
755260
2000
尽管这很有趣
12:37
when you interpret these graphs, you have to be very careful,
306
757260
2000
但是你在解读这些图表时 仍须非常谨慎
12:39
and you have to adopt the base standards in the sciences.
307
759260
3000
你必须遵循基本的科学准则
12:42
ELA: People have been using this for all kinds of fun purposes.
308
762260
3000
人们使用它来寻求各种乐趣
12:45
(Laughter)
309
765260
7000
(众人笑)
12:52
Actually, we're not going to have to talk,
310
772260
2000
我们不打算多说
12:54
we're just going to show you all the slides and remain silent.
311
774260
3000
光给你们看这些幻灯片
12:57
This person was interested in the history of frustration.
312
777260
3000
这个用户对人们烦躁的历史很感兴趣
13:00
There's various types of frustration.
313
780260
3000
这里有不同类型的烦躁
13:03
If you stub your toe, that's a one A "argh."
314
783260
3000
如果你的脚趾被碰了 你会说“啊” (“argh”)
13:06
If the planet Earth is annihilated by the Vogons
315
786260
2000
如果地球被外星人毁灭了
13:08
to make room for an interstellar bypass,
316
788260
2000
开了一条星际航道
13:10
that's an eight A "aaaaaaaargh."
317
790260
2000
那就是“啊啊啊啊啊啊啊啊” ("aaaaaaaargh")
13:12
This person studies all the "arghs,"
318
792260
2000
这个人研究了不同长短的“啊” (“argh”)
13:14
from one through eight A's.
319
794260
2000
从1个啊到8个啊
13:16
And it turns out
320
796260
2000
结果
13:18
that the less-frequent "arghs"
321
798260
2000
那些使用频率较低的啊
13:20
are, of course, the ones that correspond to things that are more frustrating --
322
800260
3000
代表程度更高的烦躁
13:23
except, oddly, in the early 80s.
323
803260
3000
八十年代是个例外
13:26
We think that might have something to do with Reagan.
324
806260
2000
我们猜这可能跟里根总统有关
13:28
(Laughter)
325
808260
2000
(众人笑)
13:30
JM: There are many usages of this data,
326
810260
3000
这个数据库的用处很多
13:33
but the bottom line is that the historical record is being digitized.
327
813260
3000
但最重要的是这是一个数字化的历史记录
13:36
Google has started to digitize 15 million books.
328
816260
2000
谷歌已经开始对1500万本书进行数字化处理
13:38
That's 12 percent of all the books that have ever been published.
329
818260
2000
其中12%的书已被出版
13:40
It's a sizable chunk of human culture.
330
820260
3000
这是人类文明相当大的一部分
13:43
There's much more in culture: there's manuscripts, there newspapers,
331
823260
3000
而文明还包括更多的内容 有手稿 报纸
13:46
there's things that are not text, like art and paintings.
332
826260
2000
非文字的内容 例如艺术与绘画
13:48
These all happen to be on our computers,
333
828260
2000
这些内容都会出现在我们的电脑上
13:50
on computers across the world.
334
830260
2000
在世界各地的电脑上
13:52
And when that happens, that will transform the way we have
335
832260
3000
如果这成真了
13:55
to understand our past, our present and human culture.
336
835260
2000
我们对过去现在以及人类文明的认识就被改变了
13:57
Thank you very much.
337
837260
2000
非常感谢大家
13:59
(Applause)
338
839260
3000
(众人鼓掌)
关于本网站

这个网站将向你介绍对学习英语有用的YouTube视频。你将看到来自世界各地的一流教师教授的英语课程。双击每个视频页面上显示的英文字幕,即可从那里播放视频。字幕会随着视频的播放而同步滚动。如果你有任何意见或要求,请使用此联系表与我们联系。

https://forms.gle/WvT1wiN1qDtmnspy7