What we learned from 5 million books

236,771 views ・ 2011-09-20

TED

请双击下面的英文字幕来播放视频。

翻译人员: Lili Liang 校对人员: dahong zhang

00:15

Erez Lieberman Aiden: Everyone knows

15260

2000

Erez Liberman Aiden：人说

00:17

that a picture is worth a thousand words.

17260

3000

一副画面抵过一千个词

00:22

But we at Harvard

22260

2000

但是我们在哈佛大学

00:24

were wondering if this was really true.

24260

3000

却在思考这是不是一定正确

00:27

(Laughter)

27260

2000

（众人笑）

00:29

So we assembled a team of experts,

29260

4000

我们召集了各方专家

00:33

spanning Harvard, MIT,

33260

2000

他们来自哈佛麻省理工

00:35

The American Heritage Dictionary, The Encyclopedia Britannica

35260

3000

《英国大百科全书》《美国传统英语字典》

00:38

and even our proud sponsors,

38260

2000

还有我们骄傲的赞助商

00:40

the Google.

40260

3000

谷歌

00:43

And we cogitated about this

43260

2000

我们思考了

00:45

for about four years.

45260

2000

大概四年

00:47

And we came to a startling conclusion.

47260

5000

最后得出一个惊人的结论

00:52

Ladies and gentlemen, a picture is not worth a thousand words.

52260

3000

女士们先生们一副画面可不止一千个词那么简单

00:55

In fact, we found some pictures

55260

2000

事实上我们发现有时候

00:57

that are worth 500 billion words.

57260

5000

一幅画面抵过5千亿个词

01:02

Jean-Baptiste Michel: So how did we get to this conclusion?

62260

2000

Jean-Baptiste Michel: 我们是如何得出这个结论的呢

01:04

So Erez and I were thinking about ways

64260

2000

是这样的 Erez和我

01:06

to get a big picture of human culture

66260

2000

在想怎样找到一幅展现人类文明

01:08

and human history: change over time.

68260

3000

和人文历史的画面：历史的变迁

01:11

So many books actually have been written over the years.

71260

2000

人们在漫长岁月中写了很多书

01:13

So we were thinking, well the best way to learn from them

73260

2000

所以我们想向他们学习的最佳方法

01:15

is to read all of these millions of books.

75260

2000

就是把那几百万本书全都读完

01:17

Now of course, if there's a scale for how awesome that is,

77260

3000

当然如果用坐标来表示这样做的好处

01:20

that has to rank extremely, extremely high.

80260

3000

那Y轴上的值一定是极高的

01:23

Now the problem is there's an X-axis for that,

83260

2000

但问题是还有X轴

01:25

which is the practical axis.

85260

2000

也就是可行性

01:27

This is very, very low.

87260

2000

这是极低的

01:29

(Applause)

89260

3000

（众人鼓掌）

01:32

Now people tend to use an alternative approach,

92260

3000

现在人们倾向于另一种做法

01:35

which is to take a few sources and read them very carefully.

95260

2000

那就是选择几本书进行精读

01:37

This is extremely practical, but not so awesome.

97260

2000

可行性极高但还不够好

01:39

What you really want to do

99260

3000

人们真正想要的

01:42

is to get to the awesome yet practical part of this space.

102260

3000

是一个既好又可行的方法

01:45

So it turns out there was a company across the river called Google

105260

3000

结果在水一方有一家叫“谷歌”的公司

01:48

who had started a digitization project a few years back

108260

2000

他们在此之前的几年前就开始了一个数字化工程

01:50

that might just enable this approach.

110260

2000

有可能帮我们找到这个“既好又可行”的方法

01:52

They have digitized millions of books.

112260

2000

他们已经将几百万本书进行了数字化

01:54

So what that means is, one could use computational methods

114260

3000

这就意味着人们在电脑上点几个键

01:57

to read all of the books in a click of a button.

117260

2000

就能阅读所有的书

01:59

That's very practical and extremely awesome.

119260

3000

这真的是既可行又好

02:03

ELA: Let me tell you a little bit about where books come from.

123260

2000

这些书是哪里来的呢

02:05

Since time immemorial, there have been authors.

125260

3000

从古时候开始人们就开始写作了

02:08

These authors have been striving to write books.

128260

3000

这些作家写书都非常卖力

02:11

And this became considerably easier

131260

2000

几个世纪前印刷机问世了

02:13

with the development of the printing press some centuries ago.

133260

2000

写书的过程变得简单多了

02:15

Since then, the authors have won

135260

3000

自那以后

02:18

on 129 million distinct occasions,

138260

2000

作家们已经出版了

02:20

publishing books.

140260

2000

1.29亿本书

02:22

Now if those books are not lost to history,

142260

2000

如果这些书没有随年月而遗失

02:24

then they are somewhere in a library,

144260

2000

就都在图书馆里存着

02:26

and many of those books have been getting retrieved from the libraries

146260

3000

谷歌已经把许多书从图书馆中调了出来

02:29

and digitized by Google,

149260

2000

进行了数字化

02:31

which has scanned 15 million books to date.

151260

2000

被扫描的书籍到目前已有1500万册

02:33

Now when Google digitizes a book, they put it into a really nice format.

153260

3000

谷歌扫描图书时把书的格式做得很好

02:36

Now we've got the data, plus we have metadata.

156260

2000

现在我们不但有了数据还有元数据

02:38

We have information about things like where was it published,

158260

3000

我们掌握了这些书的出版地

02:41

who was the author, when was it published.

161260

2000

作者出版时间等信息

02:43

And what we do is go through all of those records

163260

3000

接下来我们就要从所有这些记录中

02:46

and exclude everything that's not the highest quality data.

166260

4000

筛选出质量最高的数据

02:50

What we're left with

170260

2000

最后剩下的

02:52

is a collection of five million books,

172260

3000

是5百万本书

02:55

500 billion words,

175260

3000

5000亿个词

02:58

a string of characters a thousand times longer

178260

2000

这么多词连起来

03:00

than the human genome --

180260

3000

长度是人类基因组的1000倍

03:03

a text which, when written out,

183260

2000

如果把这些词连续写出来

03:05

would stretch from here to the Moon and back

185260

2000

其长度相当于在地月之间

03:07

10 times over --

187260

2000

往返10次以上

03:09

a veritable shard of our cultural genome.

189260

4000

这还仅是我们文化基因组的小小一段

03:13

Of course what we did

193260

2000

当然啦

03:15

when faced with such outrageous hyperbole ...

195260

3000

面对如此令人崩溃的结果

03:18

(Laughter)

198260

2000

（众人笑）

03:20

was what any self-respecting researchers

200260

3000

我们做了一个懂得自重的研究者

03:23

would have done.

203260

3000

应该做的事

03:26

We took a page out of XKCD,

206260

2000

我们借鉴了XKCD（科学漫画）

03:28

and we said, "Stand back.

208260

2000

说：" 往后站。

03:30

We're going to try science."

210260

2000

我们要用科学来解决问题。”

03:32

(Laughter)

212260

2000

（众人笑）

03:34

JM: Now of course, we were thinking,

214260

2000

当然这时我们在想

03:36

well let's just first put the data out there

216260

2000

何不先把数据放上去

03:38

for people to do science to it.

218260

2000

让人们通过科学来运用数据

03:40

Now we're thinking, what data can we release?

220260

2000

现在我们在思考哪些数据可以公开

03:42

Well of course, you want to take the books

222260

2000

你当然想把这所有5百万本书

03:44

and release the full text of these five million books.

224260

2000

全文公开

03:46

Now Google, and Jon Orwant in particular,

226260

2000

现在谷歌具体地说是乔恩. 奥温特

03:48

told us a little equation that we should learn.

228260

2000

告诉教给我们一个有用的方程式

03:50

So you have five million, that is, five million authors

230260

3000

你有5百万本书那就有五百万个作者

03:53

and five million plaintiffs is a massive lawsuit.

233260

3000

一个有5百万个原告的官司可不小啊

03:56

So, although that would be really, really awesome,

236260

2000

所以尽管这是个好想法

03:58

again, that's extremely, extremely impractical.

238260

3000

但是也极不现实

04:01

(Laughter)

241260

2000

（众人笑）

04:03

Now again, we kind of caved in,

243260

2000

现在我们做出些许让步

04:05

and we did the very practical approach, which was a bit less awesome.

245260

3000

采用一个非常可行但稍微没那么好的方法

04:08

We said, well instead of releasing the full text,

248260

2000

我们不公开全书内容

04:10

we're going to release statistics about the books.

250260

2000

而是公开书本的相关统计数据

04:12

So take for instance "A gleam of happiness."

252260

2000

拿“A gleam of happiness”这个词组做例子

04:14

It's four words; we call that a four-gram.

254260

2000

它有四个单词我们称它为四字格

04:16

We're going to tell you how many times a particular four-gram

256260

2000

我们会告诉你直到2008年出版的书中

04:18

appeared in books in 1801, 1802, 1803,

258260

2000

在1801年 1802年 1803年一直到2008年

04:20

all the way up to 2008.

260260

2000

某个四字格一共出现了多少次

04:22

That gives us a time series

100

262260

2000

这让我们看到

04:24

of how frequently this particular sentence was used over time.

101

264260

2000

这个词组在这段时期内被使用的频率

04:26

We do that for all the words and phrases that appear in those books,

102

266260

3000

我们对在这些书中的所有单词和词组都这么处理

04:29

and that gives us a big table of two billion lines

103

269260

3000

于是我们得出了一个由20亿曲线

04:32

that tell us about the way culture has been changing.

104

272260

2000

表示出文化变化的情况

04:34

ELA: So those two billion lines,

105

274260

2000

这20亿条曲线

04:36

we call them two billion n-grams.

106

276260

2000

我们成作20亿个n字格

04:38

What do they tell us?

107

278260

2000

它们告诉了我们什么

04:40

Well the individual n-grams measure cultural trends.

108

280260

2000

这些n字格衡量的是文化的走势

04:42

Let me give you an example.

109

282260

2000

我来举个例子

04:44

Let's suppose that I am thriving,

110

284260

2000

假设我正在发财

04:46

then tomorrow I want to tell you about how well I did.

111

286260

2000

明天我告诉你我发财的情况

04:48

And so I might say, "Yesterday, I throve."

112

288260

3000

我会说：“昨天，我发了。”

04:51

Alternatively, I could say, "Yesterday, I thrived."

113

291260

3000

也可以说：“昨天，我发财了。”

04:54

Well which one should I use?

114

294260

3000

我到底应该用哪个说法呢

04:57

How to know?

115

297260

2000

怎么找答案

04:59

As of about six months ago,

116

299260

2000

6个月以前

05:01

the state of the art in this field

117

301260

2000

很流行的做法是

05:03

is that you would, for instance,

118

303260

2000

比如说

05:05

go up to the following psychologist with fabulous hair,

119

305260

2000

你去问这位秀发飘逸的心理学家

05:07

and you'd say,

120

307260

2000

你说

05:09

"Steve, you're an expert on the irregular verbs.

121

309260

3000

“史蒂夫，你是不规则动词的专家。

05:12

What should I do?"

122

312260

2000

我该怎么办啊？”

05:14

And he'd tell you, "Well most people say thrived,

123

314260

2000

他会说：“大多数人说‘发财了’，

05:16

but some people say throve."

124

316260

3000

但有些人说‘发了’。”

05:19

And you also knew, more or less,

125

319260

2000

如果你可以

05:21

that if you were to go back in time 200 years

126

321260

3000

回到200年前

05:24

and ask the following statesman with equally fabulous hair,

127

324260

3000

问问这位秀发同样飘逸的政治家

05:27

(Laughter)

128

327260

3000

（众人笑）

05:30

"Tom, what should I say?"

129

330260

2000

“托马斯，我该怎么说？”

05:32

He'd say, "Well, in my day, most people throve,

130

332260

2000

他会回答：“嗯，在我的时代，大多数人说‘发了’，

05:34

but some thrived."

131

334260

3000

但是少数人说‘发财了’。”

05:37

So now what I'm just going to show you is raw data.

132

337260

2000

现在我给你们看一个原始数据

05:39

Two rows from this table of two billion entries.

133

339260

4000

这是20亿本书中的其中两本书的曲线

05:43

What you're seeing is year by year frequency

134

343260

2000

你们将看到“发了”和“发财了”这两个词

05:45

of "thrived" and "throve" over time.

135

345260

3000

随时间的推移被使用的频率

05:49

Now this is just two

136

349260

2000

这还只是

05:51

out of two billion rows.

137

351260

3000

20亿条曲线中的其中两条

05:54

So the entire data set

138

354260

2000

整套数据

05:56

is a billion times more awesome than this slide.

139

356260

3000

比这张幻灯片要宏伟10亿倍

05:59

(Laughter)

140

359260

2000

（众人笑）

06:01

(Applause)

141

361260

4000

（众人鼓掌）

06:05

JM: Now there are many other pictures that are worth 500 billion words.

142

365260

2000

很多画面都相当于5千亿个词

06:07

For instance, this one.

143

367260

2000

比如这一幅

06:09

If you just take influenza,

144

369260

2000

如果你找“流行感冒”这一词

06:11

you will see peaks at the time where you knew

145

371260

2000

你会看到几个全球范围内

06:13

big flu epidemics were killing people around the globe.

146

373260

3000

祸害人命的流感高峰

06:16

ELA: If you were not yet convinced,

147

376260

3000

如果这不足以令人信服

06:19

sea levels are rising,

148

379260

2000

海平面正在上升

06:21

so is atmospheric CO2 and global temperature.

149

381260

3000

大气中二氧化碳含量和全球气温都在升高

06:24

JM: You might also want to have a look at this particular n-gram,

150

384260

3000

你们也可以看看这个n字格

06:27

and that's to tell Nietzsche that God is not dead,

151

387260

3000

告诉尼采上帝没死

06:30

although you might agree that he might need a better publicist.

152

390260

3000

你可能也认为他或许要换一个企宣了

06:33

(Laughter)

153

393260

2000

（众人笑）

06:35

ELA: You can get at some pretty abstract concepts with this sort of thing.

154

395260

3000

你可以通过这个得到非常抽象的概念

06:38

For instance, let me tell you the history

155

398260

2000

我跟你们说说

06:40

of the year 1950.

156

400260

2000

1950年的历史

06:42

Pretty much for the vast majority of history,

157

402260

2000

在漫漫历史长河中

06:44

no one gave a damn about 1950.

158

404260

2000

几乎没人在意1950年

06:46

In 1700, in 1800, in 1900,

159

406260

2000

1700年 1800年 1900年

06:48

no one cared.

160

408260

3000

没有人在意

06:52

Through the 30s and 40s,

161

412260

2000

20世纪三十年代和四十年代

06:54

no one cared.

162

414260

2000

没有人在意

06:56

Suddenly, in the mid-40s,

163

416260

2000

到了四十年代中期突然间

06:58

there started to be a buzz.

164

418260

2000

关注度飞升

07:00

People realized that 1950 was going to happen,

165

420260

2000

人们意识到1950年快来了

07:02

and it could be big.

166

422260

2000

这一年可能非同小可啊

07:04

(Laughter)

167

424260

3000

（众人笑）

07:07

But nothing got people interested in 1950

168

427260

3000

1950年正如人们想象的一样

07:10

like the year 1950.

169

430260

3000

没发生任何有意思的事情

07:13

(Laughter)

170

433260

3000

（众人笑）

07:16

People were walking around obsessed.

171

436260

2000

人们都着了魔了

07:18

They couldn't stop talking

172

438260

2000

无时无刻不在谈论

07:20

about all the things they did in 1950,

173

440260

3000

他们1950年做过的事情

07:23

all the things they were planning to do in 1950,

174

443260

3000

他们打算在1950年做的事情

07:26

all the dreams of what they wanted to accomplish in 1950.

175

446260

5000

后者他们1950年想要实现的梦想

07:31

In fact, 1950 was so fascinating

176

451260

2000

事实上 1950年是不同凡响的一年

07:33

that for years thereafter,

177

453260

2000

即使过了好多年

07:35

people just kept talking about all the amazing things that happened,

178

455260

3000

人们还是不停地谈论那年发生的所有美好事情

07:38

in '51, '52, '53.

179

458260

2000

51年 52年 53年

07:40

Finally in 1954,

180

460260

2000

终于到了1954年

07:42

someone woke up and realized

181

462260

2000

人们醒悟过来

07:44

that 1950 had gotten somewhat passé.

182

464260

4000

1950年已成往事了

07:48

(Laughter)

183

468260

2000

（众人笑）

07:50

And just like that, the bubble burst.

184

470260

2000

就这样泡泡破了

07:52

(Laughter)

185

472260

2000

（众人笑）

07:54

And the story of 1950

186

474260

2000

1950年的情况

07:56

is the story of every year that we have on record,

187

476260

2000

以及每一年的情况我们都记录了下来

07:58

with a little twist, because now we've got these nice charts.

188

478260

3000

多亏了这些漂亮的图表我们的工作顺利多了

08:01

And because we have these nice charts, we can measure things.

189

481260

3000

有了这些漂亮的图表我们就能测量各种事物

08:04

We can say, "Well how fast does the bubble burst?"

190

484260

2000

我们会说：“泡泡破掉的速度有多快？”

08:06

And it turns out that we can measure that very precisely.

191

486260

3000

结果证明我们可以对此进行精准的测量

08:09

Equations were derived, graphs were produced,

192

489260

3000

等式出来了图表也做好了

08:12

and the net result

193

492260

2000

最终结果是

08:14

is that we find that the bubble bursts faster and faster

194

494260

3000

泡泡破掉的速度

08:17

with each passing year.

195

497260

2000

每年都在加快

08:19

We are losing interest in the past more rapidly.

196

499260

5000

我们对过去的遗忘不断加快

08:24

JM: Now a little piece of career advice.

197

504260

2000

好现在给大家一些发展事业的建议

08:26

So for those of you who seek to be famous,

198

506260

2000

如果你想成名

08:28

we can learn from the 25 most famous political figures,

199

508260

2000

我们可以向25位最著名的政治人物

08:30

authors, actors and so on.

200

510260

2000

作家演员学习

08:32

So if you want to become famous early on, you should be an actor,

201

512260

3000

如果你想早点成名你就应该做个演员

08:35

because then fame starts rising by the end of your 20s --

202

515260

2000

因为演员在20来岁的时候成名

08:37

you're still young, it's really great.

203

517260

2000

你还很年轻这是本钱

08:39

Now if you can wait a little bit, you should be an author,

204

519260

2000

如果你能等一等那就当个作家

08:41

because then you rise to very great heights,

205

521260

2000

因为你可以像马克.吐温这样

08:43

like Mark Twain, for instance: extremely famous.

206

523260

2000

成为文坛巨星

08:45

But if you want to reach the very top,

207

525260

2000

如果你想到达万人之上

08:47

you should delay gratification

208

527260

2000

你就不能安于现状

08:49

and, of course, become a politician.

209

529260

2000

要成为一个政治家

08:51

So here you will become famous by the end of your 50s,

210

531260

2000

到了快60岁的时候你就成名了

08:53

and become very, very famous afterward.

211

533260

2000

而且之后名声远扬

08:55

So scientists also tend to get famous when they're much older.

212

535260

3000

科学家通常在年纪一大把的时候才成名

08:58

Like for instance, biologists and physics

213

538260

2000

生物学家和物理学家的名声

09:00

tend to be almost as famous as actors.

214

540260

2000

通常能跟演员的名声媲美

09:02

One mistake you should not do is become a mathematician.

215

542260

3000

有一个错误你不要犯那就是成为一个数学家

09:05

(Laughter)

216

545260

2000

（众人笑）

09:07

If you do that,

217

547260

2000

如果你成了数学家

09:09

you might think, "Oh great. I'm going to do my best work when I'm in my 20s."

218

549260

3000

你会想：“太好啦，我20多岁的时候会有最辉煌的成就。”

09:12

But guess what, nobody will really care.

219

552260

2000

谁知道人们连睬都不睬你

09:14

(Laughter)

220

554260

3000

（众人笑）

09:17

ELA: There are more sobering notes

221

557260

2000

n字格中

09:19

among the n-grams.

222

559260

2000

有些情况更为明了

09:21

For instance, here's the trajectory of Marc Chagall,

223

561260

2000

这是Marc Chagall的名声起落

09:23

an artist born in 1887.

224

563260

2000

他是出生于1887的一位艺术家

09:25

And this looks like the normal trajectory of a famous person.

225

565260

3000

他的名声起落看似乎没有什么异常

09:28

He gets more and more and more famous,

226

568260

4000

他的名声越来越大

09:32

except if you look in German.

227

572260

2000

然而如果你在德语书中搜索情况就不同了

09:34

If you look in German, you see something completely bizarre,

228

574260

2000

在德语书中你会看到非常奇怪的现象

09:36

something you pretty much never see,

229

576260

2000

闻所未闻见所未见

09:38

which is he becomes extremely famous

230

578260

2000

他先是名极一时

09:40

and then all of a sudden plummets,

231

580260

2000

但突然之间名声直线下落

09:42

going through a nadir between 1933 and 1945,

232

582260

3000

在1933年到1945年间达到了低谷

09:45

before rebounding afterward.

233

585260

3000

后来才回升

09:48

And of course, what we're seeing

234

588260

2000

当然实际情况是

09:50

is the fact Marc Chagall was a Jewish artist

235

590260

3000

Marc Chagall是一个犹太艺术家

09:53

in Nazi Germany.

236

593260

2000

当时身在纳粹德国

09:55

Now these signals

237

595260

2000

这些信号

09:57

are actually so strong

238

597260

2000

实在太强了

09:59

that we don't need to know that someone was censored.

239

599260

3000

我们无需知道谁被禁了

10:02

We can actually figure it out

240

602260

2000

我们事实上可以

10:04

using really basic signal processing.

241

604260

2000

通过非常基本的信号处理来找出答案

10:06

Here's a simple way to do it.

242

606260

2000

这里有一个简单的方法

10:08

Well, a reasonable expectation

243

608260

2000

一个人在特定时期内

10:10

is that somebody's fame in a given period of time

244

610260

2000

所拥有的知名度

10:12

should be roughly the average of their fame before

245

612260

2000

应当大致为他成名前与成名后知名度的平均值

10:14

and their fame after.

246

614260

2000

这么想是有道理的

10:16

So that's sort of what we expect.

247

616260

2000

我们也是怎么想的

10:18

And we compare that to the fame that we observe.

248

618260

3000

我们把观察到的知名度进行对比

10:21

And we just divide one by the other

249

621260

2000

我们把前者比上后者

10:23

to produce something we call a suppression index.

250

623260

2000

产生的结果叫做抑制指数

10:25

If the suppression index is very, very, very small,

251

625260

3000

如果抑制指数非常非常小

10:28

then you very well might be being suppressed.

252

628260

2000

那么你的知名度正在被抑制

10:30

If it's very large, maybe you're benefiting from propaganda.

253

630260

3000

如果数值非常大或许就表明你从宣传中获益

10:34

JM: Now you can actually look at

254

634260

2000

你还可以看到

10:36

the distribution of suppression indexes over whole populations.

255

636260

3000

压抑指数在总人数中的分布情况

10:39

So for instance, here --

256

639260

2000

这里有个例子

10:41

this suppression index is for 5,000 people

257

641260

2000

这是从没有明显抑制的英文书籍中

10:43

picked in English books where there's no known suppression --

258

643260

2000

选出的5000个人

10:45

it would be like this, basically tightly centered on one.

259

645260

2000

它是这个样子的基本上以1为中心

10:47

What you expect is basically what you observe.

260

647260

2000

实际情况与预想差不多

10:49

This is distribution as seen in Germany --

261

649260

2000

而这在是德文书籍中的分布情况

10:51

very different, it's shifted to the left.

262

651260

2000

与前者大为不同往左偏了

10:53

People talked about it twice less as it should have been.

263

653260

3000

人们对它的关注较预期要少了两倍

10:56

But much more importantly, the distribution is much wider.

264

656260

2000

更重要的是这个分布的跨度更宽

10:58

There are many people who end up on the far left on this distribution

265

658260

3000

不少人处于左边的部分

11:01

who are talked about 10 times fewer than they should have been.

266

661260

3000

人数比预期中少了10倍

11:04

But then also many people on the far right

267

664260

2000

而也有不少人处于更靠右的部分

11:06

who seem to benefit from propaganda.

268

666260

2000

他们的宣传起了作用

11:08

This picture is the hallmark of censorship in the book record.

269

668260

3000

这幅图反映了书籍记录中的审查情况

11:11

ELA: So culturomics

270

671260

2000

我们把这种方法

11:13

is what we call this method.

271

673260

2000

称作文化组学

11:15

It's kind of like genomics.

272

675260

2000

有点像基因组学

11:17

Except genomics is a lens on biology

273

677260

2000

只不过基因组学是生物学上

11:19

through the window of the sequence of bases in the human genome.

274

679260

3000

观察人类基因组序列的透镜

11:22

Culturomics is similar.

275

682260

2000

文化组学很类似

11:24

It's the application of massive-scale data collection analysis

276

684260

3000

它指的是对人类文明研究的

11:27

to the study of human culture.

277

687260

2000

大规模数据收集分析的应用

11:29

Here, instead of through the lens of a genome,

278

689260

2000

它使用的不是基因组这个透镜

11:31

through the lens of digitized pieces of the historical record.

279

691260

3000

而是用数字化的历史记录片段作为透镜

11:34

The great thing about culturomics

280

694260

2000

文化组学的优点是

11:36

is that everyone can do it.

281

696260

2000

人人都会用它

11:38

Why can everyone do it?

282

698260

2000

为什么呢

11:40

Everyone can do it because three guys,

283

700260

2000

这是因为这三个人

11:42

Jon Orwant, Matt Gray and Will Brockman over at Google,

284

702260

3000

谷歌的乔恩.奥温特迈特.格雷和威尔.布洛克曼

11:45

saw the prototype of the Ngram Viewer,

285

705260

2000

看到了n字格后

11:47

and they said, "This is so fun.

286

707260

2000

说：“这太有意思了，

11:49

We have to make this available for people."

287

709260

3000

我们得让所有人都用上它。”

11:52

So in two weeks flat -- the two weeks before our paper came out --

288

712260

2000

于是在我们的论文发表之前的整整两个星期中

11:54

they coded up a version of the Ngram Viewer for the general public.

289

714260

3000

他们编了一个面向公众的Ngram Viewer版本

11:57

And so you too can type in any word or phrase that you're interested in

290

717260

3000

现在你们也可以输入任何你感兴趣的单词或词组

12:00

and see its n-gram immediately --

291

720260

2000

查看它的n字格

12:02

also browse examples of all the various books

292

722260

2000

并阅览所有书籍中

12:04

in which your n-gram appears.

293

724260

2000

出现n字格的例句

12:06

JM: Now this was used over a million times on the first day,

294

726260

2000

这个词在第一天就被使用了超过一百万次

12:08

and this is really the best of all the queries.

295

728260

2000

这真的是最棒的一个搜索词

12:10

So people want to be their best, put their best foot forward.

296

730260

3000

人们总想做到最好总想展示最好的一面

12:13

But it turns out in the 18th century, people didn't really care about that at all.

297

733260

3000

但是在18世纪人们对此并不在乎

12:16

They didn't want to be their best, they wanted to be their beft.

298

736260

3000

他们不想做到最好（“best”）而是“beft”

12:19

So what happened is, of course, this is just a mistake.

299

739260

3000

实际上这是个错别字

12:22

It's not that strove for mediocrity,

300

742260

2000

这并不是因为人们不识字

12:24

it's just that the S used to be written differently, kind of like an F.

301

744260

3000

而是因为当时英文字母S的写法跟现在不同看起来像F

12:27

Now of course, Google didn't pick this up at the time,

302

747260

3000

当然谷歌没有意识到这一点

12:30

so we reported this in the science article that we wrote.

303

750260

3000

于是我们对此在论文中做了报告

12:33

But it turns out this is just a reminder

304

753260

2000

这实际上只是一个小提示

12:35

that, although this is a lot of fun,

305

755260

2000

尽管这很有趣

12:37

when you interpret these graphs, you have to be very careful,

306

757260

2000

但是你在解读这些图表时仍须非常谨慎

12:39

and you have to adopt the base standards in the sciences.

307

759260

3000

你必须遵循基本的科学准则

12:42

ELA: People have been using this for all kinds of fun purposes.

308

762260

3000

人们使用它来寻求各种乐趣

12:45

(Laughter)

309

765260

7000

（众人笑）

12:52

Actually, we're not going to have to talk,

310

772260

2000

我们不打算多说

12:54

we're just going to show you all the slides and remain silent.

311

774260

3000

光给你们看这些幻灯片

12:57

This person was interested in the history of frustration.

312

777260

3000

这个用户对人们烦躁的历史很感兴趣

13:00

There's various types of frustration.

313

780260

3000

这里有不同类型的烦躁

13:03

If you stub your toe, that's a one A "argh."

314

783260

3000

如果你的脚趾被碰了你会说“啊” （“argh”）

13:06

If the planet Earth is annihilated by the Vogons

315

786260

2000

如果地球被外星人毁灭了

13:08

to make room for an interstellar bypass,

316

788260

2000

开了一条星际航道

13:10

that's an eight A "aaaaaaaargh."

317

790260

2000

那就是“啊啊啊啊啊啊啊啊” （"aaaaaaaargh"）

13:12

This person studies all the "arghs,"

318

792260

2000

这个人研究了不同长短的“啊” （“argh”）

13:14

from one through eight A's.

319

794260

2000

从1个啊到8个啊

13:16

And it turns out

320

796260

2000

结果

13:18

that the less-frequent "arghs"

321

798260

2000

那些使用频率较低的啊

13:20

are, of course, the ones that correspond to things that are more frustrating --

322

800260

3000

代表程度更高的烦躁

13:23

except, oddly, in the early 80s.

323

803260

3000

八十年代是个例外

13:26

We think that might have something to do with Reagan.

324

806260

2000

我们猜这可能跟里根总统有关

13:28

(Laughter)

325

808260

2000

（众人笑）

13:30

JM: There are many usages of this data,

326

810260

3000

这个数据库的用处很多

13:33

but the bottom line is that the historical record is being digitized.

327

813260

3000

但最重要的是这是一个数字化的历史记录

13:36

Google has started to digitize 15 million books.

328

816260

2000

谷歌已经开始对1500万本书进行数字化处理

13:38

That's 12 percent of all the books that have ever been published.

329

818260

2000

其中12%的书已被出版

13:40

It's a sizable chunk of human culture.

330

820260

3000

这是人类文明相当大的一部分

13:43

There's much more in culture: there's manuscripts, there newspapers,

331

823260

3000

而文明还包括更多的内容有手稿报纸

13:46

there's things that are not text, like art and paintings.

332

826260

2000

非文字的内容例如艺术与绘画

13:48

These all happen to be on our computers,

333

828260

2000

这些内容都会出现在我们的电脑上

13:50

on computers across the world.

334

830260

2000

在世界各地的电脑上

13:52

And when that happens, that will transform the way we have

335

832260

3000

如果这成真了

13:55

to understand our past, our present and human culture.

336

835260

2000

我们对过去现在以及人类文明的认识就被改变了

13:57

Thank you very much.

337

837260

2000

非常感谢大家

13:59

(Applause)

338

839260

3000

（众人鼓掌）

New videos

06:51

The Rise of China's Homegrown Brands — and Why ...

08:33

Can AI Help with the Chaos of Family Life? | Av...

09:26

You Are the Bridge to the Next Generation | Ndi...

08:29

Are We Still Human If Robots Help Raise Our Bab...

06:45

Parkour! How the Sport Keeps Your Body and Mind...

09:53

The Power of Gaming Together in a Lonely World ...

05:46

The myth of Medusa - Laura Aitken-Burt

05:02

How reliable is fingerprint evidence? - Theodor...

Original video on YouTube.com

What we learned from 5 million books - YouTube

关于本网站

这个网站将向你介绍对学习英语有用的YouTube视频。你将看到来自世界各地的一流教师教授的英语课程。双击每个视频页面上显示的英文字幕，即可从那里播放视频。字幕会随着视频的播放而同步滚动。如果你有任何意见或要求，请使用此联系表与我们联系。

https://forms.gle/WvT1wiN1qDtmnspy7

Playback speed

Subtitle font size

What we learned from 5 million books

New videos

What we learned from 5 million books

New videos

Original video on YouTube.com