What we learned from 5 million books

236,771 views ・ 2011-09-20

TED

請雙擊下方英文字幕播放視頻。

譯者: Joyce Chou 審譯者: Qi Gu

00:15

Erez Lieberman Aiden: Everyone knows

15260

2000

Erez Lieberman Aiden：大家都知道

00:17

that a picture is worth a thousand words.

17260

3000

一張圖勝過千言萬語

00:22

But we at Harvard

22260

2000

但我們在哈佛時

00:24

were wondering if this was really true.

24260

3000

卻在思考這道理是否真是如此

00:27

(Laughter)

27260

2000

(笑聲)

00:29

So we assembled a team of experts,

29260

4000

所以我們由來自哈佛大學

00:33

spanning Harvard, MIT,

33260

2000

麻省理工學院

00:35

The American Heritage Dictionary, The Encyclopedia Britannica

35260

3000

美國傳統英語詞典，大英百科全書

00:38

and even our proud sponsors,

38260

2000

甚至我們偉大的贊助商─Google的專家們

00:40

the Google.

40260

3000

組成一個團隊

00:43

And we cogitated about this

43260

2000

我們花了四年的時間

00:45

for about four years.

45260

2000

在思考這個問題

00:47

And we came to a startling conclusion.

47260

5000

然後我們得到了一個驚人的結論

00:52

Ladies and gentlemen, a picture is not worth a thousand words.

52260

3000

女士先生們，一張圖片其實不只勝過千言萬語

00:55

In fact, we found some pictures

55260

2000

事實上，我們發現某些圖片

00:57

that are worth 500 billion words.

57260

5000

更是勝過五千億個字

01:02

Jean-Baptiste Michel: So how did we get to this conclusion?

62260

2000

Jean-Baptiste Michel：我們是如何得出這項結論的呢？

01:04

So Erez and I were thinking about ways

64260

2000

Erez和我思考了不同的方式

01:06

to get a big picture of human culture

66260

2000

想更加了解人類文化

01:08

and human history: change over time.

68260

3000

以及人類歷史從古到今的變化的全景

01:11

So many books actually have been written over the years.

71260

2000

事實上，多年來已經出版了許多書籍。

01:13

So we were thinking, well the best way to learn from them

73260

2000

所以我們認為最好的學習方式

01:15

is to read all of these millions of books.

75260

2000

就是將這上百萬的書全讀過一遍

01:17

Now of course, if there's a scale for how awesome that is,

77260

3000

如果能有一個尺規來說明此舉的驚人程度

01:20

that has to rank extremely, extremely high.

80260

3000

這將會相當驚人

01:23

Now the problem is there's an X-axis for that,

83260

2000

但問題是這裡的X軸

01:25

which is the practical axis.

85260

2000

是表示實用程度

01:27

This is very, very low.

87260

2000

這相當不實用

01:29

(Applause)

89260

3000

(掌聲)

01:32

Now people tend to use an alternative approach,

92260

3000

現在人們希望用別的方式

01:35

which is to take a few sources and read them very carefully.

95260

2000

可以讀少一點書，但讀得非常仔細

01:37

This is extremely practical, but not so awesome.

97260

2000

這會相當實用，但這一點都不吸引人

01:39

What you really want to do

99260

3000

我們真正想做的是

01:42

is to get to the awesome yet practical part of this space.

102260

3000

要用一種吸引人且實用的方法來閱讀這些書

01:45

So it turns out there was a company across the river called Google

105260

3000

所以在河的對岸有間公司叫做Google

01:48

who had started a digitization project a few years back

108260

2000

他們幾年之前開始了一項數字化計畫

01:50

that might just enable this approach.

110260

2000

這項計畫讓我們能實踐剛說的方法

01:52

They have digitized millions of books.

112260

2000

他們已將數百萬本書給數位化

01:54

So what that means is, one could use computational methods

114260

3000

這意味著，我們可以透過電腦

01:57

to read all of the books in a click of a button.

117260

2000

簡單按個按鈕就能閱讀所有的書

01:59

That's very practical and extremely awesome.

119260

3000

這非常實用而且相當棒

02:03

ELA: Let me tell you a little bit about where books come from.

123260

2000

ELA：讓我為各位介紹這些書都來自何方

02:05

Since time immemorial, there have been authors.

125260

3000

自古以來，有非常多作家

02:08

These authors have been striving to write books.

128260

3000

這些作家一直努力寫作

02:11

And this became considerably easier

131260

2000

但現在寫作變得相當容易

02:13

with the development of the printing press some centuries ago.

133260

2000

這歸功於幾世紀前印刷術的革新

02:15

Since then, the authors have won

135260

3000

自那時起作家們

02:18

on 129 million distinct occasions,

138260

2000

能在一億兩千九百萬個不同的地方

02:20

publishing books.

140260

2000

出版書籍

02:22

Now if those books are not lost to history,

142260

2000

如果那些書沒有因為時代交替而遺失

02:24

then they are somewhere in a library,

144260

2000

那麼那些書可能在某個圖書館的一處

02:26

and many of those books have been getting retrieved from the libraries

146260

3000

有相當多書可以從圖書館中被借閱

02:29

and digitized by Google,

149260

2000

由Google將其數位化

02:31

which has scanned 15 million books to date.

151260

2000

迄今Google已經掃描了一千五百萬本書

02:33

Now when Google digitizes a book, they put it into a really nice format.

153260

3000

Google將一本書數位化，並以優良的型式呈現

02:36

Now we've got the data, plus we have metadata.

156260

2000

現在我們有了這些數據，加上這些詮釋資料

02:38

We have information about things like where was it published,

158260

3000

我們有了相關的資訊，比如出版地區，

02:41

who was the author, when was it published.

161260

2000

作者，出版時間

02:43

And what we do is go through all of those records

163260

3000

我們所做的就是透過這些記錄

02:46

and exclude everything that's not the highest quality data.

166260

4000

並剔除不是最精華的資料

02:50

What we're left with

170260

2000

我們後來得到的是

02:52

is a collection of five million books,

172260

3000

五百萬本書

02:55

500 billion words,

175260

3000

五千億個詞

02:58

a string of characters a thousand times longer

178260

2000

這是一串比人類基因組

03:00

than the human genome --

180260

3000

還要長上一千倍的字符

03:03

a text which, when written out,

183260

2000

如果寫成文章

03:05

would stretch from here to the Moon and back

185260

2000

將會是從這裡到月球來回距離

03:07

10 times over --

187260

2000

的十倍以上

03:09

a veritable shard of our cultural genome.

189260

4000

這是我們文化基因名副其實的的一部分

03:13

Of course what we did

193260

2000

當然當我們面臨

03:15

when faced with such outrageous hyperbole ...

195260

3000

如此誇張的情況時

03:18

(Laughter)

198260

2000

(笑聲)

03:20

was what any self-respecting researchers

200260

3000

我們也跟每一位有自尊心的研究人員一樣

03:23

would have done.

203260

3000

會做相同的事

03:26

We took a page out of XKCD,

206260

2000

我們也和四格漫畫一樣

03:28

and we said, "Stand back.

208260

2000

我們決定「等等

03:30

We're going to try science."

210260

2000

我們要用科學的方式來處理。」

03:32

(Laughter)

212260

2000

(笑聲)

03:34

JM: Now of course, we were thinking,

214260

2000

JM：當然，我們在思考

03:36

well let's just first put the data out there

216260

2000

首先我們先把資料提取出來

03:38

for people to do science to it.

218260

2000

讓其他人以科學的方式去分析

03:40

Now we're thinking, what data can we release?

220260

2000

現在我們在思考，我們能發行何種數據？

03:42

Well of course, you want to take the books

222260

2000

當然，我們想拿這些書

03:44

and release the full text of these five million books.

224260

2000

將這五百萬本書的內容全部釋出

03:46

Now Google, and Jon Orwant in particular,

226260

2000

現在Google，特別是Jon Orwant

03:48

told us a little equation that we should learn.

228260

2000

告訴我們一個我們該注意的小方程式

03:50

So you have five million, that is, five million authors

230260

3000

我們有五百萬本書，也就是有五百萬名作者

03:53

and five million plaintiffs is a massive lawsuit.

233260

3000

而五百萬名原告是一場龐大的訴訟

03:56

So, although that would be really, really awesome,

236260

2000

雖然這個過程是相當地驚人

03:58

again, that's extremely, extremely impractical.

238260

3000

但這還是極度的不切實際

04:01

(Laughter)

241260

2000

(笑聲)

04:03

Now again, we kind of caved in,

243260

2000

然後，我們似乎有點妥協

04:05

and we did the very practical approach, which was a bit less awesome.

245260

3000

我們試了比較實際的方式，這方法不怎麼吸引人

04:08

We said, well instead of releasing the full text,

248260

2000

我們認為，與其釋出全部的書籍資料

04:10

we're going to release statistics about the books.

250260

2000

我們選擇將這些書的數據資料給呈現出來

04:12

So take for instance "A gleam of happiness."

252260

2000

舉個例子「幸福的光」

04:14

It's four words; we call that a four-gram.

254260

2000

這是四個字，我們稱做「四字詞」

04:16

We're going to tell you how many times a particular four-gram

256260

2000

我們要告訴各位一個特定的四字詞

04:18

appeared in books in 1801, 1802, 1803,

258260

2000

從1801，1802，1803年開始出現在書本裡

04:20

all the way up to 2008.

260260

2000

直到2008年

04:22

That gives us a time series

100

262260

2000

這給我們一個時間軸來了解

04:24

of how frequently this particular sentence was used over time.

101

264260

2000

這些特定的字句從過去到現在的使用頻率

04:26

We do that for all the words and phrases that appear in those books,

102

266260

3000

我們計算了所有出現在這些書中的字詞

04:29

and that gives us a big table of two billion lines

103

269260

3000

彙整出的資料畫出了二十億條曲線

04:32

that tell us about the way culture has been changing.

104

272260

2000

這告訴了我們文化是如何改變的

04:34

ELA: So those two billion lines,

105

274260

2000

ELA：這二十億條曲線

04:36

we call them two billion n-grams.

106

276260

2000

我們稱為二十億組詞

04:38

What do they tell us?

107

278260

2000

這告訴了我們

04:40

Well the individual n-grams measure cultural trends.

108

280260

2000

每一組詞代表了不同的文化趨勢

04:42

Let me give you an example.

109

282260

2000

讓我舉個例子

04:44

Let's suppose that I am thriving,

110

284260

2000

假設我做了件不得了的事

04:46

then tomorrow I want to tell you about how well I did.

111

286260

2000

明天我要告訴你是多不得了

04:48

And so I might say, "Yesterday, I throve."

112

288260

3000

我可能會說「"Yesterday, I throve."」

04:51

Alternatively, I could say, "Yesterday, I thrived."

113

291260

3000

或者，我也可以說「"Yesterday, I thrived."」

04:54

Well which one should I use?

114

294260

3000

但我應該說哪一種呢？

04:57

How to know?

115

297260

2000

要怎麼知道

04:59

As of about six months ago,

116

299260

2000

大概在六個月前

05:01

the state of the art in this field

117

301260

2000

要知道這一領域最尖端的方法

05:03

is that you would, for instance,

118

303260

2000

你可能得要去詢問

05:05

go up to the following psychologist with fabulous hair,

119

305260

2000

一位有著時髦髮型的心理學家

05:07

and you'd say,

120

307260

2000

你可能會問

05:09

"Steve, you're an expert on the irregular verbs.

121

309260

3000

「史蒂夫，你是不規則動詞的專家。

05:12

What should I do?"

122

312260

2000

我該怎麼說呢？」

05:14

And he'd tell you, "Well most people say thrived,

123

314260

2000

而他會告訴你「嗯，大部分的人會說"thrive"

05:16

but some people say throve."

124

316260

3000

但有些人會說"throve"。」

05:19

And you also knew, more or less,

125

319260

2000

而你也或多或少知道

05:21

that if you were to go back in time 200 years

126

321260

3000

如果我們回到兩百年前

05:24

and ask the following statesman with equally fabulous hair,

127

324260

3000

去問一位同樣也有時髦髮型的政治家

05:27

(Laughter)

128

327260

3000

(笑聲)

05:30

"Tom, what should I say?"

129

330260

2000

「湯姆，我應該怎麼說呢？」

05:32

He'd say, "Well, in my day, most people throve,

130

332260

2000

他說「嗯，在我的年代，大部份的人說"throve"，

05:34

but some thrived."

131

334260

3000

但少部分的人說"thrived"」

05:37

So now what I'm just going to show you is raw data.

132

337260

2000

現在我要向各位展示原始數據

05:39

Two rows from this table of two billion entries.

133

339260

4000

這二十億條目資料中的其中兩條數據

05:43

What you're seeing is year by year frequency

134

343260

2000

各位將會看到的是"thrived"和"throve"兩個字

05:45

of "thrived" and "throve" over time.

135

345260

3000

在各年時期的出現頻率

05:49

Now this is just two

136

349260

2000

這只是二十億筆資料中

05:51

out of two billion rows.

137

351260

3000

其中兩個詞條的資訊

05:54

So the entire data set

138

354260

2000

這全部的數據資料

05:56

is a billion times more awesome than this slide.

139

356260

3000

將會比此張投影片還要驚人億萬倍

05:59

(Laughter)

140

359260

2000

(笑聲)

06:01

(Applause)

141

361260

4000

(掌聲)

06:05

JM: Now there are many other pictures that are worth 500 billion words.

142

365260

2000

JM：還有其他圖片也具有五千億字的價值

06:07

For instance, this one.

143

367260

2000

例如這張

06:09

If you just take influenza,

144

369260

2000

如果談到感冒

06:11

you will see peaks at the time where you knew

145

371260

2000

從這幾個高峰點我們可以知道

06:13

big flu epidemics were killing people around the globe.

146

373260

3000

感冒病毒的大流行在全球造成人類死亡

06:16

ELA: If you were not yet convinced,

147

376260

3000

ELA：如果各位還不太相信

06:19

sea levels are rising,

148

379260

2000

其他像是海平面升高

06:21

so is atmospheric CO2 and global temperature.

149

381260

3000

大氣中的二氧化碳和全球暖化

06:24

JM: You might also want to have a look at this particular n-gram,

150

384260

3000

JM：你也許會想看看這組特別的詞組

06:27

and that's to tell Nietzsche that God is not dead,

151

387260

3000

「告訴尼采，上帝還沒死」

06:30

although you might agree that he might need a better publicist.

152

390260

3000

也許你可能還會認為，他可能需要一個更好的公關

06:33

(Laughter)

153

393260

2000

(笑聲)

06:35

ELA: You can get at some pretty abstract concepts with this sort of thing.

154

395260

3000

ELA：從這當中，各位也能獲得一些相當抽象的概念

06:38

For instance, let me tell you the history

155

398260

2000

例如，讓我跟各位說說

06:40

of the year 1950.

156

400260

2000

有關「1950年」的歷史

06:42

Pretty much for the vast majority of history,

157

402260

2000

幾乎在絕大多數的歷史裡

06:44

no one gave a damn about 1950.

158

404260

2000

沒有特別談論1950這一年

06:46

In 1700, in 1800, in 1900,

159

406260

2000

在1700年，在1800年，1900年

06:48

no one cared.

160

408260

3000

沒有人在乎

06:52

Through the 30s and 40s,

161

412260

2000

甚至到30年代和40年代

06:54

no one cared.

162

414260

2000

也沒有人在談論

06:56

Suddenly, in the mid-40s,

163

416260

2000

突然到了40年代中期

06:58

there started to be a buzz.

164

418260

2000

開始出現了風潮

07:00

People realized that 1950 was going to happen,

165

420260

2000

人們意識到1950年就要來臨

07:02

and it could be big.

166

422260

2000

這是件大事

07:04

(Laughter)

167

424260

3000

(笑聲)

07:07

But nothing got people interested in 1950

168

427260

3000

但也沒有因此讓大眾對該年份產生興趣

07:10

like the year 1950.

169

430260

3000

像是「那1950年」

07:13

(Laughter)

170

433260

3000

(笑聲)

07:16

People were walking around obsessed.

171

436260

2000

人們開始對這一年著迷

07:18

They couldn't stop talking

172

438260

2000

大家無法停止談論

07:20

about all the things they did in 1950,

173

440260

3000

有關他們在1950年所做的一切

07:23

all the things they were planning to do in 1950,

174

443260

3000

所有他們計畫要在1950年所做的事

07:26

all the dreams of what they wanted to accomplish in 1950.

175

446260

5000

所有他們要在1950年完成的夢想

07:31

In fact, 1950 was so fascinating

176

451260

2000

事實上，1950年跟往後幾年相較

07:33

that for years thereafter,

177

453260

2000

是相當迷人的一年

07:35

people just kept talking about all the amazing things that happened,

178

455260

3000

人們不停談論所有發生在

07:38

in '51, '52, '53.

179

458260

2000

'51，'52，'53年的驚奇事件

07:40

Finally in 1954,

180

460260

2000

直到1954年

07:42

someone woke up and realized

181

462260

2000

有人驚覺而且意識到

07:44

that 1950 had gotten somewhat passé.

182

464260

4000

1950年已經變得過時了

07:48

(Laughter)

183

468260

2000

(笑聲)

07:50

And just like that, the bubble burst.

184

470260

2000

這一切就像泡沫破滅一樣

07:52

(Laughter)

185

472260

2000

(笑聲)

07:54

And the story of 1950

186

474260

2000

1950年的情況

07:56

is the story of every year that we have on record,

187

476260

2000

其實就是我們數據上每一個年份的情況一樣

07:58

with a little twist, because now we've got these nice charts.

188

478260

3000

稍微編排一下，我們有這些精美的圖表

08:01

And because we have these nice charts, we can measure things.

189

481260

3000

因為有這些不錯的圖表，我們就能計算

08:04

We can say, "Well how fast does the bubble burst?"

190

484260

2000

我們可以了解「風潮消逝的速度是多快？」

08:06

And it turns out that we can measure that very precisely.

191

486260

3000

結果就是我們能很精確測量出一份數據

08:09

Equations were derived, graphs were produced,

192

489260

3000

有了方程式，也有圖表

08:12

and the net result

193

492260

2000

最終的結果就是

08:14

is that we find that the bubble bursts faster and faster

194

494260

3000

談論年份的風潮一年比一年

08:17

with each passing year.

195

497260

2000

消退的更快

08:19

We are losing interest in the past more rapidly.

196

499260

5000

我們對於過去的興趣日漸消逝

08:24

JM: Now a little piece of career advice.

197

504260

2000

JM：這張圖是有關職業建議

08:26

So for those of you who seek to be famous,

198

506260

2000

對於那些想成名的人

08:28

we can learn from the 25 most famous political figures,

199

508260

2000

我們可以知道二十五位最有名的政治人物

08:30

authors, actors and so on.

200

510260

2000

作家、演員等等

08:32

So if you want to become famous early on, you should be an actor,

201

512260

3000

如果各位想在年輕時就成名，那麼各位應該要當演員

08:35

because then fame starts rising by the end of your 20s --

202

515260

2000

因為你的名氣會從二十歲後開始累積

08:37

you're still young, it's really great.

203

517260

2000

那時正值青春年華，會相當不錯

08:39

Now if you can wait a little bit, you should be an author,

204

519260

2000

如果各位有耐心一點，那麼就應該當個作家

08:41

because then you rise to very great heights,

205

521260

2000

因為各位就能攀上高峰

08:43

like Mark Twain, for instance: extremely famous.

206

523260

2000

成為像是馬克吐溫這樣有名望的作家

08:45

But if you want to reach the very top,

207

525260

2000

但如果各位想攀上最頂尖的位置

08:47

you should delay gratification

208

527260

2000

就得延後滿足自己的慾望

08:49

and, of course, become a politician.

209

529260

2000

然後當一位政治家

08:51

So here you will become famous by the end of your 50s,

210

531260

2000

那麼各位會在五十歲過後開始成名

08:53

and become very, very famous afterward.

211

533260

2000

然後你的名氣會在未來持續延續

08:55

So scientists also tend to get famous when they're much older.

212

535260

3000

科學家也往往是在老年時才成名

08:58

Like for instance, biologists and physics

213

538260

2000

而生物學家和物理學家一樣

09:00

tend to be almost as famous as actors.

214

540260

2000

往往也是和演員一樣著名

09:02

One mistake you should not do is become a mathematician.

215

542260

3000

唯一不要做的職業就是變成數學家

09:05

(Laughter)

216

545260

2000

(笑聲)

09:07

If you do that,

217

547260

2000

如果各位真要做這行

09:09

you might think, "Oh great. I'm going to do my best work when I'm in my 20s."

218

549260

3000

各位可能會想「太好了，當我在二十多歲時，我會盡一切努力。」

09:12

But guess what, nobody will really care.

219

552260

2000

但事實上，沒人會真正去在乎你所做的事

09:14

(Laughter)

220

554260

3000

(笑聲)

09:17

ELA: There are more sobering notes

221

557260

2000

ELA：在我們的資料裡

09:19

among the n-grams.

222

559260

2000

還有其他更發人省思的紀錄

09:21

For instance, here's the trajectory of Marc Chagall,

223

561260

2000

例如馬克‧夏卡爾的名字出現的頻率軌跡

09:23

an artist born in 1887.

224

563260

2000

夏卡爾是位1887年出生的藝術家

09:25

And this looks like the normal trajectory of a famous person.

225

565260

3000

這看起來是一位名人名字正常出現在書中的軌跡

09:28

He gets more and more and more famous,

226

568260

4000

他的名氣日益響亮

09:32

except if you look in German.

227

572260

2000

但如果看德國的數據就不是如此

09:34

If you look in German, you see something completely bizarre,

228

574260

2000

如果看德國的數據，會看到某部份是非常奇怪的

09:36

something you pretty much never see,

229

576260

2000

這是幾乎不太可能看到的

09:38

which is he becomes extremely famous

230

578260

2000

就是他變得非常有名

09:40

and then all of a sudden plummets,

231

580260

2000

卻突然在1933年至1945年間

09:42

going through a nadir between 1933 and 1945,

232

582260

3000

聲勢跌落谷底

09:45

before rebounding afterward.

233

585260

3000

又反彈回升

09:48

And of course, what we're seeing

234

588260

2000

當然我們看的出來

09:50

is the fact Marc Chagall was a Jewish artist

235

590260

3000

這是因為馬克‧夏卡爾是一位猶太裔藝術家

09:53

in Nazi Germany.

236

593260

2000

當時德國是納粹統治

09:55

Now these signals

237

595260

2000

這些指標

09:57

are actually so strong

238

597260

2000

事實上相當明確

09:59

that we don't need to know that someone was censored.

239

599260

3000

我們不需要知道有人在審查書籍

10:02

We can actually figure it out

240

602260

2000

我們能運用基本的信號運算方式

10:04

using really basic signal processing.

241

604260

2000

實際了解當時狀況

10:06

Here's a simple way to do it.

242

606260

2000

我們可以用簡單的方式來做

10:08

Well, a reasonable expectation

243

608260

2000

合理的預期是

10:10

is that somebody's fame in a given period of time

244

610260

2000

在一段特定的時間裡某人的名氣指數

10:12

should be roughly the average of their fame before

245

612260

2000

應該會是他們成名前

10:14

and their fame after.

246

614260

2000

和成名後的指數的平均值

10:16

So that's sort of what we expect.

247

616260

2000

這大概是我們預期的結果

10:18

And we compare that to the fame that we observe.

248

618260

3000

我們比較了我們觀察到的名人

10:21

And we just divide one by the other

249

621260

2000

我們將前後的數值相除

10:23

to produce something we call a suppression index.

250

623260

2000

得到的數值，我們稱作抑制指數

10:25

If the suppression index is very, very, very small,

251

625260

3000

如果抑制指數的值非常的小

10:28

then you very well might be being suppressed.

252

628260

2000

那麼就表示此人也許遭受到打壓

10:30

If it's very large, maybe you're benefiting from propaganda.

253

630260

3000

但如果數值非常大，也許此人獲得大量的推廣

10:34

JM: Now you can actually look at

254

634260

2000

JM：各位現在可以看到

10:36

the distribution of suppression indexes over whole populations.

255

636260

3000

抑制指數在抽樣整體人數中的分佈情況

10:39

So for instance, here --

256

639260

2000

所以，例如這裡 --

10:41

this suppression index is for 5,000 people

257

641260

2000

這個抑制指數的抽樣人數是五千人

10:43

picked in English books where there's no known suppression --

258

643260

2000

選自出版時期沒有打壓限制的英文書籍來做調查

10:45

it would be like this, basically tightly centered on one.

259

645260

2000

曲線基本上會在數值1的地方呈現高峰

10:47

What you expect is basically what you observe.

260

647260

2000

基本上預期的會和觀察到的數值是相同的

10:49

This is distribution as seen in Germany --

261

649260

2000

這份分佈圖則是德國的部分 --

10:51

very different, it's shifted to the left.

262

651260

2000

相當不同，曲線移往左側

10:53

People talked about it twice less as it should have been.

263

653260

3000

人們談論事物的次數比預期的少了兩倍

10:56

But much more importantly, the distribution is much wider.

264

656260

2000

更重要的是，整體分佈的情況更寬廣

10:58

There are many people who end up on the far left on this distribution

265

658260

3000

有相當多人是落在圖表較左側的位置

11:01

who are talked about 10 times fewer than they should have been.

266

661260

3000

因為他們比應該被提及的次數少了十倍

11:04

But then also many people on the far right

267

664260

2000

但也有相當多人是落在較右側的部分

11:06

who seem to benefit from propaganda.

268

666260

2000

似乎是因為被大量宣傳

11:08

This picture is the hallmark of censorship in the book record.

269

668260

3000

這張圖是明顯看出書本中具有審查制度

11:11

ELA: So culturomics

270

671260

2000

ELA：文化組學

11:13

is what we call this method.

271

673260

2000

是我們用的方法

11:15

It's kind of like genomics.

272

675260

2000

這和基因組學有些類似

11:17

Except genomics is a lens on biology

273

677260

2000

不過基因組學是透過生物學

11:19

through the window of the sequence of bases in the human genome.

274

679260

3000

基本的序列基礎來檢視人類基因組

11:22

Culturomics is similar.

275

682260

2000

文化組學是類似的

11:24

It's the application of massive-scale data collection analysis

276

684260

3000

這是應用收集分析規模龐大的數據

11:27

to the study of human culture.

277

687260

2000

來研究人類文化

11:29

Here, instead of through the lens of a genome,

278

689260

2000

不透過檢視基因組

11:31

through the lens of digitized pieces of the historical record.

279

691260

3000

而是檢視歷史紀錄的數位資料

11:34

The great thing about culturomics

280

694260

2000

文化組學的好處是

11:36

is that everyone can do it.

281

696260

2000

每個人都能執行

11:38

Why can everyone do it?

282

698260

2000

為何每個人都能做呢？

11:40

Everyone can do it because three guys,

283

700260

2000

因為這三位人士

11:42

Jon Orwant, Matt Gray and Will Brockman over at Google,

284

702260

3000

Google的Jon Orwant，Matt Gray還有Will Brockman

11:45

saw the prototype of the Ngram Viewer,

285

705260

2000

他們看到Ngram瀏覽器的原型

11:47

and they said, "This is so fun.

286

707260

2000

他們說「這太有趣了。」

11:49

We have to make this available for people."

287

709260

3000

我們要讓大家都可以使用這功能

11:52

So in two weeks flat -- the two weeks before our paper came out --

288

712260

2000

所以在兩週的時間 -- 我們的報告出來的兩週前 --

11:54

they coded up a version of the Ngram Viewer for the general public.

289

714260

3000

他們編寫了一個大眾版本的Ngram瀏覽器

11:57

And so you too can type in any word or phrase that you're interested in

290

717260

3000

各位可以打上任何各位有興趣的字或詞組

12:00

and see its n-gram immediately --

291

720260

2000

然後立即看到該字詞的頻率變化 --

12:02

also browse examples of all the various books

292

722260

2000

同時根據你搜尋的字詞

12:04

in which your n-gram appears.

293

724260

2000

瀏覽不同書籍中的各種例子

12:06

JM: Now this was used over a million times on the first day,

294

726260

2000

JM：這功能在首日就被使用了超過一百萬次

12:08

and this is really the best of all the queries.

295

728260

2000

這也是各種查詢工具中最好的一個

12:10

So people want to be their best, put their best foot forward.

296

730260

3000

人們希望做到最好的，以最好的狀態像前進

12:13

But it turns out in the 18th century, people didn't really care about that at all.

297

733260

3000

但事實證明在18世紀，人們一點也不關心這一切

12:16

They didn't want to be their best, they wanted to be their beft.

298

736260

3000

他們不想做到最好，他們想變成"beft"

12:19

So what happened is, of course, this is just a mistake.

299

739260

3000

這是怎麼回事，當然這只是個錯誤

12:22

It's not that strove for mediocrity,

300

742260

2000

這並不是說他們想要平凡

12:24

it's just that the S used to be written differently, kind of like an F.

301

744260

3000

這只是因為"S"常被寫的不一樣，寫得像"F"

12:27

Now of course, Google didn't pick this up at the time,

302

747260

3000

當然，Google並沒有挑出來

12:30

so we reported this in the science article that we wrote.

303

750260

3000

所以我們在自己寫科學文章中提到此事

12:33

But it turns out this is just a reminder

304

753260

2000

不過這只是個提醒

12:35

that, although this is a lot of fun,

305

755260

2000

雖然這相當有趣

12:37

when you interpret these graphs, you have to be very careful,

306

757260

2000

當你要解讀這些圖表，你必須非常謹慎

12:39

and you have to adopt the base standards in the sciences.

307

759260

3000

而且必須採納科學的基礎標準

12:42

ELA: People have been using this for all kinds of fun purposes.

308

762260

3000

ELA：大家一直在使用這工具來滿足各種樂趣

12:45

(Laughter)

309

765260

7000

(笑聲)

12:52

Actually, we're not going to have to talk,

310

772260

2000

事實上，我們不需要說明的

12:54

we're just going to show you all the slides and remain silent.

311

774260

3000

我們原本只想播放所有的投影片然後在一旁保持沉默

12:57

This person was interested in the history of frustration.

312

777260

3000

此人對於挫折的歷史感興趣

13:00

There's various types of frustration.

313

780260

3000

挫折有非常多種方式

13:03

If you stub your toe, that's a one A "argh."

314

783260

3000

如果你踢到腳趾，哀叫聲「啊」就是一個"A"的"argh"

13:06

If the planet Earth is annihilated by the Vogons

315

786260

2000

如果地球被外星人毀滅

13:08

to make room for an interstellar bypass,

316

788260

2000

變成星際間的通道

13:10

that's an eight A "aaaaaaaargh."

317

790260

2000

那麼哀叫聲「啊」就是有八個"A"的"aaaaaaaargh"

13:12

This person studies all the "arghs,"

318

792260

2000

此人研究了所有書籍上出現的哀叫聲「啊」

13:14

from one through eight A's.

319

794260

2000

有從一個"A"到八個"A"

13:16

And it turns out

320

796260

2000

結果是

13:18

that the less-frequent "arghs"

321

798260

2000

較不頻繁的「啊」“arghs”

13:20

are, of course, the ones that correspond to things that are more frustrating --

322

800260

3000

對應了那些相對較令人沮喪的的事情

13:23

except, oddly, in the early 80s.

323

803260

3000

也有例外，奇怪的是在80年代初

13:26

We think that might have something to do with Reagan.

324

806260

2000

我們認為這也許是受到雷根的影響

13:28

(Laughter)

325

808260

2000

(笑聲)

13:30

JM: There are many usages of this data,

326

810260

3000

JM：這份書據資料有相當多用途

13:33

but the bottom line is that the historical record is being digitized.

327

813260

3000

不過最終就是歷史紀錄都被數位化了

13:36

Google has started to digitize 15 million books.

328

816260

2000

Google已經開始將一千五百萬本書數位化

13:38

That's 12 percent of all the books that have ever been published.

329

818260

2000

其中百分之十二的書是已出版的

13:40

It's a sizable chunk of human culture.

330

820260

3000

這涵蓋了相當大量的人類文化

13:43

There's much more in culture: there's manuscripts, there newspapers,

331

823260

3000

這當中有非常多的文化資料：裡頭有手稿，報紙

13:46

there's things that are not text, like art and paintings.

332

826260

2000

也有不是文字的資料，像是藝術品和畫作

13:48

These all happen to be on our computers,

333

828260

2000

現在這都存放在我們的電腦裡

13:50

on computers across the world.

334

830260

2000

在世界各處的電腦裡

13:52

And when that happens, that will transform the way we have

335

832260

3000

如果這一切成真，就會改變

13:55

to understand our past, our present and human culture.

336

835260

2000

我們了解過去、現在和人類文化的方式

13:57

Thank you very much.

337

837260

2000

非常謝謝各位

13:59

(Applause)

338

839260

3000

(掌聲)

New videos

06:51

The Rise of China's Homegrown Brands — and Why ...

06:45

Parkour! How the Sport Keeps Your Body and Mind...

05:38

Can you solve the riddle of Pandora’s box? - Al...

05:59

The tale of the Monkey King and the Buddha - Ji...

10:03

Which species would you get rid of? | Ada, Ep. 5

05:29

How are microchips made? - George Zaidan and Sa...

10:03

Why Daylight Is the Secret to Great Sleep | Chr...

11:12

6 Ways to Make Better Connections Online | Marg...

Original video on YouTube.com

What we learned from 5 million books - YouTube

關於本網站

本網站將向您介紹對學習英語有用的 YouTube 視頻。您將看到來自世界各地的一流教師教授的英語課程。雙擊每個視頻頁面上顯示的英文字幕，從那裡播放視頻。字幕與視頻播放同步滾動。如果您有任何意見或要求，請使用此聯繫表與我們聯繫。

https://forms.gle/WvT1wiN1qDtmnspy7

Playback speed

Subtitle font size

What we learned from 5 million books

New videos

What we learned from 5 million books

New videos

Original video on YouTube.com