How computers learn to recognize objects instantly | Joseph Redmon

1,132,480 views ・ 2017-08-18

TED

请双击下面的英文字幕来播放视频。

翻译人员: chunhua zhang 校对人员: 易帆余

00:12

Ten years ago,

12645

1151

10年前，

00:13

computer vision researchers thought that getting a computer

13820

2776

计算机视觉研究者认为要让一台电脑

00:16

to tell the difference between a cat and a dog

16620

2696

去分辨出一只猫和狗的不同之处

00:19

would be almost impossible,

19340

1976

几乎是不可能的，

00:21

even with the significant advance in the state of artificial intelligence.

21340

3696

即便是在人工智能已经取得了重大突破的情况下。

00:25

Now we can do it at a level greater than 99 percent accuracy.

25060

3560

现在我们已经可以做到让它的正确率在99%以上。

00:29

This is called image classification --

29500

1856

这个方法叫做图像分类——

00:31

give it an image, put a label to that image --

31380

3096

给它一张图，再给这张图贴上标签——

00:34

and computers know thousands of other categories as well.

34500

3040

通过这种方式，电脑就可以知道数千种的分类。

00:38

I'm a graduate student at the University of Washington,

38500

2896

我是华盛顿大学的一名研究生，

00:41

and I work on a project called Darknet,

41420

1896

我致力于一个名叫“暗网”的项目，

00:43

which is a neural network framework

43340

1696

这是一个用来训练和测试计算机视觉模型的

00:45

for training and testing computer vision models.

45060

2816

神经网络结构。

00:47

So let's just see what Darknet thinks

47900

2976

让我们来看看暗网是如何看待

00:50

of this image that we have.

50900

1760

我们手上的这张图片。

00:54

When we run our classifier

54340

2336

当我们在这张图片上

00:56

on this image,

56700

1216

运行识别器时，

00:57

we see we don't just get a prediction of dog or cat,

57940

2456

我们注意到，它不仅能判断出图片上是猫是狗，

01:00

we actually get specific breed predictions.

60420

2336

还能给出它是哪个品种的预测。

01:02

That's the level of granularity we have now.

62780

2176

这就是我们目前所达到的粒度级别。

01:04

And it's correct.

64980

1616

而且它的预测是正确的。

01:06

My dog is in fact a malamute.

66620

1840

我的狗的确是一只阿拉斯加雪橇犬。

01:08

So we've made amazing strides in image classification,

68860

4336

很明显，我们在图像识别上取得了惊人的进步，

01:13

but what happens when we run our classifier

73220

2000

但是如果我们对这样一张图片上

01:15

on an image that looks like this?

75244

1960

运行识别器，会如何呢？

01:18

Well ...

78900

1200

看一下。。。。。

01:24

We see that the classifier comes back with a pretty similar prediction.

84460

3896

我们看到识别器给出了一个非常相似的预测。

01:28

And it's correct, there is a malamute in the image,

88380

3096

而且是正确的，图中是有一只阿拉斯加雪橇犬，

01:31

but just given this label, we don't actually know that much

91500

3696

但只使用这一个标签，我们并不能真正的了解

01:35

about what's going on in the image.

95220

1667

这张图片里的故事。

01:36

We need something more powerful.

96911

1560

我们需要更强大的检测器。

01:39

I work on a problem called object detection,

99060

2616

我正在研究一个叫做目标检测的问题，

01:41

where we look at an image and try to find all of the objects,

101700

2936

也就是我们尝试将一张图上的所有目标物都找出来，

01:44

put bounding boxes around them

104660

1456

然后将它们分别框起来，

01:46

and say what those objects are.

106140

1520

再加上标注。

01:48

So here's what happens when we run a detector on this image.

108220

3280

这就是我们对这张照片运行检测器时所发生的。

01:53

Now, with this kind of result,

113060

2256

基于这样的结果，

01:55

we can do a lot more with our computer vision algorithms.

115340

2696

我们可以用计算机视觉算法做更多的事情。

01:58

We see that it knows that there's a cat and a dog.

118060

2976

我们发现，它知道这里有一只猫和一只狗。

02:01

It knows their relative locations,

121060

2256

它知道它们的相对位置，

02:03

their size.

123340

1216

它们的大小。

02:04

It may even know some extra information.

124580

1936

它可能甚至还知道一些额外的信息。

02:06

There's a book sitting in the background.

126540

1960

例如背景里有一本书。

02:09

And if you want to build a system on top of computer vision,

129100

3256

如果你想建立一个基于计算机视觉的系统，

02:12

say a self-driving vehicle or a robotic system,

132380

3456

比如说无人驾驶汽车或者机器人系统，

02:15

this is the kind of information that you want.

135860

2456

那么这就是你想要得到的那类信息。

02:18

You want something so that you can interact with the physical world.

138340

3239

你要一个能与物质世界互动的系统。

02:22

Now, when I started working on object detection,

142579

2257

当我最开始开展目标检测项目时，

02:24

it took 20 seconds to process a single image.

144860

3296

它要花20秒去处理一张图片。

02:28

And to get a feel for why speed is so important in this domain,

148180

3880

为了感受一下为什么速度在这个领域是如此重要，

02:32

here's an example of an object detector

152940

2536

举一个例子，这是一个2秒钟

02:35

that takes two seconds to process an image.

155500

2416

就能处理一张图片的检测器。

02:37

So this is 10 times faster

157940

2616

这个检测器的速度要比

02:40

than the 20-seconds-per-image detector,

160580

3536

处理每张图需要20秒的检测器快10倍，

02:44

and you can see that by the time it makes predictions,

164140

2656

你还可以看到在它做出预测的时候，

02:46

the entire state of the world has changed,

166820

2040

被检测的世界已经发生变化了，

02:49

and this wouldn't be very useful

169700

2416

这对于一个应用来说

02:52

for an application.

172140

1416

是没有多大用处的。

02:53

If we speed this up by another factor of 10,

173580

2496

如果我们将它的速度再提升10倍，

02:56

this is a detector running at five frames per second.

176100

2816

这个检测器每秒可处理5张画面。

02:58

This is a lot better,

178940

1536

这就好很多了，

03:00

but for example,

180500

1976

但是，举个例子

03:02

if there's any significant movement,

182500

2296

如果有任何重大的移动（它就反应不过来了），

03:04

I wouldn't want a system like this driving my car.

184820

2560

我可不想让这样的一个系统来驾驶我的汽车。

03:08

This is our detection system running in real time on my laptop.

188940

3240

这是在我电脑上运行的实时检测系统。

03:12

So it smoothly tracks me as I move around the frame,

192820

3136

当我在移动时，它能顺利地追踪我，

03:15

and it's robust to a wide variety of changes in size,

195980

3720

而且它强大到能适应不同的大小、

03:21

pose,

201260

1200

姿势、

03:23

forward, backward.

203100

1856

向前、向后的改变。

03:24

This is great.

204980

1216

很了不起。

03:26

This is what we really need

206220

1736

如果我们想要建造一个

03:27

if we're going to build systems on top of computer vision.

207980

2896

基于计算机视觉的系统，那么这就是我们真正需要的。

03:30

(Applause)

210900

4000

（掌声）

03:36

So in just a few years,

216100

2176

仅仅是几年的时间，

03:38

we've gone from 20 seconds per image

218300

2656

我们就从每张图20秒，

03:40

to 20 milliseconds per image, a thousand times faster.

220980

3536

提升到了每张图20毫秒，速度提高了1000倍。

03:44

How did we get there?

224540

1416

我们是如何做到的呢？

03:45

Well, in the past, object detection systems

225980

3016

事实上在过去，目标检测系统

03:49

would take an image like this

229020

1936

会将这张图片

03:50

and split it into a bunch of regions

230980

2456

分成很多小区域，

03:53

and then run a classifier on each of these regions,

233460

3256

然后在每一块区域运行一下识别器，

03:56

and high scores for that classifier

236740

2536

在识别器中获得最高分数（的输出）

03:59

would be considered detections in the image.

239300

3136

就会被认为是这张图片的检测结果。

04:02

But this involved running a classifier thousands of times over an image,

242460

4056

这涉及到要在一张图片上运行数千次识别器，

04:06

thousands of neural network evaluations to produce detection.

246540

2920

以及数千次的神经网络评估才能获得检测结果。

04:11

Instead, we trained a single network to do all of detection for us.

251060

4536

而现在，我们训练了可以做出所有检测的单一网络。

04:15

It produces all of the bounding boxes and class probabilities simultaneously.

255620

4280

它能同时生成边界盒和类别概率。

04:20

With our system, instead of looking at an image thousands of times

260500

3496

使用我们的系统，不需要为了生成检测结果

04:24

to produce detection,

264020

1456

去重复上千数次地看同一张图片，

04:25

you only look once,

265500

1256

“只看一次”就行了，

04:26

and that's why we call it the YOLO method of object detection.

266780

2920

这也是为什么我们称之为目标检测的“YOLO”法。

04:31

So with this speed, we're not just limited to images;

271180

3976

有了这个速度，我们就不仅限于识别图像了，

04:35

we can process video in real time.

275180

2416

还可以实时处理视频。

04:37

And now, instead of just seeing that cat and dog,

277620

3096

现在，我们不仅看到了猫和狗，

04:40

we can see them move around and interact with each other.

280740

2960

还能看到它们走来走去，互相嘻戏。

04:46

This is a detector that we trained

286380

2056

这是一个我们在微软的 COCO数据库上，

04:48

on 80 different classes

288460

4376

用80种不同种类的物品

04:52

in Microsoft's COCO dataset.

292860

3256

训练过的检测器。

04:56

It has all sorts of things like spoon and fork, bowl,

296140

3336

包含了各种东西，像勺子、叉子、碗

04:59

common objects like that.

299500

1800

等常见物品。

05:02

It has a variety of more exotic things:

100

302180

3096

还有各种奇特的东西：

05:05

animals, cars, zebras, giraffes.

101

305300

3256

动物、汽车、斑马、长颈鹿。

05:08

And now we're going to do something fun.

102

308580

1936

现在我们要做点儿有趣的事情。

05:10

We're just going to go out into the audience

103

310540

2096

我们的摄像头将要对准观众区，

05:12

and see what kind of things we can detect.

104

312660

2016

看看能检测出什么。

05:14

Does anyone want a stuffed animal?

105

314700

1620

谁想要一个毛绒动物玩具？

05:17

There are some teddy bears out there.

106

317820

1762

观众席里有了一些泰迪熊。

05:21

And we can turn down our threshold for detection a little bit,

107

321860

4536

我们把检测阀值调低一点，

05:26

so we can find more of you guys out in the audience.

108

326420

3400

这样就可以找出更多的观众。

05:31

Let's see if we can get these stop signs.

109

331380

2336

看下我们能不能找出这些停车标志。

05:33

We find some backpacks.

110

333740

1880

我们找到了一些背包。

05:37

Let's just zoom in a little bit.

111

337700

1840

再放大一点。

05:42

And this is great.

112

342140

1256

非常棒。

05:43

And all of the processing is happening in real time

113

343420

3176

所有这些都是在电脑上

05:46

on the laptop.

114

346620

1200

实时处理的。

05:48

And it's important to remember

115

348900

1456

请大家记住：

05:50

that this is a general purpose object detection system,

116

350380

3216

这是一个通用的目标检测系统，

05:53

so we can train this for any image domain.

117

353620

5000

因此我们可以将它训练用于任何领域的图像识别。

06:00

The same code that we use

118

360140

2536

我们在无人驾驶汽车中

06:02

to find stop signs or pedestrians,

119

362700

2456

用来发现停车标志、行人

06:05

bicycles in a self-driving vehicle,

120

365180

1976

和自行车的代码，

06:07

can be used to find cancer cells

121

367180

2856

同样可以用于在组织活检中

06:10

in a tissue biopsy.

122

370060

3016

找出癌细胞。

06:13

And there are researchers around the globe already using this technology

123

373100

4040

全球已经有很多研究者正在利用这一技术

06:18

for advances in things like medicine, robotics.

124

378060

3416

在医学、机器人学等方面取得了进展。

06:21

This morning, I read a paper

125

381500

1376

今天早上，我刚读到一篇文章，

06:22

where they were taking a census of animals in Nairobi National Park

126

382900

4576

人们在内罗毕国家公园对动物数量进行普查，

06:27

with YOLO as part of this detection system.

127

387500

3136

使用了YOLO作为检测系统的一部分。

06:30

And that's because Darknet is open source

128

390660

3096

这是因为暗网是一个开源项目，

06:33

and in the public domain, free for anyone to use.

129

393780

2520

在公共领域，任何人都可以免费使用。

06:37

(Applause)

130

397420

5696

（掌声）

06:43

But we wanted to make detection even more accessible and usable,

131

403140

4936

但是我们想要让检测器能被更多人使用、也更好用，

06:48

so through a combination of model optimization,

132

408100

4056

因此通过结合模型优化，

06:52

network binarization and approximation,

133

412180

2296

网络二值化和近似法，

06:54

we actually have object detection running on a phone.

134

414500

3920

我们实际上已经可以在手机上进行目标检测了。

07:04

(Applause)

135

424620

5320

（掌声）

07:10

And I'm really excited because now we have a pretty powerful solution

136

430780

5056

我真的很激动，因为我们在这个低级的

07:15

to this low-level computer vision problem,

137

435860

2296

计算机视觉问题上有了一个强大的解决方案，

07:18

and anyone can take it and build something with it.

138

438180

3856

而且任何人都可以使用它来做些什么。

07:22

So now the rest is up to all of you

139

442060

3176

所以接下来就看所有在座的各位

07:25

and people around the world with access to this software,

140

445260

2936

以及世界上所有能够使用这个软件的人了，

07:28

and I can't wait to see what people will build with this technology.

141

448220

3656

而我已经等不及想要看看，人们会用这一技术造出什么来了。

07:31

Thank you.

142

451900

1216

谢谢。

07:33

(Applause)

143

453140

3440

（掌声）

New videos

06:51

The Rise of China's Homegrown Brands — and Why ...

08:33

Can AI Help with the Chaos of Family Life? | Av...

09:26

You Are the Bridge to the Next Generation | Ndi...

08:29

Are We Still Human If Robots Help Raise Our Bab...

06:45

Parkour! How the Sport Keeps Your Body and Mind...

09:53

The Power of Gaming Together in a Lonely World ...

05:46

The myth of Medusa - Laura Aitken-Burt

05:02

How reliable is fingerprint evidence? - Theodor...

Original video on YouTube.com

How computers learn to recognize objects instantly | Joseph Redmon - YouTube

关于本网站

这个网站将向你介绍对学习英语有用的YouTube视频。你将看到来自世界各地的一流教师教授的英语课程。双击每个视频页面上显示的英文字幕，即可从那里播放视频。字幕会随着视频的播放而同步滚动。如果你有任何意见或要求，请使用此联系表与我们联系。

https://forms.gle/WvT1wiN1qDtmnspy7

Playback speed

Subtitle font size

How computers learn to recognize objects instantly | Joseph Redmon

New videos

How computers learn to recognize objects instantly | Joseph Redmon

New videos

Original video on YouTube.com