How computers learn to recognize objects instantly | Joseph Redmon

1,119,896 views ・ 2017-08-18

TED


请双击下面的英文字幕来播放视频。

翻译人员: chunhua zhang 校对人员: 易帆 余
00:12
Ten years ago,
0
12645
1151
10年前,
00:13
computer vision researchers thought that getting a computer
1
13820
2776
计算机视觉研究者认为 要让一台电脑
00:16
to tell the difference between a cat and a dog
2
16620
2696
去分辨出一只猫和狗的不同之处
00:19
would be almost impossible,
3
19340
1976
几乎是不可能的,
00:21
even with the significant advance in the state of artificial intelligence.
4
21340
3696
即便是在人工智能已经取得了 重大突破的情况下。
00:25
Now we can do it at a level greater than 99 percent accuracy.
5
25060
3560
现在我们已经可以做到 让它的正确率在99%以上。
00:29
This is called image classification --
6
29500
1856
这个方法叫做图像分类——
00:31
give it an image, put a label to that image --
7
31380
3096
给它一张图,再给这张图贴上标签——
00:34
and computers know thousands of other categories as well.
8
34500
3040
通过这种方式,电脑就可以知道 数千种的分类。
00:38
I'm a graduate student at the University of Washington,
9
38500
2896
我是华盛顿大学的一名研究生,
00:41
and I work on a project called Darknet,
10
41420
1896
我致力于一个名叫“暗网”的项目,
00:43
which is a neural network framework
11
43340
1696
这是一个用来训练和测试 计算机视觉模型的
00:45
for training and testing computer vision models.
12
45060
2816
神经网络结构。
00:47
So let's just see what Darknet thinks
13
47900
2976
让我们来看看暗网是如何看待
00:50
of this image that we have.
14
50900
1760
我们手上的这张图片。
00:54
When we run our classifier
15
54340
2336
当我们在这张图片上
00:56
on this image,
16
56700
1216
运行识别器时,
00:57
we see we don't just get a prediction of dog or cat,
17
57940
2456
我们注意到,它不仅能判断出 图片上是猫是狗,
01:00
we actually get specific breed predictions.
18
60420
2336
还能给出它是哪个品种的预测。
01:02
That's the level of granularity we have now.
19
62780
2176
这就是我们目前所达到的粒度级别。
01:04
And it's correct.
20
64980
1616
而且它的预测是正确的。
01:06
My dog is in fact a malamute.
21
66620
1840
我的狗的确是一只 阿拉斯加雪橇犬。
01:08
So we've made amazing strides in image classification,
22
68860
4336
很明显,我们在图像识别上 取得了惊人的进步,
01:13
but what happens when we run our classifier
23
73220
2000
但是如果我们对这样一张图片上
01:15
on an image that looks like this?
24
75244
1960
运行识别器,会如何呢?
01:18
Well ...
25
78900
1200
看一下。。。。。
01:24
We see that the classifier comes back with a pretty similar prediction.
26
84460
3896
我们看到识别器给出了一个 非常相似的预测。
01:28
And it's correct, there is a malamute in the image,
27
88380
3096
而且是正确的,图中是有一只 阿拉斯加雪橇犬,
01:31
but just given this label, we don't actually know that much
28
91500
3696
但只使用这一个标签, 我们并不能真正的了解
01:35
about what's going on in the image.
29
95220
1667
这张图片里的故事。
01:36
We need something more powerful.
30
96911
1560
我们需要更强大的检测器。
01:39
I work on a problem called object detection,
31
99060
2616
我正在研究一个叫做 目标检测的问题,
01:41
where we look at an image and try to find all of the objects,
32
101700
2936
也就是我们尝试 将一张图上的所有目标物都找出来,
01:44
put bounding boxes around them
33
104660
1456
然后将它们分别框起来,
01:46
and say what those objects are.
34
106140
1520
再加上标注。
01:48
So here's what happens when we run a detector on this image.
35
108220
3280
这就是我们对这张照片 运行检测器时所发生的。
01:53
Now, with this kind of result,
36
113060
2256
基于这样的结果,
01:55
we can do a lot more with our computer vision algorithms.
37
115340
2696
我们可以用计算机视觉算法 做更多的事情。
01:58
We see that it knows that there's a cat and a dog.
38
118060
2976
我们发现,它知道 这里有一只猫和一只狗。
02:01
It knows their relative locations,
39
121060
2256
它知道它们的相对位置,
02:03
their size.
40
123340
1216
它们的大小。
02:04
It may even know some extra information.
41
124580
1936
它可能甚至还知道一些 额外的信息。
02:06
There's a book sitting in the background.
42
126540
1960
例如背景里有一本书。
02:09
And if you want to build a system on top of computer vision,
43
129100
3256
如果你想建立一个 基于计算机视觉的系统,
02:12
say a self-driving vehicle or a robotic system,
44
132380
3456
比如说无人驾驶汽车 或者机器人系统,
02:15
this is the kind of information that you want.
45
135860
2456
那么这就是你想要得到的那类信息。
02:18
You want something so that you can interact with the physical world.
46
138340
3239
你要一个能与物质世界互动的系统。
02:22
Now, when I started working on object detection,
47
142579
2257
当我最开始开展目标检测项目时,
02:24
it took 20 seconds to process a single image.
48
144860
3296
它要花20秒去处理一张图片。
02:28
And to get a feel for why speed is so important in this domain,
49
148180
3880
为了感受一下为什么速度 在这个领域是如此重要,
02:32
here's an example of an object detector
50
152940
2536
举一个例子,这是一个2秒钟
02:35
that takes two seconds to process an image.
51
155500
2416
就能处理一张图片的检测器。
02:37
So this is 10 times faster
52
157940
2616
这个检测器的速度要比
02:40
than the 20-seconds-per-image detector,
53
160580
3536
处理每张图需要20秒的 检测器快10倍,
02:44
and you can see that by the time it makes predictions,
54
164140
2656
你还可以看到 在它做出预测的时候,
02:46
the entire state of the world has changed,
55
166820
2040
被检测的世界已经发生变化了,
02:49
and this wouldn't be very useful
56
169700
2416
这对于一个应用来说
02:52
for an application.
57
172140
1416
是没有多大用处的。
02:53
If we speed this up by another factor of 10,
58
173580
2496
如果我们将它的速度再提升10倍,
02:56
this is a detector running at five frames per second.
59
176100
2816
这个检测器每秒可处理5张画面。
02:58
This is a lot better,
60
178940
1536
这就好很多了,
03:00
but for example,
61
180500
1976
但是,举个例子
03:02
if there's any significant movement,
62
182500
2296
如果有任何重大的移动 (它就反应不过来了),
03:04
I wouldn't want a system like this driving my car.
63
184820
2560
我可不想让这样的一个系统 来驾驶我的汽车。
03:08
This is our detection system running in real time on my laptop.
64
188940
3240
这是在我电脑上运行的 实时检测系统。
03:12
So it smoothly tracks me as I move around the frame,
65
192820
3136
当我在移动时,它能顺利地追踪我,
03:15
and it's robust to a wide variety of changes in size,
66
195980
3720
而且它强大到能适应不同的大小、
03:21
pose,
67
201260
1200
姿势、
03:23
forward, backward.
68
203100
1856
向前、向后的改变。
03:24
This is great.
69
204980
1216
很了不起。
03:26
This is what we really need
70
206220
1736
如果我们想要建造一个
03:27
if we're going to build systems on top of computer vision.
71
207980
2896
基于计算机视觉的系统, 那么这就是我们真正需要的。
03:30
(Applause)
72
210900
4000
(掌声)
03:36
So in just a few years,
73
216100
2176
仅仅是几年的时间,
03:38
we've gone from 20 seconds per image
74
218300
2656
我们就从每张图20秒,
03:40
to 20 milliseconds per image, a thousand times faster.
75
220980
3536
提升到了每张图20毫秒, 速度提高了1000倍。
03:44
How did we get there?
76
224540
1416
我们是如何做到的呢?
03:45
Well, in the past, object detection systems
77
225980
3016
事实上在过去,目标检测系统
03:49
would take an image like this
78
229020
1936
会将这张图片
03:50
and split it into a bunch of regions
79
230980
2456
分成很多小区域,
03:53
and then run a classifier on each of these regions,
80
233460
3256
然后在每一块区域运行一下识别器,
03:56
and high scores for that classifier
81
236740
2536
在识别器中获得最高分数(的输出)
03:59
would be considered detections in the image.
82
239300
3136
就会被认为是这张图片的检测结果。
04:02
But this involved running a classifier thousands of times over an image,
83
242460
4056
这涉及到要在一张图片上 运行数千次识别器,
04:06
thousands of neural network evaluations to produce detection.
84
246540
2920
以及数千次的神经网络评估 才能获得检测结果。
04:11
Instead, we trained a single network to do all of detection for us.
85
251060
4536
而现在,我们训练了可以做出 所有检测的单一网络。
04:15
It produces all of the bounding boxes and class probabilities simultaneously.
86
255620
4280
它能同时生成边界盒和类别概率。
04:20
With our system, instead of looking at an image thousands of times
87
260500
3496
使用我们的系统, 不需要为了生成检测结果
04:24
to produce detection,
88
264020
1456
去重复上千数次地看同一张图片,
04:25
you only look once,
89
265500
1256
“只看一次”就行了,
04:26
and that's why we call it the YOLO method of object detection.
90
266780
2920
这也是为什么我们称之为 目标检测的“YOLO”法。
04:31
So with this speed, we're not just limited to images;
91
271180
3976
有了这个速度,我们就 不仅限于识别图像了,
04:35
we can process video in real time.
92
275180
2416
还可以实时处理视频。
04:37
And now, instead of just seeing that cat and dog,
93
277620
3096
现在,我们不仅看到了猫和狗,
04:40
we can see them move around and interact with each other.
94
280740
2960
还能看到它们走来走去,互相嘻戏。
04:46
This is a detector that we trained
95
286380
2056
这是一个我们在微软的 COCO数据库上,
04:48
on 80 different classes
96
288460
4376
用80种不同种类的物品
04:52
in Microsoft's COCO dataset.
97
292860
3256
训练过的检测器。
04:56
It has all sorts of things like spoon and fork, bowl,
98
296140
3336
包含了各种东西, 像勺子、叉子、碗
04:59
common objects like that.
99
299500
1800
等常见物品。
05:02
It has a variety of more exotic things:
100
302180
3096
还有各种奇特的东西:
05:05
animals, cars, zebras, giraffes.
101
305300
3256
动物、汽车、斑马、长颈鹿。
05:08
And now we're going to do something fun.
102
308580
1936
现在我们要做点儿有趣的事情。
05:10
We're just going to go out into the audience
103
310540
2096
我们的摄像头将要对准观众区,
05:12
and see what kind of things we can detect.
104
312660
2016
看看能检测出什么。
05:14
Does anyone want a stuffed animal?
105
314700
1620
谁想要一个毛绒动物玩具?
05:17
There are some teddy bears out there.
106
317820
1762
观众席里有了一些泰迪熊。
05:21
And we can turn down our threshold for detection a little bit,
107
321860
4536
我们把检测阀值调低一点,
05:26
so we can find more of you guys out in the audience.
108
326420
3400
这样就可以找出更多的观众。
05:31
Let's see if we can get these stop signs.
109
331380
2336
看下我们能不能找出这些停车标志。
05:33
We find some backpacks.
110
333740
1880
我们找到了一些背包。
05:37
Let's just zoom in a little bit.
111
337700
1840
再放大一点。
05:42
And this is great.
112
342140
1256
非常棒。
05:43
And all of the processing is happening in real time
113
343420
3176
所有这些都是在电脑上
05:46
on the laptop.
114
346620
1200
实时处理的。
05:48
And it's important to remember
115
348900
1456
请大家记住:
05:50
that this is a general purpose object detection system,
116
350380
3216
这是一个通用的目标检测系统,
05:53
so we can train this for any image domain.
117
353620
5000
因此我们可以将它训练 用于任何领域的图像识别。
06:00
The same code that we use
118
360140
2536
我们在无人驾驶汽车中
06:02
to find stop signs or pedestrians,
119
362700
2456
用来发现停车标志、行人
06:05
bicycles in a self-driving vehicle,
120
365180
1976
和自行车的代码,
06:07
can be used to find cancer cells
121
367180
2856
同样可以用于在组织活检中
06:10
in a tissue biopsy.
122
370060
3016
找出癌细胞。
06:13
And there are researchers around the globe already using this technology
123
373100
4040
全球已经有很多研究者 正在利用这一技术
06:18
for advances in things like medicine, robotics.
124
378060
3416
在医学、机器人学等方面取得了进展。
06:21
This morning, I read a paper
125
381500
1376
今天早上,我刚读到一篇文章,
06:22
where they were taking a census of animals in Nairobi National Park
126
382900
4576
人们在内罗毕国家公园 对动物数量进行普查,
06:27
with YOLO as part of this detection system.
127
387500
3136
使用了YOLO作为检测系统的一部分。
06:30
And that's because Darknet is open source
128
390660
3096
这是因为暗网是一个开源项目,
06:33
and in the public domain, free for anyone to use.
129
393780
2520
在公共领域,任何人都可以免费使用。
06:37
(Applause)
130
397420
5696
(掌声)
06:43
But we wanted to make detection even more accessible and usable,
131
403140
4936
但是我们想要让检测器 能被更多人使用、也更好用,
06:48
so through a combination of model optimization,
132
408100
4056
因此通过结合模型优化,
06:52
network binarization and approximation,
133
412180
2296
网络二值化和近似法,
06:54
we actually have object detection running on a phone.
134
414500
3920
我们实际上已经可以 在手机上进行目标检测了。
07:04
(Applause)
135
424620
5320
(掌声)
07:10
And I'm really excited because now we have a pretty powerful solution
136
430780
5056
我真的很激动, 因为我们在这个低级的
07:15
to this low-level computer vision problem,
137
435860
2296
计算机视觉问题上 有了一个强大的解决方案,
07:18
and anyone can take it and build something with it.
138
438180
3856
而且任何人都可以 使用它来做些什么。
07:22
So now the rest is up to all of you
139
442060
3176
所以接下来就看所有在座的各位
07:25
and people around the world with access to this software,
140
445260
2936
以及世界上所有 能够使用这个软件的人了,
07:28
and I can't wait to see what people will build with this technology.
141
448220
3656
而我已经等不及想要看看, 人们会用这一技术造出什么来了。
07:31
Thank you.
142
451900
1216
谢谢。
07:33
(Applause)
143
453140
3440
(掌声)
关于本网站

这个网站将向你介绍对学习英语有用的YouTube视频。你将看到来自世界各地的一流教师教授的英语课程。双击每个视频页面上显示的英文字幕,即可从那里播放视频。字幕会随着视频的播放而同步滚动。如果你有任何意见或要求,请使用此联系表与我们联系。

https://forms.gle/WvT1wiN1qDtmnspy7