How computers learn to recognize objects instantly | Joseph Redmon

1,121,269 views ・ 2017-08-18

TED


請雙擊下方英文字幕播放視頻。

譯者: 易帆 余 審譯者: Wilde Luo
00:12
Ten years ago,
0
12645
1151
10 年前,
00:13
computer vision researchers thought that getting a computer
1
13820
2776
電腦視覺研究人員認為,
00:16
to tell the difference between a cat and a dog
2
16620
2696
要讓電腦辨別貓與狗的差別,
00:19
would be almost impossible,
3
19340
1976
幾乎是比登天還難,
00:21
even with the significant advance in the state of artificial intelligence.
4
21340
3696
即使用了相當先進的 人工智慧都很難辦到。
00:25
Now we can do it at a level greater than 99 percent accuracy.
5
25060
3560
現在我們可以把辨別的準確度 提升到 99% 以上。
00:29
This is called image classification --
6
29500
1856
這技術叫做圖像分類——
00:31
give it an image, put a label to that image --
7
31380
3096
給電腦看圖片, 並給圖片貼上標籤——
00:34
and computers know thousands of other categories as well.
8
34500
3040
電腦還可以識別出 許多其它類別的東西。
00:38
I'm a graduate student at the University of Washington,
9
38500
2896
我目前是華盛頓大學的研究生,
00:41
and I work on a project called Darknet,
10
41420
1896
我正在做一個專題叫做「暗黑網路」,
00:43
which is a neural network framework
11
43340
1696
它是一個用來訓練及測試
00:45
for training and testing computer vision models.
12
45060
2816
電腦視覺模型的神經網路架構。
00:47
So let's just see what Darknet thinks
13
47900
2976
所以,讓我們來瞧瞧暗黑網路
00:50
of this image that we have.
14
50900
1760
對我們照片識別能力的狀況。
00:54
When we run our classifier
15
54340
2336
當我們在這張照片上
00:56
on this image,
16
56700
1216
開啟我們的分類器,
00:57
we see we don't just get a prediction of dog or cat,
17
57940
2456
可以看到電腦現在不只 在預測這是狗或貓,
01:00
we actually get specific breed predictions.
18
60420
2336
它實際上正在擷取特定品種的預測。
01:02
That's the level of granularity we have now.
19
62780
2176
這就是現在我們電腦的粒度等級。
01:04
And it's correct.
20
64980
1616
辨別正確。
01:06
My dog is in fact a malamute.
21
66620
1840
我的狗的確是隻雪橇犬。
01:08
So we've made amazing strides in image classification,
22
68860
4336
所以,我們在圖像識別上 已經有了很大的進步,
01:13
but what happens when we run our classifier
23
73220
2000
但如果我們用識別器
01:15
on an image that looks like this?
24
75244
1960
來辨別這樣的照片呢?
01:18
Well ...
25
78900
1200
嗯……
01:24
We see that the classifier comes back with a pretty similar prediction.
26
84460
3896
可以看到從分類器 得到的預測也相當類似。
01:28
And it's correct, there is a malamute in the image,
27
88380
3096
沒錯,圖片中有一隻雪橇狗,
01:31
but just given this label, we don't actually know that much
28
91500
3696
但它只給出一個標籤,
我們對這張照片的理解 還不是很完整。
01:35
about what's going on in the image.
29
95220
1667
01:36
We need something more powerful.
30
96911
1560
我們需要更強的東西。
01:39
I work on a problem called object detection,
31
99060
2616
我正在研究一個問題, 叫做「物件偵測」,
01:41
where we look at an image and try to find all of the objects,
32
101700
2936
我們把一張照片中的 所有物體都找出來,
01:44
put bounding boxes around them
33
104660
1456
用邊界框把它們框起來,
01:46
and say what those objects are.
34
106140
1520
然後標示它們是那些東西。
01:48
So here's what happens when we run a detector on this image.
35
108220
3280
我們來看一下當我們在這一張圖片上 執行偵測軟體時,會發生甚麼事。
01:53
Now, with this kind of result,
36
113060
2256
現在,有了這類的結果,
01:55
we can do a lot more with our computer vision algorithms.
37
115340
2696
我們就可以利用電腦視覺演算法, 幫我們做更多的事。
01:58
We see that it knows that there's a cat and a dog.
38
118060
2976
我們可以看到, 電腦知道圖片中有一隻貓和狗。
02:01
It knows their relative locations,
39
121060
2256
它知道牠們彼此的相對位置、
02:03
their size.
40
123340
1216
大小。
02:04
It may even know some extra information.
41
124580
1936
電腦甚至可能知道其它的資訊。
02:06
There's a book sitting in the background.
42
126540
1960
它也看到了背景中有一本書。
02:09
And if you want to build a system on top of computer vision,
43
129100
3256
如果你想要建立一個 基於電腦視覺系統的實用系統,
02:12
say a self-driving vehicle or a robotic system,
44
132380
3456
比如說,自動駕駛車或機械人系統,
02:15
this is the kind of information that you want.
45
135860
2456
這類就會是你想要的資訊。
02:18
You want something so that you can interact with the physical world.
46
138340
3239
你會想要一個可以 與實體世界互動的東西。
02:22
Now, when I started working on object detection,
47
142579
2257
當我開始做物件偵測時,
02:24
it took 20 seconds to process a single image.
48
144860
3296
它要花 20 秒才能處理一張圖片。
02:28
And to get a feel for why speed is so important in this domain,
49
148180
3880
為了讓各位體會 為什麼這個領域這麼講究速度,
02:32
here's an example of an object detector
50
152940
2536
我這邊做個執行物件偵測器的示範,
02:35
that takes two seconds to process an image.
51
155500
2416
一張照片只要 2 秒的處理時間。
02:37
So this is 10 times faster
52
157940
2616
所以,比 20 秒一張的偵測器
02:40
than the 20-seconds-per-image detector,
53
160580
3536
快了 10 倍,
02:44
and you can see that by the time it makes predictions,
54
164140
2656
各位可以看到, 在它識別圖像的過程中,
02:46
the entire state of the world has changed,
55
166820
2040
周圍環境已經發生了變化,
02:49
and this wouldn't be very useful
56
169700
2416
但對一個應用軟體而言,
02:52
for an application.
57
172140
1416
這樣的速度是很鷄肋的。
02:53
If we speed this up by another factor of 10,
58
173580
2496
如果我們把另一個參數調升到 10 ,
02:56
this is a detector running at five frames per second.
59
176100
2816
這個偵測器每秒 就可以識別 5 張圖片。
02:58
This is a lot better,
60
178940
1536
這樣好多了,
03:00
but for example,
61
180500
1976
但,假如,
03:02
if there's any significant movement,
62
182500
2296
移動很快的時候……
03:04
I wouldn't want a system like this driving my car.
63
184820
2560
我可不想在我車上裝這樣慢的系統。
03:08
This is our detection system running in real time on my laptop.
64
188940
3240
這是在我筆電上運行的 即時偵測系統。
03:12
So it smoothly tracks me as I move around the frame,
65
192820
3136
我在框框附近移動的時候, 它可以很順暢地追蹤著我,
03:15
and it's robust to a wide variety of changes in size,
66
195980
3720
而且,它可以根據不同的大小、
03:21
pose,
67
201260
1200
姿勢、
03:23
forward, backward.
68
203100
1856
前、後來做調整。
03:24
This is great.
69
204980
1216
太棒了。
03:26
This is what we really need
70
206220
1736
如果我們要建立一個 基於電腦視覺系統的實用系統,
03:27
if we're going to build systems on top of computer vision.
71
207980
2896
這個才會是我真正想要的。
03:30
(Applause)
72
210900
4000
(掌聲)
03:36
So in just a few years,
73
216100
2176
所以,才幾年的時間,
03:38
we've gone from 20 seconds per image
74
218300
2656
我們從每 20 秒處理一張照片,
03:40
to 20 milliseconds per image, a thousand times faster.
75
220980
3536
進步到每張照片只要 20 毫秒, 快了 1000 倍。
03:44
How did we get there?
76
224540
1416
我們是如何辦到的?
03:45
Well, in the past, object detection systems
77
225980
3016
過去,物件偵測系統,
03:49
would take an image like this
78
229020
1936
會把一張像這樣的照片,
03:50
and split it into a bunch of regions
79
230980
2456
分割成好幾個小區塊,
03:53
and then run a classifier on each of these regions,
80
233460
3256
然後在每一個小區塊 運行分類器軟體,
03:56
and high scores for that classifier
81
236740
2536
相似度得分如果比較高
03:59
would be considered detections in the image.
82
239300
3136
會被識別器認為照片偵測成功。
04:02
But this involved running a classifier thousands of times over an image,
83
242460
4056
但這樣一張圖片要執行 好幾千次的識別指令、
04:06
thousands of neural network evaluations to produce detection.
84
246540
2920
經過好幾千次的神經網路評估 才有辦法偵測出來。
04:11
Instead, we trained a single network to do all of detection for us.
85
251060
4536
但我們不是這樣做,我們訓練了一個 網路模型來幫我們完成所有的偵測。
04:15
It produces all of the bounding boxes and class probabilities simultaneously.
86
255620
4280
它可以同時產出邊界框 並同時對可能的結果進行評估。
04:20
With our system, instead of looking at an image thousands of times
87
260500
3496
有了我們的系統, 你就不用一張圖片看了好幾千遍
04:24
to produce detection,
88
264020
1456
才能偵測出來。
04:25
you only look once,
89
265500
1256
你只要看一眼 (YOLO),
04:26
and that's why we call it the YOLO method of object detection.
90
266780
2920
所以我們簡稱這個 物件偵測技術為「YOLO」。
04:31
So with this speed, we're not just limited to images;
91
271180
3976
所以,有了這樣的辨識速度, 我們不只可以偵測圖片;
04:35
we can process video in real time.
92
275180
2416
還可以處理即時的影片。
04:37
And now, instead of just seeing that cat and dog,
93
277620
3096
現在各位看到的不是 貓、狗的靜態圖片,
04:40
we can see them move around and interact with each other.
94
280740
2960
而是有牠們在移動、 互動的動態影片。
04:46
This is a detector that we trained
95
286380
2056
這是我們用微軟 COCO 資料集裡
04:48
on 80 different classes
96
288460
4376
80 種不同的類別
04:52
in Microsoft's COCO dataset.
97
292860
3256
訓練出來的辨識器。
04:56
It has all sorts of things like spoon and fork, bowl,
98
296140
3336
它包含各種東西, 像是湯匙、叉子、碗
04:59
common objects like that.
99
299500
1800
這類的日常用品。
05:02
It has a variety of more exotic things:
100
302180
3096
它還有很多奇妙的東西:
05:05
animals, cars, zebras, giraffes.
101
305300
3256
動物、車子、斑馬、長頸鹿。
05:08
And now we're going to do something fun.
102
308580
1936
現在我們要進行一件好玩的事。
05:10
We're just going to go out into the audience
103
310540
2096
我們會進到觀眾席,
05:12
and see what kind of things we can detect.
104
312660
2016
去看看能辨識到哪些東西。
05:14
Does anyone want a stuffed animal?
105
314700
1620
有誰要填充娃娃?
05:17
There are some teddy bears out there.
106
317820
1762
這邊還有一些泰迪熊。
05:21
And we can turn down our threshold for detection a little bit,
107
321860
4536
我們現在降低一下 對偵測結果的精確度的要求,
05:26
so we can find more of you guys out in the audience.
108
326420
3400
這樣我們可以在觀眾席中 找到更多東西。
05:31
Let's see if we can get these stop signs.
109
331380
2336
我們來看看能不能偵測到停止標誌。
05:33
We find some backpacks.
110
333740
1880
我們有偵測到一些背包。
05:37
Let's just zoom in a little bit.
111
337700
1840
現在把鏡頭拉近一點。
05:42
And this is great.
112
342140
1256
這真的很厲害。
05:43
And all of the processing is happening in real time
113
343420
3176
所有的偵測流程
都可以在筆電裡即時呈現。
05:46
on the laptop.
114
346620
1200
05:48
And it's important to remember
115
348900
1456
更重要的是,
05:50
that this is a general purpose object detection system,
116
350380
3216
這只是一個一般用的物件偵測系統,
05:53
so we can train this for any image domain.
117
353620
5000
我們還可以訓練它 辨別任何領域的照片。
06:00
The same code that we use
118
360140
2536
同樣的程式碼, 放在自動駕駛車裡,
06:02
to find stop signs or pedestrians,
119
362700
2456
可以偵測到停止標誌、行人、
06:05
bicycles in a self-driving vehicle,
120
365180
1976
腳踏車,
06:07
can be used to find cancer cells
121
367180
2856
但放到組織切片
06:10
in a tissue biopsy.
122
370060
3016
就可以偵測出癌症細胞。
06:13
And there are researchers around the globe already using this technology
123
373100
4040
現在全球有很多研究人員 已經開始在使用這項技術
06:18
for advances in things like medicine, robotics.
124
378060
3416
做進一步的研究, 像是醫藥、機械人領域。
06:21
This morning, I read a paper
125
381500
1376
今天早上,我讀到一篇文章,
06:22
where they were taking a census of animals in Nairobi National Park
126
382900
4576
在奈洛比國家公園裡, 他們要對動物們進行統計調查,
06:27
with YOLO as part of this detection system.
127
387500
3136
YOLO 就是其使用的 偵測系統的一部分。
06:30
And that's because Darknet is open source
128
390660
3096
而這一切都是因為 暗黑網路是開放原始碼,
06:33
and in the public domain, free for anyone to use.
129
393780
2520
在公眾領域, 任何人都可以免費使用。
06:37
(Applause)
130
397420
5696
(掌聲)
06:43
But we wanted to make detection even more accessible and usable,
131
403140
4936
但我們希望偵測系統 可以更親民、更好用,
06:48
so through a combination of model optimization,
132
408100
4056
所以在經過模型優化、
06:52
network binarization and approximation,
133
412180
2296
網路二值化及近似度化的整合後,
06:54
we actually have object detection running on a phone.
134
414500
3920
我們終於可以在手機上偵測物件。
07:04
(Applause)
135
424620
5320
(掌聲)
07:10
And I'm really excited because now we have a pretty powerful solution
136
430780
5056
而我真的相當興奮,因為我們現在
在低階的電腦影像處理問題上 有了相當強力的解決方式,
07:15
to this low-level computer vision problem,
137
435860
2296
07:18
and anyone can take it and build something with it.
138
438180
3856
任何人都可以拿去並創造一些東西。
07:22
So now the rest is up to all of you
139
442060
3176
所以,接下來就看各位
07:25
and people around the world with access to this software,
140
445260
2936
以及全世界所有人 用這個軟體大展身手了,
07:28
and I can't wait to see what people will build with this technology.
141
448220
3656
我真的等不及想看看你們 用這項科技所做出來的產品。
07:31
Thank you.
142
451900
1216
謝謝。
07:33
(Applause)
143
453140
3440
(掌聲)
關於本網站

本網站將向您介紹對學習英語有用的 YouTube 視頻。 您將看到來自世界各地的一流教師教授的英語課程。 雙擊每個視頻頁面上顯示的英文字幕,從那裡播放視頻。 字幕與視頻播放同步滾動。 如果您有任何意見或要求,請使用此聯繫表與我們聯繫。

https://forms.gle/WvT1wiN1qDtmnspy7