How computers learn to recognize objects instantly | Joseph Redmon

1,123,328 views ใƒป 2017-08-18

TED


ืื ื ืœื—ืฅ ืคืขืžื™ื™ื ืขืœ ื”ื›ืชื•ื‘ื™ื•ืช ื‘ืื ื’ืœื™ืช ืœืžื˜ื” ื›ื“ื™ ืœื”ืคืขื™ืœ ืืช ื”ืกืจื˜ื•ืŸ.

ืžืชืจื’ื: Zeeva Livshitz ืžื‘ืงืจ: Ido Dekkers
00:12
Ten years ago,
0
12645
1151
ืœืคื ื™ ืขืฉืจ ืฉื ื™ื,
00:13
computer vision researchers thought that getting a computer
1
13820
2776
ืžื“ืขื ื™ ืจืื™ื™ื” ืžืžื•ื—ืฉื‘ืช ื—ืฉื‘ื• ืฉืœื’ืจื•ื ืœืžื—ืฉื‘
00:16
to tell the difference between a cat and a dog
2
16620
2696
ืœื”ื‘ื“ื™ืœ ื‘ื™ืŸ ื—ืชื•ืœ ืœื›ืœื‘
00:19
would be almost impossible,
3
19340
1976
ื™ื”ื™ื” ื›ืžืขื˜ ื‘ืœืชื™ ืืคืฉืจื™,
00:21
even with the significant advance in the state of artificial intelligence.
4
21340
3696
ืืคื™ืœื• ืขื ื”ืชืงื“ืžื•ืช ืžืฉืžืขื•ืชื™ืช ื‘ืžืฆื‘ ืฉืœ ื”ื‘ื™ื ื” ื”ืžืœืื›ื•ืชื™ืช.
00:25
Now we can do it at a level greater than 99 percent accuracy.
5
25060
3560
ืขื›ืฉื™ื• ืื ื—ื ื• ื™ื›ื•ืœื™ื ืœืขืฉื•ืช ื–ืืช ื‘ืจืžืช ื“ื™ื•ืง ืฉืœ ืœืžืขืœื” ืž 99 ืื—ื•ื–.
00:29
This is called image classification --
6
29500
1856
ื–ื” ื ืงืจื ืกื™ื•ื•ื’ ืชืžื•ื ื” --
00:31
give it an image, put a label to that image --
7
31380
3096
ืžืขืœื™ื ืชืžื•ื ื” ื•ืฉืžื™ื ืขืœื™ื” ืชื•ื•ื™ืช --
00:34
and computers know thousands of other categories as well.
8
34500
3040
ื•ืžื—ืฉื‘ื™ื ืžื›ื™ืจื™ื ืืœืคื™ ืงื˜ื’ื•ืจื™ื•ืช ืื—ืจื•ืช ื’ื ื›ืŸ.
00:38
I'm a graduate student at the University of Washington,
9
38500
2896
ืื ื™ ืกื˜ื•ื“ื ื˜ ืœืชื•ืืจ ืฉื ื™ ื‘ืื•ื ื™ื‘ืจืกื™ื˜ืช ื•ื•ืฉื™ื ื’ื˜ื•ืŸ,
00:41
and I work on a project called Darknet,
10
41420
1896
ื•ืื ื™ ืขื•ื‘ื“ ืขืœ ืคืจื•ื™ืงื˜ ืฉื ืงืจื "ื“ืืจืงื ื˜",
00:43
which is a neural network framework
11
43340
1696
ืฉื”ื•ื ืžืกื’ืจืช ืฉืœ ืจืฉืช ืขืฆื‘ื™ืช
00:45
for training and testing computer vision models.
12
45060
2816
ืœื”ื›ืฉืจื” ื•ื‘ื“ื™ืงืช ืžื•ื“ืœื™ื ืฉืœ ืจืื™ื™ื” ืžืžื•ื—ืฉื‘ืช.
00:47
So let's just see what Darknet thinks
13
47900
2976
ืื– ื‘ื•ืื• ื•ื ืจืื” ืžื” "ื“ืืจืงื ื˜" ื—ื•ืฉื‘ืช
00:50
of this image that we have.
14
50900
1760
ืขืœ ืชืžื•ื ื” ื–ื• ืฉื™ืฉ ืœื ื•.
00:54
When we run our classifier
15
54340
2336
ื›ืืฉืจ ืื ื• ืžืคืขื™ืœื™ื ืืช ื”ืžืกื•ื•ื’ ืฉืœื ื•
00:56
on this image,
16
56700
1216
ืขืœ ื”ืชืžื•ื ื” ื”ื–ื•,
00:57
we see we don't just get a prediction of dog or cat,
17
57940
2456
ืจื•ืื™ื ืฉืœื ืจืง ืžืงื‘ืœื™ื ื—ื™ื–ื•ื™ ืฉืœ ื›ืœื‘ ืื• ื—ืชื•ืœ,
01:00
we actually get specific breed predictions.
18
60420
2336
ืื ื—ื ื• ืœืžืขืฉื” ืžืงื‘ืœื™ื ืชื—ื–ื™ื•ืช ืฉืœ ื’ื–ืข ืกืคืฆื™ืคื™.
01:02
That's the level of granularity we have now.
19
62780
2176
ื–ื•ื”ื™ ืจืžืช ื”ืคื™ืจื•ื˜ ืฉื™ืฉ ืœื ื• ืขื›ืฉื™ื•.
01:04
And it's correct.
20
64980
1616
ื•ื”ื™ื ื ื›ื•ื ื”.
01:06
My dog is in fact a malamute.
21
66620
1840
ื”ื›ืœื‘ ืฉืœื™ ืœืžืขืฉื” ื”ื•ื ืžืœืžื•ื˜.
01:08
So we've made amazing strides in image classification,
22
68860
4336
ืื– ืขืฉื™ื ื• ืฆืขื“ื™ื ืžื“ื”ื™ืžื™ื ื‘ืกื™ื•ื•ื’ ืชืžื•ื ื•ืช,
01:13
but what happens when we run our classifier
23
73220
2000
ืื‘ืœ ืžื” ืงื•ืจื” ื›ืฉืื ื• ืžืคืขื™ืœื™ื ืืช ื”ืžืกื•ื•ื’
01:15
on an image that looks like this?
24
75244
1960
ืขืœ ืชืžื•ื ื” ืฉื ืจืื™ืช ื›ืžื• ื–ื•?
01:18
Well ...
25
78900
1200
ื˜ื•ื‘ ...
01:24
We see that the classifier comes back with a pretty similar prediction.
26
84460
3896
ืื ื• ืจื•ืื™ื ืฉื”ืžืกื•ื•ื’ ื ื•ืชืŸ ืชื—ื–ื™ืช ื“ื™ ื“ื•ืžื”.
01:28
And it's correct, there is a malamute in the image,
27
88380
3096
ื•ื–ื” ื ื›ื•ืŸ. ื™ืฉ ืžืœืžื•ื˜ ื‘ืชืžื•ื ื”.
01:31
but just given this label, we don't actually know that much
28
91500
3696
ืื‘ืœ ืจืง ื‘ื”ืชื—ืฉื‘ ื‘ืชื•ื•ื™ืช ื–ื•, ืื™ื ื ื• ืžืžืฉ ื™ื•ื“ืขื™ื ื›ืœ ื›ืš ื”ืจื‘ื”
01:35
about what's going on in the image.
29
95220
1667
ืขืœ ืžื” ืฉืงื•ืจื” ื‘ืชืžื•ื ื”.
01:36
We need something more powerful.
30
96911
1560
ืื ื—ื ื• ืฆืจื™ื›ื™ื ืžืฉื”ื• ื—ื–ืง ื™ื•ืชืจ.
01:39
I work on a problem called object detection,
31
99060
2616
ืื ื™ ืขื•ื‘ื“ ืขืœ ื‘ืขื™ื” ืฉื ืงืจืืช ื–ื™ื”ื•ื™ ืื•ื‘ื™ื™ืงื˜,
01:41
where we look at an image and try to find all of the objects,
32
101700
2936
ืฉื‘ื” ืื ื• ืžืกืชื›ืœื™ื ืขืœ ืชืžื•ื ื” ื•ืžื ืกื™ื ืœืžืฆื•ื ืืช ื›ืœ ื”ืื•ื‘ื™ื™ืงื˜ื™ื,
01:44
put bounding boxes around them
33
104660
1456
ืฉืžื™ื ืงื•ืคืกืื•ืช ืชื•ื—ืžื•ืช ืกื‘ื™ื‘ื
01:46
and say what those objects are.
34
106140
1520
ื•ืื•ืžืจื™ื ืžื” ื”ื ืื•ื‘ื™ื™ืงื˜ื™ื ืืœื”:
01:48
So here's what happens when we run a detector on this image.
35
108220
3280
ืื– ื–ื” ืžื” ืฉืงื•ืจื” ื›ืฉืื ื• ืžืคืขื™ืœื™ื ื’ืœืื™ ืขืœ ื”ืชืžื•ื ื” ื”ื–ืืช.
01:53
Now, with this kind of result,
36
113060
2256
ืขื›ืฉื™ื•, ืขื ืกื•ื’ ื–ื” ืฉืœ ืชื•ืฆืื”,
01:55
we can do a lot more with our computer vision algorithms.
37
115340
2696
ื ื•ื›ืœ ืœืขืฉื•ืช ื”ืจื‘ื” ื™ื•ืชืจ ืขื ื”ืืœื’ื•ืจื™ืชืžื™ื ืฉืœ ื”ืจืื™ื™ื” ื”ืžืžื•ื—ืฉื‘ืช.
01:58
We see that it knows that there's a cat and a dog.
38
118060
2976
ืื ื—ื ื• ืจื•ืื™ื ืฉื”ื•ื ืžื–ื”ื” ืฉื™ืฉ ื—ืชื•ืœ ื•ื›ืœื‘.
02:01
It knows their relative locations,
39
121060
2256
ื”ื•ื ื™ื•ื“ืข ืืช ื”ืžืงื•ืžื•ืช ื”ื™ื—ืกื™ื™ื ืฉืœื”ื,
02:03
their size.
40
123340
1216
ืืช ื’ื•ื“ืœื.
02:04
It may even know some extra information.
41
124580
1936
ื”ื•ื ืื•ืœื™ ืืคื™ืœื• ื™ื•ื“ืข ืขื•ื“ ืžื™ื“ืข ื ื•ืกืฃ ื›ืœืฉื”ื•.
02:06
There's a book sitting in the background.
42
126540
1960
ื™ืฉ ืกืคืจ ืฉืžื•ื ื— ื‘ืจืงืข.
02:09
And if you want to build a system on top of computer vision,
43
129100
3256
ื•ืื ืจื•ืฆื™ื ืœื‘ื ื•ืช ืฉื™ื˜ื” ืขืœ ื’ื‘ื™ ืจืื™ื™ื” ืžืžื•ื—ืฉื‘ืช,
02:12
say a self-driving vehicle or a robotic system,
44
132380
3456
ืœืžืฉืœ, ืจื›ื‘ ื ื”ื™ื’ื” ืขืฆืžื™ืช ืื• ืžืขืจื›ืช ืจื•ื‘ื•ื˜ื™ืช,
02:15
this is the kind of information that you want.
45
135860
2456
ื–ื” ืกื•ื’ ื”ืžื™ื“ืข ืฉืžืขื•ื ื™ื ื™ื ื‘ื•.
02:18
You want something so that you can interact with the physical world.
46
138340
3239
ืจื•ืฆื™ื ืžืฉื”ื• ืฉื™ืืคืฉืจ ืœืชืงืฉืจ ืขื ื”ืขื•ืœื ื”ืคื™ื–ื™.
02:22
Now, when I started working on object detection,
47
142579
2257
ืขื›ืฉื™ื•, ื›ืฉื”ืชื—ืœืชื™ ืœืขื‘ื•ื“ ืขืœ ื–ื™ื”ื•ื™ ืื•ื‘ื™ื™ืงื˜,
02:24
it took 20 seconds to process a single image.
48
144860
3296
ืœืงื— 20 ืฉื ื™ื•ืช ื›ื“ื™ ืœืขื‘ื“ ืชืžื•ื ื” ื‘ื•ื“ื“ืช.
02:28
And to get a feel for why speed is so important in this domain,
49
148180
3880
ื•ื›ื“ื™ ืœืงื‘ืœ ืชื—ื•ืฉื” ืœืกื™ื‘ื” ืฉืžื”ื™ืจื•ืช ื›ื” ื—ืฉื•ื‘ื” ื‘ืชื—ื•ื ื–ื”,
02:32
here's an example of an object detector
50
152940
2536
ื”ื ื” ื“ื•ื’ืžื” ืฉืœ ื’ืœืื™ ืื•ื‘ื™ื™ืงื˜
02:35
that takes two seconds to process an image.
51
155500
2416
ืฉืœื•ืงื— ืœื• ืฉืชื™ ืฉื ื™ื•ืช ืœืขื‘ื“ ืชืžื•ื ื”.
02:37
So this is 10 times faster
52
157940
2616
ืื– ื–ื” ืคื™ 10 ืžื”ืจ ื™ื•ืชืจ
02:40
than the 20-seconds-per-image detector,
53
160580
3536
ืžื”20 ืฉื ื™ื•ืช ืœืชืžื•ื ื” ืฉืœ ื’ืœืื™ ืชืžื•ื ื”,
02:44
and you can see that by the time it makes predictions,
54
164140
2656
ื•ืืชื ื™ื›ื•ืœื™ื ืœืจืื•ืช ืฉืขื“ ืฉื–ื” ืขื•ืฉื” ืชื—ื–ื™ื•ืช,
02:46
the entire state of the world has changed,
55
166820
2040
ื”ืžืฆื‘ ื›ื•ืœื• ืฉืœ ื”ืขื•ืœื ื”ืฉืชื ื”,
02:49
and this wouldn't be very useful
56
169700
2416
ื•ื–ื” ืœื ื™ื”ื™ื” ืžืื•ื“ ืฉื™ืžื•ืฉื™
02:52
for an application.
57
172140
1416
ืขื‘ื•ืจ ื™ื™ืฉื•ื.
02:53
If we speed this up by another factor of 10,
58
173580
2496
ืื ื ืื™ืฅ ืืช ื–ื” ืœืคื™ ืžืงื“ื ื ื•ืกืฃ ืฉืœ 10,
02:56
this is a detector running at five frames per second.
59
176100
2816
ื–ื” ื™ื”ื™ื” ื’ืœืื™ ืฉืจืฅ ื‘ื—ืžืฉ ืžืกื’ืจื•ืช ืœืฉื ื™ื™ื”.
02:58
This is a lot better,
60
178940
1536
ื–ื” ื”ืจื‘ื” ื™ื•ืชืจ ื˜ื•ื‘,
03:00
but for example,
61
180500
1976
ืื‘ืœ ืœื“ื•ื’ืžื”,
03:02
if there's any significant movement,
62
182500
2296
ืื ื™ืฉ ืชื ื•ืขื” ืžืฉืžืขื•ืชื™ืช,
03:04
I wouldn't want a system like this driving my car.
63
184820
2560
ืœื ื”ื™ื™ืชื™ ืจื•ืฆื” ืฉืžืขืจื›ืช ื›ื–ื• ืชื ื”ื’ ื‘ืžื›ื•ื ื™ืช ืฉืœื™.
03:08
This is our detection system running in real time on my laptop.
64
188940
3240
ื–ื•ื”ื™ ืžืขืจื›ืช ื”ืื™ืชื•ืจ ืฉืœื ื• ืฉืจืฆื” ื‘ื–ืžืŸ ืืžืช ืขืœ ื”ืžื—ืฉื‘ ื”ื ื™ื™ื“ ืฉืœื™.
03:12
So it smoothly tracks me as I move around the frame,
65
192820
3136
ื›ืš ื”ื™ื ืขื•ืงื‘ืช ืื—ืจื™ ื‘ืฆื•ืจื” ื—ืœืงื” ื›ืฉืื ื™ ื–ื– ืกื‘ื™ื‘ ื”ืžืกื’ืจืช,
03:15
and it's robust to a wide variety of changes in size,
66
195980
3720
ื•ื”ื™ื ื—ืกื™ื ื” ืœืžื’ื•ื•ืŸ ืจื—ื‘ ืฉืœ ืฉื™ื ื•ื™ื™ื ื‘ื’ื•ื“ืœ,
03:21
pose,
67
201260
1200
ื”ืขืžื“ื”,
03:23
forward, backward.
68
203100
1856
ืงื“ื™ืžื”, ืื—ื•ืจื”.
03:24
This is great.
69
204980
1216
ื–ื” ื ื”ื“ืจ.
03:26
This is what we really need
70
206220
1736
ื–ื” ืžื” ืฉืื ื—ื ื• ื‘ืืžืช ืฆืจื™ื›ื™ื
03:27
if we're going to build systems on top of computer vision.
71
207980
2896
ืื ืื ื—ื ื• ื”ื•ืœื›ื™ื ืœื‘ื ื•ืช ืžืขืจื›ื•ืช ืขืœ ื’ื‘ื™ ืจืื™ื™ื” ืžืžื•ื—ืฉื‘ืช.
03:30
(Applause)
72
210900
4000
(ืžื—ื™ืื•ืช ื›ืคื™ื™ื)
03:36
So in just a few years,
73
216100
2176
ืื– ืชื•ืš ืฉื ื™ื ืื—ื“ื•ืช,
03:38
we've gone from 20 seconds per image
74
218300
2656
ืขื‘ืจื ื• ืž -20 ืฉื ื™ื•ืช ืœืชืžื•ื ื”
03:40
to 20 milliseconds per image, a thousand times faster.
75
220980
3536
ืœ 20 ืืœืคื™ื•ืช ื”ืฉื ื™ื™ื”, ืคื™ ืืœืฃ ื™ื•ืชืจ ืžื”ืจ.
03:44
How did we get there?
76
224540
1416
ืื™ืš ื”ื’ืขื ื• ืœื–ื”?
03:45
Well, in the past, object detection systems
77
225980
3016
ื‘ืขื‘ืจ, ืžืขืจื›ื•ืช ืœืื™ืชื•ืจ ืื•ื‘ื™ื™ืงื˜ื™ื
03:49
would take an image like this
78
229020
1936
ื”ื™ื• ืœื•ืงื—ื•ืช ืชืžื•ื ื” ื›ืžื• ื–ื•
03:50
and split it into a bunch of regions
79
230980
2456
ื•ืžืคืฆืœื•ืช ืื•ืชื” ืœืงื‘ื•ืฆื” ืฉืœ ืื–ื•ืจื™ื
03:53
and then run a classifier on each of these regions,
80
233460
3256
ื•ืœืื—ืจ ืžื›ืŸ ืžืคืขื™ืœื•ืช ืžืกื•ื•ื’ ืขืœ ื›ืœ ืื—ื“ ืžืื–ื•ืจื™ื ืืœื”,
03:56
and high scores for that classifier
81
236740
2536
ื•ืฆื™ื•ื ื™ื ื’ื‘ื•ื”ื™ื ืขื‘ื•ืจ ืžืกื•ื•ื’ ื–ื”
03:59
would be considered detections in the image.
82
239300
3136
ื™ื™ื—ืฉื‘ื• ื–ื™ื”ื•ื™ื™ื ื‘ืชืžื•ื ื”.
04:02
But this involved running a classifier thousands of times over an image,
83
242460
4056
ืื‘ืœ ื–ื” ื›ืจื•ืš ื‘ื”ืคืขืœืช ืžืกื•ื•ื’ ืืœืคื™ ืคืขืžื™ื ืขืœ ืชืžื•ื ื”,
04:06
thousands of neural network evaluations to produce detection.
84
246540
2920
ืืœืคื™ ื”ืขืจื›ื•ืช ืฉืœ ืจืฉืช ืขืฆื‘ื™ืช ื›ื“ื™ ืœื™ื™ืฆืจ ื–ื™ื”ื•ื™.
04:11
Instead, we trained a single network to do all of detection for us.
85
251060
4536
ื‘ืžืงื•ื ื–ื”, ื”ื›ืฉืจื ื• ืจืฉืช ืื—ืช ืœืขืฉื•ืช ืืช ื›ืœ ื”ื–ื™ื”ื•ื™ ืขื‘ื•ืจื ื•.
04:15
It produces all of the bounding boxes and class probabilities simultaneously.
86
255620
4280
ื”ื™ื ืžื™ื™ืฆืจืช ืืช ื›ืœ ืชื™ื‘ื•ืช ื”ืชื—ื™ืžื” ื•ืืช ืกื•ื’ ื”ื”ืกืชื‘ืจื•ื™ื•ืช ื‘ื• ื–ืžื ื™ืช.
04:20
With our system, instead of looking at an image thousands of times
87
260500
3496
ืขื ื”ืžืขืจื›ืช ืฉืœื ื•, ื‘ืžืงื•ื ืœื”ืกืชื›ืœ ืขืœ ืชืžื•ื ื” ืืœืคื™ ืคืขืžื™ื
04:24
to produce detection,
88
264020
1456
ื›ื“ื™ ืœื™ื™ืฆืจ ื–ื™ื”ื•ื™,
04:25
you only look once,
89
265500
1256
ืžืกืชื›ืœื™ื ืจืง ืคืขื ืื—ืช,
04:26
and that's why we call it the YOLO method of object detection.
90
266780
2920
ื•ืœื›ืŸ ืื ื—ื ื• ืงื•ืจืื™ื ืœื–ื” ืฉื™ื˜ืช YOLO ืœื–ื™ื”ื•ื™ ืื•ื‘ื™ื™ืงื˜.
04:31
So with this speed, we're not just limited to images;
91
271180
3976
ืื– ืขื ืžื”ื™ืจื•ืช ื–ื•, ืื™ื ื ื• ืžื•ื’ื‘ืœื™ื ืจืง ืœืชืžื•ื ื•ืช;
04:35
we can process video in real time.
92
275180
2416
ืื ื• ื™ื›ื•ืœื™ื ืœืขื‘ื“ ื•ื™ื“ืื• ื‘ื–ืžืŸ ืืžืช.
04:37
And now, instead of just seeing that cat and dog,
93
277620
3096
ื•ืขื›ืฉื™ื•, ื‘ืžืงื•ื ืœืจืื•ืช ืจืง ืืช ื”ื—ืชื•ืœ ื•ื”ื›ืœื‘ ื”ืืœื”,
04:40
we can see them move around and interact with each other.
94
280740
2960
ืื ื—ื ื• ื™ื›ื•ืœื™ื ืœืจืื•ืช ืื•ืชื ื ืขื™ื ืกื‘ื™ื‘ ื•ืžืชืงืฉืจื™ื ืื—ื“ ืขื ื”ืฉื ื™.
04:46
This is a detector that we trained
95
286380
2056
ื–ื”ื• ื’ืœืื™ ืฉืื™ืžื ื•
04:48
on 80 different classes
96
288460
4376
ืขืœ 80 ืกื•ื’ื™ื ืฉื•ื ื™ื
04:52
in Microsoft's COCO dataset.
97
292860
3256
ื‘ืžืขืจืš ื”ื ืชื•ื ื™ื COCO ืฉืœ ืžื™ืงืจื•ืกื•ืคื˜.
04:56
It has all sorts of things like spoon and fork, bowl,
98
296140
3336
ื™ืฉ ื‘ื• ื›ืœ ืžื™ื ื™ ื“ื‘ืจื™ื ื›ืžื• ื›ืฃ ื•ืžื–ืœื’, ืงืขืจื”,
04:59
common objects like that.
99
299500
1800
ื—ืคืฆื™ื ืจื’ื™ืœื™ื ื›ืืœื”.
05:02
It has a variety of more exotic things:
100
302180
3096
ื™ืฉ ืœื• ืžื’ื•ื•ืŸ ืฉืœ ื“ื‘ืจื™ื ืืงื–ื•ื˜ื™ื™ื ื™ื•ืชืจ:
05:05
animals, cars, zebras, giraffes.
101
305300
3256
ื—ื™ื•ืช, ืžื›ื•ื ื™ื•ืช, ื–ื‘ืจื•ืช, ื’'ื™ืจืคื•ืช.
05:08
And now we're going to do something fun.
102
308580
1936
ื•ืขื›ืฉื™ื• ืื ื—ื ื• ื”ื•ืœื›ื™ื ืœืขืฉื•ืช ืžืฉื”ื• ืžื”ื ื”.
05:10
We're just going to go out into the audience
103
310540
2096
ืื ื—ื ื• ืคืฉื•ื˜ ื™ื•ืฆืื™ื ืืœ ื”ืงื”ืœ
05:12
and see what kind of things we can detect.
104
312660
2016
ื›ื“ื™ ืœืจืื•ืช ืื™ื–ื” ืกื•ื’ ืฉืœ ื“ื‘ืจื™ื ื ื•ื›ืœ ืœื–ื”ื•ืช.
05:14
Does anyone want a stuffed animal?
105
314700
1620
ื”ืื ืžื™ืฉื”ื• ืจื•ืฆื” ื‘ื•ื‘ืช ื—ื™ื”?
05:17
There are some teddy bears out there.
106
317820
1762
ื™ืฉ ื›ืžื” ื‘ื•ื‘ื•ืช ื“ื•ื‘ื™ ืฉื.
05:21
And we can turn down our threshold for detection a little bit,
107
321860
4536
ื•ืื ื—ื ื• ื™ื›ื•ืœ ืœื”ื ืžื™ืš ืžืขื˜ ืืช ืกืฃ ื”ื–ื™ื”ื•ื™ ืฉืœื ื•,
05:26
so we can find more of you guys out in the audience.
108
326420
3400
ื›ื“ื™ ืฉื ื•ื›ืœ ืœืžืฆื•ื ื™ื•ืชืจ ืื ืฉื™ื ืžื‘ื™ื ื™ื›ื, ื‘ืงื”ืœ.
05:31
Let's see if we can get these stop signs.
109
331380
2336
ื‘ื•ืื• ื•ื ืจืื” ืื ื ื•ื›ืœ ืœืชืคื•ืก ืชืžืจื•ืจื™ ืขืฆื•ืจ ืืœื”.
05:33
We find some backpacks.
110
333740
1880
ืื ื—ื ื• ืžื•ืฆืื™ื ื›ืžื” ืชืจืžื™ืœื™ ื’ื‘.
05:37
Let's just zoom in a little bit.
111
337700
1840
ื‘ื•ืื• ืคืฉื•ื˜ ื ื’ื“ื™ืœ ืงืฆืช.
05:42
And this is great.
112
342140
1256
ื•ื–ื” ื ื”ื“ืจ.
05:43
And all of the processing is happening in real time
113
343420
3176
ื•ื›ืœ ื”ืขื™ื‘ื•ื“ ืงื•ืจื” ื‘ื–ืžืŸ ืืžืช
05:46
on the laptop.
114
346620
1200
ืขืœ ื”ืžื—ืฉื‘ ื”ื ื™ื™ื“.
05:48
And it's important to remember
115
348900
1456
ื•ื—ืฉื•ื‘ ืœื–ื›ื•ืจ
05:50
that this is a general purpose object detection system,
116
350380
3216
ืฉื–ื•ื”ื™ ืžืขืจื›ืช ื–ื™ื”ื•ื™ ืื•ื‘ื™ื™ืงื˜ ืœืžื˜ืจื” ื›ืœืœื™ืช,
05:53
so we can train this for any image domain.
117
353620
5000
ื›ืš ืฉื ื•ื›ืœ ืœื”ื›ืฉื™ืจ ืื•ืชื” ืขื‘ื•ืจ ืชืžื•ื ื” ืžื›ืœ ืชื—ื•ื.
06:00
The same code that we use
118
360140
2536
ืื•ืชื• ืงื•ื“ ืฉื‘ื• ืื ื• ืžืฉืชืžืฉื™ื
06:02
to find stop signs or pedestrians,
119
362700
2456
ื›ื“ื™ ืœืžืฆื•ื ืฉืœื˜ื™ ืขืฆื•ืจ ืื• ื”ื•ืœื›ื™ ืจื’ืœ,
06:05
bicycles in a self-driving vehicle,
120
365180
1976
ืื•ืคื ื™ื™ื ื‘ืจื›ื‘ ืœื ื”ื™ื’ื” ืขืฆืžื™ืช,
06:07
can be used to find cancer cells
121
367180
2856
ื™ื›ื•ืœ ืœืฉืžืฉ ื›ื“ื™ ืœืžืฆื•ื ืชืื™ื ืกืจื˜ื ื™ื™ื
06:10
in a tissue biopsy.
122
370060
3016
ื‘ื‘ื™ื•ืคืกื™ื” ืฉืœ ืจืงืžื”.
06:13
And there are researchers around the globe already using this technology
123
373100
4040
ื•ื™ืฉ ื—ื•ืงืจื™ื ื‘ืจื—ื‘ื™ ื”ืขื•ืœื ืฉื›ื‘ืจ ืžืฉืชืžืฉื™ื ื‘ื˜ื›ื ื•ืœื•ื’ื™ื” ื–ื•
06:18
for advances in things like medicine, robotics.
124
378060
3416
ืœืงื“ื ืชื—ื•ืžื™ื ื›ืžื• ืจืคื•ืื”, ื•ืจื•ื‘ื•ื˜ื™ืงื”.
06:21
This morning, I read a paper
125
381500
1376
ื”ื‘ื•ืงืจ ืงืจืืชื™ ืขื™ืชื•ืŸ
06:22
where they were taking a census of animals in Nairobi National Park
126
382900
4576
ืฉื‘ื• ืขืจื›ื• ืžืคืงื“ ืฉืœ ื‘ืขืœื™ ื—ื™ื™ื ื‘ืคืืจืง ื”ืœืื•ืžื™ ืฉืœ ื ื™ื™ืจื•ื‘ื™
06:27
with YOLO as part of this detection system.
127
387500
3136
ืขื YOLO ื›ื—ืœืง ืฉืœ ืžืขืจื›ืช ื–ื™ื”ื•ื™ ื–ื•.
06:30
And that's because Darknet is open source
128
390660
3096
ื•ื–ื” ื‘ื’ืœืœ ืฉ "ื“ืืจืงื ื˜" ื”ื•ื ืงื•ื“ ืคืชื•ื—
06:33
and in the public domain, free for anyone to use.
129
393780
2520
ืขื‘ื•ืจ ืจืฉื•ืช ื”ืจื‘ื™ื, ื•ืœืœื ืชืฉืœื•ื, ืœื›ืœ ืžื™ ืฉืจื•ืฆื” ืœื”ืฉืชืžืฉ,
06:37
(Applause)
130
397420
5696
(ืžื—ื™ืื•ืช ื›ืคื™ื™ื)
06:43
But we wanted to make detection even more accessible and usable,
131
403140
4936
ืื‘ืœ ืจืฆื™ื ื• ืœืขืฉื•ืช ืืช ื”ื–ื™ื”ื•ื™ ืœืืคื™ืœื• ื™ื•ืชืจ ื ื’ื™ืฉ ื•ืฉืžื™ืฉ,
06:48
so through a combination of model optimization,
132
408100
4056
ื›ืš ืฉื‘ืืžืฆืขื•ืช ืฉื™ืœื•ื‘ ืฉืœ ืื•ืคื˜ื™ืžื™ื–ืฆื™ื” ืฉืœ ื”ืžื•ื“ืœ,
06:52
network binarization and approximation,
133
412180
2296
ื‘ื™ื ืืจื™ื–ืฆื™ื” ื•ืื•ืžื“ื ื•ืช ืฉืœ ืจืฉืช,
06:54
we actually have object detection running on a phone.
134
414500
3920
ื™ืฉ ืœื ื• ืœืžืขืฉื” ื–ื™ื”ื•ื™ ืื•ื‘ื™ื™ืงื˜ ืฉืจืฅ ื‘ื˜ืœืคื•ืŸ.
07:04
(Applause)
135
424620
5320
(ืžื—ื™ืื•ืช ื›ืคื™ื™ื)
07:10
And I'm really excited because now we have a pretty powerful solution
136
430780
5056
ื•ืื ื™ ื‘ืืžืช ืžืชืจื’ืฉ ื›ื™ ืขื›ืฉื™ื• ื™ืฉ ืœื ื• ืคืชืจื•ืŸ ื“ื™ ื—ื–ืง
07:15
to this low-level computer vision problem,
137
435860
2296
ืœื‘ืขื™ื™ืช ืจืื™ื™ื” ืžืžื•ื—ืฉื‘ืช ื‘ืจืžื” ื ืžื•ื›ื” ื–ื•.
07:18
and anyone can take it and build something with it.
138
438180
3856
ื•ื›ืœ ืื—ื“ ื™ื›ื•ืœ ืœืงื—ืช ืืช ื–ื” ื•ืœื‘ื ื•ืช ืขื ื–ื” ืžืฉื”ื•.
07:22
So now the rest is up to all of you
139
442060
3176
ืื– ืขื›ืฉื™ื• ื›ืœ ื”ืฉืืจ ืชืœื•ื™ ื‘ื›ื
07:25
and people around the world with access to this software,
140
445260
2936
ื•ื‘ืื ืฉื™ื ื‘ืจื—ื‘ื™ ื”ืขื•ืœื ืขื ื’ื™ืฉื” ืœืชื•ื›ื ื” ื–ื•,
07:28
and I can't wait to see what people will build with this technology.
141
448220
3656
ื•ืื ื™ ืœื ื™ื›ื•ืœ ืœื—ื›ื•ืช ืœืจืื•ืช ืžื” ืื ืฉื™ื ื™ื‘ื ื• ืขื ื˜ื›ื ื•ืœื•ื’ื™ื” ื–ื•.
07:31
Thank you.
142
451900
1216
ืชื•ื“ื” ืจื‘ื”.
07:33
(Applause)
143
453140
3440
(ืžื—ื™ืื•ืช ื›ืคื™ื™ื)
ืขืœ ืืชืจ ื–ื”

ืืชืจ ื–ื” ื™ืฆื™ื’ ื‘ืคื ื™ื›ื ืกืจื˜ื•ื ื™ YouTube ื”ืžื•ืขื™ืœื™ื ืœืœื™ืžื•ื“ ืื ื’ืœื™ืช. ืชื•ื›ืœื• ืœืจืื•ืช ืฉื™ืขื•ืจื™ ืื ื’ืœื™ืช ื”ืžื•ืขื‘ืจื™ื ืขืœ ื™ื“ื™ ืžื•ืจื™ื ืžื”ืฉื•ืจื” ื”ืจืืฉื•ื ื” ืžืจื—ื‘ื™ ื”ืขื•ืœื. ืœื—ืฅ ืคืขืžื™ื™ื ืขืœ ื”ื›ืชื•ื‘ื™ื•ืช ื‘ืื ื’ืœื™ืช ื”ืžื•ืฆื’ื•ืช ื‘ื›ืœ ื“ืฃ ื•ื™ื“ืื• ื›ื“ื™ ืœื”ืคืขื™ืœ ืืช ื”ืกืจื˜ื•ืŸ ืžืฉื. ื”ื›ืชื•ื‘ื™ื•ืช ื’ื•ืœืœื•ืช ื‘ืกื ื›ืจื•ืŸ ืขื ื”ืคืขืœืช ื”ื•ื•ื™ื“ืื•. ืื ื™ืฉ ืœืš ื”ืขืจื•ืช ืื• ื‘ืงืฉื•ืช, ืื ื ืฆื•ืจ ืื™ืชื ื• ืงืฉืจ ื‘ืืžืฆืขื•ืช ื˜ื•ืคืก ื™ืฆื™ืจืช ืงืฉืจ ื–ื”.

https://forms.gle/WvT1wiN1qDtmnspy7