How computers learn to recognize objects instantly | Joseph Redmon

1,132,480 views ・ 2017-08-18

TED

Please double-click on the English subtitles below to play the video.

00:12

Ten years ago,

12645

1151

00:13

computer vision researchers thought that getting a computer

13820

2776

00:16

to tell the difference between a cat and a dog

16620

2696

00:19

would be almost impossible,

19340

1976

00:21

even with the significant advance in the state of artificial intelligence.

21340

3696

00:25

Now we can do it at a level greater than 99 percent accuracy.

25060

3560

00:29

This is called image classification --

29500

1856

00:31

give it an image, put a label to that image --

31380

3096

00:34

and computers know thousands of other categories as well.

34500

3040

00:38

I'm a graduate student at the University of Washington,

38500

2896

00:41

and I work on a project called Darknet,

41420

1896

00:43

which is a neural network framework

43340

1696

00:45

for training and testing computer vision models.

45060

2816

00:47

So let's just see what Darknet thinks

47900

2976

00:50

of this image that we have.

50900

1760

00:54

When we run our classifier

54340

2336

00:56

on this image,

56700

1216

00:57

we see we don't just get a prediction of dog or cat,

57940

2456

01:00

we actually get specific breed predictions.

60420

2336

01:02

That's the level of granularity we have now.

62780

2176

01:04

And it's correct.

64980

1616

01:06

My dog is in fact a malamute.

66620

1840

01:08

So we've made amazing strides in image classification,

68860

4336

01:13

but what happens when we run our classifier

73220

2000

01:15

on an image that looks like this?

75244

1960

01:18

Well ...

78900

1200

01:24

We see that the classifier comes back with a pretty similar prediction.

84460

3896

01:28

And it's correct, there is a malamute in the image,

88380

3096

01:31

but just given this label, we don't actually know that much

91500

3696

01:35

about what's going on in the image.

95220

1667

01:36

We need something more powerful.

96911

1560

01:39

I work on a problem called object detection,

99060

2616

01:41

where we look at an image and try to find all of the objects,

101700

2936

01:44

put bounding boxes around them

104660

1456

01:46

and say what those objects are.

106140

1520

01:48

So here's what happens when we run a detector on this image.

108220

3280

01:53

Now, with this kind of result,

113060

2256

01:55

we can do a lot more with our computer vision algorithms.

115340

2696

01:58

We see that it knows that there's a cat and a dog.

118060

2976

02:01

It knows their relative locations,

121060

2256

02:03

their size.

123340

1216

02:04

It may even know some extra information.

124580

1936

02:06

There's a book sitting in the background.

126540

1960

02:09

And if you want to build a system on top of computer vision,

129100

3256

02:12

say a self-driving vehicle or a robotic system,

132380

3456

02:15

this is the kind of information that you want.

135860

2456

02:18

You want something so that you can interact with the physical world.

138340

3239

02:22

Now, when I started working on object detection,

142579

2257

02:24

it took 20 seconds to process a single image.

144860

3296

02:28

And to get a feel for why speed is so important in this domain,

148180

3880

02:32

here's an example of an object detector

152940

2536

02:35

that takes two seconds to process an image.

155500

2416

02:37

So this is 10 times faster

157940

2616

02:40

than the 20-seconds-per-image detector,

160580

3536

02:44

and you can see that by the time it makes predictions,

164140

2656

02:46

the entire state of the world has changed,

166820

2040

02:49

and this wouldn't be very useful

169700

2416

02:52

for an application.

172140

1416

02:53

If we speed this up by another factor of 10,

173580

2496

02:56

this is a detector running at five frames per second.

176100

2816

02:58

This is a lot better,

178940

1536

03:00

but for example,

180500

1976

03:02

if there's any significant movement,

182500

2296

03:04

I wouldn't want a system like this driving my car.

184820

2560

03:08

This is our detection system running in real time on my laptop.

188940

3240

03:12

So it smoothly tracks me as I move around the frame,

192820

3136

03:15

and it's robust to a wide variety of changes in size,

195980

3720

03:21

pose,

201260

1200

03:23

forward, backward.

203100

1856

03:24

This is great.

204980

1216

03:26

This is what we really need

206220

1736

03:27

if we're going to build systems on top of computer vision.

207980

2896

03:30

(Applause)

210900

4000

03:36

So in just a few years,

216100

2176

03:38

we've gone from 20 seconds per image

218300

2656

03:40

to 20 milliseconds per image, a thousand times faster.

220980

3536

03:44

How did we get there?

224540

1416

03:45

Well, in the past, object detection systems

225980

3016

03:49

would take an image like this

229020

1936

03:50

and split it into a bunch of regions

230980

2456

03:53

and then run a classifier on each of these regions,

233460

3256

03:56

and high scores for that classifier

236740

2536

03:59

would be considered detections in the image.

239300

3136

04:02

But this involved running a classifier thousands of times over an image,

242460

4056

04:06

thousands of neural network evaluations to produce detection.

246540

2920

04:11

Instead, we trained a single network to do all of detection for us.

251060

4536

04:15

It produces all of the bounding boxes and class probabilities simultaneously.

255620

4280

04:20

With our system, instead of looking at an image thousands of times

260500

3496

04:24

to produce detection,

264020

1456

04:25

you only look once,

265500

1256

04:26

and that's why we call it the YOLO method of object detection.

266780

2920

04:31

So with this speed, we're not just limited to images;

271180

3976

04:35

we can process video in real time.

275180

2416

04:37

And now, instead of just seeing that cat and dog,

277620

3096

04:40

we can see them move around and interact with each other.

280740

2960

04:46

This is a detector that we trained

286380

2056

04:48

on 80 different classes

288460

4376

04:52

in Microsoft's COCO dataset.

292860

3256

04:56

It has all sorts of things like spoon and fork, bowl,

296140

3336

04:59

common objects like that.

299500

1800

05:02

It has a variety of more exotic things:

100

302180

3096

05:05

animals, cars, zebras, giraffes.

101

305300

3256

05:08

And now we're going to do something fun.

102

308580

1936

05:10

We're just going to go out into the audience

103

310540

2096

05:12

and see what kind of things we can detect.

104

312660

2016

05:14

Does anyone want a stuffed animal?

105

314700

1620

05:17

There are some teddy bears out there.

106

317820

1762

05:21

And we can turn down our threshold for detection a little bit,

107

321860

4536

05:26

so we can find more of you guys out in the audience.

108

326420

3400

05:31

Let's see if we can get these stop signs.

109

331380

2336

05:33

We find some backpacks.

110

333740

1880

05:37

Let's just zoom in a little bit.

111

337700

1840

05:42

And this is great.

112

342140

1256

05:43

And all of the processing is happening in real time

113

343420

3176

05:46

on the laptop.

114

346620

1200

05:48

And it's important to remember

115

348900

1456

05:50

that this is a general purpose object detection system,

116

350380

3216

05:53

so we can train this for any image domain.

117

353620

5000

06:00

The same code that we use

118

360140

2536

06:02

to find stop signs or pedestrians,

119

362700

2456

06:05

bicycles in a self-driving vehicle,

120

365180

1976

06:07

can be used to find cancer cells

121

367180

2856

06:10

in a tissue biopsy.

122

370060

3016

06:13

And there are researchers around the globe already using this technology

123

373100

4040

06:18

for advances in things like medicine, robotics.

124

378060

3416

06:21

This morning, I read a paper

125

381500

1376

06:22

where they were taking a census of animals in Nairobi National Park

126

382900

4576

06:27

with YOLO as part of this detection system.

127

387500

3136

06:30

And that's because Darknet is open source

128

390660

3096

06:33

and in the public domain, free for anyone to use.

129

393780

2520

06:37

(Applause)

130

397420

5696

06:43

But we wanted to make detection even more accessible and usable,

131

403140

4936

06:48

so through a combination of model optimization,

132

408100

4056

06:52

network binarization and approximation,

133

412180

2296

06:54

we actually have object detection running on a phone.

134

414500

3920

07:04

(Applause)

135

424620

5320

07:10

And I'm really excited because now we have a pretty powerful solution

136

430780

5056

07:15

to this low-level computer vision problem,

137

435860

2296

07:18

and anyone can take it and build something with it.

138

438180

3856

07:22

So now the rest is up to all of you

139

442060

3176

07:25

and people around the world with access to this software,

140

445260

2936

07:28

and I can't wait to see what people will build with this technology.

141

448220

3656

07:31

Thank you.

142

451900

1216

07:33

(Applause)

143

453140

3440

New videos

06:27

How do drugs make you hallucinate? - Anees Bahji

06:51

The Rise of China's Homegrown Brands — and Why ...

06:16

How important is politeness? ⏲️ 6 Minute English

07:44

North Korea’s secrets revealed by phone: Study:...

17:30

Advanced English Learning: Speaking Practice

03:48

What can you do? Easy English Conversations 💬 ...

08:33

Can AI Help with the Chaos of Family Life? | Av...

12:13

Speak English Confidently: Daily Tricks & Tips 🧠

Original video on YouTube.com

How computers learn to recognize objects instantly | Joseph Redmon - YouTube

About this website

This site will introduce you to YouTube videos that are useful for learning English. You will see English lessons taught by top-notch teachers from around the world. Double-click on the English subtitles displayed on each video page to play the video from there. The subtitles scroll in sync with the video playback. If you have any comments or requests, please contact us using this contact form.

https://forms.gle/WvT1wiN1qDtmnspy7

Playback speed

Subtitle font size

How computers learn to recognize objects instantly | Joseph Redmon

New videos

How computers learn to recognize objects instantly | Joseph Redmon

New videos

Original video on YouTube.com