Object detection: an overview in the age of Deep Learning

pavlov · on Aug 31, 2017

The progress in this field is very exciting. But the last section of the fine article makes a really important point -- ultimately the training dataset defines the output that the algorithms can provide:

"Unfortunately, there aren’t enough datasets for object detection. Data is harder (and more expensive) to generate, companies probably don’t feel like freely giving away their investment, and universities do not have that many resources."

Even ImageNet has only 200 classes. Imagine a person with a vocabulary of exactly 200 concepts trying to describe the world. As Wittgenstein wrote, "The limits of my language mean the limits of my world" -- and the language we can teach to computers is still very limited.

krasin · on Aug 31, 2017

FWIW, Open Images Dataset has recently been extended with 2M bounding boxes over 600 categories: https://research.googleblog.com/2017/07/an-update-to-open-im...

melling · on Aug 31, 2017

Is this something that can be crowdsourced?

Mozilla is crowd sourcing voice data, for example: https://voice.mozilla.org

alexbeloi · on Aug 31, 2017

Yes, but somebody still has to pay or donate their time and the 'human' computational time require is pretty heavy for an object detection task.

It's also usually ideal to just have one person label one image, people often miss things or mislabel things. If you want low noise data, you need to have people repeatedly labeling the same image until they reach a consensus. The more objects the more opportunity for confusion, requiring more labeling.

You have probably been prompted to use captcha-like thing for selecting portions of an image containing a street sign. That's object detection, for a single class of object.

To answer your question, yes it should be crowdsourced, yes it is being crowdsourced, but the companies that are doing it are keeping the data to themselves because of the value/expense in collecting it.

vikiomega9 · on Aug 31, 2017

From a company's perspective this is essentially their business strategy and perhaps it makes sense to crowdfund crowdsourcing or something along the likes of SETI@Home.

boxy310 · on Aug 31, 2017

Most neural networks are structured around a classification task with pre-tagged data. This is fundamentally a problem that it requires pre-tagged data in order to ferret out what the "true meaning" of a particular classification is.

When the task moves towards determining differences between un-tagged data, you're looking at more of a clustering exercise. This is a murkier machine learning task which is relatively underdeveloped. As a quick comparison, linear regression and correlations were generally worked out by Pearson back in the late 1800's, while k-means clustering was only published openly since the 1980's.

badestrand · on Aug 31, 2017

Looks like great project! It is a pity they don't provide the data for download yet.

The Tatoeba project focuses on crowdsourcing translations of sentences but has also around a few thousand spoken ones in a variety of languages. Everything is available for download for free.

https://tatoeba.org/eng

Iv · on Sept 1, 2017

ImageNet now has 1000 classes.

And I used to think this was a bottleneck before I really got into deep learning.

Actually, adding categories is very easy and does not require a crazy amount of data. It does not even requires a lot of retraining as a lot of the lower layers of a typical ImageNet model can be reused for new categories. I have made classifiers using an old model (VGG16) and less than 100 images in each new category.

If a project like wikipedia for instance started to promote the goal to have 100 different pictures for each of its articles, a classifier could be trained with any arbitrary number of categories.

alexcnwy · on Sept 1, 2017

Also, see things like the DeVISE paper that replace the 1-hot encoding output on the 1000 imagenet paper with word vectors so the model can generalize beyond 1000 classes:

http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.466...

jamesmcintyre · on Sept 1, 2017

I've trained inception V3 for 20 classes using transfer learning method (google for poets tutorial/script) and I'm curious if there is an upper-limit to adding a lot of classes? For instance if I wanted to classify 30,000 different classes is there any reason one inception V3 model couldn't do it? For some reason I suspect that it wouldn't scale to that so I'm thinking I'd have to break the classification into a hierarchy (i.e. Classification/model 1: "Is it an animal?" If yes send to classification model 2: "what type of animal?" If feline send to model 3 "what type of feline?".)

Is there an upper limit to classes? If so what imposes this limit? Is a process like I described above ideal or is there a better way to scale to 10's of thousands of classes in image recognition?

Iv · on Sept 2, 2017

I mostly used VGG16 but I suppose this more or less translates into inception as well: in VGG16 there are a few CNN layers and two dense layers at the end, the last one having 1000 outputs.

The key is in these dense layers. If you want 30x more categories you probably need at least 30x more parameters in these. And I would naively assume that the relationship is more quadratic.

The idea is that if the lower layers trained to recognize features that are useful in differentiating 1000 categories, they probably are good enough features to recognize other categories. After all we know that there are eyes detectors there, grid detectors, text detectors, that naturally emerge.

ImageNet will continue to increase its number of categories anyway, so unless you have a cluster of TITAN GPU in your basement, you may just as well wait for their results and take their networks.

jamesmcintyre · on Sept 4, 2017

Thanks for the answer!

colmvp · on Sept 1, 2017

200 is small though I found that I could easily fit a pretrained network to a new object because of the feature detectors created from training on ImageNet

bitL · on Aug 31, 2017

GAN can help there.

thedirt0115 · on Aug 31, 2017

You can get pretty far with just 1000 words :) https://xkcd.com/thing-explainer/

GuiA · on Aug 31, 2017

That book is better in concept than in execution. Many of the expressions it comes up with sound outright ridiculous and would be completely nonsensical outside of their context.

It's a fun exercise in style, but does not prove much.

prodtorok · on Sept 1, 2017

How far out are we from Object Detection that can identify at different scales? - i.e. angled, facing towards, rotated, skewed, etc...

We seem to have good OD that can create horizontal bounding boxes, but these bounding boxes seem to be generic estimates.

Even rectangle detection with these models can't identify the angle or skew of a rectangle in a frame (we get the same generic bounding box)

OpenCV seemed to have models awhile ago that could do this just fine.

alexcnwy · on Sept 1, 2017

You can re-pose it as a regression problem to detect the coordinates of the 4 corners in the case of a single object in the image.

yorwba · on Sept 1, 2017

If you need an angled bounding box, you could probably modify any of the current approaches in that regard. You could also add a post-processing step where you take the predicted bounding boxes, rotate them in all possible directions, and predict the most likely one.

But if you're dealing with known geometric shapes, like e.g. rectangles, you'll get better results if you use "classic" detection algorithms that are already mathematically optimal.

For example, I once had to count the number of atoms in an electron-microscope image. I simply ran a circle detector with very sensitive settings, then culling overlapping circles with lower "circleness" score. That missed a few atoms that were stacked on top of others, but still got more accurate results than the previous method, which apparently involved a poor grad student ticking them off on paper.

nshm · on Sept 1, 2017

Does not work since requires much more resources. Also, small objects in high resolution images are painfully slow to detect. Most models are trained for 700x700 at max, 4096x4096 many times slower.