Google can identify and transcribe all the views it has of street numbers in France in less than an hour, thanks to a neural network that’s just as good as human operators.
Now its engineers reveal how they developed it. Google Street View has become an essential part of the online mapping experience. It allows users to drop down to street level to see the local area in photographic detail.
But it’s also a useful resource for Google as well. The company uses the images to read house numbers and match them to their geolocation. This physically locates the position of each building in its database. That’s particularly useful in places where street numbers are otherwise unavailable or places such as Japan and South Korea where streets are rarely numbered in chronological order but in other ways such as the order in which they were constructed, a system that makes many buildings impossibly hard to find, even for locals.
But the task of spotting and identifying these numbers is hugely time-consuming. Google’s street view cameras have recorded hundreds of millions of panoramic images that together contain tens of millions of house numbers. The task of searching these images manually to spot and identify the numbers is not one anybody could approach with relish.
To start off with, Goodfellow and co place some limits on the task at hand to keep it as simple as possible. For example, they assume that the building number has already been spotted and the image cropped so that the number is at least one-third the width of the resulting frame. They also assume that the number is no more than five digits long, a reasonable assumption in most parts of the world.
But the team does not divide the number into single digits, as many other groups have done. Their approach is to localize the entire number within the cropped image and to identify it in one go — all with a single neural network. They train this net using images drawn from a publicly available data set of number images known as the Street View House Numbers data set. This contains some 200,000 numbers taken by Google’s Street View cameras and made publicly available. The training takes about six days to complete, they say.
However, that doesn’t mean spotting 98 percent of the numbers in 100 percent of the images. Instead, Goodfellow and co say it is acceptable to spot 98 percent of the numbers in a certain subset of images, which in this case turn out to cover around 95 percent of the total.
But even this is significantly better than any other team has been able to achieve. “Worldwide, we automatically detected and transcribed close to 100 million physical street numbers at [human] operator level accuracy,” they say, describing this as an “unprecedented success.” And they can do it at considerable speed. “We can transcribe all the views we have of street numbers in France in less than an hour using our Google infrastructure,” they say. Yep, that’s just one hour. One interesting question is whether the same technique might help extract other numbers such as telephone numbers on business signs or even number plates.
However, Goodfellow and co are not optimistic. They say the success of their technique rests heavily on the assumption that street numbers are never more than five digits long. “For large [numbers of digits] our method is unlikely to scale well,” they say.
The big question of course is what’s next. And Goodfellow and co oblige by opening the kimono just a fraction: “This approach of using a single neural network as an entire end-to-end system could be applicable to other problems such as general text transcription or speech recognition.” Multi-digit Number Recognition from Street View Imagery using Deep Convolutional Neural Networks.