The Toronto researchers combined several techniques to achieve their breakthrough results. One was the use of convolutional neural networks. I did a deep dive on convolutions last year, so if you want the full explanation. In a nutshell, a convolutional network effectively trains small neural networks—ones whose inputs are perhaps seven to eleven pixels on a side—and then "scans" them across a larger image.
"It's like taking a stencil or pattern and matching it against every single spot on the image," AI researcher Jie Tang last year. "You have a stencil outline of a dog, and you basically match the upper-right corner of it against your stencil—is there a dog there? If not, you move the stencil a little bit. You do this over the whole image. It doesn't matter where in the image the dog appears. The stencil will match it. You don't want to have each subsection of the network learn its own separate dog classifier."
Another key to AlexNet's success was the use of graphics cards to accelerate the training process. Graphics cards contain massive amounts of parallel processing power that's well suited for the repetitive calculations required to train a neural network. By offloading computations onto a pair of GPUs—Nvidia GTX 580s, each with 3GB of memory—the researchers were able to design and train an extremely large and complex network. AlexNet had eight trainable layers, 650,000 neurons, and 60 million parameters.
Finally, AlexNet's success was made possible by the large size of the ImageNet training set: a million images. It takes a lot of images to fine tune its 60 million parameters. It was the combination of a complex network and a large data set that allowed AlexNet to achieve a decisive victory.
An interesting question to ask here is why an AlexNet-like breakthrough didn't happen years earlier:
The pair of consumer-grade GPUs used by the AlexNet researchers was far from the most powerful computing device available in 2012. More powerful computers existed five and even 10 years earlier. Moreover, the technique of using graphics card to accelerate neural network training had been since at least 2004.A million images was an unusually large data set for training machine learning algorithms in 2012, but Internet scraping was hardly a new technology in 2012. It wouldn't have been that hard for a well-funded research group to assemble a data set that large five or ten years earlier.The main algorithms used by AlexNet were not new. The backpropagation algorithm had been around for a quarter-century by 2012. The key ideas behind convolutional networks were developed in the 1980s and 1990s.
So each element of AlexNet's success existed separately long before the AlexNet breakthrough. But apparently no one had thought to combine them—in large part because no one realized how powerful the combination would be.
Making neural networks deeper didn't do much to improve performance if the training data set wasn't big. Conversely, expanding the size of the training set didn't improve performance very much for small neural networks. You needed both deep networks and large data sets—plus the vast computing power required to complete the training process in a reasonable amount of time—to see big performance gains. The AlexNet team was the first one to put all three elements together in one piece of software.
The deep learning boom
Oops! This image does not follow our content guidelines. To continue publishing, please remove it or upload a different image.