dH #010: Deep Learning for Computer Vision: Building Intelligent Visual Systems

dH #010: Deep Learning for Computer Vision: Building Intelligent Visual Systems

*Understanding the intersection of artificial intelligence and visual perception*

## Introduction

Deep Learning for Computer Vision represents one of the most transformative areas in modern artificial intelligence. This field combines the power of learning algorithms with the complexity of visual data processing, creating systems that can truly “see” and understand the world around them.

## Defining Computer Vision

Computer vision is the study of building artificial systems that can process, perceive, and otherwise reason about visual data. What does process, what does perceive, what does reason mean? These terms are really defined quite broadly, and it’s kind of up for interpretation.

Visual data encompasses a vast spectrum of information – that could be images, that could be videos, that could be medical scans, that could be just about any type of continuously valued signal you can think about. So why is computer vision important?

## Understanding Learning in the Context of AI

So before we get to deep learning, what is learning? Learning is the process of building artificial systems that learn from data and experience. Notice that this is somewhat orthogonal to the goals of computer vision. Computer vision just says, we want to understand visual data. We don’t care how you do it.

Learning is this separate problem of trying to build systems that can adapt to the data that they see and the experiences that they have in the world. And from the outside, it’s not immediately clear why these two go together.

But it turns out that in the last 10 to 20 years, we found that building learning-based systems is very important to building many kinds of generalizable computer systems, both in computer vision and across many areas of artificial intelligence and computer science more broadly.

## Deep Learning: The Next Evolution

So now, when we think about deep learning, deep learning is then yet another subset of machine learning. Deep learning consists of hierarchical learning algorithms with many “layers”, (very) loosely inspired by the brain.

Now, I know that I say loosely, this is a thing that you’ll often see people talk about in deep learning that it’s how the brain learns or how the brain processes information. I think you should take any of these comparisons with a grain of salt between hierarchical learning algorithms with many “layers” (very) loosely inspired by the brain and the brain itself. There are some very coarse comparisons between brains and neural networks that we use today, but I think you should not take them too seriously.

## The Broader Context: Artificial Intelligence

Stepping back from these specific topics, artificial intelligence is a broad research field that encompasses computer vision and machine learning. So artificial intelligence is very general, very broad. It’s broadly speaking, how can we build computer systems that can do things that normally people do?

I think people will argue about what is and is not artificial intelligence, but I think we just want to build smart machines, whatever that means to any of us. And I think there’s clearly many different sub disciplines of artificial intelligence, but the most important, clearly again, in my biased opinion, are computer vision, teaching machines to see, and machine learning, teaching machines to learn. And these are the topics that we’ll study in this class.

So then kind of where does deep learning fall in this regime? Deep learning would be a subset of machine learning and it intersects both computer vision and falls within the larger AI realm. So then this class is going to focus at kind of this intersection, right, in the middle.

## Historical Foundations

So today’s agenda is a brief history of computer vision and deep learning, and course overview and logistics. Before we can really dive into that intersection and talk about machine learning and deep learning and computer vision, all that really good stuff, I think it’s important to get a little bit of historical context about how we got here as a field.

This has been a hugely successful research area in the last five to 10 years. But deep learning, machine learning, computer vision, these are areas with decades and decades of research built upon them. And all of the successes we’ve seen in the last few years have been a result of building upon decades of prior research in these areas.

So today I want to give a bit of a brief history and overview of some of the historical context that led up to the successes of today.

## The Origins of Computer Vision

So whenever you talk about a research area, it’s always difficult to pinpoint the start, right? Because everything builds on everything else. There’s always prior work. Everyone was inspired by something else that came before. But with a finite amount of time to talk about a finite number of things, you’ve got to cut the line somewhere.

So one place where I like to draw the line and point to, as maybe the start of computer vision, is actually not with computer scientists at all. It’s from this seminal study of Hubel and Wiesel back in 1959, who were not interested in computers at all. They wanted to understand how mammalian brains work.

## Early Computer Vision Pioneers

So then, if we move forward a couple of years to 1963, Larry Roberts did his, that’s when Larry Roberts graduated from MIT with his PhD and did perhaps what was the first PhD thesis on computer vision. Here, of course, it was 1963. Doing anything with computers is very cumbersome. Doing anything with digital cameras is very cumbersome.

So large portions of his thesis just to talk about how do you actually get both of the proper information into the computer because this was not something you could take for granted at that time. But even working through those constraints, he built some system that was able to take this raw visual picture, detect some of the edges in the picture, sort of inspired by Hubel and Wiesel’s discovery that edges were fundamental visual processing. Then from there, detect feature points. And then from there, start to understand the 3D geometry of objects and images.

chunk3_t0205_image_005.jpg

Early computer vision processing workflow

Now, what’s really interesting is that if you go and look at your Larry Roberts’ Wikipedia page, it actually doesn’t mention any of this at all. Because after he finished his PhD, he went on to become the founding father of the internet and went on to be a hugely major player in the worldwide web and all of the networking technologies that were developed around that time. So doing the first PhD thesis in computer vision was kind of a low point in his career. I think all of us can aspire to that success.

## The Summer Vision Project

So then, moving forward a couple more years, people were getting really excited. So there was this very famous study in 1966 from MIT, a Seymour Papert proposed the Summer Vision Project.chunk3_t0284.jpg

Summer Vision Project overview

## The Summer Vision Project

So then, moving forward a couple more years, people were getting really excited. So there was this very famous study in 1966 from MIT, a Seymour Papert proposed the Summer Vision Project.

The summer vision project is an attempt to use our summer workers effectively in the construction of a significant part of a visual system. What we’re going to do is hire a couple of undergrads, but I’m to work over the summer. The particular task was chosen partly because it can be segmented into sub-problems which will allow individuals to work independently and yet participate in the construction of a system complex enough to be a real landmark in the development of ‘pattern recognition’.

Man, these guys are really ambitious back in the day. Because now it’s clearly computer vision is not solved. They did not achieve this ambitious goal. And nearly 50 years later, we’re still plugging away trying to achieve what they thought they could do in the summer with undergrads.

## David Marr’s Vision Theory

So moving forward into the 1970s, one hugely influential figure in this era was David Marr, who proposed this idea of stages of visual representation that again, kind of harkens back to Hubel and Wiesel.chunk4_t0030_image_001.jpg

chunk4_t0030_image_002.jpg

David Marr’s visual processing pipeline

So here you can see that maybe we want the input image. Then we have another stage of visual representation to more abstract edge image. Then from the edge image, we extract some kind of primal sketch consisting of objects and their boundaries, such as zero crossings, blobs, edges, bars, ends, virtual lines, groups, curves. And then think about the relative depths and local surface orientation and discontinuities in depth and in surface orientation of those objects. And then eventually start to reason about whole 3D models hierarchically organized in terms of surface and volumetric primitives.

## Early Object Recognition

And then moving forward into the 70s, people were started to become interested in recognizing objects and thinking about ways to build computer systems that could not just detect edges and simple geometric shapes, but more complex objects like people.

There was work about some things like generalized cylinders by Brooks and Binford, 1979 and pictorial structures by Fischler and Elshlager, 1973 that tried to recognize people as these deformable configurations of rigid parts with some kind of known topology. And this was very influential work at the time.chunk4_t0072_image_002.jpg

But the problem is that in the 1970s, processing power was very limited. Digital cameras were very limited. So a lot of this stuff was sort of toy in a sense.

## Edge Detection in the 1980s

And as we move into the 80s, people had much more access to better digital cameras, more and more computational power. And people began to work on slightly more realistic images.

1980s edge detection timeline

So one kind of theme in the 1980s was trying to recognize objects and images via edge detection. I told you that edges were going to be super influential throughout the history of computer vision.

So there was a very famous paper from John Canny in 1986 that proposed a very robust algorithm for detecting edges in images. And then David Lowe, the next year in 1987, proposed some mechanism for recognizing objects in images by matching their edges.

chunk4_t0113_image_002.jpg
Picture 6

Edge detection example with razors

So in this example, you can imagine we’ve got this cluster of razors. And then we detect the edges. Then maybe we have some template picture of a razor that we know about. Then we can detect the edges of our template razor and try to match it into this image, this cluttered image of many of all the razors.

## Recognition via Grouping (1990s)

And now moving on into the 1990s, people, again, wanted to build to more and more complex images, more and more complex scenes. So here, a big theme was trying to recognize objects via grouping.

Picture 7

Object recognition through grouping examples

So here, rather than maybe just matching the edges, what we want to do is take the input image and segment it into semantically meaningful chunks. Maybe we know that the person is composed of one meaningful chunk, the different flowers with another meaningful chunk, with the idea that if we can first do some sort of grouping, then later, downstream recognizing or giving a label to those groups might be an easier problem.

## Recognition via Matching (2000s)

Then in the 2000s, a big theme was recognition via matching.

Picture 8

SIFT feature matching system

And there was a hugely influential paper called SIFT by David Lowe, in 1999, that proposed a different way of recognition via matching. So here, the idea is that we would take our input image, detect little recognizable key points in different positions in the image. And at each of those key points, we’re going to represent its appearance using some kind of feature vector.

And that feature vector is going to be a real valued vector that somehow encodes the appearance of image at that little point in space. And by very careful design of exactly how that feature vector is computed, you can encode different types of invariances into that feature vector, such that if we were to take the same image and rotate it a little bit, or brighten or darken the lighting conditions in the scene a little bit, that hopefully we would compute the same value for that feature vector, even if the underlying image were to change a little bit.

Picture 9

SIFT stop sign feature matching example

And there was a lot of work in that, once we can extract these sets of robust and invariant feature vectors, then you can, again, perform some kind of recognition via matching. So then on the left, if we have some template image of a stop sign, we can detect all these distinctive invariant feature key points. Then on the right, if we have another image of a stop sign, this may be taken from a different angle with different lighting conditions, then by our careful clever design of these invariant robust features, then we can match and correspond points and thereby be able to recognize that the images on the right depict stop signs, illustrating the recognition via matching approach used in the 2000s.

## The Rise of Large-Scale Datasets

So one hugely influential piece of work here was the Pascal Visual Object Challenge. So we could then download images from the internet to help build these data sets of images and then we could get graduate students to go and label those images. And then you could use your machine learning algorithms to mimic the labels that the graduate students have written down on the images.

Picture 10

Pascal VOC performance graph

And then if you do that, then you can see this nice graph on the right of performance increasing on this Pascal VOC recognition challenge increased steadily over time from about 2006 to 2012.

## The ImageNet Revolution

So then this brings us to the ImageNet Large Scale Visual Recognition Challenge. So then, given that I told you it’s this ImageNet Large Scale Visual Recognition Challenge competition, you can look at what was the error rate on this competition moving over time.

Picture 11

ImageNet error rate progression

So the first time the competition was ran in 2010 and 2011. We were sitting at error rates around 28.2% and 25.8% in those years. And then something big happened in 2012 with the work of Krizhevsky et al. (AlexNet).

So at the 2012 ImageNet competition, suddenly the error rates dropped in a single year from 25.8% to 16.4%. And after 2012, error rates just kept on diminishing, diminishing very, very fast, with works like Zeiler & Fergus (2013), Simonyan & Zisserman (VGG) in 2014, Szegedy et al. (GoogLeNet) in 2014, He et al. (ResNet) in 2015, Shao et al. in 2016, and Hu et al. (SENet) in 2017.

## The Deep Learning Foundations

But there’s something else. The dramatic success we saw in 2012 wasn’t entirely sudden – it built upon decades of foundational work in neural networks. There’s another potential version of the algorithm called a multi-layer Perceptron that actually can learn to represent many many many different types of functions and it’s very flexible as representations.

Picture 12

Historical AI timeline and Perceptrons book

But that kind of got lost in the headlines of the time and nobody realized that and people just heard that Perceptrons didn’t work and were dead and he just stopped working on that, as evidenced by Minsky and Papert’s 1969 work showing that Perceptrons could not learn the XOR function, which caused a lot of disillusionment in the field. This timeline shows the key developments in AI from the Perceptron in 1958 to ImageNet in 2009, with the ‘AI Winter’ period highlighted between the late 1970s and 1990s.

## The Neocognitron: A Prescient Architecture

We skipped ahead to 1980. And there was this very influential paper called Neocognitron that was developed by Fukushima who’s a Japanese computer scientist. And he was directly inspired by Hubel and Wiesel’s idea of this hierarchical processing of neurons.

Neocognitron architecture diagram

Remember Hubel and Wiesel talked about these simple cells, these complex cells, these hierarchies of neurons that could gradually learn to see, more and more complex visual stimuli in the image. So Fukushima proposed this computational realization of Hubel and Wiesel’s formulation that he called the Neocognitron.

So the Neocognitron interleaved two types of operations: interleaved simple cells (convolution) and complex cells (pooling). One were these computational simple cells that if we were to look at them with modern terminology, they would very much look like convolution. And the latter was these computational realizations of complex cells that again under modern terminology look very much like the pooling operations that we use in modern convolutional networks.

So what’s striking is that even back in this Neocognitron from 1980, had an overall architecture and overall method of processing that looked very similar to this famous AlexNet system that swept in 2012. Even the figures that they have in the papers look pretty similar to the diagram labeled ‘Neocognitron: Fukushima, 1980’ which resembles the AlexNet architecture more than 32 years later. So they got of the same thing, right?

But what was striking with the Neocognitron: Fukushima, 1980 is that they defined this computational model the visual system, directly inspired by Hubel and Wiesel’s hierarchy of complex and simple cells. They had the right idea of interleaved simple cells (convolution) and complex cells (pooling) in hierarchy, but they did not have any practical training algorithm.

Picture 14

Neocognitron vs modern networks

Because remember, there’s a lot of learnable weights in this system, a lot of connections between all the neurons inside. They need to be set somehow. And Fukushima did not have an efficient algorithm for learning to properly set all of the free weight parameters in the system based on data.

## Backpropagation: The Training Breakthrough

So then a couple of years later, there was this, again, massively influential paper by Rumelhart, Hinton, and Williams in 1986 that introduced backpropagation for computing gradients in neural networks.

Picture 15

Backpropagation algorithm diagram

So remember that in the perceptrons book, there was this thing called a multi-layer perceptron that was thought to be very powerful in its ability to represent and learn different types of functions. Well, in the backpropagation paper, it introduced the backpropagation algorithm, which was one of the first times that people were able to successfully and efficiently train these deeper models with multiple layers of computation.

And this looks very much like a modern neural network that we’re using today, with recognizable math notation for computing gradients. That if you look at this paper and open it up and read through it, you’ll see they talk about gradients and they talk about Jacobians and Hessians, all this kind of mathematical terminology that we think about today when we’re building and training neural networks.

So these look very much like the modern fully connected networks that we still use today, sometimes called multi-layer perceptrons in homage to this long history.

## The Deep Learning Renaissance

Picture 16

Deep learning research timeline

In the 2000s, people tried to train neural networks that were deeper and deeper, which was not a mainstream research topic at that time, as shown by the works of Hinton and Salakhutdinov (2006), Bengio et al. (2007), Lee et al. (2009), and Glorot and Bengio (2010) in the ‘Deep Learning’ section of the slide.

And around this period of time in the 2000s was where the term deep learning first emerged, where the term deep was meant to refer to multiple layers in these neural network type algorithms. So this was really not a mainstream research topic at this time. There was a relatively small number of research groups and a relatively small number of people studying these ideas around this time, including the groups mentioned in the image (Hinton and Salakhutdinov, 2006; Bengio et al, 2007; Lee et al, 2009; Glorot and Bengio, 2010).

But I think a lot of the fundamentals that were reaping the rewards of now were really developed during this period of time in the 2000s when people started to figure out all the new modern tricks to train different types of neural network systems like the Pretraining, RBM-initialized autoencoder, and Fine-tuning with backprop architectures shown.

## The Perfect Storm of 2012

So that finally brings us back to AlexNet. And then in 2012, we had this great confluence of this great computer vision task called ImageNet that people in computer vision were super excited about. We had this set of new techniques, convolutional neural networks, and efficient ways to train them that have been developed by this parallel research community. And everything just seemed to come together just at the right time in 2012, leading to the breakthrough of AlexNet on the ImageNet task.

Picture 17

Deep learning explosion graph

So then from 2012 to present day, we’ve seen an absolute explosion in the usage of convolutional and other types of neural networks across both computer vision and across other types of related areas in AI and across computer science, as evidenced by the Google Trends graph showing exponential growth starting around 2012.

So here on the left, we have the Google Trends for Deep Learning. So that shows you this massive sort of exponential growth that really took off sort of 2012. This is a graph showing the number of submitted and accepted papers at the top Computer Vision conference (CVPR) over the years, demonstrating the massive increase in academic interest in computer vision and machine learning in recent years.

Picture 18

CVPR paper submission trends

So this is one of the premier venues for academic publications in computer vision. And here is a graph that they were showing at the keynote for that conference where they showed on the x-axis the year of the conference and on the y-axis the number of submitted and accepted papers in this conference.

## Modern Applications Everywhere

And if you look around the field today, we see the convolutional networks, other types of deep neural networks, are being used for just about every possible application of computer vision that you can imagine. So from 2012, these convolutional networks are really everywhere.

Picture 19

Computer vision applications overview

They’re getting used from such a wide diversity of tasks, like image classification, where we want to put labels on images like ‘mite’, ‘container ship’, ‘motor scooter’, ‘leopard’, etc., or image retrieval, where we want to retrieve images from collections like the purple flowers or elephants shown, or object detection, where we want to recognize the positions of objects and images while simultaneously labeling them.

Picture 20

Object detection and segmentation examples

The right one should be image retrieval, and for object detection, we want to label the what and where this is going back to this idea of semantic grouping you saw for computer vision in the 90s, where we want to label a region of pixels as being part of a cohesive whole.

Picture 21

Video processing with ConvNets

And so, the components are getting used for things like video classification with a spatial stream ConvNet and a temporal stream ConvNet as shown in the diagram by Simonyan et al., 2014, activity recognition, things that they’re getting used for things like pose estimation, where we want to say, how are the exact geometric poses of people arranged in images, even for things that don’t really feel like classical computer vision, like playing Atari games, where they would process the visual input of the Atari game with a convolutional neural network, and combine that with other sorts of learning techniques, in order to both jointly learn a visual representation of the video game world, as well as how to play the map world.

## Beyond Traditional Vision

Convolutional neural networks are also getting used for visual tasks involving visual data that humans aren’t very good at.

Picture 22

Scientific applications of ConvNets

So convolutional networks are getting used in things like medical imaging to diagnose different types of tumors as benign or malignant, as shown in the medical imaging examples. They’re getting used in galaxy classification, as illustrated by the galaxy images. They’re getting used in tons of scientific applications, like classifying whales or elephants or other types of animals, because there’s this problem where scientists want to go out into the world and collect a lot of data, and then be able to use images and visual recognition as a kind of universal sensor to make use of all this data that they collect and gain insights into their particular field of expertise that they’re interested in, as exemplified by the whale recognition example.

chunk8_t0221_image_003.jpg

And we’ve seen computer vision and convolutional networks branch out into all these other areas of science that just open up and unlock lots of new applications just across the board.

Picture 23

Image captioning examples

Convolutional networks are getting used for all kinds of fun tasks, like image captioning, where we can build systems that can write natural language descriptions of images. These are using convolutional networks for image captioning tasks like describing a white teddy bear sitting in the grass, a man in a baseball uniform throwing a ball, a woman holding a cat, a man riding a wave on a surfboard, a cat sitting on a suitcase, and a woman standing on a beach holding a surfboard.chunk8_t0285_image_001.jpg

We can use convolutional networks for generating art, so we can, we can make all these kind of psychedelic portraits, again, using a convolutional network. But we might ask, what was it that happened in 2012 that made all this take off, as shown by the images of Mordvintsev et al, 2015 and Gatys et al, 2016?

## The Three Pillars of Success

But my personal interpretation is that it was a combination of three big components that came together all at once.

Picture 24

Three components diagram

One was the set of algorithms like dense neural networks with 192 inputs and 128 outputs as shown in the diagram, that we saw that there was a stream of people working on deep learning and neural networks and machine learning who had developed these powerful set of tools for representing learnable functions and for learning that with the backpropagation algorithm.

We saw this the second stream of data that with the rise of digital cameras, layer robots and vendor internet and people developing crowdsourcing, we were able to collect unprecedented amounts of labeled data that could be used to train these systems.

And the third piece that we haven’t really talked about was the massive rise in computational resources like the GTX GPUs shown in the image that has been continually happening throughout the history of computer science. So one graph that I put together that I find particularly striking is looking at the gigaflops of computation per dollar as a function of time, with values up to 2048 gigaflops shown in the diagram.

So algorithms of data and dense computation fueled by advances in GPUs like the GTX 1000 and 2048 models, which led to the breakthroughs happening around 2012 that enabled new applications of convolutional networks on different types of computer vision tasks.

## Recognition and Future Challenges

Picture 25

2018 Turing Award winners

And in recognition of all of this, in recognition of the impact of computer vision and deep learning across the field, the 2018 Turing Award was awarded to Yoshua Bengio, Geoffrey Hinton and Yann LeCun, for their work on pioneering many of the deep learning ideas that we’ll learn throughout this class.

And for those of you who don’t know, the Turing Award is basically that consider the Nobel Prize equivalent in the field of computer science. So this just happened last year. And this was just a recognition that this has been a massively influential piece of research that’s been changing all of our lives over the last several years.

But I think it’s important to stay humble and realize that despite all of the successes that we’ve seen in convolutional networks, in deep learning and computer vision, I think we’re really still a long way away from building systems that can perceive and understand visual data to the same fidelity and power and strength as humans can.

Despite our success, computer vision still has a long way to go. So if we were to send this to a convolutional network, it would probably say there are a group of men in suits walking down a hallway. But if we were to look at this, you see quite a different story.

You see a group of businessmen walking together down an office hallway. You know how office environments and professional attire work, which requires some understanding of corporate culture. You know that they are walking with purpose. You know that they are likely colleagues or coworkers. You know that people tend to be conscious of their professional appearance and behavior in such settings.

This kind of deep contextual understanding, social awareness, and background knowledge integration represents the frontier that computer vision systems still need to achieve to match human-level visual intelligence.

**Conclusion**

From Hubel and Wiesel’s groundbreaking neuroscience research in 1959 to the deep learning revolution of 2012 and beyond, we’ve traced the remarkable journey of computer vision and deep learning. What started as curiosity about how mammalian brains process visual information has evolved into one of the most transformative technologies of our time.

The convergence of algorithmic innovations, massive datasets, and computational power created the perfect storm that enabled AlexNet’s breakthrough and the subsequent explosion in deep learning applications. Today, convolutional neural networks power everything from medical diagnosis to autonomous vehicles, from scientific discovery to creative applications.

Yet as we’ve seen, despite these remarkable achievements, computer vision systems still lack the nuanced understanding, contextual reasoning, and social intelligence that humans bring to visual perception. The journey from biological inspiration to artificial intelligence continues, with exciting challenges and opportunities ahead.

*The story of deep learning for computer vision reminds us that breakthrough innovations often build upon decades of foundational research, waiting for the right moment when theory, data, and computation align to change the world.*