Geoffrey Hinton was one of the founders of the concept of deep learning, winner of the Turing award in 2019 and a Google engineer. Last week, during a developer conference I/O, Wired has interviewed him and discussed his fascination with the brain and the ability to simulate the computer based on the neural structure of the brain. For a long time these ideas were considered stupid. Interesting and fascinating conversation about consciousness, future plans of Hinton, and about whether we can teach computers to dream.
What about neural networks?
Let’s start with those times when you wrote your very first, very influential article. Everyone said: “the Idea is clever, but actually we will not be able to design computers that way.” Explain why you insisted on her and why you were so sure that he had found something important.
It seemed to me that the brain can’t work any other way. He is obliged to work, exploring the power of links. And if you want to force the device to do something clever, you have two options: you either programmed it or it learns. And people were not programmed so we had to learn. This method had to be correct.
Explain what a neural network. Explain the original idea.
You take a relatively simple processing elements, which are very vaguely resemble neurons. They have incoming connections, each connection has weight and this weight can vary in the learning process. What makes a neuron takes action on the connections are multiplied by weights, sums them and then decides whether to send the data. If the amount is large enough, it makes the output (output). If the amount is negative, it sends nothing. That’s all. You only need to associate a cloud of such neurons with weights and figure out how to change these weights, and then they will do anything. The only question is how you will change weight.
When did you realize that this is a rough idea of how the brain works?
Oh, Yes, all as originally conceived. Was designed to remind the brain to work.
So, at some point in your career you start to understand how the brain works. Maybe you were twelve years old, maybe twenty-five. When you decided to try to simulate the computers in the type of brain?
Yes immediately. That was the whole point. The whole idea was to create a learning device that learns like the brain, according to the ideas of people about how the brain learns by changing the strength of connections. And it wasn’t my idea, the same idea was from Turing.Turing invented a huge part of the standard foundations of computer science, he believed that the brain was an unorganized unit with random weights and used reinforcement learning to modify the connections, so could learn anything. And he believed that the best way to intelligence.
And you followed Turing’s idea that the best way to create a car to design it according to the type of the human brain. So, they say, works the human brain, so let’s create such a machine.
Yeah, thought so, not just Turing. Many thought so.
When it was dark times? When happened so that other people who worked on this and thought the idea of a Turing faithful, began to retreat, and you continued to bend the line?
It was always a handful of people who believed, no matter what, especially in the field of psychology. But among computer scientists, I believe, in the 90s, it turned out that the dataset was quite small, but the computers were not as fast. And with small datasets, other methods, in particular support vector machines, worked a little better. They are not so much troubled by the noise. So it was sad because in the 80s we developed the method of back propagation [back propagation — back propagation, neural networks]. We thought that it will solve everything. And was puzzled that he hasn’t solved anything. The question was really to scale, but then we didn’t know that.
Why do you think that it doesn’t work?
We thought that it doesn’t work because we were not quite the right algorithms, and not quite the right objective function. I have long thought that this is because we were trying to conduct training under supervision, when you mark the data and we had to teach without supervision, when learning takes place according to no labels. It turned out that the issue was mostly in scale.
This is interesting. It turns out, the problem was that you had insufficient data. You thought that it has the correct amount of data, but not tagged. So, you just misread the problem?
I thought the mistake was that we generally use the label. Most of your learning occurs without the use of any marks, you’re just trying to simulate the structure in the data. I actually still think so. I think that as computers get faster, if your computer is fast enough, then for any dataset of a given size is better to carry out training without supervision. And as soon as you complete the learning without supervision, you will be able to learn with fewer labels.
So, in the 1990s, and you continue your research, you in academia, you still published, but do not solve big problems. Have been you ever have a moment when you said, “you Know what, enough of this. I’ll try to do something else”? Or you just told yourself that you’ll continue to engage in deep learning [that is, the concept of deep learning, deep neural network learning; read more here].
Yes. Something like this should work. I mean, connections in the brain to somehow learn, we just need to figure out how. And perhaps there are many different ways to strengthen connections in the learning process; the brain uses one of them. There may be other ways. But just need something that can strengthen these connections in teaching. Never had a doubt.
You never doubted. When it became similar to what it works?
One of the biggest disappointments of the 80s was that if we did a network with many hidden layers, we were not able to train them. This is not entirely true, because you can train relatively simple processes like writing by hand. But we didn’t know how to train most deep neural networks. And somewhere in 2005, I invented a new way of training deep networks. You enter the data, say, pixels, and trained several detectors parts, which are simply well explained why the pixels were just what we are. Then you feed those parts the data detectors, and teach another set of detectors, so that we could explain why certain detectors parts appear to be a specific correlation. You continue to train layer by layer. But the most interesting was the fact that it was possible to decompose mathematically and prove that every time you teach a new layer, you don’t necessarily have perfect data model, but you will have to deal with a range of how good your model is. This range is getting better with each added layer.
What do you mean by range of how good is your model?
Once you got the model, you could ask the question: “How unusual is this model finds the data?”. You show her the data and ask the question: “do You find it all so, as suggested, or is it unusual?”. And this can be measured. And I wanted to get a model, good model that looks at data and says, “Yes, Yes. I knew it. It is not surprising”. It is always difficult to calculate precisely how unusual model considers the data. But it is possible to calculate the range of this. We can say that the model will find these data less unusual than these. And it was possible to show that as you add new layers to detectors details model is formed, and with each added layer when it finds data, the range of understanding of how unusual it considers the data becomes better.
It turns out, about 2005 you have carried out this mathematical breakthrough. When you start to get the right answers? What data did you work? First break you get from the speech data, right?
It was just handwritten numbers. Very simple. And about the same time began the development of a GPU (graphics processor). And the people who were involved in neural networks have started to use GPU in 2007. I was a very good student who started to use the GPU to search for roads on aerial photographs. He wrote the code, which was then adopted by other students using the GPU for recognition of phonemes in speech. They used this idea prior training. And when training was finished, just hang the label on top and used the back-propagation. It turned out that you can create a very deep network that has been pre-trained in this way. And then you can apply back-propagation, and it actually worked. In speech recognition, it worked perfectly. Initially, however, it was not much better.
It was better commercially available speech recognition? Walked the best scientific work on speech recognition?
In a relatively small data set called TIMIT, it was slightly better than the best academic work. IBM has also done a lot of work.
Very quickly people realized that all this — since it bypasses the standard model, which was developed 30 years — will work fine, if a little to develop. My graduates got into Microsoft, IBM and Google, Google and very quickly established a working speech Recognizer. By 2012 this work that was done back in 2009, got on Android. Android has suddenly become much better to recognize the speech.
Tell me about a time when you stored these ideas for 40 years published on this topic for 20 years, suddenly bypassing their colleagues. What it feels like?
Well, at that time, I kept these ideas in just 30 years!
It was a wonderful feeling that everything is finally turned into a real problem.
Do you remember when you first got data indicating it?
Okay. So, you realize that it works with speech recognition. When you have begun to apply neural networks to other problems?
At first, we began to apply them to all sorts of other problems. George Dahl, with whom we initially worked on speech recognition, was used to predict whether a molecule in contact with anything and become good medicine. And there was a contest. He just used our standard technology, designed for speech recognition, to predict the activity of drugs and won in this competition. It was a sign that we’re doing something very versatile. Then there was the student who said, “you Know, Jeff, this thing is going to work with image recognition, and FEI-FEI Li created a suitable dataset for this. There is a public contest, let’s do something.”
We got the results which was far beyond standard computer vision. It was in 2012.
That is, in these three areas you Excel: modeling chemical substances, speech, voice. Where you have failed?
You understand that setbacks are temporary?
Well, what distinguishes the region where it works the fastest, and areas where more time is required? It seems that visual processing, speech recognition and some kind of basic human things that we do with sensory perception, are considered to be first overcome barriers, right?
Yes and no, because there are other things that we do well — the same motor skills. We are very good in management skills. Our brains are definitely suited for that. And only now the neural network begin to compete with the best in other technologies for that. They will win in the end, but now they only start to win.
I think the thinking, abstract thinking — the last things we learn. I think they will be among the last things which these neural networks learn to do.
And so you continue to say that neural networks will eventually win everywhere.
Well, we have a neural network. All that we can, so can they.
True, but the human brain is not the most effective computer ever created.
Definitely, not my human brain! Is there a way to simulate the machines that will be much more effective than the human brain?
From the point of view of philosophy I have no objection against the idea that it may be any completely different way to do it. Maybe if you start with logic, try to automate the logic, come up with a whimsical proof of the theorems will argue, and then decide what it is through reasoning you come up to a visual perception, it may be that this approach will win. But not yet. I have no philosophical objections to such a victory. We just know that the brain is capable of.
But there are things that our brain is not able to do well. Does this mean that neural networks cannot do them well?
Quite possibly, Yes.
And there is a separate problem, which is that we do not fully understand how neural networks work, right?
Yes, we don’t really understand how they work.
We understand how neural networks work with a top-down approach. This is the basic element of neural networks, which we don’t understand. Explain this and then let me ask you the following question: if we know how it all works, how does it then work?
If you look at modern computer vision systems, most of them are mostly directed forward; they do not use feedback connections. And there’s something else in modern computer vision systems that are very susceptible to adversarial errors. You can slightly change a few pixels, and what was the Panda image and still looks to you exactly like the Panda, suddenly becomes an ostrich in the understanding of the neural network. Obviously, the method of shifting pixels is designed in a way to trick the network into thinking its about the ostrich. But the fact is that for you it is still a Panda.
Initially, we believed that it works great. But then, when faced with the fact that they are looking at Panda and we are confident that this ostrich, we got worried. And I think part of the problem is that they are not trying to reconstruct from a high-level view. They are trying to learn in isolation, when study only layers of detectors of the parts and the whole purpose is to change the weight to become better at finding the correct answer. Recently in Toronto we discovered, or Nick frost found that if you add reconstruction, improved resistance to adversarial error. I think that the human eye is used for training reconstruction. And as we learn, making the reconstruction, we are much more resilient to adversarial attacks.
Do you think that downward communication in a neural network allows you to check on something rekonstruiruet. You check and make sure that it’s a Panda, not an ostrich.
I think it’s important, Yes.
But scientists who study the brain do not agree?
Scientists of the brain do not argue with the fact that if you have two areas of bark on the way perception, will always be reverse connection. They argue with that, what does it do. This can be a need for attention, for learning or for reconstruction. Or for all three.
And so we don’t know what feedback is. You build your new neural network, based on the assumption that… no, not even that — you build feedback, because it is needed for reconstruction of your neural networks, even though they do not really understand how the brain works?
Isn’t that the trick? I mean, if you’re trying to do something like the brain, but not sure that the brain does that?
Not really. I don’t do computational neuroscience. I’m not trying to create a model of the brain. I look at the brain and say, “This works, and if we want to do something else that works, we need to watch and be inspired by it.” We are inspired by neurons, and are not building a neural model. Thus, the whole model, we used neurons inspired by the fact that neurons have many connections and that they vary weight.
This is interesting. If I was a computer scientist, and worked on neural networks and wanted to get around Jeff Hinton, one option would be the construction of the descending communication and basing it on other models of brain science. Basing on the learning and not on reconstruction.
If there was a model better, you would have won. Yes.
That is very interesting. Let’s touch on a more General theme. So, neural networks can solve all possible problems. Are there mysteries of the human brain that cannot or will not cover neural networks? For example, emotions.
So, love will be to reconstruct a neural network? Consciousness can be reconstructed?
Exactly. Once you figure out what these things mean. We’re of the neural network, right? Consciousness is a particularly interesting topic for me. But… people don’t really know what is meant by this word. There are lots of different definitions. And I think that’s a rather scientific term. So if 100 years ago you asked people: what is life? They’d say, “Well, living things have life force, and when they die, the life force is leaving them. This is the difference between the living and the dead, either you have life force or not.” Now we have vitality, we think that this concept was introduced to science. And as you begin to understand the biochemistry and molecular biology, you will not need the life force, you will understand how it all works really. And the same thing, I think happens to consciousness. I think that consciousness is an attempt to explain mental phenomena with the use of a certain entity. And this entity, it is not needed. As soon as you can explain to her, you will be able to explain how we do all the things that makes humans conscious beings, explain the different meanings of consciousness, without involving any particular entity.
It turns out, there are no emotions that it would be impossible to create? There is no thought that it would be impossible to create? There is nothing, what can the human mind that theoretically it would be impossible to recreate fully functioning neural network, once we really understand how the brain works?
Something like sang John Lennon in one of his songs.
You 100% sure?
No, I’m bynovec, so be sure at 99.9%.
Well, what then these are the 0.01%?
Well, we could, for example, to be part of a big simulation.
True. So, what do we learn about the brain from our work on the computers?
Well, I think from what we’ve learned in the last 10 years, it is interesting that if you take the system with billions of parameters and objective function — for example, to fill a gap in the line of words — it works better than it should. It works much better than you might expect. You might think, and many people in the field of traditional research on the topic of AI would have thought that you can take the system with a billion options, start her on random values, to measure the gradient of the objective function, and then correct it so as to improve the objective function. You would think that a bad algorithm will inevitably get stuck. But no, it turns out, it’s really good algorithm. And the larger the scale, the better it works. And this discovery was essentially empirical. There was some theory behind all this, of course, but the discovery was empirical. And now, as we found out, it seems more likely that the brain computes the gradient of some objective function, and updates the weight and force of synapses to respond to this gradient. We just need to find out what this target function and how it is deteriorating.
But we do not understand this with the example of the brain? Don’t understand the updating of the weightings?
It was a theory. Long time ago people thought possible. But amid always there were some computer scientists who have said, “Yes, but the idea that it’s all random and training is due to the gradient descent does not work with a billion options, you will have to connect a lot of knowledge.” Now we know that it is not. You simply enter the random parameters, and learn.
Let’s go a little further. Once we learn more and more, we will presumably continue to learn more and more about how the brain works, because we will carry out massive tests of models based on our ideas about the functions of the brain. Once we better understand whether the moment when we in fact reconstruct their brains to become much more efficient cars?
If you really understand what is happening, we can improve some things like education. And I think we will improve. It would be very strange to understand, finally, what happens in your brain, how it learns, and to adapt so as to learn better.
How do you think a couple of years, we use what we know about the brain and the work of deep learning to change education? How would you change classes?
I’m not sure that in a couple of years learning a lot. I think that changing education will take more time. But if you talk about it, [digital] assistants become pretty smart. And when the assistants will be able to understand conversations they will be able to talk with children and educate them.
And theoretically, if we can better understand the brain, we will be able to program assistants so that they better talk with their children, based on what they have learned.
Yes, but I didn’t really thought of. I do another. But all this seems quite similar to the truth.
If we can understand dreams?
Yes, I am very interested in dreams. So interested that I have at least four different theories of dreams.
Tell us about them — first, second, third, fourth.
Long ago were these things called Hopfield networks, and they studied the memories as local attractors. Hopfield found that if you try to put too many memories, they messed up. They take two local attractor and merges them into a single attractor somewhere halfway between them.
Then came Francis Crick and Graham Mitchison and said we can get rid of these false lows with rzucanie (that is, the forgetting of what was learned). We disable the input, we translate the neural network in a random state, let her calm down, saying that it’s bad, change the connection so as not to get into this state, and thus it is possible to force the network to store more memories.
Then came Terry and Sejnowski and said, “Look, if we have not only the neurons that store memories, but lots of other neurons, if we can find an algorithm that will use all of these other neurons to help restore memory?”. In the end, we have created boltzmanngasse a machine learning algorithm. And bolzanovia the machine learning algorithm had a very interesting property: I show the data, and he goes through the remaining units until you get in a very happy state, and then it increases the power of all connections, based on the fact that two units are active simultaneously.
You also have to be the phase in which you unplug the input, allow the algorithm “poshurshat” and translate it into a state where he’s happy, so he fantasizes, and as soon as he is born a fantasy, you say: “Take all pairs of neurons that are active, and diminish the power of links”.
I explain the algorithm as a procedure. But in reality, this algorithm is the product of mathematics and the question: “How to change the chain connections to the neural network with all these hidden pieces of data seemed surprising?”. And there must be other phase, which we call a negative phase, when the network operates without input and forgot, in whatever condition you put.
We dream for hours every night. And if suddenly Wake up, we can say that dream just because the dream is stored in short term memory. We know that have dreams many hours, but in the morning, after waking, can recall only the last dream, and others do not remember that very well, because one could erroneously take them for reality. So why we do not remember their dreams? According to Crick, this is the meaning of dreams: to unlearn these things. You would learn the opposite.
Terry Sejnowski and I found that it is really a learning procedure with maximum probability for Boltzmann machines. This is the first theory about dreams.
I want to move on to other theories. But ask the question: did you train any of your algorithms, deep learning is actually a dream?
Some of the first algorithms that can learn to work with hidden units, was the Boltzmann machine. They were extremely inefficient. But later I found a way of working with approximations, which proved to be effective. And it really was the impetus for the resumption of work in deep learning. It was the thing that was taught by one layer of detector functions at a time. And it was an effective form of restrictive Boltzmann machine. And why was she doing this kind of reverse training. But instead of to sleep, she could just fantasize after each marker with the data.
Okay, so the androids really dream electrohouse. Let’s move on to theory two, three and four.
Theory two was called Wake Sleep Algorithm [algorithm “waking-sleep”]. You need to train the generative model. And you have the idea to create a model that can generate data that has layers of detectors characteristics and activates the higher and lower layers, and so on, until the activation of the pixels create the image, in fact. But you’d like to teach her different. You would like to recognize the data.
And so you have to make an algorithm with two phases. In the phase of Wake-up data is received, it tries to recognize them, and instead examine the relationships that it uses for recognition, it examines the generative communication. Data comes, I activate the hidden units. And then I’m trying to teach these hidden units to restore the data. He learns to reconstruct in each layer. But the question is how to study the direct connection? So the idea is that if know direct connection, it is possible to learn how to do a reverse connection because it is possible to learn to reconstruct it.
Now it also turns out that if you use reverse connection, you can learn and direct connections, because you can just start at the top and generate a bit of data. And as you generate data, you know the States of all hidden layers and can explore direct links to restore these States. And here’s what happens: if you start with random connections and tryto use both phases, will succeed. That worked well, will have to try different options, but will work.
Well, what about the other two theories? We only have eight minutes, I think, will not have time to ask anything at all.
Give me another hour and I’ll tell you about the other two.
Let’s talk about what will happen next. Where are you headed in your research? What problem are you trying to solve now?
In the end, will have to work on something over than work in progress. I think I may work on something that will never finish, called “capsules”, a theory about how visual perception with the use of reconstruction and how the information is sent to the right place. Two major motivating factor was the fact that standard neural networks, the information, the activity in the layer just automatically is departing, and you don’t make decisions about where to send it. The idea of the capsules was to make decisions about where to send the information.
Now, when I began work on the capsules, very smart people from Google invented the transformersthat do the same. They decide where to send information, and this is a big win.