The Man Who Taught Machines to Think: Geoffrey Hinton on The Past, Present and Future of AI

In a captivating interview, Geoffrey Hinton, often called the ‘Godfather of Deep Learning”, shared his journey, insights, and predictions about AI. Hinton’s groundbreaking work has been instrumental in shaping the AI landscape as we know it today.

The Early Days: Inspiration and Intuition

Hinton’s foray into AI began with a sense of disappointment in the explanations offered by physiology and philosophy about how the brain works. This led him to the University of Edinburgh to study AI, where he found the ability to simulate theories more fulfilling. Inspired by the works of Donald Hebb and John von Neumann, Hinton developed a strong intuition that the brain’s learning mechanisms were fundamentally different from conventional computers. He viewed the application of logical inference rules, which rely on symbol manipulation and predicate logic, as impractical from the outset. Instead, Hinton believed that the brain learns by modifying connection strengths between neurons in a neural network.

Collaborations and Breakthroughs

Hinton’s most exciting research involved collaborating with Terry Sejnowski on Boltzmann machines, a type of stochastic recurrent neural network that uses an energy-based model for learning. They believed that Boltzmann machines held the key to understanding how the brain works by learning the underlying probability distributions of the input data. Although their belief was misplaced, the technical results that emerged, such as the development of efficient learning algorithms like contrastive divergence, were fascinating. 

Contrastive Divergence (CD) is an approximation to the maximum likelihood learning algorithm for Boltzmann machines. The CD-k algorithm performs k steps of Gibbs sampling and uses the difference between the data distribution and the model distribution after k steps to update the model parameters.

The weight update rule for CD-k is:

ΔW = ε * (⟨v_i * h_j⟩_data - ⟨v_i * h_j⟩_recon)

where ε is the learning rate, ⟨v_i * h_j⟩_data is the expectation of the product of visible unit i and hidden unit j over the data distribution, and ⟨v_i * h_j⟩_recon is the expectation of the same product after k steps of Gibbs sampling starting from the data.

Hinton also credits his collaboration with Peter Brown, a skilled statistician, for teaching him about hidden Markov models, which inspired the term “hidden layers” in neural networks. Hidden layers allow neural networks to learn hierarchical representations of the input data, enabling them to capture complex patterns and abstractions.

The Arrival of Ilya Sutskever

Ilya Sutskever’s arrival at Hinton’s lab marked a turning point. Sutskever’s raw intuitions and ability to challenge traditional methods, such as questioning why the gradient isn’t utilized by a more effective function optimizer, sparked years of contemplation. Sutskever’s question highlights the importance of optimization algorithms in training neural networks. Traditional optimization methods, such as stochastic gradient descent, update the model parameters based on the gradient of the loss function. However, more advanced optimization techniques, such as Adam or RMSprop, adapt the learning rate for each parameter based on its historical gradients, leading to faster convergence and better performance.

Adam Optimizer:

Adam (Adaptive Moment Estimation) is an optimization algorithm that computes adaptive learning rates for each parameter by storing an exponentially decaying average of past gradients (m_t) and past squared gradients (v_t).

The update rule for Adam is:

m_t = β_1 * m_{t-1} + (1 - β_1) * g_t
v_t = β_2 * v_{t-1} + (1 - β_2) * g_t^2
m_hat_t = m_t / (1 - β_1^t)
v_hat_t = v_t / (1 - β_2^t)
θ_t = θ_{t-1} - α * m_hat_t / (sqrt(v_hat_t) + ε)

where g_t is the gradient at time step t, m_t and v_t are the first and second moment estimates, m_hat_t and v_hat_t are the bias-corrected estimates, α is the step size, β_1 and β_2 are the exponential decay rates for the moment estimates, and ε is a small constant for numerical stability.

The Power of Scale and Intuition

Hinton acknowledged that while he initially believed that clever ideas were crucial, Sutskever’s intuition about the power of scale proved to be correct. Increasing the size of models, in terms of the number of layers and parameters, and the scale of data and computation has led to remarkable breakthroughs. For example, character-level prediction in Wikipedia articles using recurrent neural networks demonstrated a level of understanding that seemed incredible at the time. By processing the input data at the character level, the model could capture intricate patterns and dependencies, enabling it to generate coherent and contextually relevant text.

Language Models and Reasoning

Hinton argues that language models like GPT-4 are not merely predicting the next symbol but are actively engaging in reasoning to understand the context. These models use self-attention mechanisms, such as the Transformer architecture, to capture long-range dependencies and build rich representations of the input text. Hinton provides an example of GPT-4’s ability to identify common structures in seemingly unrelated concepts, such as drawing parallels between a compost heap and an atom bomb based on their shared principle of chain reactions. This demonstrates the model’s capacity for abstract reasoning and analogical thinking. As these models scale up, by increasing the number of parameters and the size of the training data, Hinton believes they will become increasingly capable of reasoning and even surpass human creativity.

The Importance of Multimodality

Hinton emphasizes the significance of multimodal learning, where models are trained on various data types, including images, video, and sound. Multimodal learning allows the models to learn joint representations that capture the relationships and interactions between different modalities. For example, a model trained on both images and their corresponding textual descriptions can learn to associate visual features with semantic concepts. Hinton believes that incorporating multimodality will greatly enhance the models’ understanding of spatial relationships and objects, leading to more comprehensive reasoning capabilities. Techniques such as cross-modal attention and fusion can be employed to effectively integrate information from multiple modalities.

Cross-modal Attention:

Cross-modal attention allows a model to attend to information from one modality (e.g., text) based on the context from another modality (e.g., images). This enables the model to align and fuse information from different modalities effectively.

A simple implementation of cross-modal attention in PyTorch:

import torch
import torch.nn as nn

class CrossModalAttention(nn.Module):
    def __init__(self, hidden_size):
        super(CrossModalAttention, self).__init__()
        self.hidden_size = hidden_size
        self.attn = nn.Linear(hidden_size * 2, hidden_size)
        self.v = nn.Parameter(torch.rand(hidden_size))

    def forward(self, text_features, image_features):
        batch_size = text_features.size(0)
        text_len = text_features.size(1)
        image_len = image_features.size(1)

        text_features_expanded = text_features.unsqueeze(2).expand(batch_size, text_len, image_len, self.hidden_size)
        image_features_expanded = image_features.unsqueeze(1).expand(batch_size, text_len, image_len, self.hidden_size)

        combined_features =, image_features_expanded), dim=3)
        energy = torch.tanh(self.attn(combined_features))
        attention = torch.sum(self.v * energy, dim=3)

        return attention

The Brain, Language, and AI

Hinton explores the relationship between language and cognition, proposing that the brain converts symbols into rich embeddings, and understanding emerges from the interactions between these embeddings. Embeddings are dense, continuous vector representations that capture the semantic and syntactic properties of words or symbols. Hinton suggests that this process is similar to how large language models operate, indicating a plausible model of human thought. In these models, each word is mapped to a high-dimensional embedding space, and the interactions between these embeddings, through operations like dot products or cosine similarity, give rise to semantic relationships and understanding.

Word2Vec Embeddings:

Word2Vec is a popular technique for learning word embeddings. It comes in two flavors: Continuous Bag-of-Words (CBOW) and Skip-gram. The Skip-gram model aims to predict the context words given a target word, while CBOW predicts the target word given the context words.

The objective function for the Skip-gram model is:

J(θ) = (1/T) * Σ_t=1^T Σ_{-m≤j≤m, j≠0} log p(w_{t+j} | w_t)

where T is the size of the training corpus, m is the size of the context window, w_t is the target word, and w_{t+j} are the context words.

The probability p(w_{t+j} | w_t) is defined using the softmax function:

p(w_O | w_I) = exp(v'_wO^T * v_wI) / Σ_w=1^W exp(v'_w^T * v_wI)

where v_w and v’_w are the “input” and “output” vector representations of word w, and W is the number of words in the vocabulary.

The Advent of GPU Computing

Hinton played a pivotal role in popularizing the use of GPUs for training neural networks. GPUs, originally designed for graphics rendering, are highly parallel computing devices that can perform matrix multiplications efficiently. Matrix multiplications are the core operations in neural network computations, as they involve multiplying the input data with the weight matrices of each layer. Hinton recognized the potential of GPUs to accelerate these computations, enabling faster training of larger models. His advocacy for GPU computing, such as his recommendation to use NVIDIA GPUs at the NIPS conference in 2009, accelerated the progress of AI research.

Analog Computation and Digital Immortality

Hinton contemplates the future of AI hardware, envisioning the potential of analog computation to achieve brain-like efficiency. Analog computation involves using physical quantities, such as voltages or currents, to represent and process information, similar to how the brain operates. Analog systems can potentially achieve lower power consumption and higher density compared to digital systems. However, Hinton acknowledges the advantages of digital systems, particularly their ability to share weights across different computers, making them “immortal” in a sense. Digital systems allow for precise storage and transfer of model parameters, enabling reproducibility and scalability.

The Missing Pieces: Fast Weights and Multiple Time Scales

Hinton identifies a crucial aspect of neuroscience that AI models have yet to incorporate: the multiple time scales at which weights change in the brain. In biological neural networks, synaptic plasticity occurs at various time scales, ranging from short-term potentiation to long-term potentiation and depression. Hinton introduces the concept of “fast weights,” which allow for temporary memory and rapid adaptation. Fast weights are dynamic, quickly changing weights that can store temporary information and enable fast learning. Hinton believes that incorporating fast weights and multiple time scales is one of the biggest challenges and opportunities for future AI research. This could involve developing new architectures or learning algorithms that can capture the temporal dynamics of synaptic plasticity.

Fast Weights:

Fast weights are dynamic, quickly changing weights that can store temporary information and enable rapid adaptation. They operate on a shorter time scale compared to the slow weights used for long-term learning.

A simple implementation of fast weights in PyTorch:

import torch
import torch.nn as nn

class FastWeights(nn.Module):
    def __init__(self, input_size, hidden_size):
        super(FastWeights, self).__init__()
        self.input_size = input_size
        self.hidden_size = hidden_size
        self.slow_weights = nn.Linear(input_size, hidden_size)
        self.fast_weights = nn.Linear(input_size, hidden_size, bias=False)

    def forward(self, x):
        slow_output = self.slow_weights(x)
        fast_output = self.fast_weights(x)
        output = slow_output + fast_output
        return output

Feelings, Consciousness, and AI

Hinton proposes a thought-provoking perspective on feelings and consciousness in AI systems. He suggests that feelings can be understood as actions an agent would perform if not for certain constraints. For example, if an AI system predicts that it would perform a certain action in the absence of inhibitory signals, that prediction could be interpreted as a feeling. Hinton argues that there is no fundamental reason why AI systems cannot have feelings and provides an anecdote about a robot exhibiting frustration to support his view. This perspective challenges the notion of feelings as subjective, inner experiences and instead frames them as dispositions to act in certain ways.

Analogies and Compression

Hinton emphasizes the power of analogies in human and AI cognition. He believes that the ability to identify common structures across seemingly disparate concepts allows for efficient compression of information. Analogical reasoning involves mapping the relational structure from one domain (the source) to another domain (the target), enabling knowledge transfer and generalization. Hinton shares a personal analogy between religious belief and the belief in symbol processing, which influenced his own thinking. In AI, techniques like structure mapping and neural network-based analogical reasoning have been explored to enable machines to make analogies and discover common patterns across different domains.

Structure Mapping Engine (SME):

The Structure Mapping Engine (SME) is a computational model of analogical reasoning based on the structure mapping theory. It finds analogical mappings between a source and a target domain by aligning their relational structures.

The SME algorithm consists of the following steps:

  1. Represent the source and target domains as structured representations (e.g., predicate calculus).
  2. Identify potential mappings between the source and target based on structural similarity.
  3. Evaluate the mappings using a set of constraints and rules (e.g., one-to-one correspondence, parallel connectivity).
  4. Select the best mapping based on the evaluation scores.
  5. Transfer knowledge from the source to the target based on the selected mapping.

The Future of AI: Opportunities and Concerns

Hinton acknowledges the immense potential of AI to benefit society, particularly in areas like healthcare and engineering. AI can assist in medical diagnosis, drug discovery, personalized treatment, and the design of advanced materials and systems. However, he also expresses concerns about the misuse of AI by malicious actors for purposes such as autonomous weapons, public opinion manipulation, and mass surveillance. Hinton believes that while slowing down AI research is unlikely, given the competitive landscape and the potential benefits, it is crucial to raise awareness about these risks and develop appropriate governance frameworks and ethical guidelines to mitigate them.

Leave a Comment

Your email address will not be published. Required fields are marked *