Deep Learning for AI

Yoshua Bengio, Yann LeCun, Geoffrey Hinton

Research on artificial neural networks was motivated by the observation that human intelligence emerges from highly parallel networks of relatively simple, non-linear neurons that learn by adjusting the strengths of their connections. This observation leads to a central computational question: How is it possible for networks of this general kind to learn the complicated internal representations that are required for difficult tasks such as recognizing objects or understanding language? Deep learning seeks to answer this question by using many layers of activity vectors as representations and learning the connection strengths that give rise to these vectors by following the stochastic gradient of an objective function that measures how well the network is performing. It is very surprising that such a conceptually simple approach has proved to be so effective when applied to large training sets using huge amounts of computation and it appears that a key ingredient is depth: shallow networks simply do not work as well.

人类智能是从相对简单，非线性的神经元的高度并行的网络从涌现的，这些神经元会调整其链接的强度，这推动了人工神经网络的研究。这个观察带来了一个核心计算问题：这种通用类型的网络，怎么能学习到复杂的内部表示，并用于复杂的任务，比如识别目标，或理解语言？深度学习通过使用很多层激活向量作为表示，跟着一个目标函数（度量了网络的性能怎样）的随机梯度，学习给出这些向量的连接强度，来回答这个问题。非常令人惊讶的是，这样一个概念上很简单的方法，在应用到大型训练集时，使用大量计算，会变得非常有效，似乎其关键成分是深度：浅层网络的效果就没有那么好。

We reviewed the basic concepts and some of the breakthrough achievements of deep learning several years ago. Here we briefly describe the origins of deep learning, describe a few of the more recent advances, and discuss some of the future challenges. These challenges include learning with little or no external supervision, coping with test examples that come from a different distribution than the training examples, and using the deep learning approach for tasks that humans solve by using a deliberate sequence of steps which we attend to consciously — tasks that Kahneman calls system 2 tasks as opposed to system 1 tasks like object recognition or immediate natural language understanding, which generally feel effortless.

我们回顾了深度学习的基本概念和几年前的一些关键突破。这里我们简要的描述了深度学习的起源，描述了几个最近的进展，讨论了一些未来挑战。这些挑战包括，在极少或没有外部监督的情况下学习，处理测试样本的分布与训练样本分布不同的情况，使用深度学习方法来处理系统2的任务，即人类使用深思熟虑的步骤来解决的任务，相对的是系统1的任务，如目标识别，或即时自然语言理解，这些我们都感到毫不费力。

1. From Hand-Coded Symbolic Expressions to Learned Distributed Representations

There are two quite different paradigms for AI. Put simply, the logic-inspired paradigm views sequential reasoning as the essence of intelligence and aims to implement reasoning in computers using hand-designed rules of inference that operate on hand-designed symbolic expressions that formalize knowledge. The brain-inspired paradigm views learning representations from data as the essence of intelligence and aims to implement learning by hand-designing or evolving rules for modifying the connection strengths in simulated networks of artificial neurons.

AI有两种非常不同的范式。简单来说，逻辑启发的范式视序列推理作为智能的本质，使用人工设计的推理规则，在人工设计的符号表示上进行运算，在计算机中实现推理。大脑启发的范式，将从数据中学习表示视为智能的本质，通过手工设计或演化的规则，修改人工神经网络中的连接强度，来实现学习。

In the logic-inspired paradigm, a symbol has no meaningful internal structure: Its meaning resides in its relationships to other symbols which can be represented by a set of symbolic expressions or by a relational graph. By contrast, in the brain-inspired paradigm the external symbols that are used for communication are converted into internal vectors of neural activity and these vectors have a rich similarity structure. Activity vectors can be used to model the structure inherent in a set of symbol strings by learning appropriate activity vectors for each symbol and learning non-linear transformations that allow the activity vectors that correspond to missing elements of a symbol string to be filled in. This was first demonstrated in Rumelhart et al. on toy data and then by Bengio et al. on real sentences. A very impressive recent demonstration is BERT, which also exploits self-attention to dynamically connect groups of units, as described later.

在逻辑启发的范式中，符号没有有意义的内部结构：其意义在与其他符号的关系中，可以用符号表达式的集合或关系图来表示。比较起来，在大脑启发的范式中，用于通信的外部符号，转换为神经激活的内部向量，这些向量有丰富的相似性结构。激活向量可以用于对结构建模，这是符号串所内在的结构，通过对每个符号学习合适的激活向量，学习非线性变换，使得对应着缺失元素的激活向量得以填补。这首先是由Rumelhart等在toy数据上证明的，然后由Bengio等在真实的语句上证明。最近一个非常令人印象深刻的表示是BERT，利用了自注意力机制来动态的连接单元组，后面会进行描述。

The main advantage of using vectors of neural activity to represent concepts and weight matrices to capture relationships between concepts is that this leads to automatic generalization. If Tuesday and Thursday are represented by very similar vectors, they will have very similar causal effects on other vectors of neural activity. This facilitates analogical reasoning and suggests that immediate, intuitive analogical reasoning is our primary mode of reasoning, with logical sequential reasoning being a much later development, which we will discuss.

使用神经激活向量来表示概念，使用权重矩阵来捕获概念之间的关系，其主要优势是，这会带来自动的泛化。如果周三和周四是由非常类似的向量表示的，在其他神经激活向量中，他们会有很类似的因果效果。这促进了类比推理，说明即刻的直观的类比推理是我们推理的主要模式，逻辑序列推理是一个非常后期的发展，我们会进行讨论。

2. The Rise of Deep Learning

Deep learning re-energized neural network research in the early 2000s by introducing a few elements which made it easy to train deeper networks. The emergence of GPUs and the availability of large datasets were key enablers of deep learning and they were greatly enhanced by the development of open source, flexible software platforms with automatic differentiation such as Theano, Torch, Caffe, TensorFlow, and PyTorch. This made it easy to train complicated deep nets and to reuse the latest models and their building blocks. But the composition of more layers is what allowed more complex non-linearities and achieved surprisingly good results in perception tasks, as summarized here.

深度学习在2000年早期重新焕发了神经网络研究的活力，引入了一些元素，使得训练更深的网络比较容易。GPUs和大型数据集的出现，是深度学习的关键赋能在者，而且还受到了开源的灵活的带有自动微分技术的软件平台开发的强化，如Theano，Torch，Caffe，TensorFlow和PyTorch。这使得训练复杂的深度网络和重用最新的模型及其组成模块非常容易。但更多的网络层，才有了更多的复杂非线性，在感知任务上取得了惊人的好成果。

Why depth? Although the intuition that deeper neural networks could be more powerful pre-dated modern deep learning techniques, it was a series of advances in both architecture and training procedures, which ushered in the remarkable advances which are associated with the rise of deep learning. But why might deeper networks generalize better for the kinds of input-output relationships we are interested in modeling? It is important to realize that it is not simply a question of having more parameters, since deep networks often generalize better than shallow networks with the same number of parameters. The practice confirms this. The most popular class of convolutional net architecture for computer vision is the ResNet family of which the most common representative, ResNet-50 has 50 layers. Other ingredients not mentioned in this article but which turned out to be very useful include image deformations, dropout, and batch normalization.

为什么是深度？虽然在现代深度学习技术之前，人们就有直觉，更深的神经网络会更加强力，但是一系列在架构和训练过程的进展，取得了令人印象深刻的进展，带来了深度学习的崛起。但为什么更深的网络会对我们感兴趣的输入输出关系泛化的更好？要意识到，这并不是只是有更多的参数，因为深度网络比浅层的有相同参数数量的网络泛化的更好。实践也确认了这个事实。计算机视觉中最流行的卷积网络架构类型是ResNet族，最有代表性的是ResNet-50，有50层。非常有用，但是本文没有提到的组成部分，包括图像形变，dropout和批归一化。

We believe that deep networks excel because they exploit a particular form of compositionality in which features in one layer are combined in many different ways to create more abstract features in the next layer.

我们相信深度网络性能非常好，因为它们利用了特殊形式的组合性，其中一层中的特征以很多方式结合到一起，然后在下一层中创建了更抽象的特征。

For tasks like perception, this kind of compositionality works very well and there is strong evidence that it is used by biological perceptual systems.

对于类似于感知的任务，这种组合性效果非常好，而且有很强的证据，这也是在生物感知系统中使用的。

Unsupervised pre-training. When the number of labeled training examples is small compared with the complexity of the neural network required to perform the task, it makes sense to start by using some other source of information to create layers of feature detectors and then to fine-tune these feature detectors using the limited supply of labels. In transfer learning, the source of information is another supervised learning task that has plentiful labels. But it is also possible to create layers of feature detectors without using any labels at all by stacking auto-encoders.

无监督预训练。当标注的训练样本的数量很小，在任务中使用的神经网络的复杂度较高，那么就需要使用其他信息源来创建特征检测器的层，然后使用有限的标签来精调这些特征检测器。在迁移学习中，信息源是另一个监督学习任务，这个任务中有很多的标签。但不使用任何标签，创建特征检测器的层，也是可能的，即通过堆叠自动编码器。

First, we learn a layer of feature detectors whose activities allow us to reconstruct the input. Then we learn a second layer of feature detectors whose activities allow us to reconstruct the activities of the first layer of feature detectors. After learning several hidden layers in this way, we then try to predict the label from the activities in the last hidden layer and we backpropagate the errors through all of the layers in order to fine-tune the feature detectors that were initially discovered without using the precious information in the labels. The pre-training may well extract all sorts of structure that is irrelevant to the final classification but, in the regime where computation is cheap and labeled data is expensive, this is fine so long as the pre-training transforms the input into a representation that makes classification easier.

第一，我们学习一个特征检测器的层，其行为可以重建输入。然后我们学习另一层特征检测器，其行为重建第一层特征检测器的行为。以这种方式学习几个隐藏层，我们试图从最后一个隐藏层的行为中预测标签，然后将误差在所有层中反向传播，以精调没有使用标签信息发现的特征检测器。这个预训练可以很好的提取各种结构，与最终分类是无关的，但，在计算很廉价，标记数据很珍贵的领域中，只要预训练将输入变换到一个使得分类更容易的表示，这就是非常好的。

In addition to improving generalization, unsupervised pre-training initializes the weights in such a way that it is easy to fine-tune a deep neural network with backpropagation. The effect of pre-training on optimization was historically important for overcoming the accepted wisdom that deep nets were hard to train, but it is much less relevant now that people use rectified linear units (see next section) and residual connections. However, the effect of pre-training on generalization has proved to be very important. It makes it possible to train very large models by leveraging large quantities of unlabeled data, for example, in natural language processing, for which huge corpora are available. The general principle of pre-training and fine-tuning has turned out to be an important tool in the deep learning toolbox, for example, when it comes to transfer learning or even as an ingredient of modern meta-learning.

为改进泛化，无监督预训练初始化了权重，用反向传播精调一个深度网络会比较容易。预训练对优化的效果在历史上很重要，可以克服深度网络难以训练的问题，但是现在就没那么相关了，人们使用ReLU和残差连接。但是，预训练对泛化的效果已经证明是非常重要的。这使得利用大量未标注数据来训练非常大型的模型称为可能，比如，在NLP中，大量语料都是可用的。预训练和精调的总体原则是深度学习工具箱的重要工具，比如，在迁移学习的领域，或作为现代元学习的组成部分。

The mysterious success of rectified linear units. The early successes of deep networks involved unsupervised pre-training of layers of units that used the logistic sigmoid nonlinearity or the closely related hyperbolic tangent. Rectified linear units had long been hypothesized in neuroscience and already used in some variants of RBMs and convolutional neural networks. It was an unexpected and pleasant surprise to discover that rectifying non-linearities (now called ReLUs, with many modern variants) made it easy to train deep networks by backprop and stochastic gradient descent, without the need for layerwise pre-training. This was one of the technical advances that enabled deep learning to outperform previous methods for object recognition, as outlined here.

ReLU单元的神秘成功。深度网络的早期成功，涉及到层的无监督预训练，使用的是logistic sigmoid非线性，或紧密相关的hyperbolic tangent。ReLU在神经科学中一直是一个假设，而且在RBMs的一些变体和CNN中已经得到了应用。ReLU会使得用反向传播和随机梯度下降训练深度网络更容易，而不需要逐层预训练，这是未曾期待的和令人高兴的。这是使得深度学习超过之前的目标检测方法的一个关键技术进展。

Breakthroughs in speech and object recognition. An acoustic model converts a representation of the sound wave into a probability distribution over fragments of phonemes. Heroic efforts by Robinson using transputers and by Morgan et al. using DSP chips had already shown that, with sufficient processing power, neural networks were competitive with the state of the art for acoustic modeling. In 2009, two graduate students using Nvidia GPUs showed that pre-trained deep neural nets could slightly outperform the SOTA on the TIMIT dataset. This result reignited the interest of several leading speech groups in neural networks. In 2010, essentially the same deep network was shown to beat the SOTA for large vocabulary speech recognition without requiring speaker-dependent training and by 2012, Google had engineered a production version that significantly improved voice search on Android. This was an early demonstration of the disruptive power of deep learning.

在语音和目标识别中的突破。一个声学模型将声音波形的表示转化成音素片段的概率分布。Robinson使用transputer，Morgan等使用DPS芯片的努力已经证明了，有足够的处理能力，神经网络与语音建模最好的模型，是可以媲美的。在2009年，两个研究生使用Nvidia GPUs展示了，预训练的神经网络可以在TIMIT数据集上略微超过SOTA。这个结果重新点燃了几个语音领导小组对神经网络的兴趣。在2010年，使用基本上相同的网络，击败了语音识别在大型词汇库中的SOTA，不需要针对发言者的训练，到2012年，Google开发出了一个生产版本，显著改进了在Android上的语音搜索。这是深度学习能力的早期展示。

At about the same time, deep learning scored a dramatic victory in the 2012 ImageNet competition, almost halving the error rate for recognizing a thousand different classes of object in natural images. The keys to this victory were the major effort by Fei-Fei Li and her collaborators in collecting more than a million labeled images for the training set and the very efficient use of multiple GPUs by Alex Krizhevsky. Current hardware, including GPUs, encourages the use of large mini-batches in order to amortize the cost of fetching a weight from memory across many uses of that weight. Pure online stochastic gradient descent which uses each weight once converges faster and future hardware may just use weights in place rather than fetching them from memory.

几乎在同时，深度学习在2012 ImageNet竞赛中得到了非常好的成绩，在自然图像中识别1000种不同的目标类别的错误率下降了一半。这个胜利的关键，是Fei-Fei Li及其同事收集了超过100万张标注的图像作为训练集，以及Alex Krizhevsky对多个GPUs的有效利用。目前的硬件，包括GPUs，鼓励使用大的mini-batches，以补偿从内存中取权重并多次使用的代价。纯粹在线的随机梯度下降，每个权重使用一次，收敛的会更快，未来的硬件可能会在位使用权重，而不用从内存中获取。

The deep convolutional neural net contained a few novelties such as the use of ReLUs to make learning faster and the use of dropout to prevent overfitting, but it was basically just a feed-forward convolutional neural net of the kind that Yann LeCun and his collaborators had been developing for many years. The response of the computer vision community to this breakthrough was admirable. Given this incontrovertible evidence of the superiority of convolutional neural nets, the community rapidly abandoned previous hand-engineered approaches and switched to deep learning.

DCNN包含几个创新，比如使用ReLUs来更快速的学习，使用dropout来防止过拟合，但是这基本上就是一个前向卷积神经网络，Yann LeCun及其同事已经开发了很多年。计算机视觉团体对这种突破的反应，是令人钦佩的。CNN毫无疑问的优势，使得团体迅速抛弃了之前手工设计的方法，切换到了深度学习。

3. Recent Advances

Here we selectively touch on some of the more recent advances in deep learning, clearly leaving out many important subjects, such as deep reinforcement learning, graph neural networks and meta-learning.

这里我们选择性的谈一谈深度学习最近的进展，忽略了很多重要的注意，比如深度强化学习，图神经网络和元学习。

Soft attention and the transformer architecture. A significant development in deep learning, especially when it comes to sequential processing, is the use of multiplicative interactions, particularly in the form of soft attention. This is a transformative addition to the neural net toolbox, in that it changes neural nets from purely vector transformation machines into architectures which can dynamically choose which inputs they operate on, and can store information in differentiable associative memories. A key property of such architectures is that they can effectively operate on different kinds of data structures including sets and graphs.

软注意力和transformer架构。深度学习的一个主要进展，尤其是在序列处理的情况，是使用乘法的交互，特别是以软注意力的形式。这是对神经网络工具箱的一个革新性的添加，讲神经网络从纯粹的向量变换机器，转换成了可以动态选择输入进行运算的架构，可以在可辨的相关内存中存储信息。这种架构的一个关键属性，是可以有效的在不同的数据结构的上运算，包括集合和图。

Soft attention can be used by modules in a layer to dynamically select which vectors from the previous layer they will combine to compute their outputs. This can serve to make the output independent of the order in which the inputs are presented (treating them as a set) or to use relationships between different inputs (treating them as a graph).

软注意力可以用在模块的一层中，以动态的选择前一层的哪些向量，会结合到一起来计算其输出。这会使得输出与输入的顺序无关（将其视作一个集合），或使用不同输入的关系（将其作为图）。

The transformer architecture, which has become the dominant architecture in many applications, stacks many layers of ”self-attention” modules. Each module in a layer uses a scalar product to compute the match between its query vector and the key vectors of other modules in that layer. The matches are normalized to sum to 1, and the resulting scalar coefficients are then used to form a convex combination of the value vectors produced by the other modules in the previous layer. The resulting vector forms an input for a module of the next stage of computation. Modules can be made multi-headed so that each module computes several different query, key and value vectors, thus making it possible for each module to have several distinct inputs, each selected from the previous stage modules in a different way. The order and number of modules does not matter in this operation, making it possible to operate on sets of vectors rather than single vectors as in traditional neural networks. For instance, a language translation system, when producing a word in the output sentence, can choose to pay attention to the corresponding group of words in the input sentence, independently of their position in the text. While multiplicative gating is an old idea for such things as coordinate transforms and powerful forms of recurrent networks, its recent forms have made it mainstream. Another way to think about attention mechanisms is that they make it possible to dynamically route information through appropriately selected modules and combine these modules in potentially novel ways for improved out-of-distribution generalization.

Transformer架构，在很多应用中都已经成为主要架构，堆叠了很多自注意力模块层。一层中的每个模块，使用一个标量乘积，来计算查询向量和那一层中其他模块的关键向量之间的匹配关系。这些匹配归一化到和为1，得到的标量系数然后用于形成值向量的凸组合，这些向量是由前一层的其他模块产生的。得到的向量下一阶段计算的模块的输入。这些模块可以是多头的，这样每个模块会计算几个不同的查询、key和值向量，因此使得每个模块都有不同的输入，每个都是从前一阶段模块中以不同的方式选择的。模块的顺序和数量在这个运算中并不重要，使其可能在向量集合上进行运算，而不是像传统神经网络中的单个向量。比如，一个语言翻译系统，当在输出语句中产生一个单词的时候，会选择注意在输入语句中的对应单词组，与其在文本中的位置无关。乘性门控是这种东西的一个旧思想，就像坐标变换和循环网络的强力形式，其最近的形式已经成为主流。另一种思考注意力机制的方式，是它们使得信息可以动态的经过合适的选择的模块进行路由，将这些模块以潜在的新型方式结合起来，可以对不同的分布改进其泛化。

Transformers have produced dramatic performance improvements that have revolutionized natural language processing, and they are now being used routinely in industry. These systems are all pre-trained in a self-supervised manner to predict missing words in a segment of text.

Transformer在NLP中得到了极大的性能改进，现在已经在工业中得到很好的应用。这些系统都是以自监督的方式预训练的，以预测文本片段中的缺失文字。

Perhaps more surprisingly, transformers have been used successfully to solve integral and differential equations symbolically. A very promising recent trend uses transformers on top of convolutional nets for object detection and localization in images with state-of-the-art performance. The transformer performs post-processing and object-based reasoning in a differentiable manner, enabling the system to be trained end-to-end.

可能更令人惊讶的是，transformers已经成功的用于符号求解积分和微分方程。一个非常有希望的最近的趋势，在卷积网络上使用transformers，进行图像中的目标检测和定位，得到了目前最好的性能。Transformer以一种可微分的形式进行的后处理和基于目标的推理，使系统可以进行端到端的训练。

Unsupervised and self-supervised learning. Supervised learning, while successful in a wide variety of tasks, typically requires a large amount of human-labeled data. Similarly, when reinforcement learning is based only on rewards, it requires a very large number of interactions. These learning methods tend to produce task-specific, specialized systems that are often brittle outside of the narrow domain they have been trained on. Reducing the number of human-labeled samples or interactions with the world that are required to learn a task and increasing the out-of-domain robustness is of crucial importance for applications such as low-resource language translation, medical image analysis, autonomous driving, and content filtering.

无监督和自监督学习。监督学习在很多任务中都非常成功，一般需要大量人为标注的数据。类似的，强化学习是基于回报的，需要大量的交互。这些学习方法会产生与任务相关的，专门的系统，在其所训练的很窄的领域之外，通常非常脆弱。减少学习一个任务所需的人为标注的样本数量，或与世界的交互，提升在领域外的稳健性，对于一些应用是有很关键的重要性的，如低资源语言翻译，医学图像分析，自动驾驶，和内容过滤。

Humans and animals seem to be able to learn massive amounts of background knowledge about the world, largely by observation, in a task-independent manner. This knowledge underpins common sense and allows humans to learn complex tasks, such as driving, with just a few hours of practice. A key question for the future of AI is how do humans learn so much from observation alone?

人类和动物似乎可以学习大量关于世界的背景知识，主要是通过观察，而且是以一种任务无关的方式。这种知识巩固了常识，使人类可以学习复杂的任务，比如只用几小时的练习学习驾驶。AI的未来的一个关键问题，是人类怎样只从观察中就学习这么多？

In supervised learning, a label for one of N categories conveys, on average, at most log2(N) bits of information about the world. In model-free reinforcement learning, a reward similarly conveys only a few bits of information. In contrast, audio, images and video are high-bandwidth modalities that implicitly convey large amounts of information about the structure of the world. This motivates a form of prediction or reconstruction called self-supervised learning which is training to “fill in the blanks” by predicting masked or corrupted portions of the data. Self-supervised learning has been very successful for training transformers to extract vectors that capture the context-dependent meaning of a word or word fragment and these vectors work very well for downstream tasks.

在监督学习中，N个类别中的一个类别的一个标签，平均传递最多log2(N) bits关于这个世界的消息。在model-free强化学习中，一个回报类似的也传递了几个bits的消息。比较之下，声音，图像和视频是高带宽的模态，隐式的传递关于这个世界结构的大量信息。这激发了一种预测或重建的形式，称为自监督学习，训练用于“填空”，预测数据的掩模部分，或污染部分。自监督学习在训练transformers上很成功，提取的向量捕获了依赖于上下文的一个单词或单词片段的意义，这些向量在下游任务中效果非常好。

For text, the transformer is trained to predict missing words from a discrete set of possibilities. But in high-dimensional continuous domains such as video, the set of plausible continuations of a particular video segment is large and complex and representing the distribution of plausible continuations properly is essentially an unsolved problem.

对于文本，transformer的训练是用于从离散可能性集合中预测缺失的单词。但在高维连续领域，比如视频中，一个特定视频片段的可行延续的集合是非常大非常复杂的，合适的表示可行延续的分布，实际上是一个未解的问题。

Contrastive learning. One way to approach this problem is through latent variable models that assign an energy (that is, a badness) to examples of a video and a possible continuation.

对比学习。一种解决这个问题的方法是，通过隐式变量模型，对一个视频样本和一个可能的延续指定一个能量（即，坏的程度）。

Given an input video X and a proposed continuation Y, we want a model to indicate whether Y is compatible with X by using an energy function E(X, Y) which takes low values when X and Y are compatible, and higher values otherwise.

给定一个输入视频X和推荐的延续Y，我们希望一个模型能使用一个能量函数E(X,Y)表示，Y是否与X兼容，当兼容时值很低，否则就很高。

E(X,Y) can be computed by a deep neural net which, for a given X, is trained in a contrastive way to give a low energy to values Y that are compatible with X (such as examples of (X, Y) pairs from a training set), and high energy to other values of Y that are incompatible with X. For a given X, inference consists in finding one $\check Y$ that minimizes E(X, Y) or perhaps sampling from the Ys that have low values of E(X, Y). This energy-based approach to representing the way Y depends on X makes it possible to model a diverse, multi-modal set of plausible continuations.

E(X,Y)可以由一个DNN计算，对于一个给定的X，以一种对比式的方式进行训练，对于Y与X兼容的情况，给出很低的能量值，对于Y与X不兼容的形式，给出很高的值。对于一个给定的X，推理就是找到一个$\check Y$，使E(X,Y)最小化，或从Ys中采样，找到E(X,Y)值很低的情况。这种基于能量的方法来表示Y依赖于X的情况，可以对一个多样的多模态的可行延续进行建模。

The key difficulty with contrastive learning is to pick good “negative” samples: suitable points Y whose energy will be pushed up. When the set of possible negative examples is not too large, we can just consider them all. This is what a softmax does, so in this case contrastive learning reduces to standard supervised or self-supervised learning over a finite discrete set of symbols. But in a real-valued high-dimensional space, there are far too many ways a vector $\check Y$ could be different from Y and to improve the model we need to focus on those Ys that should have high energy but currently have low energy. Early methods to pick negative samples were based on Monte-Carlo methods, such as contrastive divergence for restricted Boltzmann machines and noise-contrastive estimation.

对比学习的关键难点是，选择好的负样本：合适的会将能量向上推的点Y。当可能的负样本的集合不是很大时，我们可以全部都考虑。这是softmax所做的事，所以在这种情况中，对比学习退化成，在一个有限离散符号集合上的，标准的监督学习或自监督学习。但在一个实值的高维空间中，一个向量$\check Y$与Y不同的方式简直太多了，为改进模型，我们需要聚焦在那些应当有很高的能量，但目前能量很低的Ys。早期选择负样本的方法是基于蒙特卡洛方法的，比如有限Boltzmann机的对比散度，和噪声对比估计。

Generative Adversarial Networks (GANs) train a generative neural net to produce contrastive samples by applying a neural network to latent samples from a known distribution (for example, a Gaussian). The generator trains itself to produce outputs $\check Y$ to which the model gives low energy E($\check Y$). The generator can do so using backpropagation to get the gradient of E($\check Y$) with respect to $\check Y$. The generator and the model are trained simultaneously, with the model attempting to give low energy to training samples, and high energy to generated contrastive samples.

GANs训练一个生成式神经网络，通过对已知分布的（比如，高斯）隐式样本使用神经网络处理，以产生对比样本。生成器训练自己以产生输出$\check Y$，其模型会给出低能量E($\check Y$)。生成器可以使用反向传播这样做，以得到E($\check Y$)对$\check Y$的梯度。生成器和模型是同时训练的，模型试图对训练样本给出很低的能量，对生成的对比样本则给出高的能量。

GANs are somewhat tricky to optimize, but adversarial training ideas have proved extremely fertile, producing impressive results in image synthesis, and opening up many new applications in content creation and domain adaptation as well as domain or style transfer.

GANs的优化比较需要技巧，但对抗训练的思想已经证明了非常多产，在图像生成领域中给出了令人印象深刻的结果，在内容创建和领域适应上开启了很多新的应用，也包括领域迁移，或风格迁移。

Making representations agree using contrastive learning. Contrastive learning provides a way to discover good feature vectors without having to reconstruct or generate pixels. The idea is to learn a feed-forward neural network that produces very similar output vectors when given two different crops of the same image or two different views of the same object but dissimilar output vectors for crops from different images or views of different objects. The squared distance between the two output vectors can be treated as an energy, which is pushed down for compatible pairs and pushed up for incompatible pairs.

使表示与对比学习相一致。对比学习提供了一种发现很好的特征向量的方式，而不需要重建或生成像素。其思想是，学习一个前向神经网络，当送入同样图像的两个不同的剪切块，或同样目标的两个不同的视角时，给出类似的输出向量，当送入不同图像的剪切块，或不同目标的视角时，给出不相似的输出向量。两个输出向量的距离平方可以认为是一种能量，对于兼容的对，值会较小，对不兼容的对，值会较大。

A series of recent papers that use convolutional nets for extracting representations that agree have produced promising results in visual feature learning. The positive pairs are composed of different versions of the same image that are distorted through cropping, scaling, rotation, color shift, blurring, and so on. The negative pairs are similarly distorted versions of different images which may be cleverly picked from the dataset through a process called hard negative mining or may simply be all of the distorted versions of other images in a minibatch. The hidden activity vector of one of the higher-level layers of the network is subsequently used as input to a linear classifier trained in a supervised manner. This Siamese net approach has yielded excellent results on standard image recognition benchmarks. Very recently, two Siamese net approaches have managed to eschew the need for contrastive samples. The first one, dubbed SwAV, quantizes the output of one network to train the other network, the second one, dubbed BYOL, smoothes the weight trajectory of one of the two networks, which is apparently enough to prevent a collapse.

最近一系列文章使用卷积网络来提取表示，在视觉特征学习中得到了很有希望的结果。正对是由同样图像的不同版本构成，通过剪切，缩放，旋转，色彩变换，模糊等等生成。负对是不同图像经过类似过程生成的，可以通过难分负样本挖掘的过程选择出来，或就是一个minibatch中的其他图像的变换版本。网络一个较高层的隐藏激活向量，后续用作到一个线性分类器的输入，以一种有监督的方式进行训练。这种孪生网络的方法在标准图像识别基准测试中得到了非常好的结果。最近，两个孪生网络方法已经避免了需要对比样本。第一个称之为SwAV，量化一个网络的输出没来训练另一个网络，第二个，称为BYOL，将两个网络中的一个的权重轨迹进行平滑，很明显足以防止坍塌。

Variational auto-encoders. A popular recent self-supervised learning method is the Variational Auto-Encoder (VAE). This consists of an encoder network that maps the image into a latent code space and a decoder network that generates an image from a latent code. The VAE limits the information capacity of the latent code by adding Gaussian noise to the output of the encoder before it is passed to the decoder. This is akin to packing small noisy spheres into a larger sphere of minimum radius. The information capacity is limited by how many noisy spheres fit inside the containing sphere. The noisy spheres repel each other because a good reconstruction error requires a small overlap between codes that correspond to different samples. Mathematically, the system minimizes a free energy obtained through marginalization of the latent code over the noise distribution. However, minimizing this free energy with respect to the parameters is intractable, and one has to rely on variational approximation methods from statistical physics that minimize an upper bound of the free energy.

变分自动编码器。最近一个流行的自监督学习方法，是变分自动编码器。这包括一个编码器网络，将图像映射到一个隐编码空间，和一个解码器网络，从隐编码生成图像。VAE限制了隐编码的信息容量，对编码器的输出加入了高斯噪声，然后再送入解码器。这与将小型含噪球打包进更大的最小半径球类似。信息容量受到多少含噪球可以放入包含球中的限制。含噪球会互相排斥，因为一个好的重建误差，需要对应不同样本的编码重叠很少。数学上来说，隐藏编码对噪声分布的边缘化可以得到自由能量，系统对其最小化。但是，对参数最小化这个自由能量是很难处理的，需要依赖于统计物理中变分近似方法，最小化自由能量的上限。

4. The Future of Deep Learning

The performance of deep learning systems can often be dramatically improved by simply scaling them up. With a lot more data and a lot more computation, they generally work a lot better. The language model GPT-3 with 175 billion parameters (which is still tiny compared with the number of synapses in the human brain) generates noticeably better text than GPT-2 with only 1.5 billion parameters. The chatbots Meena and BlenderBot also keep improving as they get bigger. Enormous effort is now going into scaling up and it will improve existing systems a lot, but there are fundamental deficiencies of current deep learning that cannot be overcome by scaling alone, as discussed here.

深度学习系统的性能，通常通过将系统放大，就可以得到极大的改进。有更多的数据，更大的计算量，一般来说效果都会好很多。语言模型GPT-3有175 billions参数（但是与人脑中的突触数量相比，仍然很小），其产生的文本，明显比GPT-2要好，而其只有1.5 billion参数。聊天机器人Meena和BlenderBot越来越大，也在持续改进。很多努力都在放大模型上，会对现有的系统进行改进，但现在的深度学习仍然有很基本的缺陷，不能仅仅通过放大来解决，这里进行讨论。

Comparing human learning abilities with current AI suggests several directions for improvement: 将人类学习能力与目前的AI比较，有几个可能的改进方向：

Supervised learning requires too much labeled data and model-free reinforcement learning requires far too many trials. Humans seem to be able to generalize well with far less experience. 监督学习需要大量标注数据，无模型的强化学习需要大量的尝试。人类用少的多的经验，就可以进行泛化。
Current systems are not as robust to changes in distribution as humans, who can quickly adapt to such changes with very few examples. 目前的系统对分布变化没有那么稳健，而人类可以用非常少的样本快速的适应这样的变化。
Current deep learning is most successful at perception tasks and generally what are called system 1 tasks. Using deep learning for system 2 tasks that require a deliberate sequence of steps is an exciting area that is still in its infancy. 目前的深度学习在感知任务上很成功，我们一般称之为系统1任务。使用深度学习进行系统2任务（需要深思熟虑的步骤序列）是一个激动人心的领域，仍然在刚刚开始阶段。

What needs to be improved. From the early days, theoreticians of machine learning have focused on the iid assumption, which states that the test cases are expected to come from the same distribution as the training examples. Unfortunately, this is not a realistic assumption in the real world: just consider the non-stationarities due to actions of various agents changing the world, or the gradually expanding mental horizon of a learning agent which always has more to learn and discover. As a practical consequence, the performance of today’s best AI systems tends to take a hit when they go from the lab to the field.

需要改进什么。从早期的时候，机器学习理论人员都是聚焦在iid的假设，这是说，测试样本与训练样本的分布是一样的。不幸的是，这在真实世界中并不是一个实际的假设：只要考虑一下由于各种agents的行为改变这个世界导致的非稳定性，或一个学习agent永远有更多的要学习和发现，思想边界逐渐扩张。实际的结果是，今天最好的AI系统的性能，在从实验室走向实际应用时，都会遭受到打击。

Our desire to achieve greater robustness when confronted with changes in distribution (called out-of-distribution generalization) is a special case of the more general objective of reducing sample complexity (the number of examples needed to generalize well) when faced with a new task — as in transfer learning and lifelong learning — or simply with a change in distribution or in the relationship between states of the world and rewards. Current supervised learning systems require many more examples than humans (when having to learn a new task) and the situation is even worse for model-free reinforcement learning since each rewarded trial provides less information about the task than each labeled example. It has already been noted that humans can generalize in a way that is different and more powerful than ordinary iid generalization: we can correctly interpret novel combinations of existing concepts, even if those combinations are extremely unlikely under our training distribution, so long as they respect high-level syntactic and semantic patterns we have already learned. Recent studies help us clarify how different neural net architectures fare in terms of this systematic generalization ability. How can we design future machine learning systems with these abilities to generalize better or adapt faster out-of-distribution?

我们希望在遇到分布的改变时，会有更大的稳健性（称之为分布不同的泛化），这是一个更通用的目标的特殊情况，即遇到新任务的时候，降低样本的复杂度（泛化良好所需的样本数量），就像在迁移学习和终身学习里那样，或仅仅是分布的改变，或世界状态和回报的关系的改变。目前的监督学习系统比人类需要更多的样本（当需要学习一个新任务的时候），对于无模型的强化学习，情况更差，因为每个回报的尝试比每个标注的样本提供的关于任务的信息要少。已经说明了，人类泛化的方式，与普通的iid泛化不同，更加强大：我们可以正确的解释已有概念的新组合，即使这些组合在我们的训练分布中是非常不可能的，只要它们符合我们已经学习的高层的句法和语义模式。最近的研究帮助我们澄清了，神经网络架构在这种系统的泛化能力上是怎么成功的。我们怎样设计未来的带有这些能力的机器学习系统，以在不同分布上泛化的更好，或更快的适应呢？

From homogeneous layers to groups of neurons that represent entities. Evidence from neuroscience suggests that groups of nearby neurons (forming what is called a hyper-column) are tightly connected and might represent a kind of higher-level vector-valued unit able to send not just a scalar quantity but rather a set of coordinated values. This idea is at the heart of the capsules architectures, and it is also inherent in the use of soft-attention mechanisms, where each element in the set is associated with a vector, from which one can read a key vector and a value vector (and sometimes also a query vector). One way to think about these vector-level units is as representing the detection of an object along with its attributes (like pose information, in capsules). Recent papers in computer vision are exploring extensions of convolutional neural networks in which the top level of the hierarchy represents a set of candidate objects detected in the input image, and operations on these candidates is performed with transformer-like architectures. Neural networks that assign intrinsic frames of reference to objects and their parts and recognize objects by using the geometric relationships between parts should be far less vulnerable to directed adversarial attacks, which rely on the large difference between the information used by people and that used by neural nets to recognize objects.

从同质化的层，到表示实体的神经元群组。神经科学的证据表明，附近的神经元的群组（形成了称之为超级柱）是紧密联系的，可能表示一种更高层次的向量值的单元，不仅可以发送标量，而且还包括协调的值的集合。这个思想是胶囊架构的核心，而且是软注意力机制的使用所内在的，其中集合中的每个元素都与一个向量相关，从中可以读取一个key向量和一个值向量（有时候还有一个查询向量）。一种想象这些向量级单元的方法，是表示一个目标沿着其属性的检测（如在胶囊中的姿态信息）。最近的计算机视觉文章正在探索CNN的拓展，其中层次化的最顶层表示输入图像中检测到的目标候选集合，在这些候选上的运算是用类似transformer的架构。给目标及其部位指定内在参考帧的神经网络，使用部位之间的几何关系来识别目标，会对直接的对抗攻击更加稳健，这依赖于在识别目标上，人类使用的信息与神经网络使用的信息之间的巨大差异。

Multiple time scales of adaption. Most neural nets only have two timescales: the weights adapt slowly over many examples and the activities adapt rapidly changing with each new input. Adding an overlay of rapidly adapting and rapidly, decaying “fast weights” introduces interesting new computational abilities. In particular, it creates a high-capacity, short-term memory, which allows a neural net to perform true recursion in which the same neurons can be reused in a recursive call because their activity vector in the higher-level call can be reconstructed later using the information in the fast weights. Multiple time scales of adaption also arise in learning to learn, or meta-learning.

适应的多个时间尺度。多数神经网络只有两个时间尺度：权重随着很多样本缓慢的变化，激活值随着每个新输入快速的变化。增加一个快速适应快速衰减的“快速权重”的叠加，会引入有趣的新计算能力。特别是，这会创建一个高容量，短期的存储，使得神经网络可以进行真的递归，其中同样的神经元可以以一种递归调用的方式重用，因为其在高层调用的激活向量，可以使用快速权重中的信息进行重建。多个适应的时间尺度也会带来学习学习，或元学习。

Higher-level cognition. When thinking about a new challenge, such as driving in a city with unusual traffic rules, or even imagining driving a vehicle on the moon, we can take advantage of pieces of knowledge and generic skills we have already mastered and recombine them dynamically in new ways. This form of systematic generalization allows humans to generalize fairly well in contexts that are very unlikely under their training distribution. We can then further improve with practice, fine-tuning and compiling these new skills so they do not need conscious attention anymore. How could we endow neural networks with the ability to adapt quickly to new settings by mostly reusing already known pieces of knowledge, thus avoiding interference with known skills? Initial steps in that direction include Transformers and Recurrent Independent Mechanisms.

更高层的认知。当思考一个新的挑战时，比如以不同的交通规则在一个城市中驾驶，或甚至想象在月球上驾驶一辆交通工具，我们可以利用我们已经掌握的知识片段和通用技能，将其以新的方式动态的重新结合。这种系统性的泛化的形式，使人类可以在非常不可能在训练分布下的上下文中泛化的非常好。我们可以进一步用实践改进，精调和编纂这些新技能，这样达到不再需要有意识的注意力。我们怎样才能使神经网络也有这种能力呢，可以主要通过重用已经知道的知识片段，来迅速的适应新设置，因此避免与已知技能的干扰？在这个方向的初始步子包括Transformers和Recurrent Independent Mechanisms。

It seems that our implicit (system 1) processing abilities allow us to guess potentially good or dangerous futures, when planning or reasoning. This raises the question of how system 1 networks could guide search and planning at the higher (system 2) level, maybe in the spirit of the value functions which guide Monte-Carlo tree search for AlphaGo.

似乎在计划或推理时，我们的隐式（系统1）处理能力使我们猜测潜在的好的或危险的未来。这提出了问题，系统1网络怎样引导搜索，在更高（系统2）的层次进行计划，可能是以价值函数的理念，这也是在AlphaGo系统中引导的蒙特卡洛树搜索。

Machine learning research relies on inductive biases or priors in order to encourage learning in directions which are compatible with some assumptions about the world. The nature of system 2 processing and cognitive neuroscience theories for them suggests several such inductive biases and architectures, which may be exploited to design novel deep learning systems. How do we design deep learning architectures and training frameworks which incorporate such inductive biases?

机器学习搜索依赖于归纳偏置或先验知识，以鼓励在与关于这个世界的一些假设相兼容的方向的学习。系统2处理的本质，和相关的认知神经科学理论说明，几个这样的归纳偏置和架构可以用于设计新的深度学习系统。我们怎样将这样的归纳偏置纳入进来，设计深度学习架构并训练框架呢？

The ability of young children to perform causal discovery suggests this may be a basic property of the human brain, and recent work suggests that optimizing out-of-distribution generalization under interventional changes can be used to train neural networks to discover causal dependencies or causal variables. How should we structure and train neural nets so they can capture these underlying causal properties of the world?

年轻的孩子进行因果发现的能力说明，这是人类大脑的一个基本属性，最近的工作说明，在介入式的变化下优化分布外泛化，可以用于训练神经网络，以发现因果依赖关系或因果变量。我们怎样设计神经网络的结构并训练，这样它们可以捕获这些潜在的关于这个世界的因果属性？

How are the directions suggested by these open questions related to the symbolic AI research program from the 20th century? Clearly, this symbolic AI program aimed at achieving system 2 abilities, such as reasoning, being able to factorize knowledge into pieces which can easily recombined in a sequence of computational steps, and being able to manipulate abstract variables, types, and instances. We would like to design neural networks which can do all these things while working with real-valued vectors so as to preserve the strengths of deep learning which include efficient large-scale learning using differentiable computation and gradient-based adaptation, grounding of high-level concepts in low-level perception and action, handling uncertain data, and using distributed representations.

这些开放问题的方向，与20th世纪的符号AI研究项目，是如何相关联的呢？很明显，这个符号AI的项目的目标是获得系统2的能力，比如推理，能够将知识分解成片段，可以很容易的重新结合成计算步骤的序列，并可以操作抽象的变量，类型和实例。我们希望设计能够做所有这些事情的神经网络，同时使用实值向量，这样可以保持深度学习的强度，包含了高效的大规模学习，使用可微分的计算和基于梯度的适应，将高层概念落地到低层的感知和行为中，处理不确定数据，并使用分布式表示。

Deep Learning for AI

Deep Learning for AI

1. From Hand-Coded Symbolic Expressions to Learned Distributed Representations

2. The Rise of Deep Learning

3. Recent Advances

4. The Future of Deep Learning

Related Documents

SUMMARY

Retrieval & Prompts

App Review Support Guide — Switch2Go

RFC-BLite: High-Performance Embedded Document Database for .NET