Backpropagation everywhere
Can the brain do backpropagation? - Hinton, 2016
In this talk, Hinton rebuts four arguments which neuroscientists have used to argue that the brain is not learning using backpropagation.
- Most human learning is unsupervised, without the explicit loss functions usually used in backpropagation.
- Neurons don't send real numbers, but rather binary spikes.
- Neurons can't send both feature information and training gradients.
- Neurons aren't actually paired up.
Random synaptic feedback weights support error backpropagation for deep learning - Lillicrap, Cownden, Tweed, and Akerman, 2016
Normally, after we calculate the loss function, we calculate the exact gradient of the loss function with respect to each weight. This paper is concerned with the cases when we can't access those weights, i.e. if we can't solve the weight transport problem. In this situation, we can still do backpropagation using feedback weights which are determined independently from the forward weights - but without being able to determine which weights are responsible for the outcome, we would expect performance to diminish drastically. The key result of this paper is that fixed, random feedback weights are actually sufficient for a neural network to achieve performance equal to that of standard backpropagation. The only requirement is that the teaching signal from the backward links "pushes the network in roughly the same direction as backprop would". When the backwards weights are fixed, this condition can still be fulfilled if the forwards weights evolve to better match the backwards weights; the authors call this "feedback alignment". This works even when 50% of the forward connections and 50% of the backward connections are removed.
Hinton suggests the intuitive reason why this works: apart from the last layer, the rest of a neural network is mainly focused on creating a good representation of the input. Even though the feedback weights are fixed, different data classes will result in different error signals along those weights, which allows the rest of the network to adapt. The authors claim that we should think of feedback alignment as being part of a spectrum between a global reward function which sends the same signal to every neuron, and a reward function exactly tailored for each neuron like standard backpropagation.
How important is weight symmetry in backpropagation? - Liao, Leibo, and Poggio, 2016
This paper is similar to the one by Lillicrap et al., but has a slightly different focus. Liao et al. find that when they set all feedback weights to the sign of the corresponding forward weight (which they call sign concordance), they achieve results on par with or better than standard SGD (as long as they also apply batch normalisation and batch manhattan, as described below). Similarly good results obtain if feedback magnitudes are varied randomly with the same sign, and even if the last layer is initialised randomly and frozen with those values. Liao et al claim that the successes of sign concordance are more robust than those from fixed random weights - for example, the latter has very bad performance when the last layer is frozen, which suggests that the key to feedback alignment is co-adaption between the last layer and previous layers. However, it's still unclear how biologically plausible sign concordance is, since it still requires communication of forward neurons' weight signs to backward neurons. Its performance when feedback connections are sparse is also untested, although presumably it will do at least as well as feedback alignment.
Liao et al. also emphasise that the success of these asymmetric backpropagation algorithms is enhanced greatly by the use of batch manhattan and batch normalisation, so let's have a look at them. When doing gradient descent, we calculate the gradient based on a group of examples, which we'll call a batch. Full gradient descent uses all the training examples as one batch; stochastic gradient descent uses batches of size one. Mini-batches are batches of intermediate size; they are a particularly useful alternative to SGD on computing platforms which can efficiently implement parallelism. In batch manhattan, after we have used feedback weights to calculate a given forward weight's gradient over a mini-batch, we update the forward weight based only on the sign of that gradient. This is discussed in more detail in (Hifny, 2013).
Batch Normalisation: Accelerating Deep Network Training by Reducing Internal Covariate Shift - Ioffe and Szegedy, 2015
(Note that this more recent paper argues against this interpretation of batch normalisation).
Batch normalisation is a bit more complicated. One problem with gradient descent is that as training occurs, the outputs of each layer shift, and so the distribution of inputs of the following layer also shifts, and subsequent layers need to adapt. This is known as internal covariate shift, and may cause nonlinearities to become saturated. While this can be mitigated by using RELUs and small learning rates, solving it could accelerate training speed. Ioffe and Szegedy propose a variant of "whitening" which they call "batch normalisation". Whitening involves normalising a set of input variables so that they are uncorrelated and each have mean 0 and variance 1. This essentially transforms the covariance matrix into the identity matrix, and can be done in a number of ways. Whitening data before using it to train a neural network is standard.
However, whitening each layer after each gradient descent step is expensive, and so the authors make two changes. Firstly, they normalise each feature independently, without decorrelating them, which avoids the need to calculate the correlation matrix or its inverse. Secondly, they use estimates for the mean and variance based on each mini-batch. Another issue is that normalisation could reduce the expressive power of the network - for example, when using sigmoid nonlinearities the normalised data might end up only in the linear section. They address this by adding learned parameters which rescale and reshift the data (the latter essentially replaces the bias term).
The effects of batch normalisation are to allow higher learning rates, faster convergence, and diminished use of other forms of regularisation, such as dropout and weight regularisation. The effectiveness of batch normalisation as a regulariser is greater when training examples are shuffled so that the batches aren't the same every time they're seen.
Comments
Post a Comment