You can't download happiness, but...

...you can download machine learning papers, which is almost the same thing, and I have so many papers on my to-read list. I like being able to write summaries of the most interesting ones, but when I hit 200 open tabs of interesting papers, I realised that the backlog was only going to keep piling up. So I've spent the day skimming as many as I could. There are many I'd like go to back and reread in more detail, but for now the summaries below will have to do.

Deep learning theory

Approximation by superpositions of a sigmoidal function. The original paper showing that sufficiently-wide one-hidden-layer neural networks can approximate any function.

Why does unsupervised pre-training help deep learning? Another classic paper, offering an explanation for why unsupervised pre-training improves performance: apparently it's an "unusual form of regularisation". (How widely is it still used?)

Do deep nets really need to be deep? Ba and Caruna show that shallow neural networks can achieve the same performance as deep neural networks with the same number of parameters if they are trained to mimic the deeper networks - even though they can't reach it when trained independently. When mimicking, shallow nets need to output the same probabilities, not just the same classification.

How does batch normalisation help optimisation? Batch normalisation, a widely-used technique in deep learning, changes the mean and variance of each batch's activations in each layer to be zero and one respectively. The standard explanation of why batch normalisation works is that it reduces internal covariate shift, in which each layer has to keep adapting to a new distribution of inputs. However, the authors find that batch normalisation may not be reducing ICS, and that batch normalised networks with artificially-increased ICS still show good performance. Instead, they argue that batch normalisation makes the loss landscape significantly smoother, and increases its Lipschitzness.

The reversible residual network. With neural networks becoming deeper and wider, memory use (for storing activations to be used in backpropagation) becomes a bottleneck. However, in reversible networks a layer's activations can be deduced from the next layer's. In particular, reversibility is achieved by splitting inputs into two groups, x1 and x2, and calculating the next layer as follows: y1 = x1 + F(x2), y2 = G(y1) + x2. So given y1, y2, and the current weights (which determine F and G), x1 and x2 can be deduced. This adds a computational overhead of around 33-50%, but does not decrease performance very much, and in some cases increases it. An alternative which makes the same tradeoff is checkpointing, in which some of the activations are discarded and recomputed when necessary for backpropagation.

Deep reinforcement learning

Reinforcement learning squared. OpenAI trains a "fast" reinforcement learner (implemented as an RNN) using "slow" standard reinforcement learning.

Prefrontal cortex as a meta-reinforcement learning system. DeepMind argues that something very like the mechanism in the OpenAI paper above is being implemented by the brain, using dopamine to train the prefrontal cortex.

Universal Value Function Approximators. Schaul et al. extend RL techniques using value functions to environments where there are multiple goals, and show that Universal Value Function Approximators can generalise to new goals.

Prioritised Experience Replay. Schaul et al. improve experience replay by prioritising "surprising" transitions (those with high temporal difference error). Since pure prioritisation skews the samples too much, prioritised samples are interpolated with random samples.

Hindsight Experience Replay. OpenAI use HER to speed up learning from sparse rewards in environments with multiple goals. Even if agents don't reach their current goal, if they reach another goal then they can interpret their experience as a success towards that other goal.

Asynchronous methods for deep RL DeepMind claims that asynchronicity is a robust and effective replacement for experience replay as a way of reducing the instability of deep RL (caused by the fact that sequences of observations are non-stationary, and updates are highly correlated). Unlike experience replay, it also allows on-policy algorithms. Note that last I heard, OpenAI disagreed about the value of asynchronicity.

Deep RL with Double Q-learning. DDQN waas introduced in response to the fact that deep Q-learning tends to overestimate the values of the actions chosen (essentially because of the winner's curse). DDQN also uses the slower-updating target networks of the original DQN paper. Instead of evaluating actions using the current value network, DDQN does so using its target network. This is a very minimal change which nevertheless seems to make a significant difference.

Dueling Network Architectures. This deep RL architecture separates learning which states are valuable from learning which actions are valuable, combining the two in order to calculate standard Q values. Instead of the value of a state-action pair being updated only when that action is chosen from that state, in the dueling architecture the value of a state-action pair is effectively updated every time the state is visited. In addition, state value gaps tend to be much larger than action value gaps, so updating them separately is more stable. For additional stability, advantages are normalised so that the chosen action is at 0. This leads to improvements over DDQN in most Atari games.

Feudal Networks for Hierarchical Reinforcement Learning. Their hierarchical structure apparently allows FuNs to perform particularly well on tasks involving long-term credit assignment or memorisation.

The Reactor: A fast and sample-efficient actor-critic agent for RL. This seems to be DeepMind's latest and most advanced architecture: the Retrace-Actor framework "takes a principled approach to combining the sample-efficiency of off-policy experience replay with the time-efficiency of asynchronous algorithms".

Interpretability and transparency

Towards a rigorous science of interpretable machine learning. Doshi-Velez and Kim define and discuss interpretability.

Challenges for transparency. Weller carries out a high-level survey of transparency, which can be roughly defined as interpretability + sharing of information. He points out some ways in which transparency can be harmful, e.g. Braess' paradox, or that transparency may make us complacent.

Interpretable and pedagogical examples. Milli et al. improve the interpretability of student-teacher interactions by training the student and teacher iteratively, rather than jointly. Previous work in this area usually led to student-teacher pairs learning an arbitrary code.

NLP and grounded language learning

Multi-agent cooperation and the emergence of (natural) language. Lazaridou et al. introduce multi-agent communication games. In particular, a sender is trained (using reinforcement learning) to send a word to a receiver to signal which of two images to select. The sender had a vocabulary of 10-100 words available, but in this setup only needed to use two words in order to coordinate almost perfectly; surprisingly, that binary choice seems to noisily correspond to some high-level concepts (e.g. living vs non-living). The authors then grounded vocabulary choice by also training the sender on an image-recognition tasks; the results were somewhat interpretable to humans.

Emergence of language with multi-agent games. Havrylov and Titov extend Lazaridou et al. by allowing sequences of symbols and also providing distracting images to the receiver only. Since combinatorial explosion now makes standard reinforcement learning techniques ineffective, they use "straight-through Gumbel-softmax estimators" to allow for end-to-end differentiation despite all messages being discrete. The resulting language seems to be hierarchical: the first words in a phrase specify broad categories, and the later ones narrow these down. The language is weakly grounded by being regularised based on a separately-trained language model (however, this does not require the preservation of words' meanings).

Linear Algebraic Structure of Word Senses. Arora et al. argue, on both theoretical and practical grounds, that the vector embeddings of polysemous words are linear combinations of the vectors which would have been assigned to each sense of the word, were all occurrences disambiguated. Their first experiment induces artificial polysemous pseudowords; the second uses WordNet definitions of different senses of various words. They also link each word sense with one of about 2000 "atoms" of discourse.

Comments

Popular posts from this blog

In Search of All Souls

What have been the greatest intellectual achievements?

Moral strategies at different capability levels