Recurrent Neural Networks Tutorial, Part 3 – Backpropagation Through Time and Vanishing Gradients
This the third part of the Recurrent Neural Network Tutorial.
In the previous part of the tutorial we implemented a RNN from scratch, but didn’t go into detail on how Backpropagation Through Time (BPTT) algorithms calculates the gradients. In this part we’ll give a brief overview of BPTT and explain how it differs from traditional backpropagation. We will then try to understand the vanishing gradient problem, which has led to the development of LSTMs and GRUs, two of the currently most popular and powerful models used in NLP (and other areas). The vanishing gradient problem was originally discovered by Sepp Hochreiter in 1991 and has been receiving attention again recently due to the increased application of deep architectures.
To fully understand this part of the tutorial I recommend being familiar with how partial differentiation and basic backpropagation works. If you are not, you can find excellent tutorials here and here and here, in order of increasing difficulty.
via Teaching recurrent Neural Networks about Monet.
Recurrent Neural Networks have boomed in popularity over the past months, thanks to articles like the amazing The Unreasonable Effectiveness of Recurrent Neural Networks by Andrej Karpathy.
Long story short, Recurrent Neural Networks (RNNs) are a type of NNs that can work over sequences of vectors and where their elements keep track of their state history.
Neural Networks are increasingly easy to use, specially in the Python ecosystem, with libraries like Caffe, Keras or Lasagne making the assembly of neural networks a trivial task.
I was checking the documentation on Keras and found an example togenerate text from Nietzsche readings via a Long Short Term Memory Network (LSTM).
I run the example, and after a couple hours the model started producing pretty convincing, Nietzsche-looking text.
via The Unreasonable Effectiveness of Recurrent Neural Networks.
There’s something magical about Recurrent Neural Networks (RNNs). I still remember when I trained my first recurrent network for Image Captioning. Within a few dozen minutes of training my first baby model (with rather arbitrarily-chosen hyperparameters) started to generate very nice looking descriptions of images that were on the edge of making sense. Sometimes the ratio of how simple your model is to the quality of the results you get out of it blows past your expectations, and this was one of those times. What made this result so shocking at the time was that the common wisdom was that RNNs were supposed to be difficult to train (with more experience I’ve in fact reached the opposite conclusion). Fast forward about a year: I’m training RNNs all the time and I’ve witnessed their power and robustness many times, and yet their magical outputs still find ways of amusing me. This post is about sharing some of that magic with you.
We’ll train RNNs to generate text character by character and ponder the question “how is that even possible?”
By the way, together with this post I am also releasing code on Github that allows you to train character-level language models based on multi-layer LSTMs. You give it a large chunk of text and it will learn to generate text like it one character at a time. You can also use it to reproduce my experiments below. But we’re getting ahead of ourselves; What are RNNs anyway?
This paper develops a model that addresses sentence embedding using recurrent neural networks (RNN) with Long Short Term Memory (LSTM) cells. The proposed LSTM-RNN model sequentially takes each word in a sentence, extracts its information, and embeds it into a semantic vector. Due to its ability to capture long term memory, the LSTM-RNN accumulates increasingly richer information as it goes through the sentence, and when it reaches the last word, the hidden layer of the network provides a semantic representation of the whole sentence. In this paper, the LSTM-RNN is trained in a weakly supervised manner on user click-through data logged by a commercial web search engine. Visualization and analysis are performed to understand how the embedding process works. The model automatically attenuates the unimportant words and detects the salient keywords in the sentence. Furthermore, these detected keywords automatically activate different cells of the LSTM-RNN, where words belonging to a similar topic activate the same cell. As a semantic representation of the sentence, the embedding vector can be used in many different applications. These keyword detection and topic allocation tasks enabled by the LSTM-RNN allow the network to perform web document retrieval, where the similarity between the query and documents can be measured by the distance between their corresponding sentence embedding vectors computed by the LSTM-RNN. On a web search task, the LSTM-RNN embedding is shown to significantly outperform all existing state of the art methods.
indico’s Head of Research, Alec Radford, led a workshop on general sequence learning using recurrent neural networks at Next.ML in San Francisco. Here’s his presentation and workshop resources available for free.
Recurrent Neural Networks hold great promise as general sequence learning algorithms. As such, they are a very promising tool for text analysis. However, outside of very specific use cases like handwriting recognition and recently, machine translation, they have not seen widespread use. Why has this been the case?
In this workshop, Alec will introduce RNNs as a concept. Then you’ll sketch how to implement them and cover the tricks necessary to make them work well. With the basics covered, we will investigate using RNNs as general text classification and regression models, examining where they succeed and where they fail compared to more traditional text analysis models.
Finally, a simple Python and Theano library for training RNNs with a scikit-learn style interface will be introduced and you’ll see how to use it through several hands-on tutorials on real world text datasets.
Next.ML was created to help you use the latest machine learning techniques the minute you leave the workshop. Learn from industry-leading data scientists at Next.ML on April 27th, 2015 at the Microsoft NERD Center — for more info, visit http://Next.ML
In this work, we propose a novel recurrent neural network (RNN) architecture. The proposed RNN, gated-feedback RNN (GF-RNN), extends the existing approach of stacking multiple recurrent layers by allowing and controlling signals flowing from upper recurrent layers to lower layers using a global gating unit for each pair of layers. The recurrent signals exchanged between layers are gated adaptively based on the previous hidden states and the current input. We evaluated the proposed GF-RNN with different types of recurrent units, such as tanh, long short-term memory and gated recurrent units, on the tasks of character-level language modeling and Python program evaluation. Our empirical evaluation of different RNN units, revealed that in both tasks, the GF-RNN outperforms the conventional approaches to build deep stacked RNNs. We suggest that the improvement arises because the GF-RNN can adaptively assign different layers to different timescales and layer-to-layer interactions (including the top-down ones which are not usually present in a stacked RNN) by learning to gate these interactions.