Problems with Recurrent neural network
There are two problem arises while training a Recurrent Neural Network
1- Exploding Gradient– Exploding gradients are when the algorithm, without much reason, assigns a stupidly high importance to the weights.
How to resolve Exploding Gradient problem?
Gradient clipping- It is a technique used to cope with the exploding gradient problem sometimes encountered when performing backpropagation. By capping the maximum value for the gradient, this phenomenon is controlled in practice.
2- Vanishing Gradient- Vanishing gradients occur when the values of a gradient are too small and the model stops learning or takes way too long as a result.
Why Vanishing Gradient is a problem?
One of the burning issues of RNN are to capture long term dependencies.For example: Let us take a language model to predict the next word based on the previous ones.
If we are trying to predict the last word in “the sharks lives in ocean.” we don’t need any further context – it’s pretty obvious the next word is going to be ocean.
In such cases, where the gap between the relevant information and the place that it’s needed is small, RNNs can learn to use the past information. But what happen if the gap between the relevant information and the place that it’s needed is large.
For example: I grew up in India and I speak fluent ______.
Recent information suggests that the next word is probably the name of a language, but if we want to narrow down which language, we need the context of India, from further back.
As the gap grows, RNNs become unable to connect the dots or information.
How to solve this problem
There are 3 ways to resolve the above problem.
- By using Activation functions
- By Parameter Initialization
- Use more complex RNN architecture
For this we us Long Short Term Memory (LSTM)
Long short term memory
To solve the long term dependency we use LSTM. They are designed to recall information for long periods of time. They were introduced by Sepp Hochreiter and Schmidhuber.
The diagram of LSTM is given below and their notation
Step by Step implementation of LSTM
Step 1- The first step in our LSTM is to decide what information we’re going to throw away from the cell state.
- This decision is made by sigmoid layer. It takes ht−1and xt, and outputs a number between 0 and 1.
Note: 1 represents keep this information and 0 represents get rid of the information.
Step 2- The next step is to decide what new information we’re going to store in the cell state. It has two parts:
- First, a sigmoid layer which decides which values we’ll update.
- Next, a tanh layer creates a vector of new candidate values,C̃ t, that could be added to the state.
Next, we combine these two to update the state.
Step 3- The next step is to update the old cell state, Ct−1, into the new cell state which is Ct.
We multiply the old state byft, forgetting the things we decided to forget earlier. Then we add it∗C̃ t. This is the new candidate values, scaled by how much we decided to update each state value.
Step 4- Finally, we need to decide what we’re going to output.
First, we run a sigmoid layer which decides what parts of the cell state we’re going to output. Then, we put the cell state through tanh (to push the values to be between −1 and 1) and multiply it by the output of the sigmoid gate, so that we only output the parts we decided to.