Deep Learning NLP

Understanding Long Short-Term Memory (LSTM) Networks

LSTM Header

Recurrent Neural Networks (RNNs) are powerful, but they suffer from short-term memory. LSTMs were designed to solve this via a unique internal mechanisms called gates.

In this article, we'll demystify the mathematics behind LSTMs and assume a "first principles" approach to understanding how they process sequences.

The Architecture

Unlike standard feedforward neural networks, LSTMs have feedback connections. They can process not only single data points (such as images), but also entire sequences of data (such as speech or video).

Key Concept: The core idea behind LSTMs is the cell state, a horizontal line running through the top of the diagram. It's like a conveyor belt, running straight down the entire chain, with only some minor linear interactions.

1. The Forget Gate

The first step in our LSTM is to decide what information we're going to throw away from the cell state. This decision is made by a sigmoid layer called the "forget gate layer".

def forget_gate(x, h_prev, W_f, b_f):
    # Concatenate input and previous hidden state
    concat = np.concatenate((h_prev, x))
    # Apply sigmoid
    f_t = sigmoid(np.dot(W_f, concat) + b_f)
    return f_t

2. The Input Gate

The next step is to decide what new information we're going to store in the cell state. This has two parts. First, a sigmoid layer called the "input gate layer" decides which values we'll update.


Conclusion

LSTMs are a significant step forward in what we can achieve with RNNs. While Transformers have taken over NLP, LSTMs remain crucial for timeseries analysis and distinct sequential tasks.