9447 words
47 minutes
🤗 Transformer
Why this blog

I was originally studying LLM on Hugging Face and notice that transformer occupied significant portion of their learning materials.

Introduction#

Transformers are everywhere and its models are used to solve all kinds of tasks across different modalities, including natural language processing (NLP), computer vision, audio processing, and more. The 🤗 Transformers library provides the functionality to create and use those shared models. The Model Hub contains millions of pretrained models that anyone can download and use. We can also upload our own models to the Hub!

huggingface
/
transformers
Waiting for api.github.com...
00K
0K
0K
Waiting...

Here are some of the companies and organizations using Hugging Face and Transformer models, who also contribute back to the community by sharing their models:

Before diving into how Transformer models work under the hood, let’s look at a few examples of how they can be used to solve some interesting NLP problems to give us some intuitive feels.

The most basic object in the 🤗 Transformers library is the pipeline() function. It connects a model with its necessary preprocessing and postprocessing steps, allowing us to directly input any text and get an intelligible answer. Let’s take named entity recognition as an example

TIP

Named entity recognition (NER) is a task where the model has to find which parts of the input text correspond to entities such as persons, locations, or organizations.

from transformers import pipeline
ner = pipeline("ner", grouped_entities=True)
ner("My name is Jiaqi and I work at OpenML in China.")
Terminal window
[
{'entity_group': 'PER', 'score': 0.99816, 'word': 'Jiaqi', 'start': 11, 'end': 18},
{'entity_group': 'ORG', 'score': 0.97960, 'word': 'OpenML', 'start': 33, 'end': 45},
{'entity_group': 'LOC', 'score': 0.99321, 'word': 'China', 'start': 49, 'end': 57}
]

Here the model correctly identified that Jiaqi is a person (PER), OpenML an organization (ORG), and China a location (LOC).

There are three main steps involved when we pass some text to the pipeline above:

  1. The text is preprocessed into a format the model can understand.
  2. The preprocessed inputs are passed to the model.
  3. The predictions of the model are post-processed, so we can make sense of them.

By default, this pipeline selects a particular pretrained model that has been fine-tuned for NER in English. The model is downloaded and cached when we create the ner object. If we rerun the command, the cached model will be used instead and there is no need to download the model again.

The pipeline() function supports multiple allows us to work with not just text, but also images, audio, and even multimodal tasks. In this post we’ll focus on text tasks

How do Transformers work?#

The Transformer architecture was introduced in June 2017. The focus of the original research was on translation tasks. This was followed by the introduction of several influential models, including:

  • June 2018: GPT, the first pretrained Transformer model, used for fine-tuning on various NLP tasks and obtained state-of-the-art results
  • October 2018: BERT, another large pretrained model, this one designed to produce better summaries of sentences (more on this in the next chapter!)
  • February 2019: GPT-2, an improved (and bigger) version of GPT that was not immediately publicly released due to ethical concerns
  • October 2019: T5, A multi-task focused implementation of the sequence-to-sequence Transformer architecture.
  • May 2020, GPT-3, an even bigger version of GPT-2 that is able to perform well on a variety of tasks without the need for fine-tuning (called zero-shot learning)
  • January 2022: InstructGPT, a version of GPT-3 that was trained to follow instructions better This list is far from comprehensive, and is just meant to highlight a few of the different kinds of Transformer models. Broadly, they can be grouped into three categories:
  • January 2023: Llama, a large language model that is able to generate text in a variety of languages.
  • March 2023: Mistral, a 7-billion-parameter language model that outperforms Llama 2 13B across all evaluated benchmarks, leveraging grouped-query attention for faster inference and sliding window attention to handle sequences of arbitrary length.
  • May 2024: Gemma 2, a family of lightweight, state-of-the-art open models ranging from 2B to 27B parameters that incorporate interleaved local-global attentions and group-query attention, with smaller models trained using knowledge distillation to deliver performance competitive with models 2-3 times larger.
  • November 2024: SmolLM2, a state-of-the-art small language model (135 million to 1.7 billion parameters) that achieves impressive performance despite its compact size, and unlocking new possibilities for mobile and edge devices.
  • GPT-like (also called auto-regressive Transformer models)
  • BERT-like (also called auto-encoding Transformer models)
  • T5-like (also called sequence-to-sequence Transformer models)

All the Transformer models mentioned above (GPT, BERT, T5, etc.) have been trained as large language models. This means they have been trained on large amounts of raw text in a self-supervised fashion.

IMPORTANT

Transformers are large language models

Self-supervised learning is a type of training in which the objective is automatically computed from the inputs of the model. That means that humans are not needed to label the data. This type of model develops a statistical understanding of the language it has been trained on, but it’s less useful for specific practical tasks. Because of this, the general pretrained model then goes through a process called transfer learning or fine-tuning. During this process, the model is fine-tuned in a supervised way - that is, using human-annotated labels - on a given task.

A transformer model is primarily composed of 2 blocks:

  1. Encoder (left): The encoder receives an input and builds a representation of it (its features). This means that the model is optimized to acquire understanding from the input.
  2. Decoder (right): The decoder uses the encoder’s representation (features) along with other inputs to generate a target sequence. This means that the model is optimized for generating outputs.

A key feature of Transformer models is that they are built with special layers called attention layers. In fact, the title of the paper introducing the Transformer architecture was “Attention Is All You Need”! Now let’s dive into the real business of transformers: Attention

Attention is All You Need#

In is important to know the history in order to fully understand the transformer. Before, the sequence-to-sequence models are deep learning models that have achieved a lot of success in tasks like machine translation, text summarization, and image captioning. Google Translate started using such a model in production in late 2016. These models are explained in the two pioneering papers (Sutskever et al., 2014, Cho et al., 2014).

A sequence-to-sequence model is a model that takes a sequence of items (words, letters, features of an images, etc) and outputs another sequence of items. A trained model would work like this:

In neural machine translation, a sequence is a series of words, processed one after another. The output is, likewise, a series of words:

Under the hood, the model is composed of an encoder and a decoder. The encoder processes each item in the input sequence, it compiles the information it captures into a vector (called the context). After processing the entire input sequence, the encoder sends the context over to the decoder, which begins producing the output sequence item by item.

The same applies in the case of machine translation.

The context is a vector (an array of numbers, basically) in the case of machine translation. The encoder and decoder tend to both be recurrent neural networks

TIP

The context is a vector of floats. Later in this post we will visualize vectors in color by assigning brighter colors to the cells with higher values.

We can set the size of the context vector when we set up our model. It is basically the number of hidden units in the encoder RNN. These visualizations show a vector of size 4, but in real world applications the context vector would be of a size like 256, 512, or 1024.

By design, an RNN takes two inputs at each time step: an input (in the case of the encoder, one word from the input sentence), and a hidden state. The word, however, needs to be represented by a vector. To transform a word into a vector, we turn to the class of methods called word embedding algorithms. These turn words into vector spaces that capture a lot of the meaning/semantic information of the words. Here is an example:

Preliminary - Word Embedding#

In this subsection, we will go over the concept of embedding, and the mechanics of generating embeddings with word2vec.

Here is trained word-vector examples (also called word embeddings):

[
0.50451 , 0.68607 , -0.59517 , -0.022801, 0.60046 , -0.13498 , -0.08813 , 0.47377 , -0.61798 , -0.31012 , -0.076666,
1.493 , -0.034189, -0.98173 , 0.68229 , 0.81722 , -0.51874 , -0.31503 , -0.55809 , 0.66421 , 0.1961 , -0.13495 ,
-0.11476 , -0.30344 , 0.41177 , -2.223 , -1.0756 , -1.0783 , -0.34354 , 0.33505 , 1.9927 , -0.04234 , -0.64319 ,
0.71125 , 0.49159 , 0.16754 , 0.34344 , -0.25663 , -0.8523 , 0.1661 , 0.40102 , 1.1685 , -1.0137 , -0.21585 ,
-0.15155 , 0.78321 , -0.91241 , -1.6106 , -0.64426 , -0.51042
]

It’s a list of 50 numbers. We can’t tell much by looking at the values. But let’s visualize it a bit so that we could compare it with other word vectors. First let’s put all these numbers in one row:

Next let’s color code the cells based on their values (red if they’re close to 2, white if they’re close to 0, blue if they’re close to -2):

Then we’ll proceed by ignoring the numbers and only looking at the colors to indicate the values of the cells. Let’s now contrast “King” against other words:

See how “Man” and “Woman” are much more similar to each other than either of them is to “king”? This tells us something. These vector representations capture quite a bit of the information/meaning/associations of these words.

Here’s another list of examples (compare by vertically scanning the columns looking for columns with similar colors):

A few things to point out:

  • There’s a straight red column through all of these different words. They’re similar along that dimension (and we don’t know what each dimensions codes for)
  • We can see how “woman” and “girl” are similar to each other in a lot of places. The same with “man” and “boy”
  • “boy” and “girl” also have places where they are similar to each other, but different from “woman” or “man”. Could these be coding for a vague conception of youth? possible.
  • All but the last word are words representing people. I added an object (water) to show the differences between categories. We can, for example, see that blue column going all the way down and stopping before the embedding for “water”.
  • There are clear places where “king” and “queen” are similar to each other and distinct from all the others. Could these be coding for a vague concept of royalty?
Analogies

The famous examples that show an incredible property of embeddings is the concept of analogies. We can add and subtract word embeddings and arrive at interesting results. The most famous example is the formula: “king” - “man” + “woman”. Using the Gensim library in python, we can add and subtract word vectors, and it would find the most similar words to the resulting vector. The image shows a list of the most similar words, each with its cosine similarity.

We can visualize this analogy as we did previously:

The resulting vector from “king-man+woman” doesn’t exactly equal “queen”, but “queen” is the closest word to it from the 400,000 word embeddings we have in this collection.

Now that we’ve looked at trained word embeddings, let’s learn more about the training process. But before we get to word2vec, we need to look at a conceptual parent of word embeddings: the neural language model.

Language Modeling#

If one wanted to give an example of an NLP application, one of the best examples would be the next-word prediction feature of a smartphone keyboard. It’s a feature that billions of people use hundreds of times every day.

Next-word prediction is a task that can be addressed by a language model. A language model can take a list of words (let’s say two words), and attempt to predict the word that follows them. In practice, however, the model doesn’t output only one word. It actually outputs a probability score for all the words it knows (the model’s “vocabulary”, which can range from a few thousand to over a million words). The keyboard application then has to find the words with the highest scores, and present those to the user. After being trained, early neural language models would calculate a prediction in three steps:

The first step is the most relevant for us as we discuss embeddings. One of the results of the training process was this matrix that contains an embedding for each word in our vocabulary. During prediction time, we just look up the embeddings of the input word, and use them to calculate the prediction:

Let’s now turn to the training process to learn more about how this embedding matrix was developed.

Language Model Training#

Words get their embeddings by us looking at which other words they tend to appear next to. The mechanics of that is that

  1. We get a lot of text data (say, all Wikipedia articles, for example). then
  2. We have a window (say, of three words) that we slide against all of that text.
  3. The sliding window generates training samples for our model

Here is an example:

The example above is trying to predict the target word by looking at two words before it, we can also look at two words after it. Another architecture that also tended to show great results does things a little differently and is the one we will be using as part of our following discussion: instead of guessing a word based on its context (the words before or maybe even after it), this architecture tries to guess neighboring words within certain radius using the current word. It is called Skipgram, which has a window sliding across the texts like this:

The word in the green slot would be the input(or current) word, each pink box would be a possible output within its radius. In this case, the radius is 2 (words)

A single snapshot of the sliding window creates four separate samples in our training dataset:

A couple of positions later, we have a lot more examples:

Now that we have our skipgram training dataset that we extracted from existing running text, let’s glance at how we use it to train a basic neural language model that predicts the neighboring word. We start with the first sample in our dataset. We grab the feature and feed to the untrained model asking it to predict an appropriate neighboring word.

The model conducts the three steps and outputs a prediction vector (with a probability assigned to each word in its vocabulary). Since the model is untrained, it’s prediction is sure to be wrong at this stage. But that’s okay. We know what word it should have guessed – the label/output cell in the row we’re currently using to train the model:

How far off was the model? We could choose to subtract the two vectors resulting in an error vector:

This error vector can now be used to update the model so the next time, it’s a little more likely to guess thou when it gets not as input.

And that concludes the first step of the training. We proceed to do the same process with the next sample in our dataset, and then the next, until we’ve covered all the samples in the dataset. That concludes one epoch of training. We do it over again for a number of epochs, and then we’d have our trained model and we can extract the embedding matrix from it and use it for any other application.

While this extends our understanding of the process, it’s still not how word2vec is actually trained. We’re missing a couple of key ideas.

The 3rd step (Project to output vocabulary) is very expensive from a computational point of view - especially knowing that we will do it once for every training sample in our dataset (easily tens of millions of times). We need to do something to improve performance, which is missing from the basic training strategy introduced above.

One solution for boosting the performance is to split our target into 2 steps:

  1. Generate high-quality word embeddings (Don’t worry about next-word prediction).
  2. Use these high-quality embeddings to train a language model (to do next-word prediction).

We will be focusing on step 1 as we’re focusing on embeddings. To generate high-quality embeddings using a high-performance model, we can switch the model’s task from predicting a neighboring word to taking the input and output word, and outputing a score indicating if they’re neighbors or not (0 for “not neighbors”, 1 for “neighbors”).

This simple switch changes the model we need from a neural network, to a logistic regression model - thus it becomes much simpler and much faster to calculate.

This switch requires we switch the structure of our dataset – the label is now a new column with values 0 or 1. They will be all 1 since all the words we added are neighbors.

This can now be computed at blazing speed – processing millions of examples in minutes. But there’s one loophole we need to close. If all of our examples are positive (target: 1), we open ourselves to the possibility of a smartass model that always returns 1 - achieving 100% accuracy, but learning nothing and generating garbage embeddings. To address this, we need to introduce negative samples to our dataset - samples of words that are not neighbors. Our model needs to return 0 for those samples. Now that’s a challenge that the model has to work hard to solve - but still at blazing fast speed.

But what do we fill in as output words? We randomly sample words from our vocabulary

This idea is inspired by Noise-contrastive estimation. We are contrasting the actual signal (positive examples of neighboring words) with noise (randomly selected words that are not neighbors). This leads to a great tradeoff of computational and statistical efficiency.

We have now covered two of the central ideas in word2vec: as a pair, they’re called skipgram with negative sampling:

Word2vec Training Process#

Now that we’ve established the two central ideas of skipgram and negative sampling, we can proceed to look closer at the actual word2vec training process.

Before the training process starts, we pre-process the text we’re training the model against. In this step, we determine the size of our vocabulary (we’ll call this vocab_size, think of it as, say, 10,000) and which words belong to it.

At the start of the training phase, we create two matrices – an Embedding matrix and a Context matrix. These two matrices have an embedding for each word in our vocabulary (So vocab_size is one of their dimensions). The second dimension is how long we want each embedding to be (embedding_size – 300 is a common value, but we’ve looked at an example of 50 earlier in our discussion here).

At the start of the training process, we initialize these matrices with random values. Then we start the training process. In each training step, we take one positive example and its associated negative examples. Let’s take our first-step data (highlighted in light blue rows):

Now we have 4 words: the input word not and output/context words: thou (the actual neighbor), aaron, and taco (the negative examples). We proceed to look up their embeddings - for the input word, we look in the Embedding matrix. For the context words, we look in the Context matrix (even though both matrices have an embedding for every word in our vocabulary).

Then, we take the dot product of the input embedding with each of the context embeddings. In each case, that would result in a number, that number indicates the similarity of the input and context embeddings

Now we need a way to turn these scores into something that looks like probabilities - we need them to all be positive and have values between zero and one. This is a great task for sigmoid, the logistic operation.

And we can now treat the output of the sigmoid operations as the model’s output for these examples. We can see that taco has the highest score and aaron still has the lowest score both before and after the sigmoid operations.

Now that the untrained model has made a prediction, and seeing as though we have an actual target label to compare against, let’s calculate how much error is in the model’s prediction. To do that, we just subtract the sigmoid scores from the target labels (error = target - sigmoid_scores).

Here comes the “learning” part of “machine learning”. We can now use this error score to adjust the embeddings of not, thou, aaron, and taco so that the next time we make this calculation, the result would be closer to the target scores.

This concludes the training step. We emerge from it with slightly better embeddings for the words involved in this step (not, thou, aaron, and taco). We now proceed to our next step (the next positive sample and its associated negative samples) and do the same process again.

The embeddings continue to be improved while we cycle through our entire dataset for a number of times. We can then stop the training process, discard the Context matrix, and use the Embeddings matrix as our pre-trained embeddings for the next task.

Window Size and Number of Negative Samples

Two key hyperparameters in the word2vec training process are the window size and the number of negative samples.

Different tasks are served better by different window sizes. One heuristic is that smaller window sizes (2-15) lead to embeddings where high similarity scores between two embeddings indicates that the words are interchangeable (notice that antonyms are often interchangable if we’re only looking at their surrounding words – e.g. good and bad often appear in similar contexts). Larger window sizes (15-50, or even more) lead to embeddings where similarity is more indicative of relatedness of the words.

The number of negative samples is another factor of the training process. The original paper prescribes 5-20 as being a good number of negative samples. It also states that 2-5 seems to be enough when you have a large enough dataset.

Preliminary - Recurrent Neural Networks (RNNs)#

Now that we have introduced our main vectors/tensors, let’s recap the mechanics of an RNN and establish a visual language to describe these models:

The next RNN step takes the second input vector and hidden state #1 to create the output of that time step.

In the following visualization, each pulse for the encoder or decoder is that RNN processing its inputs and generating an output for that time step. Since the encoder and decoder are both RNNs, each time step one of the RNNs does some processing, it updates its hidden state based on its inputs and previous inputs it has seen.

Let’s look at the hidden states for the encoder. Notice how the last hidden state is actually the context we pass along to the decoder.

The decoder also maintains a hidden state that it passes from one time step to the next. We just didn’t visualize it in this graphic because we are concerned with the major parts of the model for now.

Let’s now look at another way to visualize a sequence-to-sequence model. This animation will make it easier to understand the static graphics that describe these models. This is called an “unrolled” view where instead of showing the one decoder, we show a copy of it for each time step. This way we can look at the inputs and outputs of each time step.

Example - Character-Level Language Model#

We all heard of this buz word “LLM” (Large Language Model). But let’s put that aside for just a second and look at a much simpler one called “character-level language model” where, for example, we input a prefix of a word such as “hell” and the model outputs a complete word “hello”. We call inputs like “hell” a sequence.

How do we train such model? One approach is to have one function invoked 4 times, with each time taking a single character as input and calculates an output:

Input for the function is actually a one-hot encoded vector representing a single character

In our “hello” example above, the input sequence would be “h”, “e”, “l”, “l”, “o”. For each of these characters, the input to the function is not the character itself, but a vector. This vector has a size equal to the total number of unique characters in our vocabulary, i.e. a vocabulary of four possible letters “helo”. For a specific character, the vector will have a value of 1 at the index corresponding to that character, and 0 everywhere else.

For example, the input for the character “h” would be a vector of length 4. This vector would have a value of 1 at the 1st position (since ‘h’ is the 1st letter of the alphabet) and 0s in all other 3 positions. The next input would be the one-hot encoded vector for “e”, and so on. This process allows the function to handle sequential data by processing one character at a time.

But one might have noticed that if the 3rd invocation produces f(l)=lf('l') = 'l', then why would the 4th one, given the same input, outputs a different character of ‘o’? This suggests that we should take the history into account. Instead of having ff depend on 1 parameter, we now have it take 2 parameters.

  1. a character, and
  2. a variable that summarizes the previous calculations:

Now it makes much more sense with:

f(‘l’,h2)=‘l’f(\text{‘l'}, h_2) = \text{‘l'}f(‘l’,h3)=‘o’f(\text{‘l'}, h_3) = \text{‘o'}

But what if we want to predict a longer or shorter word? For example, how about predicting “cat” by “ca”? That’s simple, we will have 2 black boxes to do the work.

What if the function ff is not smart enough to produce the correct output everytime? We will simply collect a lot of examples such as “cat” and “hello”, and feed them into the boxes to train them until they can output correct vocabulary like “cat” and “hello”.

This is the idea behind RNN. It’s recurrent because the boxed function gets invoked repeatedly for each element of the sequence. In the case of our character-level language model, element is a character such as “e” and sequence is a string like “hell”:

CAUTION

The diagram below is not multiple functions chained together, but a single function being repeatedly invoked

Each function ff is a network unit containing 2 perceptrons. One perceptron computes the “history” like h1h_1, h2h_2, h3h_3.

One great thing about the RNNs is that they offer a lot of flexibility on how we wire up the neural network architecture. Normally when we are working with neural networks, we are given a fixed sized input vector (red boxes below), then we process it with some hidden layers (green), and we produce a fixed sized output vector (blue). The left-most model in figure below is the Vanilla Neural Networks, which receives a single input and produce one output (The green box in between actually represents layers of neurons). The rest of the models on the right are all Recurrent Neural Networks that allow us to operate over sequences of input, output, or both at the same time:

  • An example of one-to-many model is image captioning where we are given a fixed sized image and produce a sequence of words that describe the content of that image through RNN
  • An example of many-to-one task is sentiment classification in NLP where we are given a sequence of words of a
  • sentence and then classify what sentiment (e.g. positive or negative) that sentence is.
  • An example of many-to-many task is machine translation in NLP, where we can have an RNN that takes a sequence of words of a sentence in English, and then this RNN is asked to produce a sequence of words of a sentence in German.
  • There is also a variation of many-to-many task as shown in the last model in figure below, where the model generates an output at every timestep. An example of this many-to-many task is video classification on a frame level where the model classifies every single frame of video with some number of classes. We should note that we don’t want this prediction to only be a function of the current timestep (current frame of the video), but also all the timesteps (frames) that have come before this video.

TIP

A CNN learns to recognize patterns across space. So a CNN will learn to recognize components of an image (e.g., lines, curves, etc.) and then learn to combine these components to recognize larger structures (e.g., faces, objects, etc.)

A RNN will similarly learn to recognize patterns across time. So a RNN that is trained to translate text might learn that “dog” should be translated differently if preceded by the word “hot”.

The mechanism by which the two kinds of NNs represent these patterns is different, however. In the case of a CNN, we are looking for the same patterns on all the different subfields of the image. In the case of a RNN we are (in the simplest case) feeding the hidden layers from the previous step as an additional input into the next step. While the RNN builds up memory in this process, it is not looking for the same patterns over different slices of time in the same way that a CNN is looking for the same patterns over different regions of space.

It should be noted that “time” and “space” here shouldn’t be taken too literally. We could run a RNN on a single image for image captioning, for instance, and the meaning of “time” would simply be the order in which different parts of the image are processed. So objects initially processed will inform the captioning of later objects processed.

The sequence regime of operation is much more powerful compared to fixed networks that are doomed from the get-go by a fixed number of computational steps. Moreover, as we’ll see in a bit, RNNs combine the input vector with their state vector with a fixed (but learned) function to produce a new state vector. This can in programming terms be interpreted as running a fixed program with certain inputs and some internal variables. Viewed this way, RNNs essentially describe programs. In fact, it is known that RNNs are Turing-Complete in the sense that they can simulate arbitrary programs (with proper weights).

Space → Time v.s. Function → Program

If training vanilla neural nets is optimization over functions, training recurrent nets is optimization over programs.

If our data is not in form of sequences, we can still formulate and train powerful models that learn to process it sequentially. we are learning stateful programs that process our fixed-sized data.

At the core, RNNs accept an input vector x and give us an output vector y. This output vector’s contents are influenced not only by the input we just fed in, but also on the entire history of inputs we’ve fed in from the past. The RNN’s API consists of a single step function:

rnn = RNN()
y = rnn.step(x) # x is an input vector, y is the RNN's output vector

This is where RNN starts to model the notion of “memory”: The RNN class has some internal state that is updated every time step() is called. In the simplest case this state consists of a single hidden vector h:

class RNN:
# ...
def step(self, x):
# update the hidden state
self.h = np.tanh(np.dot(self.W_hh, self.h) + np.dot(self.W_xh, x))
# compute the output vector
y = np.dot(self.W_hy, self.h)
return y

The step function above specifies the forward pass of RNN. There are 3 parameters WhhW_hh, WxhW_xh, and WhyW_hy. The hidden vector, or more generally the hidden state, is defined by

h(t)=g1(Whhh(t1)+Wxhx(t)+bh)h^{(t)} = g_1\left( W_{hh}h^{(t - 1)} + W_{xh}x^{(t)} + b_h \right)

where tt is the index of the “black boxes” shown earlier. In our example of “hell”, t{1,2,3,4}t \in \{ 1, 2, 3, 4 \}. The hidden state hh is usually initialized with zero vector (simulating “no memory at all”). There are 2 terms inside the g1g_1:

  1. one term based on the previous hidden state Whhh(t1)W_{hh}h^{(t - 1)}, and
  2. the other term based on the current input Wxhx(t)W_{xh}x^{(t)}

In the program above we use numpy np.dot which is a matrix multiplication. The 2 terms interact with addition.

We initialize matrices WhhW_hh, WxhW_xh, and WhyW_hy with random numbers and the bulk of work during training goes into finding the matrices that gives rise to the desirable behavior, as measured with some loss function that expresses our preferences to what kind of output y we would like to see in response to our input sequence x

The value yy is given by

o(t)=g2(Wyhh(t)+bo)o^{(t)} = g_2\left( W_{yh}h^{(t)} + b_o \right)
What are g1g_1 and g2g_2?

They are activation functions which are used to change the linear function in a perceptron to a non-linear function. Please refer to Machine Learning by Mitchell, Tom M. (1997), Paperback (page 96) for why we bump it to non-linear

A typical activation function for g1g_1 is tanhtanh:

tanh(x)=exexex+extanh(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}}

which squashes the activations to the range [0,1][0, 1]

In practice, g2g_2 is constance, i.e. g2=1g_2 = 1

We get RNNs as neural networks if we stack up as follows:

y1 = rnn1.step(x)
y = rnn2.step(y1)

In other words we have two separate RNNs: One RNN is receiving the input vectors and the second RNN is receiving the output of the first RNN as its input. Except neither of these RNNs know or care - it’s all just vectors coming in and going out, and some gradients flowing through each module during backpropagation.

Forward Propagation Equations for RNN#

We now develop the forward propagation equations for the RNN. We assume the hyperbolic tangent activation function, i.e. tanh(x)=exexex+extanh(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}} and that the output is discrete, as if the RNN is used to predict words or characters. A natural way to represent discrete variables is to regard the output o\boldsymbol{o} as giving the unnormalized log probabilities of each possible value of the discrete variable. We can then apply the softmax (discussed shortly) operation as a post-processing step to obtain a vector y^(t)\boldsymbol{\hat{y}^{(t)}} of normalized probabilities over the output.

Forward propagation begins with a specification of the initial state h(0)\boldsymbol{h}^{(0)}. The dimension of the hidden state h\boldsymbol{h}, in contract to our previous overview, is independent of the dimension of the input or output sequences. In fact, h\boldsymbol{h} is a 3D array, whose 1st-dimensional size is exactly the number of RNN parameters.

Then, for each time step from t=1t = 1 to t=τt = \tau, we apply the following update equations:

h(t)=tanh(Whhh(t1)+Wxhx(t)+bh)o(t)=Wyhh(t)+boy^(t)=softmax(o(t))\color{green} \boxed{ \begin{gather*} \boldsymbol{h}^{(t)} = \tanh\left( \boldsymbol{W_{hh}}h^{(t - 1)} + \boldsymbol{W_{xh}}x^{(t)} + \boldsymbol{b_h} \right) \\ \\ \boldsymbol{o}^{(t)} = \boldsymbol{W_{yh}}\boldsymbol{h}^{(t)} + \boldsymbol{b_o} \\ \\ \boldsymbol{\hat{y}^{(t)}} = softmax(\boldsymbol{o}^{(t)}) \end{gather*} }

where

  • h(t)\boldsymbol{h}^{(t)} is the hidden state vector of size (τ+1)(\tau + 1)
  • o(t)\boldsymbol{o}^{(t)} is the output produced by the model at step tt where t{1,2,,τ}t \in \{1, 2, \cdots, \tau\}
  • y^(t)\boldsymbol{\hat{y}^{(t)}} is the normalized probability of o(t)\boldsymbol{o}^{(t)} at τ=t\tau = t
  • bh\boldsymbol{b_h} is the hidden bias vector of size τ\tau
  • bo\boldsymbol{b_o} is the output bias vector of size τ\tau
  • the size of Wxh\boldsymbol{W_{xh}} is (τ1)×τ(\tau - 1) \times \tau
  • the size of Whh\boldsymbol{W_{hh}} is (τ1)×(τ1)(\tau - 1) \times (\tau - 1)
  • the size of Wxh\boldsymbol{W_{xh}} is τ×(τ1)\tau \times (\tau - 1)

Note that this recurrent network maps an input sequence to an output sequence of the same length.

Loss Function of RNN#

According to the discussion of Machine Learning by Mitchell, Tom M. (1997), the key for training RNN or any neural network is through “specifying a measure for the training error”. We call this measure a loss function.

In RNN, the total loss for a given sequence of input x\boldsymbol{x} paired with a sequence of expected y\boldsymbol{y} is the sum of the losses over all the time steps, i.e.

L({x(1),...,x(τ)},{y(1),...,y(τ)})=tτL(t)\mathcal{L}\left( \{ \boldsymbol{x}^{(1)}, ..., \boldsymbol{x}^{(\tau)} \}, \{ \boldsymbol{y}^{(1)}, ..., \boldsymbol{y}^{(\tau)} \} \right) = \sum_t^{\tau} \mathcal{L}^{(t)}

Knowing the exact form of L(t)\mathcal{L}^{(t)} requires our intuitive understanding of cross-entropy

Cross-Entropy#

In information theory, the cross-entropy between two probability distributions pp and qq over the same underlying set of events measures the average number of bits needed to identify an event drawn from the set if a coding scheme used for the set is optimized for an estimated probability distribution qq, rather than the true distribution pp

Confused? Let’s put it in the context of Machine Learning. Machine Learning sees the world based on probability. The “probability distribution” identifies the various tasks to learn. For example, a daily language such as English or Chinese, can be seen as a probability distribution. The probability of “name” followed by “is” is far greater than “are” as in “My name is Jack”. We call such language distribution pp. The task of RNN (or Machine Learning in general) is to learn an approximated distribution of pp; we call this approximation qq

“The average number of bits needed” is can be seen as the distance between pp and qq given an event. In analogy of language, this can be the quantitative measure of the deviation between a real language phrase “My name is Jack” and “My name are Jack”.

At this point, it is easy to imagine that, in the Machine Learning world, the cross entropy indicates the distance between what the model believes the output distribution should be and what the original distribution really is.

Now we have an intuitive understanding of cross entropy, let’s formally define it. The cross-entropy of the discrete probability distribution qq relative to a distribution pp over a given set is defined as

H(p,q)=xp(x)logq(x)H(p, q) = -\sum_x p(x)\log q(x)

Since we assume the softmax probability distribution earlier, the probability distribution of q(x)q(x) is:

L=tp(t)logσ(o(t))=tlogσ(o(t))=tτlogy^(t)\mathcal{L} = -\sum_t p(t)\log\sigma(\boldsymbol{o}^{(t)}) = -\sum_t\log\sigma(\boldsymbol{o}^{(t)}) = -\sum_t^{\tau}\log\boldsymbol{\hat{y}}^{(t)}

where o\boldsymbol{o} is the predicted sequence by RNN and oio_i is the i-th element of the predicted sequence

Therefore, the total loss for a given sequence of input x\boldsymbol{x} paired with a sequence of expected y\boldsymbol{y} is the sum of the losses over all the time steps, i.e.

L({x(1),...,x(τ)},{y(1),...,y(τ)})=tτL(t)=tτlogy^(t)\color{green} \boxed{ \mathcal{L}\left( \{ \boldsymbol{x}^{(1)}, ..., \boldsymbol{x}^{(\tau)} \}, \{ \boldsymbol{y}^{(1)}, ..., \boldsymbol{y}^{(\tau)} \} \right) = \sum_t^{\tau} \mathcal{L}^{(t)} = -\sum_t^{\tau}\log\boldsymbol{\hat{y}}^{(t)} }
What is the Mathematical form of p(i)p(i) in RNN? Why would it become 1?

By definition, p(i)p(i) is the true distribution whose exact functional form is unknown. In the language of Approximation Theory, p(i)p(i) is the function that RNN is trying to learn or approximate mathematically.

Although the p(i)p(i) makes the exact form of L\mathcal{L} unknown, computationally p(i)p(i) is perfectly defined in each training example. Taking our “hello” example:

The 4 probability distributions of q(x)q(x) is “reflected” in the output layer of this example. They are “reflecting” the probability distribution of q(x)q(x) because they are only oo values and have not been transformed to the σ\sigma distribution yet. But in this case, we are 100% sure that the true probability distribution p(i)p(i) for the 4 outputs are

(0100),(0010),(0010),(0001)\begin{pmatrix}0\\1\\0\\0\end{pmatrix}, \begin{pmatrix}0\\0\\1\\0\end{pmatrix}, \begin{pmatrix}0\\0\\1\\0\end{pmatrix}, \begin{pmatrix}0\\0\\0\\1\end{pmatrix}

respectively. That is all we need for calculating the L\mathcal{L}

The softmax function takes as input a vector zz of KK real numbers, and normalizes it into a probability distribution consisting of KK probabilities proportional to the exponentials of the input numbers. That is, prior to applying softmax, some vector components could be negative, or greater than one; and might not sum to 1; but after applying softmax, each component will be in the interval (0,1)(0, 1) and the components will add up to 1, so that they can be interpreted as probabilities. Furthermore, the larger input components will correspond to larger probabilities.

For a vector zz of KK real numbers, the the standard (unit) softmax function σ:RK(0,1)K\sigma: \mathbb{R}^K \mapsto (0, 1)^K, where K1K \ge 1 is defined by

σ(z)i=ezij=1Kezj\sigma(\boldsymbol{z})_i = \frac{e^{z_i}}{\sum_{j = 1}^Ke^{z_j}}

where i=1,2,...,Ki = 1, 2, ..., K and x=(x1,x2,...,xK)RK\boldsymbol{x} = (x_1, x_2, ..., x_K) \in \mathbb{R}^K

In the context of RNN,

σ(o)i=eoij=1neoj\sigma(\boldsymbol{o})_i = -\frac{e^{o_i}}{\sum_{j = 1}^ne^{o_j}}

where

  • nn is the length of a sequence feed into the RNN
  • oio_i is the output by perceptron unit i
  • i=1,2,...,ni = 1, 2, ..., n,
  • o=(o1,o2,...,on)Rn\boldsymbol{o} = (o_1, o_2, ..., o_n) \in \mathbb{R}^n

The softmax function takes an N-dimensional vector of arbitrary real values and produces another N-dimensional vector with real values in the range (0, 1) that add up to 1.0. It maps RNRN\mathbb{R}^N \rightarrow \mathbb{R}^N

σ(o):(o1o2on)(σ1σ2σn)\sigma(\boldsymbol{o}): \begin{pmatrix}o_1\\o_2\\\dots\\o_n\end{pmatrix} \rightarrow \begin{pmatrix}\sigma_1\\\sigma_2\\\dots\\\sigma_n\end{pmatrix}

This property of softmax function that it outputs a probability distribution makes it suitable for probabilistic interpretation in classification tasks. Neural networks, however, are commonly trained under a log loss (or cross-entropy) regime

We are going to compute the derivative of the softmax function because we will be using it for training our RNN model shortly. But before diving in, it is important to keep in mind that Softmax is fundamentally a vector function. It takes a vector as input and produces a vector as output; in other words, it has multiple inputs and multiple outputs. Therefore, we cannot just ask for “the derivative of softmax”; We should instead specify:

  1. Which component (output element) of softmax we are seeking to find the derivative of.
  2. Since softmax has multiple inputs, with respect to which input element the partial derivative is computed.

What we are looking for is the partial derivatives of

σiok=okeoij=1neoj\frac{\partial \sigma_i}{\partial o_k} = \frac{\partial }{\partial o_k} \frac{e^{o_i}}{\sum_{j = 1}^ne^{o_j}}

where σiok\frac{\partial \sigma_i}{\partial o_k} is the partial derivative of the i-th output with respect with the k-th input.

We’ll be using the quotient rule of derivatives. For h(x)=f(x)g(x)h(x) = \frac{f(x)}{g(x)} where both ff and gg are differentiable and g(x)0g(x) \ne 0, The quotient rule states that the derivative of h(x)h(x) is

h(x)=f(x)g(x)f(x)g(x)g2(x)h'(x) = \frac{f'(x)g(x) - f(x)g'(x)}{g^2(x)}

In our case, we have

f(ok)=okeoi={eok,if i=k0,otherwisef'(o_k) = \frac{\partial}{\partial o_k} e^{o_i} = \begin{cases} e^{o_k}, & \text{if}\ i = k \\ 0, & \text{otherwise} \end{cases}g(ok)=okj=1neoj=(eo1ok+eo2ok++eokok++eonok)=eokok=eokg'(o_k) = \frac{\partial}{\partial o_k} \sum_{j = 1}^ne^{o_j} = \left( \frac{\partial e^{o_1}}{\partial o_k} + \frac{\partial e^{o_2}}{\partial o_k} + \dots + \frac{\partial e^{o_k}}{\partial o_k} + \dots + \frac{\partial e^{o_n}}{\partial o_k} \right) = \frac{\partial e^{o_k}}{\partial o_k} = e^{o_k}

The rest of it becomes trivial then. When i=ki = k,

σiok=eokj=1neojeokeoi(j=1neoj)2=eoij=1neojeoieoi(j=1neoj)2=eoij=1neojj=1neojeoij=1neoj=σi(j=1neojj=1neojeoij=1neoj)=σi(1σi)\frac{\partial \sigma_i}{\partial o_k} = \frac{e^{o_k} \sum_{j = 1}^ne^{o_j} - e^{o_k} e^{o_i}}{\left( \sum_{j = 1}^ne^{o_j} \right)^2} = \frac{e^{o_i} \sum_{j = 1}^ne^{o_j} - e^{o_i} e^{o_i}}{\left( \sum_{j = 1}^ne^{o_j} \right)^2} = \frac{e^{o_i}}{\sum_{j = 1}^ne^{o_j}} \frac{\sum_{j = 1}^ne^{o_j} - e^{o_i}}{\sum_{j = 1}^ne^{o_j}} \\ = \sigma_i\left( \frac{\sum_{j = 1}^ne^{o_j}}{\sum_{j = 1}^ne^{o_j}} - \frac{e^{o_i}}{\sum_{j = 1}^ne^{o_j}} \right) = \sigma_i \left( 1 - \sigma_i \right)

When iki \ne k:

σiok=eokeoi(j=1neoj)2=σiσk\frac{\partial \sigma_i}{\partial o_k} = \frac{-e^{o_k} e^{o_i}}{\left( \sum_{j = 1}^ne^{o_j} \right)^2} = -\sigma_i\sigma_k

This concludes the derivative of the softmax function:

σiok={σi(1σi),if i=kσiσk,otherwise\frac{\partial \sigma_i}{\partial o_k} = \begin{cases} \sigma_i \left( 1 - \sigma_i \right), & \text{if}\ i = k \\ -\sigma_i\sigma_k, & \text{otherwise} \end{cases}
Deriving Gradient Descent Weight Update Rule#

Training a RNN model of is the same thing as searching for the optimal values for the following parameters of the Forward Progagation Equations:

  1. WxhW_{xh}
  2. WhhW_{hh}
  3. WyhW_{yh}
  4. bhb_h
  5. bob_o

By the Gradient Descent discussed in [Machine Learning by Mitchell, Tom M. (1997), Paperback], we should derive the weight update rule by taking partial derivatives with respect to all of the variables above. Let’s start with WyhW_{yh}

[Machine Learning by Mitchell, Tom M. (1997), Paperback] has also mentioned gradients and partial derivatives as being important for an optimization algorithm to update, say, the model weights of a neural network to reach an optimal set of weights. The use of partial derivatives permits each weight to be updated independently of the others, by calculating the gradient of the error curve with respect to each weight in turn.

Many of the functions that we usually work with in machine learning are multivariate, vector-valued functions, which means that they map multiple real inputs nn to multiple real outputs mm:

f:RnRmf: \mathbb{R}^n \rightarrow \mathbb{R}^m

In training a neural network, the backpropagation algorithm is responsible for sharing back the error calculated at the output layer among the neurons comprising the different hidden layers of the neural network, until it reaches the input.

If our RNN contains only 1 perceptron unit, the error is propagated back by, using the Chain Rule of dzdx=dzdydydx\frac{dz}{dx} = \frac{dz}{dy}\frac{dy}{dx}:

LW=LooW\frac{\partial \mathcal{L}}{\partial W} = \frac{\partial \mathcal{L}}{\partial o}\frac{\partial o}{\partial W}

Note that in the RNN mode, L\mathcal{L} is not a direct function of WW. Thus its first order derivative cannot be computed unless we connect the L\mathcal{L} to oo first and then to WW, because both the first order derivatives of Lo\frac{\partial \mathcal{L}}{\partial o} and oW\frac{\partial o}{\partial W} are defined by the model presented earlier above

It is more often the case that we’d have many connected perceptrons populating the network, each attributed a different weight. Since this is the case for RNN, we can generalise multiple inputs and multiple outputs using the Generalized Chain Rule:

Generalized Chain Rule

Consider the case where xRmx \in \mathbb{R}^m and uRnu \in \mathbb{R}^n; an inner function, ff, maps mm inputs to nn outputs, while an outer function, gg, receives nn inputs to produce an output, hRkh \in \mathbb{R}^k. For i=1,,mi = 1, \dots, m the generalized chain rule states:

hxi=hu1u1xi+hu2u2xi++hununxi=j=1nhujujxi\frac{\partial h}{\partial x_i} = \frac{\partial h}{\partial u_1} \frac{\partial u_1}{\partial x_i} + \frac{\partial h}{\partial u_2} \frac{\partial u_2}{\partial x_i} + \dots + \frac{\partial h}{\partial u_n} \frac{\partial u_n}{\partial x_i} = \sum_{j = 1}^n \frac{\partial h}{\partial u_j} \frac{\partial u_j}{\partial x_i}

Therefore, the error propagation of Gradient Descent in RNN is

LWyh=t=1τi=1nLoi(t)oi(t)WyhLWhh=t=1τi=1nLhi(t)hi(t)WhhLWxh=t=1τi=1nLhi(t)hi(t)Wxh\color{green} \boxed{ \begin{align} \frac{\partial \mathcal{L}}{\partial W_{yh}} = \sum_{t = 1}^\tau \sum_{i = 1}^n \frac{\partial \mathcal{L}}{\partial o_i^{(t)}} \frac{\partial o_i^{(t)}}{\partial W_{yh}} \\ \frac{\partial \mathcal{L}}{\partial W_{hh}} = \sum_{t = 1}^\tau \sum_{i = 1}^n \frac{\partial \mathcal{L}}{\partial h_i^{(t)}} \frac{\partial h_i^{(t)}}{\partial W_{hh}} \\ \frac{\partial \mathcal{L}}{\partial W_{xh}} = \sum_{t = 1}^\tau \sum_{i = 1}^n \frac{\partial \mathcal{L}}{\partial h_i^{(t)}} \frac{\partial h_i^{(t)}}{\partial W_{xh}} \end{align} }

where nn is the length of a RNN sequence and tt the index of timestep

On t=1τ\sum_{t = 1}^\tau

We assume the error is the sum of all errors of each timestep, which is why we include the t=1τ\sum_{t = 1}^\tau term

Let’s look at LWyh\frac{\partial \mathcal{L}}{W_{yh}} first

LWyh=t=1τi=1nLoi(t)oi(t)Wyh\frac{\partial \mathcal{L}}{W_{yh}} = \sum_{t = 1}^\tau \sum_{i = 1}^n \frac{\partial \mathcal{L}}{\partial o_i^{(t)}} \frac{\partial o_i^{(t)}}{\partial W_{yh}}

Since oi=(Wyhhi+bo)o_i = \left( W_{yh}h_i + b_o \right)

oiWyh=Wyh(Wyhhi+bo)=hi\frac{\partial o_i}{W_{yh}} = \frac{\partial }{W_{yh}}\left( W_{yh}h_i + b_o \right) = h_i

For the Loi\frac{\partial \mathcal{L}}{\partial o_i} we shall recall from the earlier discussion on softmax derivative that we CANNOT simply have

Loi=oiinp(i)logσi\frac{\partial \mathcal{L}}{\partial o_i} = -\frac{\partial}{\partial o_i}\sum_i^np(i)\log\sigma_i

because we need to

  1. specify which component (output element) we are seeking to find the derivative of
  2. with respect to which input element the partial derivative is computed

Therefore:

Loi=oijnp(j)logσj=jnoip(j)logσj=jnp(j)logσjoi\frac{\partial \mathcal{L}}{\partial o_i} = -\frac{\partial}{\partial o_i}\sum_j^np(j)\log\sigma_j = -\sum_j^n\frac{\partial}{\partial o_i}p(j)\log\sigma_j = -\sum_j^np(j)\frac{\partial \log\sigma_j}{\partial o_i}

where nn is the number of timesteps (or the length of a sequence such as “hell”)

Applying the chain rule again:

jnp(j)logσjoi=jnp(j)1σjσjoi-\sum_j^np(j)\frac{\partial \log\sigma_j}{\partial o_i} = -\sum_j^np(j)\frac{1}{\sigma_j}\frac{\partial\sigma_j}{\partial o_i}

Recall we have already derived that

σioj={σi(1σi),if i=jσiσj,otherwise\frac{\partial \sigma_i}{\partial o_j} = \begin{cases} \sigma_i \left( 1 - \sigma_i \right), & \text{if}\ i = j \\ -\sigma_i\sigma_j, & \text{otherwise} \end{cases}jnp(j)1σjσjoi=i=jnp(j)1σjσjoiijnp(j)1σjσjoi=p(i)(1σi)+ijnp(j)σi-\sum_j^np(j)\frac{1}{\sigma_j}\frac{\partial\sigma_j}{\partial o_i} = -\sum_{i = j}^np(j)\frac{1}{\sigma_j}\frac{\partial\sigma_j}{\partial o_i} -\sum_{i \ne j}^np(j)\frac{1}{\sigma_j}\frac{\partial\sigma_j}{\partial o_i} = -p(i)(1 - \sigma_i) + \sum_{i \ne j}^np(j)\sigma_i

Observing that

jnp(j)=1\sum_{j}^np(j) = 1p(i)(1σi)+ijnp(j)σi=p(i)+p(i)σi+ijnp(j)σi=σip(i)-p(i)(1 - \sigma_i) + \sum_{i \ne j}^np(j)\sigma_i = -p(i) + p(i)\sigma_i + \sum_{i \ne j}^np(j)\sigma_i = \sigma_i - p(i)Loi=σip(i)\color{green} \boxed{\frac{\partial \mathcal{L}}{\partial o_i} = \sigma_i - p(i)}LWyh=t=1τin[σip(i)]hi=t=1τ(σp)h(t)\color{green} \boxed{ \frac{\partial \mathcal{L}}{\partial W_{yh}} = \sum_{t = 1}^\tau \sum_i^n\left[ \sigma_i - p(i) \right] h_i = \sum_{t = 1}^\tau \left( \boldsymbol{\sigma} - \boldsymbol{p} \right) \boldsymbol{h}^{(t)} }Lbo=t=1τinLoi(t)oi(t)bo(t)=t=1τin[σip(i)]×1\frac{\partial \mathcal{L}}{\partial b_o} = \sum_{t = 1}^\tau \sum_i^n\frac{\partial \mathcal{L}}{\partial o_i^{(t)}}\frac{\partial o_i^{(t)}}{\partial b_o^{(t)}} = \sum_{t = 1}^\tau \sum_i^n\left[ \sigma_i - p(i) \right] \times 1Lbo=t=1τin[σip(i)]=t=1τσp\color{green} \boxed{ \frac{\partial \mathcal{L}}{\partial b_o} = \sum_{t = 1}^\tau \sum_i^n\left[ \sigma_i - p(i) \right] = \sum_{t = 1}^\tau \boldsymbol{\sigma} - \boldsymbol{p} }

We have at this point derived backpropagating rule for WyhW_{yh} and bob_o:

  1. WxhW_{xh}
  2. WhhW_{hh}
  3. WyhW_{yh}
  4. bhb_h
  5. bob_o

Now let’s look at LWhh\frac{\partial \mathcal{L}}{\partial W_{hh}}:

Recall from Deep Learning, section 6.5.2, p. 207 that the vector notation of zxi=jzyjyjxi\frac{\partial z}{\partial x_i} = \sum_j \frac{\partial z}{\partial y_j}\frac{\partial y_j}{\partial x_i} is

xz=(yx)yz\nabla_{\boldsymbol{x}}z = \left( \frac{\partial \boldsymbol{y}}{\partial \boldsymbol{x}} \right)^\intercal \nabla_{\boldsymbol{y}}z

This gives us a start with:

LWhh=t=1τi=1nLhi(t)hi(t)Whh=t=1τ(Lh(t))Whhh(t)=t=1τ(Lh(t))(h(t)Whh)h(t)h(t)=t=1τ(Lh(t))(h(t)Whh)=t=1τ(h(t)Whh)Lh(t)=t=1τ(h(t)h(t1))(h(t1)Whh)Lh(t)=t=1τ(h(t)h(t1))(h(t1)h(t)h(t)h(t)h(t)Whh)Lh(t)=t=1τ(h(t)h(t1))(h(t1)h(t)h(t)Whhh(t)h(t))Lh(t)=t=1τ(h(t)h(t1))(h(t1)h(t))(h(t)Whh)(h(t)h(t))Lh(t)=t=1τ(h(t)Whh)(h(t)h(t))Lh(t)=t=1τdiag[1(h(t))2]h(t1)h(t)L=t=1τdiag[1(h(t))2](h(t)L)h(t1)\begin{align} \frac{\partial \mathcal{L}}{\partial W_{hh}} &= \sum_{t = 1}^\tau \sum_{i = 1}^n \frac{\partial \mathcal{L}}{\partial h_i^{(t)}} \frac{\partial h_i^{(t)}}{\partial W_{hh}} \\ & = \sum_{t = 1}^\tau \left( \frac{\partial \mathcal{L}}{\partial \boldsymbol{h}^{(t)}} \right)^\intercal \nabla_{\boldsymbol{W_{hh}}}\boldsymbol{h}^{(t)} \\ & = \sum_{t = 1}^\tau \left( \frac{\partial \mathcal{L}}{\partial \boldsymbol{h}^{(t)}} \right)^\intercal \left( \frac{\partial \boldsymbol{h}^{(t)}}{\partial \boldsymbol{W_{hh}}} \right)^\intercal \nabla_{\boldsymbol{h}^{(t)}}\boldsymbol{h}^{(t)} \\ & = \sum_{t = 1}^\tau \left( \frac{\partial \mathcal{L}}{\partial \boldsymbol{h}^{(t)}} \right)^\intercal \left( \frac{\partial \boldsymbol{h}^{(t)}}{\partial \boldsymbol{W_{hh}}} \right)^\intercal \\ & = \sum_{t = 1}^\tau \left( \frac{\partial \boldsymbol{h}^{(t)}}{\partial \boldsymbol{W_{hh}}} \right)^\intercal \frac{\partial \mathcal{L}}{\partial \boldsymbol{h}^{(t)}} \\ & = \sum_{t = 1}^\tau \left( \frac{\partial \boldsymbol{h}^{(t)}}{\partial \boldsymbol{h}^{(t - 1)}} \right)^\intercal \left( \frac{\partial \boldsymbol{h}^{(t - 1)}}{\partial \boldsymbol{W_{hh}}} \right)^\intercal \frac{\partial \mathcal{L}}{\partial \boldsymbol{h}^{(t)}} \\ & = \sum_{t = 1}^\tau \left( \frac{\partial \boldsymbol{h}^{(t)}}{\partial \boldsymbol{h}^{(t - 1)}} \right)^\intercal \left( \frac{\partial \boldsymbol{h}^{(t - 1)}}{\partial \boldsymbol{h}^{(t)}}\frac{\partial \boldsymbol{h}^{(t)}}{\partial \boldsymbol{h}^{(t)}}\frac{\partial \boldsymbol{h}^{(t)}}{\partial \boldsymbol{W_{hh}}} \right)^\intercal \frac{\partial \mathcal{L}}{\partial \boldsymbol{h}^{(t)}} \\ & = \sum_{t = 1}^\tau \left( \frac{\partial \boldsymbol{h}^{(t)}}{\partial \boldsymbol{h}^{(t - 1)}} \right)^\intercal \left( \frac{\partial \boldsymbol{h}^{(t - 1)}}{\partial \boldsymbol{h}^{(t)}}\frac{\partial \boldsymbol{h}^{(t)}}{\partial \boldsymbol{W_{hh}}}\frac{\partial \boldsymbol{h}^{(t)}}{\partial \boldsymbol{h}^{(t)}} \right)^\intercal \frac{\partial \mathcal{L}}{\partial \boldsymbol{h}^{(t)}} \\ & = \sum_{t = 1}^\tau \left( \frac{\partial \boldsymbol{h}^{(t)}}{\partial \boldsymbol{h}^{(t - 1)}} \right)^\intercal \left( \frac{\partial \boldsymbol{h}^{(t - 1)}}{\partial \boldsymbol{h}^{(t)}} \right)^\intercal \left( \frac{\partial \boldsymbol{h}^{(t)}}{\partial \boldsymbol{W_{hh}}} \right)^\intercal \left( \frac{\partial \boldsymbol{h}^{(t)}}{\partial \boldsymbol{h}^{(t)}} \right)^\intercal \frac{\partial \mathcal{L}}{\partial \boldsymbol{h}^{(t)}} \\ & = \sum_{t = 1}^\tau \left( \frac{\partial \boldsymbol{h}^{(t)}}{\partial \boldsymbol{W_{hh}}} \right)^\intercal \left( \frac{\partial \boldsymbol{h}^{(t)}}{\partial \boldsymbol{h}^{(t)}} \right)^\intercal \frac{\partial \mathcal{L}}{\partial \boldsymbol{h}^{(t)}} \\ & = \sum_{t = 1}^\tau diag\left[ 1 - \left(\boldsymbol{h}^{(t)}\right)^2 \right] \boldsymbol{h}^{(t - 1)} \nabla_{\boldsymbol{h}^{(t)}}\mathcal{L} \\ & = \sum_{t = 1}^\tau diag\left[ 1 - \left(\boldsymbol{h}^{(t)}\right)^2 \right] \left( \nabla_{\boldsymbol{h}^{(t)}}\mathcal{L} \right) {\boldsymbol{h}^{(t - 1)}}^\intercal \end{align}LWhh=t=1τdiag[1(h(t))2](h(t)L)h(t1)\color{green} \boxed{ \frac{\partial \mathcal{L}}{\partial W_{hh}} = \sum_{t = 1}^\tau diag\left[ 1 - \left(\boldsymbol{h}^{(t)}\right)^2 \right] \left( \nabla_{\boldsymbol{h}^{(t)}}\mathcal{L} \right) {\boldsymbol{h}^{(t - 1)}}^\intercal }

The equation above leaves us with a term h(t)L\nabla_{\boldsymbol{h}^{(t)}}\mathcal{L}, which we calculate next. Note that the back propagation on h(t)\boldsymbol{h}^{(t)} has source from both o(t)\boldsymbol{o}^{(t)} and h(t+1)\boldsymbol{h}^{(t + 1)}. It’s gradient, therefore, is given by

h(t)L=(o(t)h(t))o(t)L+(h(t+1)h(t))h(t+1)L=(Wyh)o(t)L+(diag[1(h(t+1))2]Whh)h(t+1)L=(Wyh)o(t)L+Whhh(t+1)L(diag[1(h(t+1))2])\begin{align} \nabla_{\boldsymbol{h}^{(t)}}\mathcal{L} &= \left( \frac{\partial \boldsymbol{o}^{(t)}}{\partial \boldsymbol{h}^{(t)}} \right)^\intercal \nabla_{\boldsymbol{o}^{(t)}}\mathcal{L} + \left( \frac{\partial \boldsymbol{h}^{(t + 1)}}{\partial \boldsymbol{h}^{(t)}} \right)^\intercal \nabla_{\boldsymbol{h}^{(t + 1)}}\mathcal{L} \\ &= \left( \boldsymbol{W_{yh}} \right)^\intercal \nabla_{\boldsymbol{o}^{(t)}}\mathcal{L} + \left( diag\left[ 1 - (\boldsymbol{h}^{(t + 1)})^2 \right] \boldsymbol{W_{hh}} \right)^\intercal \nabla_{\boldsymbol{h}^{(t + 1)}}\mathcal{L} \\ &= \left( \boldsymbol{W_{yh}} \right)^\intercal \nabla_{\boldsymbol{o}^{(t)}}\mathcal{L}+ \boldsymbol{W_{hh}}^\intercal \nabla_{\boldsymbol{h}^{(t + 1)}}\mathcal{L} \left( diag\left[ 1 - (\boldsymbol{h}^{(t + 1)})^2 \right] \right) \end{align}h(t)L=(Wyh)o(t)L+Whhh(t+1)L(diag[1(h(t+1))2])\color{green} \boxed{ \nabla_{\boldsymbol{h}^{(t)}}\mathcal{L} = \left( \boldsymbol{W_{yh}} \right)^\intercal \nabla_{\boldsymbol{o}^{(t)}}\mathcal{L} + \boldsymbol{W_{hh}}^\intercal \nabla_{\boldsymbol{h}^{(t + 1)}}\mathcal{L} \left( diag\left[ 1 - (\boldsymbol{h}^{(t + 1)})^2 \right] \right) }

Note that the 2nd term Wxhh(t+1)L(diag[1(h(t+1))2])\boldsymbol{W_{xh}}^\intercal \nabla_{\boldsymbol{h}^{(t + 1)}}\mathcal{L} \left( diag\left[ 1 - (\boldsymbol{h}^{(t + 1)})^2 \right] \right) is zero at first iteration propagating back because for the last-layer (unrolled) of RNN , there’s no gradient update flow from the next hidden state.

So far we have derived backpropagating rule for WhhW_{hh}

  1. WxhW_{xh}
  2. WhhW_{hh}
  3. WyhW_{yh}
  4. bhb_h
  5. bob_o

Let’s tackle the remaining LWxh\frac{\partial \mathcal{L}}{\partial W_{xh}} and bhb_h:

LWxh=t=1τi=1nLhi(t)hi(t)Wxh=t=1τ(h(t)Wxh)h(t)L=t=1τ(diag[1(h(t))2]x(t))h(t)L=t=1τ(diag[1(h(t))2])h(t)L(x(t))\begin{align} \frac{\partial \mathcal{L}}{\partial W_{xh}} &= \sum_{t = 1}^\tau \sum_{i = 1}^n \frac{\partial \mathcal{L}}{\partial h_i^{(t)}} \frac{\partial h_i^{(t)}}{\partial W_{xh}} \\ &= \sum_{t = 1}^\tau \left( \frac{\partial \boldsymbol{h}^{(t)}}{\partial \boldsymbol{W_{xh}}} \right)^\intercal \nabla_{\boldsymbol{h}^{(t)}}\mathcal{L} \\ &= \sum_{t = 1}^\tau \left( diag\left[ 1 - (\boldsymbol{h}^{(t)})^2 \right] \boldsymbol{x}^{(t)} \right)^\intercal \nabla_{\boldsymbol{h}^{(t)}}\mathcal{L} \\ &= \sum_{t = 1}^\tau \left( diag\left[ 1 - (\boldsymbol{h}^{(t)})^2 \right] \right)^\intercal \nabla_{\boldsymbol{h}^{(t)}}\mathcal{L} \left( \boldsymbol{x}^{(t)} \right) \end{align}LWxh=t=1τ(diag[1(h(t))2])h(t)L(x(t))\color{green} \boxed{ \frac{\partial \mathcal{L}}{\partial W_{xh}} = \sum_{t = 1}^\tau \left( diag\left[ 1 - (\boldsymbol{h}^{(t)})^2 \right] \right)^\intercal \nabla_{\boldsymbol{h}^{(t)}}\mathcal{L} \left( \boldsymbol{x}^{(t)} \right) }Lbh=t=1τi=1nLhi(t)hi(t)bh(t)=t=1τ(hi(t)bh(t))h(t)L=t=1τ(diag[1(h(t))2])h(t)L\begin{align} \frac{\partial \mathcal{L}}{\partial b_h} &= \sum_{t = 1}^\tau \sum_{i = 1}^n \frac{\partial \mathcal{L}}{\partial h_i^{(t)}} \frac{\partial h_i^{(t)}}{\partial b_h^{(t)}} \\ &= \sum_{t = 1}^\tau \left( \frac{\partial h_i^{(t)}}{\partial b_h^{(t)}} \right)^\intercal \nabla_{\boldsymbol{h}^{(t)}}\mathcal{L} \\ &= \sum_{t = 1}^\tau \left( diag\left[ 1 - (\boldsymbol{h}^{(t)})^2 \right] \right)^\intercal \nabla_{\boldsymbol{h}^{(t)}}\mathcal{L} \end{align}Lbh=t=1τ(diag[1(h(t))2])h(t)L\color{green} \boxed{ \frac{\partial \mathcal{L}}{\partial b_h} = \sum_{t = 1}^\tau \left( diag\left[ 1 - (\boldsymbol{h}^{(t)})^2 \right] \right)^\intercal \nabla_{\boldsymbol{h}^{(t)}}\mathcal{L} }

This concludes our propagation rules for training RNN:

LWxh=t=1τ(diag[1(h(t))2])h(t)L(x(t))LWhh=t=1τdiag[1(h(t))2](h(t)L)h(t1)LWyh=t=1τ(σp)h(t)Lbh=t=1τ(diag[1(h(t))2])h(t)LLbo=t=1τσp\color{green} \boxed{ \begin{align*} & \frac{\partial \mathcal{L}}{\partial W_{xh}} = \sum_{t = 1}^\tau \left( diag\left[ 1 - (\boldsymbol{h}^{(t)})^2 \right] \right)^\intercal \nabla_{\boldsymbol{h}^{(t)}}\mathcal{L} \left( \boldsymbol{x}^{(t)} \right) \\ \\ & \frac{\partial \mathcal{L}}{\partial W_{hh}} = \sum_{t = 1}^\tau diag\left[ 1 - \left(\boldsymbol{h}^{(t)}\right)^2 \right] \left( \nabla_{\boldsymbol{h}^{(t)}}\mathcal{L} \right) {\boldsymbol{h}^{(t - 1)}}^\intercal \\ \\ & \frac{\partial \mathcal{L}}{\partial W_{yh}} = \sum_{t = 1}^\tau \left( \boldsymbol{\sigma} - \boldsymbol{p} \right) \boldsymbol{h}^{(t)} \\ \\ & \frac{\partial \mathcal{L}}{\partial b_h} = \sum_{t = 1}^\tau \left( diag\left[ 1 - (\boldsymbol{h}^{(t)})^2 \right] \right)^\intercal \nabla_{\boldsymbol{h}^{(t)}}\mathcal{L} \\ \\ & \frac{\partial \mathcal{L}}{\partial b_o} = \sum_{t = 1}^\tau \boldsymbol{\sigma} - \boldsymbol{p} \end{align*} }

According to page 91 of [Machine Learning by Mitchell, Tom M. (1997), Paperback], the amount of updates in direction ii is given by

Δwi=ηEwi \Delta{w_i} = -\eta\frac{\partial E}{\partial w_i}

The update rules for training RNN with a learning rate of η\eta, therefore, are:

ΔWxh=ηt=1τ(diag[1(h(t))2])h(t)L(x(t))ΔWhh=ηt=1τdiag[1(h(t))2](h(t)L)h(t1)ΔWyh=ηt=1τ(σp)h(t)Δbh=ηt=1τ(diag[1(h(t))2])h(t)LΔbo=ηt=1τσp\color{green} \boxed{ \begin{align*} & \Delta W_{xh} = -\eta\sum_{t = 1}^\tau \left( diag\left[ 1 - (\boldsymbol{h}^{(t)})^2 \right] \right)^\intercal \nabla_{\boldsymbol{h}^{(t)}}\mathcal{L} \left( \boldsymbol{x}^{(t)} \right) \\ \\ & \Delta W_{hh} = -\eta\sum_{t = 1}^\tau diag\left[ 1 - \left(\boldsymbol{h}^{(t)}\right)^2 \right] \left( \nabla_{\boldsymbol{h}^{(t)}}\mathcal{L} \right) {\boldsymbol{h}^{(t - 1)}}^\intercal \\ \\ & \Delta W_{yh} = -\eta\sum_{t = 1}^\tau \left( \boldsymbol{\sigma} - \boldsymbol{p} \right) \boldsymbol{h}^{(t)} \\ \\ & \Delta b_h = -\eta\sum_{t = 1}^\tau \left( diag\left[ 1 - (\boldsymbol{h}^{(t)})^2 \right] \right)^\intercal \nabla_{\boldsymbol{h}^{(t)}}\mathcal{L} \\ \\ & \Delta b_o = -\eta\sum_{t = 1}^\tau \boldsymbol{\sigma} - \boldsymbol{p} \end{align*} }

where

h(t)L=(Wyh)o(t)L+Whhh(t+1)L(diag[1(h(t+1))2])\color{green} \boxed{ \nabla_{\boldsymbol{h}^{(t)}}\mathcal{L} = \left( \boldsymbol{W_{yh}} \right)^\intercal \nabla_{\boldsymbol{o}^{(t)}}\mathcal{L} + \boldsymbol{W_{hh}}^\intercal \nabla_{\boldsymbol{h}^{(t + 1)}}\mathcal{L} \left( diag\left[ 1 - (\boldsymbol{h}^{(t + 1)})^2 \right] \right) }

Let’s Pay Attention Now#

The context vector turned out to be a bottleneck for these types of models. It made it challenging for the models to deal with long sentences. A solution was proposed in Bahdanau et al., 2014 and Luong et al., 2015. These papers introduced and refined a technique called “Attention”, which highly improved the quality of machine translation systems. Attention allows the model to focus on the relevant parts of the input sequence as needed.

At time step 7, the attention mechanism enables the decoder to focus on the word “étudiant” (“student” in french) before it generates the English translation. This ability to amplify the signal from the relevant part of the input sequence makes attention models produce better results than models without attention.

Let’s continue looking at attention models at this high level of abstraction. An attention model differs from a classic sequence-to-sequence model in 2 main ways:

  1. The encoder passes a lot more data to the decoder. Instead of passing the last hidden state of the encoding stage, the encoder passes all the hidden states to the decoder:

  2. An attention decoder does an extra step before producing its output. In order to focus on the parts of the input that are relevant to this decoding time step, the decoder does the following:

    1. Look at the set of encoder hidden states it received - each encoder hidden state is most associated with a certain word in the input sentence
    2. Give each hidden state a score (let’s ignore how the scoring is done for now)
    3. Multiply each hidden state by its softmaxed score, thus amplifying hidden states with high scores, and drowning out hidden states with low scores

NOTE

Note that the scoring exercise is done at each time step on the decoder side.

Let us now bring the whole thing together in the following visualization and look at how the attention process works:

  1. The attention decoder RNN takes in the embedding of the <END> token, and an initial decoder hidden state.
  2. The RNN processes its inputs, producing an output and a new hidden state vector (h4h_4). The output is discarded.
  3. Attention Step: We use the encoder hidden states and the h4h_4 vector to calculate a context vector (C4C_4) for this time step.
  4. We concatenate h4h_4 and C4C_4 into one vector.
  5. We pass this vector through a feedforward neural network (one trained jointly with the model).
  6. The output of the feedforward neural networks indicates the output word of this time step.
  7. Repeat for the next time steps

This is another way to look at which part of the input sentence we’re paying attention to at each decoding step:

Note that the model isn’t just mindless aligning the first word at the output with the first word from the input. It actually learned from the training phase how to align words in that language pair (French and English in our example). An example for how precise this mechanism can come from the attention papers listed above:

You can see how the model paid attention correctly when outputing “European Economic Area”. In French, the order of these words is reversed (“européenne économique zone”) as compared to English. Every other word in the sentence is in similar order.

The Transformer#

In the previous section, we looked at Attention - a ubiquitous method in modern deep learning models. Attention is a concept that helped improve the performance of neural machine translation applications. In this section, we will look at the transformer - a model that uses attention to boost the speed with which these models can be trained. Transformer outperforms the Google Neural Machine Translation model in specific tasks. The biggest benefit, however, comes from how transformer lends itself to parallelization. It is in fact Google Cloud’s recommendation to use transformer as a reference model to use their Cloud TPU offering. So let’s try to break the model apart and look at how it functions.

TIP

Through the visually educational nature of this book and with over 250 custom made figures, Hands-On Large Language Models expands this section thoroughly in its Chapter 3. Make sure to check out its supplemental material (code examples, exercises, etc.) as well

A High-Level Look#

Let’s begin by looking at the model as a single black box. In a machine translation application, it would take a sentence in one language, and output its translation in another.

Popping open that Optimus Prime goodness, we see an encoding component, a decoding component, and connections between them.

The encoding component is a stack of encoders (In its original paper Attention is All You Need, 6 encoders and decoders each (12 in total) were stacked on top of each other - there’s nothing magical about the number 6, one can definitely experiment with other arrangements). The decoding component is a stack of decoders of the same number.

The encoders are all identical in structure (yet they do not share weights). Each one is broken down into two sub-layers:

The encoder’s inputs first flow through a self-attention layer - a layer that helps the encoder look at other words in the input sentence as it encodes a specific word. We’ll look closer at self-attention later in this post.

The outputs of the self-attention layer are fed to a feed-forward neural network. The exact same feed-forward network is independently applied to each position.

The decoder has both those layers, but between them is an attention layer that helps the decoder focus on relevant parts of the input sentence

🤗 Transformer
https://blogs.openml.io/posts/transformer/
Author
OpenML Blogs
Published at
2025-09-19
License
CC BY-NC-SA 4.0