Gradient Checkpoint in Training Neural Networks

711 words

4 minutes

Gradient Checkpoint in Training Neural Networks

2026-01-21

Technology

Gradient Checkpointing: Memory Optimization#

For a simple feed-forward neural network with $n$ layers, the computation graph for obtaining gradients looks as follows:

The activations of the neural network layers correspond to the nodes marked with an $f$ . During the forward pass all these nodes are evaluated in order. The gradient of the loss with respect to the activations and parameters of these layers is indicated by the nodes marked with $b$ . During the backward pass, all these nodes are evaluated in the reversed order. The results obtained for the $f$ nodes are needed to compute the $b$ nodes, and hence all $f$ nodes are kept in memory after the forward pass. Only when backpropagation has progressed far enough to have computed all dependencies, or children, of an $f$ node, can it be erased from memory. This means that the memory required by simple backprop grows linearly with the number of neural net layers $n$ . We show the order in which these nodes are computed below. The purple shaded circles indicate which of the nodes need to be held in memory at any given time.

Simple backpropagation as described above is optimal in terms of computation: it only computes each node once. However, if we are willing to recompute nodes we can potentially save a lot of memory. We might for instance simply recompute every node from the forward pass each time we need it. The order of execution, and the memory used, then look as follows:

With this strategy, the memory required to compute gradients in our graph is constant in the number of neural network layers $n$ , which is optimal in terms of memory. However, note that the number of node evaluations now scales with $n^2$ , whereas it previously scaled as $n$ : Each of the $n$ nodes is recomputed on the order of $n$ times. The computation graph thus becomes much slower to evaluate for deep networks, which makes this method impractical for use in deep learning.

To strike a balance between memory and computation we need to come up with a strategy that allows nodes to be recomputed, but not too often. The strategy is to mark a subset of the neural net activations as checkpoint nodes.

For the simple feed-forward network in our example, the optimal choice is to mark every $\sqrt{N}$ -th node as a checkpoint. This way, both the number of checkpoint nodes and the number of nodes in between checkpoints are on the order of $\sqrt{N}$ , which means that the required memory now also scales with the square root of the number of layers in our network. Since every node is recomputed at most once, the additional computation required by this strategy is equivalent to a single forward pass through the network.

This technique fundamentally trades computational cycles for memory capacity.

Memory Savings: By storing only $\sqrt{N}$ checkpoints (where $N$ is the number of layers), the memory complexity drops from $O(N)$ to approximately $O(\sqrt{N})$ .
Compute Penalty: Because the forward pass for the non-checkpointed layers is executed twice (once during the initial forward pass, and once during the backward pass regeneration), training becomes slower. Typically, this results in a 20% to 30% increase in training time.

Code Example (PyTorch)#

In PyTorch, we do not need to implement the recomputation logic manually. We can wrap specific modules (like Transformer blocks) with torch.utils.checkpoint. For example

TIP
In the following implementation of a model using checkpointing, we define a custom module and wrap its execution to save memory.

1
import torch
2
import torch.nn as nn
3
from torch.utils.checkpoint import checkpoint
4

5
class HeavyLayer(nn.Module):
6
    def __init__(self):
7
        super().__init__()
8
        # A large layer that produces heavy activations
9
        self.linear1 = nn.Linear(4096, 4096)
10
        self.relu = nn.ReLU()
11
        self.linear2 = nn.Linear(4096, 4096)
12

13
    def forward(self, x):
14
        return self.linear2(self.relu(self.linear1(x)))
15

16
class CheckpointedModel(nn.Module):
17
    def __init__(self, num_layers=10):
18
        super().__init__()
19
        self.layers = nn.ModuleList([HeavyLayer() for _ in range(num_layers)])
20

21
    def forward(self, x):
22
        for layer in self.layers:
23
            # Instead of layer(x), we use checkpoint(layer, x).
24
            # This prevents intermediate activations inside 'layer'
25
            # from being saved during the forward pass.
26
            x = checkpoint(layer, x, use_reentrant=False)
27
        return x
28

29
# Usage
30
device = "cuda" if torch.cuda.is_available() else "cpu"
31
model = CheckpointedModel().to(device)
32
input_data = torch.randn(32, 4096, requires_grad=True).to(device)
33

34
# This forward pass uses significantly less VRAM than standard execution
35
output = model(input_data)
36
output.sum().backward()

Configuring Checkpointing (Transformer)#

There are situations where need to explicitly turn-off checkpointing. This can be done with:

1
from transformers import Seq2SeqTrainingArguments)
2

3
training_args = Seq2SeqTrainingArguments(
4
    ...
5
    gradient_checkpointing=False,
6
    ...
7
)