The Universal Approximation Theorem - The Mathematical Guarantee of Learnability for Neural Networks

I was reading a what was called the definitive introductory text for RNNs and notice a very attractive piece of text:

“But similar to universal approximation theorems for neural nets …”

I dig what’s called this “Universal Approximation Theorem (UAT)” a little bit and it surprisingly turned to be one of 3 fundamental pillars of learning theory:

Representation (what UAT solves)
Optimization,
Generalization.

While there are several variations of the UAT (some focusing on width, others on depth, and different activation functions), the most cited and “classical” formal statement comes from George Cybenko’s 1989 paper.

TIP
The statement was formally published in:

Author: George Cybenko

Title: “Approximation by superpositions of a sigmoidal function”

Journal: Mathematics of Control, Signals, and Systems (MCSS)

Volume/Issue: Vol. 2, No. 4

Pages: 303–314

Year: 1989

Here is the formal statement derived from that paper:

Let $I_n$ denote the $n$ -dimensional unit hypercube $[0, 1]^n$ . Let $C(I_n)$ denote the space of continuous functions on $I_n$ , equipped with the uniform norm (meaning we measure the maximum distance between functions). Let $\sigma$ be a sigmoidal function, meaning it is continuous and satisfies:

\lim_{t \to -\infty} \sigma(t) = 0 \quad \text{and} \quad \lim_{t \to +\infty} \sigma(t) = 1

The Theorem: For any continuous function $f \in C(I_n)$ and any error tolerance $\epsilon > 0$ , there exists a finite sum of the form:

G(x) = \sum_{j=1}^N \alpha_j \sigma(y_j^T x + \theta_j)

such that:

|G(x) - f(x)| < \epsilon \quad \text{for all } x \in I_n

where

$N$ is the number of neurons in the hidden layer (the “width”). The theorem says a finite $N$ exists, but does not say how large it must be.
$\alpha_j$ is the weights connecting the hidden neurons to the output.
$y_j$ is the weight vector connecting the inputs to the $j$ -th hidden neuron.
$\theta_j$ is the bias term for the $j$ -th hidden neuron.
$y_j^T x + \theta_j$ represents the standard affine transformation (linear combination) of inputs before the activation function is applied.

In plain English, it says a feedforward network with a single hidden layer and a finite number of neurons can approximate any continuous function to any desired degree of accuracy, provided you choose the right weights and have enough neurons.

This representation of UAT (particularly $|G(x) - f(x)| < \epsilon$ ) signals a strong connection to the Approximation Theory. In fact, UAT evolved from classical approximation theory in mathematics to become a foundational proof for neural networks in the late 1980s.

Approximation Theory is the foundation of Machine Learning and its usefulness is brought to life by the advancement of contemporary computing power. For example, Approximation Theory says an approximated function exists by Math theorem but does not indicate how to reach that approximation. Artificial Neural Network, trained by Big Data, reaches that optimal function. Approximation theory is the proof of why AI or Machine Learning works.

K-Armed Bandit Problem: Reinforcement Learning as an Example of Approximation
Consider the following learning problem. We are faced repeatedly with a choice among $k$ different options, or actions. After each choice we receive a numerical reward chosen from a stationary probability distribution that depends on the action we selected. Our objective is to maximize the expected total reward over some time period, for example, over 1000 action selections, or time steps. This is the original form of the $k$ -armed bandit problem
Mathematically, each of the $k$ actions has an expected or mean reward given that one action is selected; let us call this the value of that action. We denote the action selected on time step $t$ as $A_t$ and the corresponding reward as $R_t$ , The value of an arbitrary action $a$ , denoted $q_*(a)$ , is the expected reward given that $a$ is selected:
$q_*(a) = \mathbb{E}[R_t|A_t = a]$
If we know the value of each action, then it would be trivial to solve the $k$ -armed bandit problem: we would always select the action with the highest value. We do not know, however, the action values with certainly in reality, although we may have estimates. We denote the estimated value of action $a$ at time step $t$ as $Q_t(a)$ . We would like $Q_t(a)$ to be close to $q_*(a)$ .
If we maintain estimates of the action values, then at any time step there is at least one action whose estimated value is greatest. We call these the greedy actions. When we select one of these actions, we say that we are exploiting our current knowledge of the values of the actions. If instead we select one of the non-greedy actions, then we say we are exploring, because this enables us to improve our estimate of the non-greedy action’s value. Exploitation is the right thing to do to maximize the expected reward on the one step, but exploration may produce the greater total reward in the long run.
Given that exploring and exploiting is not possible in any single action, systematic methods are used to balance the exploration and exploitation. This is the basic idea behind reinforcement learning.

Approximation Theory and Methods by M.J.D. Powell is a classic textbook that provides a thorough introduction to approximating functions with simpler ones like polynomials, focusing on both theory and practical computation, and is suitable for advanced undergraduates and postgraduates in mathematics, science, and engineering. The book covers key topics such as polynomial and rational approximation, splines, interpolation, error analysis, and convergence, with a clear, readable style that emphasizes computational aspects and includes exercises

To see how this concept is mathematically formulated in a simple Python script using numpy to approximate a sine wave, we manually constructs a simple neural network (a sum of sigmoids) to approximate a sine wave. It uses a basic Gradient Descent algorithm to find the weights, proving that a configuration exists to represent the function.

1
import numpy as np
2
import matplotlib.pyplot as plt
3

4
def sigmoid(x):
5
    return 1 / (1 + np.exp(-x))
6

7
def sigmoid_derivative(x):
8
    s = sigmoid(x)
9
    return s * (1 - s)
10

11
def target_function(x):
12
    """The function we want to approximate: A simple Sine wave."""
13
    return np.sin(x)
14

15
class UniversalApproximator:
16
    def __init__(self, num_neurons):
17
        self.num_neurons = num_neurons
18

19
        # Initialize weights randomly
20
        # Input (1) -> Hidden (N)
21
        self.weights_input = np.random.uniform(-1, 1, (1, num_neurons))
22

23
        # Fixed: Use np.zeros, not np.random.zeros
24
        self.bias_hidden = np.zeros((1, num_neurons))
25

26
        # Hidden (N) -> Output (1)
27
        self.weights_output = np.random.uniform(-1, 1, (num_neurons, 1))
28
        self.bias_output = np.zeros((1, 1))
29

30
    def forward(self, x):
31
        """Compute the network output: sum of weighted sigmoids."""
32
        self.z = np.dot(x, self.weights_input) + self.bias_hidden
33
        self.a = sigmoid(self.z)  # Activation (Sigmoid)
34
        output = np.dot(self.a, self.weights_output) + self.bias_output
35
        return output
36

37
    def train(self, x, y, epochs=50000, learning_rate=0.01):
38
        """Simple Gradient Descent to find the 'existence' of weights."""
39
        for i in range(epochs):
40
            # Forward pass
41
            output = self.forward(x)
42

43
            # Compute Loss (Mean Squared Error)
44
            error = output - y
45

46
            # Print progress every 5000 epochs
47
            if i % 5000 == 0:
48
                mse = np.mean(np.square(error))
49
                print(f"Epoch {i}: MSE Loss = {mse:.6f}")
50

51
            # Backpropagation (Calculating Gradients)
52
            # 1. Output Layer Gradients
53
            d_output = error * 1  # Derivative of MSE w.r.t output (simplified)
54
            d_weights_output = np.dot(self.a.T, d_output)
55
            d_bias_output = np.sum(d_output, axis=0, keepdims=True)
56

57
            # 2. Hidden Layer Gradients
58
            d_hidden_layer = np.dot(d_output, self.weights_output.T) * sigmoid_derivative(self.z)
59
            d_weights_input = np.dot(x.T, d_hidden_layer)
60
            d_bias_hidden = np.sum(d_hidden_layer, axis=0, keepdims=True)
61

62
            # 3. Update Weights & Biases
63
            self.weights_output -= learning_rate * d_weights_output
64
            self.bias_output -= learning_rate * d_bias_output
65
            self.weights_input -= learning_rate * d_weights_input
66
            self.bias_hidden -= learning_rate * d_bias_hidden
67

68
# --- Simulation ---
69

70
# 1. Prepare Data
71
# Generate 100 points between 0 and 2*Pi
72
X = np.linspace(0, 2 * np.pi, 100).reshape(-1, 1)
73
Y = target_function(X)
74

75
# 2. Initialize Approximator
76
# Try changing num_neurons to 5, 10, or 50 to see the difference.
77
nn = UniversalApproximator(num_neurons=10)
78

79
# 3. Optimize (Find the weights)
80
print(f"Training with {nn.num_neurons} neurons...")
81
nn.train(X, Y)
82

83
# 4. Result Visualization
84
predicted = nn.forward(X)
85
final_loss = np.mean(np.square(Y - predicted))
86
print(f"Final MSE Loss: {final_loss:.6f}")
87

88
# Plotting
89
plt.figure(figsize=(10, 6))
90
plt.plot(X, Y, label='True Function (Sine)', color='blue', linewidth=2)
91
plt.plot(X, predicted, label='Neural Net Approximation', color='red', linestyle='--', linewidth=2)
92
plt.title(f'Universal Approximation Theorem Demo\nNeurons: {nn.num_neurons}, Loss: {final_loss:.5f}')
93
plt.legend()
94
plt.grid(True)
95
plt.show()

Neural Networks#

The area of Neural Networks has originally been primarily inspired by the goal of modeling biological neural systems, but has since diverged and become a matter of engineering and achieving good results in Machine Learning tasks. Therefore, the Neural Networks and AI in general is essentially the discipline of Machine Learning

Biological Motivation and Connections#

Historically, digital computers such as the von Neumann model operate via the execution of explicit instructions with access to memory by a number of processors. Some neural networks, on the other hand, originated from efforts to model information processing in biological systems through the framework of connectionism. Unlike the von Neumann model, connectionist computing does not separate memory and processing.

Basic Principles of Connectionism
The central connectionist principle is that mental phenomena can be described by interconnected networks of simple and often uniform units. The form of the connections and the units can vary from model to model. For example, units in the network could represent neurons and the connections could represent synapses, as in the human brain.

Activation function: Internal states of any network change over time due to neurons sending a signal to a succeeding layer of neurons in the case of a feedforward network, or to a previous layer in the case of a recurrent network. The activation function defines under what circumstances will a neuron send a signal outward to other neurons. This can be, for example, a function of probability whose range describes the probability of the neuron firing the signal

Memory and learning: Neural networks follow two basic principles:

Any mental state can be described as a $n$ -dimensional vector of numeric activation values over neural units in a network.

Memory and learning are created by modifying the ‘weights’ of the connections between neural units, generally represented as an $(n \times m)$ matrix. The weights are adjusted according to some learning rule or algorithm

The basic computational unit of the brain is a neuron. Approximately 86 billion neurons can be found in the human nervous system and they are connected with approximately $10^{14}$ - $10^{15}$ synapses. The diagram below shows a cartoon drawing of a biological neuron (left) and a common mathematical model (right).

Each neuron receives input signals from its dendrites and produces output signals along its (single) axon. The axon eventually branches out and connects via synapses to dendrites of other neurons.

In the computational model of a neuron, the signals that travel along the axons (e.g. $x_0$ ) interact multiplicatively (e.g. $w_0x_0$ ) with the dendrites of the other neuron based on the synaptic strength at that synapse (e.g. $w_0$ ). The idea is that the synaptic strengths (the weights $w$ ) are learnable and control the strength of influence (and its direction: excitory (positive weight) or inhibitory (negative weight)) of one neuron on another. In the basic model, the dendrites carry the signal to the cell body where they all get summed. If the final sum is above a certain threshold, the neuron can fire, sending a spike along its axon. In the computational model, we assume that the precise timings of the spikes do not matter, and that only the frequency of the firing communicates information. Based on this rate code interpretation, we model the firing rate of the neuron with an activation function $f$ , which represents the frequency of the spikes along the axon. Historically, a common choice of activation function is the sigmoid function $σ$ , since it takes a real-valued input (the signal strength after the sum) and squashes it to range between 0 and 1. An example code for forward-propagating a single neuron might look as follows:

1
class Neuron(object):
2
    # ...
3
    def forward(self, inputs):
4
        """ assume inputs and weights are 1-D numpy arrays and bias is a number """
5
        cell_body_sum = np.sum(inputs * self.weights) + self.bias
6
        firing_rate = 1.0 / (1.0 + math.exp(-cell_body_sum)) # sigmoid activation function
7
        return firing_rate

In other words, each neuron performs a dot product with the input and its weights, adds the bias and applies the non-linearity (or activation function), in this case the sigmoid

\sigma(x) = \frac{1}{1 + e^{-x}}

Coarse model
It’s important to stress that this model of a biological neuron is very coarse. For example, there are many different types of neurons, each with different properties. The dendrites in biological neurons perform complex nonlinear computations. The synapses are not just a single weight, they are a complex non-linear dynamical system. The exact timing of the output spikes in many systems is known to be important, suggesting that the rate code approximation may not hold. Due to all these and many other simplifications, be prepared to hear groaning sounds from anyone with some neuroscience background if we draw analogies between Neural Networks and real brains. See this review if people are interested.

Neural Networks are modeled as collections of neurons that are connected in an acyclic graph. In other words, the outputs of some neurons can become inputs to other neurons. Cycles are not allowed since that would imply an infinite loop in the forward pass of a network. Instead of an amorphous blobs of connected neurons, Neural Network models are often organized into distinct layers of neurons.

Convolutional Neural Networks (CNNs)#

Convolutional Neural Networks (ConvNets or CNNs) are a category of Neural Networks that have proven very effective in areas such as image recognition and classification. ConvNets have been successful in identifying faces, objects and traffic signs apart from powering vision in robots and self-driving cars.

In the figure above, a ConvNet is able to recognize scenes and the system is able to suggest relevant captions (“a soccer player is kicking a soccer ball”) while figure below shows an example of ConvNets being used for recognizing everyday objects, humans and animals. Lately, ConvNets have been effective in several Natural Language Processing tasks (such as sentence classification) as well.

Essentially, every image can be represented as a matrix of pixel values.

Channel is a conventional term used to refer to a certain component of an image. An image from a standard digital camera will have three channels - red, green and blue - we can imagine those as three 2d-matrices stacked over each other (one for each color), each having pixel values in the range 0 to 255. A grayscale image, on the other hand, has just one channel. We will only consider grayscale images, so we will have a single 2d matrix representing an image. The value of each pixel in the matrix will range from 0 to 255 - zero indicating black and 255 indicating white.

Convolution Step#

ConvNets derive their name from the “convolution” operator. The primary purpose of convolution in case of a ConvNet is to extract features from the input image. Convolution preserves the spatial relationship between pixels by learning image features using small squares of input data. We will not go into the mathematical details of Convolution here, but will try to understand how it works over images.

As we discussed above, every image can be considered as a matrix of pixel values. Consider a 5 x 5 image whose pixel values are only 0 and 1 (note that for a grayscale image, pixel values range from 0 to 255, the green matrix below is a special case where pixel values are only 0 and 1):

Also, consider another 3 x 3 matrix as shown below:

Then, the Convolution of the 5 x 5 image and the 3 x 3 matrix can be computed as shown in the animation below:

We slide the orange matrix over our original image (green) by 1 pixel (also called stride) and for every position, we compute element wise multiplication (between the two matrices) and add the multiplication outputs to get the final integer which forms a single element of the output matrix (pink). Note that the 3×3 matrix sees only a part of the input image in each stride. In CNN terminology, the 3×3 matrix is called a filter or kernel or feature detector and the matrix formed by sliding the filter over the image and computing the dot product is called the Convolved Feature or Activation Map or the Feature Map. It is important to note that filters acts as feature detectors from the original input image.

In practice, a CNN learns the values of these filters on its own during the training process (although we still need to specify parameters such as number of filters, filter size, architecture of the network etc. before the training process). The more filters we have, the more image features get extracted and the better our network becomes at recognizing patterns in unseen images.

The size of the Feature Map is controlled by 3 parameters that we need to decide before the convolution step is performed:

Depth: Depth corresponds to the number of filters we use for the convolution operation. In the network shown below, we are performing convolution of the original boat image using three distinct filters, thus producing three different feature maps as shown. We can think of these three feature maps as stacked 2d matrices
Stride: Stride is the number of pixels by which we slide our filter matrix over the input matrix. When the stride is 1 then we move the filters one pixel at a time. When the stride is 2, then the filters jump 2 pixels at a time as we slide them around. Having a larger stride will produce smaller feature maps.
Zero-padding: Sometimes, it is convenient to pad the input matrix with zeros around the border, so that we can apply the filter to bordering elements of our input image matrix. A nice feature of zero padding is that it allows us to control the size of the feature maps. Adding zero-padding is also called wide convolution, and not using zero-padding would be a narrow convolution.

Pooling Step#

Spatial Pooling (also called subsampling or downsampling) reduces the dimensionality of each feature map but retains the most important information. Spatial Pooling can be of different types: Max, Average, Sum etc.

In case of Max Pooling, we define a spatial neighborhood (for example, a 2×2 window) and take the largest element from the rectified feature map within that window. Instead of taking the largest element we could also take the average (Average Pooling) or sum of all elements in that window. In practice, Max Pooling has been shown to work better.

Pooling

makes the input representations (feature dimension) smaller and more manageable
reduces the number of parameters and computations in the network, therefore, controlling overfitting
makes the network invariant to small transformations, distortions and translations in the input image
helps us arrive at an almost scale invariant representation of our image. This is very powerful since we can detect objects in an image no matter where they are located

Architecture#

A convolutional neural network consists of an input layer, hidden layers and an output layer. In a convolutional neural network, the hidden layers include one or more layers that perform convolutions followed by other layers such as pooling layers and fully connected layers

After several convolutional and max pooling layers, the final classification is done via fully connected layers. Neurons in a fully connected layer have connections to neurons in the previous layer. The input to this fully connected layer is a one-dimensional vector, which is the flattened output of the convolutional/pooling layers.

What does the flattening look like?
In a Convolutional Neural Network (CNN), the input to the fully connected network (FCN) is the output of the final convolutional layer or pooling layer. This input is typically a 3-dimensional tensor with height, width, and depth representing the extracted features. The flattening on this 3D tensor is essentially “unfolding” all the dimensions together. The resulting one-dimensional vector will have a size equal to the product of the original dimensions. For example, a $5 \times 5 \times 2$ tensor has the flattened vector with size of $5 \times 5 \times 2 = 50$ , which is the number of neurons of the FCN input layers.

There are two aspects of this computation worth paying attention to:

Location Invariance: Let’s say we want to classify whether or not there’s an elephant in an image. Because we are sliding our filters over the whole image we don’t really care where the elephant occurs. In practice, pooling also gives us invariance to translation, rotation and scaling.
Compositionality: Each filter composes a local patch of lower-level features into higher-level representation. That’s why CNNs are so powerful in Computer Vision. It makes intuitive sense that you build edges from pixels, shapes from edges, and more complex objects from shapes.

Convolutional Neural Networks for NLP#

Instead of image pixels, the input to most NLP tasks are sentences or documents represented as a matrix. Each row of the matrix corresponds to one token, typically a word, but it could be a character. That is, each row is vector that represents a word. Typically, these vectors are word embeddings (low-dimensional representations) like word2vec or GloVe, but they could also be one-hot vectors that index the word into a vocabulary. For a 10 word sentence using a 100-dimensional embedding we would have a 10×100 matrix as our input. That’s our “image”.

In vision, our filters slide over local patches of an image, but in NLP we typically use filters that slide over full rows of the matrix (words). Thus, the “width” of our filters is usually the same as the width of the input matrix. The height, or region size, may vary, but sliding windows over 2-5 words at a time is typical. Putting all the above together, a Convolutional Neural Network for NLP may look like this:

In the illustration of this Convolutional Neural Network (CNN) architecture for sentence classification above, we depict three filter region sizes: 2, 3 and 4, each of which has 2 filters. Every filter performs convolution on the sentence matrix and generates (variable-length) feature maps. Then 1-max pooling is performed over each map, i.e., the largest number from each feature map is recorded. Thus a univariate feature vector is generated from all six maps, and these 6 features are concatenated to form a feature vector for the penultimate layer. The final softmax layer then receives this feature vector as input and uses it to classify the sentence; here we assume binary classification and hence depict two possible output states.

What about the nice intuitions we had for Computer Vision? Location Invariance and local Compositionality made intuitive sense for images, but not so much for NLP. We probably do care a lot where in the sentence a word appears (Except for Latin). Pixels close to each other are likely to be semantically related (part of the same object), but the same isn’t always true for words. In many languages, parts of phrases could be separated by several other words. The compositional aspect isn’t obvious either. Clearly, words compose in some ways, like an adjective modifying a noun, but how exactly this works what higher level representations actually “mean” isn’t as obvious as in the Computer Vision case. Given all this, Recurrent Neural Networks make more intuitive sense. They resemble how we process language, or at least how we think we process language: Reading sequentially from left to right.

A glaring limitation of Vanilla Neural Networks (and also Convolutional Networks) is that their API is too constrained. They accept a fixed-sized vector as input (e.g. an image) and produce a fixed-sized vector as output (e.g. probabilities of different classes). In addition, these models perform this mapping using a fixed amount of computational steps (e.g. the number of layers in the model). The core reason that recurrent nets are more exciting is that they allow us to operate over sequences of vectors: Sequences in the input, the output, or in the most general case both. We will illustrate this now