7306 words
37 minutes
Machine Learning
2025-06-05
2025-09-25
Working with Real Data

Machine Learning is about inferencing values from data. Studying about machine learning, therefore, comes out best by experimenting with real-world data, not artificial datasets. Fortunately, there are thousands of open datasets to choose from, ranging across all sorts of domains. Here are a few places we can look at:

Approximation Theory#

The purpose of studying Approximation Theory is to better understand the Universal Approximation Theorem, which defines the limits (or unbounded potential) of AI and Machine Learning on what Neural Networks can learn to solve real-life problems. Approximation Theory is the foundation of Machine Learning and its usefulness is brought to life by the advancement of contemporary computing power. For example, Approximation Theory says an approximated function exists by Math theorem but does not indicate how to reach that approximation. Artificial Neural Network, trained by Big Data, reaches that optimal function. Approximation theory is the proof of why AI or Machine Learning works.

K-Armed Bandit Problem: Reinforcement Learning as an Example of Approximation

Consider the following learning problem. We are faced repeatedly with a choice among kk different options, or actions. After each choice we receive a numerical reward chosen from a stationary probability distribution that depends on the action we selected. Our objective is to maximize the expected total reward over some time period, for example, over 1000 action selections, or time steps. This is the original form of the kk-armed bandit problem

Mathematically, each of the kk actions has an expected or mean reward given that one action is selected; let us call this the value of that action. We denote the action selected on time step tt as AtA_t and the corresponding reward as RtR_t, The value of an arbitrary action aa, denoted q(a)q_*(a), is the expected reward given that aa is selected:

q(a)=E[RtAt=a] q_*(a) = \mathbb{E}[R_t|A_t = a]

If we know the value of each action, then it would be trivial to solve the kk-armed bandit problem: we would always select the action with the highest value. We do not know, however, the action values with certainly in reality, although we may have estimates. We denote the estimated value of action aa at time step tt as Qt(a)Q_t(a). We would like Qt(a)Q_t(a) to be close to q(a)q_*(a).

If we maintain estimates of the action values, then at any time step there is at least one action whose estimated value is greatest. We call these the greedy actions. When we select one of these actions, we say that we are exploiting our current knowledge of the values of the actions. If instead we select one of the non-greedy actions, then we say we are exploring, because this enables us to improve our estimate of the non-greedy action’s value. Exploitation is the right thing to do to maximize the expected reward on the one step, but exploration may produce the greater total reward in the long run.

Given that exploring and exploiting is not possible in any single action, systematic methods are used to balance the exploration and exploitation. This is the basic idea behind reinforcement learning.

Defining Machine Learning#

Machine Learning addresses the question of how to build computer programs that improve their performance at some task through experience.

Definition of Learning

A computer program is said to learn from experience/data EE with respect to some class of tasks TT and performance measure PP, it its performance at tasks in TT, as measured by PP, improves with experience EE.

The problem of inducing general functions from specific training examples is central to learning.

Machine Learning v.s. Data Mining

Machine Learning and Data Mining often employ the same methods and overlap significantly, but while machine learning focuses on prediction, based on known properties learned from the training data, data mining focuses on the discovery of (previously) unknown properties in the data (this is the analysis step of knowledge discovery in databases)

There are so many different types of machine learning systems that it is useful to classify them in broad categories, based on the following criteria:

Supervised, Unsupervised, Semi-Supervised, Self-Supervised, Reinforcement Learning#

Machine learning systems can be classified according to the amount and type of supervision they get during training. The main categories are

  • Supervised learning: the training set we feed to the algorithm includes the desired solutions, called labels. A typical supervised learning task is classification, such as spam detector in email
  • Unsupervised learning: the training data is unlabeled. Clustering and 2D/3D Visualization are examples.
  • Semi-supervised learning: when labeling data is time-consuming and we end up having some labeled instances and some unlabeled, simi-supervised learning can deal with this situation.
  • Self-supervised learning: generate a fully labeled dataset from a fully unlabeled one
  • Reinforcement learning: the learning system, called an agent in this context, can observe the environment, select and perform actions, and get rewards or penalties in return. It must then learn by itself what is the best strategy, called a policy, to get the most reward over time. A policy defines what action the agent should choose when it is in a given situation.

Online v.s. Batch Learning#

Batch learning#

In batch learning, the system is trained using all the available data. This will generally take a lot of time and computing resources, so it is typically done offline. First the system is trained, and then it is launched into production and runs without learning anymore; it just applies what it has learned. This is called offline learning.

The batch learning can be automated fairly easily when it needs a new round of training with new data:

Practical Batch Learning

Automated batch learning is simple and often works fine, but

  • training using the full set of data can take many hours, so we would typically train a new system only every 24 hours or even just weekly. If our system needs to adapt to rapidly changing data (e.g., to predict stock prices), then we need a more reactive solution.
  • training on the full set of data requires a lot of computing resources (CPU, memory space, disk space, disk I/O, network I/O, etc.). If we have a lot of data and we automate your system to train from scratch every day, it will end up costing us a lot of money. If the amount of data is huge, it may even be impossible to use a batch learning algorithm.
  • if our system needs to be able to learn autonomously and it has limited resources (e.g., a smartphone application or a rover on Mars), then carrying around large amounts of training data and taking up a lot of resources to train for hours every day is a showstopper.

A model’s performance tends to decay slowly over time, simply because the world continues to evolve while the model remains unchanged. This phenomenon is often called model rot or data drift ⚠️. The solution is to regularly retrain the model on up-to-date data. How often we need to do that depends on the use case

Online Learning#

In online learning, we train the system incrementally by feeding it data instances sequentially, either individually or in small groups called mini-batches. Each learning step is fast and cheap, so the system can learn about new data on the fly, as it arrives

One important parameter of online learning systems is how fast they should adapt to changing data: this is called the learning rate (not to be confused with learning rate as a hyperparameter). If we set a high learning rate, then our system will rapidly adapt to new data, but it will also tend to quickly forget the old data (for example, a spam filter would then flag only the latest kinds of spam). Conversely, if we set a low learning rate, the system will have more inertia; that is, it will learn more slowly, but it will also be less sensitive to noise in the new data or to sequences of nonrepresentative data points (outliers).

Practical Online Learning

A big challenge with online learning is that if bad data is fed to the system, the system’s performance will decline, possibly quickly. To reduce this risk, we need to monitor our system closely and promptly switch learning off (and possibly revert to a previously working state) if we detect a drop in performance. We may also want to monitor the input data and react to abnormal data; for example, using an anomaly detection algorithm

Instance-Based v.s. Model-Based Learning#

One more way to categorize machine learning systems is by how they generalize. Most machine learning tasks are about making predictions. This means that given a number of training examples, the system needs to be able to make good predictions for (generalize to) examples it has never seen before. Having a good performance measure on the training data is good, but insufficient; the true goal is to perform well on new instances.

There are 2 main approaches to generalization:

  1. instance-based learning, and
  2. model-based learning

Instance-Based Learning#

Possibly the most trivial form of learning is simply to learn by heart. A spam email detection would measure the similarity between two emails. A (very basic) similarity measure between two emails could be to count the number of words they have in common. The system would flag an email as spam if it has many words in common with a known spam email.

This is called instance-based learning: the system learns the examples by heart, then generalizes to new cases by using a similarity measure to compare them to the learned examples (or a subset of them).

Model-Based Learning#

Another way to generalize from a set of examples is to build a model of these examples and then use that model to make predictions. This is called model-based learning Explore in Kaggle

Main Challenges of Machine Learning#

Insufficient Quantity of Training Data#

It takes a lot of data for most machine learning algorithms to work properly. Even for very simple problems we typically need thousands of examples, and for complex problems such as image or speech recognition we may need millions of examples.

The Unreasonable Effectiveness of Data

In a famous paper published in 2001, Microsoft researchers Michele Banko and Eric Brill showed that very different machine learning algorithms, including fairly simple ones, performed almost identically well on a complex problem of natural language disambiguation once they were given enough data

As the authors put it, “these results suggest that we may want to reconsider the trade-off between spending time and money on algorithm development versus spending it on corpus development”.

The idea that data matters more than algorithms for complex problems was further popularized by Peter Norvig et al. in a paper titled “The Unreasonable Effectiveness of Data”, published in 2009.

Nonrepresentative Training Data#

Explore in Kaggle

In order to generalize well, it is crucial that our training data be representative of the new cases we want to generalize to. This is true whether we use instance-based learning or model-based learning and is often harder than it sounds: if the sample is too small, we will have sampling noise (i.e., non-representative data as a result of chance), but even very large samples can be non-representative if the sampling method is flawed. This is called sampling bias.

Overfitting the Training Data#

Explore in Kaggle

Say we are visiting a foreign country and the taxi driver rips us off. We might be tempted to say that all taxi drivers in that country are thieves. Overgeneralizing is something that we humans do all too often, and unfortunately machines can fall into the same trap if we are not careful. In machine learning this is called overfitting: it means that the model performs well on the training data, but it does not generalize well. See an accompanying Jupyter notebook example that illustrate such phenomenon and its possible solutions

Underfitting the Training Data#

Underfitting is the opposite of overfitting: it occurs when our model is too simple to learn the underlying structure of the data. For example, a linear model of life satisfaction is prone to underfit; reality is just more complex than the model, so its predictions are bound to be inaccurate, even on the training examples.

Here are the main options for fixing this problem:

  • Select a more powerful model, with more parameters.
  • Feed better features to the learning algorithm (Feature Engineering).
  • Reduce the constraints on the model

Testing and Validating#

The only way to know how well a model will generalize to new cases is to actually try it out on new cases. One way to do that is to split our data into two sets:

  1. the 1️⃣ training set, and
  2. the 2️⃣ test set

As these names imply, we train our model using the training set, and we test it using the test set. The error rate on new cases is called the generalization error (or out-of-sample error), and by evaluating the model on the test set, we get an estimate of this error. This value tells us how well the model will perform on instances it has never seen before. If the training error is low but the generalization error is high, it means that the model is overfitting the training data for example.

TIP

It is common to use 80% of the data for training and hold out 20% for testing. However, this depends on the size of the dataset: if it contains 10 million instances, then holding out 1% means the test set will contain 100,000 instances, probably more than enough to get a good estimate of the generalization error.

In another situation, suppose we are hesitating between 2 candidate models. How do we decide between them? One option is to train both and compare how well they generalize using the test set. Suppose further that the linear model generalizes better, but we want to apply some regularization to avoid overfitting. The question is, how do we choose the value of the regularization hyperparameter? One option is to train 100 different models using 100 different values for this hyperparameter. Finally we launch the model with the best hyperparameter value which produces the smallest generalization error of 5% into production, but unfortunately it does not perform as well as expected and produces 15% errors. What just happened?

The problem is that we measured the generalization error multiple times on the test set, and we adapted the model and hyperparameters to produce the best model for that particular set. This means the model is unlikely to perform as well on new data. A common solution to this problem is called holdout validation: simply hold out part of the training set to evaluate several candidate models and select the best one. The new held-out set is called the 3️⃣ validation set (or the development set, or dev set). The steps are:

  1. train multiple models with various hyperparameters on the reduced training set (i.e., the full training set minus the validation set)
  2. select the model that performs best on the validation set.
  3. train the best model on the full training set (including the validation set), which gives us the final model.
  4. evaluate this final model on the test set to get an estimate of the generalization error

Machine Learning Project Checklist#

As a well-organized data scientist, the first thing we should do before any serious business project is to pull out our machine learning project checklist which can guide us through our machine learning projects in 8 main steps:

TIP

There is a complete end-to-end notebook project that demonstrate this process - Explore in Kaggle

  1. Document everything along the way

  2. Frame the problem and look at the bit picture

    1. Define the objective in business terms
    2. How will the solution be used?
    3. What are the current solutions/workarounds (if any)?
    4. How should this problem to framed (i.e. supervised/unsupervised, online/offline, etc.)?
    5. How should performance be measured
    6. Is the performance measure aligned with business objective?
    7. What would be the minimum performance needed to reach the business objective?
    8. What are comparable problems? Can we reuse experience or tools?
    9. Is human expertise available?
    10. How would we solve the problem manually?
    11. List the assumptions we or others have made so far
    12. Verify assumptions if possible
  3. Get the data

    TIP

    Automate as much as possible so we can get fresh data easily

  4. Explore the data to gain insights

  5. Prepare the data to better expose the underlying data patterns to machine learning algorithms

  6. Explore many different models and shortlist the best ones

  7. Fine-tune our models and combine them into a solution

  8. Present solution

    • Make sure the big picture is hightlighted first
    • Explain why this solution achieves the business objective
    • Describe what worked and what didn’t
    • List assumptions and system’s limitations
    • Ensure keys findings are communicated through beautiful visualizations or easy-to-remember statements
  9. Launch, monitor, and maintain the system

Neural Networks#

The area of Neural Networks has originally been primarily inspired by the goal of modeling biological neural systems, but has since diverged and become a matter of engineering and achieving good results in Machine Learning tasks. Therefore, the Neural Networks and AI in general is essentially the discipline of Machine Learning

Biological Motivation and Connections#

Historically, digital computers such as the von Neumann model operate via the execution of explicit instructions with access to memory by a number of processors. Some neural networks, on the other hand, originated from efforts to model information processing in biological systems through the framework of connectionism. Unlike the von Neumann model, connectionist computing does not separate memory and processing.

Basic Principles of Connectionism

The central connectionist principle is that mental phenomena can be described by interconnected networks of simple and often uniform units. The form of the connections and the units can vary from model to model. For example, units in the network could represent neurons and the connections could represent synapses, as in the human brain.

  • Activation function: Internal states of any network change over time due to neurons sending a signal to a succeeding layer of neurons in the case of a feedforward network, or to a previous layer in the case of a recurrent network. The activation function defines under what circumstances will a neuron send a signal outward to other neurons. This can be, for example, a function of probability whose range describes the probability of the neuron firing the signal

  • Memory and learning: Neural networks follow two basic principles:

    1. Any mental state can be described as a nn-dimensional vector of numeric activation values over neural units in a network.
    2. Memory and learning are created by modifying the ‘weights’ of the connections between neural units, generally represented as an (n×m)(n \times m) matrix. The weights are adjusted according to some learning rule or algorithm

The basic computational unit of the brain is a neuron. Approximately 86 billion neurons can be found in the human nervous system and they are connected with approximately 101410^{14} - 101510^{15} synapses. The diagram below shows a cartoon drawing of a biological neuron (left) and a common mathematical model (right).

Each neuron receives input signals from its dendrites and produces output signals along its (single) axon. The axon eventually branches out and connects via synapses to dendrites of other neurons.

In the computational model of a neuron, the signals that travel along the axons (e.g. x0x_0) interact multiplicatively (e.g. w0x0w_0x_0) with the dendrites of the other neuron based on the synaptic strength at that synapse (e.g. w0w_0). The idea is that the synaptic strengths (the weights ww) are learnable and control the strength of influence (and its direction: excitory (positive weight) or inhibitory (negative weight)) of one neuron on another. In the basic model, the dendrites carry the signal to the cell body where they all get summed. If the final sum is above a certain threshold, the neuron can fire, sending a spike along its axon. In the computational model, we assume that the precise timings of the spikes do not matter, and that only the frequency of the firing communicates information. Based on this rate code interpretation, we model the firing rate of the neuron with an activation function ff, which represents the frequency of the spikes along the axon. Historically, a common choice of activation function is the sigmoid function σσ, since it takes a real-valued input (the signal strength after the sum) and squashes it to range between 0 and 1. An example code for forward-propagating a single neuron might look as follows:

class Neuron(object):
# ...
def forward(self, inputs):
""" assume inputs and weights are 1-D numpy arrays and bias is a number """
cell_body_sum = np.sum(inputs * self.weights) + self.bias
firing_rate = 1.0 / (1.0 + math.exp(-cell_body_sum)) # sigmoid activation function
return firing_rate

In other words, each neuron performs a dot product with the input and its weights, adds the bias and applies the non-linearity (or activation function), in this case the sigmoid

σ(x)=11+ex \sigma(x) = \frac{1}{1 + e^{-x}}
Coarse model

It’s important to stress that this model of a biological neuron is very coarse. For example, there are many different types of neurons, each with different properties. The dendrites in biological neurons perform complex nonlinear computations. The synapses are not just a single weight, they are a complex non-linear dynamical system. The exact timing of the output spikes in many systems is known to be important, suggesting that the rate code approximation may not hold. Due to all these and many other simplifications, be prepared to hear groaning sounds from anyone with some neuroscience background if we draw analogies between Neural Networks and real brains. See this review if people are interested.

Neural Networks are modeled as collections of neurons that are connected in an acyclic graph. In other words, the outputs of some neurons can become inputs to other neurons. Cycles are not allowed since that would imply an infinite loop in the forward pass of a network. Instead of an amorphous blobs of connected neurons, Neural Network models are often organized into distinct layers of neurons.

Convolutional Neural Networks (CNNs)#

Convolutional Neural Networks (ConvNets or CNNs) are a category of Neural Networks that have proven very effective in areas such as image recognition and classification. ConvNets have been successful in identifying faces, objects and traffic signs apart from powering vision in robots and self-driving cars.

In the figure above, a ConvNet is able to recognize scenes and the system is able to suggest relevant captions (“a soccer player is kicking a soccer ball”) while figure below shows an example of ConvNets being used for recognizing everyday objects, humans and animals. Lately, ConvNets have been effective in several Natural Language Processing tasks (such as sentence classification) as well.

Essentially, every image can be represented as a matrix of pixel values.

Channel is a conventional term used to refer to a certain component of an image. An image from a standard digital camera will have three channels - red, green and blue - we can imagine those as three 2d-matrices stacked over each other (one for each color), each having pixel values in the range 0 to 255. A grayscale image, on the other hand, has just one channel. We will only consider grayscale images, so we will have a single 2d matrix representing an image. The value of each pixel in the matrix will range from 0 to 255 - zero indicating black and 255 indicating white.

Convolution Step#

ConvNets derive their name from the “convolution” operator. The primary purpose of convolution in case of a ConvNet is to extract features from the input image. Convolution preserves the spatial relationship between pixels by learning image features using small squares of input data. We will not go into the mathematical details of Convolution here, but will try to understand how it works over images.

As we discussed above, every image can be considered as a matrix of pixel values. Consider a 5 x 5 image whose pixel values are only 0 and 1 (note that for a grayscale image, pixel values range from 0 to 255, the green matrix below is a special case where pixel values are only 0 and 1):

Also, consider another 3 x 3 matrix as shown below:

Then, the Convolution of the 5 x 5 image and the 3 x 3 matrix can be computed as shown in the animation below:

We slide the orange matrix over our original image (green) by 1 pixel (also called stride) and for every position, we compute element wise multiplication (between the two matrices) and add the multiplication outputs to get the final integer which forms a single element of the output matrix (pink). Note that the 3×3 matrix sees only a part of the input image in each stride. In CNN terminology, the 3×3 matrix is called a filter or kernel or feature detector and the matrix formed by sliding the filter over the image and computing the dot product is called the Convolved Feature or Activation Map or the Feature Map. It is important to note that filters acts as feature detectors from the original input image.

In practice, a CNN learns the values of these filters on its own during the training process (although we still need to specify parameters such as number of filters, filter size, architecture of the network etc. before the training process). The more filters we have, the more image features get extracted and the better our network becomes at recognizing patterns in unseen images.

The size of the Feature Map is controlled by 3 parameters that we need to decide before the convolution step is performed:

  1. Depth: Depth corresponds to the number of filters we use for the convolution operation. In the network shown below, we are performing convolution of the original boat image using three distinct filters, thus producing three different feature maps as shown. We can think of these three feature maps as stacked 2d matrices

  2. Stride: Stride is the number of pixels by which we slide our filter matrix over the input matrix. When the stride is 1 then we move the filters one pixel at a time. When the stride is 2, then the filters jump 2 pixels at a time as we slide them around. Having a larger stride will produce smaller feature maps.

  3. Zero-padding: Sometimes, it is convenient to pad the input matrix with zeros around the border, so that we can apply the filter to bordering elements of our input image matrix. A nice feature of zero padding is that it allows us to control the size of the feature maps. Adding zero-padding is also called wide convolution, and not using zero-padding would be a narrow convolution.

Pooling Step#

Spatial Pooling (also called subsampling or downsampling) reduces the dimensionality of each feature map but retains the most important information. Spatial Pooling can be of different types: Max, Average, Sum etc.

In case of Max Pooling, we define a spatial neighborhood (for example, a 2×2 window) and take the largest element from the rectified feature map within that window. Instead of taking the largest element we could also take the average (Average Pooling) or sum of all elements in that window. In practice, Max Pooling has been shown to work better.

Pooling

  • makes the input representations (feature dimension) smaller and more manageable
  • reduces the number of parameters and computations in the network, therefore, controlling overfitting
  • makes the network invariant to small transformations, distortions and translations in the input image
  • helps us arrive at an almost scale invariant representation of our image. This is very powerful since we can detect objects in an image no matter where they are located

Architecture#

A convolutional neural network consists of an input layer, hidden layers and an output layer. In a convolutional neural network, the hidden layers include one or more layers that perform convolutions followed by other layers such as pooling layers and fully connected layers

After several convolutional and max pooling layers, the final classification is done via fully connected layers. Neurons in a fully connected layer have connections to neurons in the previous layer. The input to this fully connected layer is a one-dimensional vector, which is the flattened output of the convolutional/pooling layers.

What does the flattening look like?

In a Convolutional Neural Network (CNN), the input to the fully connected network (FCN) is the output of the final convolutional layer or pooling layer. This input is typically a 3-dimensional tensor with height, width, and depth representing the extracted features. The flattening on this 3D tensor is essentially “unfolding” all the dimensions together. The resulting one-dimensional vector will have a size equal to the product of the original dimensions. For example, a 5×5×25 \times 5 \times 2 tensor has the flattened vector with size of 5×5×2=505 \times 5 \times 2 = 50, which is the number of neurons of the FCN input layers.

There are two aspects of this computation worth paying attention to:

  1. Location Invariance: Let’s say we want to classify whether or not there’s an elephant in an image. Because we are sliding our filters over the whole image we don’t really care where the elephant occurs. In practice, pooling also gives us invariance to translation, rotation and scaling.
  2. Compositionality: Each filter composes a local patch of lower-level features into higher-level representation. That’s why CNNs are so powerful in Computer Vision. It makes intuitive sense that you build edges from pixels, shapes from edges, and more complex objects from shapes.

Convolutional Neural Networks for NLP#

Instead of image pixels, the input to most NLP tasks are sentences or documents represented as a matrix. Each row of the matrix corresponds to one token, typically a word, but it could be a character. That is, each row is vector that represents a word. Typically, these vectors are word embeddings (low-dimensional representations) like word2vec or GloVe, but they could also be one-hot vectors that index the word into a vocabulary. For a 10 word sentence using a 100-dimensional embedding we would have a 10×100 matrix as our input. That’s our “image”.

In vision, our filters slide over local patches of an image, but in NLP we typically use filters that slide over full rows of the matrix (words). Thus, the “width” of our filters is usually the same as the width of the input matrix. The height, or region size, may vary, but sliding windows over 2-5 words at a time is typical. Putting all the above together, a Convolutional Neural Network for NLP may look like this:

In the illustration of this Convolutional Neural Network (CNN) architecture for sentence classification above, we depict three filter region sizes: 2, 3 and 4, each of which has 2 filters. Every filter performs convolution on the sentence matrix and generates (variable-length) feature maps. Then 1-max pooling is performed over each map, i.e., the largest number from each feature map is recorded. Thus a univariate feature vector is generated from all six maps, and these 6 features are concatenated to form a feature vector for the penultimate layer. The final softmax layer then receives this feature vector as input and uses it to classify the sentence; here we assume binary classification and hence depict two possible output states.

What about the nice intuitions we had for Computer Vision? Location Invariance and local Compositionality made intuitive sense for images, but not so much for NLP. We probably do care a lot where in the sentence a word appears (Except for Latin). Pixels close to each other are likely to be semantically related (part of the same object), but the same isn’t always true for words. In many languages, parts of phrases could be separated by several other words. The compositional aspect isn’t obvious either. Clearly, words compose in some ways, like an adjective modifying a noun, but how exactly this works what higher level representations actually “mean” isn’t as obvious as in the Computer Vision case. Given all this, Recurrent Neural Networks make more intuitive sense. They resemble how we process language, or at least how we think we process language: Reading sequentially from left to right.

A glaring limitation of Vanilla Neural Networks (and also Convolutional Networks) is that their API is too constrained. They accept a fixed-sized vector as input (e.g. an image) and produce a fixed-sized vector as output (e.g. probabilities of different classes). In addition, these models perform this mapping using a fixed amount of computational steps (e.g. the number of layers in the model). The core reason that recurrent nets are more exciting is that they allow us to operate over sequences of vectors: Sequences in the input, the output, or in the most general case both. We will illustrate this now

Large Language Model (LLM)#

Natural Language Processing v.s Large Language Models#

What’s the difference between NLP and LLM
  • NLP (Natural Language Processing) is the broader field focused on enabling computers to understand, interpret, and generate human language. NLP encompasses many techniques and tasks such as sentiment analysis, named entity recognition, and machine translation.
  • LLMs (Large Language Models) are a powerful subset of NLP models characterized by their massive size, extensive training data, and ability to perform a wide range of language tasks with minimal task-specific training. Models like the Llama, GPT, or Claude series are examples of LLMs that have revolutionized what’s possible in NLP.

What is NLP?#

NLP is a field of linguistics and machine learning focused on understanding everything related to human language. The aim of NLP tasks is not only to understand single words individually, but to be able to understand the context of those words.

The following is a list of common NLP tasks, with some examples of each:

  • Classifying whole sentences: Getting the sentiment of a review, detecting if an email is spam, determining if a sentence is grammatically correct or whether two sentences are logically related or not
  • Classifying each word in a sentence: Identifying the grammatical components of a sentence (noun, verb, adjective), or the named entities (person, location, organization)
  • Generating text content: Completing a prompt with auto-generated text, filling in the blanks in a text with masked words
  • Extracting an answer from a text: Given a question and a context, extracting the answer to the question based on the information provided in the context
  • Generating a new sentence from an input text: Translating a text into another language, summarizing a text
NOTE

NLP isn’t limited to written text though. It also tackles complex challenges in speech recognition and computer vision, such as generating a transcript of an audio sample or a description of an image.

The Rise of Large Language Models (LLMs)#

In recent years, the field of NLP has been revolutionized by Large Language Models (LLMs). These models, which include architectures like GPT (Generative Pre-trained Transformer) and Llama, have transformed what’s possible in language processing.

A large language model (LLM) is an AI model trained on massive amounts of text data that can understand and generate human-like text, recognize patterns in language, and perform a wide variety of language tasks without task-specific training. They represent a significant advancement in the field of natural language processing (NLP).

LLMs are characterized by:

  • Scale: They contain millions, billions, or even hundreds of billions of parameters
  • General capabilities: They can perform multiple tasks without task-specific training
  • In-context learning: They can learn from examples provided in the prompt
  • Emergent abilities: As these models grow in size, they demonstrate capabilities that weren’t explicitly programmed or anticipated

The advent of LLMs has shifted the paradigm from building specialized models for specific NLP tasks to using a single, large model that can be prompted or fine-tuned to address a wide range of language tasks. This has made sophisticated language processing more accessible while also introducing new challenges in areas like efficiency, ethics, and deployment.

Why is language processing challenging?#

LLMs, however, also have important limitations (currently):

  • Hallucinations: They can generate incorrect information confidently
  • Lack of true understanding: They lack true understanding of the world and operate purely on statistical patterns
  • Bias: They may reproduce biases present in their training data or inputs.
  • Context windows: They have limited context windows (though this is improving)
  • Computational resources: They require significant computational resources

Computers don’t process information in the same way as humans. For example, when we read the sentence “I am hungry”, we can easily understand its meaning. Similarly, given two sentences such as “I am hungry” and “I am sad,” we’re able to easily determine how similar they are. For machine learning (ML) models, such tasks are more difficult. The text needs to be processed in a way that enables the model to learn from it. And because language is complex, we need to think carefully about how this processing must be done.

Even with the advances in LLMs, many fundamental challenges remain. These include understanding ambiguity, cultural context, sarcasm, and humor. LLMs address these challenges through massive training on diverse datasets, but still often fall short of human-level understanding in many complex scenarios.

RAG & GraphRAG#

What Is Retrieval-Augmented Generation (RAG)?#

Engaging in a conversation with a company’s AI assistant can be frustrating. Chatbots give themselves away by returning generic responses that often don’t answer the question. This is because Large language models (LLMs), like OpenAI’s GPT models, excel at general language tasks but have trouble answering specific questions for several reasons:

  • LLMs have a broad knowledge base but often lack in-depth industry- or organization-specific context.
  • LLMs may generate responses that are incorrect, known as hallucinations.
  • LLMs lack explainability, as they can’t verify, trace, or cite sources.
  • An LLM’s knowledge is based on static training data that doesn’t update with real-time information.

This doesn’t have to be the case. Imagine a different scenario: interacting with a chatbot that provides detailed, precise responses. This chatbot sounds like a human with deep institutional knowledge about the company and its products and policies. This chatbot is actually helpful. The second scenario is possible through a machine learning approach called Retrieval-Augmented Generation (RAG).

RAG is a technique that enhances Large Language Model (LLM) responses by retrieving source information from external data stores to augment generated responses. These data stores, including databases, documents, or websites, may contain domain-specific, proprietary data that enable the LLM to locate and summarize specific, contextual information beyond the data the LLM was trained on.

RAG applications are becoming the industry standard for organizations that want smarter generative AI applications. With RAG, we can reduce hallucination, provide explainability, draw upon the most recent data, and expand the range of what our LLM can answer. As we improve the quality and specificity of its response, we also create a better user experience.

How Does RAG Work?#

At a high level, the RAG architecture involves 3 key processes:

  1. understanding queries: the process begins when a user asks a question. The query goes through the LLM API to the RAG application, which analyzes it to understand the user’s intent and determine what information to look for.
  2. retrieving information: the application uses advanced algorithms to find the most relevant pieces of information in the company’s database. These algorithms match vector embeddings based on semantic similarity to identify the information that can best answer the user’s question.
  3. generating responses: the application combines the retrieved information with the user’s original prompt to create a more detailed and context-rich prompt. It then uses the new prompt to generate a response tailored to the organization’s internal data.

Before implementing an RAG application, it’s important to clean up our data to make it easy for the RAG application to quickly search and retrieve relevant information. This process is called data indexing.

Frameworks like LangChain make it easy to build RAG applications by providing a unified interface to connect LLMs to external databases via APIs. Neo4j vector index on the LangChain library helps simplify the indexing process.

RAG is Not the Final Answer#

People just started building RAG systems would feel like magic: retrieve the right documents and let the model generate. No hallucinations or hand holding, and we get clean and grounded answers. But then the cracks showed over time. RAG worked fine on simple questions, but when the input is longer and poorly structured, or when multi-step reasoning is involved, it starts to struggle. Tweaking chunk sizes or plying with hybrid search for example have been shown to improve the output only slightly.

The core issue is that RAG retrieves but it doesn’t reason or plan. Due to its limitations, RAG has been widely regarded a starting point, not a solution. if we are inputting real world queries, we need memory and planning. One better example would be to wrap RAG in a task planner instead of endless fine-tuning.

For example, if we use RAG with some technical manuals and ask “explain this differently, make it easier to digest, and give me a class on this subject as an instructor would”. That is going to be a great use case for RAG when paired with a strong prompt strategy and clear retrieval scope. If the technical manuals are well-structured and chunked, a RAG system can definitely retrieve relevant sections and reframe them into simplified, instructional content. For more dynamic behavior like teaching styles, adapting explanations to learner feedback, or building a step-by-step curriculum, however, we would likely benefit from layering in agentic behavior or an instructional persona agent on top of RAG. That’s where combining memory, reasoning, and planning starts to elevate the experience beyond static retrieval. Googles Notebook LM would be a good playground to demonstrate this.

This is not RAG’s fault, because “RAG cannot plan” is like saying Elasticsearch or Google search cannot plan. These are information retrieval systems, they are not supposed to plan anything but to retrieve information. If we want planning capabilities, we should add agents and that is a very different level of complexity. Therefore people interested in exploring this direction is recommended checking out Building Business-Ready Generative AI Systems. It goes deep into combining RAG with agentic design, memory, and reasoning flows basically everything that starts where traditional RAG ends.

At the end of the day, RAG should be a component of agentic frameworks that we use to get our system to reply appropriately beyond the information retrieval. Planning would be a true AI feature, LLMs are not it

:::

GraphRAG#

Rather than using documents as a source to vectorize and retrieve from, Knowledge Graphs can be used. One can start with a set of documents, books, or other bodies of text, and convert them to a knowledge graph using one of many methods, including language models. Once the knowledge graph is created, subgraphs can be vectorized, stored in a vector database, and used for retrieval as in plain RAG. The advantage here is that graphs has more recognizable structure than strings of text and this structure can help retrieve more relevant facts for generation. This approach is called GraphRAG

Reinforcement Learning#

Recall from the K-Armed Bandit Problem at the beginning of this post, we now look more closely at methods for estimating the values of actions and for using the estimates to make action selection decisions, which we collectively call action-value methods.

Action-Value Methods

There are 2 parts involved:

  1. An estimation method for the value of an action
  2. How to use the estimates to select action

Recall that the true value of an action is the mean reward when that action is selected:

Qt(a)=E[RtAt=a]=sum of rewards when action a taken prior to tnumber of times a taken prior to t=i=1t1RiX˙Ai=ai=1t1XAi=a Q_t(a) = \mathbb{E}[R_t|A_t = a] = \frac{\text{sum of rewards when action } a \text{ taken prior to } t}{\text{number of times } a \text{ taken prior to } t} = \frac{\sum_{i = 1}^{t - 1}R_i \dot X_{A_i = a}}{\sum_{i = 1}^{t - 1}X_{A_i = a}}

where XpredicateX_{\text{predicate}} denotes the random variable that is 1 if predicate is true and 0 if it is not. If action aa is not taken at all, i.e. the denominator is 00, then we give Qt(a)Q_t(a) some default value, such as 00. In addition, Qt(a)Q_t(a) should converge to q(a)q_*(a) as the denominator goes to infinity.

IMPORTANT

The equation above assumes that the estimate value is calculated by averaging the rewards actually received. In this sense, we call this the sample-average method for estimating action values. It should be noted that this is just one way to estimate action values.

As for the question of how the estimate might be used to selection actions, the simplest selection rule is to select the one with the highest estimated value, i.e. a greedy action:

At=arg maxaQt(a) A_t = \argmax_a Q_t(a)
Machine Learning
https://blogs.openml.io/posts/machine-learning/
Author
OpenML Blogs
Published at
2025-06-05
License
CC BY-NC-SA 4.0