Overview of LLM Agents
Before divining in deep theoretically and practically, it’s essential to grasp the handful of concepts that turn a “dumb” LLM (which just completes text) into an “agent” that can reason, plan, and act. An agent is not a single technology; it’s an architecture built around an LLM.
The core of this architecture is a reasoning loop. The most fundamental and important concept to learn here is ReAct (Reasoning and Acting). This idea, first published by Google researchers, is the foundation for almost every modern agent, including those in LlamaIndex and LangChain.
The ReAct loop works as follows:
- Thought: The agent is given a complex task (e.g., “What’s the weather in the capital of France, and who is the president of that country?”). The LLM’s first step is to think and create a plan. It will generate an internal monologue, like: “I need to solve this in two parts. First, find the capital of France. Second, find the president of France. Third, get the weather for that capital city. Fourth, combine the answers.”
- Act: Based on its first thought (“find the capital of France”), the agent decides to act. It chooses a tool
from a list it has been given. For example, it might choose a
searchtool and generate the specific query:search("capital of France"). - Observation: The agent executes the action and gets a result (the observation). For example, the
searchtool returns: “The capital of France is Paris.” - Repeat: This observation is fed back into the loop as new context. The agent thinks again: “OK, the capital is
Paris. My plan said the next step is to find the president. I will act by using the
searchtool with the querysearch("president of France").” - …and so on: This loop continues - Thought, Act, Observation, Thought, Act, Observation - until the agent’s
final thought is: “I have all the information. The president is Emmanuel Macron and the capital is Paris. I will
now act by using the
search_weathertool with the queryweather("Paris").” After that observation, its final thought will be, “I have all the answers and can now formulate the final response.”
This “Thought, Act, Observation” cycle is the essential theory. The agent is simply an orchestration loop that provides the LLM with a system prompt, a set of tools, and a “scratchpad” to write down its thoughts and observations. More advanced concepts like Reflexion simply add a step where the agent critiques its own past actions to improve its plan, and ReWOO optimizes this process.
Agent Libraries
There are many frameworks that make agentic systems easier to implement. One interested in agents will be glad to know that the vast majority of the most popular and well-documented AI agent development is happening in an ecosystem centered around open models. Here is a guide to the most prominent frameworks, models, and learning resources:
-
LlamaIndex: The leading framework for building powerful Retrieval-Augmented Generation (RAG) agents (agents that can query our own data). While its name is inspired by Meta’s Llama model, it is fully model-agnostic. Its documentation and tutorials heavily feature Llama, OpenAI’s GPT, and Anthropic’s Claude.
-
TIP
- LangChain is for building a RAG (Retrieval Augmented Generation) system, a simple chatbot, or a data extraction pipeline where the steps are known in advance
- LangGraph is for building an Agent that needs to use tools, handle ambiguous instructions, collaborate with other agents, or requires a “human-in-the-loop” workflow.
Quick Start Explore in Kaggle
These frameworks make it easy to get started by simplifying standard low-level tasks like calling LLMs, defining and parsing tools, and chaining calls together. However, they often create extra layers of abstraction that can obscure the underlying prompts and responses, making them harder to debug. They can also make it tempting to add complexity when a simpler setup would suffice.
OpenML suggests that developers start by using LLM APIs directly: many patterns can be implemented in a few lines of code. If you do use a framework, ensure you understand the underlying code. Incorrect assumptions about what’s under the hood are a common source of customer error.
Those who want to pick up of agent basics quickly can check out this Jupyter notebook Explore in Kaggle
Learning Agent Deeply
Although LangChain and LlamaIndex are excellent for building agent, building from first principles gives us a durable understanding that frameworks alone cannot. To do that
- we must learn theory behind Agent: how it works; what are all the relevant concepts; then
- build an agent from scratch, totally free from any frameworks such as LlamaIndex so that we can get our hands dirty with all aspects of agent with deep understanding
The ReAct paper, however, omits lots of details for those who would like to study it from ground-up because it is, unfortunately, a conference paper, not a textbook. Its primary goal was to introduce a new synthesis of ideas and prove (through benchmarks) that this synthesis was effective. It assumes the reader is already an expert in the 3 distinct, advanced fields with their own deep theory it’s combining:
The “Inner Speech” Theory (Cognitive Psychology)
This is the deepest, most foundational layer. The reason the ReAct architecture works so well is that it computationally mimics a core human cognitive function.
- Vygotsky’s Thought and Language: This is the origin of the “private speech” (thinking aloud) to “inner speech” (internal monologue) theory. It explains why language is a tool for self-regulation and planning.
- Baddeley’s “Working Memory” Model: This provides the cognitive architecture, explaining the “Phonological Loop” (the “inner voice”) as the specific mechanism for holding and manipulating verbal information (i.e., the plan) in short-term memory.
The ReAct paper essentially created a Vygotskian agent. It forces the LLM to use “private speech” (the Thought trace) to regulate its own behavior, which is a far more robust method than simple stimulus-response.
The “Agent/Policy” Theory (Reinforcement Learning)
This is the most important piece. ReAct assumes we are fluent in the language of Reinforcement Learning (RL). When it uses words like “policy,” “agent,” “state,” “action,” and “observation,” it’s borrowing the entire formal mathematical framework of Markov Decision Processes (MDPs).

RL is a fascinating intersection of computer science, optimal control, and advanced mathematics. To learn it “deeply” means starting with the mathematical foundations before jumping into the “deep” (neural network) part. Reinforcement Learning: An Introduction by Richard S. Sutton and Andrew G. Barto is the foundational text for the entire field and formally introduces the core mathematical concepts including:
-
Markov Decision Processes (MDPs): The mathematical framework for all RL problems.
-
Bellman Equations: The fundamental equations that all RL algorithms try to solve.
-
Dynamic Programming: The theoretical (but often impractical) “perfect” solution.
-
Monte Carlo Methods & Temporal-Difference (TD) Learning: The breakthroughs that make RL practical (this includes Q-Learning and SARSA).
TIP
It is not an exaggeration to say that without the Monte Carlo (MC) method, modern AI simply would not exist. The fundamental reason is that AI is essentially the study of high-dimensional probability distributions.
To systematically study Monte Carlo method, here is the recommended resource List:
-
Statistical Mechanics: Algorithms and Computations by Werner Krauth
Understand MC as a method for solving definite integrals (calculating areas/volumes) in high dimensions. If we want to find the area of an irregular shape, we can’t use calculus. Instead, enclose it in a box, throw 10,000 random darts at the box, and count how many land inside the shape. The ratio gives us the approximated area. This is exactly how we calculate “Expected Value” (the core of RL). We don’t solve the integral; we sample rewards and average them.
-
As we read it, we will have a series of “aha!” moments. We will see that the ReAct loop is essentially an implementation of an RL “policy,” where:
- State = The history of all past thoughts, actions, and observations (the “scratchpad”).
- Action = The choice to either generate a Thought or an Act (like calling a tool).
- Policy = The LLM itself, which has been “prompted” to decide the best action given the current state.
OpenML provides the following hand-made study materials to assist the study of Reinforcement Learning: An Introduction:
The “Reasoning” and “Acting” Theory (Other LLM Papers)
Indicating in its title, ReAct didn’t invent “reasoning” or “acting” in LLMs. It was the first to synergize them. The paper was a direct response to two other lines of research that were popular at the time.
- The “Reasoning” (Chain-of-Thought) Track: The ReAct paper is building directly on the Chain-of-Thought Prompting Elicits Reasoning in Large Language Models paper, which showed that we could get an LLM to “reason” by simply prompting it to “think step-by-step.” The ReAct Thought step is a more structured version of CoT.
- The “Acting” (Tool-Use) Track: It also builds on papers like Toolformer and other research showing that LLMs could be prompted to use external tools (like a calculator or a search API).
The ReAct paper’s core argument is that both of these tracks are flawed on their own. “Reasoning-only” agents hallucinate, and “Acting-only” agents can’t plan.
Building Agent From Scratch
Now we are in a position suitable for implementing agentic theory. “Building from scratch” simply means we will be the one to write the Python code that:
- Manages the loop.
- Formats the prompt that coaxes the LLM to “think.”
- Provides and executes the tools.
This approach is 100% model-agnostic. We just need an API from a provider like OpenAI (GPT-4o), Anthropic (Claude 3), or we can run an open-source model like Meta’s Llama 3 locally.
Here are some of the best framework-free guides to get us started:
Building the Core ReAct Agent Loop
A perfect starting point is a tutorial that implements the ReAct loop directly. These guides show us how to build the “brain” of the agent. We will write Python code to manage the “Thought, Act, Observation” cycle using nothing but a standard LLM API.
A great resource for this is “Building an AI agent from scratch in Python” by Leonie Monigatti. It’s a clean, simple implementation that uses an API (Anthropic’s, but it’s trivial to swap for OpenAI’s) to create the agent class, memory, and tool-use logic from the ground up.
Building a Core Capability: RAG from Scratch
Most agents need to access knowledge. The most common way to do this is Retrieval-Augmented Generation (RAG). Before an agent can use a RAG tool, it helps to build that tool ourselves. This involves:
- Loading documents (e.g., text files, PDFs).
- Chunking them into small pieces.
- Using an embedding model (like one from sentence-transformers) to turn those chunks into vectors.
- Storing those vectors in a simple vector database (like ChromaDB or even a basic Python-based FAISS index).
- Writing a
searchfunction that embeds a user’s query and finds the most similar chunks.
The Hugging Face blog has a fantastic, concise article titled “Code a simple RAG from scratch” that does exactly this. It shows us how to build the retrieval and generation parts using standard Python libraries, giving us a fundamental agent-ready tool.
By starting with these resources, we will build the two most important parts of any agent: the reasoning loop (ReAct) and the knowledge retrieval (RAG), all from first principles.
Coding Agent Solutions
Mission Critical Projects - Paid Google Gemini CLI

Google Gemini CLI is one of the best solutions for mission-critical coding projects because
- it is very fast,
- it understands your entire codebase, and
- it offers frictionless user experience
TIPGoogle Gemini web app is also a great alternative for great amount of coding scenarios:
- prototyping
- coding up parts of a large project in isolation
- etc.
We can also upload the entire codebase folder in which case the Geminin web can largely perform the same task as Germini CLI does
In such cases, a great diff tool, such as Diffchecker would come handy when we want to compare local and remote code versions
To make sure we are using the most advanced model, we must enable preview features in the CLI configuration. We only need to do this once.
- Open your terminal and run
gemini - Type
/settingsand press Enter - Locate Preview Features and toggle it to
true. - Restart the CLI (type
/quitand rungeminiagain).
Once preview features are enabled, we can pick the Pro 3 model directly from the command line using the --model flag.
gemini --model gemini-3-pro-previewAlternatively, if we are already inside the tool or want to set it permanently, we can use these methods:
-
Interactive Switch: Inside the CLI, type
/model. We will now see “Auto (Gemini 3)” or “Pro (gemini-3-pro-preview)” as options. Select one of them -
Environment Variable: To make this default without typing the flag every time, set the environment variable in our shell profile (
.bashrcor.zshrc):Terminal window export GEMINI_MODEL="gemini-3-pro-preview"
Experimental Projects - Mixture of Gemini Free Tier and Local Agent
The advanced model of Gemini CLI, however, is paid which is total okay; Keeping a local, cost-effective “sandbox” for experimental coding, however, prevents burning API credits on non-critical tasks. This section discusses an experimental approach to explore the free alternatives.
Using Gemini Free Tier
To configure our experimental projects to use the free tier while keeping our mission-critical projects on the paid “Pro” models, we should use Project-Specific Configuration, which allows us to override our global “Paid” settings only when we are working inside specific directories.
The “Free Tier” is generally tied to specific authentication methods (like a personal Google Account or a free Google AI Studio API Key) and specific models (usually the “Flash” series).
- Mission-Critical (Global): Uses your Paid/Vertex AI credentials + Pro Model.
- Experimental (Local): Uses Personal Auth/Free API Key + Flash Model.
Here is how to set this up:
Option 1: The .gemini Directory (Recommended)
The Gemini CLI looks for a .gemini folder inside the current project directory. Settings found here override the
global system settings.
-
Navigate to the experimental project’s root folder.
-
Create a directory named .gemini.
-
Create a file inside it named settings.json with path of
<project-root>/.gemini/settings.json -
Add the following configuration to force the use of the “Flash” model (which is typically the free/low-cost tier):
settings.json {"model": "gemini-2.0-flash"}NOTE
Check gemini models list in terminal to verify the exact name of the latest Flash model available to us, e.g.,
gemini-1.5-flashorgemini-2.5-flash.
Option 2: Use a Free API Key (Specific to Project)
If our global setup uses a paid Vertex AI credential, simply switching the model might still incur Google Cloud charges. To ensure it is completely free, use an API Key from Google AI Studio (which has a generous free tier for Flash models).
-
Get a Free Key: Go to Google AI Studio and create a new API Key (ensure it is on the “Free of Charge” tier plan).
-
Configure Project: In your experimental project root, create a .env file:
.env GEMINI_API_KEY="<YOUR_FREE_TIER_KEY>"
NOTEThe Gemini CLI typically automatically loads
.envfiles in the current directory, allowing this key to take precedence over global authentication.
Option 3: Quick Command-Line Switch
If we don’t want to create config files, we can simply pass the model flag when running commands for the experimental projects:
gemini chat --model gemini-2.0-flashLocal Agent
WARNING
Expect config overhead and possible slowness in interacting with free agents
List of horrible agent products that must never be used:
- ❌ Aider: hard to config; unexceptionally slow; generates code at wrong path; commits codes without your permission
At this point it is not hard to come to the conclusion that setting up a local coding agent is essentially having
- a local LLM running as the “brain”. A popular one would be Ollama, which can be easily set up by following this link
- a software as an interface that handles prompt, proxies the “Thought, Act, Observation” cycle, context, file editing, and history
The rest of the section primarily discusses the software.
Continue

Continue is a good option if we develop codes in IDE, such as JetBrain, and can be installed as an IDE plugin.
Once the plugin is installed, configure it in the following way:
-
Click the Continue icon in IDE sidebar (right).
-
Pull the following
ollamamodels:Terminal window ollama pull starcoder2:3bollama pull llama3.1 -
Click the gear icon (⚙️) to open
config.yamland make it editable by following IDE’s prompt -
Replace the
config.yamlcontents with the following~/.continue/config.yaml name: Local Agent Configversion: 1.0.0models:# ---------------------------------------------------------# PRIMARY AGENT (Use Llama 3.1 for stability)# ---------------------------------------------------------- name: Llama 3.1 (Agent)provider: ollamamodel: llama3.1:latestroles:- chat- edit- embed# Explicitly tell Continue this model handles tools wellcapabilities:- tool_use# ---------------------------------------------------------# SECONDARY (gpt-oss)# ---------------------------------------------------------- name: GPT-OSSprovider: ollamamodel: gpt-oss:20broles:- chat- edit# ---------------------------------------------------------# AUTOCOMPLETE (Starcoder2)# ---------------------------------------------------------- name: Starcoder2provider: ollamamodel: starcoder2:3broles:- autocompletedata:debounceDelay: 500requestOptions:max_tokens: 1024# ---------------------------------------------------------# CONTEXT PROVIDERS# ---------------------------------------------------------contextProviders:- name: codebaseparams:nRetrieve: 25useReranking: true- name: docs- name: open- name: terminal- name: diffallowAnonymousTelemetry: false
Now we can start using it. By default, Continue plugin does not “knows” the project structure; this is a significant difference with Gemini CLI and is good in some cases where we don’t want agent to take full charge of all works but only offers advice and we are still the person who put down final lines of code
(To be continued…)