AI Agent - OpenML's AI Blogs

Overview of LLM Agents#

Before divining in deep theoretically and practically, it’s essential to grasp the handful of concepts that turn a “dumb” LLM (which just completes text) into an “agent” that can reason, plan, and act. An agent is not a single technology; it’s an architecture built around an LLM.

The core of this architecture is a reasoning loop. The most fundamental and important concept to learn here is ReAct (Reasoning and Acting). This idea, first published by Google researchers, is the foundation for almost every modern agent, including those in LlamaIndex and LangChain.

The ReAct loop works as follows:

Thought: The agent is given a complex task (e.g., “What’s the weather in the capital of France, and who is the president of that country?”). The LLM’s first step is to think and create a plan. It will generate an internal monologue, like: “I need to solve this in two parts. First, find the capital of France. Second, find the president of France. Third, get the weather for that capital city. Fourth, combine the answers.”
Act: Based on its first thought (“find the capital of France”), the agent decides to act. It chooses a tool from a list it has been given. For example, it might choose a search tool and generate the specific query: search("capital of France").
Observation: The agent executes the action and gets a result (the observation). For example, the search tool returns: “The capital of France is Paris.”
Repeat: This observation is fed back into the loop as new context. The agent thinks again: “OK, the capital is Paris. My plan said the next step is to find the president. I will act by using the search tool with the query search("president of France").”
…and so on: This loop continues - Thought, Act, Observation, Thought, Act, Observation - until the agent’s final thought is: “I have all the information. The president is Emmanuel Macron and the capital is Paris. I will now act by using the search_weather tool with the query weather("Paris").” After that observation, its final thought will be, “I have all the answers and can now formulate the final response.”

This “Thought, Act, Observation” cycle is the essential theory. The agent is simply an orchestration loop that provides the LLM with a system prompt, a set of tools, and a “scratchpad” to write down its thoughts and observations. More advanced concepts like Reflexion simply add a step where the agent critiques its own past actions to improve its plan, and ReWOO optimizes this process.

Agent Libraries#

There are many frameworks that make agentic systems easier to implement. One interested in agents will be glad to know that the vast majority of the most popular and well-documented AI agent development is happening in an ecosystem centered around open models. Here is a guide to the most prominent frameworks, models, and learning resources:

LlamaIndex: The leading framework for building powerful Retrieval-Augmented Generation (RAG) agents (agents that can query our own data). While its name is inspired by Meta’s Llama model, it is fully model-agnostic. Its documentation and tutorials heavily feature Llama, OpenAI’s GPT, and Anthropic’s Claude.
OpenAI Agents SDK
LangGraph
TIP
- LangChain is for building a RAG (Retrieval Augmented Generation) system, a simple chatbot, or a data extraction pipeline where the steps are known in advance
- LangGraph is for building an Agent that needs to use tools, handle ambiguous instructions, collaborate with other agents, or requires a “human-in-the-loop” workflow.
(Hugging Face) smolagents

Quick Start Explore in Kaggle #

These frameworks make it easy to get started by simplifying standard low-level tasks like calling LLMs, defining and parsing tools, and chaining calls together. However, they often create extra layers of abstraction that can obscure the underlying prompts and responses, making them harder to debug. They can also make it tempting to add complexity when a simpler setup would suffice.

OpenML suggests that developers start by using LLM APIs directly: many patterns can be implemented in a few lines of code. If you do use a framework, ensure you understand the underlying code. Incorrect assumptions about what’s under the hood are a common source of customer error.

Those who want to pick up of agent basics quickly can check out this Jupyter notebook Explore in Kaggle

Learning Agent Deeply#

Although LangChain and LlamaIndex are excellent for building agent, building from first principles gives us a durable understanding that frameworks alone cannot. Specifically, the rest of this post is going to

introduce the theory behind Agent: how it works; what are all the relevant concepts; then
build an agent from scratch, totally free from any frameworks such as LlamaIndex so that we can get our hands dirty with all aspects of agent with deep understanding

The core of agentic technology is a reasoning loop. The most fundamental and important concept to learn here is ReAct: Synergizing Reasoning and Acting in Language Models. This idea, first published by Google researchers, is the foundation for almost every modern agent. Before this paper, LLM-based agents generally fell into two camps:

“Thinkers” (Chain-of-Thought)

These models were good at reasoning. We would give them a complex problem, and they would “think” step-by-step in natural language to arrive at an answer.

The problem with Chain-of-Though is that they were “disembodied brains” which couldn’t interact with the outside world. If their reasoning required a piece of new information (like today’s weather or a fact from Google), they would either fail or hallucinate (make up) an answer.
“Actors”

These models were trained to use tools. We would give them a task, and they would immediately try to act by calling a tool, like search("What is the weather?")

The issue with actors is that they were “dumb” actors who lacked the high-level reasoning to formulate a plan. If a task required multiple steps (e.g., “Find the capital of the country where the Eiffel Tower is, then find the weather there”), they would often fail, get stuck, or use the tools incorrectly.

The “ReAct” finding is that the key to intelligent agents is to interleave these two processes. The model’s output is no longer just the final answer, but a “trajectory” that follows a repeating cycle:

Thought (Reasoning): The LLM first generates an internal monologue. It analyzes the task, reflects on what it knows, and formulates a high-level plan.
Act (Acting): Based on its thought, the model decides to take a specific, external action by calling a tool (e.g., search, lookup_wikipedia, python_code_interpreter).
Observation (External Data): The model receives the output from that tool (e.g., a search result, a calculated number). This new information is fed back into the model’s context for the next step.

This Thought -> Act -> Observation loop repeats until the final Thought is “I have enough information to give the final answer.”

The ReAct paper, however, omits lots of details for those who would like to study it from ground-up because it is, unfortunately, a conference paper, not a textbook. Its primary goal was to introduce a new synthesis of ideas and prove (through benchmarks) that this synthesis was effective. It assumes the reader is already an expert in the 3 distinct, advanced fields with their own deep theory it’s combining:

The “Agent/Policy” Theory (Reinforcement Learning)#

This is the most important piece. ReAct assumes we are fluent in the language of Reinforcement Learning (RL). When it uses words like “policy,” “agent,” “state,” “action,” and “observation,” it’s borrowing the entire formal mathematical framework of Markov Decision Processes (MDPs).

RL is a fascinating intersection of computer science, optimal control, and advanced mathematics. To learn it “deeply” means starting with the mathematical foundations before jumping into the “deep” (neural network) part. Reinforcement Learning: An Introduction by Richard S. Sutton and Andrew G. Barto is the foundational text for the entire field and formally introduces the core mathematical concepts including:

Markov Decision Processes (MDPs): The mathematical framework for all RL problems.
Bellman Equations: The fundamental equations that all RL algorithms try to solve.
Dynamic Programming: The theoretical (but often impractical) “perfect” solution.
Monte Carlo Methods & Temporal-Difference (TD) Learning: The breakthroughs that make RL practical (this includes Q-Learning and SARSA).
TIP
It is not an exaggeration to say that without the Monte Carlo (MC) method, modern AI simply would not exist. The fundamental reason is that AI is essentially the study of high-dimensional probability distributions.
To systematically study Monte Carlo method, here is the recommended resource List:
- Statistical Mechanics: Algorithms and Computations by Werner Krauth
  
  Understand MC as a method for solving definite integrals (calculating areas/volumes) in high dimensions. If we want to find the area of an irregular shape, we can’t use calculus. Instead, enclose it in a box, throw 10,000 random darts at the box, and count how many land inside the shape. The ratio gives us the approximated area. This is exactly how we calculate “Expected Value” (the core of RL). We don’t solve the integral; we sample rewards and average them.

As we read it, we will have a series of “aha!” moments. We will see that the ReAct loop is essentially an implementation of an RL “policy,” where:

State = The history of all past thoughts, actions, and observations (the “scratchpad”).
Action = The choice to either generate a Thought or an Act (like calling a tool).
Policy = The LLM itself, which has been “prompted” to decide the best action given the current state.

OpenML provides the following hand-made study materials to assist the study of Reinforcement Learning: An Introduction:

Anki cards for selected concepts and exercises (with solutions) from the book

The “Reasoning” and “Acting” Theory (Other LLM Papers)#

Indicating in its title, ReAct didn’t invent “reasoning” or “acting” in LLMs. It was the first to synergize them. The paper was a direct response to two other lines of research that were popular at the time.

The “Reasoning” (Chain-of-Thought) Track: The ReAct paper is building directly on the Chain-of-Thought Prompting Elicits Reasoning in Large Language Models paper, which showed that we could get an LLM to “reason” by simply prompting it to “think step-by-step.” The ReAct Thought step is a more structured version of CoT.
The “Acting” (Tool-Use) Track: It also builds on papers like Toolformer and other research showing that LLMs could be prompted to use external tools (like a calculator or a search API).

The ReAct paper’s core argument is that both of these tracks are flawed on their own. “Reasoning-only” agents hallucinate, and “Acting-only” agents can’t plan.

The “Inner Speech” Theory (Cognitive Psychology)#

This is the deepest, most foundational layer. The reason the ReAct architecture works so well is that it computationally mimics a core human cognitive function.

Vygotsky’s Thought and Language: This is the origin of the “private speech” (thinking aloud) to “inner speech” (internal monologue) theory. It explains why language is a tool for self-regulation and planning.
Baddeley’s “Working Memory” Model: This provides the cognitive architecture, explaining the “Phonological Loop” (the “inner voice”) as the specific mechanism for holding and manipulating verbal information (i.e., the plan) in short-term memory.

The ReAct paper essentially created a Vygotskian agent. It forces the LLM to use “private speech” (the Thought trace) to regulate its own behavior, which is a far more robust method than simple stimulus-response.

Psychological Root of LLM Agents - Inner Speech¹#

That moment of panic when we realize we’ve lost our wallet is a perfect example of inner speech. We immediately stop, and our internal monologue takes over, becoming a tool to direct our actions: “Okay, think. Where did I last have it? I used it at the coffee shop. Did I put it back in my pocket? Let me check. No. Did I put it in my bag? Let me look. Okay, not in the main pocket. What about the side pocket? Yes, there it is.” That entire step-by-step, self-directed verbal process - guiding our search, eliminating possibilities, and managing our rising panic - is our inner speech actively solving a problem.

Inner speech can be defined as the subjective experience of language in the absence of overt and audible articulation. It has long had an important role to play in psychological theorizing. Plato noted that a dialogic conversation with the self is a familiar aspect of human experience in the Theaetetus. In this dialogue, Socrates defines thinking as “a talk which the soul has with itself about the objects under its consideration.”

He describes it as a silent, internal process of questioning and answering:

“…the soul when it thinks is simply carrying on a discussion in which it asks itself questions and answers them itself, affirms and denies. And when it arrives at something definite… we call this its judgment.”

- Plato’s dialogue, Theaetetus, 189e – 190a

This part of the dialogue occurs as Socrates and Theaetetus are trying (and failing) to define “knowledge.” They are in the process of examining the idea of “knowledge as true judgment,” and Socrates introduces this definition of “thinking” to explore how “false judgment” (or “thinking” incorrectly) might be possible. This specific mechanism - the soul’s silent, internal dialogue of questions and answers - is the very definition that the “Inner Speech” paper and modern AI agent theory build upon.

The agent’s “Thought” step is a direct, practical implementation of this Platonic idea - a silent, internal dialogue used to reason, plan, and arrive at a judgment before acting.

There are 2 influential theoretical perspectives on inner speech, theorizing about its cognitive function. One relates to the development of verbal mediation of cognition and behavior, and one relates to rehearsal and working memory.

Vygotsky’s Theory#

Lev Vygotsky’s theory is a cornerstone of developmental psychology and is the perfect psychological root for understanding why modern AI agents are designed the way they are. At the heart of Vygotsky’s theory is a revolutionary idea: complex, abstract thinking is not an innate, individual ability, but rather the internalization of social processes.

He argued that our higher cognitive functions (like planning, reasoning, and self-control) are born from our social interactions with others. The primary “tool” for this transformation is language.

Vygotsky’s most famous work, Thought and Language, proposed that thought and language have separate origins.

Pre-linguistic Thought: A baby can “think” in a basic, practical way (e.g., “I am hungry,” “That object is far away”).
Pre-intellectual Speech: A baby can “speak” (cry, babble) to express emotion, but not to formulate a logical thought.

The “most significant moment” in cognitive development, he argued, occurs around age two when these two lines converge. Language and thought become intertwined. At this point, language ceases to be just a tool for communication with others and becomes the primary tool for thinking itself. This transformation happens in 3 distinct, observable stages. These 3 Stages of speech development are the central mechanisms that maps directly to AI agent theory.

Social Speech (or External Speech)

This is the first stage, from birth to about age 3. Speech at this stage is purely external and social. Its sole purpose is to communicate with others. A child uses words to control the behavior of others (“Want milk!”), express emotion (“Bad!”), or make a request. It is a tool for interacting with the outside world.

This is like a simple, non-agentic LLM. You give it a prompt (an external stimulus) and it gives you a direct response (an external output). It is a pure call-and-response.
Private Speech (or Egocentric Speech)

This is the critical transitional stage, peaking around ages 3 to 7. This is the phenomenon we observe when a child is playing alone and talking to themselves out loud. For example, a child doing a puzzle might say, “No, that piece is blue… I need a red one… where is the edge piece? Yes, put it here.”

Vygotsky’s great insight was that this is not just meaningless chatter. The child is using language as a tool for self-regulation. They have taken the social, back-and-forth dialogue they used to have with a parent (“Where does this piece go?” “Try the red one.”) and internalized it. They are now playing both roles, using their own voice to guide their own thoughts, direct their attention, and plan their next steps.

This is exactly what the ReAct (Reasoning and Acting) framework does. The agent’s “Thought” is a form of private speech. It writes down its plan (“I need to find the capital of France… my first action will be to search…”) in an external, observable “scratchpad.” It is literally thinking aloud to regulate its own behavior and execute a complex plan.
Inner Speech (or Silent Speech)

From age 7 onward, this “private speech” doesn’t disappear; it “goes underground.” It becomes silent, internalized thought. This is our mature “inner monologue” or “stream of consciousness.” It is the fast, condensed, abbreviated silent dialogue we have with ourselves (the “dialogic conversation with the self” that Plato described). We use it to plan our day, reason through a problem, and direct our own behavior, but it now happens entirely within our minds.

This is the goal of a sophisticated agent. A truly advanced agent wouldn’t need to write down every single “Thought.” It would be able to perform many of these reasoning steps internally, only surfacing its plan or actions when necessary.

Vygotsky’s theory provides the psychological justification for why the “Thought” step in an agent is so critical:

Self-Regulation: The “Thought” step is the agent’s mechanism for self-regulation. Instead of just reacting to the user’s prompt (stimulus-response), it pauses to think. It formulates a plan, critiques its own ideas, and directs its future actions.
Problem-Solving: Vygotsky noted that children’s use of private speech increases when a task is difficult or when they make a mistake. This is the same for an agent. If an agent’s “Act” step fails (e.g., a search query returns an error), its next “Thought” step is its “private speech” kicking in to solve the new problem: “Okay, that tool failed. I will try a different tool,” or “My search query was bad. I will formulate a better one.”
Interpretability: The agent’s externalized “Thought” trace is a perfect log of its “private speech.” For us as developers, it allows us to do what a teacher does with a student: “Show me your work.” We can read the agent’s step-by-step reasoning to debug its logic, which is impossible if its “thought” is a black box.

Vygotsky’s theory explains that the ability to “think aloud” (Private Speech) is the crucial developmental bridge that allows a simple social actor to become a complex, independent, and self-regulating thinker. The ReAct framework, by forcing the LLM to “think aloud” with its Thought step, is essentially guiding the AI through this same cognitive leap.

Inner Speech in Working Memory#

Vygotsky’s theory is developmental and functional. It explains why we have inner speech (for self-regulation) and how we get it (by internalizing social speech).

There is another major pillar of “inner speech” research called Working Memory theory, which is cognitive and architectural. It explains what inner speech is and how it operates as a specific mechanism in our brain’s short-term memory.

TIP
Working memory refers to the retention of information “online” during a complex task, such as keeping a set of directions in mind while navigating around a new building, or rehearsing a shopping list.

This theory is a key part of the Baddeley & Hitch model of working memory, which proposes that “working memory” (what we used to call “short-term memory”) is not just a passive storage box, but an active, multi-component “mental workbench.” This workbench, he argues, has a “boss” and two main “assistants”:

The Central Executive: This is the “boss.” It’s the flexible, high-level attention-control system. It’s the “you” that decides, “I need to remember this phone number” or “I need to solve this math problem.” It doesn’t store information itself; it just directs the other systems.
The Visuo-Spatial Sketchpad: This is the “inner eye.” It’s the assistant that holds and manipulates visual and spatial information (e.g., picturing a map, mentally rotating a 3D shape).
The Phonological Loop: This is the “inner voice”. It is the assistant responsible for holding and manipulating language-based information.

Baddeley’s great insight was to break this “inner voice” down into two sub-parts that work in a continuous loop:

The Phonological Store (The “Inner Ear”): This is a passive storage buffer. It can hold a small amount of sound-based (phonological) information for a very brief time - about 1 to 2 seconds. Any spoken word we hear (e.g., someone tells us a phone number) enters this store directly. Think of this as a tiny, rapidly-fading audio recording.
The Articulatory Rehearsal Process (The “Inner Voice”): This is an active rehearsal process. It is, quite literally, our inner speech. Its job is to “read” the information in the phonological store and then “speak” it again, feeding it back into the store. This act of silent, internal “re-speaking” is what refreshes the memory and prevents it from fading. Crucially, it also works for written information. When we read a word on a page, this process translates that visual text into an internal, sound-based code and speaks it into the phonological store.

Let’s use the life experience of remembering a phone number (555-867-5309) when we don’t have a pen. Our Central Executive (“boss”) fires up and says: “This is important. I need to remember this number for the next 30 seconds.” It directs our attention. We hear “555-867-5309.” The sounds immediately enter your Phonological Store (“inner ear”). But after 1-2 seconds, those sounds will decay and be gone forever. So, our Articulatory Rehearsal Process (“inner voice”) immediately kicks in. We start silently saying to ourselves, “five-five-five, eight-six-seven, five-three-oh-nine… five-five-five, eight-six-seven, five-three-oh-nine…”. Each time our “inner voice” rehearses the number, it’s like afresh “recording” that gets fed back into the “inner ear” (the phonological store), refreshing it for another 2 seconds. This continuous “phonological loop” is the “Inner Speech in Working Memory.”

To create a truly robust AI agent, we need both:

a Working Memory (a context window or scratchpad) to hold information (Baddeley’s part), and
a Reasoning Loop (like ReAct) that uses “inner speech” to reflect on and act upon that information (Vygotsky’s part).

Relating to ReAct Agent#

The ReAct paper cites “Inner Speech”¹ because it provides the core psychological blueprint for why an agent’s “Thoughts” are so powerful.

The ReAct paper demonstrates that interleaving reasoning and acting works. The “Inner Speech” paper, by Alderson-Day & Fernyhough, explains the cognitive function of this internal monologue, drawing heavily on the theories of psychologist Lev Vygotsky.

The central finding of the “Inner Speech” paper is that our “inner monologue” isn’t just a passive side effect of thinking. Instead, it is an active cognitive tool that we use to direct and regulate our own minds. It shows that what Google’s researchers built is not just a clever engineering trick, but a functional analogue of a core cognitive tool that humans evolved for self-regulation, planning, and complex problem-solving.

ReAct - Reasoning and Acting#

A unique feature of human intelligence is the ability to seamlessly combine task-oriented actions with verbal reasoning (or inner speech, which has been proven to play an important role in human cognition for enabling self-regulation and maintaining a working memory). The interleaving between “acting” and “reasoning” allows humans to learn new tasks quickly and perform robust decision-making or reasoning.

ReAct: Synergizing Reasoning And Acting#

Consider a general setup of an agent interacting with an environment for task solving. At time step $t$ , an agent receives an observation $o_t \in \mathcal{O}$ from the environment and takes an action $a_t \in \mathcal{A}$ following some policy $\pi(a_t | c_t)$ , where $c_t = (o_1, a_1, \cdots, o_{t-1}, a_{t-1}, o_t)$ is the context to the agent. Learning a policy is challenging when the mapping $c_t \mapsto a_t$ is highly implicit and requires extensive computation.

What is a policy?
The term “policy” in this context is a core concept borrowed directly from the field of Reinforcement Learning (RL), whose definitive theoretical introduction Reinforcement Learning: An Introduction defines it as follows:
“A policy defines the learning agent’s way of behaving at a given time. Roughly speaking, a policy is a mapping from perceived states of the environment to actions to be taken when in those states. It corresponds to what in psychology would be called a set of stimulus–response rules or associations. In some cases the policy may be a simple function or lookup table, whereas in others it may involve extensive computation such as a search process. The policy is the core of a reinforcement learning agent in the sense that it alone is sufficient to determine behavior. In general, policies may be stochastic, specifying probabilities for each action”
In short, a policy is the “brain” or “strategy” of an agent. It is the set of rules that dictates what action the agent will take in any given state.
To break that down:

Agent: The AI (in this case, the LLM).

Environment: The world the agent interacts with (e.g., a website, a code terminal, a Wikipedia API).

State: The agent’s current situation (e.g., the webpage it’s on, the code it has written so far, the user’s initial question).

Policy: The function that looks at the current state and decides what action to take next. It answers the question, “Given what’s happening right now, what should I do?”

Building Agent From Scratch#

Now we are in a position suitable for implementing agentic theory. “Building from scratch” simply means we will be the one to write the Python code that:

Manages the loop.
Formats the prompt that coaxes the LLM to “think.”
Provides and executes the tools.

This approach is 100% model-agnostic. We just need an API from a provider like OpenAI (GPT-4o), Anthropic (Claude 3), or we can run an open-source model like Meta’s Llama 3 locally.

(To be continued…)

Coding Agent Solutions#

Mission Critical Projects - Paid Google Gemini CLI#

Google Gemini CLI is one of the best solutions for mission-critical coding projects because

it is very fast,
it understands your entire codebase, and
it offers frictionless user experience

TIP
Google Gemini web app is also a great alternative for great amount of coding scenarios:

prototyping

coding up parts of a large project in isolation

etc.

We can also upload the entire codebase folder in which case the Geminin web can largely perform the same task as Germini CLI does
In such cases, a great diff tool, such as Diffchecker would come handy when we want to compare local and remote code versions

To make sure we are using the most advanced model, we must enable preview features in the CLI configuration. We only need to do this once.

Open your terminal and run gemini
Type /settings and press Enter
Locate Preview Features and toggle it to true.
Restart the CLI (type /quit and run gemini again).

Once preview features are enabled, we can pick the Pro 3 model directly from the command line using the --model flag.

1
gemini --model gemini-3-pro-preview

Alternatively, if we are already inside the tool or want to set it permanently, we can use these methods:

Interactive Switch: Inside the CLI, type /model. We will now see “Auto (Gemini 3)” or “Pro (gemini-3-pro-preview)” as options. Select one of them
Environment Variable: To make this default without typing the flag every time, set the environment variable in our shell profile (.bashrc or .zshrc):
Terminal window
```
1
export GEMINI_MODEL="gemini-3-pro-preview"
```

Experimental Projects - Mixture of Gemini Free Tier and Local Agent#

The advanced model of Gemini CLI, however, is paid which is total okay; Keeping a local, cost-effective “sandbox” for experimental coding, however, prevents burning API credits on non-critical tasks. This section discusses an experimental approach to explore the free alternatives.

Using Gemini Free Tier#

To configure our experimental projects to use the free tier while keeping our mission-critical projects on the paid “Pro” models, we should use Project-Specific Configuration, which allows us to override our global “Paid” settings only when we are working inside specific directories.

The “Free Tier” is generally tied to specific authentication methods (like a personal Google Account or a free Google AI Studio API Key) and specific models (usually the “Flash” series).

Mission-Critical (Global): Uses your Paid/Vertex AI credentials + Pro Model.
Experimental (Local): Uses Personal Auth/Free API Key + Flash Model.

Here is how to set this up:

Option 1: The `.gemini` Directory (Recommended)#

The Gemini CLI looks for a .gemini folder inside the current project directory. Settings found here override the global system settings.

Navigate to the experimental project’s root folder.
Create a directory named .gemini.
Create a file inside it named settings.json with path of <project-root>/.gemini/settings.json
Add the following configuration to force the use of the “Flash” model (which is typically the free/low-cost tier):
settings.json
```
1
{
2
  "model": "gemini-2.0-flash"
3
}
```
NOTE
Check gemini models list in terminal to verify the exact name of the latest Flash model available to us, e.g., gemini-1.5-flash or gemini-2.5-flash.

Option 2: Use a Free API Key (Specific to Project)#

If our global setup uses a paid Vertex AI credential, simply switching the model might still incur Google Cloud charges. To ensure it is completely free, use an API Key from Google AI Studio (which has a generous free tier for Flash models).

Get a Free Key: Go to Google AI Studio and create a new API Key (ensure it is on the “Free of Charge” tier plan).
Configure Project: In your experimental project root, create a .env file:
.env
```
1
GEMINI_API_KEY="<YOUR_FREE_TIER_KEY>"
```

NOTE
The Gemini CLI typically automatically loads .env files in the current directory, allowing this key to take precedence over global authentication.

Option 3: Quick Command-Line Switch#

If we don’t want to create config files, we can simply pass the model flag when running commands for the experimental projects:

1
gemini chat --model gemini-2.0-flash

Local Agent#

WARNING

Expect config overhead and possible slowness in interacting with free agents

List of horrible agent products that must never be used:

❌ Aider: hard to config; unexceptionally slow; generates code at wrong path; commits codes without your permission

At this point it is not hard to come to the conclusion that setting up a local coding agent is essentially having

a local LLM running as the “brain”. A popular one would be Ollama, which can be easily set up by following this link
a software as an interface that handles prompt, proxies the “Thought, Act, Observation” cycle, context, file editing, and history

The rest of the section primarily discusses the software.

Continue#

Continue is a good option if we develop codes in IDE, such as JetBrain, and can be installed as an IDE plugin.

Once the plugin is installed, configure it in the following way:

Click the Continue icon in IDE sidebar (right).

Pull the following ollama models:

1
ollama pull starcoder2:3b
2
ollama pull llama3.1

Click the gear icon (⚙️) to open config.yaml and make it editable by following IDE’s prompt

Replace the config.yaml contents with the following

1
name: Local Agent Config
2
version: 1.0.0
3

4
models:
5
  # ---------------------------------------------------------
6
  # PRIMARY AGENT (Use Llama 3.1 for stability)
7
  # ---------------------------------------------------------
8
  - name: Llama 3.1 (Agent)
9
    provider: ollama
10
    model: llama3.1:latest
11
    roles:
12
      - chat
13
      - edit
14
      - embed
15
    # Explicitly tell Continue this model handles tools well
16
    capabilities:
17
      - tool_use
18

19
  # ---------------------------------------------------------
20
  # SECONDARY (gpt-oss)
21
  # ---------------------------------------------------------
22
  - name: GPT-OSS
23
    provider: ollama
24
    model: gpt-oss:20b
25
    roles:
26
      - chat
27
      - edit
28

29
  # ---------------------------------------------------------
30
  # AUTOCOMPLETE (Starcoder2)
31
  # ---------------------------------------------------------
32
  - name: Starcoder2
33
    provider: ollama
34
    model: starcoder2:3b
35
    roles:
36
      - autocomplete
37
    data:
38
      debounceDelay: 500
39
    requestOptions:
40
      max_tokens: 1024
41

42
# ---------------------------------------------------------
43
# CONTEXT PROVIDERS
44
# ---------------------------------------------------------
45
contextProviders:
46
  - name: codebase
47
    params:
48
      nRetrieve: 25
49
      useReranking: true
50
  - name: docs
51
  - name: open
52
  - name: terminal
53
  - name: diff
54

55
allowAnonymousTelemetry: false

Now we can start using it. By default, Continue plugin does not “knows” the project structure; this is a significant difference with Gemini CLI and is good in some cases where we don’t want agent to take full charge of all works but only offers advice and we are still the person who put down final lines of code

(To be continued…)

Ben Alderson-Day and Charles Fernyhough. Inner speech: development, cognitive functions, phenomenology, and neurobiology. Psychological bulletin, 141 (5):931, 2015. ↩ ↩²

Overview of LLM Agents#

Agent Libraries#

Quick Start Explore in Kaggle#

Learning Agent Deeply#

The “Agent/Policy” Theory (Reinforcement Learning)#

The “Reasoning” and “Acting” Theory (Other LLM Papers)#

The “Inner Speech” Theory (Cognitive Psychology)#

Psychological Root of LLM Agents - Inner Speech1#

Vygotsky’s Theory#

Inner Speech in Working Memory#

Relating to ReAct Agent#

ReAct - Reasoning and Acting#

ReAct: Synergizing Reasoning And Acting#

Building Agent From Scratch#

Coding Agent Solutions#

Mission Critical Projects - Paid Google Gemini CLI#

Experimental Projects - Mixture of Gemini Free Tier and Local Agent#

Using Gemini Free Tier#

Option 1: The .gemini Directory (Recommended)#

Option 2: Use a Free API Key (Specific to Project)#

Option 3: Quick Command-Line Switch#

Local Agent#

Continue#

Footnotes#

Quick Start Explore in Kaggle #

Psychological Root of LLM Agents - Inner Speech¹#

Option 1: The `.gemini` Directory (Recommended)#