3746 words
19 minutes
AI Agent
2025-11-12
2025-12-05

Overview of LLM Agents Explore in Kaggle#

Before divining in deep theoretically and practically, it’s essential to grasp the handful of concepts that turn a “dumb” LLM (which just completes text) into an “agent” that can reason, plan, and act. An agent is not a single technology; it’s an architecture built around an LLM.

The core of this architecture is a reasoning loop. The most fundamental and important concept to learn here is ReAct (Reasoning and Acting). This idea, first published by Google researchers, is the foundation for almost every modern agent, including those in LlamaIndex and LangChain.

The ReAct loop works as follows:

  1. Thought: The agent is given a complex task (e.g., “What’s the weather in the capital of France, and who is the president of that country?”). The LLM’s first step is to think and create a plan. It will generate an internal monologue, like: “I need to solve this in two parts. First, find the capital of France. Second, find the president of France. Third, get the weather for that capital city. Fourth, combine the answers.”
  2. Act: Based on its first thought (“find the capital of France”), the agent decides to act. It chooses a tool from a list it has been given. For example, it might choose a search tool and generate the specific query: search("capital of France").
  3. Observation: The agent executes the action and gets a result (the observation). For example, the search tool returns: “The capital of France is Paris.”
  4. Repeat: This observation is fed back into the loop as new context. The agent thinks again: “OK, the capital is Paris. My plan said the next step is to find the president. I will act by using the search tool with the query search("president of France").”
  5. …and so on: This loop continues - Thought, Act, Observation, Thought, Act, Observation - until the agent’s final thought is: “I have all the information. The president is Emmanuel Macron and the capital is Paris. I will now act by using the search_weather tool with the query weather("Paris").” After that observation, its final thought will be, “I have all the answers and can now formulate the final response.”

This “Thought, Act, Observation” cycle is the essential theory. The agent is simply an orchestration loop that provides the LLM with a system prompt, a set of tools, and a “scratchpad” to write down its thoughts and observations. More advanced concepts like Reflexion simply add a step where the agent critiques its own past actions to improve its plan, and ReWOO optimizes this process.

Those who want to pick up of agent basics quickly can check out this Jupyter notebook Explore in Kaggle

Psychological Root of LLM Agents - Inner Speech1#

That moment of panic when we realize we’ve lost our wallet is a perfect example of inner speech. We immediately stop, and our internal monologue takes over, becoming a tool to direct our actions: “Okay, think. Where did I last have it? I used it at the coffee shop. Did I put it back in my pocket? Let me check. No. Did I put it in my bag? Let me look. Okay, not in the main pocket. What about the side pocket? Yes, there it is.” That entire step-by-step, self-directed verbal process - guiding our search, eliminating possibilities, and managing our rising panic - is our inner speech actively solving a problem.

Inner speech can be defined as the subjective experience of language in the absence of overt and audible articulation. It has long had an important role to play in psychological theorizing. Plato noted that a dialogic conversation with the self is a familiar aspect of human experience in the Theaetetus. In this dialogue, Socrates defines thinking as “a talk which the soul has with itself about the objects under its consideration.”

He describes it as a silent, internal process of questioning and answering:

…the soul when it thinks is simply carrying on a discussion in which it asks itself questions and answers them itself, affirms and denies. And when it arrives at something definite… we call this its judgment.

- Plato’s dialogue, Theaetetus, 189e – 190a

This part of the dialogue occurs as Socrates and Theaetetus are trying (and failing) to define “knowledge.” They are in the process of examining the idea of “knowledge as true judgment,” and Socrates introduces this definition of “thinking” to explore how “false judgment” (or “thinking” incorrectly) might be possible. This specific mechanism - the soul’s silent, internal dialogue of questions and answers - is the very definition that the “Inner Speech” paper and modern AI agent theory build upon.

The agent’s “Thought” step is a direct, practical implementation of this Platonic idea - a silent, internal dialogue used to reason, plan, and arrive at a judgment before acting.

There are 2 influential theoretical perspectives on inner speech, theorizing about its cognitive function. One relates to the development of verbal mediation of cognition and behavior, and one relates to rehearsal and working memory.

Vygotsky’s Theory#

Lev Vygotsky’s theory is a cornerstone of developmental psychology and is the perfect psychological root for understanding why modern AI agents are designed the way they are. At the heart of Vygotsky’s theory is a revolutionary idea: complex, abstract thinking is not an innate, individual ability, but rather the internalization of social processes.

He argued that our higher cognitive functions (like planning, reasoning, and self-control) are born from our social interactions with others. The primary “tool” for this transformation is language.

Vygotsky’s most famous work, Thought and Language, proposed that thought and language have separate origins.

  • Pre-linguistic Thought: A baby can “think” in a basic, practical way (e.g., “I am hungry,” “That object is far away”).
  • Pre-intellectual Speech: A baby can “speak” (cry, babble) to express emotion, but not to formulate a logical thought.

The “most significant moment” in cognitive development, he argued, occurs around age two when these two lines converge. Language and thought become intertwined. At this point, language ceases to be just a tool for communication with others and becomes the primary tool for thinking itself. This transformation happens in 3 distinct, observable stages. These 3 Stages of speech development are the central mechanisms that maps directly to AI agent theory.

  1. Social Speech (or External Speech)

    This is the first stage, from birth to about age 3. Speech at this stage is purely external and social. Its sole purpose is to communicate with others. A child uses words to control the behavior of others (“Want milk!”), express emotion (“Bad!”), or make a request. It is a tool for interacting with the outside world.

    This is like a simple, non-agentic LLM. You give it a prompt (an external stimulus) and it gives you a direct response (an external output). It is a pure call-and-response.

  2. Private Speech (or Egocentric Speech)

    This is the critical transitional stage, peaking around ages 3 to 7. This is the phenomenon we observe when a child is playing alone and talking to themselves out loud. For example, a child doing a puzzle might say, “No, that piece is blue… I need a red one… where is the edge piece? Yes, put it here.”

    Vygotsky’s great insight was that this is not just meaningless chatter. The child is using language as a tool for self-regulation. They have taken the social, back-and-forth dialogue they used to have with a parent (“Where does this piece go?” “Try the red one.”) and internalized it. They are now playing both roles, using their own voice to guide their own thoughts, direct their attention, and plan their next steps.

    This is exactly what the ReAct (Reasoning and Acting) framework does. The agent’s “Thought” is a form of private speech. It writes down its plan (“I need to find the capital of France… my first action will be to search…”) in an external, observable “scratchpad.” It is literally thinking aloud to regulate its own behavior and execute a complex plan.

  3. Inner Speech (or Silent Speech)

    From age 7 onward, this “private speech” doesn’t disappear; it “goes underground.” It becomes silent, internalized thought. This is our mature “inner monologue” or “stream of consciousness.” It is the fast, condensed, abbreviated silent dialogue we have with ourselves (the “dialogic conversation with the self” that Plato described). We use it to plan our day, reason through a problem, and direct our own behavior, but it now happens entirely within our minds.

    This is the goal of a sophisticated agent. A truly advanced agent wouldn’t need to write down every single “Thought.” It would be able to perform many of these reasoning steps internally, only surfacing its plan or actions when necessary.

Vygotsky’s theory provides the psychological justification for why the “Thought” step in an agent is so critical:

  • Self-Regulation: The “Thought” step is the agent’s mechanism for self-regulation. Instead of just reacting to the user’s prompt (stimulus-response), it pauses to think. It formulates a plan, critiques its own ideas, and directs its future actions.
  • Problem-Solving: Vygotsky noted that children’s use of private speech increases when a task is difficult or when they make a mistake. This is the same for an agent. If an agent’s “Act” step fails (e.g., a search query returns an error), its next “Thought” step is its “private speech” kicking in to solve the new problem: “Okay, that tool failed. I will try a different tool,” or “My search query was bad. I will formulate a better one.”
  • Interpretability: The agent’s externalized “Thought” trace is a perfect log of its “private speech.” For us as developers, it allows us to do what a teacher does with a student: “Show me your work.” We can read the agent’s step-by-step reasoning to debug its logic, which is impossible if its “thought” is a black box.

Vygotsky’s theory explains that the ability to “think aloud” (Private Speech) is the crucial developmental bridge that allows a simple social actor to become a complex, independent, and self-regulating thinker. The ReAct framework, by forcing the LLM to “think aloud” with its Thought step, is essentially guiding the AI through this same cognitive leap.

Inner Speech in Working Memory#

Vygotsky’s theory is developmental and functional. It explains why we have inner speech (for self-regulation) and how we get it (by internalizing social speech).

There is another major pillar of “inner speech” research called Working Memory theory, which is cognitive and architectural. It explains what inner speech is and how it operates as a specific mechanism in our brain’s short-term memory.

TIP

Working memory refers to the retention of information “online” during a complex task, such as keeping a set of directions in mind while navigating around a new building, or rehearsing a shopping list.

This theory is a key part of the Baddeley & Hitch model of working memory, which proposes that “working memory” (what we used to call “short-term memory”) is not just a passive storage box, but an active, multi-component “mental workbench.” This workbench, he argues, has a “boss” and two main “assistants”:

  1. The Central Executive: This is the “boss.” It’s the flexible, high-level attention-control system. It’s the “you” that decides, “I need to remember this phone number” or “I need to solve this math problem.” It doesn’t store information itself; it just directs the other systems.
  2. The Visuo-Spatial Sketchpad: This is the “inner eye.” It’s the assistant that holds and manipulates visual and spatial information (e.g., picturing a map, mentally rotating a 3D shape).
  3. The Phonological Loop: This is the “inner voice”. It is the assistant responsible for holding and manipulating language-based information.

Baddeley’s great insight was to break this “inner voice” down into two sub-parts that work in a continuous loop:

  1. The Phonological Store (The “Inner Ear”): This is a passive storage buffer. It can hold a small amount of sound-based (phonological) information for a very brief time - about 1 to 2 seconds. Any spoken word we hear (e.g., someone tells us a phone number) enters this store directly. Think of this as a tiny, rapidly-fading audio recording.
  2. The Articulatory Rehearsal Process (The “Inner Voice”): This is an active rehearsal process. It is, quite literally, our inner speech. Its job is to “read” the information in the phonological store and then “speak” it again, feeding it back into the store. This act of silent, internal “re-speaking” is what refreshes the memory and prevents it from fading. Crucially, it also works for written information. When we read a word on a page, this process translates that visual text into an internal, sound-based code and speaks it into the phonological store.

Let’s use the life experience of remembering a phone number (555-867-5309) when we don’t have a pen. Our Central Executive (“boss”) fires up and says: “This is important. I need to remember this number for the next 30 seconds.” It directs our attention. We hear “555-867-5309.” The sounds immediately enter your Phonological Store (“inner ear”). But after 1-2 seconds, those sounds will decay and be gone forever. So, our Articulatory Rehearsal Process (“inner voice”) immediately kicks in. We start silently saying to ourselves, “five-five-five, eight-six-seven, five-three-oh-nine… five-five-five, eight-six-seven, five-three-oh-nine…”. Each time our “inner voice” rehearses the number, it’s like afresh “recording” that gets fed back into the “inner ear” (the phonological store), refreshing it for another 2 seconds. This continuous “phonological loop” is the “Inner Speech in Working Memory.”

To create a truly robust AI agent, we need both:

  • a Working Memory (a context window or scratchpad) to hold information (Baddeley’s part), and
  • a Reasoning Loop (like ReAct) that uses “inner speech” to reflect on and act upon that information (Vygotsky’s part).

Relating to ReAct Agent#

The ReAct paper cites “Inner Speech”1 because it provides the core psychological blueprint for why an agent’s “Thoughts” are so powerful.

The ReAct paper demonstrates that interleaving reasoning and acting works. The “Inner Speech” paper, by Alderson-Day & Fernyhough, explains the cognitive function of this internal monologue, drawing heavily on the theories of psychologist Lev Vygotsky.

The central finding of the “Inner Speech” paper is that our “inner monologue” isn’t just a passive side effect of thinking. Instead, it is an active cognitive tool that we use to direct and regulate our own minds. It shows that what Google’s researchers built is not just a clever engineering trick, but a functional analogue of a core cognitive tool that humans evolved for self-regulation, planning, and complex problem-solving.

Theory#

Prerequisites

As a conference paper rather than a textbook, the original ReAct paper’s primary goal was to introduce a new synthesis of ideas and prove (through benchmarks) that this synthesis was effective. It assumes the reader is already an expert in the 3 distinct, advanced fields with their own deep theory it’s combining:

  1. The “Agent/Policy” Theory (Reinforcement Learning)

    This is the most important piece. ReAct assumes we are fluent in the language of Reinforcement Learning (RL). When it uses words like “policy,” “agent,” “state,” “action,” and “observation,” it’s borrowing the entire formal mathematical framework of Markov Decision Processes (MDPs).

    RL is a fascinating intersection of computer science, optimal control, and advanced mathematics. To learn it “deeply” means starting with the mathematical foundations before jumping into the “deep” (neural network) part. Reinforcement Learning: An Introduction by Richard S. Sutton and Andrew G. Barto is the foundational text for the entire field and formally introduces the core mathematical concepts including:

    • Markov Decision Processes (MDPs): The mathematical framework for all RL problems.
    • Bellman Equations: The fundamental equations that all RL algorithms try to solve.
    • Dynamic Programming: The theoretical (but often impractical) “perfect” solution.
    • Monte Carlo Methods & Temporal-Difference (TD) Learning: The breakthroughs that make RL practical (this includes Q-Learning and SARSA).

    As we read it, we will have a series of “aha!” moments. We will see that the ReAct loop is essentially an implementation of an RL “policy,” where:

    • State = The history of all past thoughts, actions, and observations (the “scratchpad”).
    • Action = The choice to either generate a Thought or an Act (like calling a tool).
    • Policy = The LLM itself, which has been “prompted” to decide the best action given the current state.

    OpenML provides the following hand-made study materials to assist the study of Reinforcement Learning: An Introduction:

  2. The “Reasoning” and “Acting” Theory (Other LLM Papers)

    Indicating in its title, ReAct didn’t invent “reasoning” or “acting” in LLMs. It was the first to synergize them. The paper was a direct response to two other lines of research that were popular at the time.

    • The “Reasoning” (Chain-of-Thought) Track: The ReAct paper is building directly on the Chain-of-Thought Prompting Elicits Reasoning in Large Language Models paper, which showed that we could get an LLM to “reason” by simply prompting it to “think step-by-step.” The ReAct Thought step is a more structured version of CoT.
    • The “Acting” (Tool-Use) Track: It also builds on papers like Toolformer and other research showing that LLMs could be prompted to use external tools (like a calculator or a search API).

    The ReAct paper’s core argument is that both of these tracks are flawed on their own. “Reasoning-only” agents hallucinate, and “Acting-only” agents can’t plan.

  3. The “Inner Speech” Theory (Cognitive Psychology)

    This is the deepest, most foundational layer. The reason the ReAct architecture works so well is that it computationally mimics a core human cognitive function.

    1. Vygotsky’s Thought and Language: This is the origin of the “private speech” (thinking aloud) to “inner speech” (internal monologue) theory. It explains why language is a tool for self-regulation and planning.
    2. Baddeley’s “Working Memory” Model: This provides the cognitive architecture, explaining the “Phonological Loop” (the “inner voice”) as the specific mechanism for holding and manipulating verbal information (i.e., the plan) in short-term memory.

    The ReAct paper essentially created a Vygotskian agent. It forces the LLM to use “private speech” (the Thought trace) to regulate its own behavior, which is a far more robust method than simple stimulus-response.

A unique feature of human intelligence is the ability to seamlessly combine task-oriented actions with verbal reasoning (or inner speech, which has been proven to play an important role in human cognition for enabling self-regulation and maintaining a working memory). The interleaving between “acting” and “reasoning” allows humans to learn new tasks quickly and perform robust decision-making or reasoning.

Recent LLM-based agents generally fell into two camps:

  1. “Thinkers” (Chain-of-Thought)

    These models were good at reasoning. We would give them a complex problem, and they would “think” step-by-step in natural language to arrive at an answer.

    The problem with Chain-of-Though is that they were “disembodied brains” which couldn’t interact with the outside world. If their reasoning required a piece of new information (like today’s weather or a fact from Google), they would either fail or hallucinate (make up) an answer.

  2. “Actors”

    These models were trained to use tools. We would give them a task, and they would immediately try to act by calling a tool, like search("What is the weather?")

    The issue with actors is that they were “dumb” actors who lacked the high-level reasoning to formulate a plan. If a task required multiple steps (e.g., “Find the capital of the country where the Eiffel Tower is, then find the weather there”), they would often fail, get stuck, or use the tools incorrectly.

The “ReAct” finding is that the key to intelligent agents is to interleave these two processes. The model’s output is no longer just the final answer, but a “trajectory” that follows a repeating cycle:

  1. Thought (Reasoning): The LLM first generates an internal monologue. It analyzes the task, reflects on what it knows, and formulates a high-level plan.
  2. Act (Acting): Based on its thought, the model decides to take a specific, external action by calling a tool (e.g., search, lookup_wikipedia, python_code_interpreter).
  3. Observation (External Data): The model receives the output from that tool (e.g., a search result, a calculated number). This new information is fed back into the model’s context for the next step.

This Thought -> Act -> Observation loop repeats until the final Thought is “I have enough information to give the final answer.

ReAct: Synergizing Reasoning And Acting#

Consider a general setup of an agent interacting with an environment for task solving. At time step tt, an agent receives an observation otOo_t \in \mathcal{O} from the environment and takes an action atAa_t \in \mathcal{A} following some policy π(atct)\pi(a_t | c_t), where ct=(o1,a1,,ot1,at1,ot)c_t = (o_1, a_1, \cdots, o_{t-1}, a_{t-1}, o_t) is the context to the agent. Learning a policy is challenging when the mapping ctatc_t \mapsto a_t is highly implicit and requires extensive computation.

What is a policy?

The term “policy” in this context is a core concept borrowed directly from the field of Reinforcement Learning (RL), whose definitive theoretical introduction Reinforcement Learning: An Introduction defines it as follows:

A policy defines the learning agent’s way of behaving at a given time. Roughly speaking, a policy is a mapping from perceived states of the environment to actions to be taken when in those states. It corresponds to what in psychology would be called a set of stimulus–response rules or associations. In some cases the policy may be a simple function or lookup table, whereas in others it may involve extensive computation such as a search process. The policy is the core of a reinforcement learning agent in the sense that it alone is sufficient to determine behavior. In general, policies may be stochastic, specifying probabilities for each action

In short, a policy is the “brain” or “strategy” of an agent. It is the set of rules that dictates what action the agent will take in any given state.

To break that down:

  • Agent: The AI (in this case, the LLM). _ Environment: The world the agent interacts with (e.g., a website, a code terminal, a Wikipedia API).
  • State: The agent’s current situation (e.g., the webpage it’s on, the code it has written so far, the user’s initial question).
  • Policy: The function that looks at the current state and decides what action to take next. It answers the question, “Given what’s happening right now, what should I do?”

(To be continued…)

Building Agent From Scratch#

Now we are in a position suitable for implementing agentic theory. “Building from scratch” simply means we will be the one to write the Python code that:

  • Manages the loop.
  • Formats the prompt that coaxes the LLM to “think.”
  • Provides and executes the tools.

This approach is 100% model-agnostic. We just need an API from a provider like OpenAI (GPT-4o), Anthropic (Claude 3), or we can run an open-source model like Meta’s Llama 3 locally.

Building Agent Using Libraries#

  • LlamaIndex

  • OpenAI Agents SDK

  • LangGraph

    TIP
    • LangChain is for building a RAG (Retrieval Augmented Generation) system, a simple chatbot, or a data extraction pipeline where the steps are known in advance
    • LangGraph is for building an Agent that needs to use tools, handle ambiguous instructions, collaborate with other agents, or requires a “human-in-the-loop” workflow.
  • [(Hugging Face) smolagents]https://huggingface.co/docs/smolagents/en/index

Footnotes#

  1. Ben Alderson-Day and Charles Fernyhough. Inner speech: development, cognitive functions, phenomenology, and neurobiology. Psychological bulletin, 141 (5):931, 2015. 2

AI Agent
https://blogs.openml.io/posts/agent/
Author
OpenML Blogs
Published at
2025-11-12
License
CC BY-NC-SA 4.0