1918 words
10 minutes
AI Agent - Inner Speech

Psychological Root of LLM Agents - Inner Speech1#

That moment of panic when we realize we’ve lost our wallet is a perfect example of inner speech. We immediately stop, and our internal monologue takes over, becoming a tool to direct our actions: “Okay, think. Where did I last have it? I used it at the coffee shop. Did I put it back in my pocket? Let me check. No. Did I put it in my bag? Let me look. Okay, not in the main pocket. What about the side pocket? Yes, there it is.” That entire step-by-step, self-directed verbal process - guiding our search, eliminating possibilities, and managing our rising panic - is our inner speech actively solving a problem.

Inner speech can be defined as the subjective experience of language in the absence of overt and audible articulation. It has long had an important role to play in psychological theorizing. Plato noted that a dialogic conversation with the self is a familiar aspect of human experience in the Theaetetus. In this dialogue, Socrates defines thinking as “a talk which the soul has with itself about the objects under its consideration.”

He describes it as a silent, internal process of questioning and answering:

…the soul when it thinks is simply carrying on a discussion in which it asks itself questions and answers them itself, affirms and denies. And when it arrives at something definite… we call this its judgment.

- Plato’s dialogue, Theaetetus, 189e – 190a

This part of the dialogue occurs as Socrates and Theaetetus are trying (and failing) to define “knowledge.” They are in the process of examining the idea of “knowledge as true judgment,” and Socrates introduces this definition of “thinking” to explore how “false judgment” (or “thinking” incorrectly) might be possible. This specific mechanism - the soul’s silent, internal dialogue of questions and answers - is the very definition that the “Inner Speech” paper and modern AI agent theory build upon.

The agent’s “Thought” step is a direct, practical implementation of this Platonic idea - a silent, internal dialogue used to reason, plan, and arrive at a judgment before acting.

There are 2 influential theoretical perspectives on inner speech, theorizing about its cognitive function. One relates to the development of verbal mediation of cognition and behavior, and one relates to rehearsal and working memory.

Vygotsky’s Theory#

Lev Vygotsky’s theory is a cornerstone of developmental psychology and is the perfect psychological root for understanding why modern AI agents are designed the way they are. At the heart of Vygotsky’s theory is a revolutionary idea: complex, abstract thinking is not an innate, individual ability, but rather the internalization of social processes.

He argued that our higher cognitive functions (like planning, reasoning, and self-control) are born from our social interactions with others. The primary “tool” for this transformation is language.

Vygotsky’s most famous work, Thought and Language, proposed that thought and language have separate origins.

  • Pre-linguistic Thought: A baby can “think” in a basic, practical way (e.g., “I am hungry,” “That object is far away”).
  • Pre-intellectual Speech: A baby can “speak” (cry, babble) to express emotion, but not to formulate a logical thought.

The “most significant moment” in cognitive development, he argued, occurs around age two when these two lines converge. Language and thought become intertwined. At this point, language ceases to be just a tool for communication with others and becomes the primary tool for thinking itself. This transformation happens in 3 distinct, observable stages. These 3 Stages of speech development are the central mechanisms that maps directly to AI agent theory.

  1. Social Speech (or External Speech)

    This is the first stage, from birth to about age 3. Speech at this stage is purely external and social. Its sole purpose is to communicate with others. A child uses words to control the behavior of others (“Want milk!”), express emotion (“Bad!”), or make a request. It is a tool for interacting with the outside world.

    This is like a simple, non-agentic LLM. You give it a prompt (an external stimulus) and it gives you a direct response (an external output). It is a pure call-and-response.

  2. Private Speech (or Egocentric Speech)

    This is the critical transitional stage, peaking around ages 3 to 7. This is the phenomenon we observe when a child is playing alone and talking to themselves out loud. For example, a child doing a puzzle might say, “No, that piece is blue… I need a red one… where is the edge piece? Yes, put it here.”

    Vygotsky’s great insight was that this is not just meaningless chatter. The child is using language as a tool for self-regulation. They have taken the social, back-and-forth dialogue they used to have with a parent (“Where does this piece go?” “Try the red one.”) and internalized it. They are now playing both roles, using their own voice to guide their own thoughts, direct their attention, and plan their next steps.

    This is exactly what the ReAct (Reasoning and Acting) framework does. The agent’s “Thought” is a form of private speech. It writes down its plan (“I need to find the capital of France… my first action will be to search…”) in an external, observable “scratchpad.” It is literally thinking aloud to regulate its own behavior and execute a complex plan.

  3. Inner Speech (or Silent Speech)

    From age 7 onward, this “private speech” doesn’t disappear; it “goes underground.” It becomes silent, internalized thought. This is our mature “inner monologue” or “stream of consciousness.” It is the fast, condensed, abbreviated silent dialogue we have with ourselves (the “dialogic conversation with the self” that Plato described). We use it to plan our day, reason through a problem, and direct our own behavior, but it now happens entirely within our minds.

    This is the goal of a sophisticated agent. A truly advanced agent wouldn’t need to write down every single “Thought.” It would be able to perform many of these reasoning steps internally, only surfacing its plan or actions when necessary.

Vygotsky’s theory provides the psychological justification for why the “Thought” step in an agent is so critical:

  • Self-Regulation: The “Thought” step is the agent’s mechanism for self-regulation. Instead of just reacting to the user’s prompt (stimulus-response), it pauses to think. It formulates a plan, critiques its own ideas, and directs its future actions.
  • Problem-Solving: Vygotsky noted that children’s use of private speech increases when a task is difficult or when they make a mistake. This is the same for an agent. If an agent’s “Act” step fails (e.g., a search query returns an error), its next “Thought” step is its “private speech” kicking in to solve the new problem: “Okay, that tool failed. I will try a different tool,” or “My search query was bad. I will formulate a better one.”
  • Interpretability: The agent’s externalized “Thought” trace is a perfect log of its “private speech.” For us as developers, it allows us to do what a teacher does with a student: “Show me your work.” We can read the agent’s step-by-step reasoning to debug its logic, which is impossible if its “thought” is a black box.

Vygotsky’s theory explains that the ability to “think aloud” (Private Speech) is the crucial developmental bridge that allows a simple social actor to become a complex, independent, and self-regulating thinker. The ReAct framework, by forcing the LLM to “think aloud” with its Thought step, is essentially guiding the AI through this same cognitive leap.

Inner Speech in Working Memory#

Vygotsky’s theory is developmental and functional. It explains why we have inner speech (for self-regulation) and how we get it (by internalizing social speech).

There is another major pillar of “inner speech” research called Working Memory theory, which is cognitive and architectural. It explains what inner speech is and how it operates as a specific mechanism in our brain’s short-term memory.

TIP

Working memory refers to the retention of information “online” during a complex task, such as keeping a set of directions in mind while navigating around a new building, or rehearsing a shopping list.

This theory is a key part of the Baddeley & Hitch model of working memory, which proposes that “working memory” (what we used to call “short-term memory”) is not just a passive storage box, but an active, multi-component “mental workbench.” This workbench, he argues, has a “boss” and two main “assistants”:

  1. The Central Executive: This is the “boss.” It’s the flexible, high-level attention-control system. It’s the “you” that decides, “I need to remember this phone number” or “I need to solve this math problem.” It doesn’t store information itself; it just directs the other systems.
  2. The Visuo-Spatial Sketchpad: This is the “inner eye.” It’s the assistant that holds and manipulates visual and spatial information (e.g., picturing a map, mentally rotating a 3D shape).
  3. The Phonological Loop: This is the “inner voice”. It is the assistant responsible for holding and manipulating language-based information.

Baddeley’s great insight was to break this “inner voice” down into two sub-parts that work in a continuous loop:

  1. The Phonological Store (The “Inner Ear”): This is a passive storage buffer. It can hold a small amount of sound-based (phonological) information for a very brief time - about 1 to 2 seconds. Any spoken word we hear (e.g., someone tells us a phone number) enters this store directly. Think of this as a tiny, rapidly-fading audio recording.
  2. The Articulatory Rehearsal Process (The “Inner Voice”): This is an active rehearsal process. It is, quite literally, our inner speech. Its job is to “read” the information in the phonological store and then “speak” it again, feeding it back into the store. This act of silent, internal “re-speaking” is what refreshes the memory and prevents it from fading. Crucially, it also works for written information. When we read a word on a page, this process translates that visual text into an internal, sound-based code and speaks it into the phonological store.

Let’s use the life experience of remembering a phone number (555-867-5309) when we don’t have a pen. Our Central Executive (“boss”) fires up and says: “This is important. I need to remember this number for the next 30 seconds.” It directs our attention. We hear “555-867-5309.” The sounds immediately enter your Phonological Store (“inner ear”). But after 1-2 seconds, those sounds will decay and be gone forever. So, our Articulatory Rehearsal Process (“inner voice”) immediately kicks in. We start silently saying to ourselves, “five-five-five, eight-six-seven, five-three-oh-nine… five-five-five, eight-six-seven, five-three-oh-nine…”. Each time our “inner voice” rehearses the number, it’s like afresh “recording” that gets fed back into the “inner ear” (the phonological store), refreshing it for another 2 seconds. This continuous “phonological loop” is the “Inner Speech in Working Memory.”

To create a truly robust AI agent, we need both:

  • a Working Memory (a context window or scratchpad) to hold information (Baddeley’s part), and
  • a Reasoning Loop (like ReAct) that uses “inner speech” to reflect on and act upon that information (Vygotsky’s part).

Relating to ReAct Agent#

The introduction section of ReAct paper cites “Inner Speech”1 because it provides the core psychological blueprint for why an agent’s “Thoughts” are so powerful.

The ReAct paper demonstrates that interleaving reasoning and acting works. The “Inner Speech” paper, by Alderson-Day & Fernyhough, explains the cognitive function of this internal monologue, drawing heavily on the theories of psychologist Lev Vygotsky.

The central finding of the “Inner Speech” paper is that our “inner monologue” isn’t just a passive side effect of thinking. Instead, it is an active cognitive tool that we use to direct and regulate our own minds. It shows that what Google’s researchers built is not just a clever engineering trick, but a functional analogue of a core cognitive tool that humans evolved for self-regulation, planning, and complex problem-solving.

Footnotes#

  1. Ben Alderson-Day and Charles Fernyhough. Inner speech: development, cognitive functions, phenomenology, and neurobiology. Psychological bulletin, 141 (5):931, 2015. 2

AI Agent - Inner Speech
https://blogs.openml.io/posts/agent-inner-speech/
Author
OpenML Blogs
Published at
2026-05-05
License
CC BY-NC-SA 4.0