AI Agent - ReAct: Reasoning and Acting

1237 words

6 minutes

AI Agent - ReAct: Reasoning and Acting

2026-05-05

2026-05-22

English

Technology

/

Psychology

Introduction#

A unique feature of human intelligence is the ability to seamlessly combine task-oriented actions with verbal reasoning (or inner speech, which has been proven to play an important role in human cognition for enabling self-regulation or strategization and maintaining a working memory.

For example, while cooking up a dish in the kitchen whilst between any two specific actions, we may reason in language in order to

track progress: “now that everything is cut, I should heat up the pot of water”
handle exceptions or adjust the plan according to the situation: “_I don’t have salt, so let me use soy sauce and pepper instead”
realize when external information is needed: “how do I prepare dough? Let me search on the Internet”.

We may also act (open a cookbook to read the recipe, open the fridge, check ingredients) to support the reasoning and to answer questions like “What dish can I make right now?”.

This tight synergy between “acting” and “reasoning” allows humans to learn new tasks quickly and perform robust decision-making or reasoning, even under previously unseen circumstances or facing information uncertainties.

Before ReAct#

There have been efforts on combining verbal reasoning with interactive decision-making in autonomous system.

LLM#

Properly prompted large language models (LLMs) have demonstrated emergent capabilities to carry out several steps of reasoning traces to derive answers from questions in arithmetic, commonsense, and symbolic reasoning tasks.

However, this “chain-of-thought” reasoning is a static black box, in that the model uses its own internal representations to generate thoughts and is not grounded in the external world, which limits its ability to reason reactively or update its knowledge.

NOTE
A RAG system could successfully address the “not grounded in the external world” criticism. When an external database is introduced, it retrieves relevant documents based on the initial query and injects them into the model’s context window. This actively grounds the subsequent chain-of-thought in factual, verifiable information rather than relying purely on the model’s frozen internal representations. The model is no longer generating thoughts in a complete vacuum, which significantly reduces hallucinations and provides the required external context for that specific prompt.
It falls short of, however, solving the inability to “reason reactively”. However, standard RAG operates as a single, upfront retrieval step. Once the retrieved context is provided, the model begins its chain-of-thought reasoning as a continuous, uninterrupted sequence. If the model realizes halfway through its reasoning that it needs an additional, highly specific piece of information to proceed, a standard RAG setup cannot pause and fetch it. The reasoning process remains a static trajectory from start to finish, completely unable to dynamically adapt to intermediate discoveries.
As we shall see in a bit, this gap is exactly what the ReAct framework addresses. By interleaving thoughts with discrete actions, the model can query an external knowledge base, read the resulting observation, and let that new information dictate its next thought. While a database connection successfully anchors the model in external reality at the very beginning of a task, it takes an interactive loop to allow the model to actively update its knowledge mid-thought and reason reactively to newly discovered facts.

WebGPT#

On the other hand, works have explored the use of pre-trained language models for planning and acting in interactive environments, such as OpenAI’s WebGPT, with a focus on predicting actions via language priors. These approaches usually convert multi-modal observations into text, use a language model to generate domain-specific actions or plans, and then use a controller to choose or execute them.

TIP
Instead of physical space, WebGPT operated in a text-based web browsing environment. The system was trained to predict navigation actions like typing a query into a search bar, clicking on a link, or scrolling down a page to answer user questions. It functioned by taking the current state of the web browser as a text observation and directly outputting the next optimal browser command. While highly effective at navigating the web, it did not produce an internal monologue to strategize its search methodology before executing the clicks.

However, they do not employ language models to reason abstractly about high-level goals or maintain a working memory to support acting

React#

ReAct is a general paradigm that combines reasoning and acting with language models for solving diverse language reasoning and decision-making tasks. ReAct prompts LLMs to generate both verbal reasoning traces and actions pertaining to a task in an interleaved manner, which allows the model to perform dynamic reasoning to create, maintain, and adjust high-level plans for acting (reason to act), while also interact with the external environments (e.g. Wikipedia) to incorporate additional information into reasoning (act to reason)

ReAct: Synergizing Reasoning And Acting#

Consider a general setup of an agent interacting with an environment for task solving. At time step $t$ , an agent receives an observation $o_t \in \mathcal{O}$ from the environment and takes an action $a_t \in \mathcal{A}$ following some policy $\pi(a_t | c_t)$ , where $c_t = (o_1, a_1, \cdots, o_{t-1}, a_{t-1}, o_t)$ is the context to the agent. Learning a policy is challenging when the mapping $c_t \mapsto a_t$ is highly implicit and requires extensive computation.

What is a policy?
The term “policy” in this context is a core concept borrowed directly from the field of Reinforcement Learning (RL), whose definitive theoretical introduction [Reinforcement Learning: An Introduction] defines it as follows:
“A policy defines the learning agent’s way of behaving at a given time. Roughly speaking, a policy is a mapping from perceived states of the environment to actions to be taken when in those states. It corresponds to what in psychology would be called a set of stimulus–response rules or associations. In some cases the policy may be a simple function or lookup table, whereas in others it may involve extensive computation such as a search process. The policy is the core of a reinforcement learning agent in the sense that it alone is sufficient to determine behavior. In general, policies may be stochastic, specifying probabilities for each action”
In short, a policy is the “brain” or “strategy” of an agent. It is the set of rules that dictates what action the agent will take in any given state.
To break that down:

Agent: The AI (in this case, the LLM).

Environment: The world the agent interacts with (e.g., a website, a code terminal, a Wikipedia API).

State: The agent’s current situation (e.g., the webpage it’s on, the code it has written so far, the user’s initial question).

Policy: The function that looks at the current state and decides what action to take next. It answers the question, “Given what’s happening right now, what should I do?”

The idea of ReAct is simple: we augment the agent’s action space to $\mathcal{\hat{A}} = \mathcal{A} \cup \mathcal{L}$ , where $\mathcal{L}$ is the space of language. An action $\hat{a}_t \in \mathcal{L}$ in the language space, which we will refer to as a thought or a reasoning trace, does not affect the external environment, thus leading to no observation feedback. Instead, a thought $\hat{a}_t$ aims to compose useful information by reasoning over the current context $c_t$ , and update the context $c_{t+1} = (c_t, \hat{a}_t)$ to support future reasoning or acting.

However, as the language space $\mathcal{L}$ is unlimited, learning in this augmented action space is difficult and requires strong language priors. The paper’s solution is relying on the internal generative capabilities of the large language model. The context consists of

the initial prompt
the task at hand
the running history of prior thoughts, actions, and external observations

The model composes this information simply by generating the next logical sequence of text based on its pre-trained weights, guided heavily by few-shot examples provided in the prompt.