03 - Agents

If the environment is the market translated into an executable system, the agent is the decision-maker translated into an adaptive rule. In economic language, the agent plays the role of the firm. It observes some information about the market, chooses a quantity or a price, receives profit, and then adjusts future behavior in light of that experience. Reinforcement learning gives that adjustment process a precise computational form.

This point matters because the project is not interested in equilibrium as a static object alone. It is interested in how a market gets there, or fails to get there, when firms do not begin with full rational knowledge of the game. The agent is the object that carries that learning dynamic. Without the agent, the model is only a market structure. With the agent, it becomes a theory of adaptive behavior.

What an Agent Is

In reinforcement learning, an agent is a rule that maps observations into actions and then updates itself using rewards. That definition may sound abstract, but in this project it has a direct economic interpretation. The observation is the information available to the firm: profit, market price, rivals’ actions, or some subset of these. The action is the firm’s strategic choice: quantity in Cournot or price in Bertrand. The reward is profit. The update is the firm’s learning process.

The easiest way to think about the agent is not as a person and not as a fully rational planner. It is closer to a behavioral mechanism. It does not solve the game analytically before acting. It experiments, collects payoffs, and gradually shifts its behavior toward actions that seem more valuable. In that sense, the agent is a formal model of boundedly rational adaptation.

For an audience in economics, this is the central bridge. Standard game theory typically starts from best responses and fixed points. Reinforcement learning starts from trial, error, and adjustment. The agent is the component that converts one language into the other. It is where strategic optimization is approximated by a repeated learning process rather than imposed as an equilibrium condition from the outset.

The Agent’s Basic Cycle

The internal logic of the agent can be written in a short sequence.

  1. The agent receives an observation from the environment.
  2. It chooses an action according to its current policy.
  3. The market clears and returns a profit.
  4. The agent treats that profit as feedback.
  5. It updates its internal rule before the next period.
flowchart LR
    A[Observation from market] --> B[Agent chooses action]
    B --> C[Environment returns profit]
    C --> D[Agent updates policy or value rule]
    D --> A

This loop is simple in structure but rich in economic content. A firm that repeatedly raises output and sees profits fall may learn to contract. A firm that matches a rival’s high price and observes sustained profit may learn to keep doing so. The important point is that the agent does not need to “know” the equilibrium analytically. Patterns of behavior can emerge from the accumulation of rewarded and punished actions.

Observation, Action, Reward

To understand the agent, it helps to separate the three objects that define its decision problem.

The first is the observation. This is not the full state of the world in a philosophical sense. It is the information that the model allows the agent to use. That distinction is economically important. A firm that sees only its own profit is in a very different strategic position from a firm that also sees rivals’ prices or quantities. The same market can therefore induce different learning dynamics depending on what the agent observes.

The second is the action. In Cournot, the action is a quantity choice. In Bertrand, it is a price choice. This sounds obvious, but it shapes the learning problem deeply. Quantities and prices are not just different labels. They generate different strategic feedback. An agent learning over quantities moves through a game of strategic substitutes. An agent learning over prices in differentiated demand moves through a game of strategic complements. The agent’s behavior cannot be understood apart from that structure.

The third is the reward. Here the reward is not artificial. It is firm profit. That is an important modeling commitment. The agent is not being taught by an external planner with a custom objective. It is adapting directly to economic incentives. If collusive patterns emerge, they emerge because the reward structure of the game makes them attractive under the agent’s learning rule.

Policy, Value, and Learning Rule

Every agent has an internal rule that determines how observations become actions. In reinforcement learning, this rule is usually described either as a policy or as a value representation.

A policy is a direct mapping from observations to actions, or to probabilities over actions. In economic terms, it can be read as a behavioral strategy: given this market signal, how likely is the firm to choose each feasible move?

A value representation is slightly more indirect. Instead of choosing actions immediately, the agent estimates how good each action is in a given context. It then acts using those estimates. In economic language, this is close to an adaptive approximation of continuation value: which move seems to lead to better future payoffs?

The learning rule is what updates that internal object. After observing the consequences of an action, the agent revises its policy or its value estimates. Different algorithms do this in different ways. Some emphasize value estimation, some emphasize direct policy improvement, and some treat action weights as an evolving response to realized payoffs. But at a high level they all solve the same problem: how should a firm change its future behavior after observing profit today?

Why Exploration Matters

An economic agent in this framework cannot learn by exploiting only what it already believes. It must also explore.

Exploration means trying actions that are not currently estimated to be best. At first sight, this can look irrational. Why would a profit-seeking firm knowingly choose an inferior action? The answer is that without experimentation the firm cannot discover whether its current beliefs are wrong. A strategy that appears unprofitable under limited experience may in fact lead to better long-run outcomes once rivals react.

This is one of the places where reinforcement learning diverges most sharply from static textbook reasoning. In one-shot theory, the firm computes a best response from known primitives. In learning models, the firm often does not know the full payoff landscape in advance. It learns that landscape by acting inside it. Exploration is therefore not noise added to optimization from the outside. It is part of the learning technology itself.

For economics, this matters because some market outcomes may depend on how much experimentation agents conduct, how quickly they stop exploring, and how strongly they respond to rare but informative events. The path to equilibrium is not neutral. The learning rule shapes it.

The Agent as a Model of the Firm

It is tempting to ask whether these agents are realistic models of actual firms. The right answer is narrower and more useful. They are stylized models of adaptive decision procedures operating under payoff feedback.

Real firms are not tabular Q-functions or neural policies. But they do rely increasingly on algorithmic systems, heuristic optimization, and repeated adjustment based on observed outcomes. The reinforcement learning agent is valuable not because it reproduces every institutional detail of a firm, but because it isolates a mechanism: decentralized adaptation under strategic interdependence.

That mechanism is exactly what matters for questions of tacit coordination and equilibrium selection. If independently learning agents can drift toward supracompetitive outcomes without communication, that is already economically meaningful. The agent is therefore not a metaphorical add-on. It is the core object needed to ask whether strategic learning alone can generate patterns that look collusive.

Interaction With the Environment

The agent never acts in isolation. Its behavior only becomes economically meaningful through interaction with the environment.

The environment defines the market rule: how actions become prices, quantities, profits, and future observations. The agent defines the adaptive rule: how observations and profits become future actions. Put differently, the environment is the structure of the game, while the agent is the process by which firms move through that structure.

This distinction is useful because it keeps two questions separate. One question is structural: what incentives does the market create? The other is behavioral: how do firms learn under those incentives? In this project, neither question is enough alone. A market with collusive potential does not guarantee collusion. A learning rule that can support coordination does not guarantee that coordination will emerge in every market. The observed outcome depends on the interaction between the two.

flowchart TD
    A[Environment provides observation] --> B[Agent]
    B --> C[Action: quantity or price]
    C --> D[Environment clears market]
    D --> E[Profit and next observation]
    E --> F[Agent updates internal rule]
    F --> B

This feedback loop is where equilibrium selection becomes an empirical question. A high-profit action today changes future incentives through the responses it induces in rivals. The agent learns from those consequences, not from an external theorem. The environment delivers strategic feedback; the agent absorbs it. Together they generate the market path.

Why Agents Matter for Economic Interpretation

The purpose of introducing agents is not merely computational. It is interpretive.

When equilibrium is imposed analytically, the economist asks which outcomes are sustainable under rational best responses. When equilibrium is approached through agents, the economist can ask an additional question: which sustainable outcomes are actually learned, and under what informational and algorithmic conditions?

That shift is important. It turns equilibrium selection from a theorem-selection problem into a dynamic behavioral problem. The agent is the tool that makes that shift possible. It introduces path dependence, experimentation, imperfect information, and adjustment speed into the analysis without abandoning disciplined optimization altogether.

For industrial organization, this is especially useful. Markets with the same primitives may generate different long-run outcomes depending on how firms learn. Two environments with the same demand and cost functions can look very different once one changes what agents observe, how they update, or how much they explore. The agent is therefore not just an implementation detail. It is part of the economic model.

References