Environments

An RL environment is the economic game translated into a learning interface.

The Environment in Reinforcement Learning

In reinforcement learning, the environment is the part of the problem that turns a choice into a consequence. Formally, it is convenient to write it as a Markov decision process with tuple $(S, A, P, R)$ . $S$ is the space of states or observations available to the agent. $A$ is the action set. $P$ maps current states and actions into the distribution of next states. $R$ returns the reward. This notation does not replace the economics. It reorganizes it. A firm observes a market signal, chooses a quantity or a price, receives profit, and then faces a market altered by strategic interaction.

The basic loop is short: observation, action, reward, next observation. The difficulty is not the syntax. It is the economic content assigned to each object. In Cournot, the natural action is $q_{i}$ . In Bertrand, it is $p_{i}$ . The reward is firm profit, not an auxiliary training signal. The state depends on the information structure. If the firm observes only its own profit, the problem has bandit feedback. If it observes aggregate output or market price, the state contains a public signal about joint behavior. If it also observes rivals’ actions, the learning problem changes again. State design is not a technical footnote. It determines what can be inferred from experience.

flowchart LR
    A[Agent observes state or market signal] --> B[Agent chooses action]
    B --> C[Environment clears the market]
    C --> D[Environment computes profit as reward]
    D --> E[Environment returns next observation]
    E --> A

The transition kernel $P$ has an equally concrete interpretation here. In abstract RL examples, transitions often appear as a black box. In repeated oligopoly, the box is the market rule combined with the rivals’ policies. A quantity choice affects total output $Q$ , the implied price, and the signal observed next period. A price choice affects demand shares, profits, and the future information set. The game is stationary at the level of rules, but not at the level of each firm’s experience. Every agent faces an environment that moves because the other agents are learning too.

That point matters for equilibrium selection. The environment is the mechanism that carries strategic feedback across time. If the observation space is too thin, agents may fail to distinguish punishment from noise. If it is too rich, coordination may become easier. What looks like an implementation detail in ML language is often a substantive assumption in economic language.

Parallelization as Methodology

Parallelization is not a convenience feature. It is part of the empirical design.

This project compares algorithm classes, information structures, market forms, numbers of firms, and random seeds. Each configuration needs many trajectories before a stable pattern can be separated from an accident of initialization. Repeated games learned by decentralized agents are noisy. Some runs converge near Nash. Some drift upward. Some cycle. A small sample can make these paths look like theory when they are only variance.

The gain from parallel environments is not just speed. It is experimental throughput. Higher throughput makes systematic sweeps feasible under a finite budget of time and hardware. That changes what can be claimed. Without vectorized or parallel execution, the temptation is obvious: reduce the number of seeds, shorten training, or drop treatments that are expensive to run. Each shortcut weakens inference. A design that looks broad in principle becomes narrow in practice.

Differences across information structures often appear in dispersion before they appear in averages. A bandit environment may produce the same mean outcome as a richer environment while generating very different stability properties across runs. That only becomes visible when many replications are available. Parallel execution preserves the width of the experimental grid instead of forcing premature simplification.

Cournot Competition

In the Cournot environment, $N$ firms choose quantities simultaneously. Aggregate output is

Q = i = 1 \sum N q_{i},

and inverse demand is

p (Q) = a - b Q .

Firm $i$ earns

π_{i} = (p (Q) - c) q_{i} .

This profit is the reward returned by the environment. Given the action vector $(q_{1}, \dots, q_{N})$ , the environment computes $Q$ , updates the market price, and assigns profits to each agent. The one-shot benchmark comes from each firm’s first-order condition, taking rivals’ quantities as given. In a symmetric Nash equilibrium,

q_{i}^{NE} = \frac{a - c}{b ( N + 1 )}, Q^{NE} = \frac{N ( a - c )}{b ( N + 1 )}, p^{NE} = \frac{a + N c}{N + 1} .

The collusive benchmark is the monopoly allocation for total output,

Q^{M} = \frac{a - c}{2 b}, p^{M} = \frac{a + c}{2} .

Under symmetry, each firm would produce $q_{i}^{M} = Q^{M} / N$ . Since collusion in Cournot means restricting output, the natural collusion index is written on the quantity side:

C I = \frac{Q ^{NE} - Q ^{RL}}{Q ^{NE} - Q ^{M}} .

If $C I = 0$ , the learned outcome matches Nash. If $C I = 1$ , it matches the monopoly benchmark. Negative values indicate output above the Nash level, so competition is more aggressive than the one-shot benchmark. The state space still depends on information. Under bandit feedback, the agent observes only profit. Under aggregate information, it may observe $Q$ or $p$ . Under complete information, it can also observe rivals’ quantities. The economic game is fixed. The environment changes because the observation interface changes.

Cournot also provides a useful contrast for the Bertrand case. Its symmetric Nash benchmark is closed form. Once parameters $(a, b, c, N)$ are fixed, the target is immediate. That simplicity makes it easier to separate learning dynamics from equilibrium computation.

Bertrand Competition

In the Bertrand environment, firms choose prices instead of quantities. Demand follows the logit structure used by Calvano et al. (2020):

q_{i} = \frac{exp ( ( a _{i} - p _{i} ) / μ )}{1 + \sum _{j = 1}^{N} exp ( ( a _{j} - p _{j} ) / μ )} .

Profit is

π_{i} = (p_{i} - c) q_{i} .

With homogeneous goods in the textbook Bertrand model, Nash drives price to marginal cost. Logit demand changes the geometry. Each firm faces a smooth residual demand curve, so the first-order condition becomes

\frac{\partial π _{i}}{\partial p _{i}} = q_{i} + (p_{i} - c) \frac{\partial q _{i}}{\partial p _{i}} = 0.

Using the logit derivative,

\frac{\partial q _{i}}{\partial p _{i}} = - \frac{1}{μ} q_{i} (1 - q_{i}),

the condition can be rewritten as

p_{i} - c = \frac{μ}{1 - q _{i}} .

At a symmetric equilibrium with $a_{i} = a$ for all $i$ , all firms choose the same price $p^{NE}$ and the same demand level $q^{NE}$ . The symmetric Nash condition is then

p^{NE} = c + \frac{μ}{1 - q ^{NE}},

with

q^{NE} = \frac{exp ( a - p ^{NE} / μ )}{1 + N exp ( a - p ^{NE} / μ )} .

This system is implicit. Unlike Cournot, there is no comparable closed-form expression for $p^{NE}$ . The Bertrand Nash price must be solved numerically once $(a, μ, c, N)$ are fixed. The benchmark itself is already a fixed point.

The monopoly benchmark is the symmetric price that maximizes joint industry profit. The collusion index is written on the price side:

C I = \frac{p ^{RL} - p ^{NE}}{p ^{M} - p ^{NE}} .

If $C I = 0$ , the learned outcome is Nash. If $C I = 1$ , it reaches the monopoly benchmark. Negative values indicate prices below the Nash level. Strategic structure is the key difference from Cournot. Quantities are strategic substitutes. Prices under differentiated demand are strategic complements. When one firm raises its price, the incentive for others to raise theirs becomes stronger. That does not guarantee collusion. It changes the slope of best responses. In learning terms, it creates an environment where gradual upward movements can reinforce each other, even though undercutting remains profitable locally. The RL agent sees rewards and observations. The economist sees a strategic system with a different geometry. The environment has to preserve both.

References

Calvano, E., Calzolari, G., Denicolò, V., & Pastorello, S. (2020). Artificial Intelligence, Algorithmic Pricing, and Collusion. American Economic Review, 110(10), 3267–3297.
Cournot, A. A. (1838). Recherches sur les Principes Mathématiques de la Théorie des Richesses. Paris: Hachette.
Nash, J. F. (1950). Equilibrium points in n-person games. Proceedings of the National Academy of Sciences, 36(1), 48–49.
Sutton, R. S., & Barto, A. G. (2018). Reinforcement Learning: An Introduction (2nd ed.). MIT Press.
Bellman, R. (1957). Dynamic Programming. Princeton University Press.

CobbDouglaz

Notes

Recent notes

About

The Cobb-Douglas Production Function

03 - Agents

01 - Environments

Environments

The Environment in Reinforcement Learning

Parallelization as Methodology

Cournot Competition

Bertrand Competition

References

Graph View

Table of Contents

Backlinks