Back to project


NOTE

Discere Agendo Tacitum brings together the two core ideas of the project. Discere agendo refers to “learning by doing”: knowledge that emerges through repeated action rather than explicit instruction. Tacitum points to tacit collusion: coordination that can arise without direct communication. The name therefore captures the central hypothesis of this research: agents learn through interaction, and from that learning there may emerge equilibrium, coordination, and even collusion.


Context

When a small number of firms compete repeatedly, the question of which outcome they reach is not settled by theory. The Nash equilibrium identifies what is stable in a precise sense: no firm can gain by deviating unilaterally. The Folk Theorem guarantees that in repeated interaction almost any outcome between competition and monopoly can be sustained as an equilibrium. Theory identifies the candidates; it does not select among them.

The gap matters for policy. Firms increasingly rely on algorithmic systems to set prices and quantities. If these systems learn to coordinate on supra-competitive outcomes without ever communicating, standard antitrust tools may fall short. They were designed around explicit agreement, and tacit coordination leaves no paper trail.

Calvano et al. (2020) produced direct evidence of the problem in price competition: Q-learning agents in a repeated Bertrand game converge to collusive prices without instruction. The mechanism is not programmed; it emerges from independent optimization. Whether analogous dynamics arise under quantity competition, whether they survive a change in algorithm class or information environment, and what structural conditions shift the equilibrium selected — these questions remain largely open.

This project investigates equilibrium selection as an emergent property of decentralized reinforcement learning in both Cournot and Bertrand competition. The two market structures are not sequential chapters. They are parallel environments with distinct strategic geometry, and comparing them directly is part of what makes the investigation tractable.


Research question

Under what conditions do reinforcement learning agents converge to Nash equilibrium, collusive outcomes, or cyclical dynamics in repeated Cournot and Bertrand competition?


Methodology

The experimental setup places symmetric firms in two oligopoly environments. In the Cournot setting, firms choose quantities with inverse demand . In the Bertrand setting, firms choose prices under logit demand, following the environment in Calvano et al. In both cases, each firm is an independent RL agent with no communication channel and no knowledge of its rivals’ decision rules. Agents observe market signals according to a specified information structure and update their policies through repeated interaction.

Three dimensions are varied systematically across both market structures:

Algorithm class. Q-Learning, REINFORCE, DQN, and Hedge (EXP3) each represent a different approach to policy representation and the exploration-exploitation tradeoff. Q-Learning and DQN operate over discrete action sets; REINFORCE and Hedge maintain explicit probability distributions. The choice of algorithm shapes what kind of learning trajectory is possible.

Information structure. Bandit feedback (own profit only), own action, and rival actions define how much of the market state each agent observes. Information is a first-order determinant of the coordination problem: agents who cannot observe rivals’ actions face a fundamentally different learning task than those who can.

Market size. and firms test whether collusive tendencies are robust to competitive pressure. Three-firm markets are harder to coordinate tacitly; the Nash equilibrium is closer to the competitive outcome.

Each configuration is run across multiple random seeds to distinguish systematic from seed-specific outcomes.

Outcome metrics

Three quantities measure where agents land relative to the theoretical benchmarks.

MetricCournotBertrandEquilibrium Range
Nash distance; 0 = Nash
Collusion index; 0 = Nash, 1 = monopoly
Dynamicsfixed point / limit cycle / unstablefixed point / limit cycle / unstable

The collusion index is a normalized distance from Nash toward monopoly. In Cournot, less output means more collusion: the numerator grows as falls below . In Bertrand, higher prices mean more collusion: the numerator grows as rises above . Both formulas map onto the same scale, so outcomes are directly comparable across market structures. Negative values indicate overcompetition relative to Nash in both cases.

Dynamics classification matters because the collusion index can conceal important structure. A mean near 0.5 could reflect stable coordination or a cycle oscillating between Nash and monopoly. Classifying the trajectory shape is a necessary complement to the scalar metrics.

WARNING

All hyperparameters are specified in YAML configuration files. Nothing is tuned post-hoc.


Project Articles

FileDateStatus
01 - Environments2026-04-24Done
02 - Environment Implementation2026-05-01Done
03 - Agents2026-05-03Done

References

Note

More specific references, including individual papers on algorithmic collusion, RL convergence, and market design, are cited in the relevant development notes.