NOTE
Discere Agendo Tacitum brings together the two core ideas of the project. Discere agendo refers to “learning by doing”: knowledge that emerges through repeated action rather than explicit instruction. Tacitum points to tacit collusion: coordination that can arise without direct communication. The name therefore captures the central hypothesis of this research: agents learn through interaction, and from that learning there may emerge equilibrium, coordination, and even collusion.
Context
When a small number of firms compete repeatedly, the question of which outcome they reach is not settled by theory. The Nash equilibrium identifies what is stable in a precise sense: no firm can gain by deviating unilaterally. The Folk Theorem guarantees that in repeated interaction almost any outcome between competition and monopoly can be sustained as an equilibrium. Theory identifies the candidates; it does not select among them.
The gap matters for policy. Firms increasingly rely on algorithmic systems to set prices and quantities. If these systems learn to coordinate on supra-competitive outcomes without ever communicating, standard antitrust tools may fall short. They were designed around explicit agreement, and tacit coordination leaves no paper trail.
Calvano et al. (2020) produced direct evidence of the problem in price competition: Q-learning agents in a repeated Bertrand game converge to collusive prices without instruction. The mechanism is not programmed; it emerges from independent optimization. Whether analogous dynamics arise under quantity competition, whether they survive a change in algorithm class or information environment, and what structural conditions shift the equilibrium selected — these questions remain largely open.
This project investigates equilibrium selection as an emergent property of decentralized reinforcement learning in both Cournot and Bertrand competition. The two market structures are not sequential chapters. They are parallel environments with distinct strategic geometry, and comparing them directly is part of what makes the investigation tractable.
Research question
Under what conditions do reinforcement learning agents converge to Nash equilibrium, collusive outcomes, or cyclical dynamics in repeated Cournot and Bertrand competition?
Methodology
The experimental setup places symmetric firms in two oligopoly environments. In the Cournot setting, firms choose quantities with inverse demand . In the Bertrand setting, firms choose prices under logit demand, following the environment in Calvano et al. In both cases, each firm is an independent RL agent with no communication channel and no knowledge of its rivals’ decision rules. Agents observe market signals according to a specified information structure and update their policies through repeated interaction.
Three dimensions are varied systematically across both market structures:
Algorithm class. Q-Learning, REINFORCE, DQN, and Hedge (EXP3) each represent a different approach to policy representation and the exploration-exploitation tradeoff. Q-Learning and DQN operate over discrete action sets; REINFORCE and Hedge maintain explicit probability distributions. The choice of algorithm shapes what kind of learning trajectory is possible.
Information structure. Bandit feedback (own profit only), own action, and rival actions define how much of the market state each agent observes. Information is a first-order determinant of the coordination problem: agents who cannot observe rivals’ actions face a fundamentally different learning task than those who can.
Market size. and firms test whether collusive tendencies are robust to competitive pressure. Three-firm markets are harder to coordinate tacitly; the Nash equilibrium is closer to the competitive outcome.
Each configuration is run across multiple random seeds to distinguish systematic from seed-specific outcomes.
Outcome metrics
Three quantities measure where agents land relative to the theoretical benchmarks.
| Metric | Cournot | Bertrand | Equilibrium Range |
|---|---|---|---|
| Nash distance | ; 0 = Nash | ||
| Collusion index | ; 0 = Nash, 1 = monopoly | ||
| Dynamics | fixed point / limit cycle / unstable | fixed point / limit cycle / unstable | — |
The collusion index is a normalized distance from Nash toward monopoly. In Cournot, less output means more collusion: the numerator grows as falls below . In Bertrand, higher prices mean more collusion: the numerator grows as rises above . Both formulas map onto the same scale, so outcomes are directly comparable across market structures. Negative values indicate overcompetition relative to Nash in both cases.
Dynamics classification matters because the collusion index can conceal important structure. A mean near 0.5 could reflect stable coordination or a cycle oscillating between Nash and monopoly. Classifying the trajectory shape is a necessary complement to the scalar metrics.
WARNING
All hyperparameters are specified in YAML configuration files. Nothing is tuned post-hoc.
Project Articles
| File | Date | Status |
|---|---|---|
| 01 - Environments | 2026-04-24 | Done |
| 02 - Environment Implementation | 2026-05-01 | Done |
| 03 - Agents | 2026-05-03 | Done |
References
- Calvano, E., Calzolari, G., Denicolò, V., & Pastorello, S. (2020). Artificial Intelligence, Algorithmic Pricing, and Collusion. American Economic Review, 110(10), 3267–3297.
- Cournot, A. A. (1838). Recherches sur les Principes Mathématiques de la Théorie des Richesses. Paris: Hachette.
- Nash, J. F. (1950). Equilibrium points in n-person games. Proceedings of the National Academy of Sciences, 36(1), 48–49.
- Arrow, K. J. (1962). The economic implications of learning by doing. The Review of Economic Studies, 29(3), 155–173.
- Sutton, R. S., & Barto, A. G. (2018). Reinforcement Learning: An Introduction (2nd ed.). MIT Press.
- Fudenberg, D., & Levine, D. K. (1998). The Theory of Learning in Games. MIT Press.
Note
More specific references, including individual papers on algorithmic collusion, RL convergence, and market design, are cited in the relevant development notes.