The field map

A navigable map of multi-agent autonomous-systems research, 2015–2026 — 70 algorithm / system / framework / benchmark entries and 6 surveys, organized so the absences between classical MARL and the agentic-AI / LLM-agent wave become visible at a glance. Transcribed from the project's generated taxonomy; the full generative YAML source is preserved under assets/sources/marl_taxonomy/.

0 · The thesis: interaction mode as the primary axis 1 · The map at a glance (figures) 2 · The 70 entries (mode → paradigm → era) 3 · The 6 surveys 4 · Gaps & opportunities 5 · Cross-disciplinary connectors 6 · Methodology
0 · the thesis

Interaction mode is the primary axis

Most MARL maps sort work by method (value-decomposition, policy-gradient, communication…). This map sorts by interaction mode instead — the relationship between the agents' interests — because that is the axis along which the classical-MARL literature and the 2023–2026 LLM-agent wave fail to meet. Five modes, each with a single distinguishing question:

solo
Is there only one agent?
cooperative
Are interests persistently aligned across the whole horizon?
collaborative
Are interests temporarily aligned around a sub-goal?
competitive
Are interests strictly opposed (zero-sum / N-player rank)?
mixed-motive
Are interests partially aligned (general-sum, social dilemma)?
The cooperative–vs–collaborative split is the map's load-bearing original contribution. Most MARL literature collapses the two. But "cooperative-MARL methods" (MAPPO, QMIX, VDN…) assume persistent reward identity — a single team reward held for the entire horizon. That assumption breaks the moment interests are only aligned around a sub-task and dissolve when it ends. Naming the two modes distinctly is what makes the central absence legible.
COLLABORATIVE mode is, in this corpus, entirely a 2023–2026 LLM-agent phenomenon — zero classical-MARL entries. Every collaborative entry (AutoGen, MetaGPT, ChatDev, CrewAI, LangGraph, AgentVerse, CAMEL, Multi-Agent Debate) is an LLM-agent system from 2023 onward, coordinating in natural language with role-asymmetric, prompt-implicit rewards — no value function, no gradient update, no team reward. That absence is the thesis: the regime where most real agentic-AI coordination now lives is exactly the regime classical cooperative MARL cannot natively express. The Zymera research program (covert-misbehavior resilience at the mission level) sits in this seam — see Literature and Paper.

Secondary axes

Each entry carries one primary mode (used for grouping and diagram coloring) plus honest secondary tags on three further axes:

AxisValuesWhat it captures
Paradigmclassical-MARL · llm-agent · hybrid · classical-RL-extendedEra + research community. classical-MARL is the 2018–2022 deep-MARL wave; llm-agent the 2023–2026 agentic-AI wave; hybrid bridges them; classical-RL-extended is single-agent work with multi-agent applicability.
Era2015–2018 · 2018–2022 · 2023–2026When the line of work landed. The timeline figure below shows the paradigm hand-off.
Method familymulti-tag (value-decomposition, policy-gradient, communication, GNN, role-based, language-model, …)Multi-membership by default — MAPPO is "cooperative + CTDE + policy-gradient + parameter-sharing" all at once.
1 · at a glance

The map at a glance

Mode tree: entries grouped by interaction mode then paradigm
Mode tree — the primary clustering: each interaction mode branches into its paradigms and then its entries.
Timeline of entries 2015 to 2026
Timeline — the classical-MARL wave (2018–2022) hands off to the LLM-agent wave (2023–2026); collaborative mode appears only on the right edge.
Mode by era heatmap
Mode × era — collaborative is dense only in 2023–2026; competitive peaks in 2018–2022.
Mode by method-family heatmap
Mode × method-family — the empty cells in this matrix are what the gap report ranks.
Paradigm by venue heatmap
Paradigm × venue — classical-MARL clusters at ICML / NeurIPS / ICLR / AAMAS; the llm-agent wave scatters across arXiv, frameworks, ACL, and UIST.
2 · the corpus

The 70 entries

Grouped mode → paradigm → era. Each card carries the plain-language description, authors, venue + tier, method / coordination tags, key relationship edges (Extends / Contrasts), and, where notable, the project-relevance note. Venue tiers are abbreviated: T1 tier-1, T2 tier-2, pre preprint, jrnl/bk/fw journal / textbook / framework.

Solo — one agent in an environment; baselines that anchor extends edges to multi-agent variants
Solo · hybrid · 2018–2022
inner-monologue · Inner Monologue2022

Embodied LLM-agent that consumes natural-language environment feedback (success detection, scene description, human input) to drive an "inner monologue" that revises plans on the fly; closing the perception–language loop sharply improves long-horizon manipulation/navigation vs. open-loop planners.

Huang, Xia, Xiao, Chan, … Ichter

T2 CoRL

language-modelplanningmemoryreflection
Extends SayCan · Contrasts Voyager
saycan · SayCan2022

Grounds an LLM planner in a robot's affordances by combining "Say" (LLM likelihood a skill helps the goal) with "Can" (a learned value function for executability). Foundational hybrid LLM-plus-RL system for embodied long-horizon instruction following.

Ahn, Brohan, Brown, Chebotar, … Zeng

T2 CoRL

language-modelplanningq-learningmemory
Contrasts Voyager
Solo · llm-agent · 2023–2026
react · ReAct2023

Interleaves Reasoning and Acting in LLM agents: the model alternates verbal reasoning traces with external tool actions. The foundational pattern of the agentic-AI wave — nearly every later framework (AutoGen, LangGraph, MetaGPT) descends from a ReAct variant.

Yao, Zhao, Yu, Du, Shafran, Narasimhan, Cao

T1 ICLR

language-modelplanningmemory
reflexion · Reflexion2023

Self-reflective agent: after each trial the LLM verbalizes what went wrong and stores the reflection in episodic memory for the next attempt — "RL via verbal feedback" with an in-context rather than gradient-based policy update.

Shinn, Cassano, Berman, Gopinath, Narasimhan, Yao

T1 NeurIPS

language-modelplanningmemoryreflection
Extends ReAct
tot · Tree of Thoughts2023

Frames LLM problem-solving as deliberate search over a tree of partial "thoughts": propose candidate steps, score them, explore by BFS/DFS. Generalizes Chain-of-Thought and ReAct from a linear trace into a search procedure.

Yao, Yu, Zhao, Shafran, Griffiths, Cao, Narasimhan

T1 NeurIPS

language-modelplanningreflection
Extends ReAct · Contrasts Reflexion
voyager · Voyager2023

Lifelong-learning LLM agent in Minecraft that builds an ever-growing skill library via iterative prompting (propose, code, execute, reflect) with no gradient updates — open-ended exploration from in-context learning and external memory alone.

Wang, Xie, Jiang, Mandlekar, … Anandkumar

pre arXiv

language-modelplanningmemory
Single-agent baseline for the agentic wave; contrast with AutoGen/MetaGPT showing multi-agent is not just "Voyager × N".
agentbench · AgentBench2024

First systematic evaluation suite for LLM-as-agent across eight environments (OS, databases, knowledge graphs, card games, household, web browse/shop). The standard yardstick for solo-LLM-agent capability; exposed large API vs. open-source gaps.

Liu, Yu, Zhang, Xu, … Tang

T1 ICLR

language-model
Contrasts Melting Pot
Solo · classical-RL-extended · 2015–2018
options-framework · Options Framework1999

Foundational formalism for temporally extended actions: an option is (initiation set, intra-option policy, termination); the resulting process is a semi-MDP. Underpins essentially every later hierarchical-RL method.

Sutton, Precup, Singh

jrnl AIJ

hierarchical
Contrasts FeUdal Networks
map-elites · MAP-Elites2015

Quality-diversity algorithm maintaining a grid archive indexed by hand-designed behavior descriptors, keeping the best solution per cell — a diverse repertoire of local optima rather than one global optimum. Foundational for QD and behavior-space curriculum design.

Mouret, Clune

pre arXiv

evolutionarypopulation
Contrasts OpenAI ES
feudal-net · FeUdal Networks (FuN)2017

Hierarchical RL where a high-level Manager emits abstract goals in a learned latent space and a low-level Worker is rewarded for moving along them. Decouples temporal abstraction from action selection; reference for hierarchical cooperative-MARL extensions.

Vezhnevets, Osindero, Schaul, Heess, … Kavukcuoglu

T1 ICML

hierarchicalactor-critic
Contrasts Options Framework
openai-es · OpenAI ES2017

Black-box policy optimization via natural-evolution-strategy gradient estimates from antithetic Gaussian perturbations. Trades sample efficiency for trivial parallelism (one perturbed rollout per worker, a tiny noise-seed broadcast). Reignited evolutionary methods for deep RL.

Salimans, Ho, Chen, Sidor, Sutskever

pre arXiv

evolutionarypopulation
Contrasts MAP-Elites
rlhf · RLHF (Deep RL from Human Preferences)2017

Trains a reward model from pairwise human preference comparisons over trajectory snippets, then optimizes a policy against it with standard deep RL. The upstream root of the entire human-feedback alignment line (InstructGPT, ChatGPT, Constitutional AI, DPO).

Christiano, Leike, Brown, Martic, Legg, Amodei

T1 NeurIPS

policy-gradientimitationlanguage-model
Solo · classical-RL-extended · 2018–2022
world-models · World Models2018

Three-part architecture: a VAE compresses observations, a recurrent mixture-density net predicts latent dynamics, and a small linear controller is evolved by CMA-ES inside the learned world. First popular demonstration of training a policy almost entirely "in imagination".

Ha, Schmidhuber

T1 NeurIPS

model-based
Contrasts MuZero
muzero · MuZero2020

Combines a learned value-equivalent dynamics model with MCTS, removing the need for a known simulator at planning time. Matches AlphaZero on Go/chess/shogi and is strong on Atari with one architecture.

Schrittwieser, Antonoglou, Hubert, Simonyan, … Silver

T1 Nature

model-basedplanningself-playalso: competitive
Contrasts DreamerV3, AlphaStar
paired · PAIRED2020

Unsupervised environment design: a protagonist and an antagonist play environments produced by an adversary trained to maximize the regret between them, yielding a curriculum at the frontier of the protagonist's ability with minimax-regret guarantees.

Dennis, Jaques, Vinitsky, Bayen, … Levine

T1 NeurIPS

environment-designcurriculumpolicy-gradientalso: mixed-motive
Contrasts ACCEL
accel · ACCEL2022

Unsupervised environment design that drops the adversary network for an evolutionary edit operator: a buffer of high-regret levels is mutated and re-scored, compounding complexity along the agent's frontier. Often matches/beats PAIRED while being simpler.

Parker-Holder, Jiang, Dennis, Samvelyan, … Rocktäschel

T1 ICML

environment-designcurriculumevolutionary
Extends PAIRED · Contrasts PAIRED, MAESTRO
Solo · classical-RL-extended · 2023–2026
constitutional-ai · Constitutional AI (CAI / RLAIF)2022

Aligns an LLM with a written "constitution" using AI feedback instead of human labels: the model critiques and revises its own outputs against the principles, producing preference data for an RLAIF stage. Showed AI-generated preferences can largely replace human ones.

Bai, Kadavath, Kundu, Askell, … Kaplan

pre arXiv

language-modelimitationpolicy-gradientreflection
Extends RLHF · Contrasts DPO
dpo · Direct Preference Optimization2023

Reformulates RLHF as a closed-form classification objective on pairwise preferences, eliminating the explicit reward model and the PPO loop. Now the dominant alignment recipe for open-source LLMs.

Rafailov, Sharma, Mitchell, Ermon, Manning, Finn

T1 NeurIPS

language-modelimitation
Extends RLHF · Contrasts Constitutional AI
dreamerv3 · DreamerV32025

Latent-space world model that learns from pixels and trains an actor-critic entirely on imagined trajectories. Masters Minecraft diamonds from scratch with a single fixed hyperparameter set across 150+ tasks — model-based RL that is both general and robust.

Hafner, Pasukonis, Ba, Lillicrap

T1 Nature

model-basedactor-critic
Extends World Models · Contrasts MuZero
Cooperative — persistently aligned interests; one team reward across the whole horizon. SMAC, MPE-coop, Overcooked live here
Cooperative · classical-MARL · 2015–2018
iql · IQL (Independent Q-Learning)1993

Foundational decentralized MARL: each agent runs independent Q-learning, treating others as environment dynamics. Trivial to scale; pays in non-stationarity and inability to coordinate beyond shared rewards. Still a competitive baseline today.

Ming Tan

T1 ICML

q-learningDTDE
Lower-bound baseline — the "DTDE" row in the RedWithinBlue paradigm-comparison table.
dec-pomdp · Dec-POMDP (complexity)2002

Formal complexity anchor for cooperative MARL: defines the decentralized POMDP and proves finite-horizon optimal policies are NEXP-complete — ruling out tractable exact algorithms and explaining why cooperative MARL needs approximate, learning-based methods.

Bernstein, Givan, Immerman, Zilberstein

jrnl Math. Oper. Res.

formal model
Contrasts Markov Games
The NEXP-hardness result is the standard justification for learning-based CTDE over exact DP; cited in formalism.tex.
commnet · CommNet2016

Foundational learned-communication architecture: each agent's hidden state is mean-pooled across all agents and broadcast back as next-layer input — a differentiable broadcast-to-all channel trained end-to-end with policy gradients. Anchors the communication family.

Sukhbaatar, Szlam, Fergus

T1 NeurIPS

communicationpolicy-gradientparameter-sharingCTDE
Contrasts DIAL, BiCNet
dial · DIAL2016

First gradient-based learned-communication method: during centralized training agents pass continuous messages with gradients flowing through the channel; at execution messages are discretized (via the DRU). Pioneered differentiable-train / discrete-deploy.

Foerster, Assael, de Freitas, Whiteson

T1 NeurIPS

communicationq-learningparameter-sharingCTDE
Contrasts CommNet, IC3Net
bicnet · BiCNet2017

Models the team policy and Q-function as a bidirectional recurrent network whose hidden states act as an inter-agent channel (StarCraft 1 micro). Shows recurrent message passing captures team coordination, but the imposed ordering breaks permutation invariance.

Peng, Wen, Yang, Yuan, Tang, Long, Wang

pre arXiv

communicationactor-criticparameter-sharingCTDE
Contrasts CommNet, DIAL
maddpg · MADDPG2017

Multi-Agent DDPG: per-agent centralized critics (see all obs+actions at training) with decentralized actors. Established the CTDE paradigm in deep MARL, with cooperative, competitive, and mixed-motive variants on the particle environment.

Lowe, Wu, Tamar, Harb, Abbeel, Mordatch

T1 NeurIPS

policy-gradientactor-criticcentralized-criticCTDEalso: comp/mixed
Contrasts IQL, QMIX
CTDE anchor in formalism.tex; cited throughout the RedWithinBlue RL-taxonomy notes.
mpe · MPE (Particle Environment)2017

Lightweight 2D continuous-state particle environment introduced with MADDPG — cooperative navigation, predator-prey, speaker-listener, physical deception. The default sandbox for MADDPG, MAAC, M3DDPG, and most early CTDE policy-gradient methods.

Lowe, Wu, Tamar, Harb, Abbeel, Mordatch

T1 NeurIPS

benchmarkalso: comp/mixed
Contrasts SMAC, Melting Pot
coma · COMA2018

Solves multi-agent credit assignment with a counterfactual baseline: the centralized critic estimates how much each agent's specific action contributed beyond a default, giving a stronger cooperative gradient than naive joint-policy gradient.

Foerster, Farquhar, Afouras, Nardelli, Whiteson

T1 AAAI

policy-gradientactor-criticcentralized-criticCTDE
Extends MADDPG · Contrasts QMIX
vdn · VDN2018

Earliest deep value-decomposition method: the joint Q-function is the simple sum of per-agent utilities, trivially satisfying Individual-Global-Max. Limited expressiveness vs. QMIX's hypernetwork mixing; the foundational value-decomposition anchor.

Sunehag, Lever, Gruslys, Czarnecki, … Graepel

T1 AAMAS

value-decompositionq-learningparameter-sharingCTDE
Extends IQL · Contrasts MADDPG
Cooperative · classical-MARL · 2018–2022
mean-field-marl · Mean Field MARL2018

Approximates the joint action of a large population by the empirical distribution (mean field) of neighbor actions, reducing the Q-function to a pairwise interaction with a population summary. Scales to hundreds–thousands of homogeneous agents.

Yang, Luo, Li, Zhou, Zhang, Wang

T1 ICML

mean-fieldq-learningactor-criticN: 20–1000also: comp/mixed
Contrasts MADDPG, QMIX, MAPPO
qmix · QMIX2018

Factorizes the joint Q as a monotonic mixture of per-agent utilities, enforced by a non-negative-weight hypernetwork, satisfying IGM so per-agent argmax = joint argmax. Strong on discrete-action SMAC; the monotonicity constraint is its key limitation for sacrifice actions.

Rashid, Samvelyan, Schroeder de Witt, Farquhar, Foerster, Whiteson

T1 ICML

value-decompositionq-learningcentralized-criticCTDE
Extends VDN · Contrasts MADDPG, MAPPO
Monotonicity is exactly the issue flagged in RedWithinBlue: connectivity maintenance often needs one agent to "sacrifice" so others can explore.
bad · Bayesian Action Decoder2019

Recasts a cooperative Dec-POMDP with hidden information as a public-belief MDP: agents condition on a Bayesian posterior over private states given public history, sharing an implicit language. Super-human on 2-player Hanabi.

Foerster, Song, Hughes, Burch, … Bowling

T1 ICML

policy-gradientactor-criticcommunicationCTDE
Contrasts MADDPG, QMIX
The public-belief construction is the canonical reference for grounded implicit communication.
ic3net · IC3Net2019

Adds a learned binary gate to CommNet so each agent decides at each step whether to broadcast at all — learning when to communicate matters as much as what. Extends to mixed settings where unconditional broadcast leaks info to opponents.

Singh, Jain, Sukhbaatar

T1 ICLR

communicationpolicy-gradientparameter-sharingCTDEalso: mixed-motive
Extends CommNet · Contrasts CommNet, TarMAC
Direct precedent for bandwidth-aware messaging: gating is the simplest "speak only when necessary" prior for connectivity-constrained missions.
maven · MAVEN2019

Augments QMIX with a hierarchical latent variable conditioning the joint policy, trained via mutual-information maximization to encourage diverse coordinated exploration — addressing QMIX's exploration weakness on sparse-reward tasks.

Mahajan, Rashid, Samvelyan, Whiteson

T1 NeurIPS

value-decompositionq-learningcentralized-criticCTDE
Extends QMIX · Contrasts QPLEX
openai-five · OpenAI Five2019

Defeated the world-champion Dota 2 team with PPO at unprecedented scale: five parameter-sharing LSTM agents on a cooperative team reward. Scale + self-play reached pro-level play in a 5v5 partially-observable game with no explicit multi-agent algorithm.

OpenAI: Berner, Brockman, Chan, … Zhang

pre arXiv

policy-gradientself-playparameter-sharingCTCEalso: competitive
Contrasts AlphaStar
Major coop↔comp crossover: within-team cooperation from shared reward, cross-team competition from self-play — the mode-shifting framing applies directly.
qtran · QTRAN2019

Replaces QMIX's monotonic mixing with a soft regularization of the IGM condition, making the function class strictly broader. Strongest theory in the value-decomposition family but often empirically loses to QMIX/QPLEX because the regularizer is harder to tune.

Son, Kim, Kang, Hostallero, Yi

T1 ICML

value-decompositionq-learningcentralized-criticCTDE
Extends QMIX · Contrasts QPLEX
smac · SMAC2019

Benchmark of cooperative StarCraft II micromanagement (2–27 unit agents, decentralized partial obs, scripted enemy). The de-facto cooperative-MARL benchmark on which QMIX, MAPPO, MAVEN, QPLEX were validated.

Samvelyan, Rashid, Schroeder de Witt, Farquhar, … Whiteson

T1 AAMAS

benchmarkN: 2–27
Contrasts Melting Pot, MPE
tarmac · TarMAC2019

Replaces CommNet's mean-pool broadcast with signature-based soft attention: agents emit (key, value) pairs, listeners query to weight them — learning who to address and what to send. Established attention as the default learned-communication architecture.

Das, Gervet, Romoff, Batra, … Pineau

T1 ICML

communicationtransformerattentionCTDE
Extends CommNet · Contrasts CommNet, IC3Net
Closest classical-MARL precedent for connectivity-aware messaging policies.
dgn · DGN (Graph Conv RL)2020

Treats agents as nodes in a spatial-neighborhood graph and applies multi-head graph-attention to fuse local observations before Q-learning. GNN message passing scales cooperatively to ~100 agents. Anchor of the GNN family in cooperative MARL.

Jiang, Dun, Huang, Lu

T1 ICLR

GNNq-learningattentionCTDEN: 10–100
Contrasts CommNet, TarMAC
ippo · IPPO2020

Independent PPO with parameter sharing: each agent runs PPO on its own observations and treats others as environment. The key result — IPPO (DTDE) often matches or beats QMIX (CTDE) on SMAC — challenged the field's CTDE-by-default assumption.

Schroeder de Witt, Gupta, Makoviichuk, … Whiteson

pre arXiv

policy-gradientactor-criticparameter-sharingDTDE
Extends IQL · Contrasts QMIX
DTDE baseline in the RedWithinBlue paradigm-comparison table.
ndq · NDQ2020

Combines QMIX value-decomposition with an information-theoretic regularizer minimizing mutual information between sent messages and the sender's full observation while preserving task content — minimal-bandwidth communication only when independent decomposition is insufficient.

Wang, Wang, Zheng, Zhang

T1 ICML

communicationvalue-decompositionq-learningCTDE
Extends QMIX · Contrasts TarMAC, CommNet
The information-bottleneck framing maps cleanly onto link-capacity constraints in connectivity-aware missions.
roma · ROMA (Emergent Roles)2020

Augments value-decomposition MARL with a stochastic role embedding per agent, regularized to be mutually informative with trajectories yet compact enough to drive specialization. Yields emergent roles without hand-designed priors; pairs with QMIX-style mixing.

Wang, Dong, Lesser, Zhang

T1 ICML

role-basedvalue-decompositionq-learningCTDE
Extends QMIX · Contrasts QMIX, QPLEX, MAVEN
Classical-MARL precedent for asymmetric team coordination — relevant when scout-vs-follower roles must emerge without hand-specification.
dicg · DICG (Implicit Coordination Graphs)2021

Learns an implicit coordination graph end-to-end: an attention module produces a soft adjacency feeding a GNN reasoning layer over joint values/actions. Mitigates relative overgeneralization in predator-prey and competes with QMIX on SMAC without hand-specified structure.

Li, Gupta, Morales, Allen, Kochenderfer

T1 AAMAS

GNNactor-criticattentionCTDE
Extends DGN · Contrasts MAGIC, QMIX
Implicit graph learning is the alternative to spatial-proximity graphs when "who-coordinates-with-whom" is task-dependent.
magic · MAGIC2021

Combines a Scheduler (when/to-whom to communicate) with a graph-attention Message Processor over a dynamically-learned communication graph — unifying the when/who/what axes that CommNet, IC3Net, TarMAC each addressed separately. Validated on a physical robot-soccer testbed.

Niu, Paleja, Gombolay

T1 AAMAS

communicationGNNattentionCTDE
Extends TarMAC, DGN · Contrasts CommNet, IC3Net, DICG
Tightest existing match for "learned dynamic comm-graph + attention message processing" — direct precedent for connectivity-aware multi-robot policies.
qplex · QPLEX2021

Duplex dueling decomposition with multi-head attention over agents, generalizing QMIX's monotonic mixing to non-monotonic interactions while still satisfying IGM. Outperforms QMIX and QTRAN on hard SMAC scenarios needing sacrifice actions.

Wang, Ren, Liu, Yu, Zhang

T1 ICML

value-decompositionq-learningattentionCTDE
Extends QMIX · Contrasts QTRAN
Addresses the QMIX monotonicity limitation that is load-bearing in connectivity-maintenance tasks.
updet · UPDeT2021

Replaces the per-agent recurrent encoder with an entity-based transformer: each agent attends over a variable-length set of observed entities, and the policy head decouples into action groups. A single policy transfers across team sizes — universal in N.

Hu, Zhu, Chang, Liang

T1 ICLR

transformerq-learningattentionCTDE
Extends QMIX · Contrasts QMIX, MAPPO
Entity-set attention is exactly what permutation-invariant multi-robot policies need.
mamba · MAMBA2022

Multi-agent latent world-model (Dreamer family): agents jointly maintain a shared latent state via learned communication and train policies on imagined rollouts. Model-based imagination matches/exceeds model-free CTDE at lower sample budgets.

Egorov, Shpilman

pre arXiv

model-basedactor-criticcommunicationCTDE
Extends DreamerV3, World Models · Contrasts MAPPO, QMIX
mappo · MAPPO2022

CTDE variant of PPO: a centralized value uses global state, decentralized actors use per-agent obs, agents share parameters. Empirically the strongest cooperative-MARL baseline on SMAC/MPE/Hanabi as of 2022, despite the field's earlier off-policy preference.

Yu, Velu, Vinitsky, Wang, Bayen, Wu

T1 NeurIPS

policy-gradientactor-criticcentralized-criticCTDE
Extends COMA, IPPO · Contrasts QMIX, MADDPG
Phase 1–3 baseline in the RedWithinBlue curriculum.
mat · MAT (Multi-Agent Transformer)2022

Recasts cooperative MARL as sequence modeling: an encoder-decoder transformer encodes the joint observation and decodes agent actions one at a time, using the multi-agent advantage decomposition theorem for monotonic improvement. Beats MAPPO, QMIX, HAPPO on SMAC/MA-MuJoCo.

Wen, Kuba, Lin, Zhang, Wen, Wang, Yang

T1 NeurIPS

transformerpolicy-gradientactor-criticCTDE
Extends UPDeT, MAPPO · Contrasts MAPPO, QMIX, QPLEX
Cooperative · classical-MARL · 2023–2026
haven · HAVEN2023

Hierarchical cooperative MARL with dual coordination at both the high level (across subgoal selections) and low level (across primitive actions), combined with QMIX-style value decomposition. Improves on flat CTDE on long-horizon SMAC tasks.

Xu, Bai, Zhang, Li, Fan

T1 AAAI

hierarchicalvalue-decompositionactor-criticCTDE
Extends QMIX, FeUdal Networks · Contrasts MAPPO, QMIX
maestro · MAESTRO2023

Multi-agent extension of PAIRED: jointly co-evolves a population of co-players and a curriculum of environments to maximize regret in cooperative MARL. Targets the failure where environment design ignores the partner-policy distribution; partner-aware curricula generalize better.

Samvelyan, Khan, Dennis, Jiang, … Rocktäschel

T1 ICLR

environment-designcurriculumpolicy-gradientpopulation
Extends PAIRED · Contrasts PAIRED, ACCEL
Collaborative — temporarily aligned around a sub-goal; different individual rewards/roles; alignment dissolves when the sub-task ends
Every entry below is an LLM-agent system, 2023–2026. There are zero classical-MARL collaborative entries in the corpus. Coordination is in natural language, rewards are prompt-implicit, roles are asymmetric, and there is no value function or gradient update. This empty classical column is the map's central finding.
Collaborative · llm-agent · 2023–2026
agentverse · AgentVerse2023

Structures problem-solving into four phases — expert recruitment, collaborative decision-making, action execution, evaluation — dynamically assembling a team of role-tagged LLM agents with reflection. Explicit phase decomposition can beat free-form group chat.

Chen, Su, Zuo, Yang, … Zhou

pre arXiv

language-modelrole-basednatural-languagedebatereflection
Extends AutoGen · Contrasts MetaGPT, ChatDev
autogen · AutoGen2023

Microsoft's framework for multi-agent LLM apps: agents with roles (UserProxy, AssistantAgent, GroupChatManager) exchange natural-language messages to solve a user task, collaborating temporarily and dissolving afterward — quintessentially collaborative, with each role's reward implicit in its prompt.

Wu, Bansal, Zhang, Wu, … Wang

pre arXiv

language-modelrole-basednatural-language
Extends ReAct · Contrasts MADDPG, MAPPO
Anchor for the collaborative paradigm — no explicit reward, no value function, no gradient updates, yet solves coordination via prompt design alone.
camel · CAMEL2023

Two-agent role-playing framework: an "AI user" and an "AI assistant" get complementary roles via "inception prompting" and exchange messages to solve a task. One of the earliest multi-agent LLM frameworks (March 2023); seeded AutoGen and MetaGPT.

Li, Hammoud, Itani, Khizbullin, Ghanem

T1 NeurIPS

language-modelrole-basednatural-language
multi-agent-debate · Multi-Agent Debate2023

Multiple LLM agents propose, critique, and refine each other's answers across structured debate rounds, improving factual accuracy and reasoning beyond single-agent baselines. Competitive-form mechanism producing a collaborative outcome — borderline mixed-motive in form.

Liang, He, Jiao, Wang, … Shi

pre arXiv

language-modelopponent-modelingdebatealso: mixed-motive
Contrasts AutoGen
Cross-mode example: the same agents act competitively (debate) to reach a collaborative goal — the "regiments shifting modes" framing.
chatdev · ChatDev2024

LLM-agent virtual software company with role-based agents communicating via structured chat. Distinguishes itself from MetaGPT through explicit "double-agent" debate phases at each stage (designer ↔ reviewer). Strong collaborative role-based coordination.

Qian, Liu, Liu, Chen, … Sun

T2 ACL

language-modelrole-basednatural-languagedebate
Extends AutoGen · Contrasts MetaGPT
crewai · CrewAI2024

Open-source Python framework orchestrating role-playing LLM agents as a "crew" — each with role, goal, backstory, tools — distributing tasks sequentially or hierarchically. A lightweight production-oriented alternative to AutoGen with explicit role abstractions.

João Moura

fw Framework

language-modelrole-basednatural-languagehierarchical
Extends AutoGen · Contrasts LangGraph, MetaGPT
Representative of how practitioners actually deploy multi-agent role coordination outside research labs.
langgraph · LangGraph2024

LangChain library for stateful multi-agent LLM workflows as explicit directed graphs with shared state: nodes are LLM agents or tool calls, edges are conditional transitions, persistent state enables long-horizon coordination and human-in-the-loop checkpoints.

LangChain Inc

fw Framework

language-modelrole-basedplanninghierarchical
Extends AutoGen · Contrasts CrewAI, MetaGPT
The closest production-grade analogue to formal coordination diagrams in classical MARL.
metagpt · MetaGPT2024

Software-development LLM-agent system encoding human SOPs: a Product Manager writes requirements, an Architect designs, Engineers implement, QA tests. Hand-designed workflows ("meta-programming via prompts") beat free-form multi-agent conversation on structured tasks.

Hong, Zhuge, Chen, Zheng, … Schmidhuber

T1 ICLR

language-modelrole-basedhierarchicalnatural-language
Extends AutoGen
Quintessential collaborative system — the role-asymmetry it relies on is exactly what classical cooperative MARL cannot natively express.
Competitive — strictly opposed interests; two-player zero-sum, N-player rank contests, security games, adversarial MARL
Competitive · classical-MARL · 2015–2018
markov-games · Markov Games (Littman)1994

Introduces Markov (stochastic) games as the formalism for competitive MARL and proposes minimax-Q, a value-iteration variant that converges in two-player zero-sum games. The standard ancestor citation for self-play and minimax-RL.

Michael L. Littman

T1 ICML

q-learningopponent-modeling
Contrasts Dec-POMDP
Foundational competitive-MARL formalism cited in formalism.tex; minimax-Q is the conceptual ancestor of M3DDPG.
Competitive · classical-MARL · 2018–2022
alphastar · AlphaStar2019

Grandmaster-level StarCraft II from DeepMind via population-based self-play with a "league" of main agents, main exploiters, and league exploiters to prevent strategy cycles. Influential for the population-of-policies family.

Vinyals, Babuschkin, Czarnecki, Mathieu, … Silver

T1 Nature

policy-gradientself-playpopulationimitation
Contrasts OpenAI Five
m3ddpg · M3DDPG2019

Robust adversarial extension of MADDPG: trains each agent's policy against an adversarially perturbed approximation of the others as a minimax objective on the centralized critic (one-step linearized inner gradient). Improves robustness to adversarial co-players.

Li, Wu, Cui, Dong, Fang, Russell

T1 AAAI

policy-gradientactor-criticopponent-modelingCTDEalso: cooperative
Extends MADDPG · Contrasts MADDPG, OpenAI Five
Worst-case-robust CTDE baseline; relevant when blue-team policies must be hardened against red-team perturbations.
pr2 · PR2 (Recursive Reasoning)2019

Models multi-agent decisions as recursive level-K reasoning where each agent assumes opponents reason one level shallower; uses variational inference to approximate the joint policy. Anchor for the Bayesian / cognitive-hierarchy branch of opponent modeling.

Wen, Yang, Luo, Wang, Pan

T1 ICLR

opponent-modelingactor-criticCTDEalso: mixed-motive
Contrasts ROMMEO, LOLA, MADDPG
rommeo · ROMMEO2019

Builds an opponent model jointly with each agent's policy under a maximum-entropy objective, then regularizes the policy update against the inferred opponent posterior — a Bayes-optimal best-response under co-player uncertainty in general-sum and competitive games.

Tian, Wen, Gong, Punakkath, Zou, Wang

T1 ICML

opponent-modelingpolicy-gradientactor-criticCTDEalso: mixed-motive
Extends MADDPG · Contrasts PR2, LOLA, MADDPG
Competitive · hybrid · 2018–2022
cicero · CICERO2022

Meta AI's human-level Diplomacy player: a planning module trained via no-press self-play plus a dialogue model fine-tuned on human games and conditioned on intended actions. Human-level competitive multi-agent natural-language negotiation.

Meta FAIR Diplomacy Team

T1 Science

language-modelplanningopponent-modelingself-playalso: mixed-motive
Contrasts AutoGen, MetaGPT
Combines RL self-play with LLM dialogue — a hybrid that classical MARL surveys mostly miss.
Mixed-motive — general-sum games; social dilemmas, negotiation, mechanism design, emergent-society simulation
Mixed-motive · classical-MARL · 2015–2018
lola · LOLA2018

Differentiates each agent's update through a one-step lookahead on the opponent's learning step, so policies are shaped by how they influence the opponent's future gradient. Yields tit-for-tat cooperation in iterated prisoner's dilemma where naive learners defect.

Foerster, Chen, Al-Shedivat, Whiteson, Abbeel, Mordatch

T1 AAMAS

opponent-modelingpolicy-gradientCTDEalso: competitive
Contrasts MADDPG, ROMMEO, PR2
Direct conceptual ancestor for any opponent-shaping mediator design in mixed-motive scenarios.
Mixed-motive · classical-MARL · 2018–2022
social-influence · Social Influence2019

Adds an intrinsic reward proportional to one agent's causal influence on another's policy (KL between conditional and marginal action distributions) to encourage emergent communication and prosocial behavior in Cleanup/Harvest social dilemmas without explicit channels.

Jaques, Lazaridou, Hughes, Gulcehre, … de Freitas

T1 ICML

policy-gradientcommunicationopponent-modelingCTDE
Contrasts Bayesian Action Decoder, LOLA
ai-economist · AI Economist2020

Two-level RL where a "social planner" agent designs tax policy while heterogeneous worker agents simultaneously learn to respond (PPO at both levels), discovering tax schedules that improve equality-vs-productivity trade-offs. Anchor for RL-based mechanism design.

Zheng, Trott, Srinivasa, Parkes, Socher

pre arXiv

mechanism-designpolicy-gradientactor-critichierarchical
Contrasts Melting Pot, CICERO
The hierarchical planner/worker decomposition mirrors the mediator role in RedWithinBlue.
melting-pot · Melting Pot2021

DeepMind benchmark of MARL environments around social dilemmas, free-rider problems, and common-pool resources. The most prominent infrastructure for studying generalization across cooperative ↔ competitive ↔ mixed-motive within one framework.

Leibo, Duéñez-Guzmán, Vezhnevets, Agapiou, … Graepel

T1 ICML

benchmarkN: 2–16also: coop/comp
Closest existing benchmark to the framing of agents shifting between coop / collab / compete modes.
Mixed-motive · llm-agent · 2023–2026
generative-agents · Generative Agents2023

25 LLM-driven agents in a sandbox town (Smallville) show believable individual and social behavior via a memory stream, reflection, and planning — emergent information spread, relationship formation, event coordination. Anchor for emergent-society simulation.

Park, O'Brien, Cai, Morris, Liang, Bernstein

T2 UIST

language-modelmemoryreflectionN: 25also: collaborative
Mixed-motive and collaborative mode-shifting in one system — alignment shifts with context, exactly the dynamic-regiments framing.
ai-town · AI Town2024

Open-source deployable virtual town inspired by Generative Agents: LLM-driven characters live, plan, gossip, and form relationships in a Convex-backed real-time world. A practical template for persistent multi-agent LLM societies, widely forked.

a16z Infra

fw Framework

language-modelmemoryreflectionalso: collaborative
Extends Generative Agents

70 entries rendered by primary mode — solo 19 (hybrid 2, llm-agent 5, classical-RL-extended 12), cooperative 31, collaborative 8, competitive 6, mixed-motive 6. Secondary modes (the "also:" tags) place several entries in more than one mode at once.

3 · the field's self-image

The 6 surveys

Surveys are treated as first-class nodes: they define the field's self-image at a moment in time. Each card lists what the survey covers, what it explicitly omits, and how many pool entries it cites. The omissions, cross-checked against the entry pool, drive the gap report below.

albrecht-christianos-schaefer-2024
MARL: Foundations & Modern Approaches
2024

Most recent comprehensive textbook treatment of cooperative and competitive MARL. Strong on theory (Markov games, POSG, Nash, Bellman equations) and modern deep-MARL benchmarks; mostly excludes the LLM-agent wave and applied / mission-level work.

Albrecht, Christianos, Schäfer · MIT Press (textbook)

Covers: cooperative, competitive, mixed-motive · classical-MARL
Omits: llm-agent, emergent-society-simulation, real-world-deployment, mission-level-success, role-asymmetry, foundation-model-agents
Cites: 10 entries (mappo, qmix, vdn, maddpg, coma, …)
oroojlooyjadid-hajinezhad-2023
A review of cooperative MA deep RL
2023

Cooperative-MARL-only survey organized by communication, coordination, training paradigm, and applications. Useful for the cooperative-only literature; a conspicuous gap on competitive and mixed-motive.

OroojlooyJadid, Hajinezhad · Applied Intelligence (survey)

Covers: cooperative · classical-MARL
Omits: llm-agent, competitive, mixed-motive, mission-level-success, foundation-model-agents
Cites: 5 entries (mappo, qmix, vdn, maddpg, coma)
du-ding-2023
A survey on MARL with communication
2023

Survey focused specifically on learned communication in MARL: protocols (broadcast, targeted, attention), representations (continuous, discrete, symbolic), and learning algorithms. Useful for the communication-learning cluster.

Zhai, Ding · arXiv (survey)

Covers: cooperative · classical-MARL
Omits: llm-agent, mixed-motive, role-asymmetry, mission-level-success
gronauer-diepold-2022
Multi-agent deep RL: a survey
2022

Broad survey of deep MARL up to 2022, organized by training paradigm (centralized / decentralized / fully / partially observable). Predates the LLM-agent wave; useful as a "what classical MARL covered" reference for triangulating gaps.

Gronauer, Diepold · AI Review (survey)

Covers: cooperative, competitive, mixed-motive · classical-MARL
Omits: llm-agent, role-asymmetry, foundation-model-agents, real-world-deployment, mission-level-success
Cites: 6 entries (mappo, qmix, vdn, maddpg, coma, …)
zhang-yang-basar-2021
MARL: a selective overview
2021

Theory-leaning selective survey emphasizing convergence guarantees, Markov games, and game-theoretic foundations. Contrasts cooperative, competitive, and mixed settings via formal analysis; light on emerging deep-MARL empirical work.

Zhang, Yang, Başar · Handbook of RL & Control (survey)

Covers: cooperative, competitive, mixed-motive · classical-MARL
Omits: llm-agent, foundation-model-agents, mission-level-success, real-world-deployment, role-asymmetry
Cites: 5 entries (maddpg, qmix, vdn, coma, iql)
hernandez-leal-2019
A survey and critique of MA deep RL
2019

Influential 2019 critique categorizing work into four open problems: emergent behaviors, learning communication, learning cooperation, agents modeling agents. A critical view that helps identify gaps — highlighting how thin coverage of communication and opponent modeling was at the time.

Hernandez-Leal, Kartal, Taylor · JAAMAS (survey)

Covers: cooperative, competitive, mixed-motive · classical-MARL
Omits: llm-agent, foundation-model-agents, role-asymmetry, mission-level-success, real-world-deployment
Cites: 5 entries (maddpg, coma, vdn, qmix, iql)
4 · gaps

Gaps & opportunities

The build pipeline auto-ranks every (mode, method-family) cell by an interest score: a cell is interesting if it is sparse (few entries) AND its neighbors are dense. Absences neighboring populated cells are more likely structural — opportunities for new work — than absences in already-empty regions.

Top-ranked sparse cells

#ModeMethod familyCellMax nbrScore
1collaborativecurriculum01516.00
2collaborativeparameter-sharing01516.00
3collaborativemechanism-design01516.00
4cooperativemechanism-design01516.00
5soloparameter-sharing01516.00
6solomechanism-design01516.00
7collaborativemodel-based01213.00
8collaborativeq-learning01213.00
9collaborativeactor-critic01213.00
10collaborativevalue-decomposition01213.00
11collaborativepolicy-gradient01213.00
13cooperativeplanning01011.00
14cooperativelanguage-model01011.00
15–16collaborativeGNN / communication0910.00

Full top-30 ranking lives in assets/sources/marl_taxonomy/ (the original gaps.md). The collaborative row dominates: nearly every classical method-family that is dense for cooperative is completely empty for collaborative — the same finding as the empty classical column above, now quantified.

Mode × era absences

collaborative — dense in 2023–2026 (8 entries), sparse in 2015–2018 and 2018–2022. The mode simply did not exist as a studied object before the LLM-agent wave.
competitive — dense in 2018–2022 (5 entries), sparse in 2015–2018 and 2023–2026. The self-play / opponent-modeling peak was the deep-MARL middle era.

The survey-omission finding

Cross-checking each survey's omits field against the entry pool surfaces concepts that are present in the literature but absent from the canonical surveys:

llm-agent and mission-level-success are omitted by all 6 surveys. role-asymmetry is omitted by nearly all (5 of 6). foundation-model-agents by 5; real-world-deployment by 4. The field's own self-image has no place for the agentic-AI wave, for whether a mission actually succeeds, or for role-differentiated teams.
This is precisely the project's governing research question. Zymera asks how covert, within-bounds micro-level agent misbehavior propagates to mission-level failure in a spatial connectivity mission governed by role / graph-position — the exact triple (mission-level-success + role-asymmetry + the post-classical llm-agent/agentic regime) that every canonical survey leaves out. The gap the map measures and the gap the program targets are the same gap. See Literature for the five-flank review and Paper for the formal statement.

25 pool entries are cited by no survey at all — nearly the entire llm-agent and classical-RL-extended frontier (Voyager, ReAct, AutoGen, MetaGPT, DPO, DreamerV3, …). The surveys' citation pool is almost exactly the classical cooperative core (MAPPO, QMIX, VDN, MADDPG, COMA, IQL).

5 · connectors

Cross-disciplinary connectors

When the gap report flags an empty cell, it surfaces the adjacent disciplines tagged on the neighboring entries — out-of-the-box ideas that may apply. The connectors catalogue defines, for each field, the canonical methods that have crossed (or could cross) into MARL:

game-theory

No-regret learning → self-play convergence; mechanism design → mediator-induced cooperation; Stackelberg → leader-follower MARL.

control

Robust H∞ for adversarial perturbations; consensus algorithms as networked-MARL precursors; MPC as a model-based planning baseline.

information-theory

Minimum-information communication (NDQ); information-bottleneck regularization for compositional policies.

distributed-systems

Byzantine-robust federated MARL; distributed Kalman filtering as a sensor-fusion precursor.

graph-theory

Spectral analysis of communication topologies; random-graph models for ad-hoc team formation.

adversarial-ml

Robust MARL under action / observation / communication / reward attacks; certified MARL policies.

safe-rl

Shielded MARL for mission-level safety guarantees; constrained CTDE with cost critics.

mech-design

Mediator-induced cooperation in social dilemmas; shared-reward design for collaborative LLM-agent teams.

nlp

Natural language as the coordination channel (AutoGen, CrewAI); RLHF as single-agent RL with a human partner.

evolutionary-computation

OpenAI ES as MARL hyperparameter search; MAP-Elites for diverse behavior repertoires; PBT for self-play.

statistical-physics

Mean-field MARL for very large N; phase transitions in cooperation under partial information.

cognitive-science

Theory-of-mind-augmented opponent modeling; level-K reasoning for cognitive-hierarchy MARL.

swarm-robotics

Behavior taxonomies (Brambilla, Schranz) as a macro-objective vocabulary; kilobot-style local rules as DTDE baselines.

economics / social-choice

Auction-based task allocation; negotiation as MARL in LLM settings; voting/aggregation in multi-agent debate; fair credit assignment.

Further connectors in the source catalogue: network-coding, operations-research, formal-methods, neuroscience, probability, optimization, behavioral.

6 · methodology

Methodology

"Build a navigable map of multi-agent autonomous-systems research that makes gaps and adjacencies visible, rather than an encyclopedic listing."
Source preserved. The full generative YAML source — all entry/survey definitions, the venue and adjacency catalogues, the schema, and the build pipeline — is archived under assets/sources/marl_taxonomy/, alongside the rendered gaps.md and venues.md reports this page draws from.