The field map

A navigable map of multi-agent autonomous-systems research, 2015–2026 — 70 algorithm / system / framework / benchmark entries and 6 surveys, organized so the absences between classical MARL and the agentic-AI / LLM-agent wave become visible at a glance. Transcribed from the project's generated taxonomy; the full generative YAML source is preserved under assets/sources/marl_taxonomy/.

0 · The thesis: interaction mode as the primary axis 1 · The map at a glance (figures) 2 · The 70 entries (mode → paradigm → era) 3 · The 6 surveys 4 · Gaps & opportunities 5 · Cross-disciplinary connectors 6 · Methodology

0 · the thesis

Interaction mode is the primary axis

Most MARL maps sort work by method (value-decomposition, policy-gradient, communication…). This map sorts by interaction mode instead — the relationship between the agents' interests — because that is the axis along which the classical-MARL literature and the 2023–2026 LLM-agent wave fail to meet. Five modes, each with a single distinguishing question:

solo

Is there only one agent?

cooperative

Are interests persistently aligned across the whole horizon?

collaborative

Are interests temporarily aligned around a sub-goal?

competitive

Are interests strictly opposed (zero-sum / N-player rank)?

mixed-motive

Are interests partially aligned (general-sum, social dilemma)?

The cooperative–vs–collaborative split is the map's load-bearing original contribution. Most MARL literature collapses the two. But "cooperative-MARL methods" (MAPPO, QMIX, VDN…) assume persistent reward identity — a single team reward held for the entire horizon. That assumption breaks the moment interests are only aligned around a sub-task and dissolve when it ends. Naming the two modes distinctly is what makes the central absence legible.

COLLABORATIVE mode is, in this corpus, entirely a 2023–2026 LLM-agent phenomenon — zero classical-MARL entries. Every collaborative entry (AutoGen, MetaGPT, ChatDev, CrewAI, LangGraph, AgentVerse, CAMEL, Multi-Agent Debate) is an LLM-agent system from 2023 onward, coordinating in natural language with role-asymmetric, prompt-implicit rewards — no value function, no gradient update, no team reward. That absence is the thesis: the regime where most real agentic-AI coordination now lives is exactly the regime classical cooperative MARL cannot natively express. The Zymera research program (covert-misbehavior resilience at the mission level) sits in this seam — see Literature and Paper.

Secondary axes

Each entry carries one primary mode (used for grouping and diagram coloring) plus honest secondary tags on three further axes:

Axis	Values	What it captures
Paradigm	classical-MARL · llm-agent · hybrid · classical-RL-extended	Era + research community. `classical-MARL` is the 2018–2022 deep-MARL wave; `llm-agent` the 2023–2026 agentic-AI wave; `hybrid` bridges them; `classical-RL-extended` is single-agent work with multi-agent applicability.
Era	2015–2018 · 2018–2022 · 2023–2026	When the line of work landed. The timeline figure below shows the paradigm hand-off.
Method family	multi-tag (value-decomposition, policy-gradient, communication, GNN, role-based, language-model, …)	Multi-membership by default — MAPPO is "cooperative + CTDE + policy-gradient + parameter-sharing" all at once.

1 · at a glance

The map at a glance

Mode tree: entries grouped by interaction mode then paradigm — Mode tree — the primary clustering: each interaction mode branches into its paradigms and then its entries.

Timeline of entries 2015 to 2026 — Timeline — the classical-MARL wave (2018–2022) hands off to the LLM-agent wave (2023–2026); collaborative mode appears only on the right edge.

Mode by era heatmap — Mode × era — collaborative is dense only in 2023–2026; competitive peaks in 2018–2022.

Mode by method-family heatmap — Mode × method-family — the empty cells in this matrix are what the gap report ranks.

Paradigm by venue heatmap — Paradigm × venue — classical-MARL clusters at ICML / NeurIPS / ICLR / AAMAS; the llm-agent wave scatters across arXiv, frameworks, ACL, and UIST.

2 · the corpus

The 70 entries

Grouped mode → paradigm → era. Each card carries the plain-language description, authors, venue + tier, method / coordination tags, key relationship edges (Extends / Contrasts), and, where notable, the project-relevance note. Venue tiers are abbreviated: T1 tier-1, T2 tier-2, pre preprint, jrnl/bk/fw journal / textbook / framework.

Solo — one agent in an environment; baselines that anchor extends edges to multi-agent variants

Solo · hybrid · 2018–2022

inner-monologue · Inner Monologue2022

Embodied LLM-agent that consumes natural-language environment feedback (success detection, scene description, human input) to drive an "inner monologue" that revises plans on the fly; closing the perception–language loop sharply improves long-horizon manipulation/navigation vs. open-loop planners.

Huang, Xia, Xiao, Chan, … Ichter

T2 CoRL

language-modelplanningmemoryreflection

Extends SayCan · Contrasts Voyager

saycan · SayCan2022

Grounds an LLM planner in a robot's affordances by combining "Say" (LLM likelihood a skill helps the goal) with "Can" (a learned value function for executability). Foundational hybrid LLM-plus-RL system for embodied long-horizon instruction following.

Ahn, Brohan, Brown, Chebotar, … Zeng

T2 CoRL

language-modelplanningq-learningmemory

Contrasts Voyager

Solo · llm-agent · 2023–2026

react · ReAct2023

Interleaves Reasoning and Acting in LLM agents: the model alternates verbal reasoning traces with external tool actions. The foundational pattern of the agentic-AI wave — nearly every later framework (AutoGen, LangGraph, MetaGPT) descends from a ReAct variant.

Yao, Zhao, Yu, Du, Shafran, Narasimhan, Cao

T1 ICLR

language-modelplanningmemory

reflexion · Reflexion2023

Self-reflective agent: after each trial the LLM verbalizes what went wrong and stores the reflection in episodic memory for the next attempt — "RL via verbal feedback" with an in-context rather than gradient-based policy update.

Shinn, Cassano, Berman, Gopinath, Narasimhan, Yao

T1 NeurIPS

language-modelplanningmemoryreflection

Extends ReAct

tot · Tree of Thoughts2023

Frames LLM problem-solving as deliberate search over a tree of partial "thoughts": propose candidate steps, score them, explore by BFS/DFS. Generalizes Chain-of-Thought and ReAct from a linear trace into a search procedure.

Yao, Yu, Zhao, Shafran, Griffiths, Cao, Narasimhan

T1 NeurIPS

language-modelplanningreflection

Extends ReAct · Contrasts Reflexion

voyager · Voyager2023

Lifelong-learning LLM agent in Minecraft that builds an ever-growing skill library via iterative prompting (propose, code, execute, reflect) with no gradient updates — open-ended exploration from in-context learning and external memory alone.

Wang, Xie, Jiang, Mandlekar, … Anandkumar

pre arXiv

language-modelplanningmemory

Single-agent baseline for the agentic wave; contrast with AutoGen/MetaGPT showing multi-agent is not just "Voyager × N".

agentbench · AgentBench2024

First systematic evaluation suite for LLM-as-agent across eight environments (OS, databases, knowledge graphs, card games, household, web browse/shop). The standard yardstick for solo-LLM-agent capability; exposed large API vs. open-source gaps.

Liu, Yu, Zhang, Xu, … Tang

T1 ICLR

language-model

Contrasts Melting Pot

Solo · classical-RL-extended · 2015–2018

options-framework · Options Framework1999

Foundational formalism for temporally extended actions: an option is (initiation set, intra-option policy, termination); the resulting process is a semi-MDP. Underpins essentially every later hierarchical-RL method.

Sutton, Precup, Singh

jrnl AIJ

hierarchical

Contrasts FeUdal Networks

map-elites · MAP-Elites2015

Quality-diversity algorithm maintaining a grid archive indexed by hand-designed behavior descriptors, keeping the best solution per cell — a diverse repertoire of local optima rather than one global optimum. Foundational for QD and behavior-space curriculum design.

Mouret, Clune

pre arXiv

evolutionarypopulation

Contrasts OpenAI ES

feudal-net · FeUdal Networks (FuN)2017

Hierarchical RL where a high-level Manager emits abstract goals in a learned latent space and a low-level Worker is rewarded for moving along them. Decouples temporal abstraction from action selection; reference for hierarchical cooperative-MARL extensions.

Vezhnevets, Osindero, Schaul, Heess, … Kavukcuoglu

T1 ICML

hierarchicalactor-critic

Contrasts Options Framework

openai-es · OpenAI ES2017

Black-box policy optimization via natural-evolution-strategy gradient estimates from antithetic Gaussian perturbations. Trades sample efficiency for trivial parallelism (one perturbed rollout per worker, a tiny noise-seed broadcast). Reignited evolutionary methods for deep RL.

Salimans, Ho, Chen, Sidor, Sutskever

pre arXiv

evolutionarypopulation

Contrasts MAP-Elites

rlhf · RLHF (Deep RL from Human Preferences)2017

Trains a reward model from pairwise human preference comparisons over trajectory snippets, then optimizes a policy against it with standard deep RL. The upstream root of the entire human-feedback alignment line (InstructGPT, ChatGPT, Constitutional AI, DPO).

Christiano, Leike, Brown, Martic, Legg, Amodei

T1 NeurIPS

policy-gradientimitationlanguage-model

Solo · classical-RL-extended · 2018–2022

world-models · World Models2018

Three-part architecture: a VAE compresses observations, a recurrent mixture-density net predicts latent dynamics, and a small linear controller is evolved by CMA-ES inside the learned world. First popular demonstration of training a policy almost entirely "in imagination".

Ha, Schmidhuber

T1 NeurIPS

model-based

Contrasts MuZero

muzero · MuZero2020

Combines a learned value-equivalent dynamics model with MCTS, removing the need for a known simulator at planning time. Matches AlphaZero on Go/chess/shogi and is strong on Atari with one architecture.

Schrittwieser, Antonoglou, Hubert, Simonyan, … Silver

T1 Nature

model-basedplanningself-playalso: competitive

Contrasts DreamerV3, AlphaStar

paired · PAIRED2020

Unsupervised environment design: a protagonist and an antagonist play environments produced by an adversary trained to maximize the regret between them, yielding a curriculum at the frontier of the protagonist's ability with minimax-regret guarantees.

Dennis, Jaques, Vinitsky, Bayen, … Levine

T1 NeurIPS

environment-designcurriculumpolicy-gradientalso: mixed-motive

Contrasts ACCEL

accel · ACCEL2022

Unsupervised environment design that drops the adversary network for an evolutionary edit operator: a buffer of high-regret levels is mutated and re-scored, compounding complexity along the agent's frontier. Often matches/beats PAIRED while being simpler.

Parker-Holder, Jiang, Dennis, Samvelyan, … Rocktäschel

T1 ICML

environment-designcurriculumevolutionary

Extends PAIRED · Contrasts PAIRED, MAESTRO

Solo · classical-RL-extended · 2023–2026

constitutional-ai · Constitutional AI (CAI / RLAIF)2022

Aligns an LLM with a written "constitution" using AI feedback instead of human labels: the model critiques and revises its own outputs against the principles, producing preference data for an RLAIF stage. Showed AI-generated preferences can largely replace human ones.

Bai, Kadavath, Kundu, Askell, … Kaplan

pre arXiv

language-modelimitationpolicy-gradientreflection

Extends RLHF · Contrasts DPO

dpo · Direct Preference Optimization2023

Reformulates RLHF as a closed-form classification objective on pairwise preferences, eliminating the explicit reward model and the PPO loop. Now the dominant alignment recipe for open-source LLMs.

Rafailov, Sharma, Mitchell, Ermon, Manning, Finn

T1 NeurIPS

language-modelimitation

Extends RLHF · Contrasts Constitutional AI

dreamerv3 · DreamerV32025

Latent-space world model that learns from pixels and trains an actor-critic entirely on imagined trajectories. Masters Minecraft diamonds from scratch with a single fixed hyperparameter set across 150+ tasks — model-based RL that is both general and robust.

Hafner, Pasukonis, Ba, Lillicrap

T1 Nature

model-basedactor-critic

Extends World Models · Contrasts MuZero

Cooperative — persistently aligned interests; one team reward across the whole horizon. SMAC, MPE-coop, Overcooked live here

Cooperative · classical-MARL · 2015–2018

iql · IQL (Independent Q-Learning)1993

Foundational decentralized MARL: each agent runs independent Q-learning, treating others as environment dynamics. Trivial to scale; pays in non-stationarity and inability to coordinate beyond shared rewards. Still a competitive baseline today.

Ming Tan

T1 ICML

q-learningDTDE

Lower-bound baseline — the "DTDE" row in the RedWithinBlue paradigm-comparison table.

dec-pomdp · Dec-POMDP (complexity)2002

Formal complexity anchor for cooperative MARL: defines the decentralized POMDP and proves finite-horizon optimal policies are NEXP-complete — ruling out tractable exact algorithms and explaining why cooperative MARL needs approximate, learning-based methods.

Bernstein, Givan, Immerman, Zilberstein

jrnl Math. Oper. Res.

formal model

Contrasts Markov Games

The NEXP-hardness result is the standard justification for learning-based CTDE over exact DP; cited in formalism.tex.

commnet · CommNet2016

Foundational learned-communication architecture: each agent's hidden state is mean-pooled across all agents and broadcast back as next-layer input — a differentiable broadcast-to-all channel trained end-to-end with policy gradients. Anchors the communication family.

Sukhbaatar, Szlam, Fergus

T1 NeurIPS

communicationpolicy-gradientparameter-sharingCTDE

Contrasts DIAL, BiCNet

dial · DIAL2016

First gradient-based learned-communication method: during centralized training agents pass continuous messages with gradients flowing through the channel; at execution messages are discretized (via the DRU). Pioneered differentiable-train / discrete-deploy.

Foerster, Assael, de Freitas, Whiteson

T1 NeurIPS

communicationq-learningparameter-sharingCTDE

Contrasts CommNet, IC3Net

bicnet · BiCNet2017

Models the team policy and Q-function as a bidirectional recurrent network whose hidden states act as an inter-agent channel (StarCraft 1 micro). Shows recurrent message passing captures team coordination, but the imposed ordering breaks permutation invariance.

Peng, Wen, Yang, Yuan, Tang, Long, Wang

pre arXiv

communicationactor-criticparameter-sharingCTDE

Contrasts CommNet, DIAL

maddpg · MADDPG2017

Multi-Agent DDPG: per-agent centralized critics (see all obs+actions at training) with decentralized actors. Established the CTDE paradigm in deep MARL, with cooperative, competitive, and mixed-motive variants on the particle environment.

Lowe, Wu, Tamar, Harb, Abbeel, Mordatch

T1 NeurIPS

policy-gradientactor-criticcentralized-criticCTDEalso: comp/mixed

Contrasts IQL, QMIX

CTDE anchor in formalism.tex; cited throughout the RedWithinBlue RL-taxonomy notes.

mpe · MPE (Particle Environment)2017

Lightweight 2D continuous-state particle environment introduced with MADDPG — cooperative navigation, predator-prey, speaker-listener, physical deception. The default sandbox for MADDPG, MAAC, M3DDPG, and most early CTDE policy-gradient methods.

Lowe, Wu, Tamar, Harb, Abbeel, Mordatch

T1 NeurIPS

benchmarkalso: comp/mixed

Contrasts SMAC, Melting Pot

coma · COMA2018

Solves multi-agent credit assignment with a counterfactual baseline: the centralized critic estimates how much each agent's specific action contributed beyond a default, giving a stronger cooperative gradient than naive joint-policy gradient.

Foerster, Farquhar, Afouras, Nardelli, Whiteson

T1 AAAI

policy-gradientactor-criticcentralized-criticCTDE

Extends MADDPG · Contrasts QMIX

vdn · VDN2018

Earliest deep value-decomposition method: the joint Q-function is the simple sum of per-agent utilities, trivially satisfying Individual-Global-Max. Limited expressiveness vs. QMIX's hypernetwork mixing; the foundational value-decomposition anchor.

Sunehag, Lever, Gruslys, Czarnecki, … Graepel

T1 AAMAS

value-decompositionq-learningparameter-sharingCTDE

Extends IQL · Contrasts MADDPG

Cooperative · classical-MARL · 2018–2022

mean-field-marl · Mean Field MARL2018

Approximates the joint action of a large population by the empirical distribution (mean field) of neighbor actions, reducing the Q-function to a pairwise interaction with a population summary. Scales to hundreds–thousands of homogeneous agents.

Yang, Luo, Li, Zhou, Zhang, Wang

T1 ICML

mean-fieldq-learningactor-criticN: 20–1000also: comp/mixed

Contrasts MADDPG, QMIX, MAPPO

qmix · QMIX2018

Factorizes the joint Q as a monotonic mixture of per-agent utilities, enforced by a non-negative-weight hypernetwork, satisfying IGM so per-agent argmax = joint argmax. Strong on discrete-action SMAC; the monotonicity constraint is its key limitation for sacrifice actions.

Rashid, Samvelyan, Schroeder de Witt, Farquhar, Foerster, Whiteson

T1 ICML

value-decompositionq-learningcentralized-criticCTDE

Extends VDN · Contrasts MADDPG, MAPPO

Monotonicity is exactly the issue flagged in RedWithinBlue: connectivity maintenance often needs one agent to "sacrifice" so others can explore.

bad · Bayesian Action Decoder2019

Recasts a cooperative Dec-POMDP with hidden information as a public-belief MDP: agents condition on a Bayesian posterior over private states given public history, sharing an implicit language. Super-human on 2-player Hanabi.

Foerster, Song, Hughes, Burch, … Bowling

T1 ICML

policy-gradientactor-criticcommunicationCTDE

Contrasts MADDPG, QMIX

The public-belief construction is the canonical reference for grounded implicit communication.

ic3net · IC3Net2019

Adds a learned binary gate to CommNet so each agent decides at each step whether to broadcast at all — learning when to communicate matters as much as what. Extends to mixed settings where unconditional broadcast leaks info to opponents.

Singh, Jain, Sukhbaatar

T1 ICLR

communicationpolicy-gradientparameter-sharingCTDEalso: mixed-motive

Extends CommNet · Contrasts CommNet, TarMAC

Direct precedent for bandwidth-aware messaging: gating is the simplest "speak only when necessary" prior for connectivity-constrained missions.

maven · MAVEN2019

Augments QMIX with a hierarchical latent variable conditioning the joint policy, trained via mutual-information maximization to encourage diverse coordinated exploration — addressing QMIX's exploration weakness on sparse-reward tasks.

Mahajan, Rashid, Samvelyan, Whiteson

T1 NeurIPS

value-decompositionq-learningcentralized-criticCTDE

Extends QMIX · Contrasts QPLEX

openai-five · OpenAI Five2019

Defeated the world-champion Dota 2 team with PPO at unprecedented scale: five parameter-sharing LSTM agents on a cooperative team reward. Scale + self-play reached pro-level play in a 5v5 partially-observable game with no explicit multi-agent algorithm.

OpenAI: Berner, Brockman, Chan, … Zhang

pre arXiv

policy-gradientself-playparameter-sharingCTCEalso: competitive

Contrasts AlphaStar

Major coop↔comp crossover: within-team cooperation from shared reward, cross-team competition from self-play — the mode-shifting framing applies directly.

qtran · QTRAN2019

Replaces QMIX's monotonic mixing with a soft regularization of the IGM condition, making the function class strictly broader. Strongest theory in the value-decomposition family but often empirically loses to QMIX/QPLEX because the regularizer is harder to tune.

Son, Kim, Kang, Hostallero, Yi

T1 ICML

value-decompositionq-learningcentralized-criticCTDE

Extends QMIX · Contrasts QPLEX

smac · SMAC2019

Benchmark of cooperative StarCraft II micromanagement (2–27 unit agents, decentralized partial obs, scripted enemy). The de-facto cooperative-MARL benchmark on which QMIX, MAPPO, MAVEN, QPLEX were validated.

Samvelyan, Rashid, Schroeder de Witt, Farquhar, … Whiteson

T1 AAMAS

benchmarkN: 2–27

Contrasts Melting Pot, MPE

tarmac · TarMAC2019

Replaces CommNet's mean-pool broadcast with signature-based soft attention: agents emit (key, value) pairs, listeners query to weight them — learning who to address and what to send. Established attention as the default learned-communication architecture.

Das, Gervet, Romoff, Batra, … Pineau

T1 ICML

communicationtransformerattentionCTDE

Extends CommNet · Contrasts CommNet, IC3Net

Closest classical-MARL precedent for connectivity-aware messaging policies.

dgn · DGN (Graph Conv RL)2020

Treats agents as nodes in a spatial-neighborhood graph and applies multi-head graph-attention to fuse local observations before Q-learning. GNN message passing scales cooperatively to ~100 agents. Anchor of the GNN family in cooperative MARL.

Jiang, Dun, Huang, Lu

T1 ICLR

GNNq-learningattentionCTDEN: 10–100

Contrasts CommNet, TarMAC

ippo · IPPO2020

Independent PPO with parameter sharing: each agent runs PPO on its own observations and treats others as environment. The key result — IPPO (DTDE) often matches or beats QMIX (CTDE) on SMAC — challenged the field's CTDE-by-default assumption.

Schroeder de Witt, Gupta, Makoviichuk, … Whiteson

pre arXiv

policy-gradientactor-criticparameter-sharingDTDE

Extends IQL · Contrasts QMIX

DTDE baseline in the RedWithinBlue paradigm-comparison table.

ndq · NDQ2020

Combines QMIX value-decomposition with an information-theoretic regularizer minimizing mutual information between sent messages and the sender's full observation while preserving task content — minimal-bandwidth communication only when independent decomposition is insufficient.

Wang, Wang, Zheng, Zhang

T1 ICML

communicationvalue-decompositionq-learningCTDE

Extends QMIX · Contrasts TarMAC, CommNet

The information-bottleneck framing maps cleanly onto link-capacity constraints in connectivity-aware missions.

roma · ROMA (Emergent Roles)2020

Augments value-decomposition MARL with a stochastic role embedding per agent, regularized to be mutually informative with trajectories yet compact enough to drive specialization. Yields emergent roles without hand-designed priors; pairs with QMIX-style mixing.

Wang, Dong, Lesser, Zhang

T1 ICML

role-basedvalue-decompositionq-learningCTDE

Extends QMIX · Contrasts QMIX, QPLEX, MAVEN

Classical-MARL precedent for asymmetric team coordination — relevant when scout-vs-follower roles must emerge without hand-specification.

dicg · DICG (Implicit Coordination Graphs)2021

Learns an implicit coordination graph end-to-end: an attention module produces a soft adjacency feeding a GNN reasoning layer over joint values/actions. Mitigates relative overgeneralization in predator-prey and competes with QMIX on SMAC without hand-specified structure.

Li, Gupta, Morales, Allen, Kochenderfer

T1 AAMAS

GNNactor-criticattentionCTDE

Extends DGN · Contrasts MAGIC, QMIX

Implicit graph learning is the alternative to spatial-proximity graphs when "who-coordinates-with-whom" is task-dependent.

magic · MAGIC2021

Combines a Scheduler (when/to-whom to communicate) with a graph-attention Message Processor over a dynamically-learned communication graph — unifying the when/who/what axes that CommNet, IC3Net, TarMAC each addressed separately. Validated on a physical robot-soccer testbed.

Niu, Paleja, Gombolay

T1 AAMAS

communicationGNNattentionCTDE

Extends TarMAC, DGN · Contrasts CommNet, IC3Net, DICG

Tightest existing match for "learned dynamic comm-graph + attention message processing" — direct precedent for connectivity-aware multi-robot policies.

qplex · QPLEX2021

Duplex dueling decomposition with multi-head attention over agents, generalizing QMIX's monotonic mixing to non-monotonic interactions while still satisfying IGM. Outperforms QMIX and QTRAN on hard SMAC scenarios needing sacrifice actions.

Wang, Ren, Liu, Yu, Zhang

T1 ICML

value-decompositionq-learningattentionCTDE

Extends QMIX · Contrasts QTRAN

Addresses the QMIX monotonicity limitation that is load-bearing in connectivity-maintenance tasks.

updet · UPDeT2021

Replaces the per-agent recurrent encoder with an entity-based transformer: each agent attends over a variable-length set of observed entities, and the policy head decouples into action groups. A single policy transfers across team sizes — universal in N.

Hu, Zhu, Chang, Liang

T1 ICLR

transformerq-learningattentionCTDE

Extends QMIX · Contrasts QMIX, MAPPO

Entity-set attention is exactly what permutation-invariant multi-robot policies need.

mamba · MAMBA2022

Multi-agent latent world-model (Dreamer family): agents jointly maintain a shared latent state via learned communication and train policies on imagined rollouts. Model-based imagination matches/exceeds model-free CTDE at lower sample budgets.

Egorov, Shpilman

pre arXiv

model-basedactor-criticcommunicationCTDE

Extends DreamerV3, World Models · Contrasts MAPPO, QMIX

mappo · MAPPO2022

CTDE variant of PPO: a centralized value uses global state, decentralized actors use per-agent obs, agents share parameters. Empirically the strongest cooperative-MARL baseline on SMAC/MPE/Hanabi as of 2022, despite the field's earlier off-policy preference.

Yu, Velu, Vinitsky, Wang, Bayen, Wu

T1 NeurIPS

policy-gradientactor-criticcentralized-criticCTDE

Extends COMA, IPPO · Contrasts QMIX, MADDPG

Phase 1–3 baseline in the RedWithinBlue curriculum.

mat · MAT (Multi-Agent Transformer)2022

Recasts cooperative MARL as sequence modeling: an encoder-decoder transformer encodes the joint observation and decodes agent actions one at a time, using the multi-agent advantage decomposition theorem for monotonic improvement. Beats MAPPO, QMIX, HAPPO on SMAC/MA-MuJoCo.

Wen, Kuba, Lin, Zhang, Wen, Wang, Yang

T1 NeurIPS

transformerpolicy-gradientactor-criticCTDE

Extends UPDeT, MAPPO · Contrasts MAPPO, QMIX, QPLEX

Cooperative · classical-MARL · 2023–2026

haven · HAVEN2023

Hierarchical cooperative MARL with dual coordination at both the high level (across subgoal selections) and low level (across primitive actions), combined with QMIX-style value decomposition. Improves on flat CTDE on long-horizon SMAC tasks.

Xu, Bai, Zhang, Li, Fan

T1 AAAI

hierarchicalvalue-decompositionactor-criticCTDE

Extends QMIX, FeUdal Networks · Contrasts MAPPO, QMIX

maestro · MAESTRO2023

Multi-agent extension of PAIRED: jointly co-evolves a population of co-players and a curriculum of environments to maximize regret in cooperative MARL. Targets the failure where environment design ignores the partner-policy distribution; partner-aware curricula generalize better.

Samvelyan, Khan, Dennis, Jiang, … Rocktäschel

T1 ICLR

environment-designcurriculumpolicy-gradientpopulation

Extends PAIRED · Contrasts PAIRED, ACCEL

Collaborative — temporarily aligned around a sub-goal; different individual rewards/roles; alignment dissolves when the sub-task ends

Every entry below is an LLM-agent system, 2023–2026. There are zero classical-MARL collaborative entries in the corpus. Coordination is in natural language, rewards are prompt-implicit, roles are asymmetric, and there is no value function or gradient update. This empty classical column is the map's central finding.

Collaborative · llm-agent · 2023–2026

agentverse · AgentVerse2023

Structures problem-solving into four phases — expert recruitment, collaborative decision-making, action execution, evaluation — dynamically assembling a team of role-tagged LLM agents with reflection. Explicit phase decomposition can beat free-form group chat.

Chen, Su, Zuo, Yang, … Zhou

pre arXiv

language-modelrole-basednatural-languagedebatereflection

Extends AutoGen · Contrasts MetaGPT, ChatDev

autogen · AutoGen2023

Microsoft's framework for multi-agent LLM apps: agents with roles (UserProxy, AssistantAgent, GroupChatManager) exchange natural-language messages to solve a user task, collaborating temporarily and dissolving afterward — quintessentially collaborative, with each role's reward implicit in its prompt.

Wu, Bansal, Zhang, Wu, … Wang

pre arXiv

language-modelrole-basednatural-language

Extends ReAct · Contrasts MADDPG, MAPPO

Anchor for the collaborative paradigm — no explicit reward, no value function, no gradient updates, yet solves coordination via prompt design alone.

camel · CAMEL2023

Two-agent role-playing framework: an "AI user" and an "AI assistant" get complementary roles via "inception prompting" and exchange messages to solve a task. One of the earliest multi-agent LLM frameworks (March 2023); seeded AutoGen and MetaGPT.

Li, Hammoud, Itani, Khizbullin, Ghanem

T1 NeurIPS

language-modelrole-basednatural-language

multi-agent-debate · Multi-Agent Debate2023

Multiple LLM agents propose, critique, and refine each other's answers across structured debate rounds, improving factual accuracy and reasoning beyond single-agent baselines. Competitive-form mechanism producing a collaborative outcome — borderline mixed-motive in form.

Liang, He, Jiao, Wang, … Shi

pre arXiv

language-modelopponent-modelingdebatealso: mixed-motive

Contrasts AutoGen

Cross-mode example: the same agents act competitively (debate) to reach a collaborative goal — the "regiments shifting modes" framing.

chatdev · ChatDev2024

LLM-agent virtual software company with role-based agents communicating via structured chat. Distinguishes itself from MetaGPT through explicit "double-agent" debate phases at each stage (designer ↔ reviewer). Strong collaborative role-based coordination.

Qian, Liu, Liu, Chen, … Sun

T2 ACL

language-modelrole-basednatural-languagedebate

Extends AutoGen · Contrasts MetaGPT

crewai · CrewAI2024

Open-source Python framework orchestrating role-playing LLM agents as a "crew" — each with role, goal, backstory, tools — distributing tasks sequentially or hierarchically. A lightweight production-oriented alternative to AutoGen with explicit role abstractions.

João Moura

fw Framework

language-modelrole-basednatural-languagehierarchical

Extends AutoGen · Contrasts LangGraph, MetaGPT

Representative of how practitioners actually deploy multi-agent role coordination outside research labs.

langgraph · LangGraph2024

LangChain library for stateful multi-agent LLM workflows as explicit directed graphs with shared state: nodes are LLM agents or tool calls, edges are conditional transitions, persistent state enables long-horizon coordination and human-in-the-loop checkpoints.

LangChain Inc

fw Framework

language-modelrole-basedplanninghierarchical

Extends AutoGen · Contrasts CrewAI, MetaGPT

The closest production-grade analogue to formal coordination diagrams in classical MARL.

metagpt · MetaGPT2024

Software-development LLM-agent system encoding human SOPs: a Product Manager writes requirements, an Architect designs, Engineers implement, QA tests. Hand-designed workflows ("meta-programming via prompts") beat free-form multi-agent conversation on structured tasks.

Hong, Zhuge, Chen, Zheng, … Schmidhuber

T1 ICLR

language-modelrole-basedhierarchicalnatural-language

Extends AutoGen

Quintessential collaborative system — the role-asymmetry it relies on is exactly what classical cooperative MARL cannot natively express.

Competitive — strictly opposed interests; two-player zero-sum, N-player rank contests, security games, adversarial MARL

Competitive · classical-MARL · 2015–2018

markov-games · Markov Games (Littman)1994

Introduces Markov (stochastic) games as the formalism for competitive MARL and proposes minimax-Q, a value-iteration variant that converges in two-player zero-sum games. The standard ancestor citation for self-play and minimax-RL.

Michael L. Littman

T1 ICML

q-learningopponent-modeling

Contrasts Dec-POMDP

Foundational competitive-MARL formalism cited in formalism.tex; minimax-Q is the conceptual ancestor of M3DDPG.

Competitive · classical-MARL · 2018–2022

alphastar · AlphaStar2019

Grandmaster-level StarCraft II from DeepMind via population-based self-play with a "league" of main agents, main exploiters, and league exploiters to prevent strategy cycles. Influential for the population-of-policies family.

Vinyals, Babuschkin, Czarnecki, Mathieu, … Silver

T1 Nature

policy-gradientself-playpopulationimitation

Contrasts OpenAI Five

m3ddpg · M3DDPG2019

Robust adversarial extension of MADDPG: trains each agent's policy against an adversarially perturbed approximation of the others as a minimax objective on the centralized critic (one-step linearized inner gradient). Improves robustness to adversarial co-players.

Li, Wu, Cui, Dong, Fang, Russell

T1 AAAI

policy-gradientactor-criticopponent-modelingCTDEalso: cooperative

Extends MADDPG · Contrasts MADDPG, OpenAI Five

Worst-case-robust CTDE baseline; relevant when blue-team policies must be hardened against red-team perturbations.

pr2 · PR2 (Recursive Reasoning)2019

Models multi-agent decisions as recursive level-K reasoning where each agent assumes opponents reason one level shallower; uses variational inference to approximate the joint policy. Anchor for the Bayesian / cognitive-hierarchy branch of opponent modeling.

Wen, Yang, Luo, Wang, Pan

T1 ICLR

opponent-modelingactor-criticCTDEalso: mixed-motive

Contrasts ROMMEO, LOLA, MADDPG

rommeo · ROMMEO2019

Builds an opponent model jointly with each agent's policy under a maximum-entropy objective, then regularizes the policy update against the inferred opponent posterior — a Bayes-optimal best-response under co-player uncertainty in general-sum and competitive games.

Tian, Wen, Gong, Punakkath, Zou, Wang

T1 ICML

opponent-modelingpolicy-gradientactor-criticCTDEalso: mixed-motive

Extends MADDPG · Contrasts PR2, LOLA, MADDPG

Competitive · hybrid · 2018–2022

cicero · CICERO2022

Meta AI's human-level Diplomacy player: a planning module trained via no-press self-play plus a dialogue model fine-tuned on human games and conditioned on intended actions. Human-level competitive multi-agent natural-language negotiation.

Meta FAIR Diplomacy Team

T1 Science

language-modelplanningopponent-modelingself-playalso: mixed-motive

Contrasts AutoGen, MetaGPT

Combines RL self-play with LLM dialogue — a hybrid that classical MARL surveys mostly miss.

Mixed-motive — general-sum games; social dilemmas, negotiation, mechanism design, emergent-society simulation

Mixed-motive · classical-MARL · 2015–2018

lola · LOLA2018

Differentiates each agent's update through a one-step lookahead on the opponent's learning step, so policies are shaped by how they influence the opponent's future gradient. Yields tit-for-tat cooperation in iterated prisoner's dilemma where naive learners defect.

Foerster, Chen, Al-Shedivat, Whiteson, Abbeel, Mordatch

T1 AAMAS

opponent-modelingpolicy-gradientCTDEalso: competitive

Contrasts MADDPG, ROMMEO, PR2

Direct conceptual ancestor for any opponent-shaping mediator design in mixed-motive scenarios.

Mixed-motive · classical-MARL · 2018–2022

social-influence · Social Influence2019

Adds an intrinsic reward proportional to one agent's causal influence on another's policy (KL between conditional and marginal action distributions) to encourage emergent communication and prosocial behavior in Cleanup/Harvest social dilemmas without explicit channels.

Jaques, Lazaridou, Hughes, Gulcehre, … de Freitas

T1 ICML

policy-gradientcommunicationopponent-modelingCTDE

Contrasts Bayesian Action Decoder, LOLA

ai-economist · AI Economist2020

Two-level RL where a "social planner" agent designs tax policy while heterogeneous worker agents simultaneously learn to respond (PPO at both levels), discovering tax schedules that improve equality-vs-productivity trade-offs. Anchor for RL-based mechanism design.

Zheng, Trott, Srinivasa, Parkes, Socher

pre arXiv

mechanism-designpolicy-gradientactor-critichierarchical

Contrasts Melting Pot, CICERO

The hierarchical planner/worker decomposition mirrors the mediator role in RedWithinBlue.

melting-pot · Melting Pot2021

DeepMind benchmark of MARL environments around social dilemmas, free-rider problems, and common-pool resources. The most prominent infrastructure for studying generalization across cooperative ↔ competitive ↔ mixed-motive within one framework.

Leibo, Duéñez-Guzmán, Vezhnevets, Agapiou, … Graepel

T1 ICML

benchmarkN: 2–16also: coop/comp

Closest existing benchmark to the framing of agents shifting between coop / collab / compete modes.

Mixed-motive · llm-agent · 2023–2026

generative-agents · Generative Agents2023

25 LLM-driven agents in a sandbox town (Smallville) show believable individual and social behavior via a memory stream, reflection, and planning — emergent information spread, relationship formation, event coordination. Anchor for emergent-society simulation.

Park, O'Brien, Cai, Morris, Liang, Bernstein

T2 UIST

language-modelmemoryreflectionN: 25also: collaborative

Mixed-motive and collaborative mode-shifting in one system — alignment shifts with context, exactly the dynamic-regiments framing.

ai-town · AI Town2024

Open-source deployable virtual town inspired by Generative Agents: LLM-driven characters live, plan, gossip, and form relationships in a Convex-backed real-time world. A practical template for persistent multi-agent LLM societies, widely forked.

a16z Infra

fw Framework

language-modelmemoryreflectionalso: collaborative

Extends Generative Agents

70 entries rendered by primary mode — solo 19 (hybrid 2, llm-agent 5, classical-RL-extended 12), cooperative 31, collaborative 8, competitive 6, mixed-motive 6. Secondary modes (the "also:" tags) place several entries in more than one mode at once.

3 · the field's self-image

The 6 surveys

Surveys are treated as first-class nodes: they define the field's self-image at a moment in time. Each card lists what the survey covers, what it explicitly omits, and how many pool entries it cites. The omissions, cross-checked against the entry pool, drive the gap report below.

albrecht-christianos-schaefer-2024
MARL: Foundations & Modern Approaches2024

Most recent comprehensive textbook treatment of cooperative and competitive MARL. Strong on theory (Markov games, POSG, Nash, Bellman equations) and modern deep-MARL benchmarks; mostly excludes the LLM-agent wave and applied / mission-level work.

Albrecht, Christianos, Schäfer · MIT Press (textbook)

Covers: cooperative, competitive, mixed-motive · classical-MARL

Omits: llm-agent, emergent-society-simulation, real-world-deployment, mission-level-success, role-asymmetry, foundation-model-agents

Cites: 10 entries (mappo, qmix, vdn, maddpg, coma, …)

oroojlooyjadid-hajinezhad-2023
A review of cooperative MA deep RL2023

Cooperative-MARL-only survey organized by communication, coordination, training paradigm, and applications. Useful for the cooperative-only literature; a conspicuous gap on competitive and mixed-motive.

OroojlooyJadid, Hajinezhad · Applied Intelligence (survey)

Covers: cooperative · classical-MARL

Omits: llm-agent, competitive, mixed-motive, mission-level-success, foundation-model-agents

Cites: 5 entries (mappo, qmix, vdn, maddpg, coma)

du-ding-2023
A survey on MARL with communication2023

Survey focused specifically on learned communication in MARL: protocols (broadcast, targeted, attention), representations (continuous, discrete, symbolic), and learning algorithms. Useful for the communication-learning cluster.

Zhai, Ding · arXiv (survey)

Covers: cooperative · classical-MARL

Omits: llm-agent, mixed-motive, role-asymmetry, mission-level-success

gronauer-diepold-2022
Multi-agent deep RL: a survey2022

Broad survey of deep MARL up to 2022, organized by training paradigm (centralized / decentralized / fully / partially observable). Predates the LLM-agent wave; useful as a "what classical MARL covered" reference for triangulating gaps.

Gronauer, Diepold · AI Review (survey)

Covers: cooperative, competitive, mixed-motive · classical-MARL

Omits: llm-agent, role-asymmetry, foundation-model-agents, real-world-deployment, mission-level-success

Cites: 6 entries (mappo, qmix, vdn, maddpg, coma, …)

zhang-yang-basar-2021
MARL: a selective overview2021

Theory-leaning selective survey emphasizing convergence guarantees, Markov games, and game-theoretic foundations. Contrasts cooperative, competitive, and mixed settings via formal analysis; light on emerging deep-MARL empirical work.

Zhang, Yang, Başar · Handbook of RL & Control (survey)

Covers: cooperative, competitive, mixed-motive · classical-MARL

Omits: llm-agent, foundation-model-agents, mission-level-success, real-world-deployment, role-asymmetry

Cites: 5 entries (maddpg, qmix, vdn, coma, iql)

hernandez-leal-2019
A survey and critique of MA deep RL2019

Influential 2019 critique categorizing work into four open problems: emergent behaviors, learning communication, learning cooperation, agents modeling agents. A critical view that helps identify gaps — highlighting how thin coverage of communication and opponent modeling was at the time.

Hernandez-Leal, Kartal, Taylor · JAAMAS (survey)

Covers: cooperative, competitive, mixed-motive · classical-MARL

Omits: llm-agent, foundation-model-agents, role-asymmetry, mission-level-success, real-world-deployment

Cites: 5 entries (maddpg, coma, vdn, qmix, iql)

4 · gaps

Gaps & opportunities

The build pipeline auto-ranks every (mode, method-family) cell by an interest score: a cell is interesting if it is sparse (few entries) AND its neighbors are dense. Absences neighboring populated cells are more likely structural — opportunities for new work — than absences in already-empty regions.

Top-ranked sparse cells

#	Mode	Method family	Max nbr	Score
1	collaborative	curriculum	15	16.00
2	collaborative	parameter-sharing	15	16.00
3	collaborative	mechanism-design	15	16.00
4	cooperative	mechanism-design	15	16.00
5	solo	parameter-sharing	15	16.00
6	solo	mechanism-design	15	16.00
7	collaborative	model-based	12	13.00
8	collaborative	q-learning	12	13.00
9	collaborative	actor-critic	12	13.00
10	collaborative	value-decomposition	12	13.00
11	collaborative	policy-gradient	12	13.00
13	cooperative	planning	10	11.00
14	cooperative	language-model	10	11.00
15–16	collaborative	GNN / communication	9	10.00

Full top-30 ranking lives in assets/sources/marl_taxonomy/ (the original gaps.md). The collaborative row dominates: nearly every classical method-family that is dense for cooperative is completely empty for collaborative — the same finding as the empty classical column above, now quantified.

Mode × era absences

collaborative — dense in 2023–2026 (8 entries), sparse in 2015–2018 and 2018–2022. The mode simply did not exist as a studied object before the LLM-agent wave.

competitive — dense in 2018–2022 (5 entries), sparse in 2015–2018 and 2023–2026. The self-play / opponent-modeling peak was the deep-MARL middle era.

The survey-omission finding

Cross-checking each survey's omits field against the entry pool surfaces concepts that are present in the literature but absent from the canonical surveys:

llm-agent and mission-level-success are omitted by all 6 surveys. role-asymmetry is omitted by nearly all (5 of 6). foundation-model-agents by 5; real-world-deployment by 4. The field's own self-image has no place for the agentic-AI wave, for whether a mission actually succeeds, or for role-differentiated teams.

This is precisely the project's governing research question. Zymera asks how covert, within-bounds micro-level agent misbehavior propagates to mission-level failure in a spatial connectivity mission governed by role / graph-position — the exact triple (mission-level-success + role-asymmetry + the post-classical llm-agent/agentic regime) that every canonical survey leaves out. The gap the map measures and the gap the program targets are the same gap. See Literature for the five-flank review and Paper for the formal statement.

25 pool entries are cited by no survey at all — nearly the entire llm-agent and classical-RL-extended frontier (Voyager, ReAct, AutoGen, MetaGPT, DPO, DreamerV3, …). The surveys' citation pool is almost exactly the classical cooperative core (MAPPO, QMIX, VDN, MADDPG, COMA, IQL).

5 · connectors

Cross-disciplinary connectors

When the gap report flags an empty cell, it surfaces the adjacent disciplines tagged on the neighboring entries — out-of-the-box ideas that may apply. The connectors catalogue defines, for each field, the canonical methods that have crossed (or could cross) into MARL:

game-theory

No-regret learning → self-play convergence; mechanism design → mediator-induced cooperation; Stackelberg → leader-follower MARL.

control

Robust H∞ for adversarial perturbations; consensus algorithms as networked-MARL precursors; MPC as a model-based planning baseline.

information-theory

Minimum-information communication (NDQ); information-bottleneck regularization for compositional policies.

distributed-systems

Byzantine-robust federated MARL; distributed Kalman filtering as a sensor-fusion precursor.

graph-theory

Spectral analysis of communication topologies; random-graph models for ad-hoc team formation.

adversarial-ml

Robust MARL under action / observation / communication / reward attacks; certified MARL policies.

safe-rl

Shielded MARL for mission-level safety guarantees; constrained CTDE with cost critics.

mech-design

Mediator-induced cooperation in social dilemmas; shared-reward design for collaborative LLM-agent teams.

nlp

Natural language as the coordination channel (AutoGen, CrewAI); RLHF as single-agent RL with a human partner.

evolutionary-computation

OpenAI ES as MARL hyperparameter search; MAP-Elites for diverse behavior repertoires; PBT for self-play.

statistical-physics

Mean-field MARL for very large N; phase transitions in cooperation under partial information.

cognitive-science

Theory-of-mind-augmented opponent modeling; level-K reasoning for cognitive-hierarchy MARL.

swarm-robotics

Behavior taxonomies (Brambilla, Schranz) as a macro-objective vocabulary; kilobot-style local rules as DTDE baselines.

economics / social-choice

Auction-based task allocation; negotiation as MARL in LLM settings; voting/aggregation in multi-agent debate; fair credit assignment.

Further connectors in the source catalogue: network-coding, operations-research, formal-methods, neuroscience, probability, optimization, behavioral.

6 · methodology

Methodology

"Build a navigable map of multi-agent autonomous-systems research that makes gaps and adjacencies visible, rather than an encyclopedic listing."

Why interaction-mode is the primary axis. A pure method-tree forces fuzzy methods into rigid boxes (MAPPO is coop + CTDE + policy-gradient + parameter-sharing at once). Sorting by interaction mode instead exposes the coop/collab seam that is otherwise collapsed — and that seam is where the classical and agentic literatures fail to meet.
Multi-tagging by default. Each entry has one primary mode (for clustering / coloring) plus honest secondary modes, method-families, coordination tags, and adjacent disciplines. All tags surface in the matrices.
Surveys as first-class nodes. The gap report cross-checks each survey's omits against the pool to flag concepts present in the literature but absent from the field's self-image.
The gap interest-score. A (mode, method-family) cell scores high when it is sparse and surrounded by dense neighbors — the intuition being that an absence next to populated cells is structural (an opportunity), not peripheral.
Single source of truth. The whole map is generated from YAML (data/entries/*.yaml, data/surveys/*.yaml, venues.yaml, adjacencies.yaml), pydantic-validated, and rendered to README + Graphviz diagrams + matplotlib matrices + the gap and venue reports. The README is never hand-edited.

Source preserved. The full generative YAML source — all entry/survey definitions, the venue and adjacency catalogues, the schema, and the build pipeline — is archived under assets/sources/marl_taxonomy/, alongside the rendered gaps.md and venues.md reports this page draws from.