The journey — a layered research program

Zymera is not one experiment. It's a stack of layers unified by a single question — how covert, stealthy misbehaviour propagates from one agent (micro) to mission failure (macro), and how resilient the mission is. This page compartmentalizes the whole program: the theory, the empirical journey, the networks, the paper, the field maps, and the outputs.

① Project map — the layers ② Theory — the resiliency formalism ③ The empirical journey (5 eras) ④ Networks & learning stack ⑤ The paper — Stealth Attacks on Swarms ⑥ Field maps & references ⑦ How the layers connect
① Project map

The layers

Theory

formalism.tex — a mission-centered formal foundation: micro/macro/bridge, stealth & propagation, resiliency metrics.

Engine + experiments

swarm_explore/ + RedWithinBlue/ — the JAX testbed and the 5-era empirical journey.

The paper

report/Stealth Attacks on Swarms: the threat-model + resiliency study with the compromise sweep.

Field maps

marl_taxonomy/ + docs/mission-taxonomy/ — a 70-entry field map and a mission taxonomy.

Dissemination

presentations/ + references/ — the Red-within-Blue / Swarm-Resiliency decks and reference drafts.

② Theory

The resiliency formalism

formalism.tex — the formal backbone of the whole question; everything informal elsewhere on this site is a special case of it.

A mission-centered formal foundation for the resiliency of decentralized teams under partial observability and adversarial pressure — built around the mission, not the agent, on a Dec-POMDP backbone augmented with time-varying interaction graphs.

Everything we called informal earlier has a formal home here: covert = the KL stealth budget; the mission-safety K-budget = the break/compromise budget k; micro→macro amplification = the propagation formalism + amplification hypotheses; the stealth–damage frontier = the stealth–degradation trade-off; belief-as-signal = mission health on the team posterior.

Full formal treatment → theory.html — a dedicated Theory page now carries the complete formalism (all definitions, budgets, propagation, and resiliency metrics).

③ The empirical journey

Five eras of experiments

The engine work that turns the theory into measurements. win · dead-end/pivot · insight.

Era 0 · from one agent to a swarm
Starting point — the single-agent problempredecessor

The program began with one agent, not a swarm: train a single RL policy that maximizes coverage of an arbitrary grid world within a horizon T, and have the same policy generalize across world size and obstacle layout. The agent spawns at an unknown location with no prior knowledge of world size, boundaries, or its own position; it sees only a k×k sensor window (radius 3 / 5 / 7), builds its whole map from observations anchored at spawn (relative, not absolute coordinates), and discovers boundaries by reaching them. Deliberately asymmetric information: the policy runs on partial state at inference, while the critic and training-time modules get full ground truth — a free supervision signal.

The baseline ladder & win condition

Three rungs bracket the problem: random walk (lower bound), frontier-based exploration (the fair partial-info comparator), and a Hamiltonian path with full information (the oracle ceiling). The quantitative win condition: the trained policy must beat frontier-based on coverage at horizon T and approach the Hamiltonian ceiling.

Five solution strategies considered

① Pure RL from scratch (end-to-end, no priors; reward as sparse-terminal / dense-per-step / hybrid; gradient-based A2C·SAC·DQN·Rainbow vs evolutionary ES·CMA-ES·novelty-search). ② Bootstrapped RL — behaviour-clone frontier or full-info Hamiltonian experts, then RL fine-tune. ③ Learning-augmented heuristic — frontier spine with learned corrections (utility scores, override flags, terrain-aware adjustments) under full-state supervision. ④ Planning with a learned value — search over the agent's belief (map + relative position + remaining horizon), value trained against ground-truth coverage and queried on partial belief — closer to POMCP than AlphaZero. ⑤ Model-based with map completion — a learned model maps the partial map to a distribution over plausible full worlds, then plans against samples or expected utility. Shared scaffolding: an egocentric k×k encoder + accumulated-map encoder (size-invariant by construction), small nets (<1M params), distillation, auxiliary "predict the unseen map" objectives, curriculum, and domain randomization.

Single-agent coverage saturatedpivot

The single-agent target was hit and the metric saturated — no headroom left. The open problems live in teams, which motivated abandoning the solo problem and pivoting to the multi-agent / adversarial program below.

Era 1 · MARL coverage (PPO) — coverage solved, the connectivity wall
Honest re-baseline → visit-coveragehonesty

3×3 sensor-coverage was too generous → re-baselined to 1×1 visit-coverage; old checkpoints invalidated.

Decentralization winswin

Indep 86.4% ≈ CTDE ≫ joint (70%/4% — fails).

Clustered start (87%) · anti-overlap (91.4%) · return-norm (93.4%)

Start distribution is a lever; a connectivity bonus clumps; return-norm accelerates.

Frontier attention — coverage crackedwin

97.7%, size-agnostic (zero-shot ~96% @20²/24²).

…connectivity collapsed to 32%the wall

A pure coverage optimizer disperses the swarm. connectivity is the problem.

Attention ≠ incentiveinsight

Capability (a head) and behaviour (the reward) are separate knobs.

Threshold ≻ maximize → capped giant-componentbreakthrough

"3 connected + 1 roamer" scores like a clump → no clump incentive: 89.5%/87% giant/0 collisions.

Collision hard, connectivity soft + curriculumdesign

Action-masking beats env-override; masking-from-scratch collapses → curriculum.

Local degree-floor ≻ global giant-componentwin

Local "keep ≥1 neighbour" dominates a global Fiedler value under partial info.

The action-representation bottleneckpivot

The ~69% re-tread ceiling is the 1-step move head, not perception/critic/memory → goal/candidate heads.

Era 2 · the graph-belief swarm — design, decentralization, the role question
3-module design + sub-goal action headdesign

Compass + SLAM controller + Emergence; the learned action is a goal over graph regions, the fix for the bottleneck.

Emergence-chain correctionsinsight

Structured symmetry-break (NOT entropy — entropy preserves symmetry); central critic NOT required; anti-crowding general; shared belief neither sufficient nor necessary.

Hard guardrail gridlocks → soft tetherfix

Hard shield + anti-crowding = ~5% gridlock; a soft tether makes it cohesive and mobile.

Centralized prototype rejected → decentralized rebuildpivot

swarm_explore: certainty field (operate +1, diminishing reward, max-merge gossip), connectivity by choice.

λ Pareto saturates; you can't out-range ithonest Pareto

λ=0.5…8 identical (penalty never beats an in-range target); cr-bump helps @16 but not @32. Connectivity is size-blind.

Hand-coded cut-vertex role failed (72→42)dead-end

The role policy had to be learned.

Learned ES role policy works (77→85)win

129-param MLP, OpenAI-ES, fitness 0.6cov+0.4conn.

Topology-gossip + active relay — the action is the live knobwin

Relay action reaches cohesion 78→92 where λ couldn't; gossip a ~5-pt refinement; a no-op-flag bug was caught by an adversarial-audit workflow.

GNN actor-critic → structure-blind rolesdecision

Shared team advantage can't attribute roles → confirmed ES over AC.

MARL-at-32 with a degree signal → CPU-blockedpivot

32×32 compile too slow → commit to the host-side ES path.

Era 3 · the three stacks consolidated
Connectivity as mission-safety; λ dropped

Degree-budget (lose ≤K); hard guardrail beats soft by ~20 pts; re-calibrated at N=10.

Stuck-relay fixed

Relay→herd centroid + hysteresis → balanced switching.

Selective structure-aware relayingwin

Cut-vertex relay 94% (vs 3%); transfers to 32×32 (78%).

Size-invariant graph beliefwin

Snapshot estimator (91%) collapsed OOD → recurrent message-passing belief: 95.9%@16, 72.5±1.2% zero-shot@32, one-shot→80%.

Era 4 · the wall, and the reframe to the research question
The geometric wallhard limit

Belief-wiring failed (clump/freeze/suppress); local switcher (60/36) best; relaying can't buy global connectivity at scale (compass breaks 97% of links).

The reframe — to covert resiliencethe RQ

Failure modes are micro→macro amplifications → the governing question: covert single-agent misbehaviour → mission failure + resilience.

The literature verdictgap

Both flanks narrowed; the spatial + role-position + belief + stealth-frontier conjunction is the open seam (the gap).

Open questions / future-work seedsseeds

Loose threads logged for the next push: energy-aware agents — adding a notion of energy / budget to each agent so movement and sensing have a cost. A safety & detectability study spanning control-barrier functions (CBF), Koopman operator views of the dynamics, and formal stealthy-detection metrics. KKT conditions & distributed optimization as the analytic lens on the constrained team objective. And the reward-design distinction that keeps recurring: zero-sum vs. general-sum framing, and agent-vs-agent vs. team-vs-team contests — i.e. how the reward function differs for blue-agent-vs-red-agent versus team-blue-vs-team-red.

④ Networks & learning stack

Every model, part by part

module · eranetworklearninginputs → outputssizeresult
Frontier-attention AC · E1AC + spatial frontier cross-attentionPPO (IPPO/CTDE; return-norm)grid+agent obs → move + value~122k97.7% cov; conn 32%
Compass · Stack 1deterministic heuristiccertainty field → goal cell0~99%@16 / 76%@32
ES role-switcher · Stack 2MLP 7→16→1 + hysteresisOpenAI-ES; mission-safety fitnessgraph-criticality feats → P(relay)~145cut-vertex relay 94%; transfers
Relay-gate · E2MLP, 8 featsESgate feats → relay gatesmall@16 97/85/95
GNN actor-critic · E2 (dropped)GCN actor + central GCN criticA2C, MC returnsgraph → role + valuestructure-blind
Snapshot estimator · E3 (dropped)GCN + bilinear decodersupervised BCEmasked feats → adjacency91%@16 → trivial@32
GCRN belief · Stack 3per-node GRU + message-pass + bilinearprivileged distillation, BPTT; k-foldF_IN=4 → full adjacencyHID 1695.9%@16; 72.5±1.2%@32
Belief-wired switcher · E411-feat ES MLP + belief relay actionESlocal+belief feats → P(relay)+target~20964/13 (the wall)
curves
E1 — PPO training curves.
pareto
E2 — the λ Pareto saturation.
gnnac
E2 — GNN actor-critic (structure-blind → dropped).
swarm32
E2/E4 — the 32×32 swarm fragments: the geometric wall.
three-stack
Stack 1+2 — relay (amber) vs frontier (blue).
transfer32
Stack 3 — the size-invariant belief, zero-shot @32.
⑤ The paper

Stealth Attacks on Swarms

report/ — a NeurIPS-style results report (the empirical instantiation of the theory).

Frames the contest as a two-player zero-sum POSG with a centralized adversary on k of n agents, instantiated in RedWithinBlue (an open JAX testbed for cooperative grid exploration). A compromise sweep on a 16×16, 5-agent fixture shows:

Sections: intro · related work · problem formulation · environment · threat model · method · experiments · results · discussion · conclusion. This is the direct precursor to the connectivity-aware / belief-driven extensions on this site.

⑥ Field maps & references

Situating the work

⑦ How the layers connect

The through-line

The formalism defines the objects (health, viability, stealth/break budgets, propagation, the resiliency metrics); the engine turns them into measurements; the paper reports the compromise sweep and the posterior-as-resilience-signal finding; the field maps situate the gap. The empirical through-line: coverage is local and size-invariant; connectivity is global and size-blind — almost every dead-end came from conflating the two, and the arc is the slow untangling (local action over global leashes, threshold over maximize, structured symmetry-breaking, learned roles, the relay action over the connectivity penalty, a size-invariant belief) until the geometric wall reframed the work around covert resilience — exactly the question the formalism was built to make precise.