LUNAR LANDER — THE MANUAL

← BACK TO THE GAME

The 1979 arcade classic, rebuilt as a reinforcement-learning arena. One pure-Python physics core (stdlib only, ~30 KB wheel) does everything: it trains your agents through Gymnasium, and it runs live in the browser via Pyodide (CPython on WebAssembly). JS never simulates — it only draws what Python computes, 60 times a second. Same seed → same episode, byte-for-byte, in CI, in training, and in your browser. The title screen is a live demo: a built-in autopilot flying three real landers, collisions welcome.

←/A →/D rotate · ↑/W/Space thrust · R new episode · O agent view · P AI pilot · 1/2/3 difficulty
Touch: arcade buttons — ←/→ and THRUST; tap the difficulty line on the title screen · ?seed=123 in the URL pins the terrain

Landing: both feet on a pad, upright, slow — |vx| ≤ 12 and |vy| ≤ 18 is a perfect landing (50 × multiplier); up to 25/35 still pays 15 ×; anything else — fast, tilted, off-pad, off-world — is a crash. Narrower pads pay more (2X/3X/5X). Landers are solid: meet one in flight and you both crash; a landed lander blocks its pad, legally.

Three presets — terrain ruggedness, pad widths, fuel budget and spawn drift change; the physics never does. The selection persists and applies to both your episodes and the attract mode.

preset	displacement	decay	y_max	pad widths 2X/3X/5X	fuel	spawn vx
TRAINEE	140	0.52	380	130 / 95 / 60	1200	±15
CADET	210	0.62	480	110 / 75 / 45	1000	±25
COMMANDER	260	0.68	560	90 / 60 / 36	850	±40

The ladder doubles as an RL curriculum: train trainee → cadet → commander with the same reward and physics throughout.

import gymnasium, moonlander

env = gymnasium.make("MoonLander-v0")            # classic rotate+thrust
# or: MoonLanderEnv(mode="gym")                  # LunarLander-style engines
# or: MoonLanderEnv(obs_mode="radar")            # partial observability
# or: MoonLanderEnv(preset="trainee")            # the difficulty curriculum
# or: MoonLanderEnv(frame_skip=4)                # 4 physics ticks per step

obs, info = env.reset(seed=42)                   # info["terrain"] = terrain dict
obs, r, term, trunc, info = env.step(env.action_space.sample())
if term: print(info["outcome"])                  # {"kind": "perfect", "mult": 5, ...}

action_space = Discrete(4) · observation_space = Box(-10, 10, (14,), float32). In classic mode the four actions are noop / rotate-left / rotate-right / thrust; in gym mode they fire engines — noop / left / main / right. Episodes truncate at max_steps physics ticks; stepping a finished episode raises RuntimeError. On the terminal step info carries outcome, is_success and score.

14 floats, world-size-relative. The target pad is the nearest by euclidean distance to the pad center.

idx	value	formula
0	x (normalized)	`x / world_w * 2 - 1`
1	y (normalized)	`y / world_h * 2 - 1`
2	vx	`vx / 60`
3	vy	`vy / 60`
4	sin(angle)	`sin(angle)`
5	cos(angle)	`cos(angle)`
6	angular velocity	`ang_vel / 3`
7	fuel fraction	`fuel / fuel_init`
8	dx to target pad	`(pad_cx - x) / world_w`
9	dy to target pad	`(pad_y - y) / world_h`
10	pad half-width	`(x1 - x0) / 2 / 100`
11	terrain clearance	`(y - lander_bottom - ground_y(x)) / world_h`
12	pad multiplier	`mult / 5`
13	pad visible	`1.0` or `0.0`

Truth lives in the core; perception is a filter. The frame JSON always carries the truth — obs is what a policy sees, shaped by obs_mode:

"full" — indices 8–12 always populated, index 13 always 1.0
"radar" — beyond radar_range of the nearest pad, indices 8, 9, 10, 12 read 0.0 and index 13 reads 0.0: the agent must explore to find a pad

Documented but not yet implemented: a lidar mode (terrain rays, no pad oracle), seeded sensor noise, and other-lander slots for the multi-agent env.

Press P (or the AI button) to hand the lander to a neural-network pilot: a small MLP whose forward pass runs in pure Python, right here in your browser. Touch any flight control and it hands the stick back. Press M (or tap the AI button) to switch method: EVOLUTION (the CMA-ES policy, cyan) and HEURISTIC (a pure-Python MPC/iLQR optimal controller, amber) both fly live in your browser, each with its own lander shape and color. The game ships the EVOLUTION policy (~40% trainee landings); you can also bring your own.

Bring your own brain: drag a policy .json onto the game (or use the LOAD AI footer link) and your network flies, labeled CUSTOM AI. Train and export one with examples/train_template.py — an annotated starting point that tours the whole world API; the forward pass it exports is the exact one this page runs.

φ(s)  = -1.0 * dist - 0.5 * speed - 0.5 * |sin angle|
        dist  = (min over pads of euclidean distance to pad center) / world_w
        speed = hypot(vx, vy) / 60
r_t   = 10 * (φ(s') - φ(s)) - 0.06 * (1 if main engine actually fired else 0)
terminal: perfect → +100 + 10*mult;  hard → +30;  crash → -100

The shaping is potential-based, so it is policy-invariant, and the min-over-pads potential stays continuous when the nearest pad switches — no fake reward at the boundary.

Training note (audit-verified): at frame_skip=1 a good landing is ~1300+ decisions, so with γ = 0.99 the terminal reward is discounted to ~0.0002 and hovering beats landing in discounted return. Train with frame_skip=4 and gamma ≥ 0.997 (0.999 recommended). frame_skip up to 8 preserves the autopilot landing rate (within one seed of k=1 on cadet seeds 0–29).

The single-agent env wraps Game(n_landers=1), but the core is multi-lander: one world, shared terrain, solid collisions, each lander with its own fuel, score, and fate. (A PettingZoo wrapper is a later phase.)

from moonlander.core.game import Game
g = Game(n_landers=3)                              # one world, three landers
g.reset(seed=7)                                    # shared terrain, spread spawns
g.step_all('[[1, true], [0, false], [-1, true]]')  # solid: collisions crash both

All randomness flows through one random.Random(seed) per episode: same preset + same seed = byte-identical terrain, pads, stars, and spawns (different presets differ on the same seed, by design).
env.reset(seed=k) follows the gymnasium np_random chain — the Game seed is derived, so the terrain's seed field ≠ k.
env.reset(options={"game_seed": k}) seeds the Game directly — byte-identical to the web game's ?seed=k. That is the bridge between a training run and what you watch in the browser.

Algorithm arena (train different algorithms, watch them fly side by side on identical seeds) → competition (multi-agent, pad-blocking strategy, collision risk, comm channels) → human + AI co-op (you fly one lander, the agent flies the other).

MultiLander is built and maintained by Bijan Mehralizadeh as an open-source playground for teaching and researching reinforcement learning — an homage to Atari's 1979 vector-monitor original, rebuilt so the same physics that trains agents flies in your browser. Python + Gymnasium on the inside; Pyodide/WebAssembly and a hand-drawn vector stroke font on the outside. Source, tests, and the full Python⇆JS contract live at github.com/bijanmehr/MultiLander.

← BACK TO THE GAME