Use the live tab to inspect one real decision trace from a seeded game.
This site is a research artifact, not a product mockup. Hanabi is the controlled teamwork environment. The actual claim is about whether an AI teammate can time coordination cues well enough to help without creating unnecessary communication overhead.
Use the benchmark tab to rerun all policy conditions over repeated deterministic seeds.
Use the final tab to inspect the formulas, seed list, and exportable evidence snapshot behind the displayed result.
Selective awareness is only interesting if the benchmark keeps three checks visible at the same time.
The selective policy should remain close to or better than the performance-only baseline.
The system should reduce hints that add little or no new coordination value.
Lower redundancy is not enough if one partner ends up carrying most of the signaling work.
The simulator runs from deterministic seeds, the benchmark averages are computed from repeated runs, and the final screen exposes the formulas and exportable evidence instead of relying on presentation-only text.
Press Step or Autoplay to run the current game.
Switch between the current decision, the partner estimate, the metrics, and the turn log instead of reading everything at once.
The app will show the live action rationale here, including the partner-state estimate and whether a cue was redundant.
Run the benchmark to populate the summary, the full table, and the cross-play matrix.
Run a game or benchmark first to populate the computed evidence below.
Saved entries preserve the computed summary so the result can be re-checked later from the same metrics and seed configuration.
This application runs a full two-player Hanabi simulation in the browser using standard rules, deterministic seeded decks, live agent policies, and turn-by-turn state updates. It is not a static dashboard. The agents actually act inside the simulator and can be benchmarked under repeated seeds.
The storage layer is only used to persist evidence snapshots. The core game logic, policy actions, and benchmark metrics are computed from the simulation itself.
The single-run metrics come directly from one seeded trace. The benchmark rows are arithmetic means over repeated seeded runs for each agent-partner condition.
The reproducibility export includes the selected summary and, for benchmarks, the per-seed samples that generated the displayed averages so the result can be checked independently.
This demo focuses on the bounded Hanabi benchmark promised in the progress report: a unified pipeline, four-policy comparison, and pilot testing in a controlled setting. It should be framed as a simulation-first benchmark rather than a replacement for a hosted human-agent evaluation protocol.