Project SYNAPSE

Selective Coordination in Human-AI Teams

Team SYNAPSE: Ishan Shivansh Bangroo and Abolfazl

A reproducible Hanabi benchmark for studying when an AI teammate should speak up, stay quiet, and coordinate without creating redundant signaling overhead.

Problem AI teammates can over-talk, under-talk, or time cues well.
Method Run one live game, then compare all four policies over many seeds.
Goal Keep score strong while reducing redundant coordination overhead.
Research question Does selective awareness improve coordination without shifting burden elsewhere?
How to read this site Move screen by screen: inspect one live run, compute the repeated benchmark, then verify the exact formulas and seeds behind the result.
Simulation, Storage, And Interpretation Checking whether saved evidence can persist beyond the browser preview.
Step 1
Understand the claim before running anything

This site is a research artifact, not a product mockup. Hanabi is the controlled teamwork environment. The actual claim is about whether an AI teammate can time coordination cues well enough to help without creating unnecessary communication overhead.

A
Watch one live run

Use the live tab to inspect one real decision trace from a seeded game.

B
Compare all policies

Use the benchmark tab to rerun all policy conditions over repeated deterministic seeds.

C
Verify reproducibility

Use the final tab to inspect the formulas, seed list, and exportable evidence snapshot behind the displayed result.

Success Rule
What counts as a credible result

Selective awareness is only interesting if the benchmark keeps three checks visible at the same time.

Score stays competitive

The selective policy should remain close to or better than the performance-only baseline.

Redundant cueing drops

The system should reduce hints that add little or no new coordination value.

Burden does not just shift

Lower redundancy is not enough if one partner ends up carrying most of the signaling work.

Trust check

The simulator runs from deterministic seeds, the benchmark averages are computed from repeated runs, and the final screen exposes the formulas and exportable evidence instead of relying on presentation-only text.

Step 2B
Watch one decision unfold

Press Step or Autoplay to run the current game.

Score 0
Info tokens8
Fuse tokens3
Deck remaining40
Turn0
One thing to focus on at a time

Switch between the current decision, the partner estimate, the metrics, and the turn log instead of reading everything at once.

Awaiting run No action yet

The app will show the live action rationale here, including the partner-state estimate and whether a cue was redundant.

Step 3B
Inspect one benchmark view at a time

Run the benchmark to populate the summary, the full table, and the cross-play matrix.

Run the benchmark to generate the computed result summary.
Step 4B
Reproducibility check and interpretation

Run a game or benchmark first to populate the computed evidence below.

No evidence selected yet.
Computed note
The reproducibility note will summarize the seed configuration and how the displayed metric values were computed.
Optional interpretation
Ask for an interpretation after running a game or benchmark. Any answer here is derived from the computed evidence shown above.
Saved snapshots

Saved entries preserve the computed summary so the result can be re-checked later from the same metrics and seed configuration.

What is actually implemented

This application runs a full two-player Hanabi simulation in the browser using standard rules, deterministic seeded decks, live agent policies, and turn-by-turn state updates. It is not a static dashboard. The agents actually act and can be benchmarked under repeated seeds.

The storage layer is only used to persist evidence snapshots. The core game logic, policy choices, and benchmark metrics are computed from the simulation itself.

How results are produced

The single-run metrics come directly from one seeded trace. The benchmark rows are arithmetic means over repeated seeded runs for each agent-partner condition.

The reproducibility export includes the selected summary and, for benchmarks, the per-seed samples that generated the displayed averages so the result can be checked independently.

Scope boundary

This demo focuses on the bounded Hanabi benchmark promised in the progress report: a unified pipeline, four-policy comparison, and pilot testing in a controlled setting. It should be framed as a simulation-first benchmark rather than a replacement for a hosted human-agent evaluation protocol.