Selective Coordination in Human-AI Teams

Step 1

Understand the claim before running anything

This site is a research artifact, not a product mockup. Hanabi is the controlled teamwork environment. The actual claim is about whether an AI teammate can time coordination cues well enough to help without creating unnecessary communication overhead.

A

Watch one live run

Use the live tab to inspect one real decision trace from a seeded game.

B

Compare all policies

Use the benchmark tab to rerun all policy conditions over repeated deterministic seeds.

C

Verify reproducibility

Use the final tab to inspect the formulas, seed list, and exportable evidence snapshot behind the displayed result.

Success Rule

What counts as a credible result

Selective awareness is only interesting if the benchmark keeps three checks visible at the same time.

Score stays competitive

The selective policy should remain close to or better than the performance-only baseline.

Redundant cueing drops

The system should reduce hints that add little or no new coordination value.

Burden does not just shift

Lower redundancy is not enough if one partner ends up carrying most of the signaling work.

Trust check

The simulator runs from deterministic seeds, the benchmark averages are computed from repeated runs, and the final screen exposes the formulas and exportable evidence instead of relying on presentation-only text.

Step 2A

Configure one seeded run

This screen is for one live trace only. It helps explain a single decision, but the benchmark screen is where the repeated claim is tested.

Player A policy Player B policy Seed

Replay and export tools

Load replay

Why this screen exists

The live run shows the actual policy behavior on a concrete board state. It is evidence of mechanism, not the final research conclusion by itself.

Step 2B

Watch one decision unfold

Press Step or Autoplay to run the current game.

Score 0

Info tokens8

Fuse tokens3

Deck remaining40

Turn0

One thing to focus on at a time

Switch between the current decision, the partner estimate, the metrics, and the turn log instead of reading everything at once.

Awaiting run No action yet

The app will show the live action rationale here, including the partner-state estimate and whether a cue was redundant.

Step 3A

Run repeated seeded games

This is the main evidence screen. Every row is computed by rerunning the simulator over the exact deterministic seed list shown in the reproducibility tab.

Mode Number of seeds

Minimum benchmark size is 4 seeds because with fewer than 4 runs, one unusually lucky or unlucky game can distort the comparison too much. Use 4 for a quick check, 12 for a balanced run, and 20 for the strongest presentation result. Values outside 4-100 are corrected automatically before the run starts.

Interpretation rule

Selective awareness only supports the research claim if score stays competitive while redundant cueing or burden concentration improves against the less selective baselines.

Step 3B

Inspect one benchmark view at a time

Run the benchmark to populate the summary, the full table, and the cross-play matrix.

Run the benchmark to generate the computed result summary.

Step 4A

Capture evidence or ask for interpretation

The formulas and seed-backed numbers remain the source of truth. The interpretation tools here are optional and always operate on the already-computed result.

Checking services... Waiting for storage and interpretation status

Evidence source Saved label Ask About The Current Result

Type a question here, then click Ask assistant. If you just want the standard explanation with no custom prompt, click Generate default summary.

What the export contains

The exported record includes the source type, policies, seed or seed list, computed metrics, success-rule status, and for benchmarks the per-seed samples used to form each average.

Why the chat is optional

The interpretation box helps users understand the result in plain language, but it does not generate the result itself. The benchmark table, formulas, and seed exports remain the evidence.

Step 4B

Reproducibility check and interpretation

Run a game or benchmark first to populate the computed evidence below.

No evidence selected yet.

Computed note

The reproducibility note will summarize the seed configuration and how the displayed metric values were computed.

Optional interpretation

Ask for an interpretation after running a game or benchmark. Any answer here is derived from the computed evidence shown above.

Saved snapshots

Saved entries preserve the computed summary so the result can be re-checked later from the same metrics and seed configuration.

What the system computes

This application runs a full two-player Hanabi simulation in the browser using standard rules, deterministic seeded decks, live agent policies, and turn-by-turn state updates. It is not a static dashboard. The agents actually act inside the simulator and can be benchmarked under repeated seeds.

The storage layer is only used to persist evidence snapshots. The core game logic, policy actions, and benchmark metrics are computed from the simulation itself rather than loaded from a fixed dataset.

How selective awareness decides

The selective agent is not a trained neural model and not an NLP-style Markov chain. The Hanabi state changes come directly from the game rules. On each turn, the policy evaluates play, discard, and hint actions from the current state.

A hint scores higher when it reveals a playable card, makes a safe discard explicit, protects a critical card, or meaningfully reduces partner uncertainty. The score drops when the hint is redundant, costly, or shifts too much signaling burden onto one player.

Scope boundary

This demo focuses on a bounded Hanabi benchmark with a unified pipeline, four-policy comparison, and controlled simulation testing. A full probabilistic formulation would be closer to a POMDP than a simple Markov chain, but that is beyond the scope of this project.

It should be framed as a simulation-first benchmark rather than a replacement for a hosted human-agent evaluation protocol.