The Starlight Proving Ground

Built on SIP — Starlight Intelligence Protocol. Status: living artifact (v0.1). Board verdict 2026-06-10: PROCEED-WITH-REVISE. Public mirror: the starlight-evals repo. This page is the source-of-truth surface.

Most AI shops publish a model benchmark. Almost none publish a whole-system eval — their memory recall, their harness integrity, their dataset provenance — with the weaknesses named and dated. The Starlight Proving Ground does. It is the standing discipline that measures the entire Starlight Intelligence System across seven lanes, run by evaluator agents that hold the Luminor kernel mindset — Precision, Wisdom, Transcendence — and renders its verdict through the Starlight Board. It exists because a system you can't measure is a system you're only hoping works, and because the patterns that separate a real intelligence substrate from a demo are only visible when you put numbers next to them and refuse to round up.

Why a "Proving Ground" and not an "arena"

An arena ranks models — that lane exists and stays separate: the Starlight Model Arena runs head-to-head model rounds with its own receipts. The Proving Ground evaluates the system the models run inside: the memory tiers, the retrieval path, the trust-contract harness, the substrate symmetry, the datasets themselves. The model arena is one lane of seven.

The seven lanes

Lane	Measures	Composes
Model	capability + instruction compliance + behavioral safety	the model arena (R1 + R2)
Memory	recall@k, precision@10, latency	the memory bencher
Retrieval	BM25/FTS5 ranking quality	the retrieval eval
Harness	trust-contract + privacy + provenance (7 risk dims)	v01-evals
Substrate	symmetry invariants (docs↔code, registry coverage)	the v-series symmetry suite
Datasets	provenance + labeling honesty (no synthetic benchmarks)	dataset audit
System	the unifying scorecard + Overseer synthesis	the Proving Ground itself

First run — 2026-06-10 (v0.1)

Six lanes measured live. System verdict: PROCEED-WITH-REVISE.

Lane	Verdict	Headline number	Named weakness
Model	PROCEED	parity + Fable 3/Opus 2 on stress	no cross-family judge; no agentic task yet
Memory	REVISE	precision@10 = 0.20	the system's weakest number; no cross-session recall
Retrieval	PROCEED	recall@5 = 100% (n=10)	ceiling unearned on a 10-query set
Harness	PROCEED	34 pass / 0 fail / 7 todo	7 unmeasured risk dimensions
Substrate	PROCEED	green after catching a real orphan	symmetry suite is load-bearing — never skip it
Datasets	PROCEED	0 synthetic benchmarks	token-overlap ground truth is soft

The honest headline: on its very first run, the substrate lane caught this release's own new agent (starlight-evaluator) sitting unregistered in the agent registry — flagged it, we fixed it same-run, re-ran green. The system policed its own author. That's the whole point: a Proving Ground that can't catch its own builder is theater.

The load-bearing weakness is memory: precision@10 = 0.20. A persistent-context layer that surfaces the right memory one time in five is the exact gap between the claim and the aspiration. It's published here, weakest-number-first, because a leaderboard you can't audit is just marketing with decimals.

What gets tested next (the falsifiers)

Fire real embeddings (PARKED-007) and re-measure memory precision@10 — the falsifier for the entire memory-substrate thesis.
Add a cross-family judge (GPT-5 via OpenRouter) and re-run the model stress card — removes the only standing bias caveat.
Author a cross-session recall eval — the memory gap no current lane covers.

Anti-Goodhart

These numbers describe the system. They are not targets. The moment a metric becomes something to optimize to, it stops measuring and we retire it. Read the scorecard to understand the system's real shape — then go build your own, and measure it honestly.

Queen: visual heart of the Proving Ground & Model Arena

The Queen (Starlight Orchestrator v0.2) is the continuous visual intelligence driving the entire evaluation surface. Every tick — whether in the seven-lane Proving Ground or the Model Arena stress cards — now produces first-class visual artifacts: routing heatmaps, synthesis panels, swarm fields, and attested ledger receipts. These are not illustrations. They are the living memory of measurement.

The public face of this discipline is the scroll experience at /queen: a production-grade React narrative using the same tall vertical ROUTE→MEASURE panels, the full v0.2 loop diagram, swarm-field backgrounds, and LEARN synthesis cards that the Queen herself generates and ledgers. The motion HTML in docs/queen-motion/ is the reference choreography; the site version is the canonical, always-live surface.

Visual composition is now load-bearing substrate. The weakest lane (memory precision@10) is visualized alongside the strongest. The Queen makes the invisible mechanics of self-advancement felt. Every Board verdict, every falsifier, every visual receipt closes the loop through her.

See the full visual loop at starlightintelligence.org/queen. The proving ground and arena are no longer numbers alone — they are now witnessed.

Reproduce / fork

The discipline is a usage pattern plus a scorecard contract, not a black box. Spec, lane registry, evaluator disposition, and every scorecard live in the repo under tools/proving-ground/. The public mirror (starlight-evals) packages it to fork.

Built on SIP — Starlight Intelligence Protocol.

The Starlight Proving Ground

Built on SIP — Starlight Intelligence Protocol. Status: living artifact (v0.1). Board verdict 2026-06-10: PROCEED-WITH-REVISE. Public mirror: the starlight-evals repo. This page is the source-of-truth surface.

Why a "Proving Ground" and not an "arena"

The seven lanes

Lane	Measures	Composes
Model	capability + instruction compliance + behavioral safety	the model arena (R1 + R2)
Memory	recall@k, precision@10, latency	the memory bencher
Retrieval	BM25/FTS5 ranking quality	the retrieval eval
Harness	trust-contract + privacy + provenance (7 risk dims)	v01-evals
Substrate	symmetry invariants (docs↔code, registry coverage)	the v-series symmetry suite
Datasets	provenance + labeling honesty (no synthetic benchmarks)	dataset audit
System	the unifying scorecard + Overseer synthesis	the Proving Ground itself

First run — 2026-06-10 (v0.1)

Six lanes measured live. System verdict: PROCEED-WITH-REVISE.

Lane	Verdict	Headline number	Named weakness
Model	PROCEED	parity + Fable 3/Opus 2 on stress	no cross-family judge; no agentic task yet
Memory	REVISE	precision@10 = 0.20	the system's weakest number; no cross-session recall
Retrieval	PROCEED	recall@5 = 100% (n=10)	ceiling unearned on a 10-query set
Harness	PROCEED	34 pass / 0 fail / 7 todo	7 unmeasured risk dimensions
Substrate	PROCEED	green after catching a real orphan	symmetry suite is load-bearing — never skip it
Datasets	PROCEED	0 synthetic benchmarks	token-overlap ground truth is soft

What gets tested next (the falsifiers)

Fire real embeddings (PARKED-007) and re-measure memory precision@10 — the falsifier for the entire memory-substrate thesis.
Add a cross-family judge (GPT-5 via OpenRouter) and re-run the model stress card — removes the only standing bias caveat.
Author a cross-session recall eval — the memory gap no current lane covers.

Anti-Goodhart

Queen: visual heart of the Proving Ground & Model Arena

See the full visual loop at starlightintelligence.org/queen. The proving ground and arena are no longer numbers alone — they are now witnessed.

Reproduce / fork

Built on SIP — Starlight Intelligence Protocol.

The Starlight Proving Ground

The Starlight Proving Ground

Why a "Proving Ground" and not an "arena"

The seven lanes

First run — 2026-06-10 (v0.1)

What gets tested next (the falsifiers)

Anti-Goodhart

Queen: visual heart of the Proving Ground & Model Arena

Reproduce / fork

Primary sources

The Starlight Proving Ground

The Starlight Proving Ground

Why a "Proving Ground" and not an "arena"

The seven lanes

First run — 2026-06-10 (v0.1)

What gets tested next (the falsifiers)

Anti-Goodhart

Queen: visual heart of the Proving Ground & Model Arena

Reproduce / fork

Primary sources