Build Observatory — Applied AI campaign

A terminal for the systems built for this campaign — what each is, how it works, and where it stands. First system: the synthetic Data Factory.

Data Factory

design · no data yet

A general engine that manufactures high-quality synthetic data points — one methodically-designed point per "lever pull" — and routes each into a human-review queue. A different synthetic flow per data type shares one common pipeline shape. First product: an industry benchmark (type TBD).

Status

Design

Recipes

2 defined

Next phase

P0 · GTFA lever

Generator

Claude Code /loop

Pipeline — one shape, per-type middle

Input Spec

governs one lever pull; defines the variety axes

→

Lever

type-specific gated loop — the only part that differs per type

→

Candidate + receipts

the artifact plus its full trace

→

Review Queue

human verdicts feed reflection

→

Approved Dataset

export

Everything outside the Lever is shared and built once. Variety is enforced structurally: variety axes → a fresh coordinate per pull → a novelty gate against the corpus (so the LLM can't repeat itself).

Recipes (per-type flows)

prompt → ground-truth answerP0

Verifiable answer, graded programmatically. Hard problem: is the reference correct?

sample → author → hermetic check → novelty → independent oracle → difficulty probe → final sanity → emit + receipts

prompt → output → rubricP1

Per-task atomic rubric (binary + gating; types: instruction-following · outcome · process · grounding). Hard problem: is the rubric atomic, MECE, discriminating?

sample → author rubric → novelty → atomicity → MECE → inter-judge → gold/bad calibration → difficulty → final QC → emit + receipts

Roadmap

P0One lever, flat files — GTFA → JSONLnext
P1Second type (rubric) — tests the abstraction
P2Persistence — DB + backend API
P3Static review UI — visualize the queue
P4Interactive review — approve / reject / edit
P5Feedback loop — verdicts feed reflection
P6Headless productization (later)

Live queue metrics surface here once P2 lands.

Grounded in prior art

Arena-Hard — separability ≈ difficulty gate EvalGen — "criteria drift" ≈ review→reflection Constitutional AI — principle-guided critique→revise