Applied AI Campaign · Observatory

Build Observatory

Observability into what's being built — beyond the prep guide.

A terminal for the systems built for this campaign — what each is, how it works, and where it stands. First system: the synthetic Data Factory.

Data Factory

design · no data yet

A general engine that manufactures high-quality synthetic data points — one methodically-designed point per "lever pull" — and routes each into a human-review queue. A different synthetic flow per data type shares one common pipeline shape. First product: an industry benchmark (type TBD).

Status
Design
Recipes
2 defined
Next phase
P0 · GTFA lever
Generator
Claude Code /loop
Pipeline — one shape, per-type middle
Input Spec
governs one lever pull; defines the variety axes
Lever
type-specific gated loop — the only part that differs per type
Candidate + receipts
the artifact plus its full trace
Review Queue
human verdicts feed reflection
Approved Dataset
export

Everything outside the Lever is shared and built once. Variety is enforced structurally: variety axes → a fresh coordinate per pull → a novelty gate against the corpus (so the LLM can't repeat itself).

Recipes (per-type flows)
prompt → ground-truth answerP0
Verifiable answer, graded programmatically. Hard problem: is the reference correct?
sample → author → hermetic check → novelty → independent oracle → difficulty probe → final sanity → emit + receipts
prompt → output → rubricP1
Per-task atomic rubric (binary + gating; types: instruction-following · outcome · process · grounding). Hard problem: is the rubric atomic, MECE, discriminating?
sample → author rubric → novelty → atomicity → MECE → inter-judge → gold/bad calibration → difficulty → final QC → emit + receipts
Roadmap
  1. P0One lever, flat files — GTFA → JSONLnext
  2. P1Second type (rubric) — tests the abstraction
  3. P2Persistence — DB + backend API
  4. P3Static review UI — visualize the queue
  5. P4Interactive review — approve / reject / edit
  6. P5Feedback loop — verdicts feed reflection
  7. P6Headless productization (later)

Live queue metrics surface here once P2 lands.

Grounded in prior art
Arena-Hard — separability ≈ difficulty gate EvalGen — "criteria drift" ≈ review→reflection Constitutional AI — principle-guided critique→revise