Starling Strategy × NCTQ

Building an AI System
for District Policy Analysis

Lessons learned processing 11,700+ documents across 148 districts — and what it taught us about making AI reliable at scale.

15-Prediction Consensus Pydantic-First Architecture LLM-Agnostic Design
STARLING STRATEGY
Setting the Frame

What AI Can and Cannot Do Today

Where AI Excels

Extracting and reasoning over information when provided with well-structured workflows, analyst guidance, and clear examples.
Questions with pre-specified response sets — checkbox questions reach 98.5% accuracy.
Consistent, tireless processing: 11,700+ documents across 148 districts, 24/7 without fatigue.

Where AI Still Struggles

Ambiguous cases that take experienced analysts time — contract nuances, implicit policy, cross-reference logic.
Salary schedules (9–14%), complex tables, and multi-document synthesis across 960K-character CBAs.
The remaining ~30% gap is solvable with retrieval + reasoning improvements — not a fundamental AI limitation.
STARLING STRATEGY
Building on a Moving Landscape

What Changed, What We Built, and Why It Mattered

What the world shipped
What we built
Why it mattered
Gemini 1.5 Pro — 1M context
Full documents in a single prompt.
Full-document prediction pipeline
Tested Llama, Claude, Gemini on contracts. Gemini won on cost and quality.
Proved the approach before investing in retrieval infrastructure. Momentum over perfection.
PydanticAI + Docling (IBM)
Type-safe LLM output. Universal doc extraction with OCR.
Document pipeline + schema-as-prompt
11,700 docs. Format errors 15–20% → under 1%. Replaced Marker.
Data model became the instruction set. LLM-agnostic — swap models without rewriting.
Vespa Cloud hybrid search
Perplexity's engine. Keyword + semantic in one query.
K-diversity voting (15 predictions)
Retrieve once, slice by depth. Nathan's PiedPiper research.
Consensus replaced confidence. Agreement among runs tracks real accuracy.
Gemini 2.5 Flash
Faster, cheaper. "Don't need a Ferrari to take the kids to school."
Production: 148 districts × 112 questions
807 analyst reviews. Katherine's rule. Citations. Improvement loop.
15 predictions per question at scale. The architecture, not the model, is the investment.
STARLING STRATEGY
System Architecture

Three Pipelines, One Database,
One Dashboard

Document Pipeline

Downloads district policy files, extracts text via Docling, enriches with AI metadata. Handles PDFs, Word, HTML, and scanned images via OCR.
11,700+ Documents

Prediction Pipeline

Retrieves passages via hybrid search, generates 15 AI predictions per question, synthesizes consensus through voting, evaluates against golden answers.
15-Run Consensus

NCTQ.ai Dashboard

Unified analyst interface for reviewing AI suggestions, exploring documents, managing workflows, and conversational policy Q&A.
5 Active Reviewers
Shared
PostgreSQL Bronze/Silver
Vespa Cloud Hybrid Search
LLM-Agnostic
Dry-Run by Default
STARLING STRATEGY
Document Pipeline

Turning 11,700 Messy Files
into Clean, Searchable Text

Document ingestion is non-trivial infrastructure work, not just LLM prompting. We convert PDFs, Word docs, HTML pages, and scanned images into clean, structured text.

"Analysts cannot review documents the AI has never read."
1
Docling (IBM) — One tool for PDFs, Word, HTML, OCR, and table extraction
2
Parallel downloads across 4 workers; serial extraction (ML models too large to duplicate)
3
Smart failure classification — reprocess only fixable failures
4
AI enrichment labels type, years, readability, confidence
Tool Evolution
Marker → Docling: Finding the right tool replaced weeks of custom work.
Processing Stats
99.7% documents extracted
8 canonical document types
1–3s enrichment, 8-way concurrency
STARLING STRATEGY
Design Philosophy

The Data Model
Is the Prompt

The AI can't get in the door unless it satisfies the requirements — type hints and validation replace prose instructions.

class PredictionOutput(BaseModel): """The AI must fill every field — the schema IS the instruction.""" predicted_answer: str = Field( description="The answer. Use 'INA' if the evidence" " does not address the question." " Silence is NEVER 'No'." ) confidence: float = Field( ge=0.0, le=1.0, description="Confidence in the answer" ) reasoning: str = Field( min_length=50, description="Step-by-step reasoning citing evidence" ) document_index: Literal[1, 2, ..., k] = Field( description="Which passage contains the answer" ) evidence_agreement: Literal[ "strong", "partial", "no_evidence" ] # Pydantic validates EVERY response automatically.

Schema Over Prose

Field descriptions become AI instructions. ge=0.0, le=1.0 constrains outputs. Format errors dropped from 15–20% to under 1%.

LLM-Agnostic by Design

Built on PydanticAI — MIT-licensed, open source, works with any LLM. Swap models without rewriting the system.

NCTQ Expertise Baked In

8 document types, answer options, coding guidance, and domain terminology from NCTQ's classification system. Human expertise constrains AI output.

STARLING STRATEGY
Prediction Pipeline

15 Predictions, One Answer

We ask 15 AI agents to answer each question — each with a different slice of evidence. Then we take a vote. This mirrors NCTQ's own process of multiple analysts and fact-checkers.

15
Prediction runs per question, each at a different evidence depth
15×
Search cost reduction — retrieve passages once, slice by depth
9
INA override threshold — 60% consensus forces INA
<1×
Net cost vs single expensive model — small models in parallel

Why Consensus, Not Single-Shot?

LLMs hallucinate and tell people what they want to hear. Running 15 predictions with variance overcomes these failure modes through iteration — the same principle behind having multiple analysts review the same answer. When all 15 agree, accuracy is 58.7%. When they disagree, it drops to 18.7%.

The Vote Is Pure Math

No AI involved in synthesis. Plurality wins, unless 9 or more runs say INA — then INA overrides everything. Agreement among 15 runs correlates with real accuracy far better than any model's self-reported confidence score.

STARLING STRATEGY
The Most Important Rule

"If It's Silent, It's INA Always."

AI models infer "No" from silence. We fight that at every layer. Absence of evidence is not evidence of absence.

Layer 1 — System Prompt: INA rule stated before any question
Layer 2 — Field Descriptions: Every output field includes the rule
Layer 3 — Request Footer: Placed last, where models weigh most
Layer 4 — INA Verification: Dedicated check when prediction is INA
Layer 5 — Hard Override: 9/15 say INA → INA is final. No exceptions.

INA Is Valuable Information

Knowing which districts don't address topics in contracts is meaningful information, not a failure.

Katherine Violations = #1 Metric

Predicting a value when the truth is INA is the primary failure metric — not overall accuracy.

Audit, Never Auto-Correct

When AI says INA but last year had a real answer, we log an audit note. We never auto-correct.

STARLING STRATEGY
Trust & Provenance

Citations as a Separate Primitive

Citations are facts — objective truths that can be deterministically audited. We separate them from predictions to build trust.

1. Evidence Identified First

Passages retrieved before the LLM answers. Can only cite pre-identified evidence.

2. Structured Output

Citations in JSON — document ID, passage index, quoted text. Each validated against source.

3. Separate Validation

Batch validation independent of predictions. Missing citations held for review.

4. Dashboard Visibility

Citations inline. Click through to source. Missing-citation warnings.

The "Mise en Place" Approach

All evidence identified, structured, and validated before AI reasons over it.

Result: Source → passage → citation → prediction → review.

Anti-Hallucination
  • ✓ Constrained indices ✓ Quote verification
  • ✓ Missing-citation holds ✓ Quality tracking
STARLING STRATEGY
Retrieval Infrastructure

Hybrid Search — Finding
Meaning, Not Just Keywords

Why Hybrid?

Lexical search finds exact terms. Semantic search finds meaning. Neither alone is sufficient for policy documents.

Lexical: Search "salary schedule" — finds exact title matches, misses "Compensation Guide" or "Pay Scale"
Semantic: Same query also finds "pay schedule," "compensation table," "teacher pay rates"
Hybrid: Both in one query — precise when terms match, flexible when they don't

Why Vespa Cloud?

The same search engine Perplexity uses to search the entire internet. Handles hybrid search, text embeddings, and district filtering in a single query.

Enriched Query Construction

Query Full question + parent context
Focus NCTQ domain terms & target sections
Boost Salary/calendar docs promoted for relevant Qs
Scope District + academic year filtering

Offline Fallback

When Vespa is unavailable, the system falls back to PostgreSQL — returning documents ranked by AI confidence. Pragmatic, not ideal.

STARLING STRATEGY
Data Architecture

Shared Database, Clean Lineage

All data flows through PostgreSQL in two layers — a clean separation between what humans curate and what AI produces.

Bronze Layer — Human-Curated
Ground truth AI never overwrites: district info, document links, academic years, policy definitions, golden answers.
Silver Layer — AI-Produced
Enriched documents, predictions, suggested answers, evidence citations, evaluation results. Tagged by pipeline source.

Idempotent Writes

Upserts with deterministic keys — safe to rerun at any time.

Schema Isolation

Dashboard auth in its own schema. Analytical data and user management never collide.

Failure Classification

Migrations reclassify failures into actionable buckets — dead links, corrupted files, blocked access.

Append-Only Predictions

Version-tracked. Analyst-approved answers "stick" through re-runs via sticky view ordering.

STARLING STRATEGY
Measuring What Matters

Evaluation Framework & Current Performance

6-way classification treating INA as a first-class outcome:
Exact MatchAI matches ground truth
INA CorrectCorrectly ID'd as not addressed
INA False PosSaid INA but answer exists
Katherine ViolationGave answer, truth is INA
Value DifferentRight idea, wrong value
No GoldenNo ground truth available

Lesson: Overfitting

Should have integrated golden answers from additional districts earlier. Initial smaller sets caused overfitting.

56%
Overall accuracy
up from 45.3%
98.5%
Checkbox accuracy
93.8%
INA detection
~40%
Analyst acceptance

Multi-Stage Evaluation

Retrieval recall, per-depth correctness, synthesis impact, citation quality. Regression snapshots diff accuracy per question across changes.
STARLING STRATEGY
NCTQ.ai Dashboard

Where Analysts and AI Meet

One interface for reviewing AI suggestions, exploring documents, asking policy questions, and managing the review workflow.

Metric Calculator

Core review: districts → policies → questions with AI recommendations, citations, and one-click approve/reject.

Documents

Search and explore 11,700+ processed policy documents with status badges and extraction details.

Policy Advisor

Conversational AI for open-ended policy questions with cited answers.

Admin & Auth

Passwordless email auth. Role-based access: viewer, analyst, power_user, admin.

Technology

FastHTML + MonsterUI + HTMX. Fast server-rendered pages, no heavy frontend framework.

Review Stats

807 reviews
5 reviewers
341 approved / 466 rejected
STARLING STRATEGY
Reflections

What We Learned

Build Structures Invariant to LLM Progress

Frameworks that accept better models, not systems that compete with state-of-the-art. Swapping Gemini 1.5 to 2.5 Flash was a config change, not a rewrite.

Data Structures as Constraints

Type-safe structures validate outputs at the boundary. The AI can't get in the door unless it satisfies requirements. More reliable than any prompt alone.

Consensus Through Iteration

15 cheap runs beat one expensive run. Same principle as NCTQ's human process: multiple analysts reviewing the same question.

The Harness Is the Hard Part

The real challenge isn't AI — it's infrastructure connecting LLMs to tools and data safely at scale. The harness is the moat.

Robust Internal Benchmarking

Automated research loops: identify failure modes, test fixes, measure impact — all before touching production.

From Extraction to Reasoning

AI evolved from extracting to reasoning — when given the right workflow and constraints. Next-gen reasoning models + agentic processing will narrow the remaining gap.

STARLING STRATEGY
Looking Ahead

What Comes Next

Immediate: Launch

District Policy Pathfinder (2024–2025 data), website content, and Compass chatbot.

June: Scale to 500 Districts

Expand from 148 to 500 districts. LLM-agnostic architecture means adopting newer models without structural changes.

Ongoing: Close the Loop

Write approved answers to the golden table. 94% of reviews have no ground truth — every approval expands evaluation coverage.

Priority Improvement Areas

1 Fix retrieval, not prompts — 75% of errors are retrieval gaps
2 Salary schedule parsing — 9–14% accuracy, biggest opportunity
3 CBA section extraction — 960K-char contracts have most answers
4 Add N/A category — fixes ~19% of rejections
5 Inject analyst corrections — 466 notes as signal
"The remaining 30% gap is solvable with retrieval, reasoning, and agentic improvements — not a fundamental AI limitation."
STARLING STRATEGY
Starling Strategy × NCTQ

Thank You

148 districts. 112 questions. 11,700 documents.
One system built to improve with every model generation.

Nathan Roll — Stanford NLP Macon Phillips — Starling Strategy
56%
Current Accuracy
98.5%
Checkbox Accuracy
93.8%
INA Detection
STARLING STRATEGY