Starling Strategy × NCTQ

Building an AI System
for District Policy Analysis

Lessons learned processing 11,700+ documents across 148 districts — and what it taught us about making AI reliable at scale.

15-Prediction Consensus Pydantic-First Architecture LLM-Agnostic Design

STARLING STRATEGY

Setting the Frame

What AI Can and Cannot Do Today

Where AI Excels

Extracting and reasoning over information when provided with well-structured workflows, analyst guidance, and clear examples.

Questions with pre-specified response sets — checkbox questions reach 98.5% accuracy.

Consistent, tireless processing: 11,700+ documents across 148 districts, 24/7 without fatigue.

Where AI Still Struggles

Ambiguous cases that take experienced analysts time — contract nuances, implicit policy, cross-reference logic.

Salary schedules (9–14% accuracy), complex tables, and multi-document synthesis across 960K-character CBAs.

The remaining ~30% gap is solvable with retrieval + reasoning improvements — not a fundamental AI limitation.

STARLING STRATEGY

Building on a Moving Landscape

What Changed, What We Built, and Why It Mattered

What the world shipped

What we built

Why it mattered

Gemini 1.5 Pro — 1M context

Full documents in a single prompt.

→

Full-document pipeline — 1 district

Tested Llama, Claude, Gemini on Chicago contracts. Gemini won on cost and quality.

→

Proved the approach before investing in retrieval infrastructure. Momentum over perfection.

PydanticAI + Docling (IBM)

Type-safe LLM output. Universal doc extraction with OCR.

→

Document pipeline — 4 districts

Schema-as-prompt. Format errors 15–20% → under 1%. Replaced Marker.

→

Data model became the instruction set. LLM-agnostic — swap models without rewriting.

Vespa Cloud hybrid search

Perplexity's engine. Keyword + semantic in one query.

→

K-diversity voting — 30 districts

15 predictions per question. Retrieve once, slice by depth. Nathan's PiedPiper research.

→

Consensus replaced confidence. Agreement among runs tracks real accuracy.

Gemini 2.5 Flash

Faster, cheaper. "Don't need a Ferrari to take the kids to school."

→

Production — 148 districts × 112 questions

800+ analyst reviews. Katherine's rule. Citations. Improvement loop.

→

15 predictions per question at scale. The architecture, not the model, is the investment.

STARLING STRATEGY

System Architecture

Three Pipelines, One Database,
One Dashboard

Document Pipeline

Downloads district policy files, extracts text via Docling, enriches with AI metadata. Handles PDFs, Word, HTML, and scanned images via OCR.

11,700+ Documents

Prediction Pipeline

Retrieves passages via hybrid search, generates 15 AI predictions per question, synthesizes consensus through voting, evaluates against golden answers.

15-Run Consensus

NCTQ.ai Dashboard

Unified analyst interface for reviewing AI suggestions, exploring documents, managing workflows, and conversational policy Q&A.

5 Active Reviewers

Shared

PostgreSQL Bronze/Silver

Vespa Cloud Hybrid Search

LLM-Agnostic

Dry-Run by Default

STARLING STRATEGY

Document Pipeline

Turning 11,700 Messy Files
into Clean, Searchable Text

Document ingestion is non-trivial infrastructure work, not just LLM prompting. We convert PDFs, Word docs, HTML pages, and scanned images into clean, structured text.

"Analysts cannot review documents the AI has never read."

1

Docling (IBM) — One tool for PDFs, Word, HTML, OCR, and table extraction

2

Parallel downloads across 4 workers; serial extraction (ML models too large to duplicate)

3

Smart failure classification — reprocess only fixable failures

4

AI enrichment labels type, years, readability, confidence

Tool Evolution

Marker → Docling: Finding the right tool replaced weeks of custom work.

Processing Stats

99.7% documents extracted

8 canonical document types

1–3s enrichment, 8-way concurrency

STARLING STRATEGY

Design Philosophy

The Data Model
Is the Prompt

The AI can't get in the door unless it satisfies the requirements — type hints and validation replace prose instructions.

class PredictionOutput(BaseModel):
    """The AI must fill every field — the schema IS the instruction."""

    predicted_answer: str = Field(
        description="The answer. Use 'INA' if the evidence"
                    " does not address the question."
                    " Silence is NEVER 'No'."
    )
    confidence: float = Field(
        ge=0.0, le=1.0,
        description="Confidence in the answer"
    )
    reasoning: str = Field(
        min_length=50,
        description="Step-by-step reasoning citing evidence"
    )
    document_index: Literal[1, 2, ..., k] = Field(
        description="Which passage contains the answer"
    )
    evidence_agreement: Literal[
        "strong", "partial", "no_evidence"
    ]

# Pydantic validates EVERY response automatically.

Schema Over Prose

Field descriptions become AI instructions. ge=0.0, le=1.0 constrains outputs. Format errors dropped from 15–20% to under 1%.

LLM-Agnostic by Design

Built on PydanticAI — MIT-licensed, open source, works with any LLM. Swap models without rewriting the system.

NCTQ Expertise Baked In

8 document types, answer options, coding guidance, and domain terminology from NCTQ's classification system. Human expertise constrains AI output.

STARLING STRATEGY

Prediction Pipeline

15 Predictions, One Answer

We ask 15 AI agents to answer each question — each with a different slice of evidence. Then we take a vote. This mirrors NCTQ's own process of multiple analysts and fact-checkers.

15

Prediction runs per question, each at a different evidence depth

15×

Search cost reduction — retrieve passages once, slice by depth

9

INA override threshold — 60% consensus forces INA

<1×

Net cost vs single expensive model — small models in parallel

Why Consensus, Not Single-Shot?

LLMs hallucinate and tell people what they want to hear. Running 15 predictions with variance overcomes these failure modes through iteration — the same principle behind having multiple analysts review the same answer. When all 15 agree, accuracy is 58.7%. When they disagree, it drops to 18.7%.

The Vote Is Pure Math

No AI involved in synthesis. Plurality wins, unless 9 or more runs say INA — then INA overrides everything. Agreement among 15 runs correlates with real accuracy far better than any model's self-reported confidence score.

STARLING STRATEGY

The Most Important Rule

"If It's Silent, It's INA Always."

AI models infer "No" from silence. We fight that at every layer. Absence of evidence is not evidence of absence.

Layer 1 — System Prompt: INA rule stated before any question

Layer 2 — Field Descriptions: Every output field includes the rule

Layer 3 — Request Footer: Placed last, where models weigh most

Layer 4 — INA Verification: Dedicated check when prediction is INA

Layer 5 — Hard Override: 9/15 say INA → INA is final. No exceptions.

INA Is Valuable Information

Knowing which districts don't address topics in contracts is meaningful information, not a failure.

Katherine Violations = #1 Metric

Predicting a value when the truth is INA is the primary failure metric — not overall accuracy.

Audit, Never Auto-Correct

When AI says INA but last year had a real answer, we log an audit note. We never auto-correct.

STARLING STRATEGY

Trust & Provenance

Citations as a Separate Primitive

Citations are facts — objective truths that can be deterministically audited. We separate them from predictions to build trust.

1. Evidence Identified First

Passages retrieved before the LLM answers. Can only cite pre-identified evidence.

2. Structured Output

Citations in JSON — document ID, passage index, quoted text. Each validated against source.

3. Separate Validation

Batch validation independent of predictions. Missing citations held for review.

4. Dashboard Visibility

Citations inline. Click through to source. Missing-citation warnings.

The "Mise en Place" Approach

All evidence identified, structured, and validated before AI reasons over it.

Result: Source → passage → citation → prediction → review.

Anti-Hallucination

✓ Constrained indices ✓ Quote verification
✓ Missing-citation holds ✓ Quality tracking

STARLING STRATEGY

Retrieval Infrastructure

Hybrid Search — Finding
Meaning, Not Just Keywords

Why Hybrid?

Lexical search finds exact terms. Semantic search finds meaning. Neither alone is sufficient for policy documents.

Lexical: Search "salary schedule" — finds exact title matches, misses "Compensation Guide" or "Pay Scale"

Semantic: Same query also finds "pay schedule," "compensation table," "teacher pay rates"

Hybrid: Both in one query — precise when terms match, flexible when they don't

Why Vespa Cloud?

The same search engine Perplexity uses to search the entire internet. Hybrid search, embeddings, and district filtering in one query.

Enriched Query Construction

Query Full question + parent context

Focus NCTQ domain terms & sections

Boost Salary/calendar docs promoted

Scope District + AY filtering

STARLING STRATEGY

Data Architecture

Shared Database, Clean Lineage

All data flows through PostgreSQL in two layers — a clean separation between what humans curate and what AI produces.

Bronze Layer — Human-Curated

Ground truth AI never overwrites: district info, document links, academic years, policy definitions, golden answers.

Silver Layer — AI-Produced

Enriched documents, predictions, suggested answers, evidence citations, evaluation results. Tagged by pipeline source.

Safe to Re-Run

Reprocessing a document overwrites the old row instead of duplicating — no double-counting when jobs retry.

Schema Isolation

Dashboard auth in its own schema. Analytical data and user management never collide.

Failure Classification

Migrations reclassify failures into actionable buckets — dead links, corrupted files, blocked access.

Append-Only Predictions

Version-tracked. Analyst-approved answers "stick" through re-runs via sticky view ordering.

STARLING STRATEGY

Measuring What Matters

Evaluation Framework & Current Performance

6-way classification treating INA as a first-class outcome:

Exact MatchAI matches ground truth

INA CorrectCorrectly ID'd as not addressed

INA False PosSaid INA but answer exists

Katherine ViolationGave answer, truth is INA

Value DifferentRight idea, wrong value

No GoldenNo ground truth available

Lesson: Overfitting

Should have integrated golden answers from additional districts earlier. Initial smaller sets caused overfitting.

56%

Overall accuracy (Feb 2026)
up from 45.3% in Dec 2025

98.5%

Checkbox accuracy
structured answer types

93.8%

INA recall
of questions where truth is INA

~40%

Analyst acceptance
in production reviews

Multi-Stage Evaluation

Retrieval recall, per-depth correctness, synthesis impact, citation quality. Regression snapshots diff accuracy across changes.

STARLING STRATEGY

NCTQ.ai Dashboard

Where Analysts and AI Meet

One interface for reviewing AI suggestions, exploring documents, asking policy questions, and managing the review workflow.

Metric Calculator

Core review: districts → policies → questions with AI recommendations, citations, and one-click approve/reject.

Documents

Search and explore 11,700+ processed policy documents with status badges and extraction details.

Policy Advisor

Conversational AI for open-ended policy questions with cited answers.

Admin & Auth

Passwordless email auth. Role-based access: viewer, analyst, power_user, admin.

Technology

FastHTML + MonsterUI + HTMX. Fast server-rendered pages, no heavy frontend framework.

Review Stats (as of Mar 2026)

807 reviews

5 reviewers

341 approved / 466 rejected

STARLING STRATEGY

Reflections

What We Learned

Build Structures Invariant to LLM Progress

Frameworks that accept better models, not systems that compete with state-of-the-art. Swapping Gemini 1.5 to 2.5 Flash was a config change, not a rewrite.

Data Structures as Constraints

Type-safe structures validate outputs at the boundary. The AI can't get in the door unless it satisfies requirements. More reliable than any prompt alone.

Consensus Through Iteration

15 cheap runs beat one expensive run. Same principle as NCTQ's human process: multiple analysts reviewing the same question.

The Harness Is the Hard Part

The real challenge isn't AI — it's infrastructure connecting LLMs to tools and data safely at scale. The harness is the moat.

Robust Internal Benchmarking

Automated research loops: identify failure modes, test fixes, measure impact — all before touching production.

From Extraction to Reasoning

AI evolved from extracting to reasoning — when given the right workflow and constraints. Next-gen reasoning models + agentic processing will narrow the remaining gap.

STARLING STRATEGY

Looking Ahead

What Comes Next

Immediate: Launch

District Policy Pathfinder (2024–2025 data), website content, and Compass chatbot.

2026–2027: Scale to 500 Districts

Expand from 148 to 500 districts. LLM-agnostic architecture means adopting newer models without structural changes.

Ongoing: Close the Loop

Write approved answers to the golden table. 94% of reviews have no ground truth — every approval expands evaluation coverage.

Priority Improvement Areas

1 Fix document retrieval, not prompts — 75% of errors happen when the relevant document isn't in the evidence window

2 Salary schedule parsing — 9–14% accuracy, biggest opportunity

3 CBA section extraction — 960K-char contracts have most answers

4 Add N/A category — fixes ~19% of rejections

5 Inject analyst corrections — 466 notes as signal

"The remaining 30% gap is solvable with retrieval, reasoning, and agentic improvements — not a fundamental AI limitation."

STARLING STRATEGY

Starling Strategy × NCTQ

Thank You

148 districts. 112 questions. 11,700 documents.
One system built to improve with every model generation.

Nathan Roll — Stanford NLP Macon Phillips — Starling Strategy

56%

Current Accuracy

98.5%

Checkbox Accuracy

93.8%

INA Detection

STARLING STRATEGY

Building an AI Systemfor District Policy Analysis

What AI Can and Cannot Do Today

Where AI Excels

Where AI Still Struggles

What Changed, What We Built, and Why It Mattered

Three Pipelines, One Database,One Dashboard

Document Pipeline

Prediction Pipeline

NCTQ.ai Dashboard

Turning 11,700 Messy Filesinto Clean, Searchable Text

The Data ModelIs the Prompt

Schema Over Prose

LLM-Agnostic by Design

NCTQ Expertise Baked In

15 Predictions, One Answer

Why Consensus, Not Single-Shot?

The Vote Is Pure Math

"If It's Silent, It's INA Always."

INA Is Valuable Information

Katherine Violations = #1 Metric

Audit, Never Auto-Correct

Citations as a Separate Primitive

1. Evidence Identified First

2. Structured Output

3. Separate Validation

4. Dashboard Visibility

Hybrid Search — FindingMeaning, Not Just Keywords

Why Hybrid?

Why Vespa Cloud?

Enriched Query Construction

Shared Database, Clean Lineage

Safe to Re-Run

Schema Isolation

Failure Classification

Append-Only Predictions

Evaluation Framework & Current Performance

Lesson: Overfitting

Multi-Stage Evaluation

Where Analysts and AI Meet

Metric Calculator

Documents

Policy Advisor

Admin & Auth

Technology

Review Stats (as of Mar 2026)

What We Learned

Build Structures Invariant to LLM Progress

Data Structures as Constraints

Consensus Through Iteration

The Harness Is the Hard Part

Robust Internal Benchmarking

From Extraction to Reasoning

What Comes Next

Immediate: Launch

2026–2027: Scale to 500 Districts

Ongoing: Close the Loop

Priority Improvement Areas

Thank You

Building an AI System
for District Policy Analysis

Three Pipelines, One Database,
One Dashboard

Turning 11,700 Messy Files
into Clean, Searchable Text

The Data Model
Is the Prompt

Hybrid Search — Finding
Meaning, Not Just Keywords