Lessons learned processing 11,700+ documents across 148 districts — and what it taught us about making AI reliable at scale.
Document ingestion is non-trivial infrastructure work, not just LLM prompting. We convert PDFs, Word docs, HTML pages, and scanned images into clean, structured text.
The AI can't get in the door unless it satisfies the requirements — type hints and validation replace prose instructions.
Field descriptions become AI instructions. ge=0.0, le=1.0 constrains outputs.
Format errors dropped from 15–20% to under 1%.
Built on PydanticAI — MIT-licensed, open source, works with any LLM. Swap models without rewriting the system.
8 document types, answer options, coding guidance, and domain terminology from NCTQ's classification system. Human expertise constrains AI output.
We ask 15 AI agents to answer each question — each with a different slice of evidence. Then we take a vote. This mirrors NCTQ's own process of multiple analysts and fact-checkers.
LLMs hallucinate and tell people what they want to hear. Running 15 predictions with variance overcomes these failure modes through iteration — the same principle behind having multiple analysts review the same answer. When all 15 agree, accuracy is 58.7%. When they disagree, it drops to 18.7%.
No AI involved in synthesis. Plurality wins, unless 9 or more runs say INA — then INA overrides everything. Agreement among 15 runs correlates with real accuracy far better than any model's self-reported confidence score.
AI models infer "No" from silence. We fight that at every layer. Absence of evidence is not evidence of absence.
Knowing which districts don't address topics in contracts is meaningful information, not a failure.
Predicting a value when the truth is INA is the primary failure metric — not overall accuracy.
When AI says INA but last year had a real answer, we log an audit note. We never auto-correct.
Citations are facts — objective truths that can be deterministically audited. We separate them from predictions to build trust.
Passages retrieved before the LLM answers. Can only cite pre-identified evidence.
Citations in JSON — document ID, passage index, quoted text. Each validated against source.
Batch validation independent of predictions. Missing citations held for review.
Citations inline. Click through to source. Missing-citation warnings.
All evidence identified, structured, and validated before AI reasons over it.
Result: Source → passage → citation → prediction → review.
Lexical search finds exact terms. Semantic search finds meaning. Neither alone is sufficient for policy documents.
The same search engine Perplexity uses to search the entire internet. Handles hybrid search, text embeddings, and district filtering in a single query.
When Vespa is unavailable, the system falls back to PostgreSQL — returning documents ranked by AI confidence. Pragmatic, not ideal.
All data flows through PostgreSQL in two layers — a clean separation between what humans curate and what AI produces.
Upserts with deterministic keys — safe to rerun at any time.
Dashboard auth in its own schema. Analytical data and user management never collide.
Migrations reclassify failures into actionable buckets — dead links, corrupted files, blocked access.
Version-tracked. Analyst-approved answers "stick" through re-runs via sticky view ordering.
Should have integrated golden answers from additional districts earlier. Initial smaller sets caused overfitting.
One interface for reviewing AI suggestions, exploring documents, asking policy questions, and managing the review workflow.
Core review: districts → policies → questions with AI recommendations, citations, and one-click approve/reject.
Search and explore 11,700+ processed policy documents with status badges and extraction details.
Conversational AI for open-ended policy questions with cited answers.
Passwordless email auth. Role-based access: viewer, analyst, power_user, admin.
FastHTML + MonsterUI + HTMX. Fast server-rendered pages, no heavy frontend framework.
Frameworks that accept better models, not systems that compete with state-of-the-art. Swapping Gemini 1.5 to 2.5 Flash was a config change, not a rewrite.
Type-safe structures validate outputs at the boundary. The AI can't get in the door unless it satisfies requirements. More reliable than any prompt alone.
15 cheap runs beat one expensive run. Same principle as NCTQ's human process: multiple analysts reviewing the same question.
The real challenge isn't AI — it's infrastructure connecting LLMs to tools and data safely at scale. The harness is the moat.
Automated research loops: identify failure modes, test fixes, measure impact — all before touching production.
AI evolved from extracting to reasoning — when given the right workflow and constraints. Next-gen reasoning models + agentic processing will narrow the remaining gap.
District Policy Pathfinder (2024–2025 data), website content, and Compass chatbot.
Expand from 148 to 500 districts. LLM-agnostic architecture means adopting newer models without structural changes.
Write approved answers to the golden table. 94% of reviews have no ground truth — every approval expands evaluation coverage.
148 districts. 112 questions. 11,700 documents.
One system built to improve with every model generation.