The Guarantee

Every behavior PILLAR commits to is a named, enforced, append-only invariant, and we can show you the exact test that proves each one.

Most AI-native tools ask you to take the math on faith. PILLAR doesn’t. We publish our correctness spec, enforce it in CI, and show you the tests. This article walks through the framework so you can verify what we claim — and what we don’t.

TL;DR for non-technical readers. PILLAR makes a structural promise: every behavior we commit to is enforced by an automated test that runs on every code change. If any test fails, the build fails — no exceptions, no warning-downgrades, no “we’ll fix it later.” As of today: 105+ named correctness invariants across 18 categories, 2,800+ tests, all green on the latest commit. What this means for you in plain English:

Procurement sign-off in days, not quarters. Hand this page to your CTO or CISO; they read enforced behaviors, not marketing copy. Every claim on this page has a citable test.
Integration breakage caught automatically. When PILLAR ships a new connector or refactors a scoring formula, the existing invariants either keep holding or the build turns red — no silent regressions slip into production.
Falsifiable correctness claim. Most “trust us, our AI works” claims aren’t testable. Every “PILLAR is correct because…” statement on this page is backed by a specific named test that runs on every release. We name them, we cite them, and we publish what they enforce.

Skim the framework below for the structural argument, jump to “What this gets you, the customer” for the buyer-facing value, or hand the whole page to your security team for vendor due diligence.

Why this exists

In early 2026, a customer’s renewal risk score rendered a number on the dashboard that was obviously wrong to any human reading it. The root cause was two bugs in how scoring functions passed values between each other — the kind of orchestration-layer failure that unit tests don’t catch because each formula looks correct in isolation. We built The Guarantee to close that class of bug. Every invariant you’ll see below was added because a past failure slipped through a test boundary it should have hit. The framework is append-only: once a behavior is promised, it stays promised.

The chain: Spec → Guarantee → Test

1. The Spec — what PILLAR commits to

Every behavior PILLAR commits to is a numbered entry in the public spec. Seven domain files cover scoring correctness, signal intelligence, multi-tenant isolation, integration fidelity, per-org configuration, UI journeys, and API contracts. Examples of concrete commitments:

SPEC-SCORING-04 — A pipeline run that changes renewal_risk flows the fresh value into account_priority via the risk_urgency weight in the same invocation. No stale reads between scoring models.
SPEC-TENANCY-03 — Every public API route handler that queries an org-scoped table filters by org_id, or is explicitly allowlisted as cross-org (admin, cron, webhook).
SPEC-VALIDATION-01 — Every API route that reads a JSON body validates input via Zod with parseBody — no exceptions, enforced across all 92 body-accepting routes.
SPEC-INGESTION-07 — When a non-terminated contracts row exists for an account, the scoring pipeline uses contracts.end_date as the canonical renewal anchor, falling back to renewals.renewal_date for tenants without contracts enumerated.

Each entry has a status: required (must be enforced), aspirational (intent tracked for future enforcement), or retired (explicitly deprecated, with a pointer to what replaced it). IDs are append-only — spec numbers never get reused. Current count: 108 entries spanning ten domains (scoring, vertical intelligence, configuration, ingestion, UI, tenancy, contract, signals, validation, ops). The Vertical Intelligence domain (SPEC-VERTICAL-*) is the fastest-growing surface at 37 entries, reflecting the multi-layer reconciliation work that closes the gap between “PILLAR canonicalizes all 50 states + DC + 26 federal datasets” and “PILLAR’s per-district numbers reconcile to the state DOE’s own published values” — and the new Round 8 federal-data canonical-shape layer that closes “is the federal-dataset row right?“

2. The Guarantee — how each commitment is enforced

Every required spec entry has at least one named Guarantee that enforces it in code. IDs follow G-<category>-<NN>, where the category letter maps to a class of correctness failure:

Code	Category	What it enforces
F	Freshness	Cross-model staleness — when score A feeds score B, B sees the current run’s A
M	Monotonicity	Directional correctness — worse input raises risk, better raises priority
B	Bounds	Scores ∈ [0, 100]. No NaN. No Infinity.
D	Decomposition	Weighted contributions sum to the final score within ±1
S	Signals	Every signal traces back to a triggered rule; no ghost signals
R	Rules	Rule catalog well-formedness (unique IDs, valid score_models)
C	Calibration	Weights sum to 1.0 per formula
T	Tenancy	Multi-tenant isolation at every layer (routes, helpers, POST bodies)
I	Ingestion	CRM connector + field-mapping fidelity
P	Performance	Per-call latency budgets under load
W	forecast-Weights	Per-org forecast category weight resolution
O	Org-config	Per-org business configuration resolution
A	Audit-shape	`scoring_audit` persisted-row contract
H	Hermeticity	CI independence from external state
V	Validation	API-boundary input validation via Zod
U	UI	User-facing journey correctness (Playwright)
E	Endpoint	API input → output contract invariants
X	Vertical Intelligence	State DOE canonicalization, federal Title program flows, NAEP cross-validation, accreditation cycles, state procurement calendars — the external-knowledge surfaces that horizontal Revenue AI platforms structurally cannot answer

Current count: 105 Guarantees across 18 categories. Vertical Intelligence (X) is the largest category at 40 Guarantees (G-X-01 through G-X-40), reflecting the depth of the canonicalization + accuracy-verification layer that translates 50 state DOEs + DC + 26 federal datasets into a unified schema.

3. The Tests — evidence the invariants hold

Every Guarantee has at least one automated test whose description starts with the Guarantee ID, so coverage is directly attributable. Four strategies:

Fixed-example tests — hand-picked inputs exercising known-tricky cases.
Property-based tests — fast-check generates 100s to 1000s of random inputs per assertion; any failure shrinks to a minimal counterexample.
Endpoint-contract tests — lock the shape of every API response plus invariants about the output data (e.g. decomposition sums to score within ±1).
UI tests — Playwright exercises user-facing pages end-to-end.

Current count: 2,700+ tests running on every commit (441 in the Guarantee suite directly, ~2,300 broader unit/integration), plus Playwright UI tests against the public auth + unsubscribe surfaces.

What makes the claim falsifiable

Anyone can write tests. The structural claim that makes “every behavior is enforced” provable is the build-level check of the chain itself:

Add a spec entry without a Guarantee → build fails (spec-check.test.ts cross-references spec id ↔ Guarantee spec_ref).
Add a Guarantee without a matching test → build fails (index.test.ts cross-references registry entries ↔ it("<ID>: ...") descriptions).
Orphan test referencing a Guarantee that doesn’t exist → build fails (same cross-check, in the other direction).
Retire a Guarantee without naming its replacement → build fails.

The chain can’t break silently. If we publish a commitment, the test exists. If we delete a commitment, the spec explicitly retires it with a pointer to what replaced it.

What this gets you, the customer

Procurement sign-off in days, not quarters. Hand your CTO or CISO the spec. They read 61 behaviors, not 61 pages of marketing.
Confidence at the seams. Every time PILLAR ships an integration (Salesforce, HubSpot, Gong, Pendo, Slack), a new Ingestion Guarantee (G-I-*) locks the field-mapping + override fidelity. Breaking it in a future refactor fails the build.
Zero “vibe-coded” surprises. Input at every API boundary goes through Zod schema validation, enforced by the G-V-01 / G-V-02 invariants across 92 route files.
A changelog that doesn’t lie. Every scoring-model update bumps MODEL_VERSION and updates the golden-fixture snapshot, tracked by G-D-02. The Changelog references affected SPEC / Guarantee IDs for each release.
Silent breakage caught automatically. The integration-health canary (G-I-10 / G-I-11) runs every 15 minutes and pages the on-call when a tenant’s OAuth token expires, a sync stalls, or a connection flips to error. No more “the dashboard is wrong and nobody noticed until the customer asked.”

Categories in depth

Freshness (F)

When the scoring pipeline runs, one model’s output often feeds another — renewal_risk flows into account_priority via the risk_urgency weight, and pipeline_hygiene flows into forecast_confidence. The freshness Guarantees (G-F-01 through G-F-05) prove the downstream model sees the current run’s upstream value, not a stale cache. This closes the class of bug where a customer saw the right renewal risk on the account page but the wrong priority ranking in triage — because the priority calculation had read last night’s risk score, not this morning’s.

Tenancy (T)

Six Guarantees (G-T-01 through G-T-06) enforce multi-tenant isolation at three layers: pure scoring functions (the compute is scoped to a context object, never a global pool), route handlers (every query filters by org_id), and downstream workflow helpers (every helper re-applies the org filter, because a task’s source_id could reference a resource in another org). Expanded after a 2026-04-08 pen-test-style review found 11 gaps in task-completion helpers and 4 gaps in the plays POST body handler.

Ingestion (I)

Eleven Guarantees (G-I-01 through G-I-11) lock the fidelity of data flowing from your CRM + tooling into PILLAR’s scoring pipeline. Direct field mappings copy verbatim. Picklist translations fall back gracefully when a value is missing. Contracts-object renewal anchors override legacy renewal-date fields. Usage snapshots drive renewal_risk via a NEUTRAL baseline on insufficient data (missing usage never downgrades; only observed decline does). The integration-health canary detects expired OAuth tokens and stalled scoring pipelines 15 minutes after they start, not days later when a customer notices.

Validation (V)

Every API route that reads a JSON body or searchParams runs input through a Zod schema before touching it. The G-V-01 and G-V-02 Guarantees enforce this across 92 body-accepting routes and all query-parsing routes, with a shrinking exemption allowlist tracked in the test file — adding a new unvalidated route requires explicit review.

Hermeticity (H)

G-H-01 ensures the integration-tests CI job runs against a local Supabase CLI stack, not a cloud Supabase credential. Eliminates an entire class of paste-corruption failures where a rotated credential would silently break every PR check. CI must be runnable from a clean clone without access to production secrets.

UI (U)

Three Playwright-backed Guarantees (G-U-01 through G-U-03) verify the public-surface integrity layer: the /login page responds non-5xx and renders an interactive email input, and /api/architects/unsubscribe?t=<invalid> returns a branded 200 HTML page rather than leaking a stack trace. Data-aware UI invariants (account detail scoring, triage ordering, signal feed tenancy) are tracked as aspirational SPEC entries and will land with the hermetic Playwright fixture seed.

Vertical Intelligence (X) — the canonicalization layer

The largest category at 40 Guarantees (G-X-01 through G-X-40), covering external-knowledge surfaces that horizontal Revenue AI platforms structurally cannot answer: state DOE assessment + accountability + graduation data across all 50 states + DC, 26 federal datasets (8 IPEDS components + 8 Higher Ed sources + 10 K-12 sources), federal Title program allocations, NAEP cross-validation, accreditation review cycles, state procurement calendars, and cooperative-contract eligibility. Runtime-truth status (May 2026): 51 jurisdictions covered (50 states + DC + federal); 26 federal datasets ingested with 890,000+ canonical rows; 47 state-funding adapters live; 129 MCP tools across 14 categories — 63 in vertical_intelligence (all live and queryable). Per-district coverage is at 51/51 jurisdictions for assessment proficiency (5.03M cells across ~19,700 LEAs), cohort graduation (391k cells), accountability status (24k cells), and engagement/chronic absenteeism (104k cells). K-12 state funding allocations: 46 of 51 jurisdictions (90.2%), 114,699 per-LEA rows, $957B+ captured.** **HE state aid: 7,876 per-institution rows across 58 jurisdictions,$ 7.94B captured (IPEDS SFA + 11 state-specific programs). Per-state DOE deep ingest for 27+ states at recent-year grain plus federal EDFacts SY 2020-21 backfill closes the long tail. 550 Guarantee tests pass on every commit; the schema, ingest pipeline, MCP wrappers, and canonical-shape validators are all enforced — every commit blocks merge unless every row landing in the 26 federal-data + 47 state-funding tables passes its G-X-31 through G-X-40 validator. The headline claim:

PILLAR canonicalizes 51 jurisdictions (50 state DOEs + DC + federal) with a documented policy footprint, a structural honesty layer, and sixteen independent layers of accuracy verification: Round 1-5 reconciliation layer (closes “is the state-DOE proficiency number right?”)

Macro-level reconciliation against state-published statewide aggregates with 24-state coverage (G-X-25)

Micro-level spot-checks against 17 hand-validated district fixtures across 13 states including the load-bearing LDOE R36→036 alias (G-X-26)

External NAEP trend-direction cross-validation for the 11 Tier-1 states with a live MCP route at /api/vertical/state-naep-comparison (G-X-27)

Silent-corruption canary on every ingested cell with queue-backed weekly review via the value_unknown_alarms table (G-X-28)

Federal Title pass-through reconciliation between EDFacts allocations and SEA-published disbursements (G-X-29)

Per-district Title allocation spot-checks closing the loop on “we know proficiency AND federal allocation are right for the same district” (G-X-30)

Round 8 federal-data canonical-shape layer (closes “is the federal-dataset row right?” — 10 NEW Guarantees)

IPEDS-extension shape discipline for Human Resources / Admissions / Academic Year Tuition / Academic Libraries / Enrollment by CIP — including biennial-even-year discipline on EF-CIP and pre-2014 collection_status discipline on Academic Libraries (G-X-31)

OPEID padding integrity on the institution_crosswalk join — the only authorized path between UNITID-keyed (IPEDS, Scorecard, Carnegie) and OPEID-keyed (FSA CDR/GE/NSLDS/HCM, NC-SARA) datasets (G-X-32)

College Scorecard shape validators on institution-level + field-of-study tables (G-X-33)

Carnegie 2025 four-dimension derivation discipline — is_r1/is_r2 MUST be derivable from research_activity_designation (G-X-34)

FSA regulatory-status discipline preventing accidental publish-rate inference during the 2019-2023 GE rescission gap; CDR + HCM enums locked (G-X-35)

SHEEO SHEF + NC-SARA state-level shape with USPS-keyed JSONB integrity (G-X-36)

CRDC biennial discipline — collection year MUST be even; suspensions ≤ 2× total enrollment sanity check (G-X-37)

CCD School Universe + EDGE locale enum — title_i_status, charter_status, magnet_status, virtual_indicator, locale_code all locked (G-X-38)

OSEP IDEA Part B + K-12 federal program state-aggregate shape (G-X-39)

NCES EDGE entity-type-conditional ID-length + NIEER 0-10 quality benchmark hard cap (G-X-40)

Why this matters for buyers: state DOEs each express proficiency on a different scale, suppression with different sentinels, accountability in 4-tier vs 5-tier vs A-F, with subgroup labels that vary across all 50 states. Each state DOE essentially publishes data that’s only legible inside its own bureaucracy. PILLAR’s canonicalization now spans 51 jurisdictions — taking Tennessee’s “Approached/Met/Exceeded” and Louisiana’s “Mastery and above” and Wisconsin’s “Advanced+Meeting” and forcing them all into one comparable pct_proficient_or_above column with a documented policy footprint. The sixteen verification layers above mean the resulting unified surface isn’t just “structurally faithful” — it’s been independently checked against state-published aggregates, hand-validated district fixtures, federal NAEP, federal EDFacts allocation tables, SEA pass-through reports, AND every Round 8 federal-dataset row passes a canonical-shape validator before it can land in the table. What this is NOT: numerical perfection at the per-district per-cell level for every single one of the ~13,000 US LEAs. The Round 1 backfill seed includes 24 states with reconciliation aggregates and 17 districts with locked spot-check cells (out of ~91 districts × 27 states = ~2,500 possible spot-checks); coverage expands per the published backfill runbooks (docs/RECONCILIATION_BACKFILL_RUNBOOK.md, docs/SPOT_CHECK_BACKFILL_RUNBOOK.md). Build-time Guarantees enforce the lookup-table shape + helper behavior; runtime production crons (separate from the build-time gate) compare freshly-ingested cells against the locked fixtures and alert on drift. Honesty contract baked into the response shape:

STANDARDS_CROSS_FAMILY_NOTE (G-X-15) — every cross-family comparison carries a “cut-scores differ; not directly comparable” caveat
continuity_break: true (G-X-14) — flagged on year-over-year transitions where state assessment family changed (PARCC→MCAP, FSA→FAST, AIR→Cambium)
CCMR_COMPOSITE_NOTE (G-X-19) — warns against cross-state composite comparison
GROWTH_VS_LEVEL_NOTE (G-X-20) — prevents conflating growth with absolute proficiency
naep_disagrees: true (G-X-27) — surfaces when state cut-score recalibrations diverge from NAEP

The category as a whole answers the question: “How do you know your federalized + state-canonicalized district intelligence is right?” with the same structural answer PILLAR gives for scoring math — show the spec, show the Guarantee, show the test.

Scoring Overview — how the five scoring formulas connect to the Decomposition (D) + Calibration (C) Guarantees.
Signal Overview — how the eight signal families connect to the Signals (S) + Rules (R) Guarantees.
Data Readiness — the enforcement layer that checks CRM data quality before scoring runs, covered by the Ingestion (I) Guarantees.
Changelog — every release references the SPEC and Guarantee IDs it affected.

Questions?

Email support@pillargtm.com or ask in-app via Drafter. For a live walkthrough of the framework, schedule 15 minutes with Eli.

Overview

Resources

Configuration

BUILDER

Why this exists

The chain: Spec → Guarantee → Test

1. The Spec — what PILLAR commits to

2. The Guarantee — how each commitment is enforced

3. The Tests — evidence the invariants hold

What makes the claim falsifiable

What this gets you, the customer

Categories in depth

Freshness (F)

Tenancy (T)

Ingestion (I)

Validation (V)

Hermeticity (H)

UI (U)

Vertical Intelligence (X) — the canonicalization layer

Questions?

Overview

Resources

Configuration

BUILDER

Documentation Index

​Why this exists

​The chain: Spec → Guarantee → Test

​1. The Spec — what PILLAR commits to

​2. The Guarantee — how each commitment is enforced

​3. The Tests — evidence the invariants hold

​What makes the claim falsifiable

​What this gets you, the customer

​Categories in depth

​Freshness (F)

​Tenancy (T)

​Ingestion (I)

​Validation (V)

​Hermeticity (H)

​UI (U)

​Vertical Intelligence (X) — the canonicalization layer

​Related reading

​Questions?

Why this exists

The chain: Spec → Guarantee → Test

1. The Spec — what PILLAR commits to

2. The Guarantee — how each commitment is enforced

3. The Tests — evidence the invariants hold

What makes the claim falsifiable

What this gets you, the customer

Categories in depth

Freshness (F)

Tenancy (T)

Ingestion (I)

Validation (V)

Hermeticity (H)

UI (U)

Vertical Intelligence (X) — the canonicalization layer

Related reading

Questions?