ENGINEERING

The 9-step Verifiable Reasoning Protocol, explained step by step

2026-05-23 · Tablize Team

In an earlier post we introduced Verifiable Reasoning — the protocol Tablize runs in Deep Analysis mode to make agent-generated numbers trustworthy. This post is the full walkthrough: every step, what it does, an example of what it catches that naive agents miss, and the design rationale.

If you’re evaluating Tablize for serious analytical work — finance, research, anywhere a wrong number costs you — this is the post that explains why you can trust the output. We’ve open-sourced the protocol thinking; the implementation is in the Tablize runtime.

Why 9 steps and not fewer

A common reaction to “9-step protocol” is “isn’t that overkill.” We wrestled with this. The first internal version had 4 steps. We added 5 more as we caught real failure modes that the 4 didn’t prevent.

Each step exists because of an incident — a real wrong answer we shipped to a user (or to an internal test), traced back to a specific gap. We’ll cite the failure mode under each step.

The protocol is fail-safe: not every step fires on every question. The agent decides which steps to skip when the question is simple enough. The full sequence fires on questions that involve aggregation, joins across tables, or any arithmetic.

Step 1: Clarify the question (if needed)

What it does: Detects ambiguity in the user’s question and either asks a clarifying question or picks a sensible default and discloses the assumption.

Failure mode it catches: “Top 10 customers by revenue.” Revenue when? Lifetime, last year, last 90 days? Revenue gross or net of refunds? Customers including or excluding deleted accounts? Each interpretation gives a different answer. The naive agent picks one silently. The user thinks they got the answer they asked for.

Heuristic: Ask if a wrong assumption changes the answer by more than 10%. Pick a default if the difference is purely stylistic.

Example interaction:

User: “Top 10 customers by revenue.” Agent: “Two ways I could interpret this. (a) Lifetime revenue per customer, (b) Last 30 days. I’ll default to (a) since you didn’t specify a window — say the word if you want (b).”

The default is shown. The alternative is shown. The user can correct in one word. No silent assumption.

Step 2: Make a data plan

What it does: Before running any query, the agent writes a plan: which tables, which columns, which joins, which filters, which aggregations.

Failure mode it catches: The agent thinks orders.amount is in cents (because that’s how Stripe’s API returns amounts), but it’s actually in dollars (because the ETL pipeline divided by 100 already). The naive agent runs the query and reports MRR is $4,720 (it’s actually $47.20). The data plan exposes the assumption before the query — you can spot the mistake and correct it.

What the plan looks like:

“Plan: query stripe_subscriptions filtered to status=‘active’. Sum the amount_dollars column (which I’m assuming is already in dollars based on the column name; sample 5 rows to confirm before running the full query). Group by month using started_at. Note: not including failed subscription attempts; if you want those, say so.”

Design rationale: Plans are cheap to generate (a few hundred tokens), they catch errors before query cost is incurred, and they create a paper trail of the agent’s logic that you can audit.

Step 3: Sample before aggregating

What it does: Pulls 10-100 rows from each relevant table to verify column semantics before running aggregations.

Failure mode it catches: A column called status has values paid, refunded, pending. The agent’s plan says WHERE status = 'completed'. The query returns zero rows. The naive agent reports “no completed orders this month” — wrong, because completed is the wrong status value. The sample step catches this immediately: the agent sees the actual values and corrects the filter.

Also catches: Negative numbers in amount columns (often refund rows). Timezones embedded in date strings. Boolean columns stored as 0/1 vs true/false vs Y/N. Unicode in supposedly-ASCII columns. All real bugs in real datasets.

Cost: A few hundred milliseconds per table. Worth every millisecond.

Step 4: Run the actual query

What it does: Runs the query.

Failure mode it catches: None, by itself. This step is the easy one. The interesting steps are around it.

Why it’s a separate step in the protocol: Because the protocol distinguishes “made a plan” from “ran the query.” If the plan is wrong, the query is wasted. If the plan is right but the query failed (timeout, permission error, syntax error), the agent retries with a smaller variant. Keeping these as distinct steps means the agent can recover from failures at the right level.

Step 5: Sanity check the result

What it does: Compares the result to expected magnitudes. If your monthly revenue was around $50K and the current month comes back as $4.8M, the agent flags it as suspicious.

Failure mode it catches: The unit error from Step 2 (we caught it in the plan; if we hadn’t, we catch it here). Forgotten filters that cause Cartesian product blowup. Wrong join keys that multiply rows. Date filter bugs that include 10 years of data when you wanted last month.

Heuristic: Flag results that are 10× outside the recent baseline. Flag results with > 100M rows. Flag percentage changes > 200%. Flag results where every group has the same value (often a sign the GROUP BY is wrong).

Example output:

“MRR last month: $48,200. MRR this month: $4,820,000. This is 100× the prior month, which suggests a unit error — let me check if I aggregated in cents instead of dollars.”

The agent doesn’t just report the number. It notices the number is weird and investigates before serving it.

Step 6: Cross-verify with an independent query

What it does: For high-stakes questions, computes the answer two different ways using different data paths.

Failure mode it catches: Two systems both have order data — Shopify and Stripe. The Shopify orders table says revenue was $X. The Stripe charges table says revenue was $Y. If X ≠ Y within rounding, one of them is wrong. The naive agent picks the one its plan referenced first; the cross-verify surfaces the discrepancy and lets the user decide which to trust.

Heuristic: Cross-verify when the question matters (financial, decision-driving) and when two independent data paths exist. Skip when only one data source is available.

Example output:

“Cross-check: net revenue last quarter computed via Shopify orders = $487,200. Same computation via Stripe captures = $471,810. Discrepancy of $15,390 (3.2%) likely due to (a) Shopify orders including tax, (b) Stripe captures excluding tax. Recommend using Stripe number for revenue recognition; Shopify number for gross sales reporting.”

The agent doesn’t pick one silently. It explains the discrepancy.

Step 7: Inline math

What it does: When the answer involves arithmetic, the agent shows every step of the calculation.

Failure mode it catches: Wrong-base percentage errors. “MRR grew 12% from last month” — 12% from what base? Computing $5,200 increase as a percentage of $50,000 is 10.4%, not 12%. The agent shows the math:

“MRR last month: $50,000. MRR this month: $55,200. Delta: +$5,200. Percentage: $5,200 / $50,000 = 10.4%.”

This is pedantic. It is also the single best defense against a class of LLM errors that comes from the model confusing “of” and “from” in percentage language. Always show the math.

Also catches: Wrong-time-period comparisons (year-over-year compared to month-over-month). Inconsistent rounding. Lossy floating-point comparisons.

Step 8: Declare assumptions

What it does: Lists every assumption the agent made — about ambiguous question parts, about which dataset to use, about timezone, about aggregation function.

Failure mode it catches: Implicit assumptions that turn out to be wrong. The agent assumed timestamps were in UTC; they were actually in America/Los_Angeles. The agent assumed revenue excluded shipping; it included shipping. The agent assumed null values meant zero; they actually meant unknown.

Format:

“Assumptions:

‘This quarter’ = calendar Q1 2026 (Jan 1 - Mar 31, UTC)

Revenue is gross sales, including tax and shipping

Excluded test orders (orders with email matching @testdata.com)

14 orders had null shipping_cost; treated as 0

Currency conversion at last-of-month rate”

The user reads the assumptions. If any are wrong, they correct in plain English. The agent reruns with the corrected assumptions.

Step 9: State the conclusion clearly

What it does: Produces a one-sentence final answer at the bottom, after all the verbose verification above.

Failure mode it catches: Confused readers. The protocol’s first 8 steps produce a wall of text — necessary for trust, but hard to skim. Step 9 distills the answer into a single declarative sentence so the reader who already trusts the agent can get the bottom-line in 2 seconds.

Format:

“Net revenue Q1 2026: $487,300, up 18% from Q4 2025.”

That’s the line. Above it: all the work. Below it: the next question.

How this changes the user experience

In Standard mode, you ask a question and get an answer in 3 seconds. In Deep Analysis mode, the same question takes 30-60 seconds because the protocol runs. You watch the steps fire in real time — each tool call is labeled with which step it’s serving.

The result is slower but boring-in-a-good-way: the number is the number, and you don’t have to second-guess it. For decisions that matter, the 50-second wait is the right price.

What we’re still working on

A few protocol enhancements on the roadmap:

Confidence intervals on results. When the underlying data is sparse or noisy, the agent should say “MRR last month was $48,200 ± $2,100 (95% CI)” rather than asserting a point estimate. This is harder than it sounds — most analytical queries don’t have natural error bars — but for forecasted or sampled values, it’s tractable.

Counterfactual checking. “If you change Assumption 3 from X to Y, the answer becomes Z.” Lets the user explore sensitivity to specific assumptions without re-asking the whole question.

Multi-agent cross-verify. For the highest-stakes questions, run the same question through two independent agent instances with different models (e.g., Claude + GLM) and surface any disagreement. Expensive, but the right shape for board-meeting-grade numbers.

Reasoning replay. Save the protocol trace and let the user re-run with the same trace 6 months later. Verifies that the analysis is reproducible against the same data, and shows what changed if rerun against current data.

The honest summary

Verifiable Reasoning isn’t a silver bullet. It catches a lot of common errors and surfaces the agent’s assumptions to the user — but it can’t guarantee correctness, because no protocol can. What it can do is move agent-generated numbers from “trust at your peril” to “trust the same way you’d trust a careful junior analyst.”

For most analytical work, that’s the bar that matters. A careful junior shows their work, declares their assumptions, sanity-checks the obvious errors, and produces answers you can defend in a meeting. The Verifiable Reasoning protocol is our attempt to make the agent that careful by default.

If you want to see the protocol fire on your own data, switch any Tablize chat to Deep Analysis mode (the toggle is above the message input) and ask a question that involves a non-trivial aggregation.

Try Deep Analysis free →

Related reading: