ENGINEERING

Verifiable Reasoning: how to make AI trust its own numbers

2026-05-23 · Tablize Team

LLMs hallucinate. This is news to no one. But there’s a specific category of hallucination that breaks data tools harder than any other: confidently producing the wrong number.

A language model will tell you “MRR is $47,200, up 12% from last month” with exactly the same tone whether the number is right or whether the model misread one column and silently aggregated the wrong field. The output is plausible. The output is wrong. You won’t catch it unless you check the SQL — and the whole point of using a Data Agent is to not have to check the SQL every time.

This post is about how we built Deep Analysis (internally called Rigorous mode) into Tablize to make the agent slow down on hard questions and verify its own work. It’s a 9-step protocol called Verifiable Reasoning, and the design decisions behind it are surprisingly interesting.

The problem in one example

Suppose you ask a Data Agent: “What was my net revenue last quarter, accounting for refunds and chargebacks?”

A naive agent will:

See “net revenue” in the question.
Find a revenue column in your orders table.
Sum it for the quarter.
Return a confident number.

The number is wrong in at least three ways. First, “revenue” in orders is gross order value, not net of returns. Second, chargebacks live in a different table (disputes). Third, “last quarter” is ambiguous — calendar quarter? Fiscal quarter? Trailing 90 days?

The agent doesn’t notice any of this because the model is producing the most probable answer given a question, not the most correct answer given the data. The two diverge constantly.

Why “just add more thinking tokens” doesn’t fix it

Extended thinking helps, but not nearly enough. We tested it: giving the model 4-16K tokens of scratch space before answering improves accuracy on multi-step questions by maybe 15-25%. That’s nice, but it’s not the difference between “ship to production” and “shouldn’t ship.”

The issue is that thinking is still confined to the model’s head. It doesn’t actually look at the data. It guesses what the data probably contains, reasons about that guess, and produces an answer that’s self-consistent with the guess — but not necessarily consistent with reality.

What we needed wasn’t more thinking. It was more checking.

The 9-step Verifiable Reasoning Protocol

Deep Analysis mode in Tablize runs every complex analytical question through 9 steps. Not all 9 fire on every question — the agent picks which to run based on the question shape — but the protocol is the upper bound on rigor.

1. Clarify the question (if needed)

If the question is ambiguous in a way that materially affects the answer, the agent asks. “Last quarter” is ambiguous. “Top 10 customers” is ambiguous (by lifetime value? by last 30 days? by gross or net?). The agent decides whether to ask or pick a sensible default and disclose the assumption.

The heuristic: ask if a wrong assumption would change the answer by more than 10%. Default if it’s a stylistic difference (e.g., “top 10” vs “top 20”).

2. Make a data plan

Before writing any query, the agent writes a plan: which tables to use, which columns from each, which joins, which filters, which aggregations. This is held in the agent’s context and shown to you in the response.

The data plan catches a lot of errors before they cost compute. If the agent’s plan says “join orders to refunds on order_id” and your refunds table actually uses payment_id as the link, you’ll see this before the query runs.

3. Sample before aggregating

This is the big one. Before computing any aggregate, the agent samples the relevant tables — 10-100 rows depending on table size — and looks at the actual data.

Sampling catches schema misunderstandings the data plan can’t. If the agent thinks order_status has values paid / refunded / pending but it actually has PAID / REFUND / PENDING_PAYMENT, the filter WHERE status = 'paid' returns zero rows. The sample reveals this immediately.

It also catches data weirdness. A column named revenue that contains negative numbers for refund rows is a common pattern, and “sum of revenue” without filtering for non-refund rows gives a number that’s neither gross nor net.

4. Run the actual query

After the plan and the sample, run the real query. By this point, the agent has high confidence in the structure. The query is the small, low-risk step.

5. Sanity check the result

The agent compares the result to expected magnitudes. If your prior monthly revenue was around $50K and this month’s number comes back as $4.8M, the agent flags it as suspicious — probably a unit error (cents vs dollars), a forgotten filter, or a join multiplication.

Sanity checks are simple, mechanical, and catch a huge class of errors. The rule of thumb: if the answer is 10× outside the recent baseline, treat it as suspect and dig deeper before reporting.

6. Cross-verify with an independent query

For the highest-stakes questions, the agent computes the answer two different ways. If “net revenue this quarter” is computed via SUM(amount) FROM orders WHERE status = 'completed' GROUP BY quarter and also via SELECT SUM(amount) FROM stripe_charges WHERE captured = true AND created BETWEEN ..., the two should match within rounding. If they don’t, one of them is wrong — and the agent surfaces the discrepancy rather than picking a winner silently.

Cross-verification is expensive (two queries instead of one), so the agent only runs it when the question warrants it. The rule: any answer that will drive a decision worth more than the cost of the extra query gets cross-verified.

7. Inline math

If the answer involves any arithmetic (percentages, deltas, ratios), the agent shows the math inline. Not just “MRR grew 12%” but “MRR grew from $42K to $47K, a delta of $5K, which is $5K / $42K = 11.9%.”

This sounds pedantic. It is the single best defense against the worst kind of LLM error, which is computing a percentage from the wrong base. A 12% growth from $42K is $5,040. A 12% drop to $42K means you started at $47,727. Different number, often confused. Showing the math forces the model to be explicit about which formula it used.

8. Declare assumptions

Every assumption the agent made — about ambiguous question parts, about which dataset to use, about which timezone, about which aggregation function — is listed at the end of the answer in an “Assumptions” section.

This serves two purposes. First, it lets you spot wrong assumptions. Second, it forces the agent to be aware of its assumptions, which experimentally reduces the rate at which it makes implicit ones.

9. State the conclusion clearly

The final answer is short, declarative, and one-sentence — backed by the layers above. Not “the data suggests” or “it appears that” — the final line says what the number is and what it means.

The verbose-but-honest version of the answer goes above. The clean version goes at the bottom. You can read either or both.

What the agent looks like in Deep Analysis mode

In the Tablize UI, when Deep Analysis is enabled, you see all 9 steps as the agent works through them. Tool calls are tagged with verify and have a gold left border, distinguishing them from regular db_query calls. The agent’s reasoning text shows the protocol — “Step 3: sampling 50 rows from orders to confirm column semantics…” — so you can follow along.

This is slower than Standard mode. Token cost is 2-3× higher, latency goes up 50-100%. That’s the tradeoff. For 80% of questions (“how many users signed up yesterday”), the cost isn’t worth it and you stay in Standard. For the 20% that matter (“what’s our real net revenue last quarter accounting for refunds”), the cost is the price of getting the right answer.

The toggle is right above the chat input. You flip it any time — even mid-conversation. The mode applies to the next message, not the whole session.

Why this matters for the Data Agent category

Trust in AI numbers is the single biggest barrier to LLMs being used for analytical work. Every other category of LLM application (code completion, copywriting, customer support) has a human in the loop who can spot a wrong output. With data analysis, the wrong number can drive a business decision before anyone notices.

The Data Agent shape only works if the answers are right. “Mostly right” doesn’t cut it — a 5% error rate on financial numbers is a deal-killer. Verifiable Reasoning is our bet on how to get from “mostly right” to “right enough to ship.”

We’re not done. The next thing we’re working on: surfacing confidence intervals on the agent’s numbers, so you can see at a glance which answers are well-grounded in the data and which involve modeling assumptions. The honest answer for some questions is “I’m 70% sure” and we’d rather say that than give you a confident wrong number.

How to try it

Deep Analysis is on the Plus tier and up. Open any chat in Tablize, find the mode toggle above the message input, switch to Deep Analysis, ask a hard question. You’ll see the protocol fire.

For maximum interesting output, ask the kind of question you’d give a smart analyst on their first day with your data — something that requires understanding the schema, joining across tables, and being careful about edge cases. The 9 steps light up.

Try Deep Analysis on your own data →

Related reading: