Methodology

How a diagnostic is run and what it is not.

A test bank authored by a practitioner, a model run with its own scaffolding untouched, and a grade made by a senior reader rather than another model. The method is deliberately narrow, and honest about its edges.

How the test bank is built

The bank is practitioner-authored. It is not LLM-generated and not crowd-sourced. Items are anchored in textbook structural conventions purchase accounting, no-arbitrage pricing, IFRS 9 classification, covenant mechanics and validated against real-world reference treatments.

Each item exists to probe a specific named failure mode in the taxonomy. The taxonomy is the public face of the bank; the items themselves stay private.

How a model is run against it

The diagnostic runs your deployed model or a scoped subset of it with your prompt scaffolding preserved. There is no prompt re-engineering on my side. The bank is the constant; your product is the variable.

This matters: the goal is to find how your system behaves as shipped, not how a cleaner prompt might behave in a lab.

How responses are graded

Grading is senior-practitioner structural grading. An LLM-as-judge is used only as a pre-filter for arithmetic and exact-extraction items never for a final structural verdict.

A leaderboard judge at temperature zero scales well, but it cannot make a senior structural call that a model treated a liquidation preference and conversion value as additive rather than greater-of. That call is the product.

What a deliverable looks like

A report: an executive summary, a per-subdomain finding list, severity-weighted prioritisation, and remediation framed as test cases you can fold into your own internal regression suite.

The findings are typed and rated. Nothing in the deliverable is a black-box verdict; every finding states the mechanism and the correct treatment so your team can verify it independently.

The failure-mode typology

Every finding carries a type.

Structural

The wrong framework, applied confidently

The model uses a coherent method that is wrong for the case. A yield to maturity where a recovery PV is required.

Arithmetic

A calculation or convention slip

The framework is right; the math or sign is not. Scaling daily VaR by 252 rather than its square root.

Hallucination

An asserted fact with no basis

A figure, term, or treatment invented to fit. Revenue synergies counted that the standalone plan already includes.

Disclosure

A material omission or unflagged judgment

A treatment that is defensible but left unstated where it changes the answer. Interest classified as operating under one framework, financing under another, with no note.

The severity rubric

And a severity, scoped to consequence.

Critical

Would materially distort a deliverable a senior practitioner signs. The error survives ordinary review and changes a decision.

Material

Would require re-work. Wrong enough to matter, visible enough that a careful second pass should catch it.

Minor

Would be caught in review. A slip that a competent checker removes before the work leaves the desk.

Observation

Style or convention drift, not error. Noted for completeness; no action implied.

The layered relationship

Horizontal evaluation platforms answer what a model scored. Practitioner diagnostics answer how it broke.

What this diagnostic is not

This is not a regulatory conformity assessment. It is not a model-risk validation under SR 11-7 or any analogous framework. It is not a public benchmark.

It is a structural error catalogue, scoped to a customer's deployed surface useful as an input to a validation team's own work, never a substitute for it.

Sample findings shown anywhere on this site are illustrative reconstructions, not real client work, not real prompts, and not real model traces. No client system is ranked, named, or disclosed.