How a diagnostic is run and what it is not.
A test bank authored by a practitioner, a model run with its own scaffolding untouched, and a grade made by a senior reader rather than another model. The method is deliberately narrow, and honest about its edges.
How the test bank is built
The bank is practitioner-authored. It is not LLM-generated and not crowd-sourced. Items are anchored in textbook structural conventions purchase accounting, no-arbitrage pricing, IFRS 9 classification, covenant mechanics and validated against real-world reference treatments.
Each item exists to probe a specific named failure mode in the taxonomy. The taxonomy is the public face of the bank; the items themselves stay private.
How a model is run against it
The diagnostic runs your deployed model or a scoped subset of it with your prompt scaffolding preserved. There is no prompt re-engineering on my side. The bank is the constant; your product is the variable.
This matters: the goal is to find how your system behaves as shipped, not how a cleaner prompt might behave in a lab.
How responses are graded
Grading is senior-practitioner structural grading. An LLM-as-judge is used only as a pre-filter for arithmetic and exact-extraction items never for a final structural verdict.
A leaderboard judge at temperature zero scales well, but it cannot make a senior structural call that a model treated a liquidation preference and conversion value as additive rather than greater-of. That call is the product.
What a deliverable looks like
A report: an executive summary, a per-subdomain finding list, severity-weighted prioritisation, and remediation framed as test cases you can fold into your own internal regression suite.
The findings are typed and rated. Nothing in the deliverable is a black-box verdict; every finding states the mechanism and the correct treatment so your team can verify it independently.
Every finding carries a type.
The wrong framework, applied confidently
The model uses a coherent method that is wrong for the case. A yield to maturity where a recovery PV is required.
A calculation or convention slip
The framework is right; the math or sign is not. Scaling daily VaR by 252 rather than its square root.
An asserted fact with no basis
A figure, term, or treatment invented to fit. Revenue synergies counted that the standalone plan already includes.
A material omission or unflagged judgment
A treatment that is defensible but left unstated where it changes the answer. Interest classified as operating under one framework, financing under another, with no note.
And a severity, scoped to consequence.
Would materially distort a deliverable a senior practitioner signs. The error survives ordinary review and changes a decision.
Would require re-work. Wrong enough to matter, visible enough that a careful second pass should catch it.
Would be caught in review. A slip that a competent checker removes before the work leaves the desk.
Style or convention drift, not error. Noted for completeness; no action implied.
Horizontal evaluation platforms answer what a model scored. Practitioner diagnostics answer how it broke.
This is not a regulatory conformity assessment. It is not a model-risk validation under SR 11-7 or any analogous framework. It is not a public benchmark.
It is a structural error catalogue, scoped to a customer's deployed surface useful as an input to a validation team's own work, never a substitute for it.
Sample findings shown anywhere on this site are illustrative reconstructions, not real client work, not real prompts, and not real model traces. No client system is ranked, named, or disclosed.