Bounding the Uncertainty: An Error Taxonomy for Non-Deterministic AI in Regulated Life Sciences

May 25

The Probabilistic Problem

An LLM deployed for deviation triage identifies a root cause with 86% confidence. Upon investigation, the output contains an error: the root cause is incorrect. The quality team flags it. But when they write the deviation report, they face a question the existing quality system has no answer for: what kind of error was it?

In traditional computerized systems, errors are well-understood. The system is deterministic: it either executes the instruction correctly or it doesn't. When it fails, the failure is classifiable within existing frameworks: software bug, configuration error, data integrity failure, hardware malfunction. GAMP AI Guide (led by Brandi Stockton, Eric Staib and Martin Heitmann) and the ML Risk & Control Framework (Blumenthal, Erdmann, Heitmann, Lemettinen, and Stockton, 2024) establish the risk management process - this error taxonomy provides the failure mode inventory that the process operates against, specifically for generative and probabilistic AI systems that don’t cleanly fit the traditional ML lifecycle model.

Probabilistic AI systems break this model. An LLM can produce output that is:

Entirely fabricated (hallucination)
Partially correct but misleadingly assembled
Technically accurate but contextually inappropriate
Right in content but wrong in confidence (miscalibration)
Within scope on most outputs but exceeding scope on edge cases

Each of these is a fundamentally different failure mode with different root causes, different risk profiles, and different corrective actions. Treating them all as "the AI was wrong" is the equivalent of classifying every deviation as "equipment malfunction": it tells you nothing actionable and prevents meaningful trending.

The pharmaceutical industry needs a standardized error taxonomy for probabilistic AI systems in GxP drug development workflows: one that classifies failures by type and by origin point, and maps them to severity, and enables the same kind of systematic root cause analysis and trending we apply to every other quality event logged in the QMS.

QA professionals will recognize the structure: this is FMEA-shaped, decomposing failure modes by manifestation (class) and cause (origin), with risk tiering applied through the FDA seven-step framework. The familiarity is intentional. The substantive content addresses what FMEA was not built for: probabilistic systems where failures don't have stable identities and reproducibility is bounded rather than absolute. To the author’s knowledge, this is the first public two-dimensional matrix anchored in GxP context.

The Two-Dimensional Probabilistic Error Taxonomy: Error Type x Origin

Classifying output errors from probabilistic systems necessitates a shared vocabulary for understanding what “wrong” looks like.

Probabilistic errors don’t have fixed identities, so classifying them by surface category obscures meaningful information about the error. Enter the two-dimensional taxonomy: error type and origin. This two-dimensional taxonomy enables organizations to comprehensively address each error with appropriate architectural controls. Instead of labeling an LLM-generated adverse event a “fabrication”, we label it a fabrication and then track where in the pipeline the error occurred. A fabrication caused by training data that is poorly representative of the use case requires a different root cause investigation than a fabrication caused by a model inference error.

Dimension 1: Error Type - What Went Wrong

Class 1: Fabrication: These are your classic “hallucinations.” In regulated settings, one method of preventing this class of error is via using Retrieval Augmented Generation (RAG). See the GAMP AI Guide’s RAG infrastructure controls.

1a: Citation/reference fabrication

1b: Data point fabrication (dosages, study results, AE data)

1c: Entity fabrication (drugs, conditions, sites, identifiers that don't exist)

Examples include:

”Made up” citations or references (e.g., an SOP that does not exist).
Invented data points (e.g., dosage, manufacturing yield).
Hallucinated adverse events not present in patient reports.

GAMP 5 Second Edition (2022) Appendix M11 – IT Infrastructure and the new ISPE GAMP AI Guide (2025) treat the vector store as a regulated record store requiring IQ/OQ verification, backup/restore testing, and integrity checks.

Class 2: Misinterpretation: Misinterpretation includes recognition errors, reasoning errors, and omission.

2a: Recognition errors (entity confusion, semantic similarity)

2b: Reasoning errors (logical/inferential failures from correct recognition)

2c: Omission (failure to surface available relevant information)

Examples include:

A LLM used to flag environmental monitoring limit alerts fails to identify an excursion because it conflates the action limit with the alert limit.
Jin Q, Chen F, Zhou Y, et al., "Hidden Flaws Behind Expert-Level Accuracy of Multimodal GPT-4 Vision in Medicine," npj Digital Medicine 7 (2024) reported that "GPT-4V frequently presents flawed rationales in cases where it makes the correct final choices (35.5%), most prominent in image comprehension (27.2%)".
In a study by Wong et al. (JAMA Internal Medicine 2021) the model failed to surface 67% of sepsis cases in its validation cohort. Despite being present in the EHR, the model failed to flag them.

Omission is classified here as a subclass of Misinterpretation because it represents a failure in processing available information. However, practitioners should note that omission requires distinct test strategies, specifically, completeness testing against a defined reference set, and may warrant elevation to a standalone class as the taxonomy matures.

Class 3: Contextual Misapplication: The model produces results that are outside the bounds of appropriate contextual constraints, while perhaps appearing “correct” on the surface.

3a: Population/context mismatch (pediatric vs. geriatric, jurisdiction)

3b: Temporal validity (withdrawn guidance, version drift, knowledge cutoff)

3c: Specification non-compliance (output doesn't match instruction/format/scope)

Examples include:

Clinical LLMs citing withdrawn or superseded FDA guidance documents.
Model including data from an irrelevant reference population (i.e. including data from both the EU and US when queried for EU data only).
Model omitting required regulatory section headings from a submission document.

While 3a and 3b involve failures of contextual judgment, 3c involves failure to adhere to explicit procedural constraints. It is included here because the underlying mechanism is the same: the model produced output outside the bounds of its specified operating context.

Class 4: Confidence Miscalibration: The model hedges where certain, is confidently wrong, or collapses ambiguous evidence into a clean binary.

4a: Overconfidence (high confidence on fabrication or error)

4b: Under-confidence/hedging (excessive uncertainty on established facts)

4c: Failure to flag conflicting evidence

Examples include:

Confidence overweighting: maximum-confidence statement on fully fabricated content.
Underconfidence/hedging: in a hypothetical example, a model claims it is 32% certain of a deviation root cause, but in fact it is 88% certain.
A model summarizing a regulatory dossier encounters two sources that contradict each other on a stability specification and weights one more heavily without flagging the conflict.

Class 5: Boundary Violation:

5a: Scope creep

5b: Authority creep

5c: Adversarial boundary breach

Examples include:

Weissman et al. (2025) found that LLMs can and often do produce device-like clinical decision support when explicitly instructed not to.
When the LLM recommendations supersede professional judgment calls (for example, a study in which UK GP’s switched from a correct to an incorrect prescription in 5.2% of CDSS-advised cases where the CDSS issued incorrect advice).
Lee, Jun, Lee, Cho, Park & Suh (2025) found that in a sample of 216 controlled patient-LLM dialogues prompt-injection attacks succeeded in 94.4% of trials, including 91.7% of high-harm trials.

Class 6: Population Bias: The model produces responses that may be accurate in aggregate but inaccurate for specific sub-populations less well represented in training data.

6a: Demographic subgroup bias

6b: Pharmacovigilance/safety-signal bias

6c: Site/transferability bias

Examples include:

Larrazabal et al. (2020) found a “consistent decrease in performance for underrepresented genders when a minimum balance is not filled.”
Spontaneous-report systems under-report adverse drug reactions from populations with lower healthcare usage, meaning an LLM-signal detection tool trained on this data may infer false safety in those subgroups.
Models trained on one site’s process data perform poorly when deployed at a second site in the network.

Dimension 2: Origination Point - Where It Went Wrong

Six error origins exist broadly:

Training Data
Retrieval / RAG layer
Model Inference
Human-AI interface
Agent Orchestration
Supplier

Origin 1: Training Data: Training data is the furthest “upstream” source of error for GenAI used in drug development workflows. Any generative AI application that utilizes foundation models inherits provenance risk. This is a more vulnerable failure mode than it may appear on the surface: replacing just 0.001% of training tokens with synthetic medical misinformation produced LLMs that produces LLMs that propagate harmful content significantly more often than baseline, while being indistinguishable using traditional validation/benchmark models (Alber DA et al. 2025).

1a: Data poisoning

1b: Data privacy

1c: Distribution shift

1d. Source error

Examples include:

Invented data points (e.g., dosage, manufacturing yield)
Hallucinated adverse events not present in patient reports
Memorization of PII/PHI
Memorization of copyrighted literature

Origin 2: Retrieval/RAG layer: RAG is the dominant architectural pattern in regulated pharma Generative AI because it enables grounding in version controlled, GxP-aligned knowledge bases (i.e. protocols, SOPs, CSRs, labels, regulatory guidelines). This class of failures originate in the retrieval pipeline of RAG-based systems.

2a: Chunking failures

2b: Embedding drift

2c: Citation fabrication/grounded hallucination

2d. Multi-hop/complex query

2e. Retriever-generator misalignment

Examples include:

Allamraju et al. (2025, PSC and MFC PubMed) shows that retrieval performance is highly domain-dependent. Paragraph-group and LLM-based structure-aware chunking outperforms naïve recursive splitting.
Changing the embedding model, tokenizer, or preprocessing without properly re-embedding the entire corpus. This produces a vector store with mixed-provenance vectors, and outputs that drift without a clear trace in the logs.
RAG reduces but does not eliminate “grounded hallucination” errors. In particular, fabricated citations occur when the model extrapolates beyond retrieved context.
A query that requires multiple RAG retrieval instances is more error-prone. For instance, “Which of our Phase 2 protocols use the same secondary endpoint as Study A and what was their dropout rate?”
A regulatory Q&A system retrieves the correct FDA guidance section but responds from (generates the response from) parametric anyway.

Origin 3: Model Inference: Failures at runtime even when training and retrieval are sound. These failure modes survive downstream of quality data and accurate retrieval.

3a: Hallucination (intrinsic vs. extrinsic)

3b: Sycophancy/prompt sensitivity

3c. Numerical errors

3d: Non-determinism

3e. Structured output failures

3f. Prompt injection

Examples include:

Ji et al. (2023) formalized this distinction in a comprehensive survey: an intrinsic hallucination occurs when a model summarizing a clinical trial report states the control arm showed superiority when the source text states the opposite; an extrinsic hallucination occurs when the same model introduces an adverse event finding that appears nowhere in the source document.
Perez et al. (2022) demonstrated that LLMs systematically shift their answers to align with the user's implied opinion when questions contain leading framing, even when the model "knows" the correct answer under neutral prompting.
Frieder et al. (2023) systematically evaluated GPT-4 on graduate-level mathematics and found consistent failures in multi-step arithmetic, symbolic manipulation, and modular arithmetic.
At identical temperature settings and system prompts, repeated inference runs on the same clinical vignette can produce materially different outputs: a phenomenon documented in the Chen, Zaharia & Zou (2023) behavioral-drift study.
LLMs tasked with producing JSON, XML, or tabular outputs frequently contain syntactically invalid structures.
Lee et al. (2025) found that prompt-injection attacks against six commercial medical LLMs succeeded in 94.4% of 216 controlled dialogues.

Origin 4: Human-AI Interface: This class of failures arises from how humans interact with the AI system itself.

4a: Automation bias

4b: Confidence miscommunication

4c: Deskilling/workflow integration

4d. Prompt design

Examples include:

Goddard, Roudsari & Wyatt (International Journal of Medical Informatics, 2014) found that in 5.2% of prescribing cases, UK GPs switched from a correct answer to an incorrect answer after receiving clinical decision support advice.
When an LLM output is rendered in a polished UI with citations formatted as hyperlinks, users may interpret the presentation as evidence of retrieval grounding. These limitations should be made explicit to end users.
Over repeated use, operators who initially verified AI outputs begin to skip verification steps as trust accumulates.
Weissman, Mankowitz & Kanter (npj Digital Medicine, 2025) demonstrated that a single emotional reframing of a prompt collapsed FDA non-device guardrails in both GPT-4 and Llama-3, producing device-like clinical decision support that the system prompt explicitly prohibited.

Origin 5: Agent Orchestration: Failures originated in the orchestration layer of multi-agent or multi-step LLM systems. Orchestration failures occur when multiple agents, tools, or function calls compose into an error that none of the individual components would have produced alone:

5a: Readability collapse across trials

5b: Multi-step planning failures

5c: Compounding propagation (multiple errors interact across stages)

5d. Tool-use errors

Examples include:

In multi-agent document generation pipelines, iterative revision loops can degrade output quality.
The model structures its tasks in the wrong order, creating downstream dependency issues (for example, by scheduling the statistical analysis plan generation prior to receiving protocol endpoints from another agent).
In multi-hop reasoning chains, per-step accuracy declines significantly by the fourth or fifth step.
Agents may select an inappropriate tool; for example, an agent may call a pharmacovigilance database API with an incorrect MedDRA code.

Origin 6: Supplier: Failures rooted in components provided by an upstream vendor (i.e. foundation models, vendor data, vendor model updates, vendor SLA failures, and third-party platform changes). The ISPE GAMP AI Guide surfaces this concern explicitly with a supplier-quality planning section.

6a: Silent model updates

6b: Deprecation

6c: Multi-tenancy leakage

6d. DPA drift

Examples include:

GPT-4 silent retraining (Chen, Zaharia and Zou, 2023): when comparing March 2023 to June 2023 versions, GPT-4 USMLE accuracy decreased from 86.6% to 82.4%. This highlights the need for continuous monitoring.
Pharma organizations rely on predictability to plan change control and revalidation events. For a sponsor that had validated a regulatory submission workflow against a specific model endpoint, deprecation forces an unplanned revalidation.
Training data can be extracted from language models through targeted prompting; this presents concerns around patient-level data and proprietary parameters for sponsors.
When a foundation model provider silently migrates inference infrastructure across regions, for example, routing API calls through a new data center to manage capacity, the data residency and processing terms may shift without the regulated customer's knowledge.

GAMP 5 (Second Edition) suggests Category 5 treatment for foundation models, given potential for autonomous decision-making impact.

The Worked Example: A Real-Life Use Case

Mata v. Avianca isn't a pharma case, and that's why it's worth your time. In 2023, two New York attorneys submitted a legal brief containing six fabricated court decisions generated by ChatGPT, complete with fake quotes attributed to real federal judges. When one of the attorneys asked ChatGPT whether the cases were real, it confirmed they were. They weren't.

The lawyers were sanctioned $5,000 (joint-and-several Rule 11 sanction against all the lawyers and the law firm collectively). The headlines focused on the embarrassment. The substantive lesson, for any regulated industry, is what the failure mode actually was, because "the AI hallucinated" doesn't tell you anything actionable.

Mapped through the taxonomy:

Class 1 (Hallucination) at Origin 3 (Model inference): the model generated case names and citations with no grounding in any retrieved source.
An argument can also be made for Class 1 × Origin 4 (Human-AI Interface) as a secondary failure mode. Verification would have prevented the error from surfacing in the official legal brief.

Two different controls would have caught them. A retrieval-grounding requirement (citation-grounding fix at the RAG layer) addresses Class 1 at Origin 3. A forced verification workflow with structured event logging (Human-AI interface fix) addresses Class 1 at Origin 4.

How This Fits Within Existing Standards

The taxonomy is not a parallel discipline competing with classical quality risk management. It is an extension of it into territory the foundational standards did not originally anticipate but whose principles still cover.

The FDA's seven-step credibility assessment framework (draft January 2025) defines the validation envelope within which any specific AI deployment lives. Context of use is established, model influence is characterized along with decision consequence, and the resulting credibility plan is scaled to the risk tier. A high-influence deployment with high decision consequence demands the highest evidence tier: independent test sets, blinded human evaluation, statistically powered comparisons. A lower-tier deployment requires proportionate but lighter evidence. The framework is still in draft, but it is sufficiently mature to adopt today; waiting for finalization is no longer defensible given the volume of AI now entering regulated workflows.

The error taxonomy operates inside this envelope. The FDA framework tells you how much rigor a given deployment requires; the taxonomy tells you what kinds of failures that rigor must address. They are orthogonal and complementary. A higher-tier deployment requires explicit treatment of more cells in the matrix and deeper evidence for each. A lower-tier deployment may need explicit controls only for the highest-probability cells. The connection between the two is operational: the FDA framework sets the evidentiary bar, and the taxonomy specifies what evidence is needed at each failure point.

ICH Q9 (R1), effective July 2023, provides the risk management spine that both the FDA framework and the taxonomy sit within. ICH Q9's classical triad of severity, probability, and detectability maps directly onto this architecture. Severity is drawn from the FDA framework's model influence × decision consequence classification. Probability is characterized per cell of the taxonomy through bounded-variance testing: the same input replayed N times to characterize the stochastic envelope, with failure rate estimated from the resulting distribution. Detectability is established through the continuous monitoring layer, which determines whether a given failure type surfaces at the output or requires external validation to catch. ICH Q9 was written for deterministic systems; this framework extends its principles into probabilistic systems where reproducibility is bounded rather than absolute. The R1 revision's explicit acknowledgment that risk management decisions involve transparent subjectivity is directly relevant: bounded variance is a transparent acknowledgment of subjectivity in evaluation, and it earns its regulatory standing under R1.

GAMP 5 (Second Edition, 2022) provides the lifecycle and categorization layer. Probabilistic AI systems typically fall under GAMP Category 5 - custom applications - because they are neither off-the-shelf nor merely configured. The taxonomy does not replace this categorization; it sits inside Category 5 as the deeper decomposition GAMP 5 does not provide for probabilistic systems specifically. The Britt Probabilistic Validation Lifecycle that operationalizes the taxonomy maps structurally onto the GAMP 5 V-model: context of use corresponds to concept and planning; risk tiering corresponds to risk assessment; evaluation design corresponds to configuration and verification; acceptance criteria corresponds to qualification; HITL/HOTL controls correspond to operational use; continuous monitoring and change control correspond to ongoing change management. Same skeleton, different muscle: adapted for the failure characteristics of probabilistic systems.

The GAMP AI Guide (July 2025) establishes the risk management process for AI-enabled computerized systems but does not provide an explicit failure-mode framework for non-deterministic systems specifically. This taxonomy is intended to operate inside that space. The supplier qualification work extends Chapter 7 and Appendix M2 with the four-pillar VALID Trust decomposition. The error taxonomy extends the failure analysis discipline that GAMP 5 inherits from FMEA into the probabilistic systems that classical FMEA was not built to address. The framework is GAMP-aligned in its skeleton and FMEA-shaped in its decomposition, but its substantive content is new because probabilistic systems break the reproducibility assumptions both of those frameworks were built on.

The structural familiarity is intentional. QA professionals should be able to use the taxonomy with the same operational discipline they bring to deviation investigation, root cause analysis, and FMEA. The substantive divergence is what makes the work necessary. The matrix gives quality teams something to hold onto when the failure they are investigating does not have a stable identity - which, in probabilistic systems, is most of the time.

For non-deterministic AI systems, the state of control is demonstrated not by establishing and maintaining a fixed validated state, but by continuously verifying the absence of defined failure modes within an accepted tolerance. This shifts the validation evidence model from point-in-time qualification to continuous process verification, a concept already established in process validation (FDA 2011 guidance, EU Annex 15) but not yet extended to AI-enabled computerized systems. This taxonomy is meant to enable sponsors to build robust, defensible acceptance criteria for non-deterministic systems,

This taxonomy is the initial version of a framework I expect to develop publicly over the next several years. The six origin categories are stable enough to be useful now; the cell-level mappings between origins and output types will need refinement as more case studies accumulate. I'm particularly interested in contributions to three areas where the published literature is thinnest: retrieval-origin failures in regulatory submission workflows, agent orchestration failures in GxP manufacturing, and supplier-origin failures involving foundation model updates. If you've encountered a failure that doesn't fit cleanly into this structure, or if you think the structure itself needs reconsidering, I want to hear about it. Refinements will be published with attribution.

Kayla Britt

Bounding the Uncertainty: An Error Taxonomy for Non-Deterministic AI in Regulated Life Sciences

The Probabilistic Problem

The Two-Dimensional Probabilistic Error Taxonomy: Error Type x Origin

The Worked Example: A Real-Life Use Case

How This Fits Within Existing Standards

When Your AI Vendor Passes SOC 2 — and Still Fails GxP