Kayla Britt Kayla Britt

The Guidance Says What. The Next 12 Articles Show How

Over the past four months, I've published 12 articles on AI validation for life sciences, starting with the case for interdisciplinary expertise in AI validation, and have touched on everything from the technical implementation side (HITL/HOTL in practice, transparency architecture) to the business case for a robust validation strategy.

Since my first post, I launched Britt Biocomputing Insights and filed Britt Biocomputing as an LLC on November 25th. On the regulatory side, the FDA announced their internal usage of agentic AI on December 1st, and the FDA and EMA jointly released the Good AI Practice Principles on January 14th, just weeks later (https://www.kaylabritt.com/blog-1-1/fda-amp-ema-just-released-ai-guiding-principles-for-drug-development-heres-what-they-actually-mean).

Prior to launching my consultancy publicly, I shared my core framework on my website:

CoU Definition → Risk → Eval Design and Development → Acceptance Criteria → Deployment and HITL Control → Continuous Monitoring

And also noted that rigor always scales with the risk tier. The FDA/EMA Good AI Practice Principles formalized these same pillars.

Now that I’ve laid the foundations, the next phase of my blog addresses the harder operational questions, like agentic and multimodal architectures, human performance validation, full worked examples, and vendor qualification: specifics that are useful for sponsors navigating the changing landscape of AI architectures in drug development. The guidance tells sponsors what to demonstrate but deliberately stops short of how. That's where the next phase of this work lives.

The foundations are built. Now we stress-test them.

Read More
Kayla Britt Kayla Britt

AI Trust is not a feeling - It’s a Validation Strategy

Everyone says they want "trustworthy AI." But when I ask pharma teams what that means, I get feelings, not metrics.

Trust isn't built by reassurance. It's built by evidence.

"Trust" in AI adoption is currently treated as a communications problem - better change management, better messaging to stakeholders - when it's actually an engineering problem. You can't talk your way into trust with a QA team that's seen AI hallucinate. A Pistoia Alliance survey found that 27% of respondents didn't even know the source of data used to train their AI models.

Trust is the output of a validation strategy, not the input. You engineer it through three things:

1. Transparency: Can you show the human reviewer why the model made that decision? (Chain-of-thought logging, confidence scores)

2. Reproducibility: Can you get the same result twice? (Frozen architectures, version-locked prompts)

3. Accountability: When it fails, does someone own it? (HITL/HOTL)

Stop asking "how do we get people to trust AI?"

Start asking "what evidence would it take?"

Regulatory and validation teams have been trained their entire careers to demand reproducible evidence: and many AI implementations haven't produced any.

Transparency doesn't mean explaining every decision. It means providing the right level of evidence for the risk tier. Abandoning "black box" models for all contexts of use just because you can't trace every internal weight is its own form of risk. Post-hoc explainability techniques can validate even opaque models: the question is whether the evidence matches the stakes.

Reproducibility needs to be reframed. In the context of probabilistic models, exact output-level replication isn't the standard. Functional and statistical reproducibility are the primary domains of significance. Across N runs, does performance stay within your pre-defined acceptance thresholds? That's the question that matters.

Accountability means more than assigning blame. We don't just validate the model; we validate the human-AI interaction itself. Who reviews the output? How do we know the reviewer is actually reviewing, and not rubber-stamping? The delta between model suggestion and human action is where trust lives or dies.

Trust isn't the prerequisite for AI adoption. It's the deliverable.

Read More
Kayla Britt Kayla Britt

The Engineering of Uncertainty: Transparency in the Probabilistic Era

For the last thirty years, validation engineering has rested on a single, comforting bedrock: Determinism.

In the world of traditional GxP software, the contract was simple: Input A must always equal Output B. If you ran the test a thousand times, you expected the same result a thousand times. Any deviation was a defect; any variance was a failure.

But we have left that world behind.

We are now integrating systems where variance is not a bug; it is the engine. Large Language Models and generative agents do not offer us the safety of repetition; they offer us the power of inference. This shift creates a fundamental paradox for the life sciences: How do we apply rigid, binary validation standards to fluid, probabilistic systems?

The answer isn't to force these models to act like legacy software. We cannot simply "test out" the uncertainty. Instead, we must learn to measure it, bound it, and ultimately, engineer it. We are moving from validating for correctness to validating for stability: and that requires a completely new set of metrics.

See content credentials

The Regulatory Landscape

While research on methodology is robust, best practices for validation remain fluid. The FDA has yet to offer specific guidance on the methodology of explainability, whereas the EMA explicitly specifies:

"To allow review and monitoring of black box models, methods within the field of explainable AI should be used whenever possible. This includes providing explainability metrics, such as SHAP and/or LIME analyses..."

Similarly, the CIOMS Working Group XIV includes “Explainability” as the eighth core “guiding principle” for AI in pharmacovigilance. For QA, PV, and clinical leaders who must sign validation reports, the core question is not “Is the model clever?” but “When it fails, do we see it coming?”

Here, we propose a layered approach to the Explainable AI (xAI) problem, utilizing risk stratification to determine the appropriate methodology.

1. Interpretable Architectures ("Glass Box")

In some very high-risk use cases, interpretable architectures are, and should remain, the default approach. A 2025 analysis suggests that the “black box vs. glass box” tradeoff is not as clear-cut as we may assume. Atrey et al. quantified this metric as “Composite Interpretability” (CI), identifying instances where interpretable models (like Decision Trees or Generalized Additive Models) outperform neural networks when human error integration is accounted for.

2. Post-hoc XAI

For “black box” models inherent to modern NLP and deep learning, we cannot see the architecture, so we must interrogate the behavior. Three primary methods dominate the literature:

LIME (Local Interpretable Model-agnostic Explanations) LIME assumes that even if a model’s global decision boundary is complex, it is likely simple (linear) locally around a specific data point.

  • How it works: It takes a single inference, generates thousands of slightly "off" versions (adding noise), and observes how predictions change. It then fits a weighted linear model to explain that specific prediction.

  • The Intuition: "I don't know how the whole brain works, but for this specific decision, it acted like a simple linear equation where "Feature A" carried the most weight."

SHAP (SHapley Additive exPlanations) SHAP derives from cooperative game theory. It treats each feature as a "player" in a game where the "payout" is the model's prediction.

  • How it works: It calculates the marginal contribution of a feature by analyzing prediction changes when that feature is present vs. absent across all possible combinations.

  • The Intuition: "Feature A contributed +10% to the probability, and Feature B subtracted 5%, based on a mathematically fair distribution of credit."

See content credentials

Counterfactuals Often the most intuitive for end-users, this method ignores "feature weights" and focuses on outcomes.

  • How it works: It searches for the smallest change to the input vector that would flip the prediction class.

  • The Intuition: "Your loan was denied. If you earned $5k more per year, it would have been approved.”

  • The Validation Angle: For life sciences, a counterfactual is only useful if it is Feasible. We must apply Validity metrics (does it flip the class?) and Actionability metrics. If a model suggests changing a patient’s age or genetic history to optimize a trial outcome, the counterfactual is mathematically valid but clinically useless.

3. Uncertainty Quantification as Transparency

In a deterministic system, "transparency" means seeing the logic. In a probabilistic system, transparency means seeing the confidence.

If a model predicts a tumor classification with 51% probability, presenting that result as a binary "Malignant" is a failure of transparency. It implies a certainty that does not exist.

The Method: We must move beyond simple point estimates (soft-max probabilities) which are notoriously uncalibrated. Techniques like Conformal Prediction allow us to generate prediction sets (e.g., "The diagnosis is {Class A, Class B} with 95% confidence") rather than a single label.

The Validation Angle: Validation here shifts from checking for "correctness" to checking for Calibration. We validate that when the model says it is 90% confident, it is actually correct 90% of the time. This "error bar" is often more valuable to a clinician than the prediction itself.

See content credentials

4. Concept-Based Explanations

Feature attribution methods (like SHAP) tell us where the model is looking (e.g., "Pixel 402"), but they fail to tell us what the model sees. In life sciences, we need explanations that speak the language of the domain expert, not the language of the matrix.

The Method: Approaches like TCAV (Testing with Concept Activation Vectors) bridge this gap. Instead of highlighting pixels, TCAV measures the model's sensitivity to high-level concepts defined by the user (e.g., "Is the model predicting 'Zebra' because of 'Stripes'?").

The Validation Angle: This allows us to validate the scientific plausibility of the model’s reasoning. If a dermatology model is detecting skin cancer, TCAV can confirm it is triggering on "irregular borders" (a valid clinical concept) rather than "ruler markings" (a confounding artifact).

5. Documentation-Based Transparency

Transparency is not solely about the algorithm; it is about the artifact. Before a single inference is run, the system’s pedigree must be transparent.

The Method: We advocate for the adoption of standardized "nutrition labels" for models, such as Model Cards (Mitchell et al.) and Datasheets for Datasets (Gebru et al.). These documents must explicitly detail the training data composition, known limitations, intended use cases, and performance metrics across different demographic subgroups.

The Validation Angle: This is Static Transparency. In a GxP audit, this documentation serves as the primary evidence that the system's "Intended Use" matches its operational reality. It prevents "scope creep" where a model validated for adults is inappropriately deployed for pediatrics.

6. Contextual Disclosure

The final layer of transparency is the user interface itself. "Explainability" is not a dump of raw data; it includes the delivery of relevant information to the human operator at the moment of decision.

The Method: This involves Progressive Disclosure. A physician using a Clinical Decision Support (CDS) tool does not need to see a SHAP value for every inference. They need a "traffic light" indicator of uncertainty and a "Click for Details" option to drill down into the counterfactuals when the case is ambiguous.

The Validation Angle: This is Usability Engineering (IEC 62366). We must validate that the transparency mechanism reduces, rather than increases, cognitive load. If the XAI tool confuses the user, it is a safety hazard, not a feature.

See content credentials

The arc of HITL-HOTL-xAI was intentional; human performance remains a key variable in AI performance. We cannot fully address transparency without the human element.

A few operational suggestions for industry leads:

  • Inventory your AI systems and classify them into risk tiers aligned with CIOMS XIV and the AI Act.

  • For each tier, define a minimum transparency stack (which of the six layers are mandatory) and embed this into your QMS templates and validation plans.

See content credentials

Example of risk-tiering transparency methodology based on AI context of use.

  • Pilot conformal prediction or similar calibration techniques on at least one CDS or PV model in 2026 and document coverage and calibration as primary validation endpoints.

The onus is on us to develop robust frameworks for transparent AI. Abandoning high-performance models solely due to their “black box” nature is a hindrance to patient benefit. Technology will move forward; our validation frameworks must move with it.

Note: Pharmacovigilance is used here as a representative model; R&D and CMC leaders can and should adapt these transparency frameworks for upstream applications.

Read More
Kayla Britt Kayla Britt

Human-in-the-Loop, Liability Still in Play

Note: This approach aligns with established GxP principles around procedural controls, segregation of duties, and auditability.

Human-in-the-Loop is such a critical component of any probabilistic AI deployment within regulated life sciences spaces that it received its own explicit carve-out in the FDA/EMA's Good AI Practice Principles release (Principle 1: Human-Centric by Design). As AI technologies become embedded within infrastructure and workflows in R&D/CMC and healthcare organizations, HITL is a guardrail against downstream propagation of model errors.

However, this means we must evaluate and document the human-interface interaction as critically as we do the model performance and architecture itself. In practice, there are several ways to accomplish this:

1. “Draft Only — Requires Human Review”

For AI-assisted protocols, reports, or structured records, model outputs should be explicitly labeled Draft Only.

System controls should prevent finalization or downstream use until a human reviewer:

  • performs review,

  • documents rationale, and

  • applies a signature or electronic attestation.

This enforces procedural accountability and prevents silent adoption of AI-generated content.

2. Workflow Design (Preventing “Blind Approval”)

In RAG or multi-step AI workflows, each stage should require human confirmation before progression.

The goal is not speed reduction; it is preventing opaque, end-to-end automation where no single human can attest to what they actually reviewed.

3. Cognitive Forcing Functions (“Friction-by-Design”)

One of the most common HITL failure modes is automation bias: over time, humans may stop reading and simply click “Approve.”

To counter this, interfaces should require intentional cognitive engagement before submission.

Examples include:

  • requiring the reviewer to highlight supporting evidence in source text,

  • selecting a justification or confidence code,

  • or highlighting discrepancies.

This aligns with established human-factors and safety-critical system design and ensures the review is real, not ceremonial.

4. Confidence-Based Triage Routing (Risk-Based HITL)

Not all AI outputs require the same level of scrutiny.

HITL workflows should adapt based on:

  • calibrated uncertainty scores,

  • confidence thresholds,

  • or predefined risk classifications.

Higher-uncertainty outputs can be automatically routed for deeper or secondary review, while low-risk outputs follow streamlined paths. This mirrors traditional GxP risk-based validation approaches and supports scale without sacrificing control.

5. Full Traceability of the Hybrid Decision

Traditional audit trails track data changes. AI workflows must also track decision lineage.

The audit record should capture:

  • model output,

  • human edits,

  • timestamps,

  • reviewer identity,

  • and rationale.

This directly supports ALCOA+ principles and regulator expectations around accountability and traceability.

Real-World Example: AI in Pharmacovigilance (PV) Case Processing

Here is a scenario you can use to tie all the points together. It demonstrates how HITL protects the process during high-volume data intake.

The Scenario: A pharmaceutical company uses a Large Language Model (LLM) to scan incoming unstructured emails from patients to identify potential Adverse Events (AEs).

The Risk: If the AI misses an AE (False Negative), a safety signal could be ignored. If it hallucinates an AE (False Positive), resources are wasted investigating non-events.

The HITL Implementation:

  1. Draft Only: The AI scans the email and pre-fills the intake form (Patient ID, Drug Name, Symptom). The status is automatically set to "Pending Medical Review": the system prevents the record from moving to the safety database until a human signs off.

  2. Cognitive Forcing: The UI displays the original email on the left and the extracted data on the right. The "Submit" button is disabled until the human reviewer clicks the specific sentence in the email that describes the symptom (e.g., "I felt dizzy after taking the pill"). This proves the reviewer actually read the source text.

  3. Audit Trail: The reviewer notices the AI listed "nausea" but the patient actually wrote "queasy." The reviewer corrects the field. The system logs: Field 'Reaction' changed from 'Nausea' (Model) to 'Queasy' (User: Dr. Smith) at 10:42 AM.

The Result: The efficiency of AI is gained (pre-filling data), but the regulatory requirement for validated safety reporting is maintained through forced, documented human oversight.

Implementing HITL is not a "set it and forget it" deployment; it is an ongoing process of quality assurance. Just as we monitor models for data drift, we must rigorously monitor our workforce for "reviewer drift": the tendency for human oversight to degrade over time due to fatigue or over-reliance on the AI.

To ensure the human element remains a robust guardrail, organizations should implement a Reviewer Quality Assurance (QA) Protocol:

  • Randomized "Golden Set" Evaluation: A configurable percentage (e.g., 5–10%) of all AI-processed records that have been "Verified" by a human are automatically routed to a Senior Quality Lead for a blind secondary review. This acts as a continuous audit of the HITL process.

  • The "Three Strikes" Threshold: We must quantify human performance just as we do model performance. If a human reviewer fails to catch a model error (or erroneously edits a correct output) more than X times in a rolling period:

HITL as a Validated Control — Not a Checkbox

By validating the interaction, not just the output, HITL becomes an active, inspectable control that satisfies both the letter and the spirit of the FDA/EMA’s Human-Centric by Design principle.

As AI systems evolve toward multimodal and agentic architectures, HITL must scale accordingly: shifting from manual intervention inside every step to structured oversight of the loop itself.

Next week: a deep dive into Human-on-the-Loop (HOTL) and how oversight changes as autonomy increases.

Read More
Kayla Britt Kayla Britt

FDA & EMA Just Released AI Guiding Principles for Drug Development: Here’s What They Actually Mean

Today, the FDA and EMA jointly released Guiding Principles for Good AI Practice in Drug Development.

If you work in life-sciences R&D, stop scrolling. This is not just another policy document. It’s a signal that the era of experimental, undocumented AI is ending.

What matters isn’t the principles themselves. What matters is who now owns the risk, and what regulators will expect to see when AI influences scientific decisions.

Below is what sponsors should understand now.

This is not regulation. And that’s exactly why it matters.

The guidance is deliberately non-prescriptive. There are no checklists, no templates, no mandated validation methods.

That’s not a gap. That’s the point.

Regulators are saying:

“You are responsible for demonstrating that your AI system is fit-for-purpose, risk-appropriate, and well-governed — across its entire lifecycle.”

In other words:

  • Waiting for rules is no longer defensible

  • Pointing to vendor benchmarks is insufficient

  • Treating AI as 'just software' is no longer a viable regulatory strategy

The quiet but critical shift: from tools to evidence

The most important change in the FDA–EMA principles is subtle:

AI is no longer framed as a productivity tool. It is framed as a system that can generate, analyze, or influence scientific evidence.

That has consequences.

When AI contributes to:

  • target identification

  • candidate prioritization

  • trial design

  • safety interpretation

  • manufacturing decisions

…it becomes subject to the same scrutiny as any other system that influences patient outcomes.

This is why the guidance emphasizes:

  • context of use

  • risk-based validation

  • human oversight

  • lifecycle monitoring

  • clear documentation and traceability

Not accuracy. Not model size. Not novelty.

Why “we’ll validate later” no longer works

A common pattern I see in R&D organizations is:

“We’re piloting AI now; we’ll formalize validation once it’s closer to GxP.”

The problem is that model behavior is shaped early:

  • by training data

  • by prompt strategies

  • by human-AI interaction patterns

  • by how outputs are trusted (or over-trusted)

By the time a system is “critical,” the evidence gap already exists.

The FDA–EMA principles make this explicit: validation is proportional to risk, not delayed until formality.

Early-stage AI still requires:

  • defined decision boundaries

  • known failure modes

  • fit-for-use performance criteria

  • documented assumptions and limitations

What regulators are really asking sponsors to show

Stripped of policy language, the principles boil down to five questions regulators will increasingly expect sponsors to answer:

  1. Who owns the risk if the model fails?

  2. Can you trace the decision back to the data?

  3. Is the human actually overseeing it, or just clicking 'OK'?

  4. How is performance monitored as data, context, and models change?

  5. Can you explain its use and limitations to the people affected by it?

Answering these questions with intent is no longer enough. The new standard requires evidence.

Where most organizations may struggle

In practice, the hardest parts of alignment are not technical:

  • Translating AI behavior into scientifically meaningful failure modes

  • Defining acceptance criteria that reflect biological risk

  • Evaluating human-AI interaction, not just model output

  • Maintaining evidence over time as models drift and evolve

  • Maintaining multi-disciplinary expertise over the entire lifecycle of the model

These are validation problems, not data science problems.

And they sit squarely between R&D, Quality, Regulatory, and Digital teams. Teams that historically spoke four different languages must now answer to one shared standard.

What “fit-for-purpose AI validation” actually means now:

A fit-for-purpose approach does not mean validating everything to the same standard.

It means:

  • defining context of use first

  • tiering risk explicitly

  • tailoring evaluation methods to scientific impact

  • generating evidence that is proportionate, traceable, and defensible

  • planning for lifecycle monitoring from day one

This is exactly the operating model regulators are signaling, without telling sponsors how to implement it.

The bottom line

The FDA–EMA principles do not slow AI adoption.

They raise the bar for trust.

Organizations that treat this moment as a documentation exercise will struggle. Organizations that treat it as a scientific quality problem will move faster, and safer.

AI in drug development is no longer about whether it works.

It’s about whether you can stand behind it.

Patients must be able to trust that we aren't just accelerating discovery, but governing it. Because in the end, speed without safety isn't a breakthrough; it's a liability.

Read More
Kayla Britt Kayla Britt

History Rhymes: Why AI is the "Paper-to-Digital" Shift of Our Generation

1997: FDA drops 21 CFR Part 11. Pharma validation breaks overnight.
2026: FDA deploys agentic AI internally. History rhymes—and your validation frameworks aren't ready.
The question isn't if you'll need GxP-aligned AI validation. It's whether you'll build it before the audit pack lands on your desk.

The First Wave: Paper (deterministic)

The first major shift was moving from physical atoms (paper) to binary bits (electronic records). We had to prove that the computer would do exactly what the paper did, every single time. 1 + 1 had to equal 2. This birthed "Computer System Validation" (CSV). It was rigid, script-based, and binary. Pass/Fail.

The Second Wave: Digital (probablistic)

We are now entering the second massive shift. We aren't just changing the medium (paper to screen); we are changing the logic.

We aren’t changing the medium; we are changing the logic. We are moving from Deterministic (If X, then Y) to Probabilistic (If X, then likely Y).

The original “CSV” playbook doesn’t work when applied to LLMs or agentic AI. You can't write a test script for an infinite number of potential outputs.

Like biological processes, AI requires a unique approach to validation. AI is more like biology than software. We are moving from Validation as Architecture (checking blueprints) to Validation as Medicine (monitoring health). You don't 'debug' a biological system; you diagnose it. AI is the same.

The "Compliance Tollbooth": Bridging the Gap

Validation isn't dying; it’s just getting harder. We need a new "Tollbooth": a set of checks that acknowledges uncertainty rather than trying to eliminate it.

The Britt Biocomputing Playbook:

  • Fit-for-Purpose Validation: We assess the context-of-use to identify the appropriate risk-tier, rather than a one-size-fits-all approach.

  • From Checklists to Guardrails: We don't test every output; we test the safety boundaries.

    • Critical Thinking vs. Scripting: This aligns tightly with the GAMP 5 ISPE guidance; instead of checking every function utilizing the same checks, we implement risk-based approaches that acknowledge that not every use case needs the same level of validation.

  • Golden Datasets: We validate against a proprietary suite of “golden datasets” developed via testing against dozens of frontier models.

  • Continuous Monitoring: We provide the framework to monitor your model long-term, so you don’t stumble into barriers like data drift.

This is why we need interdisciplinary professionals as the new generation of AI Validation Engineers - we need people who can translate between the code, the science, and the regulations.

Part 11 rewrote validation overnight. AI validation guardrails aren't optional anymore.

Read More
Kayla Britt Kayla Britt

The Capability Paradox: Why Soaring LLM Benchmarks Demand Stricter Validation

We often assume that as LLMs get smarter, the validation burden decreases.

The opposite is true, especially for R&D and CMC workflows.

Recent benchmarks, such as OpenAI’s FrontierScience, confirm that while models are becoming exponentially better at scientific reasoning, they are also becoming adept at 'ardently defending' their mistakes.

In a Manufacturing environment, physical constraints often catch these errors; a bioreactor can only spin so fast before a safety breaker trips. But in R&D and CMC—where the output is decision-making, data interpretation, and regulatory drafting, a 'confident' hallucination can contaminate an entire development lifecycle before it is caught.

Intelligence does not equal Compliance. In fact, without fit-for-purpose validation, high-IQ models are high-risk liabilities.

The Shift from Retrieval to Reasoning

OpenAI acknowledges that FrontierScience measures only part of the model’s capability. However, it represents a critical leap: it is one of the first benchmarks to measure a model's ability to reason through novel scientific input rather than just regurgitating training data.

Previous benchmarks (like MMLU) tested Knowledge Retrieval (e.g., "What is the boiling point of ethanol?"). FrontierScience tests Scientific Process (e.g., "Given these novel conditions, predict the reaction yield.").

The New Validation Mandate

For Life Sciences, this shift signals that the era of "Generic Benchmarks" is over. We can no longer rely on general reasoning scores to predict GxP safety.

If an agentic workflow is capable of 79x efficiency gains in protocol design (as recent reports suggest), it is also capable of generating errors at 79x the speed.

To harness these tools safely, we must move beyond standard evaluation metrics and implement Context-Specific Validation layers: frameworks that don't just test if the model is "smart," but verify that it is "compliant."

The models are ready for the lab. The question is: Are your safeguards ready for the models?

Read More
Kayla Britt Kayla Britt

FDA’s Agentic AI Announcement Signals a New Era for Scientific Computing

In early December, the U.S. Food and Drug Administration quietly released one of the most consequential technology updates in its recent history: an agency-wide deployment of agentic AI tools for internal use across regulatory review, scientific computing, compliance, inspections, and administrative workflows.

For an organization historically defined by caution and structured decision-making, the introduction of planning-capable, multi-step-reasoning AI systems marks a genuine turning point. And not only because of what FDA will do with these tools internally, but because of what this move signals to the life-sciences sector watching closely from the outside.

What the FDA adopts today becomes the industry’s expectation tomorrow.

What FDA Actually Announced

The agency’s announcement included several key components:

  • FDA has deployed agentic AI systems: advanced models designed for planning, reasoning, and executing multi-step tasks — within a secure government cloud environment.

  • Use of these systems is optional for staff but available across a wide range of regulatory and operational functions.

  • The AI is configured not to train on reviewer inputs or on confidential industry submissions, a critical safeguard for regulated data.

  • FDA also launched an “Agentic AI Challenge,” inviting staff to build and test AI-augmented workflows, with outputs slated for presentation at the agency’s Scientific Computing event in January 2026.

  • This builds on the earlier rollout of Elsa, FDA’s generative-AI assistant, which rapidly reached over 70% voluntary staff adoption.

In short: FDA is no longer exploring AI. It is operationalizing it.

A Strategic Inflection Point for Scientific Computing

Within regulatory agencies, change tends to be incremental. But when it comes to computational approaches, the last five years have been an acceleration curve: real-world evidence tooling, large-scale data integration, model-informed drug development, and now agentic systems capable of generating structured workflows.

For life-sciences organizations already experimenting with LLMs, the FDA’s move does two things:

1. It normalizes AI-augmented scientific computing.

If internal regulatory workflows are being reshaped by agentic systems, it is now reasonable for industry scientific and quality teams to pursue AI-enabled efficiencies as well. Organizations that adopt AI may have a significant competitive advantage in the not-so-distant future as efficiency gains compound.

2. It raises the bar for validation, auditability, and evidence.

When regulators embrace AI, the natural next question is:
How will regulated companies demonstrate that their own AI systems are fit-for-purpose?

The FDA’s announcement implicitly signals that risk-based, evidence-driven evaluation frameworks will become even more essential for LLMs and other agentic tools used in R&D, quality, and manufacturing.

A Personal Note on Timing

A few days before the press release, I filed the paperwork for Britt Biocomputing LLC, a consultancy built around fit-for-purpose LLM validation for life sciences.

The timing wasn’t intentional.

It was simply a response to the same trends that FDA is now making explicit: AI is no longer a novelty within scientific and regulated environments: it is becoming infrastructure. And once a technology becomes infrastructure, it requires rigor, governance, and evidence to support its use.

If anything, the FDA’s announcement confirms what many early practitioners have already been preparing for: the shift from theoretical AI governance to operational AI validation.

Implications for Industry

While the FDA emphasized internal usage, the downstream effects will extend across the entire life-sciences ecosystem.

1. Regulatory interactions may accelerate, but expectations may rise.

More efficient internal workflows could shorten review cycles or increase throughput. At the same time, companies may face more structured questions about how their own AI-enabled processes operate.

2. AI will become part of the “normal” regulatory conversation.

Whether in submissions, inspections, or quality system discussions, AI-driven workflows will cease to be exotic. They will be treated like any other computerized system: something to be understood, assessed, and validated.

3. Evidence packs and traceability frameworks will matter more than ever.

If agentic tools are helping generate analyses, summaries, or draft documents, both regulators and industry will need clear provenance, human-in-the-loop controls, and risk-mitigation strategies that map cleanly to existing quality expectations.

4. The adoption gap will widen.

Organizations that prepare now will move faster later; not because they “trust AI” more, but because they understand how to govern it.

What to Watch in Early 2026

The upcoming Scientific Computing event, where FDA staff will showcase their internally built AI workflows, will likely set the tone for:

  • how agentic systems are evaluated in a regulatory context,

  • what kinds of tasks FDA sees as low-, medium-, or high-risk,

  • how reviewers incorporate AI outputs into their decision-making pipelines, and

  • what transparency expectations may start to form for industry.

Even if details remain internal, the themes that emerge will shape the industry’s next steps.

Conclusion: AI Has Entered the Regulated Core

The most important part of FDA’s announcement is not the technology itself: it is the signal.

AI is no longer peripheral. It is becoming part of the regulated decision-making fabric.

For the life-sciences sector, this creates a dual responsibility:

  • to innovate with these tools, and

  • to validate them with the same rigor we apply to any system that touches product quality or patient safety.

Agentic AI inside FDA is more than a technological shift: it is a governance shift. And governance shifts always reshape the landscape for those who operate within it.

Read More
Kayla Britt Kayla Britt

Data drift: a risk-based and gamp-aligned approach

Why it matters: LLMs can fall out of spec without any code change—because the inputs, policies, or real-world tasks evolve. That’s data drift. In GxP, we handle it with a continuous, risk-based approach: define intended use → set acceptance criteria → monitor → re-validate on triggers.

1) Define the context of use (CoU)

State exactly what the model may influence and the allowable autonomy (draft-only, HITL required, blocked actions). Tie it to process/scientific risk.

Example (Deviation/CAPA assistant): Suggests categories using the approved ontology; HITL required; never commits system-of-record changes.

2) Set acceptance criteria up front

Pre-register the bar so you know when drift matters.

  • Coverage/accuracy (gold set): ≥ 90–95% top-k on SME-labeled cases

  • Safety: 0% prohibited actions

  • Traceability: ≥ 95% of suggestions include source/rule citation

  • Contradictions/hallucinations: ≤ 1% on spot checks

  • Ops KPI: −30–50% time-to-first-draft; rework ≤ 10% needing >1 revision

3) Know the drift you’re watching for

  • Input/format drift: new document types, vendors, equipment, templates

  • Concept drift: updated taxonomy, new CAPA rules, new SOPs

  • Prior/frequency shift: distribution of cases changes (e.g., more of type X)

4) Monitor and act on triggers

Treat re-validation as triggered and proportional to risk.

Periodic review: keep a light cadence (e.g., quarterly) even without triggers.

5) Minimal evidence pack (inspection-ready)

  • CoU & allowable autonomy

  • Risk register (what can go wrong; key slices)

  • Acceptance criteria & test plan (pre-registered)

  • Gold set + results (with ALCOA+ lineage)

  • Monitoring plan + trigger log

  • Change-control entries (what changed, why, evidence)

6) Worked micro-example (new equipment type)

A new controlled rate freezer goes live → input drift.

  • Add a representative “equipment-X” slice to the gold set.

  • Re-run evals; require ≥ 92% top-k, 0% prohibited actions, ≥ 95% citation coverage.

  • Don’t enable suggestions on equipment-X until the slice meets the bar.

  • Update CoU, risk register, and change-control record.

Compatibility note: I run a continuous, risk-based lifecycle and map evidence to the CSA/GAMP guidance.

Read More
Kayla Britt Kayla Britt

From Pilot to Production: A Practical Roadmap for LLM Implementation in GxP Environments

Editor’s note (Nov 17, 2025): This article has been updated to reflect a continuous, risk-based lifecycle consistent with GAMP 5 (Second Edition) and the ISPE GAMP AI guidance. Per GAMP 5 (2nd ed.), specification and verification are not inherently linear and fully support iterative, incremental methods. Where legacy terms (e.g., IQ/OQ/PQ) appear, they are provided as a crosswalk for teams whose SOPs still file that way.

What's the difference between an LLM that works and one that's validated for life sciences use? Everything.

When implemented safely, AI can bring intelligence, automation, and real-time decision-making to quality processes. But in life sciences, where errors can impact patient safety and regulatory compliance, bridging the gap between AI's potential and reality necessitates careful strategy and implementation.

From identifying a clear scope of use to monitoring and evaluation, the full lifecycle of a deployed LLM requires end-to-end validation.

While organizations own their validation destiny, the specialized nature of LLM validation often requires external expertise. Whether providing strategic frameworks, hands-on validation execution, or capability building, experienced partners can accelerate compliant AI adoption while avoiding common pitfalls.

Let’s walk through the process below . . .

Note: Before embarking on validation, organizations need a governance framework defining when and how LLMs can be considered. This isn't part of validation itself but rather the prerequisite 'organizational readiness' that enables compliant AI adoption. Phase 1 then builds on this foundation with specific use-case documentation.

📍Phase 1: Definition & Risk Assessment

  • Definition- we must define the user requirements and do a thorough risk assessment for the LLM.

  • Organizations don't need to reinvent their validation approach for AI. A risk-based approach aligned with GAMP emphasizes comprehensive testing around AI-specific risks like hallucination, drift, and traceability. This evolution, not revolution, approach helps maintain regulatory compliance while addressing novel AI challenges. We’ve pre-built standard LLM additions, enabling seamless integration into your existing processes.

  • The URS and SOP work in tandem but serve distinct purposes. The URS defines what the system must do—its capabilities, limitations, and performance standards. The SOP defines how humans interact with that system—who can use it, when it's appropriate, and what procedures to follow. Together, they create a complete framework for compliant LLM use. Think of it this way: The URS ensures the LLM is fit for purpose. The SOP ensures it's used for that purpose.

📍Phase 2: Design & Development

  • To create a true fit-for-purpose LLM, we must ensure the model architecture aligns with risk level and use case. The outputs from Phase 1 directly inform our approach.

    *Note: Unlike traditional software, LLM performance can degrade over time as production data evolves—a phenomenon called "data drift." This occurs when new products, updated SOPs, or changed terminology cause the production environment to diverge from training conditions. This reality shapes our design decisions, requiring built-in monitoring capabilities and clear revalidation triggers from day one.

  • Risk-Based Model Selection

    • High-Risk (patient safety, batch release):

      • Smaller, specialized models

      • Deterministic components (where possible)

      • Extensive guardrails and confidence thresholds

    • Medium-Risk (document review, categorization):

      • Balanced models

      • Commercial or open-source options possible

      • Emphasis on explainability features

    • Low-Risk (literature search, drafting):

      • Larger models acceptable

      • API-based solutions may be appropriate

      • Emphasis on performance over interpretability

📍Phase 3: Verification & Model Validation

  • Confirm correct deployment- model version verification

  • Fit-for-Purpose Qualification addresses LLM-specific testing:

    • Model verification against accuracy benchmarks (≥95% vs SME)

    • Use case validation with real-world scenarios

    • Integration testing with existing QMS systems

  • Performance Check demonstrates sustained performance with production data and confirms users can follow updated SOPs effectively.

📍Phase 4: Deployment & Control

  • Beyond technical deployment, successful implementation requires:

    • SOP revision: “AI-Assisted [Process Name] with clear oversight requirements

    • Training requirement: 2-hour session on reviewing/verifying LLM outputs

    • Output controls: All LLM-output marked as “Draft- Requires Review”

    • Change control: Model versions, prompts, and data pipelines under formal control

    • Audit trail: Complete traceability of inputs, model version, and human decisions

📍Phase 5: Continuous Monitoring & Improvement

  • Key Metrics to Track:

    • Model accuracy trending

    • Confidence score distribution

    • User override rates

    • Processing time per request

  • Revalidation Triggers (Defined in Advance)

    • New equipment types added

    • Changes to review criteria in SOPs

    • Model performance degradation <5% week-over-week

    • Regulatory guidance updates

Example: A deviation categorization LLM following this framework achieved 94% accuracy against SME review and reduced processing time from 4 hours to 30 minutes per batch.

Validating LLMs for life sciences isn't about reinventing validation—it's about thoughtfully embracing software tools and automation to improve higher quality and lower risks. Ready to accelerate your AI validation journey? Stay tuned for next week's deep dive on data drift.

Read More
Kayla Britt Kayla Britt

fit-for-purpose llms: why it matters

Validated ≠ leaderboards. Recent studies show that large language models can be more agreeable than humans—optimizing for pleasing answers rather than correct ones. That’s entertaining in chat apps; it’s risky in life‑sciences workflows. The antidote is simple: design for fit‑for‑purpose, not applause.

The problem: helpful isn’t the same as correct

Most LLMs are tuned to be helpful and polite. In practice, that can morph into sycophancy—agreeing with the user’s assumption even when it’s wrong. In R&D and GxP‑adjacent settings, this shows up as:

  • False reassurance: an LLM gently validates a shaky hypothesis or casual assumption.

  • Label echo: the model over‑indexes on prior labels and quietly repeats them.

  • “Looks right” bias: well‑phrased but ungrounded answers that slip through review.

Bottom line: if you don’t explicitly design against sycophancy, you’ll ship it.

What “fit‑for‑purpose” actually means

“Fit‑for‑purpose” is not a vibe; it’s a measurement and operations problem:

  1. Context of Use (CoU) + risk: who uses the model, for what decision, with which failure modes. Evidence depth matches impact.

  2. Consequence‑weighted metrics: errors are not equal—weight them by business/clinical consequences.

  3. Traceable, domain data: evaluation sets with lineage (ALCOA+), leakage controls, and real edge cases.

  4. Pre‑registered acceptance criteria: metrics, thresholds, and sample sizes agreed upfront.

  5. HITL & SOPs: clear review thresholds, escalation paths, and training—so "agreeable" outputs don’t slide through.

  6. Monitoring & drift: golden‑set rescoring, quality KPIs, and ownership in production.

  7. Change control for retraining: triggers, impact assessments, rollback, and signed release notes.

Anti‑sycophancy tests you should run

If your model can pass these, you’re on the right path:

  • Agreement‑vs‑truth: does the model side with a confident but wrong user, or with the evidence?

  • Dissent calibration: can it respectfully challenge a claim and cite sources?

  • Authority flip: does behavior change when the “speaker” is a junior analyst vs. a PI/manufacturer lead?

  • Self‑confidence checks: does it hedge appropriately when uncertain?

  • Grounding audits (for RAG): are citations real, relevant, and actually used in the answer?

R&D vs. regulated work: same measurements, scaled

  • In R&D, a lightweight credibility plan prevents “polite hallucinations” from steering experiments.

  • For GxP‑impacting steps, expand those measurements into formal V&V, audit trails, and independence in testing. The framework is the same; the rigor scales with risk.

Why this matters to regulators and QA

Health authorities and QA teams don’t ask for leaderboard screenshots. They expect risk‑based credibility tied to the model’s context of use, with documented operation, monitoring, and change control. If you can walk into an audit with that story—and evidence—you’re ready.

A simple flow that works

CoU → Risk → Eval Design → Acceptance Criteria → HITL → Monitoring → Change Control
Ship with this lifecycle in place and you’ll avoid the trap of “agreeable but wrong.”

What I deliver

  • R&D Fit‑for‑Purpose Sprint (2–4 wks): CoU & risk rubric • eval set + error taxonomy • acceptance criteria • small pilot • decision memo.

  • GxP Validate → Launch (6–10 wks): validation protocol & report • supplier qualification • change control • monitoring/drift • audit pack.

  • Monitor → Improve (retainer): golden‑set rescoring • drift watch • periodic re‑validation • release notes • inspection readiness.

CTA

Curious if your LLM is truly fit‑for‑purpose? Book a 20‑minute fit check. I’ll share a quick scorecard, highlight gaps, and recommend the smallest experiment that proves value.

Read More
Kayla Britt Kayla Britt

validation for llms: An interdisciplinary perspective

The advent of modern neural networks carries the promise of transforming industries worldwide. Yet, the “black box” nature of large language models (LLMs) introduces substantial risk — particularly in high-stakes domains such as life sciences and pharmaceuticals.

Effective validation requires more than code reviews or benchmark scores. It demands a risk-based, interdisciplinary approach that integrates expertise in both data science and the domain being modeled. A biologist, for instance, can spot when a generative model produces biologically implausible hypotheses that might escape a purely technical evaluator.

True validation extends beyond technical metrics. It involves translating complex architectures and training data assumptions into a transparent, testable framework — one that aligns with scientific rigor and regulatory expectations.

As AI systems increasingly shape discovery pipelines, interdisciplinary validation will become the foundation of trust. Building teams that bridge computational and domain knowledge isn’t optional; it’s the key to ensuring LLMs advance science responsibly, rather than simply accelerating it.

Read More