Kayla Britt Kayla Britt

FDA & EMA Just Released AI Guiding Principles for Drug Development: Here’s What They Actually Mean

Today, the FDA and EMA jointly released Guiding Principles for Good AI Practice in Drug Development.

If you work in life-sciences R&D, stop scrolling. This is not just another policy document. It’s a signal that the era of experimental, undocumented AI is ending.

What matters isn’t the principles themselves. What matters is who now owns the risk, and what regulators will expect to see when AI influences scientific decisions.

Below is what sponsors should understand now.

This is not regulation. And that’s exactly why it matters.

The guidance is deliberately non-prescriptive. There are no checklists, no templates, no mandated validation methods.

That’s not a gap. That’s the point.

Regulators are saying:

“You are responsible for demonstrating that your AI system is fit-for-purpose, risk-appropriate, and well-governed — across its entire lifecycle.”

In other words:

  • Waiting for rules is no longer defensible

  • Pointing to vendor benchmarks is insufficient

  • Treating AI as 'just software' is no longer a viable regulatory strategy

The quiet but critical shift: from tools to evidence

The most important change in the FDA–EMA principles is subtle:

AI is no longer framed as a productivity tool. It is framed as a system that can generate, analyze, or influence scientific evidence.

That has consequences.

When AI contributes to:

  • target identification

  • candidate prioritization

  • trial design

  • safety interpretation

  • manufacturing decisions

…it becomes subject to the same scrutiny as any other system that influences patient outcomes.

This is why the guidance emphasizes:

  • context of use

  • risk-based validation

  • human oversight

  • lifecycle monitoring

  • clear documentation and traceability

Not accuracy. Not model size. Not novelty.

Why “we’ll validate later” no longer works

A common pattern I see in R&D organizations is:

“We’re piloting AI now; we’ll formalize validation once it’s closer to GxP.”

The problem is that model behavior is shaped early:

  • by training data

  • by prompt strategies

  • by human-AI interaction patterns

  • by how outputs are trusted (or over-trusted)

By the time a system is “critical,” the evidence gap already exists.

The FDA–EMA principles make this explicit: validation is proportional to risk, not delayed until formality.

Early-stage AI still requires:

  • defined decision boundaries

  • known failure modes

  • fit-for-use performance criteria

  • documented assumptions and limitations

What regulators are really asking sponsors to show

Stripped of policy language, the principles boil down to five questions regulators will increasingly expect sponsors to answer:

  1. Who owns the risk if the model fails?

  2. Can you trace the decision back to the data?

  3. Is the human actually overseeing it, or just clicking 'OK'?

  4. How is performance monitored as data, context, and models change?

  5. Can you explain its use and limitations to the people affected by it?

Answering these questions with intent is no longer enough. The new standard requires evidence.

Where most organizations may struggle

In practice, the hardest parts of alignment are not technical:

  • Translating AI behavior into scientifically meaningful failure modes

  • Defining acceptance criteria that reflect biological risk

  • Evaluating human-AI interaction, not just model output

  • Maintaining evidence over time as models drift and evolve

  • Maintaining multi-disciplinary expertise over the entire lifecycle of the model

These are validation problems, not data science problems.

And they sit squarely between R&D, Quality, Regulatory, and Digital teams. Teams that historically spoke four different languages must now answer to one shared standard.

What “fit-for-purpose AI validation” actually means now:

A fit-for-purpose approach does not mean validating everything to the same standard.

It means:

  • defining context of use first

  • tiering risk explicitly

  • tailoring evaluation methods to scientific impact

  • generating evidence that is proportionate, traceable, and defensible

  • planning for lifecycle monitoring from day one

This is exactly the operating model regulators are signaling, without telling sponsors how to implement it.

The bottom line

The FDA–EMA principles do not slow AI adoption.

They raise the bar for trust.

Organizations that treat this moment as a documentation exercise will struggle. Organizations that treat it as a scientific quality problem will move faster, and safer.

AI in drug development is no longer about whether it works.

It’s about whether you can stand behind it.

Patients must be able to trust that we aren't just accelerating discovery, but governing it. Because in the end, speed without safety isn't a breakthrough; it's a liability.

Read More
Kayla Britt Kayla Britt

History Rhymes: Why AI is the "Paper-to-Digital" Shift of Our Generation

1997: FDA drops 21 CFR Part 11. Pharma validation breaks overnight.
2026: FDA deploys agentic AI internally. History rhymes—and your validation frameworks aren't ready.
The question isn't if you'll need GxP-aligned AI validation. It's whether you'll build it before the audit pack lands on your desk.

The First Wave: Paper (deterministic)

The first major shift was moving from physical atoms (paper) to binary bits (electronic records). We had to prove that the computer would do exactly what the paper did, every single time. 1 + 1 had to equal 2. This birthed "Computer System Validation" (CSV). It was rigid, script-based, and binary. Pass/Fail.

The Second Wave: Digital (probablistic)

We are now entering the second massive shift. We aren't just changing the medium (paper to screen); we are changing the logic.

We aren’t changing the medium; we are changing the logic. We are moving from Deterministic (If X, then Y) to Probabilistic (If X, then likely Y).

The original “CSV” playbook doesn’t work when applied to LLMs or agentic AI. You can't write a test script for an infinite number of potential outputs.

Like biological processes, AI requires a unique approach to validation. AI is more like biology than software. We are moving from Validation as Architecture (checking blueprints) to Validation as Medicine (monitoring health). You don't 'debug' a biological system; you diagnose it. AI is the same.

The "Compliance Tollbooth": Bridging the Gap

Validation isn't dying; it’s just getting harder. We need a new "Tollbooth": a set of checks that acknowledges uncertainty rather than trying to eliminate it.

The Britt Biocomputing Playbook:

  • Fit-for-Purpose Validation: We assess the context-of-use to identify the appropriate risk-tier, rather than a one-size-fits-all approach.

  • From Checklists to Guardrails: We don't test every output; we test the safety boundaries.

    • Critical Thinking vs. Scripting: This aligns tightly with the GAMP 5 ISPE guidance; instead of checking every function utilizing the same checks, we implement risk-based approaches that acknowledge that not every use case needs the same level of validation.

  • Golden Datasets: We validate against a proprietary suite of “golden datasets” developed via testing against dozens of frontier models.

  • Continuous Monitoring: We provide the framework to monitor your model long-term, so you don’t stumble into barriers like data drift.

This is why we need interdisciplinary professionals as the new generation of AI Validation Engineers - we need people who can translate between the code, the science, and the regulations.

Part 11 rewrote validation overnight. AI validation guardrails aren't optional anymore.


DM 'PART11' for the checklist that'll save your first audit.

Read More
Kayla Britt Kayla Britt

The Capability Paradox: Why Soaring LLM Benchmarks Demand Stricter Validation

We often assume that as LLMs get smarter, the validation burden decreases.

The opposite is true, especially for R&D and CMC workflows.

Recent benchmarks, such as OpenAI’s FrontierScience, confirm that while models are becoming exponentially better at scientific reasoning, they are also becoming adept at 'ardently defending' their mistakes.

In a Manufacturing environment, physical constraints often catch these errors; a bioreactor can only spin so fast before a safety breaker trips. But in R&D and CMC—where the output is decision-making, data interpretation, and regulatory drafting, a 'confident' hallucination can contaminate an entire development lifecycle before it is caught.

Intelligence does not equal Compliance. In fact, without fit-for-purpose validation, high-IQ models are high-risk liabilities.

The Shift from Retrieval to Reasoning

OpenAI acknowledges that FrontierScience measures only part of the model’s capability. However, it represents a critical leap: it is one of the first benchmarks to measure a model's ability to reason through novel scientific input rather than just regurgitating training data.

Previous benchmarks (like MMLU) tested Knowledge Retrieval (e.g., "What is the boiling point of ethanol?"). FrontierScience tests Scientific Process (e.g., "Given these novel conditions, predict the reaction yield.").

The New Validation Mandate

For Life Sciences, this shift signals that the era of "Generic Benchmarks" is over. We can no longer rely on general reasoning scores to predict GxP safety.

If an agentic workflow is capable of 79x efficiency gains in protocol design (as recent reports suggest), it is also capable of generating errors at 79x the speed.

To harness these tools safely, we must move beyond standard evaluation metrics and implement Context-Specific Validation layers: frameworks that don't just test if the model is "smart," but verify that it is "compliant."

The models are ready for the lab. The question is: Are your safeguards ready for the models?

Read More
Kayla Britt Kayla Britt

FDA’s Agentic AI Announcement Signals a New Era for Scientific Computing

In early December, the U.S. Food and Drug Administration quietly released one of the most consequential technology updates in its recent history: an agency-wide deployment of agentic AI tools for internal use across regulatory review, scientific computing, compliance, inspections, and administrative workflows.

For an organization historically defined by caution and structured decision-making, the introduction of planning-capable, multi-step-reasoning AI systems marks a genuine turning point. And not only because of what FDA will do with these tools internally, but because of what this move signals to the life-sciences sector watching closely from the outside.

What the FDA adopts today becomes the industry’s expectation tomorrow.

What FDA Actually Announced

The agency’s announcement included several key components:

  • FDA has deployed agentic AI systems: advanced models designed for planning, reasoning, and executing multi-step tasks — within a secure government cloud environment.

  • Use of these systems is optional for staff but available across a wide range of regulatory and operational functions.

  • The AI is configured not to train on reviewer inputs or on confidential industry submissions, a critical safeguard for regulated data.

  • FDA also launched an “Agentic AI Challenge,” inviting staff to build and test AI-augmented workflows, with outputs slated for presentation at the agency’s Scientific Computing event in January 2026.

  • This builds on the earlier rollout of Elsa, FDA’s generative-AI assistant, which rapidly reached over 70% voluntary staff adoption.

In short: FDA is no longer exploring AI. It is operationalizing it.

A Strategic Inflection Point for Scientific Computing

Within regulatory agencies, change tends to be incremental. But when it comes to computational approaches, the last five years have been an acceleration curve: real-world evidence tooling, large-scale data integration, model-informed drug development, and now agentic systems capable of generating structured workflows.

For life-sciences organizations already experimenting with LLMs, the FDA’s move does two things:

1. It normalizes AI-augmented scientific computing.

If internal regulatory workflows are being reshaped by agentic systems, it is now reasonable for industry scientific and quality teams to pursue AI-enabled efficiencies as well. Organizations that adopt AI may have a significant competitive advantage in the not-so-distant future as efficiency gains compound.

2. It raises the bar for validation, auditability, and evidence.

When regulators embrace AI, the natural next question is:
How will regulated companies demonstrate that their own AI systems are fit-for-purpose?

The FDA’s announcement implicitly signals that risk-based, evidence-driven evaluation frameworks will become even more essential for LLMs and other agentic tools used in R&D, quality, and manufacturing.

A Personal Note on Timing

A few days before the press release, I filed the paperwork for Britt Biocomputing LLC, a consultancy built around fit-for-purpose LLM validation for life sciences.

The timing wasn’t intentional.

It was simply a response to the same trends that FDA is now making explicit: AI is no longer a novelty within scientific and regulated environments: it is becoming infrastructure. And once a technology becomes infrastructure, it requires rigor, governance, and evidence to support its use.

If anything, the FDA’s announcement confirms what many early practitioners have already been preparing for: the shift from theoretical AI governance to operational AI validation.

Implications for Industry

While the FDA emphasized internal usage, the downstream effects will extend across the entire life-sciences ecosystem.

1. Regulatory interactions may accelerate, but expectations may rise.

More efficient internal workflows could shorten review cycles or increase throughput. At the same time, companies may face more structured questions about how their own AI-enabled processes operate.

2. AI will become part of the “normal” regulatory conversation.

Whether in submissions, inspections, or quality system discussions, AI-driven workflows will cease to be exotic. They will be treated like any other computerized system: something to be understood, assessed, and validated.

3. Evidence packs and traceability frameworks will matter more than ever.

If agentic tools are helping generate analyses, summaries, or draft documents, both regulators and industry will need clear provenance, human-in-the-loop controls, and risk-mitigation strategies that map cleanly to existing quality expectations.

4. The adoption gap will widen.

Organizations that prepare now will move faster later; not because they “trust AI” more, but because they understand how to govern it.

What to Watch in Early 2026

The upcoming Scientific Computing event, where FDA staff will showcase their internally built AI workflows, will likely set the tone for:

  • how agentic systems are evaluated in a regulatory context,

  • what kinds of tasks FDA sees as low-, medium-, or high-risk,

  • how reviewers incorporate AI outputs into their decision-making pipelines, and

  • what transparency expectations may start to form for industry.

Even if details remain internal, the themes that emerge will shape the industry’s next steps.

Conclusion: AI Has Entered the Regulated Core

The most important part of FDA’s announcement is not the technology itself: it is the signal.

AI is no longer peripheral. It is becoming part of the regulated decision-making fabric.

For the life-sciences sector, this creates a dual responsibility:

  • to innovate with these tools, and

  • to validate them with the same rigor we apply to any system that touches product quality or patient safety.

Agentic AI inside FDA is more than a technological shift: it is a governance shift. And governance shifts always reshape the landscape for those who operate within it.

Read More
Kayla Britt Kayla Britt

Data drift: a risk-based and gamp-aligned approach

Why it matters: LLMs can fall out of spec without any code change—because the inputs, policies, or real-world tasks evolve. That’s data drift. In GxP, we handle it with a continuous, risk-based approach: define intended use → set acceptance criteria → monitor → re-validate on triggers.

1) Define the context of use (CoU)

State exactly what the model may influence and the allowable autonomy (draft-only, HITL required, blocked actions). Tie it to process/scientific risk.

Example (Deviation/CAPA assistant): Suggests categories using the approved ontology; HITL required; never commits system-of-record changes.

2) Set acceptance criteria up front

Pre-register the bar so you know when drift matters.

  • Coverage/accuracy (gold set): ≥ 90–95% top-k on SME-labeled cases

  • Safety: 0% prohibited actions

  • Traceability: ≥ 95% of suggestions include source/rule citation

  • Contradictions/hallucinations: ≤ 1% on spot checks

  • Ops KPI: −30–50% time-to-first-draft; rework ≤ 10% needing >1 revision

3) Know the drift you’re watching for

  • Input/format drift: new document types, vendors, equipment, templates

  • Concept drift: updated taxonomy, new CAPA rules, new SOPs

  • Prior/frequency shift: distribution of cases changes (e.g., more of type X)

4) Monitor and act on triggers

Treat re-validation as triggered and proportional to risk.

Periodic review: keep a light cadence (e.g., quarterly) even without triggers.

5) Minimal evidence pack (inspection-ready)

  • CoU & allowable autonomy

  • Risk register (what can go wrong; key slices)

  • Acceptance criteria & test plan (pre-registered)

  • Gold set + results (with ALCOA+ lineage)

  • Monitoring plan + trigger log

  • Change-control entries (what changed, why, evidence)

6) Worked micro-example (new equipment type)

A new controlled rate freezer goes live → input drift.

  • Add a representative “equipment-X” slice to the gold set.

  • Re-run evals; require ≥ 92% top-k, 0% prohibited actions, ≥ 95% citation coverage.

  • Don’t enable suggestions on equipment-X until the slice meets the bar.

  • Update CoU, risk register, and change-control record.

Compatibility note: I run a continuous, risk-based lifecycle and map evidence to the CSA/GAMP guidance.

Read More
Kayla Britt Kayla Britt

From Pilot to Production: A Practical Roadmap for LLM Implementation in GxP Environments

Editor’s note (Nov 17, 2025): This article has been updated to reflect a continuous, risk-based lifecycle consistent with GAMP 5 (Second Edition) and the ISPE GAMP AI guidance. Per GAMP 5 (2nd ed.), specification and verification are not inherently linear and fully support iterative, incremental methods. Where legacy terms (e.g., IQ/OQ/PQ) appear, they are provided as a crosswalk for teams whose SOPs still file that way.

What's the difference between an LLM that works and one that's validated for life sciences use? Everything.

When implemented safely, AI can bring intelligence, automation, and real-time decision-making to quality processes. But in life sciences, where errors can impact patient safety and regulatory compliance, bridging the gap between AI's potential and reality necessitates careful strategy and implementation.

From identifying a clear scope of use to monitoring and evaluation, the full lifecycle of a deployed LLM requires end-to-end validation.

While organizations own their validation destiny, the specialized nature of LLM validation often requires external expertise. Whether providing strategic frameworks, hands-on validation execution, or capability building, experienced partners can accelerate compliant AI adoption while avoiding common pitfalls.

Let’s walk through the process below . . .

Note: Before embarking on validation, organizations need a governance framework defining when and how LLMs can be considered. This isn't part of validation itself but rather the prerequisite 'organizational readiness' that enables compliant AI adoption. Phase 1 then builds on this foundation with specific use-case documentation.

📍Phase 1: Definition & Risk Assessment

  • Definition- we must define the user requirements and do a thorough risk assessment for the LLM.

  • Organizations don't need to reinvent their validation approach for AI. A risk-based approach aligned with GAMP emphasizes comprehensive testing around AI-specific risks like hallucination, drift, and traceability. This evolution, not revolution, approach helps maintain regulatory compliance while addressing novel AI challenges. We’ve pre-built standard LLM additions, enabling seamless integration into your existing processes.

  • The URS and SOP work in tandem but serve distinct purposes. The URS defines what the system must do—its capabilities, limitations, and performance standards. The SOP defines how humans interact with that system—who can use it, when it's appropriate, and what procedures to follow. Together, they create a complete framework for compliant LLM use. Think of it this way: The URS ensures the LLM is fit for purpose. The SOP ensures it's used for that purpose.

📍Phase 2: Design & Development

  • To create a true fit-for-purpose LLM, we must ensure the model architecture aligns with risk level and use case. The outputs from Phase 1 directly inform our approach.

    *Note: Unlike traditional software, LLM performance can degrade over time as production data evolves—a phenomenon called "data drift." This occurs when new products, updated SOPs, or changed terminology cause the production environment to diverge from training conditions. This reality shapes our design decisions, requiring built-in monitoring capabilities and clear revalidation triggers from day one.

  • Risk-Based Model Selection

    • High-Risk (patient safety, batch release):

      • Smaller, specialized models

      • Deterministic components (where possible)

      • Extensive guardrails and confidence thresholds

    • Medium-Risk (document review, categorization):

      • Balanced models

      • Commercial or open-source options possible

      • Emphasis on explainability features

    • Low-Risk (literature search, drafting):

      • Larger models acceptable

      • API-based solutions may be appropriate

      • Emphasis on performance over interpretability

📍Phase 3: Verification & Model Validation

  • Confirm correct deployment- model version verification

  • Fit-for-Purpose Qualification addresses LLM-specific testing:

    • Model verification against accuracy benchmarks (≥95% vs SME)

    • Use case validation with real-world scenarios

    • Integration testing with existing QMS systems

  • Performance Check demonstrates sustained performance with production data and confirms users can follow updated SOPs effectively.

📍Phase 4: Deployment & Control

  • Beyond technical deployment, successful implementation requires:

    • SOP revision: “AI-Assisted [Process Name] with clear oversight requirements

    • Training requirement: 2-hour session on reviewing/verifying LLM outputs

    • Output controls: All LLM-output marked as “Draft- Requires Review”

    • Change control: Model versions, prompts, and data pipelines under formal control

    • Audit trail: Complete traceability of inputs, model version, and human decisions

📍Phase 5: Continuous Monitoring & Improvement

  • Key Metrics to Track:

    • Model accuracy trending

    • Confidence score distribution

    • User override rates

    • Processing time per request

  • Revalidation Triggers (Defined in Advance)

    • New equipment types added

    • Changes to review criteria in SOPs

    • Model performance degradation <5% week-over-week

    • Regulatory guidance updates

Example: A deviation categorization LLM following this framework achieved 94% accuracy against SME review and reduced processing time from 4 hours to 30 minutes per batch.

Validating LLMs for life sciences isn't about reinventing validation—it's about thoughtfully embracing software tools and automation to improve higher quality and lower risks. Ready to accelerate your AI validation journey? Stay tuned for next week's deep dive on data drift.

Read More
Kayla Britt Kayla Britt

fit-for-purpose llms: why it matters

Validated ≠ leaderboards. Recent studies show that large language models can be more agreeable than humans—optimizing for pleasing answers rather than correct ones. That’s entertaining in chat apps; it’s risky in life‑sciences workflows. The antidote is simple: design for fit‑for‑purpose, not applause.

The problem: helpful isn’t the same as correct

Most LLMs are tuned to be helpful and polite. In practice, that can morph into sycophancy—agreeing with the user’s assumption even when it’s wrong. In R&D and GxP‑adjacent settings, this shows up as:

  • False reassurance: an LLM gently validates a shaky hypothesis or casual assumption.

  • Label echo: the model over‑indexes on prior labels and quietly repeats them.

  • “Looks right” bias: well‑phrased but ungrounded answers that slip through review.

Bottom line: if you don’t explicitly design against sycophancy, you’ll ship it.

What “fit‑for‑purpose” actually means

“Fit‑for‑purpose” is not a vibe; it’s a measurement and operations problem:

  1. Context of Use (CoU) + risk: who uses the model, for what decision, with which failure modes. Evidence depth matches impact.

  2. Consequence‑weighted metrics: errors are not equal—weight them by business/clinical consequences.

  3. Traceable, domain data: evaluation sets with lineage (ALCOA+), leakage controls, and real edge cases.

  4. Pre‑registered acceptance criteria: metrics, thresholds, and sample sizes agreed upfront.

  5. HITL & SOPs: clear review thresholds, escalation paths, and training—so "agreeable" outputs don’t slide through.

  6. Monitoring & drift: golden‑set rescoring, quality KPIs, and ownership in production.

  7. Change control for retraining: triggers, impact assessments, rollback, and signed release notes.

Anti‑sycophancy tests you should run

If your model can pass these, you’re on the right path:

  • Agreement‑vs‑truth: does the model side with a confident but wrong user, or with the evidence?

  • Dissent calibration: can it respectfully challenge a claim and cite sources?

  • Authority flip: does behavior change when the “speaker” is a junior analyst vs. a PI/manufacturer lead?

  • Self‑confidence checks: does it hedge appropriately when uncertain?

  • Grounding audits (for RAG): are citations real, relevant, and actually used in the answer?

R&D vs. regulated work: same measurements, scaled

  • In R&D, a lightweight credibility plan prevents “polite hallucinations” from steering experiments.

  • For GxP‑impacting steps, expand those measurements into formal V&V, audit trails, and independence in testing. The framework is the same; the rigor scales with risk.

Why this matters to regulators and QA

Health authorities and QA teams don’t ask for leaderboard screenshots. They expect risk‑based credibility tied to the model’s context of use, with documented operation, monitoring, and change control. If you can walk into an audit with that story—and evidence—you’re ready.

A simple flow that works

CoU → Risk → Eval Design → Acceptance Criteria → HITL → Monitoring → Change Control
Ship with this lifecycle in place and you’ll avoid the trap of “agreeable but wrong.”

What I deliver

  • R&D Fit‑for‑Purpose Sprint (2–4 wks): CoU & risk rubric • eval set + error taxonomy • acceptance criteria • small pilot • decision memo.

  • GxP Validate → Launch (6–10 wks): validation protocol & report • supplier qualification • change control • monitoring/drift • audit pack.

  • Monitor → Improve (retainer): golden‑set rescoring • drift watch • periodic re‑validation • release notes • inspection readiness.

CTA

Curious if your LLM is truly fit‑for‑purpose? Book a 20‑minute fit check. I’ll share a quick scorecard, highlight gaps, and recommend the smallest experiment that proves value.

Read More
Kayla Britt Kayla Britt

validation for llms: An interdisciplinary perspective

The advent of modern neural networks carries the promise of transforming industries worldwide. Yet, the “black box” nature of large language models (LLMs) introduces substantial risk — particularly in high-stakes domains such as life sciences and pharmaceuticals.

Effective validation requires more than code reviews or benchmark scores. It demands a risk-based, interdisciplinary approach that integrates expertise in both data science and the domain being modeled. A biologist, for instance, can spot when a generative model produces biologically implausible hypotheses that might escape a purely technical evaluator.

True validation extends beyond technical metrics. It involves translating complex architectures and training data assumptions into a transparent, testable framework — one that aligns with scientific rigor and regulatory expectations.

As AI systems increasingly shape discovery pipelines, interdisciplinary validation will become the foundation of trust. Building teams that bridge computational and domain knowledge isn’t optional; it’s the key to ensuring LLMs advance science responsibly, rather than simply accelerating it.

Read More