When Your AI Vendor Passes SOC 2 — and Still Fails GxP

May 6

On April 27th, 2026, Valkit.ai (originally the developer of the digital validation lifecycle management platform, which has since expanded to incorporate additional products) shipped Valkit MSAT: a purpose-built product for GxP tech transfer and CDMO engagements.

What's worth pausing on is what was in the launch. Alongside the feature list, Valkit's CPO/Co-Founder and Co-Chair of GAMP Americas Stephen Ferrell and his team disclosed an unusually specific set of supplier-qualification artifacts: ISO 27001:2022 certification, CSA STAR Valid-AI-ted Level 1 self-assessment for AI-specific security controls, a written commitment that no customer data is used to train underlying models, and mandatory human-in-the-loop review with electronic signature on every AI-generated record before it can be finalized.

Most readers will skim past that paragraph. Pharma quality teams shouldn't.

Because the question that paragraph is quietly answering is the one most AI vendor relationships in our industry haven't answered at all:

What does AI-specific supplier qualification actually look like, beyond the documents your existing QMS already requires?

Beyond practical considerations like ease of use, sponsors increasingly value alignment with regulatory direction and industry supplier best practices. On April 2, 2026, the FDA issued a warning letter to Purolea Cosmetics Lab. Notably, the FDA cited an existing regulation, 21 CFR 211.22(c), sending a clear message to industry: improper AI use does not require approval of new regulations to be in violation of compliance.

What sponsors should note:

Mandatory HITL on AI-generated GxP content is now an FDA enforcement expectation, not a best practice
AI oversight cannot be delegated to the vendor or the AI platform itself
Quality agreements and quality systems should be updated to address AI tool governance in cGMP operations

This isn't just a regulatory point — it reflects how vendors who've thought about the AI-specific layer are framing the problem. As Stephen Ferrell of Valkit.ai put it, “There's a knowledge deficit between probabilistic RAG-infused AI and HITL-centric, deterministic AI. HITL gives you the opportunity to augment instead of replace."

The regulatory and industry-standard authorities relevant to AI supplier qualification have proliferated over the past eighteen months. Several authorities and industry organizations now address parts of the AI supplier qualification problem. None of them address it completely.

1. GAMP AI Guide (Ch. 7, Appendix M2, July 2025) + GAMP 5 Second Edition (2022)

The GAMP AI Guide addresses General Supplier Good Practices, Data Usage, and AI-Specific Security Measures. Appendix M2, “Supplier Management”, covers Concept, Project, Operation, and Retirement phases. The guide references ISO/IEC 42001:2023 and builds on GAMP 5 Second Edition (2022).

2. FDA CSA Final Guidance (September 2025, February 2026)

The guidance explicitly endorses leveraging vendor evidence, including but not limited to SOC reports, ISO certifications, SDLC artifacts, cybersecurity documentation. The underlying principles are clear: a risk-based approach is expected.

3. ISO/IEC 42001:2023 (December 2023)

This is the world’s first-ever international standard for AI Management Systems, published jointly by ISO and IEC in December of 2023. It specifies requirements for establishing, implementing, maintaining and continually improving an AI management system within an organization. Notably, this certification is the prerequisite for CSA STAR for AI Level 2, suggesting that a vendor with CSA STAR certification is already qualified under ISO 42001.

4. Annex 22 (Draft: July 2025)

The first comprehensive EU regulatory framework specifically for AI in GMP. Critically, it excludes LLMs and probabilistic models from GMP-critical applications, permitting them only in non-critical applications under direct human oversight. For supplier qualification, it makes clear that vendor accountability cannot be outsourced; the regulated user retains responsibility regardless of vendor pre-qualification.

The framework most teams are still using

Walk into a typical AI vendor qualification today and the artifacts on the table are familiar. SOC 2 Type II report. ISO 27001 certificate. A GAMP supplier audit, often built from a Category-4 or Category-5 questionnaire designed in 2008. A quality agreement borrowed from the last LIMS implementation. Maybe a penetration test summary.

All of it is necessary - but none of it is sufficient.

Here, I propose an operational framework for AI Supplier Qualification in non-critical regulated pharma — built on a four-pillar architecture and a five-step decision logic —that applies to probabilistic modalities like LLMs and agentic systems.

The five operational risks traditional QMS misses

1. Training data provenance

Traditional supplier qualification asks where customer data is stored and who can access it. AI supplier qualification has to ask a question that has no equivalent in classical software:

What data was the model trained on?

Who decided it was fit for that purpose?

This is now an explicit regulatory expectation. The draft EU GMP Annex 22, published for consultation in July 2025, sets out specific requirements that test data used to validate AI systems must be kept entirely separate from the data used to train the models, to guard against bias and ensure true predictive power. The ISPE GAMP Guide: Artificial Intelligence — published in July 2025 and now the foundational reference for the industry — devotes entire sections to data and model governance, data usage, and AI-specific security measures within its supplier-activities chapter.

2. Model governance and update cadence

This is the one that catches sponsors off guard most often. In classical computerised systems, change control is the sponsor's lever: you decide when an update lands and you validate it on your timeline. In AI systems built on foundation models, that lever may not exist at all.

A recent analysis of AI wrapper vendors put it plainly: Wrapper vendors typically have no control over model governance. When the foundation model provider updates their model, all customers receive the change simultaneously. There is no version pinning, no controlled rollout, no ability to validate before deployment.

That's a GxP problem hiding inside a procurement decision. If your supplier is reselling capability they don't control, your validated state is at the mercy of an upstream provider's release schedule. Annex 22 anticipates this by restricting GMP-critical applications to static models and deterministic output models, with dynamic models, generative AI/LLM, and probabilistic output models prohibited in critical use. Whatever you think of that line, the regulatory direction is clear: model update behaviour is now a supplier qualification question.

What you should be asking:

Does the vendor pin model versions for regulated customers?

What is the change control process when the underlying model is updated?

Will the vendor commit, in writing, to a controlled rollout window that allows you to revalidate?

3. Inference variance and reproducibility

Classical software gives you the same output for the same input. AI systems often don't: and this isn't a defect, it's the architecture. Temperature settings, sampling, system prompts, retrieval context, even the order of input tokens can produce different outputs from identical-looking inputs.

For GxP purposes this is the heart of the problem. Audit trails assume reproducibility. Deviation investigations assume you can re-run the conditions. Forensic review of an AI output six months after the fact assumes the vendor can tell you what the system actually did at the moment in question.

Annex 22's approach to this is to confine GMP-critical use to deterministic outputs, but even within that envelope, supplier qualification has to verify what controls the vendor has placed on the elements that could introduce variance. System prompts. Temperature. Retrieval indices. Tool-use scaffolding. Most quality agreements don't mention any of these because the standard templates pre-date them.

4. AI-specific security, not just IT security

ISO 27001 covers information security management. SOC 2 covers trust services criteria. Neither was built for prompt injection, model exfiltration, training-data poisoning, or adversarial inputs that exploit how models actually process information.

This is why the Cloud Security Alliance's STAR for AI program — launched in October 2025 with Level 2 added in November — exists in the first place. It builds on ISO/IEC 42001 (the AI management system standard) and adds the AI Controls Matrix specifically because SOC 2 can tell a customer that your controls over security, confidentiality, or privacy exist and are described. It does not automatically tell them how you handle prompt injection, unsafe tool invocation, agent memory poisoning, or model governance.

Pharma quality teams haven't widely adopted ISO 42001 or CSA STAR for AI in their supplier qualification questionnaires yet. That gap is closing fast. Vendors who already hold these certifications are signalling readiness for the supplier qualification questions the next two years will bring.

5. Validation-inheritance boundary

The four risks above all live within a single supplier relationship. The fifth lives in the spaces between them. When a pharma sponsor uses a vendor that uses Anthropic on AWS Bedrock, four parties are involved in the AI's behavior: and current supplier qualification frameworks address none of them as a system.

Where does the foundation model provider's responsibility end and the wrapper vendor's begin? Where does the wrapper vendor's responsibility end and the sponsor's begin? When the sponsor's quality team writes their validation summary, which validation evidence are they entitled to inherit, and which do they have to generate themselves?

The GAMP AI Guide gestures at this in Chapter 7 by treating each supplier relationship as discrete. The FDA CSA final guidance endorses leveraging vendor evidence but doesn't address the layered case. Annex 22 punts: the regulated user retains accountability regardless of how many parties contributed to the AI's behavior.

That sounds clean. In practice it means sponsors are validating systems whose behavior is partially determined by parties they have no contractual relationship with. The IP indemnity from the wrapper vendor doesn't cover the foundation model's training data. The wrapper vendor's SOC 2 doesn't cover the hyperscaler's BAA. The sponsor's validation summary has to either claim coverage of layers they don't control, or carve out exclusions that may or may not satisfy an inspector.

This is the supplier qualification question nobody has fully answered. The 2026 best practice is partial: BAA-eligible hyperscaler tiers (AWS Bedrock, Azure OpenAI, Vertex AI) compress the legal layer, contractual indemnification compresses the IP risk layer, and sponsor-side validation of context-of-use compresses the technical layer. None of those compressions are complete. All of them are negotiated rather than standardized.

This is where the analytical layer becomes operational. The five risks above don't sit in parallel: risks 1 through 4 each surface within a single supplier relationship, and risk 5 emerges only when you map them across the supply chain. That mapping is what VALID Trust does.

VALID TRUST: A Framework for Supplier Qualification

VALID TRUST is the name I use for the supplier qualification control layer architecture. It decomposes any AI vendor into four pillars based on what the AI actually does in the workflow:

Generative AI Validation — applies whenever the system produces novel content (text, structured records, draft documents). The qualification questions concentrate on training data provenance, output reproducibility, and grounding controls.
Agentic AI Governance — applies whenever the system takes autonomous action across multiple steps or tools. The qualification questions concentrate on action scope, tool-use scaffolding, autonomous decision boundaries, and HITL/HOTL design.
Probabilistic Acceptance — applies whenever the system's outputs vary for identical-looking inputs. The qualification questions concentrate on calibration, acceptance criteria, uncertainty quantification, and inference variance controls.
Continuous Monitoring — applies whenever the model, its data, or its operating context can change post-deployment. The qualification questions concentrate on drift detection, model versioning, change control, and revalidation triggers.

Most AI vendors trigger more than one pillar. The framework's job is to make explicit which ones, and to scope the qualification work accordingly.

The decision logic

Step 1 — Map the vendor to the pillars. For the AI vendor in front of you, determine which of the four pillars apply based on what the system actually does. A static classifier triggers Probabilistic Acceptance and Continuous Monitoring but not Generative AI Validation. An LLM-based content authoring tool triggers three pillars. An autonomous trial design agent triggers all four.

Step 2 — Run the pillar-specific questions. For each applicable pillar, work through the qualification questions specific to that AI modality. The questions for Generative AI Validation are not the same as those for Agentic AI Governance — and a single SOC 2 report does not answer either set.

Step 3 — Apply the cross-cutting threads. Across all applicable pillars, run the four cross-cutting threads: security (including AI-specific controls beyond ISO 27001), explainability, stakeholder communications, and supplier qualification meta-questions (certifications, disclosure posture, contractual commitments). These apply uniformly regardless of which pillars are in scope.

Step 4 — Locate the validation-inheritance boundary. Map the supply chain (foundation model → wrapper vendor → hyperscaler → sponsor). For each pillar, identify which evidence the sponsor can inherit from upstream parties and which the sponsor must generate themselves. This is where most vendor qualification efforts in 2026 break down: the inheritance boundary is rarely explicit, and inspectors will ask.

Step 5 — Synthesize the qualification decision. The output is a single supplier qualification record that names the applicable pillars, summarizes the evidence collected against each, identifies the inheritance boundaries, and documents the residual risks the sponsor is accepting. This goes into the quality agreement and the supplier file.

Worked examples to make this concrete:

Vendor A: Anthropic Claude API (raw foundation model access)

Pillar mapping:

Generative AI Validation: yes
Agentic AI Governance: only if the sponsor is using it agentically; the API itself is not agentic
Probabilistic Acceptance: yes (probabilistic outputs by definition)
Continuous Monitoring: yes (drift, model updates)

So three pillars apply. The supplier qualification questions concentrate on training data provenance, model versioning, inference variance, and drift monitoring: with the agentic governance pillar deferred until or unless the sponsor builds agentic systems on top.

Vendor B: Valkit MSAT

Pillar mapping:

Generative AI Validation: yes
Agentic AI Governance: minimal (mandatory HITL on every output prevents agentic autonomy)
Probabilistic Acceptance: yes (LLM outputs are probabilistic)
Continuous Monitoring: yes (platform-level monitoring)

Three pillars apply, but the agentic pillar is structurally suppressed by the product design. That's a strength; the product architecture eliminates a class of risk that other vendors leave open.

Vendor C: An agentic AI vendor doing autonomous trial design

Pillar mapping:

Generative AI Validation: yes
Agentic AI Governance: yes (this is the central concern)
Probabilistic Acceptance: yes
Continuous Monitoring: yes

All four pillars apply, and the agentic pillar carries the highest weight. The qualification questions concentrate on tool-use scaffolding, action scope, autonomous decision boundaries, and HITL/HOTL design: questions that wouldn't be load-bearing for Vendor A or B.

BioPhorum's A Practical Guide to Technical Assurance for AI (April 7, 2026) and the companion SCOPE paper propose a four-dimensional assurance framework (Structural, Process, Technical, and Cultural) addressing the types of evidence to collect. VALID Trust's pillars are orthogonal; they decompose by AI capability rather than by evidence type. The two frameworks complement each other: BioPhorum tells you what evidence to collect; VALID Trust tells you which pillar that evidence applies to for any given vendor modality.

What “good” looks like

What made the Valkit MSAT launch worth pausing on wasn't the product itself; it was that the disclosure pattern is finally becoming legible. Two vendors in the category are now publicly disclosing AI-specific supplier-qualification artifacts that go beyond SOC 2 and ISO 27001:

MasterControl achieved ISO 42001 certification in July 2025: the AI Management System standard. Notably, ISO 42001 is the prerequisite for CSA STAR for AI Level 2 — meaning a vendor holding the Level 2 designation is, by definition, already ISO 42001 certified.

Valkit MSAT discloses a CSA STAR Valid-AI-ted Level 1 self-assessment, ISO 27001:2022 certification, a written commitment that customer data is not used to train underlying models, and mandatory human review with electronic signature on every AI-generated record.

Different artifacts, different emphases. MasterControl leads on the certification stack; Valkit leads on the bundled disclosure of architectural commitments (no-training, mandatory HITL) that complement the cert. Most AI vendors selling into life sciences haven't disclosed either. When you ask, they hand you a SOC 2 report and a marketing deck.

Veeva, by contrast, illustrates a third pattern. Their public 'Third-Party Access and AI Models' policy requires that customer data not be used to train any third-party LLM — but the obligation is contractual, and it's pushed onto the customer rather than absorbed by Veeva. That's a meaningfully different posture than MasterControl's certification stack or Valkit's bundled architectural commitments. All three are 'disclosed,' but they place the locus of accountability in three different places: the auditor (MasterControl), the product architecture (Valkit), and the customer's procurement contract (Veeva). For a sponsor running supplier qualification, where the accountability lives is exactly the question.

What pharma quality teams should be asking

If you're qualifying an AI supplier in 2026, the standard pack is the floor, not the ceiling. The questions worth adding:

What data was the model trained on, and how was its provenance documented?

Is customer data ever used for training, fine-tuning, or evaluation: and if so, under what consent?

How are model versions controlled, and what is the change control process when the underlying model updates?

Where reproducibility matters, what is locked down - temperature, system prompts, retrieval context, tool scaffolding?

What validation does the supplier perform, what evidence will they provide, and where does sponsor responsibility begin?

What AI-specific security controls are in place beyond ISO 27001 — ISO 42001, CSA STAR for AI, equivalent frameworks?

None of these questions appear in the typical GAMP supplier audit template. All of them are now, as of the GAMP AI Guide's July 2025 publication and the EU's Annex 22 consultation, defensibly within the scope of regulatory expectation.

The vendors who can answer these cleanly are differentiating themselves; the sponsors who are asking them are getting ahead of the next inspection cycle. The ones who aren't are about to have an uncomfortable year.

Note: I have no commercial relationship with Valkit.ai, Veeva, or MasterControl, and was not compensated by any of them in connection with this piece.

Kayla Britt

When Your AI Vendor Passes SOC 2 — and Still Fails GxP

The framework most teams are still using

The five operational risks traditional QMS misses

1. Training data provenance

2. Model governance and update cadence

3. Inference variance and reproducibility

4. AI-specific security, not just IT security

5. Validation-inheritance boundary

VALID TRUST: A Framework for Supplier Qualification

The decision logic

What “good” looks like

What pharma quality teams should be asking

The Safety Paradox: Why Frozen Models Aren’t Always Safer Than Agentic Ones