ISO/IEC 42001 for Agentic AI: The Certification Evidence Gap That Policies Can't Close

ISO/IEC 42001:2023 is not a compliance checklist — it is a certification standard. The difference matters. A compliance checklist asks: do your policies say the right things? A certification audit asks: show me the evidence that your controls ran. For organisations deploying agentic AI, the most demanding ISO 42001 controls — Annex A.6.1.6 (operational logging enabling reconstruction of AI system behaviour) and the human oversight controls — both require the same thing: documented proof that the AI system operated as governed, at the moment it operated.

This article covers

What ISO/IEC 42001:2023 is — and why certification changes the evidence requirement

ISO/IEC 42001:2023 is the international standard for AI management systems, published in December 2023. It provides a structured framework — modelled on ISO 9001 and ISO 27001 — for organisations to establish, implement, maintain, and continually improve how they develop and deploy AI systems. Organisations already certified to ISO 9001 (quality management) or ISO 27001 (information security) will find the clause structure familiar: policy, risk assessment, operational controls, performance evaluation, and improvement.

The critical distinction between ISO 42001 and advisory frameworks like NIST AI RMF is what certification requires. ISO auditors do not accept policy documents as evidence of control operation. They require documented records — logs, receipts, decision records — that demonstrate controls ran during the period under audit. For agentic AI systems, this creates an acute problem: most AI governance tooling produces policies, dashboards, and reports. Very little produces per-action evidence that a control executed before the agent did.

ISO/IEC 42001 is not legally mandated in most jurisdictions, but it is increasingly referenced in procurement requirements, enterprise partner questionnaires, and as a conformity path for EU AI Act Article 9 risk management obligations. Organisations pursuing ISO 42001 certification for AI are typically those already operating under ISO management systems — financial services, healthcare, manufacturing, critical infrastructure.

Where agentic AI creates control gaps

ISO/IEC 42001 Annex A defines the normative controls. Four areas are where agentic AI systems most commonly create evidence gaps during certification audits:

Evidence gap
A.6 — AI System Lifecycle
Operational controls for AI system deployment. A.6.1.6 requires logging sufficient to reconstruct AI system behaviour. Most agentic AI produces logs of what the model reported — not records of what the control layer enforced.
Evidence gap
A.7 — Human Oversight
Controls requiring human review and intervention mechanisms. For agentic AI, the gap is structural: if the only way to stop the system is to ask the model to stop, human oversight is nominal — not a control.
Measurement gap
Clause 9 — Performance Evaluation
Monitoring, measurement, and analysis of AI system behaviour. Clause 9.1 requires methods appropriate to the risk. For agentic AI, per-session metrics are insufficient — evidence must be per-action.
Operational
Clause 8 — Operation
Risk treatment embedded in AI system operation. Clause 8 requires that risk controls are part of the operational process — not a parallel review exercise running separately from the AI system.

The audit failure pattern is consistent: organisations demonstrate strong policy controls (Clause 5, Clause 6 planning) but cannot produce per-action evidence for operational controls (A.6, A.7) and performance evaluation (Clause 9). The gap is not in governance intent — it is in the mechanism that would produce the evidence.

Annex A.6.1.6 — Operational logging and reconstruction of AI system behaviour

A.6.1.6
Operational logging enabling reconstruction of AI system behaviour

ISO/IEC 42001 Annex A.6.1.6 requires that organisations implement operational logging for AI systems at a level that enables reconstruction of AI system behaviour. The word "reconstruction" is precise. It is not sufficient to log that a sequence ran — the log must enable a complete picture of what the system did, in what order, at what time, and under what conditions.

For agentic AI, the failure mode that this control targets is probabilistic behaviour: a model that skips a step, infers a condition was met, or proceeds without explicit authorisation. If the only log is a model-generated output report, reconstruction is impossible — the model's report of what it did may not reflect what actually ran.

How sequence enforcement addresses it: Every gate decision — ALLOW or DENY — is written to immutable storage before the agent step executes. The receipt records: sequence ID, step identifier, function, action type, gate decision, timestamp (UTC), nonce, and a cryptographic pack ID. Receipts are chained via prev_receipt_id — each receipt links to the previous, creating a tamper-evident sequence record. To reconstruct exactly what ran, in what order, at what time: read the chain. The chain is the reconstruction. It is not derived from model output. It is written by the enforcement layer before the model acts.

Audit evidence produced: The compliance report endpoint generates a full chain proof — per-step enforcement log, HMAC signature verification for every receipt, chain linkage from step 0 → N, and an AI-generated compliance narrative. This constitutes the documented evidence Annex A.6.1.6 requires for certification audit.

Clause 9.1 — Monitoring, measurement, analysis and evaluation

Clause 9.1
Monitoring and measurement of AI system performance at appropriate intervals

ISO/IEC 42001 Clause 9.1 requires that organisations determine what needs to be monitored and measured, what methods are appropriate, and when analysis and evaluation should occur. For agentic AI systems making consequential decisions, appropriate intervals means per-action — not per-session, not daily dashboards, not weekly reports.

The common implementation gap is treating Clause 9.1 as a reporting requirement. Aggregate metrics satisfy the letter of the clause for low-risk systems. For agentic AI in regulated contexts — underwriting, hiring, clinical triage, access control — an auditor will ask: how do you know the system behaved correctly on this specific decision, at this specific time? Aggregate metrics cannot answer that question.

How sequence enforcement addresses it: Gate statistics are recorded per-decision: total evaluations, ALLOW/DENY/HALT ratios, step distribution, sequence completion rates, and denial reason codes. These are written to KV on every gate call and surfaced in the dashboard in real time. For Clause 9.1, this provides monitoring that is contemporaneous with execution at the granularity required — every action, not every session. The receipt chain enables full retrospective analysis: which steps were denied, at what timestamps, for what reason, across how many sequences.

Human oversight controls — structural authority to intervene

A.7 Controls
Human review and intervention mechanisms for AI system operation

ISO/IEC 42001 Annex A.7 defines human oversight controls — the mechanisms by which humans can review AI system behaviour and intervene when needed. For agentic AI, the critical test is not whether a human could intervene, but whether the system architecture requires human-controlled authorisation to proceed.

The audit failure here is structural: if the AI system can run a complete sequence without any human-controlled checkpoint, then human oversight is an option, not a control. A certification auditor will ask: show me the mechanism. A policy that says "humans review outputs" is not a mechanism — it is a procedure. A mechanism is something the system cannot bypass.

How sequence enforcement addresses it: The gate key is the human-controlled mechanism. Gate access is issued to designated personnel via API keys. Those personnel control which sequences are permitted to run, which step orders are enforced, and when a sequence should be halted. No agent step can execute without clearing the gate — the model cannot self-authorise, self-report around the gate, or replay a previously issued ALLOW. When a sequence is sealed, no further steps are accepted. Human authority is structural: it is embedded in the enforcement path, not described in a governance document.

Audit evidence produced: Key issuance records. Sequence halt events (HALT decisions with timestamps and reason codes). Denial logs showing the gate blocked non-compliant steps before execution. Every HALT or DENY is proof the human oversight mechanism fired.

The receipt chain as ISO 42001 audit evidence

ISO 42001 certification audits require documented evidence. The receipt chain AgenticRail produces is designed to be that documentation — not a report generated after the fact, but a record written at the moment of each enforcement decision.

Each receipt is HMAC-signed using a versioned key. The signature covers a canonical JSON serialisation — alphabetically sorted, deterministic. Every receipt records: sequence ID, step identifier, function, action type, gate decision, timestamp, nonce, and pack ID. Receipts are linked via prev_receipt_id. The chain is self-evidencing: if it verifies, the sequence ran as recorded. If it does not verify, tampering is provable. Optional attestation fields — KYC results, risk scores, approval IDs, document hashes — are signed into the receipt at the step they apply to.

For an ISO/IEC 42001 certification audit, the receipt chain provides documented evidence against three control areas:

To generate a compliance report for any sequence:

POST https://report.agenticrail.nz

{"sequence_id": "your-seq-id", "format": "html"}

The report includes: per-step enforcement log, chain linkage proof, cryptographic verification results, and a compliance narrative. It is suitable for inclusion in ISO/IEC 42001 management system documentation, certification audit evidence packages, and management review records.

Cross-framework: one chain, three standards

ISO/IEC 42001, EU AI Act, and NIST AI RMF all converge on the same operational evidence requirement: proof that oversight mechanisms functioned during deployment. The clause numbering differs; the underlying gap is the same.

You build the enforcement layer once. The receipt chain it produces maps to all three frameworks simultaneously — because all three are asking the same question: did your controls actually run?

See it working

One enforcement layer. ISO 42001, EU AI Act, and NIST AI RMF satisfied.

Run a demo sequence and generate a compliance report in 60 seconds — no signup required. The report shows the receipt chain, chain proof, and per-step enforcement log suitable for ISO 42001 audit evidence.

Try the demo → See compliance report →