Original Analysis

The Agentic AI Governance Gap: Why Logging Is Not Compliance

Abstract

The industry default response to agentic AI governance is logging: record what the agent did, when, and surface anomalies after the fact. This analysis identifies five structural failure modes in which logging cannot satisfy the evidence requirements of EU AI Act Articles 12 and 14 — and argues that gate-first enforcement, which evaluates each action before it executes and produces a cryptographic receipt as the authorisation record, is the only architecture that produces the evidence the regulation requires. With high-risk AI obligations applying from December 2027 under the EU AI Act and Annex III classifications covering agentic AI in finance, healthcare, law enforcement, and critical infrastructure, the window for building compliant-by-design systems is narrowing.

Regulatory deadline

EU AI Act Regulation 2024/1689 high-risk AI obligations — including conformity assessment, technical documentation, CE marking, and EU database registration — apply from 2 December 2027 (extended from 2 August 2026, Digital Omnibus May 2026) for Annex III systems. Agentic AI pipelines being built or deployed now will need to demonstrate compliance at that date, not rebuild to achieve it.

The governance default: logging

When organisations deploy agentic AI systems — autonomous pipelines where language models plan and execute sequences of actions across tools, APIs, and databases — the governance pattern that emerges is almost always the same: log what the agent did. Append records to a database. Surface anomalies to a monitoring dashboard. Alert a human if something looks wrong.

This pattern is borrowed from traditional software observability, where it is adequate. In traditional systems, the code is deterministic: given the same inputs, it produces the same outputs, takes the same path, calls the same services. A log is a faithful record of what happened because there is no ambiguity between the system's intent and its execution.

Agentic AI systems break this assumption. The model is probabilistic. It infers. It hallucinates completions. It skips steps it assigns low weight to. The log records what the model reported — not what the system executed. These are different things, and the difference is the governance gap.

Five failure modes logging cannot catch

The following five failure modes are structural properties of probabilistic agentic systems. In each case, a log-after-the-fact approach produces a plausible audit trail that does not reflect what actually happened.

F1
Hallucinated step completion

A language model may determine that a prior step was implicitly completed based on context — and log it as such. The log shows the step ran. The step did not run. There is no way to distinguish this from a genuine execution after the fact, because the log was written by the same model that hallucinated the completion. The OWASP Top 10 for LLM Applications classifies this as "excessive agency" — one of the primary risk categories for deployed agentic systems.

F2
Out-of-order execution

An agent may execute steps in a different order than the defined policy requires — because the model infers an alternative path is more efficient or more likely to succeed. The log shows all steps completed. The order in the log may be post-hoc rationalisation. No log-based system can prove that step 3 ran before step 4 in real time — only that both appeared in the log, in whatever order the model chose to record them.

F3
Replay with fresh identity

A captured valid action — particularly one with high-value consequences like a financial transfer authorisation or a data access grant — can be replayed later with a fresh timestamp and session identifier. A log cannot distinguish a legitimate action from its replay. The log shows two identical actions at different times and has no mechanism to determine whether the second was authorised independently or duplicated from the first.

F4
Post-hoc log manipulation

Logs that live in the application layer — databases, cloud storage, centralised logging services — are mutable. An agent with write access to its own log store can modify the record of what it did. A mutable log is not an audit trail — it is a story the system tells about itself. ISO/IEC 42001:2023 Annex A, Control A.6.1.6 requires AI system operational logging to be tamper-evident, but does not specify how to achieve this — leaving organisations to implement it or not.

F5
Human oversight that cannot intervene

EU AI Act Article 14(4) requires that high-risk AI systems allow natural persons to "intervene on the operations of the high-risk AI system or interrupt the system through a 'stop' button or a similar procedure." A monitoring dashboard connected to a logging system satisfies the letter of this requirement — a human can see the alert and push a stop button. It does not satisfy the spirit: the action has already executed by the time the alert fires. The oversight mechanism that activates after the fact is not an oversight mechanism for the action in question.

The gate-first architecture

The alternative to logging is enforcement. Where logging records what the agent did, enforcement determines what the agent is permitted to do — before it does it. This is the distinction between observability and control.

A gate-first architecture interposes an evaluation step between the model's intent and the action's execution. Before any step runs, the gate evaluates:

The gate returns ALLOW or DENY. On ALLOW, a cryptographic receipt is written to immutable storage — an HMAC-signed record that this specific step, in this specific sequence, was authorised at this specific time. The receipt is written by the enforcement layer, not the application. It cannot be modified by the agent.

The gate receipt is not a log entry. A log entry records what the system reports happened. A gate receipt is the authorisation that precedes the action — the proof that the gate evaluated the step, found it compliant, and permitted it to execute. These are different documents with different evidentiary weight. Under EU AI Act Article 12, the requirement is for logs that enable post-market monitoring of the AI system's operation. A gate receipt chain — where each receipt's authenticity can be cryptographically verified and each receipt's position in the chain can be proven — satisfies this requirement in a way that a mutable application log does not.

Comparing log-after versus gate-first across compliance dimensions

Compliance dimension Log-after approach Gate-first approach
EU AI Act Art. 12 — logging Mutable; records what model reports Immutable HMAC-signed receipts; pre-action
EU AI Act Art. 14 — human oversight Alert fires after action executes Gate blocks action until authorised; halt is immediate
EU AI Act Art. 9 — risk management Risk identified post-execution Risk evaluated before execution; deny on any failure
ISO/IEC 42001 — AI system logging Satisfies letter; tamper-evidence unspecified Tamper-evident by construction; chain linkage verifiable
NIST AI RMF — govern/measure Measurement is post-hoc only Enforcement metrics captured at gate; real-time govern
Replay attack prevention Detectable in log analysis; not preventable Blocked at gate via nonce ledger; never executes
Step-skipping prevention Detectable if model logs the skip Impossible; gate enforces prerequisite receipt chain
Evidence for conformity assessment Reconstructed from logs; model may have altered record Receipt chain with cryptographic proof; independently verifiable

What the December 2027 deadline means in practice

The EU AI Act's Annex III classifies agentic AI systems making consequential decisions in eight domains — including biometric identification, critical infrastructure, education, employment, essential private services, law enforcement, migration, and justice — as high-risk. For these systems, full conformity obligations apply from 2 December 2027 (extended from 2 August 2026, Digital Omnibus May 2026): technical documentation, conformity assessment, CE marking, and EU database registration.

Conformity assessment for high-risk AI requires demonstrating that the system's technical documentation reflects its actual operation. A log-after approach produces documentation that reflects what the model reported — a weaker standard than what notified bodies conducting conformity assessment will require. A gate receipt chain produces documentation that reflects what the enforcement layer authorised — verifiable by any party with access to the receipt chain and the signing key.

In April 2026, Microsoft released the Agent Governance Toolkit — open-source runtime security for AI agents — acknowledging that the gap between model intent and executed action is the central governance problem for agentic AI. The Toolkit focuses on tool access control and policy enforcement at the Microsoft Azure level. AgenticRail addresses the same problem at the sequence level: not just what tools the agent can access, but whether each step in the sequence was authorised in the correct order, with a tamper-evident receipt as proof.

Conclusion

Logging is necessary. It is not sufficient. The five failure modes in this analysis — hallucinated completion, out-of-order execution, replay, log manipulation, and post-hoc human oversight — are not edge cases. They are structural properties of probabilistic systems operating in environments where the audit trail is produced by the same system being audited.

Gate-first enforcement addresses these failure modes by moving the compliance record to a layer that the agent cannot influence. The gate receipt is written before the action executes, by the enforcement layer, and stored in immutable infrastructure. The agent cannot skip it, manipulate it, or replay past it.

For organisations building agentic AI systems that will face EU AI Act conformity assessment in 2026, the question is not whether to have an audit trail — it is whether the audit trail will satisfy a notified body's review. A log-after approach produces a story the system tells about itself. A gate receipt chain is a structural guarantee — cryptographic proof that the sequence cannot have run differently from what the receipts record.