AI Agent Audit Log Best Practices: Deterministic Replay, Cryptographic Receipts, Fail-Closed

The fundamental problem with AI agent audit logs: the model writes them. An LLM-based agent records what it believes it did — not what it provably executed. Best practice is to move the record upstream: an independent gate writes a cryptographic receipt before each action executes. The result is an audit trail the model cannot influence, that supports deterministic replay for any audit, and that satisfies ISO 42001 A.6.1.6, EU AI Act Article 12, and NIST Measure 2.5 simultaneously.

The six requirements for a production AI agent audit log

Most production AI deployments use application-layer logging — the agent writes a record after each step completes. This is a good start and usually enough for internal observability. It is not enough for a compliance audit. An auditor reviewing an AI agent deployment needs to answer a different question: did these steps provably execute, in this order, at these timestamps, with these inputs? Application-layer logs cannot answer that question reliably.

A production audit log that can withstand regulatory scrutiny requires six properties:

Why AI agents cannot reliably log their own actions

LLMs are probabilistic systems. They do not execute a deterministic program — they infer the most statistically likely next action given their current context. This creates a structural problem for self-reported audit logs.

The self-reporting failure mode

A model processing a loan application decides that identity verification implicitly ran — based on context suggesting it should have — and moves to the credit check step. It logs "identity_verified: true". The verification never ran. The log is accurate from the model's perspective. It is wrong. An auditor reviewing the log has no way to know.

This is not a hallucination in the traditional sense — the model is not confabulating a wrong answer. It is doing what LLMs do: making a statistically reasonable inference from context. The problem is that inference is not execution, and a log that records inference as execution is not an audit trail.

The OWASP Top 10 for LLM Applications identifies this pattern — excessive agency — as a primary attack surface for agentic systems. A model that proceeds without executing required steps is operating with excessive agency, and a self-reported log provides no evidence that it did not.

What deterministic replay requires

Deterministic replay means that for any completed AI agent sequence, you can reconstruct exactly what steps ran, in what order, at what timestamps, with what inputs — from the audit records alone, without re-running the model. (See: Deterministic vs Probabilistic AI Agents — why the distinction decides compliance.)

This is only possible if:

Requirement Application-layer log Infrastructure-layer receipt
Record written before execution No — written after, or not at all if step fails Yes — gate decision is the record
Record independent of the model No — model decides what to log Yes — gate is a separate system
Tamper-evident after write No — database records can be updated Yes — HMAC signature breaks on modification
Replay attacks blocked No — same step can be re-logged Yes — nonce ledger rejects repeats
Sequence provably complete No — gaps are invisible Yes — sealed chain with ordered receipts

Without these properties, a replay of an audit log is a replay of what the model said it did. With them, a replay is a reconstruction of what provably executed — verifiable without trusting the model's account.

Replay protection: how nonces work in practice

Every gate request carries a nonce — a UUID generated by the caller that is used exactly once. The gate checks the nonce against a ledger maintained per sequence. If the nonce has appeared before, the gate returns REPLAY_NONCE and blocks the step regardless of all other conditions.

This matters in three scenarios:

Network retry loops. An agent that receives a timeout may retry the same request. Without replay protection, the step executes twice — the second execution is real and the audit trail shows two receipts for the same logical action. With a nonce, the retry is blocked.

Adversarial replay. An attacker captures a valid gate request and re-submits it later — possibly with a fresh timestamp — to trigger an action a second time. Nonce-based protection blocks this even if the timestamp is within the freshness window.

Malfunctioning agents. An agent in a loop may re-submit a step it has already completed. The gate blocks the repeat and returns REPLAY_NONCE. The agent gets a clear error rather than a silent second execution.

What a production audit receipt looks like

Each gate decision produces one receipt — one per step, per sequence. The receipt is written before the action executes and stored in immutable object storage with an HMAC signature over all fields.

Gate receipt — sequence: loan-app-2847f3 / step: identity_verification ALLOW
sequence_id loan-app-2847f3
step identity_verification
step_order 3 of 8
decision ALLOW
nonce f7a2c891-3e4d-4b1a-9c02-a8f1e6d3b745 — first use confirmed
ts_ms 1746748812041 — within freshness window
hmac sha256:8f3a… — signed over all fields, key_id: k1_2026-02-22_01
recorded before action executed

The HMAC is computed over a canonical JSON serialisation of the full receipt — keys sorted alphabetically, no whitespace variation. Any modification to any field after write produces a different hash. The receipt cannot be silently updated to show a different decision, step, or timestamp.

At audit time, a compliance report reads the receipt chain for a sequence from the KV index, verifies each HMAC, confirms step order, and confirms no gaps. The report is generated from the receipts — not from application logs, not from model-reported state.

Timestamp freshness and the replay window

Replay protection has two layers. The nonce blocks exact replay of a previous request. Timestamp freshness closes the window for replay-with-new-nonce attacks.

Each gate request carries a ts_ms field — milliseconds since epoch, set by the caller at request time. The gate enforces a freshness window: if the timestamp is more than 300 seconds in the past or future, the request is rejected with STALE_TIMESTAMP. A valid nonce does not help if the timestamp is stale.

This means an attacker who captures a valid gate request cannot submit it later with a fresh nonce — the timestamp is outside the freshness window. The request must be submitted within 5 minutes of the original timestamp, and with a unique nonce. Both conditions must be met simultaneously.

Framework alignment: one receipt chain, three frameworks

The same infrastructure-layer receipt chain satisfies the audit log requirements across all three major AI governance frameworks:

ISO 42001 · A.6.1.6
Operational logging
Requires logging enabling reconstruction of AI system behaviour for certification audits. The receipt chain provides step-by-step reconstruction with cryptographic integrity.
EU AI Act · Article 12
Logging obligations
Requires logs enabling post-market monitoring and incident investigation for high-risk AI systems. Pre-execution receipts independent of the model satisfy this directly.
NIST AI RMF · Measure 2.5
Runtime monitoring
Requires monitoring mechanisms that detect performance degradation and unexpected behaviour. Gate decisions surface policy violations, out-of-order steps, and replay attempts in real time.

The same gate receipt that satisfies ISO 42001 A.6.1.6 satisfies EU AI Act Article 12 and NIST Measure 2.5. A single enforcement layer produces evidence for all three certifications simultaneously — no additional logging infrastructure required per framework.

The compliance report

For any sequence, a compliance report can be generated on demand. The report reads the receipt chain from the KV index, verifies each HMAC, confirms step order is intact, confirms no replays occurred, and surfaces any DENY or HALT decisions with the reason code. The report is formatted for auditor review — it answers the question "what did this AI agent actually execute?" with cryptographic evidence rather than application-reported state.

See a live example of the compliance report generated from real gate receipts:

Try the enforcement gate with the public demo key. Run a sequence, see the receipts written in real time, and generate the compliance report.