AI Agent Audit Log Best Practices: Deterministic Replay, Cryptographic Receipts, Fail-Closed
The fundamental problem with AI agent audit logs: the model writes them. An LLM-based agent records what it believes it did — not what it provably executed. Best practice is to move the record upstream: an independent gate writes a cryptographic receipt before each action executes. The result is an audit trail the model cannot influence, that supports deterministic replay for any audit, and that satisfies ISO 42001 A.6.1.6, EU AI Act Article 12, and NIST Measure 2.5 simultaneously.
The six requirements for a production AI agent audit log
Most production AI deployments use application-layer logging — the agent writes a record after each step completes. This is a good start and usually enough for internal observability. It is not enough for a compliance audit. An auditor reviewing an AI agent deployment needs to answer a different question: did these steps provably execute, in this order, at these timestamps, with these inputs? Application-layer logs cannot answer that question reliably.
A production audit log that can withstand regulatory scrutiny requires six properties:
-
01
Pre-execution recordThe gate decision is written before the action executes. A log written after execution can be fabricated, lost on failure, or overwritten. The record must precede the action — not follow it.
-
02
Nonce-based replay protectionEach step carries a unique nonce. The gate rejects any repeated nonce with REPLAY_NONCE. Without this, the same step can execute multiple times against the same sequence — duplicating real-world effects and corrupting the audit trail.
-
03
Cryptographic integrityEach receipt is HMAC-signed over all fields. Any modification to the record after write — payload, decision, timestamp, step name — breaks the signature. The record cannot be silently altered to show a different outcome.
-
04
Sequence sealingWhen the final step runs, the sequence is sealed. No further steps can be appended to a closed chain. This prevents retroactive insertion of steps that didn't happen — a tactic that would otherwise allow a manipulated agent to make a skipped validation appear to have run.
-
05
Infrastructure-layer independenceThe logging system must be independent of the model. Application-layer logs — records the AI system writes about itself — can be bypassed if the model infers a step as complete without executing it. The gate must sit between the model's decision and the action execution.
-
06
Fail-closed designAny ambiguity, missing precondition, network error, or policy gap returns DENY or HALT — never a silent pass. A log that records "ALLOW" because no gate was consulted is indistinguishable from one that records "ALLOW" because the step legitimately passed. Fail-closed makes the distinction provable.
Why AI agents cannot reliably log their own actions
LLMs are probabilistic systems. They do not execute a deterministic program — they infer the most statistically likely next action given their current context. This creates a structural problem for self-reported audit logs.
A model processing a loan application decides that identity verification implicitly ran — based on context suggesting it should have — and moves to the credit check step. It logs "identity_verified: true". The verification never ran. The log is accurate from the model's perspective. It is wrong. An auditor reviewing the log has no way to know.
This is not a hallucination in the traditional sense — the model is not confabulating a wrong answer. It is doing what LLMs do: making a statistically reasonable inference from context. The problem is that inference is not execution, and a log that records inference as execution is not an audit trail.
The OWASP Top 10 for LLM Applications identifies this pattern — excessive agency — as a primary attack surface for agentic systems. A model that proceeds without executing required steps is operating with excessive agency, and a self-reported log provides no evidence that it did not.
What deterministic replay requires
Deterministic replay means that for any completed AI agent sequence, you can reconstruct exactly what steps ran, in what order, at what timestamps, with what inputs — from the audit records alone, without re-running the model. (See: Deterministic vs Probabilistic AI Agents — why the distinction decides compliance.)
This is only possible if:
| Requirement | Application-layer log | Infrastructure-layer receipt |
|---|---|---|
| Record written before execution | No — written after, or not at all if step fails | Yes — gate decision is the record |
| Record independent of the model | No — model decides what to log | Yes — gate is a separate system |
| Tamper-evident after write | No — database records can be updated | Yes — HMAC signature breaks on modification |
| Replay attacks blocked | No — same step can be re-logged | Yes — nonce ledger rejects repeats |
| Sequence provably complete | No — gaps are invisible | Yes — sealed chain with ordered receipts |
Without these properties, a replay of an audit log is a replay of what the model said it did. With them, a replay is a reconstruction of what provably executed — verifiable without trusting the model's account.
Replay protection: how nonces work in practice
Every gate request carries a nonce — a UUID generated by the caller that is used exactly once. The gate checks the nonce against a ledger maintained per sequence. If the nonce has appeared before, the gate returns REPLAY_NONCE and blocks the step regardless of all other conditions.
This matters in three scenarios:
Network retry loops. An agent that receives a timeout may retry the same request. Without replay protection, the step executes twice — the second execution is real and the audit trail shows two receipts for the same logical action. With a nonce, the retry is blocked.
Adversarial replay. An attacker captures a valid gate request and re-submits it later — possibly with a fresh timestamp — to trigger an action a second time. Nonce-based protection blocks this even if the timestamp is within the freshness window.
Malfunctioning agents. An agent in a loop may re-submit a step it has already completed. The gate blocks the repeat and returns REPLAY_NONCE. The agent gets a clear error rather than a silent second execution.
What a production audit receipt looks like
Each gate decision produces one receipt — one per step, per sequence. The receipt is written before the action executes and stored in immutable object storage with an HMAC signature over all fields.
The HMAC is computed over a canonical JSON serialisation of the full receipt — keys sorted alphabetically, no whitespace variation. Any modification to any field after write produces a different hash. The receipt cannot be silently updated to show a different decision, step, or timestamp.
At audit time, a compliance report reads the receipt chain for a sequence from the KV index, verifies each HMAC, confirms step order, and confirms no gaps. The report is generated from the receipts — not from application logs, not from model-reported state.
Timestamp freshness and the replay window
Replay protection has two layers. The nonce blocks exact replay of a previous request. Timestamp freshness closes the window for replay-with-new-nonce attacks.
Each gate request carries a ts_ms field — milliseconds since epoch, set by the caller at request time. The gate enforces a freshness window: if the timestamp is more than 300 seconds in the past or future, the request is rejected with STALE_TIMESTAMP. A valid nonce does not help if the timestamp is stale.
This means an attacker who captures a valid gate request cannot submit it later with a fresh nonce — the timestamp is outside the freshness window. The request must be submitted within 5 minutes of the original timestamp, and with a unique nonce. Both conditions must be met simultaneously.
Framework alignment: one receipt chain, three frameworks
The same infrastructure-layer receipt chain satisfies the audit log requirements across all three major AI governance frameworks:
The same gate receipt that satisfies ISO 42001 A.6.1.6 satisfies EU AI Act Article 12 and NIST Measure 2.5. A single enforcement layer produces evidence for all three certifications simultaneously — no additional logging infrastructure required per framework.
The compliance report
For any sequence, a compliance report can be generated on demand. The report reads the receipt chain from the KV index, verifies each HMAC, confirms step order is intact, confirms no replays occurred, and surfaces any DENY or HALT decisions with the reason code. The report is formatted for auditor review — it answers the question "what did this AI agent actually execute?" with cryptographic evidence rather than application-reported state.
See a live example of the compliance report generated from real gate receipts:
Try the enforcement gate with the public demo key. Run a sequence, see the receipts written in real time, and generate the compliance report.