What is Annex IV of the EU AI Act?

Annex IV is an annex to Regulation (EU) 2024/1689 that defines the technical documentation providers of high-risk AI systems must prepare, maintain, and update. It covers topics such as a general description of the system, design specifications (including the logic of the system and key design choices), monitoring and control, technical capabilities and limitations, and changes over the system's lifecycle .

What does Annex IV require for high-risk AI systems?

It requires a structured technical dossier aligned with Article 11: documentation of intended purpose, system architecture and development, data and training where applicable, monitoring, human oversight, performance, and how the system may evolve after market placement. Substantial modifications can trigger updated documentation and, in some cases, a new conformity assessment.

Why does Annex IV documentation go out of date so quickly?

Production AI stacks change continuously: model aliases resolve to different versioned checkpoints, inference parameters drift, and integrations evolve. Because Annex IV expects documentation to stay consistent with the actual system over its lifecycle , teams that only update narrative documents occasionally often end up with a dossier that no longer matches what the logs can prove ran in production.

AI compliance & regulations - engineering notes

EU AI Act - The Annex IV Trap: Why Your AI Documentation Is Already Out of Date

March 21, 2026

Engineering Notes - AI Compliance & Regulations March 21, 2026 · ~2,426 words · 11 min read

There is a particular kind of compliance failure that doesn't announce itself with a crashed service or a failed audit finding. It accumulates silently in the gap between what your technical documentation says your AI system does and what it actually did at 14:32:07 UTC on a Tuesday three months ago.

I have been prototyping an infrastructure layer for EU AI Act compliance, a governance middleware I call Leksly, and in doing so I have run directly into what I think is the most underestimated operational problem in the current high-risk AI compliance landscape: the Annex IV documentation debt spiral.

This note is an attempt to explain the engineering shape of that problem, and what a code-first solution looks like in practice.

The Mandate: What Annex IV Actually Requires

Most practitioners have absorbed the high-level obligation: Article 11 of the EU AI Act requires providers of high-risk AI systems to prepare and maintain technical documentation before placing a system on the market, and to keep it updated throughout the system's lifecycle. The full scope of what that documentation must contain is specified in Annex IV of Regulation (EU) 2024/1689.

Reading Annex IV carefully, not as a lawyer but as an engineer, reveals something uncomfortable. The dossier must include:

A general description of the AI system and its intended purpose
A detailed description of the system's components and development process
The design specifications, including the general logic of the algorithm, key design choices, assumptions made, and classification decisions
A description of the monitoring, functioning and control of the system
The technical capabilities and limitations including risks to health, safety, and fundamental rights
Changes made to the system over its lifecycle

That last item is the trap. "Changes over its lifecycle" is not defined with precision, but guidance from the AI Act Service Desk on Article 11 makes clear that "substantial modifications," which include changes that affect the system's performance or safety characteristics, require documentation updates and, in some cases, a new conformity assessment.

Now consider the operational reality for an enterprise deploying a high-risk AI system through an API: OpenAI's model versioning alone has produced a sequence of production-deployed variants of "gpt-4o" within a single calendar year. Each carries different training data cutoffs, different benchmark scores, and different system behaviors. Under a risk-based reading of the regulation, deploying a materially different model version can be a change to the AI system that should be reflected in the technical dossier and assessed for potential substantial modification impacts.

At current model release velocity, your Annex IV documentation will be out of date before the ink is dry.

The Enterprise Problem: Three Failure Modes Nobody Is Talking About

When I started prototyping Leksly, I expected the hard part to be the cryptography: implementing a signed, tamper-evident audit ledger is not trivial engineering. What I did not expect is how much the compliance failures happen above the cryptographic layer, in the metadata that teams assume is being captured but isn't.

Failure Mode 1: The Model Alias Gap

This one surprised me enough that I built an explicit check for it.

When a client application sends a chat completion request, the model field typically contains an alias: "gpt-4o", "claude-3-5-sonnet", "gpt-4-turbo". The model that actually serves the response is frequently a specific versioned checkpoint: "gpt-4o-2024-11-20", "claude-3-5-sonnet-20241022".

Annex IV requires documentation of the actual AI system — the specific model used. Your logs almost certainly record the alias from the request. The actual model identity comes back in the response, but most logging pipelines discard it or never capture it at all.

This creates a systematic gap between your Annex IV documentation ("we use model X") and your operational evidence trail ("model X was sometimes served as version X.1 and sometimes as version X.2 and we cannot tell which was which"). And critically, the gap covers every AI component in your system — not just the final generation step. Embedding models, retrieval calls, and tool invocations each carry the same ambiguity. Leksly treats model identity resolution as a first-class compliance event across all AI operation types, not an afterthought.

Failure Mode 2: Parameter Drift Is a Compliance Event

Article 12 of the EU AI Act establishes a traceability requirement: logging capabilities "shall ensure a level of traceability of the AI system's functioning throughout its lifecycle that is appropriate to the intended purpose of the system." Read alongside Annex IV's requirement to document design specifications, including "the general logic of the algorithm" and "key design choices," this creates a practical obligation to record the inference parameters that materially affect output behavior.

Two identical prompts sent with different inference parameters, temperature 0.0 versus temperature 0.9 for example, produce materially different output distributions. For a high-risk system making consequential decisions, the temperature setting is part of the system design specification that Annex IV requires you to document and maintain.

In practice, what happens is this: an ML team tweaks the temperature during a "performance optimization sprint." They do not consider this a change to the AI system in any compliance sense; they are just tuning. No one updates the technical documentation. Three months later, under audit, the system's documented behavior no longer matches its actual behavior.

Leksly addresses this by making inference configuration a mandatory component of every evidence record. Parameter drift surfaces automatically as a deviation in the forensic record, even if no one remembered to update the documentation — because the documentation is derived from the evidence, not maintained separately from it.

Failure Mode 3: Schema Drift Breaks Retroactive Auditability

This is the most subtle failure mode, and it is a pure engineering problem that compliance frameworks have not yet articulated clearly.

If you are building a tamper-evident audit log (which you should be for forensic integrity and incident investigation readiness), your records are integrity-protected via cryptographic means. This creates a quiet bomb: when your event schema evolves — even for innocent reasons, like adding a useful observability field — your audit verification tool may no longer be able to verify old records against the new schema.

The practical consequence is that a system which was forensically auditable last quarter may not be auditable this quarter, because an engineer improved the event structure. Your "tamper-evident" log quietly becomes tamper-unverifiable.

This is a solved problem in Leksly, but it required deliberate design work that I have not seen discussed in the compliance tooling space. The core principle is keeping schema evolution and integrity proofs on separate tracks, so that improving observability never silently invalidates your forensic chain.

Without this kind of discipline, your "living" documentation and your "immutable" audit trail are in direct conflict: every improvement to one risks breaking the other.

The Architecture: Evidence Generation as Infrastructure

The pattern I am working toward in Leksly is a reframe of the compliance problem: instead of maintaining an Annex IV document that describes what the system does, build the system so that it continuously generates machine-readable evidence of what it actually did, and derive the documentation from the evidence.

The three failure modes above — model aliasing, parameter drift, schema drift — all stem from the same root cause: compliance evidence is treated as a byproduct of logging infrastructure that was designed for observability, not for regulatory defensibility. These are different requirements. Observability logging optimizes for debugging speed. Compliance evidence needs to be tamper-evident, complete with respect to specific regulatory fields, and verifiable after the fact by parties who were not present when the event occurred.

What Leksly produces is not a 40-page Word document. It is a machine-verifiable proof that a specific model version, operating with specific parameters, processed a specific request, and produced a specific output, at a specific point in time, in a continuous chain of evidence that has not been modified since it was written. That is the actual substance of what Annex IV requires: the documentation is the evidence trail, not a description of the evidence trail.

Importantly, no raw prompt or completion text is stored in the evidence layer — only derived artifacts. This preserves the audit story while supporting GDPR data minimization objectives. The compliance record can prove what happened without becoming a liability in its own right.

The Open Problem: Annex IV's Narrative Layer

I want to be honest about where this approach runs out. Annex IV requires more than a machine-readable evidence trail. It requires a narrative: a description of the intended purpose, the design choices and their justifications, the training data and its preprocessing, the human oversight measures, the known limitations and residual risks.

None of this can be auto-generated from an evidence trail. A human must write it, and a human must update it when the system changes.

The gap I am trying to close with the infrastructure approach is specifically the gap between "we have written down what the system does" and "we can prove what the system actually did." The narrative documentation describes the former; the forensic evidence proves the latter. For Article 73(6) incident investigations, regulators want the latter. For the Annex IV conformity assessment, you need both.

For example, imagine a high-risk loan-screening workflow documented in January as using model family A with conservative generation settings and human escalation for low-confidence outputs. In March, the operations team switches to a newer model variant and relaxes one decoding control to reduce latency. The evidence layer can still prove exactly what ran on each decision event, but unless someone updates the narrative dossier, your Annex IV file still describes the January system. The result is a compliance mismatch: evidence and documentation are both internally correct, but no longer aligned with each other.

Closing this loop — binding narrative dossier versions to specific evidence states so that drift between them surfaces automatically as a governance signal rather than a late audit finding — is the open problem I am currently working on. The direction involves treating documentation updates with the same provenance discipline as runtime events, so that "which dossier version was active at execution time" becomes an answerable question rather than an educated guess.

Why this matters for Article 73(6) and Annex IV in practice: it strengthens pre-incident evidence of design intent, reduces ambiguity during investigations, and replaces ad hoc file collection with verifiable linked records.

FAQ

What I Am Looking For

I am currently mapping how enterprise GRC and internal audit teams are preparing their infrastructure for the August 2026 high-risk AI deadlines, specifically how they are handling the practical gap between Annex IV narrative documentation and the operational evidence generated by AI systems in production.

The problems I have described above — model aliasing, parameter drift, schema drift — are problems I found while building Leksly. I suspect practitioners on the compliance and audit side are encountering different shapes of the same problem.

If you are an Internal Auditor, GRC Officer, or Compliance Engineer working through these challenges firsthand, your insight would be incredibly valuable. I am currently looking for a few expert design partners to evaluate the architecture against real world use cases.

Feel free to connect on my LinkedIn (see below) or drop your email via the early access form.

Anas Siddiqui
Software Engineer & Researcher, Leksly

Sources cited in this article: