Microsoft's DELEGATE-52 benchmark shows LLMs still can't be trusted with document editing

SiliconFeed EditorialMay 13, 2026

llm reliability enterprise AI document automation AI benchmark Microsoft research

Sections and tags — in the Topics menu Search the feed

At a glance:

A Microsoft research preprint benchmark called DELEGATE-52 tested 19 LLMs across 52 professional domains and found widespread document corruption during multi-step editing tasks, with errors that compound silently over repeated interactions.
Frontier models — Gemini 3.1 Pro, Claude 4.6 Opus, and GPT 5.4 — lost an average of 25% of document content over 20 delegated interactions, while average degradation across all 19 models reached roughly 50%.
Python is the only domain where most tested models meet a readiness threshold, and even the strongest single model achieves that threshold in just 11 out of 52 domains.

What the DELEGATE-52 benchmark tested

The preprint paper, "LLMs Corrupt Your Documents When You Delegate," was authored by Microsoft researchers Philippe Laban, Tobias Schnabel, and Jennifer Neville. The benchmark simulates workflows that might realistically appear in a knowledge worker's daily routine, spanning 310 work environments across 52 professional domains — including coding, crystallography, genealogy, and music sheet notation. Each environment consists of real documents totaling around 15,000 tokens in length, paired with five to ten complex editing tasks that a user might plausibly ask an LLM to perform.

The researchers employed a round-trip evaluation method, checking whether a document returns intact after repeated AI-driven edits. Their core finding, stated directly in the paper's abstract, is blunt: "Our analysis shows that current LLMs are unreliable delegates: they introduce sparse but severe errors that silently corrupt documents, compounding over long interaction." The paper is currently under peer review.

How bad the damage actually is

The headline numbers are striking. Frontier models — Gemini 3.1 Pro, Claude 4.6 Opus, and GPT 5.4 — lost an average of 25% of document content over 20 delegated interactions. Across all 19 models tested, average degradation reached approximately 50%, painting a picture of systemic unreliability rather than isolated edge cases.

But raw percentages only tell part of the story. The benchmark also revealed that performance varies sharply by domain. Python is the only area where most models meet a defined readiness threshold, and even the single strongest model achieves that threshold in just 11 out of 52 domains tested. The research further showed that errors worsen with larger documents, longer interaction chains, and the presence of distractor files — meaning short tests tend to flatter the system while longer, messier workflows expose its weaknesses.

Industry experts weigh in

Brian Jackson, principal research director at Info-Tech Research Group, called the findings valuable but cautioned against overgeneralizing. "Putting a list of LLMs to the test across different work domains yields a lot of useful insights," he said. "I think this type of benchmark exercise could be helpful to enterprise developers who are looking to leverage agentic AI to automate specific workflows and understand the limits of what can be achieved." He emphasized that the results do not mean LLMs cannot automate work — they mean the models cannot currently handle all of it unsupervised.

Sanchit Vir Gogia, chief analyst at Greyhound Research, struck a similar but more pointed tone. "The Microsoft paper should be read as a serious warning about delegated AI, not as a claim that enterprise AI has failed. That distinction matters," he said. Gogia praised the study's methodology, arguing it surpasses what he called "the usual AI benchmark theatre" by testing actual work products rather than one-off clever answers. The study uses reversible editing tasks, domain-specific evaluators, and round-trip integrity checks — and in too many cases, the document does not return intact.

Why the errors are so dangerous

Gogia highlighted a distinction that matters enormously for enterprises: stronger models do not merely delete content — they corrupt it. "Weaker models are easier to catch when they visibly drop material. Frontier models are more awkward because the content remains present but becomes wrong, distorted, or subtly altered. That requires knowledgeable review, not casual inspection."

This makes the problem particularly insidious in high-stakes environments. When AI edits a contract, a ledger, a policy document, a codebase, or a compliance record, the enterprise still owns the damage — even if the output looks superficially correct at first glance. The silent nature of the corruption is what sets this apart from simple hallucination errors.

What enterprises can do now

Both analysts pointed toward mitigation strategies rather than outright abandonment of AI-assisted workflows. Jackson recommended fine-tuning foundation models on enterprise-specific data to improve performance in targeted domains. He also noted that multi-agent setups — one agent making edits and another checking for errors — can help, though the Microsoft paper itself found that one such configuration actually increased degradation, meaning the verification layer must be carefully architected to be effective.

Mathematical verification of outputs is another avenue some enterprise platforms are already exploring, providing deterministic accuracy checks rather than relying on the model's own confidence. Jackson advised that developers use the benchmark results to select the LLM best suited to their specific domain and then invest in additional training and verification steps tailored to that particular workflow.

The human role is changing, not disappearing

Perhaps the most important takeaway is what the findings imply about the evolving relationship between AI and human workers. Gogia put it directly: "The paper shows that AI changes the human layer from production to supervision, validation, and accountability. That is a rather different operating model from the one being sold in many boardroom conversations."

He warned that removing domain expertise from workflows also removes the people most capable of detecting when AI has quietly damaged work. "The people best placed to catch AI errors are often the same people organizations are hoping to replace, reduce, or redeploy," he said. "Remove too much domain expertise from the workflow, and the enterprise also removes the people who know when the AI has quietly damaged the work." Expertise, in other words, becomes more valuable — not less. The honest conclusion is not that AI should be kept out of enterprise workflows, but that delegated AI is not yet trustworthy enough to be left alone with consequential artefacts.

Editorial SiliconFeed is an automated feed: facts are checked against sources; copy is normalized and lightly edited for readers.

Briefing

Microsoft Research

Authored the DELEGATE-52 benchmark paper testing LLM reliability in document editing workflows.

DELEGATE-52

A benchmark evaluating 19 LLMs across 52 professional domains on multi-step document editing tasks.

Philippe Laban

Microsoft researcher and co-author of the DELEGATE-52 study on LLM document corruption.

Brian Jackson

Principal research director at Info-Tech Research Group who commented on the benchmark findings.

Sanchit Vir Gogia

Chief analyst at Greyhound Research who provided expert analysis on enterprise AI reliability.

Gemini 3.1 Pro

Google's frontier large language model tested in the DELEGATE-52 benchmark.

Claude 4.6 Opus

Anthropic's frontier large language model tested in the DELEGATE-52 benchmark.

GPT 5.4

OpenAI's frontier large language model tested in the DELEGATE-52 benchmark.

FAQ

What is the DELEGATE-52 benchmark and who created it?

DELEGATE-52 is a benchmark created by Microsoft researchers Philippe Laban, Tobias Schnabel, and Jennifer Neville to evaluate how reliably large language models perform multi-step document editing tasks. It spans 310 work environments across 52 professional domains — including coding, crystallography, genealogy, and music sheet notation — with each environment containing real documents of roughly 15,000 tokens and five to ten complex editing tasks. The paper, titled "LLMs Corrupt Your Documents When You Delegate," is currently under peer review.

Which LLMs were tested and which performed best?

Nineteen large language models were tested in total. The frontier models evaluated included Gemini 3.1 Pro from Google, Claude 4.6 Opus from Anthropic, and GPT 5.4 from OpenAI. These frontier models lost an average of 25% of document content over 20 delegated interactions — better than the overall average of 50% degradation across all 19 models, but still substantial. Python was the only domain where most models met a readiness threshold, and even the best single model reached that threshold in only 11 of the 52 domains tested.

What should enterprises do based on these findings?

The researchers and industry analysts suggest several mitigation strategies rather than abandoning AI-assisted workflows. Enterprises can fine-tune foundation models on their own domain-specific data to improve reliability. Multi-agent architectures — where one agent performs edits and another verifies accuracy — can help catch errors, though the benchmark found that poorly designed multi-agent setups can actually increase degradation. Some platforms are adopting mathematical verification methods for deterministic output checks. The overarching recommendation is to treat delegated AI as a tool that requires supervision, not as an autonomous agent, especially when working with high-stakes documents like contracts, ledgers, or compliance records.

More in the feed

Prepared by the editorial stack from public data and external sources.

Original article