Microsoft's DELEGATE-52 benchmark shows LLMs still can't be trusted with document editing
At a glance:
- A Microsoft research preprint benchmark called DELEGATE-52 tested 19 LLMs across 52 professional domains and found widespread document corruption during multi-step editing tasks, with errors that compound silently over repeated interactions.
- Frontier models — Gemini 3.1 Pro, Claude 4.6 Opus, and GPT 5.4 — lost an average of 25% of document content over 20 delegated interactions, while average degradation across all 19 models reached roughly 50%.
- Python is the only domain where most tested models meet a readiness threshold, and even the strongest single model achieves that threshold in just 11 out of 52 domains.
What the DELEGATE-52 benchmark tested
The preprint paper, "LLMs Corrupt Your Documents When You Delegate," was authored by Microsoft researchers Philippe Laban, Tobias Schnabel, and Jennifer Neville. The benchmark simulates workflows that might realistically appear in a knowledge worker's daily routine, spanning 310 work environments across 52 professional domains — including coding, crystallography, genealogy, and music sheet notation. Each environment consists of real documents totaling around 15,000 tokens in length, paired with five to ten complex editing tasks that a user might plausibly ask an LLM to perform.
The researchers employed a round-trip evaluation method, checking whether a document returns intact after repeated AI-driven edits. Their core finding, stated directly in the paper's abstract, is blunt: "Our analysis shows that current LLMs are unreliable delegates: they introduce sparse but severe errors that silently corrupt documents, compounding over long interaction." The paper is currently under peer review.
How bad the damage actually is
The headline numbers are striking. Frontier models — Gemini 3.1 Pro, Claude 4.6 Opus, and GPT 5.4 — lost an average of 25% of document content over 20 delegated interactions. Across all 19 models tested, average degradation reached approximately 50%, painting a picture of systemic unreliability rather than isolated edge cases.
But raw percentages only tell part of the story. The benchmark also revealed that performance varies sharply by domain. Python is the only area where most models meet a defined readiness threshold, and even the single strongest model achieves that threshold in just 11 out of 52 domains tested. The research further showed that errors worsen with larger documents, longer interaction chains, and the presence of distractor files — meaning short tests tend to flatter the system while longer, messier workflows expose its weaknesses.
Industry experts weigh in
Brian Jackson, principal research director at Info-Tech Research Group, called the findings valuable but cautioned against overgeneralizing. "Putting a list of LLMs to the test across different work domains yields a lot of useful insights," he said. "I think this type of benchmark exercise could be helpful to enterprise developers who are looking to leverage agentic AI to automate specific workflows and understand the limits of what can be achieved." He emphasized that the results do not mean LLMs cannot automate work — they mean the models cannot currently handle all of it unsupervised.
Sanchit Vir Gogia, chief analyst at Greyhound Research, struck a similar but more pointed tone. "The Microsoft paper should be read as a serious warning about delegated AI, not as a claim that enterprise AI has failed. That distinction matters," he said. Gogia praised the study's methodology, arguing it surpasses what he called "the usual AI benchmark theatre" by testing actual work products rather than one-off clever answers. The study uses reversible editing tasks, domain-specific evaluators, and round-trip integrity checks — and in too many cases, the document does not return intact.
Why the errors are so dangerous
Gogia highlighted a distinction that matters enormously for enterprises: stronger models do not merely delete content — they corrupt it. "Weaker models are easier to catch when they visibly drop material. Frontier models are more awkward because the content remains present but becomes wrong, distorted, or subtly altered. That requires knowledgeable review, not casual inspection."
This makes the problem particularly insidious in high-stakes environments. When AI edits a contract, a ledger, a policy document, a codebase, or a compliance record, the enterprise still owns the damage — even if the output looks superficially correct at first glance. The silent nature of the corruption is what sets this apart from simple hallucination errors.
What enterprises can do now
Both analysts pointed toward mitigation strategies rather than outright abandonment of AI-assisted workflows. Jackson recommended fine-tuning foundation models on enterprise-specific data to improve performance in targeted domains. He also noted that multi-agent setups — one agent making edits and another checking for errors — can help, though the Microsoft paper itself found that one such configuration actually increased degradation, meaning the verification layer must be carefully architected to be effective.
Mathematical verification of outputs is another avenue some enterprise platforms are already exploring, providing deterministic accuracy checks rather than relying on the model's own confidence. Jackson advised that developers use the benchmark results to select the LLM best suited to their specific domain and then invest in additional training and verification steps tailored to that particular workflow.
The human role is changing, not disappearing
Perhaps the most important takeaway is what the findings imply about the evolving relationship between AI and human workers. Gogia put it directly: "The paper shows that AI changes the human layer from production to supervision, validation, and accountability. That is a rather different operating model from the one being sold in many boardroom conversations."
He warned that removing domain expertise from workflows also removes the people most capable of detecting when AI has quietly damaged work. "The people best placed to catch AI errors are often the same people organizations are hoping to replace, reduce, or redeploy," he said. "Remove too much domain expertise from the workflow, and the enterprise also removes the people who know when the AI has quietly damaged the work." Expertise, in other words, becomes more valuable — not less. The honest conclusion is not that AI should be kept out of enterprise workflows, but that delegated AI is not yet trustworthy enough to be left alone with consequential artefacts.
FAQ
What is the DELEGATE-52 benchmark and who created it?
Which LLMs were tested and which performed best?
What should enterprises do based on these findings?
More in the feed
Prepared by the editorial stack from public data and external sources.
Original article