Microsoft's DELEGATE-52 benchmark shows LLMs still can't be trusted with document editing
A Microsoft-led benchmark of 19 LLMs across 52 professional domains finds that even frontier models lose a quarter of document content after 20 delegated edits, with Python the only reliably automatable domain.