EMO: AllenAI pretrains mixture of experts so modularity emerges from data
At a glance:
- AllenAI releases EMO, a 1B-active / 14B-total-parameter MoE trained on 1 trillion tokens where experts self-organize into coherent, domain-level modules without any human-defined priors.
- Using just 12.5% of experts (16 of 128) retains near full-model performance; even at 25% expert retention the drop is only about 1% absolute across benchmarks.
- EMO's expert clusters map to semantic domains like Health, Medical & Wellness, News Reporting, US Politics & Elections, and Film & Music — unlike standard MoEs whose experts specialize in surface features such as prepositions and definite articles.
The problem with today's mixture-of-experts models
Large language models are typically trained and deployed as monolithic systems: a single model is initialized, pretrained, fine-tuned, and served as one unified entity. But applications often need only a subset of capabilities — code generation, mathematical reasoning, domain-specific knowledge — and as frontier models routinely reach trillions of parameters, loading and adapting the full model becomes impractical for most users. It incurs unnecessary computational cost and memory to host parameters that may not even be needed.
Mixture-of-experts architectures seem like a natural way to relax this constraint. Instead of using one large feedforward network at each layer, MoEs contain many smaller networks called experts and activate only a small subset for each input token. In principle, a task that only needs one capability could load only the relevant experts. In practice, however, existing MoEs still need the full model to work well. Different tokens within a single input often activate different experts, so a task can end up using all the experts during generation. As the EMO paper shows, experts in standard MoEs often specialize in low-level lexical patterns like prepositions or punctuation rather than higher-level domains or capabilities, meaning small subsets of experts are not reliably usable on their own.
How EMO trains modularity to emerge
EMO is an MoE trained with modularity as a first-class objective. The key insight is that tokens from the same document usually come from the same domain. The team uses document boundaries as a weak supervisory signal: during training, all tokens in a document are restricted to choose their active experts from a shared expert pool.
Concretely, in an MoE with 10 total experts and 2 active experts per token, all tokens in a document are restricted to route within the same pool of 4 experts. That pool is chosen by the router itself — the team averages the router's expert preferences across all tokens in the document and selects the most-used experts as the document's shared pool. Different documents can use different pools, allowing recurring expert groups to emerge directly from the training data without any predefined semantic categories.
One technical challenge is load balancing. In standard MoE training, the load-balancing objective prevents the model from collapsing onto only a small number of experts, but applied locally within a micro-batch it pushes tokens within the same document to spread across many experts — directly opposing EMO's objective. The team resolves this by applying load balancing globally across many documents, where the two objectives become complementary: EMO encourages coherent expert usage within a document while global load balancing encourages different documents to collectively cover all experts.
The document pool size controls how restrictive the modularity constraint is. Rather than fixing one pool size, the team randomly samples it during training, preventing overfitting to a single subset size and letting the model support different expert subset sizes at inference time.
Benchmark results and expert specialization
On general-purpose benchmarks, EMO matches the performance of a standard MoE model, showing that the modularity objective does not come at the cost of full-model performance. The critical test is whether the model still works when only a subset of experts is kept.
The team constructs task-specific expert subsets by ranking experts according to their routing usage on a small amount of task validation data, keeping the most-used experts and discarding the rest. The results are striking:
- Keeping 25% of the experts (32 expert subset), EMO loses only about 1% absolute performance across all benchmarks.
- Keeping only 12.5% of the experts (16 expert subset), the overall drop is only about 3%.
- This holds both before and after fine-tuning.
- In contrast, a matching standard MoE degrades sharply as the expert subset gets smaller, often falling close to or below random performance in the smallest expert subset settings.
Selecting the right experts for a task is also surprisingly cheap: a single example with few-shot demonstrations is enough to identify a module that performs on par with one selected using a full validation set. EMO works well with existing expert-pruning approaches like Easy-EP, and the two complement each other. In a smaller 130B-token setting, EMO expert subsets push the Pareto frontier in memory-accuracy trade-off, outperforming standard MoEs and even fixed-budget models trained from scratch.
When the team clustered router activations of the first 100 tokens across 12K pretraining documents, the difference from a standard MoE was stark. EMO's token clusters correspond to semantically meaningful domains:
- Health, Medical & Wellness
- News Reporting
- US Politics & Elections
- Film & Music
A standard MoE produces clusters like:
- Prepositions
- Proper Names
- Copula Verbs
- Definite Articles
In EMO, tokens from a given document mostly land in the same cluster; in a standard MoE, they end up scattered across many. The interactive visualization is available at emovisualization.netlify.app.
What AllenAI is releasing
AllenAI is releasing the full EMO-trained model, a matched standard-MoE baseline trained on the same data, and the training code. The artifacts are available at:
- Models: https://huggingface.co/collections/allenai/emo
- Tech report: https://allenai.org/papers/emo
- Code: https://github.com/allenai/EMO
- Visualization: https://emovisualization.netlify.app/
EMO is a 1B-active, 14B-total-parameter MoE (8 expert active, 128 expert total) trained on 1 trillion tokens. The team frames it as an early step toward making large sparse models more modular, but notes many open questions: how to better select and compose expert subsets, how to update modules without disrupting the full model, and how to use modular structure for better interpretability and control.
Why this matters for deployment and composition
The payoff of emergent modularity is a composable architecture. Because EMO's expert groups map to real semantic capabilities rather than surface features, a user can pick a small expert subset and still have a functioning model tailored to a specific domain or task. This turns a single model into something that can be flexibly deployed with improved memory-accuracy tradeoffs for large, sparse MoEs — a meaningful advantage as models scale and the cost of inference becomes a binding constraint for most organizations.
The research also represents a philosophical shift in how MoEs are designed. Prior work like BTX and the FlexOlmo project tried to route tokens based on predefined semantic domains such as math, biology, or code, but those approaches require domain labels across the pretraining corpus and can inject too much human bias. More importantly, fixing domains upfront fixes the model's modular structure, so new capabilities that emerge at inference time have no clear home. EMO sidesteps all of that by letting the router discover coherent expert groups directly from the data.
What to watch next
The team acknowledges that EMO is an early step and that many questions remain. Key open problems include better methods for selecting and composing expert subsets at inference time, updating individual modules without disrupting the full model, and leveraging modular structure for interpretability and controllability. The release of models, baselines, and code should help the wider community study these questions and build toward modular language models that are easier to deploy, adapt, inspect, and compose.
Tags: mixture-of-experts, emergent modularity, allenai, MoE, pretraining, expert routing
FAQs are provided separately in the JSON array.
FAQ
How does EMO differ from standard mixture-of-experts models?
Can EMO really work with only a fraction of its experts?
What exactly is AllenAI releasing?
More in the feed
Prepared by the editorial stack from public data and external sources.
Original article