Overview
Sparse Activation is a technique where neural networks selectively activate only a fraction of their total parameters for each input. This enables models with trillions of parameters to run efficiently by ensuring compute costs scale with active parameters rather than total parameters.
Dense vs Sparse Models
| Aspect | Dense Model | Sparse Model |
|---|---|---|
| Active Parameters | 100% per input | 5-20% per input |
| Compute Cost | Scales with total size | Scales with active size |
| Memory | All weights used | All weights loaded, few used |
| Scaling | Diminishing returns | Near-linear capability gains |
Implementation in MoE
In Mixture of Experts architectures:
- Router Selection: Gating network picks 1-8 experts from hundreds
- Sparse Computation: Only selected experts process the token
- Efficient FLOPS: 8-expert activation on 64-expert model = 12.5% compute
Benefits
- Scale Without Cost: Train 10T parameter model with 100B active parameters
- Specialization: Different experts handle different domains
- Inference Speed: Lower latency than equivalent dense model
- Training Efficiency: Can leverage expert parallelism
Challenges
- Memory Bandwidth: All experts must be in memory even if few are used
- Communication: Distributed training requires cross-device routing
- Load Imbalance: Popular experts can become bottlenecks
Real-World Scale
Frontier models achieve massive scale through sparsity:
| Model | Total Parameters | Active per Token |
|---|---|---|
| GPT-5.2 | ~2T+ (estimated) | ~200B |
| Gemini 3 | ~2T+ (estimated) | ~200B |
| Mixtral 8x7B | 47B | 13B |
Sparse activation is what makes trillion-parameter models practical for real-time inference.