Project Aya
Compression, Equity, and the Architecture of Linguistic Inclusion
The Problem
Most efficient AI models are English-centric. The architecture of language technology is leaving billions behind.
Can transformer-based multilingual models be distilled into Mamba architectures while preserving multilingual capability, structured tool use, and cross-lingual reasoning — and does the compression degrade uniformly across language families?
Aetheris: Hybrid Mamba-MoE
A novel student architecture combining selective state spaces with sparse mixture-of-experts, distilled from Cohere's Tiny Aya.
3-Stage MambaInLlama Pipeline
Layer Alignment
Map transformer attention layers to Mamba SSM blocks with structural correspondence
KL Distillation
Soft-target training with KL divergence to transfer knowledge from teacher to student
Supervised Fine-Tuning
Restore multilingual capability and structured tool calling via targeted SFT
SSM Blocks
O(n) selective scan with constant memory. The Mamba backbone enables linear-time sequence processing.
Sparse MoE
4 expert FFNs with top-1 routing and load balancing. Only one expert fires per token — efficient and specialized.
Hybrid Design
SSM blocks on odd layers, MoE on even layers. 24 total layers with weight-tied embeddings and gradient checkpointing.
Tool Use & Reasoning
Can a distilled Mamba model generate structured outputs and reason across languages? These are the capabilities compression must preserve.
Multilingual Tool Calling
Structured JSON generation and function selection across 70+ languages. Nearly unstudied in multilingual settings.
"function": "get_weather",
"arguments": {
"location": "القاهرة",
"unit": "celsius"
}
}
Cross-Lingual Reasoning
XCOPA benchmark — causal inference across 11 languages. Tests whether commonsense reasoning survives compression.
mGSM
Multilingual grade school math — 250 problems per language, evaluated across all supported scripts and families.
70+ Languages, 23+ Families
Aetheris must preserve Tiny Aya's full multilingual coverage. Distillation targets every language the teacher supports — across scripts, typologies, and resource levels.
Measuring Equity
Benchmarks designed to reveal whether compression harms some languages more than others.
Degradation Equity Score
The variance of accuracy drop across language families after distillation. Measures whether compression picks winners and losers.
Lower is better. Compression should not pick winners.