🌿
Cohere Labs Wayy Research

Project Aya

Compression, Equity, and the Architecture of Linguistic Inclusion

Multilingual Mamba SSM Distillation 70+ Languages MoE Open Source Edge AI
Scroll
Multilingual · Mamba · Distillation · 70+ Languages · Open Science · Est. 2026 · Buffalo, NY · Hybrid Mamba-MoE · Linguistic Equity · Edge Deployment ·  Multilingual · Mamba · Distillation · 70+ Languages · Open Science · Est. 2026 · Buffalo, NY · Hybrid Mamba-MoE · Linguistic Equity · Edge Deployment · 

The Problem

Most efficient AI models are English-centric. The architecture of language technology is leaving billions behind.

70%+
of internet users are non-English speakers, yet most edge-deployed models serve English only
O(n²)
Transformer attention cost makes multilingual models impractical on consumer hardware and edge devices
0
Multilingual Mamba large language models exist today — this gap is the opportunity
"

Can transformer-based multilingual models be distilled into Mamba architectures while preserving multilingual capability, structured tool use, and cross-lingual reasoning — and does the compression degrade uniformly across language families?

Aetheris: Hybrid Mamba-MoE

A novel student architecture combining selective state spaces with sparse mixture-of-experts, distilled from Cohere's Tiny Aya.

Teacher
Tiny Aya
3.35B params · 70+ langs
distill
Student
Aetheris
~500–800M params · Mamba-MoE

3-Stage MambaInLlama Pipeline

1

Layer Alignment

Map transformer attention layers to Mamba SSM blocks with structural correspondence

2

KL Distillation

Soft-target training with KL divergence to transfer knowledge from teacher to student

3

Supervised Fine-Tuning

Restore multilingual capability and structured tool calling via targeted SFT

ψ

SSM Blocks

O(n) selective scan with constant memory. The Mamba backbone enables linear-time sequence processing.

Σ

Sparse MoE

4 expert FFNs with top-1 routing and load balancing. Only one expert fires per token — efficient and specialized.

Δ

Hybrid Design

SSM blocks on odd layers, MoE on even layers. 24 total layers with weight-tied embeddings and gradient checkpointing.

Tool Use & Reasoning

Can a distilled Mamba model generate structured outputs and reason across languages? These are the capabilities compression must preserve.

Structured Output

Multilingual Tool Calling

Structured JSON generation and function selection across 70+ languages. Nearly unstudied in multilingual settings.

{
  "function": "get_weather",
  "arguments": {
    "location": "القاهرة",
    "unit": "celsius"
  }
}        
Evaluates valid JSON tool calls across scripts: Latin, Arabic, Devanagari, CJK, Telugu, and more
Reasoning

Cross-Lingual Reasoning

XCOPA benchmark — causal inference across 11 languages. Tests whether commonsense reasoning survives compression.

Does distillation degrade reasoning uniformly, or do some language families lose more?
mGSM

Multilingual grade school math — 250 problems per language, evaluated across all supported scripts and families.

70+ Languages, 23+ Families

Aetheris must preserve Tiny Aya's full multilingual coverage. Distillation targets every language the teacher supports — across scripts, typologies, and resource levels.

Europe & Americas
EnglishFrenchSpanishPortugueseItalianGermanDutchRomanianCatalanGalicianDanishSwedishNorwegianFinnishEstonianHungarianCzechPolishSlovakSlovenianCroatianSerbianBulgarianUkrainianRussianLatvianLithuanianGreekWelshIrishBasqueMaltese
South & Central Asia
HindiBengaliMarathiGujaratiPunjabiTamilTeluguNepaliUrdu
East & Southeast Asia
ChineseJapaneseKoreanVietnameseThaiLaoBurmeseKhmerIndonesianMalayTagalogJavanese
West Asia & North Africa
ArabicPersianTurkishHebrew
Sub-Saharan Africa
SwahiliAmharicHausaIgboYorubaWolofXhosaZuluShonaMalagasy

Measuring Equity

Benchmarks designed to reveal whether compression harms some languages more than others.

mGSM
Multilingual grade school math — 250 problems per language. Tests arithmetic reasoning across scripts and language families.
XCOPA
Cross-lingual commonsense reasoning — 11 languages. Causal inference beyond English-centric assumptions.
Tool Calling
Structured JSON generation and function selection across 70+ languages. Nearly unstudied in multilingual settings.
CPU Throughput
Tokens/sec, peak memory, time-to-first-token. The whole point is running on consumer hardware.

Degradation Equity Score

The variance of accuracy drop across language families after distillation. Measures whether compression picks winners and losers.

Lower is better. Compression should not pick winners.

Built With

PyTorch HuggingFace Transformers HuggingFace Datasets FastAPI Mamba SSM Docker GitHub Actions CUDA llama.cpp Pydantic SSE Streaming RunPod A100

Three Phases

Phase 01
Baseline
Load Tiny Aya, establish CPU tokens/sec across language families. Validate tokenizer compatibility for all 70+ languages.
Phase 02
Distillation
Execute 3-stage MambaInLlama pipeline on A100. Layer alignment, KL distillation, then multilingual SFT with tool calling recovery.
Phase 03
Evaluation
Full benchmark suite across 70+ languages. Equity analysis, ablation studies, and paper preparation targeting EMNLP / ACL / NeurIPS.