Multilingual Tokenization Showdown

Analyzing 12 LLM Tokenizers

Across 204 Languages

Tokenizer Winners

🥇 GPT-OSS
95 languages (46.6%)
🥈 MiniMax-M2
41 languages (20.1%)
🥉 Llama-4
39 languages (19.1%)

Average Performance (WPT)

Words Per Token: Bigger is Better

Worst Performers

GPT-2: The Legacy Problem

Worst in 190 out of 204 languages

Trained on English only - struggles with multilingual text

Russian
WPT: 0.134
German
WPT: 0.309

Second worst: Granite-4 (Avg WPT: 0.282)

Performance by Language Family

Slavic (10 langs)

Llama-4 70%

Germanic (8 langs)

GPT-OSS 75%

Romance (8 langs)

MiniMax-M2 62.5%

Indo-Aryan (12 langs)

Gemma-3 50%

100% Dominance

MiniMax-M2

✓ Semitic (Arabic): 100%
✓ Japanese: 100%

GPT-OSS

✓ Niger-Congo: 100%
✓ Semitic (Hebrew): 100%

Llama-4

✓ Korean: 100%
✓ Austroasiatic: 100%

Gemma-3

✓ Sino-Tibetan (Burmese): 100%

Key Takeaways

🏆 Overall Winner: GPT-OSS (46.6% wins, highest avg WPT: 0.345)
🌍 Specialization Matters: Different tokenizers excel at different language families
⚠️ Legacy Issues: GPT-2 struggles dramatically outside English (93% worst)
💡 Recommendation: Choose GPT-OSS for multilingual applications
1 / 7