Analyzing 12 LLM Tokenizers
Across 204 Languages
WikiCat Multilingual Analysis
GitHub Repository
Words Per Token: Bigger is Better
Worst in 190 out of 204 languages
Trained on English only - struggles with multilingual text
Second worst: Granite-4 (Avg WPT: 0.282)