Our paper on evaluating LLMs across 100 culturally diverse languages was accepted at CVPR 2025. The work forced us to confront uncomfortable questions about what "multilingual" really means in AI evaluation — and how far we still have to go.
The starting point
Most multilingual benchmarks test a handful of high-resource languages — English, Chinese, Spanish, French, German — and call it a day. When researchers claim "multilingual support," they often mean "works in 10-15 languages that already have abundant training data."
We wanted to know: what happens when you push that to 100 languages, including low-resource languages with limited digital presence, different scripts, and fundamentally different grammatical structures?
Key findings
The results were sobering but instructive:
- Performance cliff: Most LLMs showed a sharp performance drop-off beyond the top 20-30 languages. The degradation wasn't gradual — it was a cliff. Models that performed at 90%+ accuracy in English would drop to 40-50% in languages like Yoruba, Khmer, or Amharic.
- Script matters more than expected: Languages using non-Latin scripts suffered disproportionately, even when the language itself had reasonable training data. Tokenization strategies optimized for Latin scripts created systematic disadvantages.
- Cultural context is invisible to benchmarks: Many evaluation tasks assume Western cultural frameworks. A sentiment analysis task about "going to prom" doesn't translate meaningfully to cultures where that concept doesn't exist.
- Translation is not localization: Machine-translated benchmarks introduced subtle but systematic biases. Idioms, formality registers, and pragmatic meaning were lost in translation, making the evaluation unreliable.
What we proposed
Rather than just documenting the problems, we introduced a framework for more equitable multilingual evaluation:
- Culturally grounded tasks: Evaluation tasks should be created by native speakers in each language, not translated from English.
- Script-aware metrics: Separate performance reporting by script family to make systematic biases visible.
- Resource-stratified benchmarks: Explicitly categorize languages by resource availability and report performance within each tier.
- Community-sourced validation: Partner with language communities to validate that benchmarks are meaningful in cultural context.
The bigger picture
What struck me most about this project was how much our field's notion of "general intelligence" is shaped by English-centric assumptions. A model that works brilliantly in English and fails in Swahili isn't "generally intelligent" — it's specifically intelligent in one linguistic context.
If we're serious about building AI that serves the full diversity of human experience, we need evaluation frameworks that reflect that diversity. That starts with acknowledging that 100 languages isn't an edge case — it's the minimum bar for global relevance.
The true test of a multilingual model isn't how well it handles English — it's how gracefully it handles the languages it was least prepared for.