Evaluating large language models on Northern Thai ↔ Standard Thai translation — with multi-reference scoring, error typology, and human ratings.
Why Northern Thai — and why now?
This page is the public companion site for LannaBench — a research benchmark evaluating how well frontier and open-weight LLMs handle translation between Northern Thai and Standard Thai in both directions. It pairs automated scoring (a multi-reference Triple-ChrF metric) with human ratings.
Northern Thai (Lanna / ᨣᩴᩤᨾᩮᩬᩥᨦ, ISO nort2740) is spoken across eight provinces of Northern Thailand by millions of speakers. Despite this, it is severely under-represented in modern NLP benchmarks — the models that handle Standard Thai with ease routinely stumble on Northern Thai vocabulary, particles, and tonal patterns.
The study uses a 3-prompt robustness protocol on a held-out slice of naturalistic conversational text, evaluating 9 models including Typhoon2, SeaLLM, LLaMA-3.1-8B, Qwen2.5, and a three-provider API cohort (OpenAI / Anthropic / DeepSeek). LoRA fine-tuning is scoped to Typhoon2 using QLoRA NF4 4-bit via a custom Anglo finetuner.
NTD → STD · held-out conversational text · Δ ChrF over base model
v1 → v2 prompt template comparison · all scores are chrf_avg · v2 redesigned with clearer NTD boundary marking and more specific imperative instruction
| Model | v1 | v2 | Δ |
|---|---|---|---|
| gpt-4o | 70.94 | 70.18 | −0.75 |
| claude-sonnet | 67.34 | 68.22 | +0.88 |
| deepseek (v4-flash, alias) | 68.94 | 72.39 | ⭐ +3.45 ← leader |
| typhoon-s-thaillm | 47.34 | 51.34 | +4.01 |
2026-05-13 → 2026-05-16 · 3 providers · 7 inference runs · ~3,990 API calls total
Per-day, per-model breakdown
| Date | Provider | Model | Req | In tok | Out tok | Cost |
|---|---|---|---|---|---|---|
| 05-13 | OpenAI | gpt-4o-2024-11-20 | 397 | 74,639 | 10,359 | $0.2902 |
| Anthropic | claude-sonnet-4-5 | 397 | 159,354 | 23,807 | $0.8400 | |
| DeepSeek | deepseek-v4-flash | 397 | 65,337 | 11,221 | $0.0105 | |
| 05-15 | OpenAI | gpt-4o-2024-11-20 | 397 | 79,270 | 10,018 | $0.3076 |
| OpenAI | gpt-5.5-2026-04-23 (smoke ×2) | 2 | 364 | 248 | ~$0.00 | |
| Anthropic | claude-sonnet-4-5 | 397 | 168,550 | 22,501 | $0.8500 | |
| Anthropic | claude-sonnet-4-6 (smoke ×1) | 1 | 396 | 10 | ~$0.00 | |
| DeepSeek | deepseek-v4-flash | 396 | 68,482 | 9,423 | $0.0107 | |
| 05-16 | OpenAI | gpt-5.5-2026-04-23 🔺 | 401 | 79,768 | 105,593 | $3.5666 |
| Anthropic | claude-sonnet-4-6 | 397 | 168,946 | 23,115 | $0.8600 | |
| DeepSeek | deepseek-v4-flash (explicit pin) 🔺 | 398 | 68,812 | 255,540 | $0.0743 | |
| Grand total | $6.8100 | |||||
deepseek-chat alias to the pinned deepseek-v4-flash produced 26× more output tokens (255,540 / 398 = 642 tok/req vs ~24 on alias runs). Same prompt template, same dataset. The alias may have been routing to a terser snapshot, or the explicit pin enables thinking/reasoning by default.
Northern Thai cultural terms as software identifiers
Internally, the codebase uses Northern Thai cultural terms as codenames for its components. The 28-item system maps one-to-one to software modules. The codenames are both a stylistic choice and a way of carrying the culture into the engineering layer of the project.
The monorepo is structured as four uv-workspace Python packages: lanna_khuang (data), lanna_kuafai (adaptation), lanna_jorfa (evaluation), and oob_meta (shared utilities).
→ Lanna Codenames (v2) — each entry shows the literal object, its cultural role, and how the codename maps to a module or concept in the codebase.
Gamified NTD → STD translation evaluation form
BaiLan is the human-rating instrument for LannaBench. Raters evaluate model translations on six axes (1–5 Likert scale each) in a card-by-card quiz flow with streak counters, bilingual Thai/English content, and Lanna temple aesthetics (gold/cream/teak palette).
The form shows one prompt output per card, split into natural context dependent (NCD) and natural context independent (NCI). It is a self-contained HTML file that submits via Google Apps Script with a JSONL backup download fallback.