LannaBench · Public README
Measured 2026-05-18 · Northern Thai NLP
Research Benchmark

Lanna
Bench

Evaluating large language models on Northern Thai ↔ Standard Thai translation — with multi-reference scoring, error typology, and human ratings.

MIT License In Progress NTD ↔ STD LoRA v2 · +20% ChrF 9 Models
Best adapter · Δ ChrF
+20%
over off-the-shelf base · 2026-05-18
Best benchmark score
72.39
DeepSeek v4-flash · chrf_avg · v2 prompts
4-day API campaign
$6.81
3 providers · ~3,990 API calls
Background

Overview

Why Northern Thai — and why now?

This page is the public companion site for LannaBench — a research benchmark evaluating how well frontier and open-weight LLMs handle translation between Northern Thai and Standard Thai in both directions. It pairs automated scoring (a multi-reference Triple-ChrF metric) with human ratings.

Northern Thai (Lanna / ᨣᩴᩤᨾᩮᩬᩥᨦ, ISO nort2740) is spoken across eight provinces of Northern Thailand by millions of speakers. Despite this, it is severely under-represented in modern NLP benchmarks — the models that handle Standard Thai with ease routinely stumble on Northern Thai vocabulary, particles, and tonal patterns.

The study uses a 3-prompt robustness protocol on a held-out slice of naturalistic conversational text, evaluating 9 models including Typhoon2, SeaLLM, LLaMA-3.1-8B, Qwen2.5, and a three-provider API cohort (OpenAI / Anthropic / DeepSeek). LoRA fine-tuning is scoped to Typhoon2 using QLoRA NF4 4-bit via a custom Anglo finetuner.

Adapter Iterations

LoRA Adapter Improvements

NTD → STD · held-out conversational text · Δ ChrF over base model

Base model · off-the-shelf
0%
A capable general-purpose Thai LLM, used as-is. Handles Standard Thai well; stumbles on Northern Thai vocabulary, particles, and tone.
Adapter v1 · 2026-05-08
+13%
Lightweight adaptation on the natural-conversation slice of the corpus — the model begins to hear Northern Thai as Northern Thai.
Adapter v2 · 2026-05-18 Current
+20%
Retrained on a refined dataset with a tuned recipe — recovers earlier failure cases like repetition loops and dropped dialect particles. Available on HuggingFace at aminomewza/typhoon2-lora-v2.
Benchmark

Multimodel Benchmark — API Cohort

v1 → v2 prompt template comparison · all scores are chrf_avg · v2 redesigned with clearer NTD boundary marking and more specific imperative instruction

Model v1 v2 Δ
gpt-4o 70.94 70.18 −0.75
claude-sonnet 67.34 68.22 +0.88
deepseek (v4-flash, alias) 68.94 72.39 ⭐ +3.45 ← leader
typhoon-s-thaillm 47.34 51.34 +4.01
Cost Analysis

API Cost Record

2026-05-13 → 2026-05-16 · 3 providers · 7 inference runs · ~3,990 API calls total

Grand total
$6.81
4 days
OpenAI
$4.16
61% · gpt-4o + gpt-5.5 spike
Anthropic
$2.55
37% · Sonnet 4.5/4.6 · ~$0.85/run flat
DeepSeek
$0.10
1.4% · alias + explicit pin

Per-day, per-model breakdown

Date Provider Model Req In tok Out tok Cost
05-13 OpenAI gpt-4o-2024-11-20 397 74,639 10,359 $0.2902
Anthropic claude-sonnet-4-5 397 159,354 23,807 $0.8400
DeepSeek deepseek-v4-flash 397 65,337 11,221 $0.0105
05-15 OpenAI gpt-4o-2024-11-20 397 79,270 10,018 $0.3076
OpenAI gpt-5.5-2026-04-23 (smoke ×2) 2 364 248 ~$0.00
Anthropic claude-sonnet-4-5 397 168,550 22,501 $0.8500
Anthropic claude-sonnet-4-6 (smoke ×1) 1 396 10 ~$0.00
DeepSeek deepseek-v4-flash 396 68,482 9,423 $0.0107
05-16 OpenAI gpt-5.5-2026-04-23 🔺 401 79,768 105,593 $3.5666
Anthropic claude-sonnet-4-6 397 168,946 23,115 $0.8600
DeepSeek deepseek-v4-flash (explicit pin) 🔺 398 68,812 255,540 $0.0743
Grand total $6.8100
🔺 GPT-5.5 hidden-reasoning tax (05-16). Output tokens spiked to 105,593 on gpt-5.5 vs ~10k for gpt-4o on the same task — that's ~263 tok/req vs ~26, an 11.7× cost lift. The billing API counts reasoning tokens in output. Budget planning using GPT-4o-era estimates underestimates GPT-5 cost by ~10×; pre-budget at ~$0.009 per call for any GPT-5 family member.
🔺 DeepSeek alias vs explicit pin (05-16). Switching from the floating deepseek-chat alias to the pinned deepseek-v4-flash produced 26× more output tokens (255,540 / 398 = 642 tok/req vs ~24 on alias runs). Same prompt template, same dataset. The alias may have been routing to a terser snapshot, or the explicit pin enables thinking/reasoning by default.
DeepSeek cache efficiency jumped 4× under the explicit pin (05-16). Cache hit rate improved from ~17–20% (alias runs) to 73% (50,432 cache-hit / 18,380 cache-miss). At a 50× hit/miss price ratio, this partially offsets the output-bloat cost: even at 642 tok/req the per-call cost was $0.000187 — still 47× cheaper than gpt-4o.
Cost rule of thumb · per API call GPT-5 flagship ≈ 11× GPT-4o ≈ 4× Sonnet ≈ 50× DeepSeek-flash (explicit pin) ≈ 330× DeepSeek-flash (alias). Anthropic pricing is the most predictable: all four Sonnet runs (4.5 ×2, 4.6 ×2) landed within $0.84–$0.86 with no reasoning surprise and no caching discount. For floating-alias provider lines, the billing CSV is the ground truth — the alias name does not predict price or output-length behaviour reliably.
Codebase

Codename Reference

Northern Thai cultural terms as software identifiers

Internally, the codebase uses Northern Thai cultural terms as codenames for its components. The 28-item system maps one-to-one to software modules. The codenames are both a stylistic choice and a way of carrying the culture into the engineering layer of the project.

The monorepo is structured as four uv-workspace Python packages: lanna_khuang (data), lanna_kuafai (adaptation), lanna_jorfa (evaluation), and oob_meta (shared utilities).

Lanna Codenames (v2) — each entry shows the literal object, its cultural role, and how the codename maps to a module or concept in the codebase.

Human Evaluation

BaiLan — Human Rating Instrument

Gamified NTD → STD translation evaluation form

BaiLan is the human-rating instrument for LannaBench. Raters evaluate model translations on six axes (1–5 Likert scale each) in a card-by-card quiz flow with streak counters, bilingual Thai/English content, and Lanna temple aesthetics (gold/cream/teak palette).

The form shows one prompt output per card, split into natural context dependent (NCD) and natural context independent (NCI). It is a self-contained HTML file that submits via Google Apps Script with a JSONL backup download fallback.

The full code repository will be opened once the project is ready to publish.
An ongoing benchmark and adaptation study for Kham Mueang — Northern Thai, the Lanna Dialect. nort2740 · ISO 639-3 · nod  ·  the codenames