Research Benchmark

Lanna
Bench

Evaluating large language models on Northern Thai ↔ Standard Thai translation — with multi-reference scoring, error typology, and human ratings.

MIT License In Progress NTD ↔ STD LoRA v2 · +20% ChrF 9 Models

Best adapter · Δ ChrF

+20%

over off-the-shelf base · 2026-05-18

Best benchmark score

72.39

DeepSeek v4-flash · chrf_avg · v2 prompts

4-day API campaign

$6.81

3 providers · ~3,990 API calls

Background

Overview

Why Northern Thai — and why now?

This page is the public companion site for LannaBench — a research benchmark evaluating how well frontier and open-weight LLMs handle translation between Northern Thai and Standard Thai in both directions. It pairs automated scoring (a multi-reference Triple-ChrF metric) with human ratings.

Northern Thai (Lanna / ᨣᩴᩤᨾᩮᩬᩥᨦ, ISO nort2740) is spoken across eight provinces of Northern Thailand by millions of speakers. Despite this, it is severely under-represented in modern NLP benchmarks — the models that handle Standard Thai with ease routinely stumble on Northern Thai vocabulary, particles, and tonal patterns.

The study uses a 3-prompt robustness protocol on a held-out slice of naturalistic conversational text, evaluating 9 models including Typhoon2, SeaLLM, LLaMA-3.1-8B, Qwen2.5, and a three-provider API cohort (OpenAI / Anthropic / DeepSeek). LoRA fine-tuning is scoped to Typhoon2 using QLoRA NF4 4-bit via a custom Anglo finetuner.

Adapter Iterations

LoRA Adapter Improvements

NTD → STD · held-out conversational text · Δ ChrF over base model

Base model · off-the-shelf

A capable general-purpose Thai LLM, used as-is. Handles Standard Thai well; stumbles on Northern Thai vocabulary, particles, and tone.

Adapter v1 · 2026-05-08

+13%

Lightweight adaptation on the natural-conversation slice of the corpus — the model begins to hear Northern Thai as Northern Thai.

Adapter v2 · 2026-05-18 Current

+20%

Retrained on a refined dataset with a tuned recipe — recovers earlier failure cases like repetition loops and dropped dialect particles. Available on HuggingFace at aminomewza/typhoon2-lora-v2.

Benchmark

Multimodel Benchmark — API Cohort

v1 → v2 prompt template comparison · all scores are chrf_avg · v2 redesigned with clearer NTD boundary marking and more specific imperative instruction

Model	v1	v2	Δ
gpt-4o	70.94	70.18	−0.75
claude-sonnet	67.34	68.22	+0.88
deepseek (v4-flash, alias)	68.94	72.39	⭐ +3.45 ← leader
typhoon-s-thaillm	47.34	51.34	+4.01

Cost Analysis

API Cost Record

2026-05-13 → 2026-05-16 · 3 providers · 7 inference runs · ~3,990 API calls total

Grand total

$6.81

4 days

OpenAI

$4.16

61% · gpt-4o + gpt-5.5 spike

Anthropic

$2.55

37% · Sonnet 4.5/4.6 · ~$0.85/run flat

DeepSeek

$0.10

1.4% · alias + explicit pin

Per-day, per-model breakdown

Date	Provider	Model	Req	In tok	Out tok	Cost
05-13	OpenAI	gpt-4o-2024-11-20	397	74,639	10,359	$0.2902
	Anthropic	claude-sonnet-4-5	397	159,354	23,807	$0.8400
	DeepSeek	deepseek-v4-flash	397	65,337	11,221	$0.0105
05-15	OpenAI	gpt-4o-2024-11-20	397	79,270	10,018	$0.3076
	OpenAI	gpt-5.5-2026-04-23 (smoke ×2)	2	364	248	~$0.00
	Anthropic	claude-sonnet-4-5	397	168,550	22,501	$0.8500
	Anthropic	claude-sonnet-4-6 (smoke ×1)	1	396	10	~$0.00
	DeepSeek	deepseek-v4-flash	396	68,482	9,423	$0.0107
05-16	OpenAI	gpt-5.5-2026-04-23 🔺	401	79,768	105,593	$3.5666
	Anthropic	claude-sonnet-4-6	397	168,946	23,115	$0.8600
	DeepSeek	deepseek-v4-flash (explicit pin) 🔺	398	68,812	255,540	$0.0743
Grand total						$6.8100

🔺 GPT-5.5 hidden-reasoning tax (05-16). Output tokens spiked to 105,593 on gpt-5.5 vs ~10k for gpt-4o on the same task — that's ~263 tok/req vs ~26, an 11.7× cost lift. The billing API counts reasoning tokens in output. Budget planning using GPT-4o-era estimates underestimates GPT-5 cost by ~10×; pre-budget at ~$0.009 per call for any GPT-5 family member.

🔺 DeepSeek alias vs explicit pin (05-16). Switching from the floating deepseek-chat alias to the pinned deepseek-v4-flash produced 26× more output tokens (255,540 / 398 = 642 tok/req vs ~24 on alias runs). Same prompt template, same dataset. The alias may have been routing to a terser snapshot, or the explicit pin enables thinking/reasoning by default.

DeepSeek cache efficiency jumped 4× under the explicit pin (05-16). Cache hit rate improved from ~17–20% (alias runs) to 73% (50,432 cache-hit / 18,380 cache-miss). At a 50× hit/miss price ratio, this partially offsets the output-bloat cost: even at 642 tok/req the per-call cost was $0.000187 — still 47× cheaper than gpt-4o.

Cost rule of thumb · per API call GPT-5 flagship ≈ 11× GPT-4o ≈ 4× Sonnet ≈ 50× DeepSeek-flash (explicit pin) ≈ 330× DeepSeek-flash (alias). Anthropic pricing is the most predictable: all four Sonnet runs (4.5 ×2, 4.6 ×2) landed within $0.84–$0.86 with no reasoning surprise and no caching discount. For floating-alias provider lines, the billing CSV is the ground truth — the alias name does not predict price or output-length behaviour reliably.

Codebase

Codename Reference

Northern Thai cultural terms as software identifiers

Internally, the codebase uses Northern Thai cultural terms as codenames for its components. The 28-item system maps one-to-one to software modules. The codenames are both a stylistic choice and a way of carrying the culture into the engineering layer of the project.

The monorepo is structured as four uv-workspace Python packages: lanna_khuang (data), lanna_kuafai (adaptation), lanna_jorfa (evaluation), and oob_meta (shared utilities).

→ Lanna Codenames (v2) — each entry shows the literal object, its cultural role, and how the codename maps to a module or concept in the codebase.

Human Evaluation

BaiLan — Human Rating Instrument

Gamified NTD → STD translation evaluation form

BaiLan is the human-rating instrument for LannaBench. Raters evaluate model translations on six axes (1–5 Likert scale each) in a card-by-card quiz flow with streak counters, bilingual Thai/English content, and Lanna temple aesthetics (gold/cream/teak palette).

The form shows one prompt output per card, split into natural context dependent (NCD) and natural context independent (NCI). It is a self-contained HTML file that submits via Google Apps Script with a JSONL backup download fallback.

The full code repository will be opened once the project is ready to publish.