Best LLM for Armenian OCR: a small investigation

April 15, 20264 min read

If you're extracting Armenian text from images with an LLM, use gemini-3-flash-preview with temperature: 0. Every other model I tested (claude-haiku-4-5, claude-sonnet-4-6, gpt-5-mini, gpt-5.4-mini) has a categorical weakness that makes it unusable for stylized fonts or specific glyph pairs. And it turns out gemini-3-flash-preview at its default temperature (1.0) silently garbles Armenian ~22% of the time on exactly the same image it handles perfectly at temperature 0.

The setup

Our event pipeline ingests Instagram posts from Armenian event organizers. One post (a concert poster for "Hayk Petrosyan") got stored with a garbled media description — ՀԱՅԱ ՁԵՏՐՈՄՅԱՆ instead of ՀԱՅԿ ՊԵՏՐՈՍՅԱՆ — and I wondered if it was a one-off hallucination or a systematic problem. Answer: both. Case-insensitive whitespace-normalized token matching, scored against 6 tokens on the poster (date, stylized title, performer name, subtitle, tickets-for label, venue name).

Concert poster for Hayk Petrosyan — notice the stylized Armenian text (Մի ձայն) that every model I tested struggled with except for gemini. I was surprised gemini caught it — it would be hard to read this even for some humans.

The surprising finding: temperature fixes everything

First observation: the same poster image produces different garbled versions across runs — ԱՂՐԷԱ, ԱՃՐԷԱ, ԱՄՐԲԴ instead of ԱՊՐԻԼ. Each failure is different, but not random — they're visually-similar-glyph confusions (Պ↔Ղ, Ի↔Է, Ս↔Մ, Տ↔Ճ), the same mistakes a person would make on stylized fonts.

The tell: every failure is different, but every success is identical and perfect. That's not how "the model can't read Armenian" looks — that would be the same wrong answer every time. It looked like sampling noise.

Which sent me back to a setting I'd honestly forgotten existed: temperature. gemini-3-flash-preview's default is 1.0 — high enough that on tokens where the model is uncertain (small stylized Armenian glyphs, in this case), it rolls the dice between visually-similar candidates instead of committing to its most likely read. Setting temperature: 0 collapses the decoder to its greedy answer.

50/50 perfect runs, 100% on all 6 tokens, no garble.

One line of config. Months of "LLMs are just flaky" chalked up to a default we never reviewed.

The other surprise: newer ≠ better

Everything I'd read online said gpt-5.4-mini should blow gpt-5-mini out of the water — it's the newer model, and the benchmarks back that up. On this task, nope:

gpt-5-mini with reasoning: minimal — 3.4s latency, $0.81/1k calls, 90% critical pass
gpt-5.4-mini with reasoning: low — 8.8s latency, 5% critical pass

At low reasoning, gpt-5.4-mini "hedges" — it ignores the transcription instruction 70% of the time and returns just a visual description. You have to crank up to medium to force it to commit, which costs $17/1k (vs $5/1k for gpt-5-mini medium). The older, cheaper model is strictly better here.

Other notable findings

Neither OpenAI mini can read stylized Armenian cursive — 0/40 runs combined got the Մի ձայն handwritten title. They nail block text but go blind on decorative fonts. gemini-3-flash-preview gets it 100% at temp=0.
claude-haiku-4-5 ignored the transcription task entirely — returned only visual descriptions ("a man with a guitar"), 0/10.
claude-sonnet-4-6 tries hard but has systematic glyph confusions that gemini-3-flash-preview doesn't share: ՀԱՅԿ → ՀԱԿՈԲ (Hayk → Hakob), Մ → Ս.
gemini-3-flash-preview with mediaResolution: high (1120 tokens/image vs default) did not help — slightly hurt accuracy at n=20.
gemini-3-flash-preview with thinking_level: HIGH also did not help meaningfully (76% all-6 vs 78% at LOW) and costs 2.85× more.

Full results

Model	Reasoning	Temp	n	Date Ապրիլ	Title Մի ձայն	Name ՀԱՅԿ ՊԵՏՐՈՍՅԱՆ	Subtitle հեղինակային երգերի երեկո	Tickets Տոմսերի համար	Venue Ակումբ	All 6	Latency	Cost/1k
`gemini-3-flash-preview`	LOW	0	50	100%	100%	100%	100%	100%	100%	100%	6.0s	$1.05
`gemini-3-flash-preview`	LOW	1	50	80%	94%	80%	88%	80%	96%	78%	6.2s	$1.23
`gemini-3-flash-preview`	HIGH	1	50	90%	90%	90%	76%	90%	90%	76%	9.1s	$3.58
`gemini-3-flash-preview`	LOW · hi-res	1	20	70%	90%	70%	80%	70%	95%	70%	6.5s	$1.33
`gpt-5-mini`	medium	—	10	100%	0%	100%	40%	100%	40%	0%	27.6s	$4.96
`gpt-5-mini`	minimal	—	20	90%	0%	100%	10%	85%	15%	0%	3.4s	$0.81
`gpt-5.4-mini`	medium	—	10	100%	0%	100%	70%	100%	50%	0%	28.6s	$17.21
`gpt-5.4-mini`	low	—	20	15%	0%	10%	0%	15%	0%	0%	8.8s	$4.44
`gpt-5.4-mini`	none	—	10	0%	0%	0%	0%	0%	0%	0%	2.3s	$1.74
`claude-haiku-4-5`	—	—	10	0%	0%	0%	0%	0%	0%	0%	1.4s	$2.84
`claude-sonnet-4-6`	—	—	10	100%	0%	0%	90%	80%	50%	0%	8.9s	$11.07

The setup#

The surprising finding: temperature fixes everything#

The other surprise: newer ≠ better#

Other notable findings#

Full results#

Things you could do

The setup

The surprising finding: temperature fixes everything

The other surprise: newer ≠ better

Other notable findings

Full results