Single-Shot ASR Model Evaluation

Experiment

A single-take benchmark of every transcription model available in Handy (v0.8.1), a local speech-to-text tool for Linux. Each model transcribed the same test script in one continuous recording, with a deliberate 5-second silence in the middle to test VAD handling and hallucination resistance.

Test Script

I had scrambled eggs and toast for breakfast this morning. The coffee was a bit too strong but I drank it anyway. [5 second pause] The capital of France is Paris. It sits on the River Seine and has a population of about two million people in the city itself.

The two halves are deliberately unrelated (personal anecdote vs. factual statement). Any text bridging them during the pause would indicate hallucination.

Test Environment

Application	Handy 0.8.1
Inference	ONNX Runtime (auto) + Whisper.cpp (auto)
GPU	AMD Radeon RX 7800 XT (Navi 32, 12 GB VRAM)
CPU	12th Gen Intel Core i7-12700F
OS	Ubuntu 25.10, kernel 6.17.0-19-generic
Date	2026-03-29

Results

Rankings

#	Model	Inference	RTF	Errors	Hallucination
1	Whisper Small	976 ms	0.07x	0	No
2	Parakeet V2	1,354 ms	0.09x	0	No
3	Canary 180M Flash	2,223 ms	0.17x	0	No
4	Moonshine Base	2,301 ms	0.15x	0	No
5	Parakeet V3 (INT8)	1,378 ms	0.10x	1	No
6	Whisper Turbo	1,112 ms	0.09x	2	No
7	Canary 1B v2	2,473 ms	0.17x	1	No
8	Moonshine Small Streaming	4,140 ms	0.33x	1	No
9	Moonshine Tiny Streaming	3,414 ms	0.25x	2	No
10	Whisper Medium	1,694 ms	0.13x	3	No
11	Whisper Large	2,780 ms	0.22x	3	No
12	Breeze ASR	2,626 ms	0.20x	3	No
13	SenseVoice (INT8)	145 ms	0.01x	3	No

Charts

Speed vs accuracy scatter plot — Speed vs accuracy tradeoff — ideal models are in the bottom-left

Key Findings

VAD & Hallucination

All 13 models handled the 5-second silence cleanly. No model hallucinated words, repeated phrases, or invented bridging text during the pause.

Bigger is not always better

Whisper Small (976 ms, 0 errors) outperformed both Whisper Medium (1,694 ms, 3 errors) and Whisper Large (2,780 ms, 3 errors) on this GPU. The larger Whisper models were slower and less accurate.

Common Error Patterns

Capitalisation: 7 models wrote "river Seine" instead of "River Seine"
Numerals: 5 models output "2 million" instead of "two million"
Punctuation: SenseVoice and Breeze ASR replaced sentence-ending periods with commas
Mishearing: Moonshine Tiny turned "drank it anyway" into "don't get any way"; SenseVoice heard "Seine" as "sand"
Dropped words: Canary 1B v2 lost "people in" from the final sentence

Recommendation

Whisper Small is the best overall choice for this hardware — fastest perfect transcription at under 1 second. Parakeet V2 is a strong runner-up. For users who want correct proper noun capitalisation and spelled-out numbers, Canary 180M Flash is the most pedantically accurate, though slower.

Data

Raw benchmark data is available as transcription-benchmarks.json.

Source repository: danielrosehill/Handy-Ubuntu-Setup