Single-Shot ASR Model Evaluation

13 local speech-to-text models tested on AMD GPU with Handy

Experiment

A single-take benchmark of every transcription model available in Handy (v0.8.1), a local speech-to-text tool for Linux. Each model transcribed the same test script in one continuous recording, with a deliberate 5-second silence in the middle to test VAD handling and hallucination resistance.

Test Script

I had scrambled eggs and toast for breakfast this morning. The coffee was a bit too strong but I drank it anyway. [5 second pause] The capital of France is Paris. It sits on the River Seine and has a population of about two million people in the city itself.

The two halves are deliberately unrelated (personal anecdote vs. factual statement). Any text bridging them during the pause would indicate hallucination.

Test Environment

ApplicationHandy 0.8.1
InferenceONNX Runtime (auto) + Whisper.cpp (auto)
GPUAMD Radeon RX 7800 XT (Navi 32, 12 GB VRAM)
CPU12th Gen Intel Core i7-12700F
OSUbuntu 25.10, kernel 6.17.0-19-generic
Date2026-03-29

Results

Rankings

# Model Inference RTF Errors Hallucination
1Whisper Small976 ms0.07x0No
2Parakeet V21,354 ms0.09x0No
3Canary 180M Flash2,223 ms0.17x0No
4Moonshine Base2,301 ms0.15x0No
5Parakeet V3 (INT8)1,378 ms0.10x1No
6Whisper Turbo1,112 ms0.09x2No
7Canary 1B v22,473 ms0.17x1No
8Moonshine Small Streaming4,140 ms0.33x1No
9Moonshine Tiny Streaming3,414 ms0.25x2No
10Whisper Medium1,694 ms0.13x3No
11Whisper Large2,780 ms0.22x3No
12Breeze ASR2,626 ms0.20x3No
13SenseVoice (INT8)145 ms0.01x3No

Charts

Inference speed and accuracy bar charts
Inference speed and transcription accuracy across all 13 models
Speed vs accuracy scatter plot
Speed vs accuracy tradeoff — ideal models are in the bottom-left

Key Findings

VAD & Hallucination

All 13 models handled the 5-second silence cleanly. No model hallucinated words, repeated phrases, or invented bridging text during the pause.

Bigger is not always better

Whisper Small (976 ms, 0 errors) outperformed both Whisper Medium (1,694 ms, 3 errors) and Whisper Large (2,780 ms, 3 errors) on this GPU. The larger Whisper models were slower and less accurate.

Common Error Patterns

Recommendation

Whisper Small is the best overall choice for this hardware — fastest perfect transcription at under 1 second. Parakeet V2 is a strong runner-up. For users who want correct proper noun capitalisation and spelled-out numbers, Canary 180M Flash is the most pedantically accurate, though slower.

Data

Raw benchmark data is available as transcription-benchmarks.json.

Source repository: danielrosehill/Handy-Ubuntu-Setup