Experiment
A single-take benchmark of every transcription model available in Handy (v0.8.1), a local speech-to-text tool for Linux. Each model transcribed the same test script in one continuous recording, with a deliberate 5-second silence in the middle to test VAD handling and hallucination resistance.
Test Script
I had scrambled eggs and toast for breakfast this morning. The coffee was a bit too strong but I drank it anyway. [5 second pause] The capital of France is Paris. It sits on the River Seine and has a population of about two million people in the city itself.
The two halves are deliberately unrelated (personal anecdote vs. factual statement). Any text bridging them during the pause would indicate hallucination.
Test Environment
| Application | Handy 0.8.1 |
| Inference | ONNX Runtime (auto) + Whisper.cpp (auto) |
| GPU | AMD Radeon RX 7800 XT (Navi 32, 12 GB VRAM) |
| CPU | 12th Gen Intel Core i7-12700F |
| OS | Ubuntu 25.10, kernel 6.17.0-19-generic |
| Date | 2026-03-29 |
Results
Rankings
| # | Model | Inference | RTF | Errors | Hallucination |
|---|---|---|---|---|---|
| 1 | Whisper Small | 976 ms | 0.07x | 0 | No |
| 2 | Parakeet V2 | 1,354 ms | 0.09x | 0 | No |
| 3 | Canary 180M Flash | 2,223 ms | 0.17x | 0 | No |
| 4 | Moonshine Base | 2,301 ms | 0.15x | 0 | No |
| 5 | Parakeet V3 (INT8) | 1,378 ms | 0.10x | 1 | No |
| 6 | Whisper Turbo | 1,112 ms | 0.09x | 2 | No |
| 7 | Canary 1B v2 | 2,473 ms | 0.17x | 1 | No |
| 8 | Moonshine Small Streaming | 4,140 ms | 0.33x | 1 | No |
| 9 | Moonshine Tiny Streaming | 3,414 ms | 0.25x | 2 | No |
| 10 | Whisper Medium | 1,694 ms | 0.13x | 3 | No |
| 11 | Whisper Large | 2,780 ms | 0.22x | 3 | No |
| 12 | Breeze ASR | 2,626 ms | 0.20x | 3 | No |
| 13 | SenseVoice (INT8) | 145 ms | 0.01x | 3 | No |
Charts
Key Findings
VAD & Hallucination
All 13 models handled the 5-second silence cleanly. No model hallucinated words, repeated phrases, or invented bridging text during the pause.
Bigger is not always better
Whisper Small (976 ms, 0 errors) outperformed both Whisper Medium (1,694 ms, 3 errors) and Whisper Large (2,780 ms, 3 errors) on this GPU. The larger Whisper models were slower and less accurate.
Common Error Patterns
- Capitalisation: 7 models wrote "river Seine" instead of "River Seine"
- Numerals: 5 models output "2 million" instead of "two million"
- Punctuation: SenseVoice and Breeze ASR replaced sentence-ending periods with commas
- Mishearing: Moonshine Tiny turned "drank it anyway" into "don't get any way"; SenseVoice heard "Seine" as "sand"
- Dropped words: Canary 1B v2 lost "people in" from the final sentence
Recommendation
Whisper Small is the best overall choice for this hardware — fastest perfect transcription at under 1 second. Parakeet V2 is a strong runner-up. For users who want correct proper noun capitalisation and spelled-out numbers, Canary 180M Flash is the most pedantically accurate, though slower.
Data
Raw benchmark data is available as transcription-benchmarks.json.
Source repository: danielrosehill/Handy-Ubuntu-Setup