Handy + local LLMs: the dictation setup that quietly built a corpus

I thought I was making dictation faster.

That was true, but it was not the interesting part. After two weeks, I inspected the folder and realized I had quietly built a private corpus of how I think out loud: audio on disk, raw transcripts beside it, and cleaned versions whenever post-processing ran.

Most dictation tools give you text and keep the rest of the value somewhere else. My setup gives me text and leaves the source material on my machine.

That changes the shape of the workflow.

The pipeline, plainly

The setup has three jobs.

I press a hotkey, talk, and release. Handy transcribes the audio locally. A local Qwen model cleans or restructures the transcript. The final text lands wherever my cursor is.

In the background, a watcher copies the audio and transcript into a folder I own:

mic -> Handy (VAD + Whisper) -> local Qwen cleanup -> paste into focused app
                  \
                   launchd watcher -> mirror.py -> /voice-corpus/raw/{fr,en,unknown}/

VAD means voice activity detection. It detects when I am speaking and skips silence. Whisper is the speech-to-text model. launchd is the macOS scheduler, basically the native way to run a small background job.

Each part is boring and swappable. That is why the system works.

Handy is the fast part

Handy is an open-source desktop app that turns a global shortcut into transcription.

In my setup, it uses Silero VAD to trim silence, Whisper large to transcribe, and the system paste action to put the result into the focused app. None of that needs a cloud API.

The immediate win is that long prompts get easier to write. When I type, I compress too early. When I talk, I can explain the context, the constraint, and the shape of the ask before I start editing myself.

That alone would be enough to keep using it.

The cleanup pass is local too

Raw transcripts are messy. They include false starts, fillers, repeated words, and the normal noise of thinking while speaking.

So I run a cleanup pass through mlx-community/Qwen2.5-14B-Instruct-4bit, served locally with mlx-lm on localhost:8080. Handy points its post-processing step at that local endpoint with provider set to custom.

The cleanup prompt has one load-bearing rule: never add content.

That sounds obvious, but it took tightening. Many cleanup prompts quietly rewrite for clarity, and "clarity" can become hallucination. If I say a messy sentence, the model can remove filler and fix obvious dictation artifacts, but it should not add an argument I did not make.

The other constraint is French/English switching. I often move between the two in the same thought. A cleanup model that translates or normalizes too aggressively makes the output less mine. It has to preserve the mixed language when the mixed language is what I said.

Two hotkeys changed the behavior

The first version had one hotkey and one job: clean the transcript.

After a few weeks, I noticed that most of my dictation was not going to humans. It was going into Claude, ChatGPT, Codex, or another model. A clean paragraph is fine for a human. A structured prompt is often better for a model.

So I split the workflow.

One caveat: vanilla Handy supports one global post-processing prompt. To make two hotkeys route to two different prompts, I use a small fork that accepts a --prompt-id flag and a --model flag, then route the hotkeys through Hammerspoon. This is not stock Handy behavior.

Mode 1 is Cleanup, bound to Option+Space. It removes filler, fixes some recurring dictionary mistakes, and keeps the output close to what I said. It runs on Qwen 14B.

Mode 2 is Promptify, bound to Option+Shift+Space. Same input, different job. It turns the dictation into a structured prompt: the ask first, then constraints, context, examples, and output format. It runs on Qwen 32B because the transform needs more room.

The split matters because it changes how I speak. If I hit the Promptify hotkey, I know I am talking to a model next, so I include the ask, constraints, and format while dictating. Half the prompt engineering happens before the frontier model sees anything.

There is a sharp failure mode. If I dictate a vague thought, Promptify must not invent a task. "I am wondering whether to use Postgres or SQLite" should not become a polished assignment if I have not asked for one. A lot of the prompt work went into teaching the model to pass through thinking when there is no real ask.

The corpus was the surprise

Handy stores clips locally in its own app folder and keeps transcripts in a small SQLite database called history.db. SQLite is a local database file. The practical benefit is that the data is inspectable, but it is still app-shaped, not corpus-shaped.

So I wrote mirror.py. It copies new recordings into:

/Users/clement/projects/voice-corpus/raw/{fr,en,unknown}/

For every .wav, it writes a JSON sidecar with the same filename. A sidecar is just a metadata file that sits next to the source file. Mine stores the raw transcript, processed transcript if it exists, language, duration, timestamp, and Handy recording ID.

launchd watches the Handy recordings folder and triggers the mirror script within a few seconds of a new clip. The script tracks the last mirrored row in state/last_seen_id.txt, so it can be re-run without duplicating everything.

I also have curate.py and upload.py sitting on top of the raw folder for future voice work. They filter clips into a cleaner set and batch them for ElevenLabs when I decide to train. I have not pushed that button yet. Today, the asset is the local corpus itself.

What is actually in there

Inspected on 2026-05-11, the corpus looked like this:

375 raw .wav clips.
375 matching .json sidecars.
192 minutes of audio.
355 MB on disk.
291 French clips, about 141 minutes.
79 English clips, about 52 minutes.
5 unknown-language clips.
Date range: 2026-04-27 to 2026-05-11.

The number that matters depends on the future task.

For voice cloning or voice fine-tuning, the useful asset is audio plus transcript. I have that for almost every clip: 369 out of 375 have a raw transcript on disk.

For training my own cleanup or prompt style, the useful asset is a raw-to-cleaned pair. That slice is much smaller: 151 clips have a processed transcript, and only 97 are meaningfully different from the raw transcript.

That was a useful correction. "I have 192 minutes of audio" is true. "I have 192 minutes of training-ready pairs for cleanup style" is not. The corpus is real, but each downstream use has its own usable subset.

Latency, honestly

I re-checked this on 2026-05-12 against Handy's history.db, the mirrored sidecars, and the app debug log.

The database and sidecars store audio, transcript, processed text, and duration. They do not store latency. The useful source for latency was ~/Library/Logs/com.pais.handy/handy.log, which had 37 complete post-processed runs between 2026-05-07 and 2026-05-11.

The number that matters is release-to-paste: the time from releasing the hotkey to text appearing in the focused field.

Mode	Samples	Median release-to-paste	Median Whisper	Median local LLM pass	P90 release-to-paste
Cleanup, Qwen 14B	23	3.9s	1.0s	~2.0s	5.8s
Promptify, Qwen 32B	14	10.7s	2.4s	~8.0s	16.8s

The local LLM pass is approximate because Handy logs those timestamps at second-level precision. Whisper and paste timings are exact enough for this purpose.

The tradeoff is clear. Cleanup feels like normal dictation. Promptify does not feel like typing. It is a heavier prompt-building action, and that is fine because the output is more valuable.

What does not work yet

No streaming. Handy is press-and-release, and continuous dictation is a different design.

No diarization. Diarization means separating speakers. That makes this useless for meetings today, which is fine because the workflow is solo dictation.

No cross-device corpus. Everything is local by design. My phone is a separate problem.

The corpus is not training-ready. It needs curation, deduplication, quality checks, and probably manual rejection before I should trust it for any serious fine-tune. Having audio is not the same as having a dataset.

Proper nouns are still fragile. If Whisper hears a name wrong, the cleanup model should not invent the correct one unless the dictionary rule is explicit. I prefer a faithful wrong transcript to a confident invented correction.

What I would copy

For the productivity half, install Handy, use a strong Whisper setting, and add local post-processing only after plain transcription is working. The cleanup prompt should be conservative. The load-bearing rule is: never add content.

For the corpus half, mirror recordings out of the app folder into a project you own. Put one metadata sidecar next to each audio file. Track language, timestamp, raw transcript, processed transcript, duration, and source ID. Start earlier than you think you need to, because the corpus only becomes useful after it quietly grows.

I would not start with two hotkeys or a custom fork. I would start with one reliable cleanup mode and a mirror script. Add Promptify only when you notice that most of your dictation is going into other models.

The bigger point is the same as my local-first setup: rent the model when it helps, but keep the substrate. The productivity gain is the headline. The corpus is the part that compounds.