Fine-Tuning Whisper for My Voice: A Hobbyist's Journey
You know that person who can't watch Netflix without closed captions on? Not because they're hard of hearing - they just like it that way. They'll pause the movie to read a mumbled line, get annoyed when captions are out of sync, and insist subtitles make everything better.
I'm that person. And I'm attempting to build my own speech recognition system.
Two audiences, one problem
I teach physics to ESL students in China. Every day I speak technical terms: Newton, momentum, joule, Coulomb. I watch my students struggle to parse my accent in real-time. Closed captions during lectures would help them enormously. But existing solutions either don't support custom models, or mangle physics terminology so badly that "joule" becomes "jewel" and "Coulomb" becomes "cool um."
This past weekend I posted in the Spokenly Discord about my interest in fine-tuning. Someone responded: a person with a health condition that affects their speech clarity. Standard voice recognition struggles with their voice because of physical differences in how they speak. They wanted to know more about data collection, what models I was targeting, and whether I'd document the process.
That conversation stuck with me. For my students, better captions mean better comprehension. For people with accessability needs, voice recognition isn't a convenience; it's infrastructure. The current state of "one model fits all" doesn't seem to serve either group well.
The goal
Fine-tune Whisper tiny (39M parameters) on my own voice:
- Improve accuracy on physics/CS terminology
- Handle my accent and speech patterns
- Deploy to FUTO Keyboard (Android) via GGUF format
- Build real-time captions for ESL students using WhisperKit on Mac
The tiny model matters because I want this running locally, in real-time, on my phone and during live lectures. The theory I'm testing: a smaller model fine-tuned on personal data might be both faster and more accurate than a larger generic model - at least for a specific voice. Whether that holds up remains to be seen.
Data collection: three apps, three formats
I'm collecting voice data from three sources, all with different output formats.
Spokenly (Mac)
Spokenly can use NVIDIA's Parakeet model (specifically parakeetTDT06V2), I intend to change this to my Whisper fine-tune later. The output is a nested JSON structure with timestamped segments:
{
"transcriptionData": {
"modelId": "parakeetTDT06V2",
"segments": [
{
"start": 0,
"end": 25.21,
"text": "Wait, so then for the ChalkTalk classroom captioning..."
}
]
},
"creationDate": 789902568.510609
}
Each dictation saves a paired WAV file. I wrote an archive script that runs via launchd 3x daily (4PM, 6PM, 9PM) to capture teaching sessions, pulling from ~/Library/Containers/app.spokenly/.
Wispr Flow (Mac/Windows)
While I don't use it as frequently as spokenly due to the free tier limits, Wispr Flow is better at capturing more context - which app was active, the URL if it was a browser, VSCode telemetry, etc:
{
"asrText": "Why is this a black background?",
"formattedText": "Why is this a black background?",
"app": "com.microsoft.VSCode",
"url": "unknown",
"duration": 10.76,
"numWords": 18
}
The archive script runs on both Mac (launchd) and Windows (Task Scheduler at 6PM and 8:30PM to catch my teaching window).
FUTO Keyboard (Android)
This one required actual work. FUTO Keyboard with voice input is open source, but it doesn't save transcripts and voice data by default. I forked it and added transcript logging.
The key file I created was TranscriptLogger.kt - a logging utility with a WAV writer. I modified AudioRecognizer.kt to pass audio samples through the callback, and added a settings toggle: "Save transcripts locally."
Now every voice dictation on my Samsung Galaxy S10 saves paired files:
{
"transcriptId": "664ef41f-3382-40cc-8744-3e71bc3b863b",
"asrText": "Testing, testing, one, two, three.",
"formattedText": "Testing, testing, one, two, three.",
"timestamp": "2026-01-12 08:34:48.542 +0800",
"audioFile": "664ef41f-3382-40cc-8744-3e71bc3b863b.wav"
}
Files land at /sdcard/Android/data/org.futo.inputmethod.latin.unstable/files/FutoVoiceArchive/ and get pulled to Mac when I plug in via USB.
Classroom recordings (DJI Mic Mini)
My DJI Mic Mini clips to my shirt during lectures. Important settings:
- Noise cancellation: OFF (preserves acoustic details the model needs)
- Sample rate: 48 kHz (downsample later)
- Format: WAV lossless
The classroom recordings should be the most valuable: background noise from students, physics terminology in context, natural teaching cadence. This is what I want the model to learn. We'll see if that plays out.
The processing pipeline
Raw recordings need work before training. My pipeline has four stages:
1. Resample
Whisper expects 16kHz/16-bit/mono. The DJI records at 48kHz.
ffmpeg -i input.wav -ar 16000 -ac 1 -sample_fmt s16 output.wav
2. LLM transcript correction
Original transcripts have errors - that's the whole point. But I need correct ground truth for training.
I built a Claude Code skill (/zu-asr-correct) that takes pending transcripts and fixes them:
- Fix physics terms: Newton (not newton), joule (not jewel), Coulomb
- Fix CS terms: algorithm, recursion, Python, GitHub
- Keep filler words (uh, um) - the model should learn these
- Remove stutters: "I-I-I" becomes "I"
Using an LLM for ground truth correction feels circular. The hope is that the LLM is better at this than the original ASR - at least for domain-specific terms. I haven't validated this at scale yet.
3. Segment
Long recordings get split at sentence boundaries. Each clip needs to be under 30 seconds for Whisper training.
4. Manifest
Final step generates a HuggingFace-compatible dataset. Current stats at time of writing: 450 segments - about 87 minutes total (29% toward 5-hour goal).
I've since automated this entire pipeline with a launchd job that runs 30 minutes after each archive. I also added Groq's Whisper large-v3 for transcription - seems faster than LLM-based correction for the initial pass, though I'm still comparing quality and testing a hybrid approach. A simple dashboard (scripts/dashboard.py) shows progress toward training goals.
Problems encountered
The overnight clone saga
Building FUTO Keyboard requires cloning their GitLab repo with submodules. One submodule - res-large at 43MB - kept failing. I wrote a monitoring script that retried automatically. It took 15 attempts overnight to complete.
The VPN conflict
I have an RTX 4080 Super at home running Folding@Home. I wanted to SSH in for training, but both Tailscale and my work VPN want the WinTun driver. They fight.
Solution: USB sneakernet. Process data on Mac, copy to drive, train overnight, copy model back. Low-tech but it works.
Spokenly AI proofreading
Spokenly supports AI post-processing via OpenRouter. I started with Gemini 2.0 Flash and even built a Context Daemon menu bar app to provide window context. Turns out Spokenly already had native OCR. I just didn't notice it because the docs are incomplete. Had to fiddle around with settings to discover the "Include active window contents" toggle was already there. One less moving part, and a lesson about exploring before building.
The main reason I'm interested in AI post-processing is formatting: expanding file paths, creating lists, converting spoken language to markdown. Useful for dictating into VSCode. Not useful for classroom captions; too much latency.
For the AI backend, I'm currently trying GPT-OSS 120B on Groq: free and fast. Still experimenting though. The main issue is prompt leakage: models keep answering the transcription directly instead of just cleaning up the text. Finding a model that follows instructions correctly is harder than expected.
Spokenly has native pre-AI and post-AI processing hooks (bash scripts) plus a separate field for the prompt itself. I'm using XML-structured prompts with a minimal intervention philosophy - fix obvious errors, preserve my voice. The setup is reproducible once I find a model that behaves. I've published my Spokenly configuration - prompts, scripts, and word replacements - on GitHub.
Training setup
Two machines split the work: my M1 MacBook Pro handles data collection, processing, and inference. The Windows PC with the RTX 4080 Super is for training only.
For the base model, I'm starting with FUTO English-39 ACFT rather than vanilla Whisper tiny - it's already been fine-tuned for English dictation, which should give me a better starting point.
What's working
Automated data collection: Once the archive scripts were set up, data accumulates without effort. One concern: I've noticed audio sometimes gets cut off at the beginning or end - could be FUTO Keyboard, Spokenly, or Wispr Flow. Not sure yet if that's fixable in settings or if it'll affect training quality. Might be fine, might not.
Cross-platform pipeline: Mac (Spokenly, Wispr), Windows (Wispr), Android (FUTO fork) - all feeding into one dataset.
LLM-based correction: The /zu-asr-correct skill handles most corrections automatically - though I haven't measured the actual percentage.
What's not working (yet)
Model quality: Haven't run a single training iteration yet. Still in the data collection phase - waiting until I have enough to feel comfortable running a first cycle.
FUTO deployment: Haven't gotten this far yet - need a trained model first. The plan is to export to GGUF format, but I expect some debugging.
Live captions in production: I forked OpenSuperWhisper to build ChalkTalkCaptions - a SwiftUI floating window with WhisperKit backend. Tried it with a small class today. Mixed results: some students found it distracting, others showed interest. The main issues were latency (too high) and accuracy (too low). Students preferred I turn it off.
The animation might be part of the problem - each word pops up and bubbles into the caption area, dynamically changing the shape. Too much happening at once. I need to experiment with different approaches: maybe chunking by sentence instead of word-by-word, or different load animations. Will look at how other captioning systems handle this.
What I'm betting on
Betting on natural use data. My hypothesis is that the Spokenly archives (just me dictating normally) will be more valuable than any script I could write, and more valuable than relying on AI post-processing. AI post-processing is expensive, adds latency, requires an online connection, and reduces privacy. A fine-tuned local model avoids all of that. Haven't tested this yet.
Local-first is possible. Whisper tiny runs fine on a phone. Fine-tuning on consumer hardware should work - I'm still in the process of testing this.
Forking is worth it. Adding transcript logging to FUTO Keyboard was intimidating, it was my first time forking an android app. Now I get training data from every voice dictation on my phone.
What's next
- Submit PR to futo-org/android-keyboard for transcript logging feature - could help others collect their own datasets for personal fine-tuning, including people with speech differences who need voice recognition trained on their voice
- Accumulate more classroom recordings (target: 10+ hours)
- Complete first real training run with proper evaluation
- Finish ChalkTalkCaptions for live classroom use
Are you working on something similar? If you have accessibility needs for better voice recognition, or you're fine-tuning ASR for a specific domain, I'd be curious to hear about your approach.
Why I thought I could do this
I have a limited undergraduate degree in computer science. I've never trained a machine learning model before. By most reasonable measures, I probably shouldn't be attempting this.
But here I am.
The honest answer is tooling. Around Christmas, I bought myself the Claude Max plan at $140 CAD/month. It felt excessive at the time. Now it feels like the best purchase I've made in years. Claude Code with Opus 4.5 has been my pair programmer through this entire project: writing the transcript logging code for FUTO Keyboard, debugging Python pipelines, explaining HuggingFace dataset formats, building the archive scripts. Every time I hit a wall, I could work through it with an AI that actually understood what I was trying to do.
I'm not saying this to sell you on anything. I'm saying it because it's relevant context for why a hobbyist would even attempt fine-tuning an ASR model. The barrier to entry has dropped dramatically.
What I'm looking forward to: token costs coming down, and open source models catching up enough to plug into a harness like Claude Code. Right now this kind of workflow requires proprietary frontier models. When that changes, projects like this become accessible to people who can't afford $140/month subscriptions. That matters, especially for the accessibility use cases I care about.
For now, I'm using what's available and trying to document the process well enough that others can follow when the tools become cheaper.
Resources
HuggingFace Whisper Fine-Tuning Guide
FUTO Keyboard - open source Android voice keyboard
WhisperKit - Apple Silicon Whisper inference
OpenSuperWhisper - open source macOS transcription
Spokenly - Mac transcription with AI post-processing (uses Parakeet)
OpenRouter - unified API for multiple LLMs
Spokenly Dictation Config - my prompts, scripts, and word replacements