Speech to Text (Whisper)
Transcribe spoken audio or video to text or subtitles, fully local.
Whisper is installed. First run with a given model size downloads the weights from openai.com (one-time, ~75 MB to ~3 GB depending on size).
Model size guide (smaller = faster + lower quality):
- tiny — ~75 MB, very fast, ok for clear English
- base — ~150 MB, recommended starting point
- small — ~500 MB, good multilingual quality
- medium — ~1.5 GB, near-best quality, slow on CPU
- large — ~3 GB, best quality, very slow on CPU
Without a GPU, expect roughly 0.5×–2× real-time for tiny/base/small, and 5×–20× real-time for medium/large. A 10-minute audio file at medium on CPU can take 50+ minutes.