Speech to Text (Whisper)

Transcribe spoken audio or video to text or subtitles, fully local.

Whisper is installed. First run with a given model size downloads the weights from openai.com (one-time, ~75 MB to ~3 GB depending on size).

Model size guide (smaller = faster + lower quality):

tiny — ~75 MB, very fast, ok for clear English
base — ~150 MB, recommended starting point
small — ~500 MB, good multilingual quality
medium — ~1.5 GB, near-best quality, slow on CPU
large — ~3 GB, best quality, very slow on CPU

Without a GPU, expect roughly 0.5×–2× real-time for tiny/base/small, and 5×–20× real-time for medium/large. A 10-minute audio file at medium on CPU can take 50+ minutes.

Drag & drop file here

or click to browse Accepted: .mp3,.wav,.ogg,.flac,.aac,.m4a,.opus,.mp4,.webm,.mkv,.mov

Model size

Language hint (blank = auto-detect)

Output format