Whisper is OpenAI’s open-source automatic speech recognition (ASR) system trained on 680,000 hours of multilingual data. It achieves near human-level accuracy across multiple languages and can handle noisy audio, accents, and technical terminology. Running locally, Whisper provides powerful transcription capabilities without sending audio to external services.
📑 Table of Contents
Key Features
- Multilingual – Supports 99+ languages
- Robust – Handles accents, background noise, technical language
- Translation – Translate audio directly to English
- Timestamps – Word-level and segment timestamps
- Multiple Sizes – From tiny (39M) to large (1.5B parameters)
- Local Processing – Complete privacy, no data leaves your machine
- Open Source – MIT licensed, fully customizable
Model Sizes
Choose based on your accuracy needs and hardware:
- tiny – 39M parameters, ~1GB VRAM, fastest
- base – 74M parameters, ~1GB VRAM
- small – 244M parameters, ~2GB VRAM
- medium – 769M parameters, ~5GB VRAM
- large-v3 – 1.5B parameters, ~10GB VRAM, most accurate
Installation
# Install with pip
pip install openai-whisper
# Or install from source
pip install git+https://github.com/openai/whisper.git
# Install FFmpeg (required)
sudo apt install ffmpeg
Command Line Usage
# Basic transcription
whisper audio.mp3
# Specify model and language
whisper audio.mp3 --model medium --language English
# Translate to English
whisper audio.mp3 --task translate
# Output formats
whisper audio.mp3 --output_format txt
whisper audio.mp3 --output_format srt # Subtitles
whisper audio.mp3 --output_format vtt # Web subtitles
whisper audio.mp3 --output_format json # Detailed JSON
Python Usage
import whisper
# Load model
model = whisper.load_model("base")
# Transcribe audio
result = model.transcribe("audio.mp3")
print(result["text"])
# With options
result = model.transcribe(
"audio.mp3",
language="en",
task="transcribe", # or "translate"
fp16=True, # Use half precision
verbose=True
)
# Access segments with timestamps
for segment in result["segments"]:
print(f"[{segment['start']:.2f}s - {segment['end']:.2f}s] {segment['text']}")
Faster Alternatives
- faster-whisper – CTranslate2 optimization, 4x faster
- whisper.cpp – C++ implementation for CPU
- whisperX – Word-level timestamps and diarization
Use Cases
- Podcast Transcription – Convert episodes to searchable text
- Subtitle Generation – Create SRT/VTT files for videos
- Meeting Notes – Transcribe recordings automatically
- Voice Commands – Build voice-controlled applications
- Content Accessibility – Make audio content accessible
Was this article helpful?