OpenAI Whisper – Automatic Speech Recognition

Whisper is OpenAI’s open-source automatic speech recognition (ASR) system trained on 680,000 hours of multilingual...

AI/ML Tools Linux Open Source

Whisper is OpenAI’s open-source automatic speech recognition (ASR) system trained on 680,000 hours of multilingual data. It achieves near human-level accuracy across multiple languages and can handle noisy audio, accents, and technical terminology. Running locally, Whisper provides powerful transcription capabilities without sending audio to external services.

Key Features

  • Multilingual – Supports 99+ languages
  • Robust – Handles accents, background noise, technical language
  • Translation – Translate audio directly to English
  • Timestamps – Word-level and segment timestamps
  • Multiple Sizes – From tiny (39M) to large (1.5B parameters)
  • Local Processing – Complete privacy, no data leaves your machine
  • Open Source – MIT licensed, fully customizable

Model Sizes

Choose based on your accuracy needs and hardware:

  • tiny – 39M parameters, ~1GB VRAM, fastest
  • base – 74M parameters, ~1GB VRAM
  • small – 244M parameters, ~2GB VRAM
  • medium – 769M parameters, ~5GB VRAM
  • large-v3 – 1.5B parameters, ~10GB VRAM, most accurate

Installation

# Install with pip
pip install openai-whisper

# Or install from source
pip install git+https://github.com/openai/whisper.git

# Install FFmpeg (required)
sudo apt install ffmpeg

Command Line Usage

# Basic transcription
whisper audio.mp3

# Specify model and language
whisper audio.mp3 --model medium --language English

# Translate to English
whisper audio.mp3 --task translate

# Output formats
whisper audio.mp3 --output_format txt
whisper audio.mp3 --output_format srt  # Subtitles
whisper audio.mp3 --output_format vtt  # Web subtitles
whisper audio.mp3 --output_format json # Detailed JSON

Python Usage

import whisper

# Load model
model = whisper.load_model("base")

# Transcribe audio
result = model.transcribe("audio.mp3")
print(result["text"])

# With options
result = model.transcribe(
    "audio.mp3",
    language="en",
    task="transcribe",  # or "translate"
    fp16=True,  # Use half precision
    verbose=True
)

# Access segments with timestamps
for segment in result["segments"]:
    print(f"[{segment['start']:.2f}s - {segment['end']:.2f}s] {segment['text']}")

Faster Alternatives

  • faster-whisper – CTranslate2 optimization, 4x faster
  • whisper.cpp – C++ implementation for CPU
  • whisperX – Word-level timestamps and diarization

Use Cases

  • Podcast Transcription – Convert episodes to searchable text
  • Subtitle Generation – Create SRT/VTT files for videos
  • Meeting Notes – Transcribe recordings automatically
  • Voice Commands – Build voice-controlled applications
  • Content Accessibility – Make audio content accessible

Download Whisper

Was this article helpful?