OpenAI Whisper is a free, open-source speech-to-text model that converts spoken audio into written text across 99 languages. To run it, you need Python installed on your computer, at least one additional library called ffmpeg, and somewhere between 150 MB and 3 GB of free disk space depending on the quality level you want. It does not transcribe in real time. These are the facts that the breathless newsletter coverage tends to skip.
Priya manages partnerships at a fintech company in Singapore. In early 2026, she read that Whisper could match "human-level transcription accuracy" and was completely free. She found the GitHub page, skimmed the instructions, and felt the optimism of someone who has not yet encountered the phrase "pip install ffmpeg." Three hours later she had a cryptic CUDA compatibility error, no transcript, and had taken the rest of the meeting notes by hand. The tool is genuinely excellent. It was simply built for a different person than Priya.
Whisper was designed for developers and researchers. That does not make it a bad tool — it makes it the wrong tool for people who just want to transcribe Thursday's standup call in Mandarin without writing a single line of code.
This article explains how OpenAI Whisper actually works in plain English, what it does well, what it fundamentally cannot do, and which options make more sense if you need live meeting transcription today.
- OpenAI Whisper is a free, open-source speech-to-text model released in September 2022, trained on 680,000 hours of audio from the web.
- It supports 99 languages and reaches near-human accuracy on English — roughly 2–3% word error rate on clean recordings.
- Whisper does not work in real time. It processes audio in 30-second chunks after a recording is complete, not while someone is speaking.
- Running it locally requires Python 3.9+, ffmpeg, and a model file between 75 MB and 3 GB. Accuracy and speed scale together.
- For live meeting transcription without coding, you need streaming speech-to-text — a different architecture that Whisper was not designed to provide.
What Is OpenAI Whisper?
OpenAI Whisper is a speech recognition model released as open-source in September 2022. OpenAI trained it on 680,000 hours of audio collected from the internet — lectures, podcasts, interviews, YouTube videos, audiobooks — across dozens of languages. The scale of that training data is a big part of why its accuracy is so good.
It can do two things: transcription, which converts audio to text in the same language, and translation, which converts audio in a foreign language to English text. Note that it only translates into English, not between arbitrary language pairs.
You can access Whisper two ways. First, you can download the model weights for free from GitHub and run it on your own hardware — no API costs, no rate limits, but you do the setup. Second, you can call the OpenAI Whisper API at $0.006 per minute of audio, which removes most of the setup burden but still processes audio as a file upload rather than a live stream.
If you need something that works without a command line, skip ahead to the no-code options section. If you want to understand why Whisper works the way it does, read on — it matters for knowing what it can and cannot do.
How OpenAI Whisper Works — A Plain-English Walk-Through
You do not need to understand the math to use Whisper effectively. But understanding the four steps it takes helps explain why it has the limitations it does.
Step 1: Audio goes in as a file
You give Whisper a recorded audio file — MP3, WAV, M4A, or most other common formats. It cannot read a live microphone stream by default. The audio sits on your disk waiting to be processed.
Step 2: Whisper converts sound into a visual fingerprint
Whisper transforms the audio waveform into a mel spectrogram — think of it as a heat map of the sound, where the horizontal axis is time and the vertical axis shows which frequencies are present at each moment. Speech looks different from music looks different from background noise. This visual representation is what the AI actually reads.
Step 3: An AI model reads the fingerprint and predicts words
A transformer model — the same type of architecture underlying GPT — reads the spectrogram and predicts the most likely sequence of words. One part of the model encodes the sound pattern; another part decodes it into text, one token at a time. The decoder uses context from earlier in the audio to make better predictions as it goes.
Step 4: Text comes out, punctuated and capitalized
Whisper outputs formatted text with sentence-appropriate punctuation and capitalization already applied. You get a usable transcript, not a wall of lowercase words.
The 30-second window — and why it matters. Whisper divides your audio into 30-second segments and processes them sequentially. This chunked approach is the core reason why Whisper cannot stream live captions. There is no partial result after each word. There is only a completed chunk after each 30-second block finishes processing. For a 60-minute meeting, that means you receive your first partial transcript 30 seconds after the call ends — and the full transcript only when all chunks are done.
What Whisper Does Well
Within its design constraints, Whisper is genuinely impressive.
- Near-human accuracy on English. The large-v3 model achieves roughly 2–3% word error rate on standard benchmarks — comparable to professional human transcriptionists on clean audio. For reference, older consumer speech recognition averaged 10–15% error rates.
- 99 languages. Mandarin, Cantonese, Japanese, Korean, Arabic, Hindi, Russian, Portuguese, Spanish, German, French, and dozens more. The Whisper GitHub README lists the full language set with per-language accuracy benchmarks.
- Strong accent tolerance. Because it trained on real-world web audio rather than studio-quality speech, Whisper handles non-native accents better than many older ASR systems tuned on narrow datasets.
- Auto-punctuation. Commas, periods, and capitalization are included. Most competing batch transcription tools require a separate post-processing step for this.
- Technical vocabulary. Whisper handles domain-specific terminology — medical, legal, programming terms — better than general-purpose consumer speech recognition.
- Completely free to use. The model weights are released under the MIT license, which permits commercial use. You can process as many recordings as your hardware allows at zero marginal cost.
If post-recording accuracy on a saved audio file is your priority, Whisper is hard to beat. It is the right tool for transcribing recorded interviews, podcast episodes, lectures, or any audio you have already captured.
What Whisper Cannot Do — The Part Nobody Explains
Most articles about Whisper are written by developers for developers. They mention limitations in passing. Here they get the attention they deserve.
It does not transcribe in real time
If you start a Zoom call and point Whisper at it, you will receive a transcript when the call is over — not while it is happening. The delay between speaking and seeing text ranges from a few seconds for short clips to several minutes for a long meeting, depending on your hardware and model size.
This is not a bug. It is a design choice. Whisper's accuracy comes partly from processing each audio chunk with full context. Live transcription requires sending partial results immediately, before context is available. The two approaches involve a fundamental trade-off, and Whisper was built to maximize accuracy rather than minimize latency.
It cannot tell who is speaking
By default, Whisper produces a flat, unlabeled transcript. Every sentence appears in a continuous block with no indication of which participant said what. In a two-person sales call, you will not know which lines were yours and which were your prospect's. In a ten-person standup, the output is completely unattributed.
There are open-source add-ons (pyannote.audio is the most common) that layer speaker diarization on top of Whisper. They work reasonably well but require additional Python packages, model downloads, and configuration. The setup time roughly doubles.
Running it locally requires technical setup
To use Whisper on your own computer, you need:
- Python 3.9 or higher installed correctly
- The ffmpeg audio library (a separate install on most operating systems)
- The model weights file: 75 MB for "tiny," 1.5 GB for "medium," 3 GB for "large-v3"
- A modern GPU if you want reasonable speed — the large model takes 20–40 minutes to process one hour of audio on a typical laptop CPU
Miguel leads a 12-person customer success team at a Barcelona startup. His team handles calls in Spanish, Catalan, and English. In January 2026, he asked his lead developer to "set up Whisper for the team." The developer spent a full weekend installing dependencies, hit a CUDA version conflict that took four hours to resolve, then built a small upload interface so teammates could submit recordings without touching the terminal. Total setup time: about 14 hours of engineering work. The tool now works well. Miguel is grateful. He also acknowledges that most teams do not have a developer with a free weekend to spend on it.
The OpenAI API is easier — but still not live
The OpenAI Whisper API removes the local install problem. You send an audio file to OpenAI's servers via a simple HTTP request and receive the transcript back, typically within seconds for short clips. The cost is $0.006 per minute — a 60-minute meeting transcript costs about $0.36.
This lowers the technical barrier substantially. But the API is still a file-upload model, not a live stream. You send the finished recording after the call ends. The transcript arrives shortly afterward. If your goal is to read captions while someone is still talking, the API does not change the underlying constraint.
Whisper Model Sizes at a Glance
Whisper comes in five quality tiers. Bigger models are more accurate but slower and heavier. On a typical consumer laptop without a GPU, the "small" model is usually the practical ceiling for speed.
| Model | File size | CPU speed (vs audio) | Best for |
|---|---|---|---|
| tiny | 75 MB | ~10× faster | Quick tests, demos |
| base | 150 MB | ~7× faster | Casual use, fast iteration |
| small ★ | 490 MB | ~4× faster | Good quality/speed balance on laptops |
| medium | 1.5 GB | ~2× faster | Higher accuracy, GPU recommended |
| large-v3 | 3 GB | ~1× (real time on GPU) | Maximum accuracy, GPU required for practical use |
Start with "small" if you are testing on a laptop. Move to "large-v3" if you have a compatible NVIDIA GPU and need the best accuracy on non-English audio. The jump from small to large-v3 in accuracy is noticeable. The jump in processing time on CPU is severe.
How to Use Whisper Without Writing Code
Three practical options exist for non-developers, each making a different trade-off between effort, cost, and timing.
Option 1: The OpenAI Whisper API
Upload your audio file through OpenAI's interface or via a no-code HTTP client like Postman. You get a clean transcript back in seconds to minutes depending on length. Cost: $0.006/minute. This is the lowest-friction path if you have occasional recordings and do not want to install anything. The downside: you are still processing recordings after the fact, not capturing speech live.
Option 2: Desktop applications built on Whisper
Several developers have wrapped Whisper in a clickable interface. MacWhisper (Mac only) and Buzz (cross-platform, free) let you drag in an audio file and get a transcript without opening a terminal. These are genuinely useful for post-call transcription. They share the same architectural constraint — no live captions, no speaker labels without additional configuration.
Option 3: Browser-based streaming tools for live meetings
If your goal is to read captions while a conversation is happening — not retrieve a transcript after it ends — you need a different approach entirely. Browser-based tools that use streaming speech-to-text capture audio from your microphone or browser tab and send partial results word-by-word as people speak. No install, no Python, no post-processing wait.
This category includes tools like Whisper alternatives built for non-technical users, which trade some of Whisper's post-hoc accuracy for the immediacy that live conversations require. The choice between them is not about which is "better" — it is about whether you need transcription of a meeting or during one.
Whisper vs. Live Meeting Transcription — Two Different Architectures
Understanding why Whisper cannot stream live captions requires understanding the difference between batch and streaming speech-to-text.
Whisper is a batch model. It waits for a complete audio chunk, processes it with full context, and returns a result. The accuracy advantage comes from that full context: the model can see the end of a sentence before confirming what the beginning said. It is like reading a paragraph twice before summarizing it.
Streaming speech-to-text works differently. It sends partial results the moment each word arrives, then auto-corrects as context accumulates. Tools built on streaming STT engines — including Soniox, which MirrorCaption uses — can deliver the first word of a caption within 300–500 milliseconds of someone speaking it. The trade-off is some accuracy loss on ambiguous words that batch processing would catch with hindsight.
This is not a quality comparison. Whisper is arguably more accurate on recorded audio precisely because it processes more context. Streaming STT accepts a small accuracy penalty in exchange for immediacy. For live meetings, immediacy is the entire product.
Kenji works in Tokyo for a manufacturer that sells to European clients. His Thursday calls with a Munich team used to rely on a bilingual colleague to interpret key phrases. When that colleague left, Kenji started using a browser-based streaming transcription tool. He reads the German captions in real time during the call. No downloads, no Python, no waiting for a transcript to appear after the meeting ends. The difference from Whisper is not accuracy. It is the ability to hear something, understand it, and respond — all within the same 60-minute call.
Need live captions, not post-call transcripts? MirrorCaption streams transcription and translation in any browser, during your meeting. No install required.
Try Free →Frequently Asked Questions
Is OpenAI Whisper free?
Yes. The Whisper model weights are free to download and use under the MIT license, which permits commercial applications. Running Whisper locally costs nothing beyond your own hardware and electricity. The OpenAI Whisper API charges $0.006 per minute of audio — a 60-minute meeting transcript costs roughly $0.36.
Can Whisper transcribe a Zoom call in real time?
No. Whisper processes audio in 30-second chunks after the audio is captured. It cannot deliver word-by-word captions while someone is speaking. If you record a Zoom call and then run Whisper on the saved file, you will get a clean transcript — but only after the meeting has ended. For live Zoom captions, you need a streaming speech-to-text tool, not Whisper. Our speech-to-text software roundup compares real-time and post-meeting options across common workflows.
How accurate is OpenAI Whisper?
Whisper large-v3 achieves roughly 2–3% word error rate on the standard LibriSpeech benchmark for English, which is comparable to professional human transcription on clean audio. Accuracy drops on heavy background noise, overlapping speakers, very fast speech, or low-quality microphones. Non-English languages average higher error rates than English, though they still outperform many older region-specific models. For a broader look at transcription accuracy tradeoffs, see our real-time translation accuracy benchmarks.
Does Whisper support Chinese and Japanese?
Yes. Whisper covers 99 languages including Mandarin Chinese, Cantonese, Japanese, Korean, Arabic, Hindi, and all major European languages. For Mandarin and Cantonese, Whisper's large model performs well on clearly spoken audio, though it struggles with heavy regional accents and code-switching between Chinese and English in the same sentence. For a broader comparison of multilingual tools available today, see our speech-to-text software roundup.
Is there a browser-based alternative to Whisper that works for live meetings?
Yes. Browser-based tools like MirrorCaption use streaming speech-to-text to transcribe and translate in real time during your meeting — no Python, no install, no waiting for the call to end. They work in Chrome, Safari, or Edge on any device. The trade-off versus Whisper is that post-hoc accuracy on a saved recording may be slightly lower, but for live conversations the immediacy is the point. Start with 2 free hours per month at mirrorcaption.com/app.
The Bottom Line
OpenAI Whisper is one of the most accurate speech-to-text systems ever made publicly available. It is also one of the most inaccessible to the people who would benefit from it most.
If you have a saved audio file and the patience for some setup, Whisper — especially via the OpenAI API — delivers near-human transcription accuracy across 99 languages for almost no cost. That is a remarkable engineering achievement.
If you need to read what someone is saying while they are saying it — during a meeting, not after — Whisper's architecture is the wrong fit. Streaming speech-to-text tools exist for exactly this use case. They work in a browser tab, they start within seconds, and they do not require a command line.
The question is not which tool is better. The question is which tool matches your timing requirement. For the best speech-to-text tools in 2026 across all use cases, our full roundup covers the landscape.
Live meeting transcription, no setup required
MirrorCaption streams transcription and translation word-by-word during your call. Works in any browser on any video call platform. 2 hours free every month, no credit card.
Try MirrorCaption Free