Blog

Automated meeting transcription is a solved problem for single-speaker audio. Multi-speaker scenarios are not. The moment two or more participants speak across a call, generic ASR tools collapse into an undifferentiated wall of text, with no indication of who said what.
This tutorial solves that by combining two APIs: pyannoteAI for speaker diarization and OpenAI Whisper for speech-to-text. The result is a production-ready pipeline that outputs a speaker-labeled transcript from any meeting recording.
Problem framing
A meeting transcription system must answer two distinct questions:
What was said? handled by ASR (Whisper)
Who said it, and when? handled by speaker diarization (pyannoteAI)
Neither model alone produces a usable output. Whisper transcribes speech accurately but has no speaker awareness. pyannoteAI segments audio by speaker identity with precise timestamps but produces no text. The core engineering challenge is aligning these two outputs at the segment level using overlapping timestamps.
Architecture overview
The pipeline runs both APIs independently and merges their outputs in a post-processing step. This decoupled approach means each component can be swapped or upgraded without touching the other.
Implementation walkthrough
1. Environment setup
You need two API credentials:
2. Audio preprocessing (optional)
Both pyannoteAI and the OpenAI transcription API accept common audio formats (WAV, MP3, MP4, M4A, OGG, FLAC, WEBM) directly and handle normalization on their end — no client-side preprocessing is required.
If your source file is a video (e.g., an MP4 screen recording or a Zoom export) or in an unsupported container, extract the audio track first using pydub:
For audio files already in a supported format, skip this step and pass the file directly to the pipeline.
3. Speaker diarization with pyannoteAI
Submit the preprocessed file to the pyannoteAI diarization endpoint. The API is asynchronous: you POST the audio, receive a job ID, then poll until the result is ready.
The returned diarization payload looks like this:
Each entry represents a continuous speech segment attributed to one speaker.
4. Transcription with Whisper
Use the OpenAI Whisper API with verbose_json the response format to retrieve word-level timestamps. These granular timestamps are essential for accurate alignment.
Each word object contains:
5. Timestamp alignment
This is the critical merge step. For each diarization segment, collect all Whisper words whose midpoint timestamp falls within the segment's [start, end] window. Using the midpoint rather than the word's start time reduces alignment errors at segment boundaries.
6. Putting it together
Sample output:
Key technical challenges
Timestamp boundary drift. Whisper's word timestamps can shift by 100–300ms relative to the actual acoustic onset. pyannoteAI segment boundaries are more precise because diarization operates directly on acoustic features. Using the word midpoint for alignment (rather than onset) absorbs most of this drift without additional correction.
Overlapping speech. pyannoteAI detects overlapping speech regions and may assign them to multiple speakers. Whisper cannot separate overlapping voices and will produce a single transcription stream. When overlap regions appear in diarization output, the current pipeline assigns words to the first matching segment. For high-overlap recordings, consider filtering diarization segments below a confidence threshold or flagging overlap windows in the output.
Speaker label consistency. pyannoteAI returns anonymized labels (SPEAKER_00, SPEAKER_01, etc.) that are consistent within a session but not across sessions. For multi-session workflows, implement a speaker embedding comparison step using a model like pyannote/wespeaker to maintain identity across recordings.
Audio quality. Both models degrade on recordings with significant background noise, codec artifacts, or simultaneous crosstalk. Applying a noise reduction pass with a library such as noisereduce before submission measurably improves both diarization accuracy and Whisper word error rate.
API latency. For a one-hour meeting, expect pyannoteAI processing time of roughly 2–5 minutes, depending on load. Whisper large-v2 via the OpenAI API processes the same file in 3–8 minutes. Run both calls concurrently using asyncio or concurrent.futures to reduce total wall time.
Conclusion
This pipeline demonstrates that accurate multi-speaker transcription does not require a single monolithic model. By keeping diarization and ASR as independent components and merging their outputs through timestamp alignment, you retain the ability to upgrade each part independently, swap Whisper for a fine-tuned domain model, or adjust diarization sensitivity without touching the transcription layer.
The structured JSON output is directly usable as input to downstream tasks: meeting summary generation, action item extraction, or speaker-specific analytics. From here, the natural extensions are speaker name resolution (mapping SPEAKER_00 to a known participant list), chunked processing for very long recordings, and a lightweight web interface for file upload and transcript review.
The full source for this tutorial is available in the pyannoteAI examples repository.
