Blog

Tutorial: Building a multi-speaker meeting transcription app with pyannoteAI and Whisper

Tutorial: Building a multi-speaker meeting transcription app with pyannoteAI and Whisper

Automated meeting transcription is a solved problem for single-speaker audio. Multi-speaker scenarios are not. The moment two or more participants speak across a call, generic ASR tools collapse into an undifferentiated wall of text, with no indication of who said what.

This tutorial solves that by combining two APIs: pyannoteAI for speaker diarization and OpenAI Whisper for speech-to-text. The result is a production-ready pipeline that outputs a speaker-labeled transcript from any meeting recording.

Problem framing

A meeting transcription system must answer two distinct questions:

  1. What was said? handled by ASR (Whisper)

  2. Who said it, and when? handled by speaker diarization (pyannoteAI)

Neither model alone produces a usable output. Whisper transcribes speech accurately but has no speaker awareness. pyannoteAI segments audio by speaker identity with precise timestamps but produces no text. The core engineering challenge is aligning these two outputs at the segment level using overlapping timestamps.

Architecture overview

Audio file (WAV/MP3)
        │
        ├──► pyannoteAI API ──► Diarization segments
        │                        [{speaker, start, end}, ...]
        │
        └──► Whisper API ──────► Word-level transcription
                                  [{text, start, end}, ...]
                                         │
                                         ▼
                              Timestamp alignment engine
                                         │
                                         ▼
                              Speaker-labeled transcript
                        [{"speaker": "SPEAKER_00", "text": "...", ...}]

The pipeline runs both APIs independently and merges their outputs in a post-processing step. This decoupled approach means each component can be swapped or upgraded without touching the other.

Implementation walkthrough

1. Environment setup

pip install requests==2.34.0 openai==2.36.0

# Optional: only needed if you have video files or unsupported audio formats
pip install pydub==0

You need two API credentials:

PYANNOTE_API_KEY = "your_pyannoteai_key"
OPENAI_API_KEY   = "your_openai_key"

2. Audio preprocessing (optional)

Both pyannoteAI and the OpenAI transcription API accept common audio formats (WAV, MP3, MP4, M4A, OGG, FLAC, WEBM) directly and handle normalization on their end — no client-side preprocessing is required.

If your source file is a video (e.g., an MP4 screen recording or a Zoom export) or in an unsupported container, extract the audio track first using pydub:

from pydub import AudioSegment

def extract_audio(input_path: str, output_path: str) -> str:
    """Extract audio from a video file or convert to a supported format."""
    audio = AudioSegment.from_file(input_path)
    audio.export(output_path, format="mp3")
    return output_path

For audio files already in a supported format, skip this step and pass the file directly to the pipeline.

3. Speaker diarization with pyannoteAI

Submit the preprocessed file to the pyannoteAI diarization endpoint. The API is asynchronous: you POST the audio, receive a job ID, then poll until the result is ready.

import os
import requests
import time

def upload_to_pyannote(audio_path: str, api_key: str, object_key: str) -> str:
    """Upload a local file to pyannoteAI temporary storage. Returns the media:// URL."""
    headers = {"Authorization": f"Bearer {api_key}"}
    media_url = f"media://{object_key}"

    # Step 1: declare the media location and get a pre-signed PUT URL
    resp = requests.post(
        "<https://api.pyannote.ai/v1/media/input>",
        headers=headers,
        json={"url": media_url},
    )
    resp.raise_for_status()
    presigned_url = resp.json()["url"]

    # Step 2: PUT the file to the pre-signed URL
    with open(audio_path, "rb") as f:
        put_resp = requests.put(
            presigned_url,
            data=f,
            headers={"Content-Type": "application/octet-stream"},
        )
    put_resp.raise_for_status()

    return media_url

def run_diarization(audio_path: str, api_key: str) -> list[dict]:
    headers = {"Authorization": f"Bearer {api_key}"}

    # Upload local file to pyannoteAI temporary storage
    object_key = os.path.basename(audio_path)
    media_url = upload_to_pyannote(audio_path, api_key, object_key)

    # Submit the diarization job
    response = requests.post(
        "<https://api.pyannote.ai/v1/diarize>",
        headers=headers,
        json={"url": media_url, "exclusive": True},
    )
    response.raise_for_status()
    job_id = response.json()["jobId"]

    # Poll for completion
    while True:
        status_resp = requests.get(
            f"<https://api.pyannote.ai/v1/jobs/{job_id}>",
            headers=headers,
        )
        status_resp.raise_for_status()
        job = status_resp.json()

        if job["status"] == "succeeded":
            return job["output"]["exclusiveDiarization"]  # list of {speaker, start, end}, overlapping speech removed
        elif job["status"] == "failed":
            raise RuntimeError(f"Diarization failed: {job.get('error')}")

        time.sleep(3)

The returned diarization payload looks like this:

[
  {"speaker": "SPEAKER_00", "start": 0.0,  "end": 4.2},
  {"speaker": "SPEAKER_01", "start": 4.5,  "end": 9.1},
  {"speaker": "SPEAKER_00", "start": 9.4,  "end": 13.7}
]

Each entry represents a continuous speech segment attributed to one speaker.

4. Transcription with Whisper

Use the OpenAI Whisper API with verbose_json the response format to retrieve word-level timestamps. These granular timestamps are essential for accurate alignment.

from openai import OpenAI

def run_transcription(audio_path: str, api_key: str) -> list[dict]:
    client = OpenAI(api_key=api_key)

    with open(audio_path, "rb") as f:
        response = client.audio.transcriptions.create(
            model="gpt-4o-transcribe",
            file=f,
            response_format="verbose_json",
            timestamp_granularities=["word"],
        )

    return response.segments  # list of {text, start, end}

Each word object contains:

{"word": "Hello", "start": 0.12, "end": 0.48}

5. Timestamp alignment

This is the critical merge step. For each diarization segment, collect all Whisper words whose midpoint timestamp falls within the segment's [start, end] window. Using the midpoint rather than the word's start time reduces alignment errors at segment boundaries.

def align_transcript(
    diarization: list[dict],
    transcript_segments: list[dict],
    fill_nearest: bool = False,
) -> list[dict]:
    diarization = sorted(diarization, key=lambda x: x["start"])

    for seg in transcript_segments:
        seg_start = seg.get("start", 0.0)
        seg_end = seg.get("end", 0.0)
        speaker_overlap: dict[str, float] = {}

        for dia in diarization:
            intersection = min(dia["end"], seg_end) - max(dia["start"], seg_start)
            if intersection <= 0:
                continue
            speaker = dia["speaker"]
            speaker_overlap[speaker] = speaker_overlap.get(speaker, 0.0) + intersection

        if speaker_overlap:
            seg["speaker"] = max(speaker_overlap.items(), key=lambda x: x[1])[0]
            continue

        if fill_nearest and diarization:
            midpoint = (seg_start + seg_end) / 2
            nearest = min(
                diarization,
                key=lambda x: abs(((x["start"] + x["end"]) / 2) - midpoint),
            )
            seg["speaker"] = nearest["speaker"]
            continue

        seg["speaker"] = "UNKNOWN"

    return transcript_segments

6. Putting it together

import json

def transcribe_meeting(input_audio: str) -> list[dict]:
    processed = input_audio  # skip if already a supported audio format
    # processed = extract_audio(input_audio, "processed.mp3")  # uncomment for video input
    diarization = run_diarization(processed, PYANNOTE_API_KEY)
    segments = run_transcription(processed, OPENAI_API_KEY)
    return align_transcript(diarization, segments)

if __name__ == "__main__":
    result = transcribe_meeting("meeting.mp3")

    # Print readable transcript
    for entry in result:
        print(f"[{entry['start']}s → {entry['end']}s] {entry['speaker']}: {entry['text']}")

    # Save structured output
    with open("transcript.json", "w") as f:
        json.dump(result, f, indent=2)

Sample output:

[0.0s → 4.2s]  SPEAKER_00: Thanks everyone for joining today.
[4.5s → 9.1s]  SPEAKER_01: Happy to be here. Should we start with the Q3 review?
[9.4s → 13.7s]

Key technical challenges

Timestamp boundary drift. Whisper's word timestamps can shift by 100–300ms relative to the actual acoustic onset. pyannoteAI segment boundaries are more precise because diarization operates directly on acoustic features. Using the word midpoint for alignment (rather than onset) absorbs most of this drift without additional correction.

Overlapping speech. pyannoteAI detects overlapping speech regions and may assign them to multiple speakers. Whisper cannot separate overlapping voices and will produce a single transcription stream. When overlap regions appear in diarization output, the current pipeline assigns words to the first matching segment. For high-overlap recordings, consider filtering diarization segments below a confidence threshold or flagging overlap windows in the output.

Speaker label consistency. pyannoteAI returns anonymized labels (SPEAKER_00, SPEAKER_01, etc.) that are consistent within a session but not across sessions. For multi-session workflows, implement a speaker embedding comparison step using a model like pyannote/wespeaker to maintain identity across recordings.

Audio quality. Both models degrade on recordings with significant background noise, codec artifacts, or simultaneous crosstalk. Applying a noise reduction pass with a library such as noisereduce before submission measurably improves both diarization accuracy and Whisper word error rate.

API latency. For a one-hour meeting, expect pyannoteAI processing time of roughly 2–5 minutes, depending on load. Whisper large-v2 via the OpenAI API processes the same file in 3–8 minutes. Run both calls concurrently using asyncio or concurrent.futures to reduce total wall time.

from concurrent.futures import ThreadPoolExecutor

def transcribe_meeting_parallel(input_audio: str) -> list[dict]:
    processed = input_audio  # skip if already a supported audio format
    # processed = extract_audio(input_audio, "processed.mp3")  # uncomment for video input

    with ThreadPoolExecutor(max_workers=2) as executor:
        f_diarization   = executor.submit(run_diarization, processed, PYANNOTE_API_KEY)
        f_transcription = executor.submit(run_transcription, processed, OPENAI_API_KEY)

    return align_transcript(f_diarization.result(), f_transcription.result())

Conclusion

This pipeline demonstrates that accurate multi-speaker transcription does not require a single monolithic model. By keeping diarization and ASR as independent components and merging their outputs through timestamp alignment, you retain the ability to upgrade each part independently, swap Whisper for a fine-tuned domain model, or adjust diarization sensitivity without touching the transcription layer.

The structured JSON output is directly usable as input to downstream tasks: meeting summary generation, action item extraction, or speaker-specific analytics. From here, the natural extensions are speaker name resolution (mapping SPEAKER_00 to a known participant list), chunked processing for very long recordings, and a lightweight web interface for file upload and transcript review.

The full source for this tutorial is available in the pyannoteAI examples repository.

Speaker Intelligence Platform for developers

Detect, segment, label and separate speakers in any language.

Speaker Intelligence Platform for developers

Detect, segment, label and separate speakers in any language.

Make the most of conversational speech
with AI

Detect, segment, label and separate speakers in any language.