Tutorial: Create subtitles with speaker labels - pyannoteAI Speaker Intelligence and Diarization

Blog

Tutorial: Create subtitles with speaker labels

Subtitle files such as SRT and VTT are the de facto standard for distributing accessible, navigable transcripts alongside video and audio. When multiple people speak in a recording: interviews, podcasts, meetings, panel discussions, plain transcripts quickly lose context. Adding speaker labels to subtitles restores that context and is increasingly expected for compliance, indexing, and downstream analysis.

This tutorial walks through a production-ready pipeline that combines pyannoteAI speaker diarization with OpenAI's gpt-4o-transcribe to produce subtitle files with speaker attribution. The same approach generalizes to any ASR that returns word-level timestamps.

Problem framing

A speaker-labeled subtitle file requires three pieces of information per cue:

A start and end timestamp.
The transcribed text.
The speaker's identity is active during that interval.

ASR systems produce (1) and (2) but do not reliably attribute speech to speakers. Diarization systems produce speaker turns ((start, end, speaker_id) segments) but no text. The core engineering task is temporal alignment: mapping each ASR token onto the diarization timeline, then re-segmenting into subtitle cues that respect both speaker boundaries and readability constraints (line length, duration, characters-per-second).

A naive alignment that snaps entire ASR segments to a single speaker fails on overlapping speech, fast turn-taking, and ASR segments that straddle a speaker change. Word-level alignment is a reliable approach.

Architecture overview

The pipeline has five stages:

Audio extraction: optional; only needed for video files or unsupported containers.
Diarization: upload the audio to pyannoteAI and obtain exclusive speaker turns.
Transcription: run gpt-4o-transcribe via the OpenAI API to get word-level timestamps.
Alignment: assign each word to a speaker by overlap, then group words into subtitle cues bounded by speaker changes.
Formatting: emit SRT or VTT with speaker labels and enforced readability rules.

Diarization and transcription are independent and can run in parallel. Alignment is the only stage that requires both outputs.

Stage 1: Audio extraction (optional)

Both pyannoteAI and the OpenAI transcription API accept common formats (WAV, MP3, MP4, M4A, OGG, FLAC, WEBM) and handle normalization on their end. No client-side preprocessing is required for audio files.

If your source is a video file (e.g., an MP4 screen recording or a Zoom export), extract the audio track first:

ffmpeg -i input.mp4 -vn -c

Or programmatically with pydub (pip install pydub==0.25.1):

from pydub import AudioSegment

def extract_audio(input_path: str, output_path: str = "audio.mp3") -> str:
    AudioSegment.from_file(input_path).export(output_path, format="mp3")
    return output_path

For audio files already in a supported format, skip this stage entirely.

Stage 2: Diarization with pyannoteAI

For local files, first upload to pyannoteAI's temporary storage, then submit the diarization job. Pass exclusive=True to remove overlapping speech regions, which ensures each segment contains exactly one speaker and makes alignment cleaner.

import os
import time
import requests

API_KEY = os.environ["PYANNOTEAI_API_KEY"]
BASE = "<https://api.pyannote.ai/v1>"
HEADERS = {"Authorization": f"Bearer {API_KEY}"}

def upload_to_pyannote(audio_path: str, object_key: str) -> str:
    """Upload a local file to pyannoteAI temporary storage. Returns the media:// URL."""
    media_url = f"media://{object_key}"
    resp = requests.post(
        f"{BASE}/media/input",
        headers=HEADERS,
        json={"url": media_url},
    )
    resp.raise_for_status()
    presigned_url = resp.json()["url"]
    with open(audio_path, "rb") as f:
        requests.put(presigned_url, data=f,
            headers={"Content-Type": "application/octet-stream"}).raise_for_status()
    return media_url

def diarize(audio_path: str) -> list[dict]:
    media_url = upload_to_pyannote(audio_path, os.path.basename(audio_path))
    job = requests.post(
        f"{BASE}/diarize",
        headers=HEADERS,
        json={"url": media_url, "exclusive": True},
    ).json()
    job_id = job["jobId"]

    while True:
        status = requests.get(f"{BASE}/jobs/{job_id}", headers=HEADERS).json()
        if status["status"] == "succeeded":
            return status["output"]["exclusiveDiarization"]
        if status["status"] == "failed":
            raise RuntimeError(status.get("error"))
        time.sleep(2)

The returned structure is a list of non-overlapping segments:

[
    {"start": 0.52, "end": 4.31, "speaker": "SPEAKER_00"},
    {"start": 4.40, "end": 7.88, "speaker": "SPEAKER_01"},
    ...
]

Speaker labels are stable within a job. If you need persistent identities across recordings, use pyannoteAI's voiceprint identification endpoint with enrolled reference samples.

Stage 3: Transcription with word-level timestamps

Use the OpenAI gpt-4o-transcribe model via the API (pip install openai==2.36.0). Request timestamp_granularities=["word"] to get per-word start and end times. For subtitles, word-level precision is important: ASR segments can span several seconds and straddle speaker boundaries, causing misattribution. Words give finer alignment and reduce that risk at turn transitions.

from openai import OpenAI

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

def transcribe(audio_path: str) -> list[dict]:
    with open(audio_path, "rb") as f:
        response = client.audio.transcriptions.create(
            model="gpt-4o-transcribe",
            file=f,
            response_format="verbose_json",
            timestamp_granularities=["word"],
        )
    return response.words  # list of {word, start, end}

Each word has start, end, and word; sufficient for fine-grained overlap-based alignment.

Stage 4: Alignment

Assign each word to a speaker using overlap; compute the intersection time between the word's interval and every diarization segment, then pick the speaker with the most overlap. Fall back to the nearest speaker if there is no overlap:

def assign_speakers(
    words: list[dict],
    diarization: list[dict],
    fill_nearest: bool = True,
) -> list[dict]:
    diarization = sorted(diarization, key=lambda x: x["start"])
    for w in words:
        w_start, w_end = w["start"], w["end"]
        speaker_overlap: dict[str, float] = {}
        for dia in diarization:
            intersection = min(dia["end"], w_end) - max(dia["start"], w_start)
            if intersection <= 0:
                continue
            speaker_overlap[dia["speaker"]] = (
                speaker_overlap.get(dia["speaker"], 0.0) + intersection
            )
        if speaker_overlap:
            w["speaker"] = max(speaker_overlap.items(), key=lambda x: x[1])[0]
            continue
        if fill_nearest and diarization:
            midpoint = (w_start + w_end) / 2
            nearest = min(
                diarization,
                key=lambda x: abs(((x["start"] + x["end"]) / 2) - midpoint),
            )
            w["speaker"] = nearest["speaker"]
            continue
        w["speaker"] = "UNKNOWN"
    return words

Next, group words into cues. A new cue starts whenever the speaker changes, the gap between words exceeds a silence threshold, or the cue exceeds readability limits:

MAX_CHARS = 84      # ~42 chars per line, 2 lines
MAX_DURATION = 6.0  # seconds
SILENCE_GAP = 0.7

def build_cues(words: list[dict]) -> list[dict]:
    cues, current = [], None
    for w in words:
        new_cue = (
            current is None
            or w["speaker"] != current["speaker"]
            or w["start"] - current["end"] > SILENCE_GAP
            or len(current["text"]) + len(w["word"]) + 1 > MAX_CHARS
            or w["end"] - current["start"] > MAX_DURATION
        )
        if new_cue:
            if current:
                cues.append(current)
            current = {
                "start": w["start"], "end": w["end"],
                "speaker": w["speaker"], "text": w["word"],
            }
        else:
            current["end"] = w["end"]
            current["text"] += " " + w["word"]
    if current:
        cues.append(current)
    return cues

Stage 5: Subtitle formatting

SRT uses comma-separated milliseconds; VTT uses a period. Both expect HH:MM:SS,mmm-style timestamps. Speaker labels are conventionally placed at the start of the cue text.

def fmt_ts(seconds: float, vtt: bool = False) -> str:
    h = int(seconds // 3600)
    m = int((seconds % 3600) // 60)
    s = int(seconds % 60)
    ms = int(round((seconds - int(seconds)) * 1000))
    sep = "." if vtt else ","
    return f"{h:02d}:{m:02d}:{s:02d}{sep}{ms:03d}"

def to_srt(cues: list[dict]) -> str:
    lines = []
    for i, c in enumerate(cues, 1):
        lines.append(str(i))
        lines.append(f"{fmt_ts(c['start'])} --> {fmt_ts(c['end'])}")
        lines.append(f"[{c['speaker']}] {c['text']}")
        lines.append("")
    return "\\n".join(lines)

def to_vtt(cues: list[dict]) -> str:
    out = ["WEBVTT", ""]
    for c in cues:
        out.append(f"{fmt_ts(c['start'], True)} --> {fmt_ts(c['end'], True)}")
        out.append(f"<v {c['speaker']}>{c['text']}</v>")
        out.append("")
    return "\\n".join(out)

VTT's <v Speaker> tag is the standard speaker annotation and is honored by most players, including HTML5 <track> and ffplay.

Key technical challenges

Overlapping speech: This pipeline uses exclusive diarization (exclusive=True), which removes overlapping speech so each segment contains exactly one speaker. If your use case requires representing simultaneous speakers (e.g., two people talking at once), disable exclusive mode, read from the regular diarization output instead, and emit two simultaneous cues with distinct VTT regions.

ASR-diarization clock drift: Both systems must run on the same audio file. Even small offsets accumulate over long recordings.

Short turns and backchannels: Filter diarization segments below ~250 ms before alignment to suppress brief interjections that fragment cues without adding informational value.

Readability: Streaming-platform conventions cap characters-per-second at roughly 17–20 and limit lines to 42 characters. Enforce both during cue construction, not as a post-processing pass.

Speaker naming: Generic labels like SPEAKER_00 are unhelpful in published subtitles. Map them to real names using pyannoteAI voiceprints or a manual labeling step before export.

Conclusion

Speaker-labeled subtitles are a straightforward composition of two well-defined components (diarization and ASR) joined by word-level alignment. The pipeline above produces standards-compliant SRT and VTT output suitable for video platforms, accessibility workflows, and searchable archives. Once the alignment logic is in place, swapping ASR backends or extending to identity resolution with voiceprints requires only localized changes.

Full example

Putting all stages together:

from concurrent.futures import ThreadPoolExecutor

def generate_subtitles(audio_path: str, fmt: str = "srt") -> str:
    """Generate speaker-labeled subtitles from an audio file.

    Args:
        audio_path: Path to a local audio file in a supported format.
        fmt: Output format, either 'srt' or 'vtt'.
    """
    with ThreadPoolExecutor(max_workers=2) as executor:
        f_diarization = executor.submit(diarize, audio_path)
        f_words = executor.submit(transcribe, audio_path)

    diarization = f_diarization.result()
    words = f_words.result()

    labeled_words = assign_speakers(words, diarization)
    cues = build_cues(labeled_words)

    if fmt == "vtt":
        return to_vtt(cues)
    return to_srt(cues)

if __name__ == "__main__":
    srt_output = generate_subtitles("meeting.mp3")
    with open("meeting.srt", "w") as f:
        f.write(srt_output)
    print("Subtitles written to meeting.srt")

Diarization and transcription run concurrently since they are independent. The entire pipeline takes roughly the duration of the slower API call plus negligible local processing for alignment and formatting.