Blog

Subtitle files such as SRT and VTT are the de facto standard for distributing accessible, navigable transcripts alongside video and audio. When multiple people speak in a recording: interviews, podcasts, meetings, panel discussions, plain transcripts quickly lose context. Adding speaker labels to subtitles restores that context and is increasingly expected for compliance, indexing, and downstream analysis.
This tutorial walks through a production-ready pipeline that combines pyannoteAI speaker diarization with OpenAI's gpt-4o-transcribe to produce subtitle files with speaker attribution. The same approach generalizes to any ASR that returns word-level timestamps.
Problem framing
A speaker-labeled subtitle file requires three pieces of information per cue:
A start and end timestamp.
The transcribed text.
The speaker's identity is active during that interval.
ASR systems produce (1) and (2) but do not reliably attribute speech to speakers. Diarization systems produce speaker turns ((start, end, speaker_id) segments) but no text. The core engineering task is temporal alignment: mapping each ASR token onto the diarization timeline, then re-segmenting into subtitle cues that respect both speaker boundaries and readability constraints (line length, duration, characters-per-second).
A naive alignment that snaps entire ASR segments to a single speaker fails on overlapping speech, fast turn-taking, and ASR segments that straddle a speaker change. Word-level alignment is a reliable approach.
Architecture overview
The pipeline has five stages:
Audio extraction: optional; only needed for video files or unsupported containers.
Diarization: upload the audio to pyannoteAI and obtain exclusive speaker turns.
Transcription: run
gpt-4o-transcribevia the OpenAI API to get word-level timestamps.Alignment: assign each word to a speaker by overlap, then group words into subtitle cues bounded by speaker changes.
Formatting: emit SRT or VTT with speaker labels and enforced readability rules.
Diarization and transcription are independent and can run in parallel. Alignment is the only stage that requires both outputs.
Stage 1: Audio extraction (optional)
Both pyannoteAI and the OpenAI transcription API accept common formats (WAV, MP3, MP4, M4A, OGG, FLAC, WEBM) and handle normalization on their end. No client-side preprocessing is required for audio files.
If your source is a video file (e.g., an MP4 screen recording or a Zoom export), extract the audio track first:
Or programmatically with pydub (pip install pydub==0.25.1):
For audio files already in a supported format, skip this stage entirely.
Stage 2: Diarization with pyannoteAI
For local files, first upload to pyannoteAI's temporary storage, then submit the diarization job. Pass exclusive=True to remove overlapping speech regions, which ensures each segment contains exactly one speaker and makes alignment cleaner.
The returned structure is a list of non-overlapping segments:
Speaker labels are stable within a job. If you need persistent identities across recordings, use pyannoteAI's voiceprint identification endpoint with enrolled reference samples.
Stage 3: Transcription with word-level timestamps
Use the OpenAI gpt-4o-transcribe model via the API (pip install openai==2.36.0). Request timestamp_granularities=["word"] to get per-word start and end times. For subtitles, word-level precision is important: ASR segments can span several seconds and straddle speaker boundaries, causing misattribution. Words give finer alignment and reduce that risk at turn transitions.
Each word has start, end, and word; sufficient for fine-grained overlap-based alignment.
Stage 4: Alignment
Assign each word to a speaker using overlap; compute the intersection time between the word's interval and every diarization segment, then pick the speaker with the most overlap. Fall back to the nearest speaker if there is no overlap:
Next, group words into cues. A new cue starts whenever the speaker changes, the gap between words exceeds a silence threshold, or the cue exceeds readability limits:
Stage 5: Subtitle formatting
SRT uses comma-separated milliseconds; VTT uses a period. Both expect HH:MM:SS,mmm-style timestamps. Speaker labels are conventionally placed at the start of the cue text.
VTT's <v Speaker> tag is the standard speaker annotation and is honored by most players, including HTML5 <track> and ffplay.
Key technical challenges
Overlapping speech: This pipeline uses exclusive diarization (exclusive=True), which removes overlapping speech so each segment contains exactly one speaker. If your use case requires representing simultaneous speakers (e.g., two people talking at once), disable exclusive mode, read from the regular diarization output instead, and emit two simultaneous cues with distinct VTT regions.
ASR-diarization clock drift: Both systems must run on the same audio file. Even small offsets accumulate over long recordings.
Short turns and backchannels: Filter diarization segments below ~250 ms before alignment to suppress brief interjections that fragment cues without adding informational value.
Readability: Streaming-platform conventions cap characters-per-second at roughly 17–20 and limit lines to 42 characters. Enforce both during cue construction, not as a post-processing pass.
Speaker naming: Generic labels like SPEAKER_00 are unhelpful in published subtitles. Map them to real names using pyannoteAI voiceprints or a manual labeling step before export.
Conclusion
Speaker-labeled subtitles are a straightforward composition of two well-defined components (diarization and ASR) joined by word-level alignment. The pipeline above produces standards-compliant SRT and VTT output suitable for video platforms, accessibility workflows, and searchable archives. Once the alignment logic is in place, swapping ASR backends or extending to identity resolution with voiceprints requires only localized changes.
Full example
Putting all stages together:
Diarization and transcription run concurrently since they are independent. The entire pipeline takes roughly the duration of the slower API call plus negligible local processing for alignment and formatting.
