Benchmark

Industry-leading speaker intelligence benchmark results

Compare pyannoteAI performance against major speech-AI
providers and open-source models.

Benchmark

Industry-leading speaker intelligence benchmark results

Compare pyannoteAI performance against major speech-AI
providers and open-source models.

Benchmark

Industry-leading speaker intelligence benchmark results

Compare pyannoteAI performance against major speech-AI
providers and open-source models.

Accuracy

Delivers high diarization performance with DER and speaker-attribution results.

Robustness

Stays consistent in noisy, overlapping, & multi-speakers conversations.

Speed

Processes full diarization and speaker-segmentation output in seconds.

Diarization Error Rate

State-of-the-art

State-of-the-art

Speaker Diarization Benchmark

Discover the power of speaker diarization

Discover the power of speaker diarization

What can go wrong?

When benchmarking speaker diarization systems, one should focus on the following three types of errors.

A speaker confusion happens when a system assigns a speech turn to the wrong speaker, when it merges two speakers into one, or when it splits one speaker into multiple ones.

A missed detection happens when a speech turn (or part of it) is missed, or when two speakers talk on top of each other, and only one of them is detected.

A false alarm happens when the system detects a speech turn when there is none.

When benchmarking speaker diarization systems, one should focus on the following three types of errors.

A speaker confusion happens when a system assigns a speech turn to the wrong speaker, when it merges two speakers into one, or when it splits one speaker into multiple ones.

A missed detection happens when a speech turn (or part of it) is missed, or when two speakers talk on top of each other, and only one of them is detected.

A false alarm happens when the system detects a speech turn when there is none.

Diarization error rate

Discover the power of speaker diarization

Discover the power of speaker diarization

In this benchmark, we report the diarization error rate (or DER), which is defined as the aggregate durations of all three types of errors divided by the total duration of speech in the recording:

Within a single processing pipeline, pyannoteAI STT Orchestration collects speaker-attribute transcription in one unified workflow.

Within a single processing pipeline, pyannoteAI STT Orchestration collects speaker-attribute transcription in one unified workflow.

Benchmark Results (lower is better)

Benchmark Results (lower is better)

Benchmark Results (lower is better)

Broadcast Interview - Radio interview speech.

Clinical - Clinical child assessment interviews.

Courtroom - Formal multi-speaker legal speech.

Conversational telephone speech - Two-speaker telephone conversations.

Map task - Task-oriented dyadic dialogue.

Meeting - Spontaneous multi-speaker meetings.

Restaurant - Noisy informal group conversations.

Sociolinguistic (field) - Field sociolinguistic interviews.

Sociolinguistic (lab) - Controlled sociolinguistic interviews.

Web video - Diverse online video speech.

Evaluation Datasets

Evaluation Datasets

DIHARD Broadcast

DIHARD Clinical

DIHARD Court

DIHARD CTS

DIHARD Maptask

DIHARD Meeting

DIHARD Restaurant

DIHARD Socio Field

DIHARD Socio Lab

DIHARD Webvideo

Models

Models

pyannoteAI - Precision-2

pyannoteAI - OSS Community-1

AssemblyAI - Universal

Deepgram - Nova-3

ElevenLabs - Scribe-v1

Soniox - STT-async-preview-v1

Speechmatics - Enhanced

OpenAI - GPT-4o-transcribe-diarize

AWS - Transcribe, word-level

NVIDIA - OSS NeMo streaming sortformer (very high latency)

Benchmark Report Methodology

10

distinct areas of use

259

recordings

9.3%

of overlap speech

≈67

hours of challenging, multi‑domain audio

Methodology

We compare speaker diarization pipelines on a wide range of benchmarks, covering various acoustic conditions and conversation setups.

We rely on pyannote.metrics open source evaluation toolkit, which has become the de facto reference implementation in both academia and industry.

Commercial APIs were accessed via provider endpoints. Open-source models were evaluated using self-hosted instances. We did not provide the number of speakers for any of them.

Results

For pyannoteAI OSS and commercial models, we made sure not to leak any test data into our training set. For other providers, there is no guarantee that this is the case, since they do not communicate this piece of information.