Blog

Setting a new standard with Precision-2

By Hervé Bredin (Chief Science Officer, pyannoteAI) and Antoine Laurent (founding pyannoteAI labs researcher)

A few months ago, we silently released the first version of pyannoteAI flagship speaker diarization model, built on top of a decade of academic research. Its performance did not go unnoticed and Precision-1 is being used from AI meeting notetakers (where correct speaker assignment is critical) to the video dubbing industry (where precise speech turn boundaries and cross-talk detection are paramount), as well as call centers monitoring or medical scribes.

Precision-1 was 20% more accurate (in diarization error rate) than the current state-of-the-art at that time, as well as 2x faster than our open-source speaker diarization toolkit pyannote -- but that was just the beginning.

Based on early customer feedback, researchers at pyannoteAI labs have been hard at work in the last 6 months and we are very happy to release the second iteration of our flagship speaker diarization model. Precision-2 brings both much better accuracy out-of-the-box, as well as much more flexibility for advanced use cases.

Setting a new standard for speaker diarization

For more than a decade, our open-source toolkit pyannote.audio (8k+ followers on Github) and its pretrained models (50 millions monthly downloads on Huggingface) have been the de facto standard for speaker diarization and speaker-attributed transcription.

The Precision series set a new standard. Thanks to the support of French "Jean Zay" super-computer managed by GENCI, we were able to train large models with more training data and more compute than ever before. The benchmark below speaks for itself: Precision-2 is 14% more accurate than Precision-1 and 28% more accurate than pyannote.audio OSS 3.1 model.

`Speaker diarization systems can make three types of errors:

speaker confusion, when a speech turn is assigned to the wrong speaker;
missed detection, when a single-speaker or overlapping speech regions are not detected;
false alarm, when a non-speech region is marked as speech.

Precision-2 improves on every one of those three aspects.

70% accuracy in predicting exact speaker counts across our most demanding benchmark

Precision-2 keeps a single speaker identity consistent across turns — so “B” stays “B” throughout the conversation (see below).

We are therefore happy to report that Precision-2 predicts the correct number of speakers on 70% of our most difficult internal benchmark (250+ files with 2 to 10 speakers), while Precision-1 barely reached 50%. This translates into a relative reduction of 37% on the speaker confusion rate... and a huge cost reduction for use cases involving human correction.

Major leap forward in timestamp precision and cross-talk detection

Video dubbing companies love our flagship Precision-1 model because of its very accurate timestamps as well as state-of-the-art cross-talk (or overlapped speech) detection. Accurate timestamps make the alignment of translated audio and lip-synced video way better, while cross-talk detection allows to clean up large voice cloning training datasets and only focus on single-speaker clean speech for training TTS models.

We are happy to report that Precision-2 brings an additional 5% relative improvement on this particular aspect over Precision-1, or 15% relative improvement over our most popular open-source speaker diarization pipeline.

Giving you more control

On top of raw out-of-the-box performance improvement, Precision-2 also comes with a whole lot of new features and quality-of-life improvements for the developers using our pyannoteAI Speaker Intelligence platform.

Reconciliation of speech-to-text/diarization timestamps

Speaker-attributed transcription is one of the main use cases of speaker diarization.

Developers would use open-source models like OpenAI Whisper or NVIDIA Parakeet to obtain a sequence of words, and rely on pyannote speaker diarization to assign each word to a speaker. We got lots of feedback from users mentioning that timestamps returned by pyannoteAI are much more precise than the ones returned by STT providers -- making their reconciliation trickier than it sounds at first.

In particular, the Precision series is really good at pinpointing speaker interruptions and short backchannels (ok, hmmmm, yes), while most STT providers completely overlook those regions. For instance, in the example below, STT providers are very likely to not transcribe "definitely".

Additionally, in this example, if users rely solely on STT word timestamps to project words onto the diarization output, they will not even be able to tell who says "that's what", because speakers overlapped.

Precision-2 now supports an exclusive boolean flag to return speaker diarization where only one single speaker (the most likely to be transcribed) is active at a time.

curl --request POST \
  --url https://api.pyannote.ai/v1/diarize \
  --header 'Authorization: Bearer <your-api-key>' \
  --header 'Content-Type: application/json' \
  --data '{
  "url": "https://example.com/audio.wav",
  "model": "precision-2",
  "exclusive": true,
}'

This should make it easier to reconcile speech-to-text and diarization timestamps. Let us know how useful this feature is!

Set the number of speakers

By default, Precision models will detect any number of speakers automatically. Precision-1 used to allow forcing exactly one speaker (for voice activity detection purposes) or two speakers (for phone calls). This limit no longer holds for Precision-2 and one can force any number of speakers.

Precision-2 now brings even more flexibility. One can either impose a lower bound on the number of speakers (you would set minSpeakers to 2 for a patient-doctor consultation where a parent might accompany their child) or set an upper bound on the number of speakers (you would set maxSpeakers to the number of invited participants when transcribing a physical meeting, without guarantee that they all will attend), or a combination of both.

curl --request POST \
  --url https://api.pyannote.ai/v1/diarize \
  --header 'Authorization: Bearer <your-api-key>' \
  --header 'Content-Type: application/json' \
  --data '{
  "url": "https://example.com/audio.wav",
  "model": "precision-2",
  "minSpeakers": 2,
  "maxSpeakers": 10, 
}'

Human-in-the-loop correction

Last, but not least, for advanced use cases where human correction is needed (e.g. 100% correct verbatim transcription of courtroom recordings), we now expose a new type of confidence score that can help streamline and speed up the manual correction process.

{
  "start": 10.2,
  "end": 15.8,  
  "speaker": "SPEAKER_02",  # most likely speaker
  "confidence": {           # speaker probabilities
    "SPEAKER_00": 12,        
    "SPEAKER_01": 45,
    "SPEAKER_02": 87,
    "SPEAKER_03": 24},
}