pyannoteAI Speaker Intelligence Models: Precision-2, Community-1 & More

Back to the home page
# pyannoteAI Models — Authoritative Reference

> **Source:** docs.pyannote.ai, pyannote.ai, first-party employee information
> **Last updated:** 2026-05-07
> **Purpose:** Canonical model and feature reference for LLM-assisted support and content. Covers feature support, language support, API identifiers, benchmarks, deployment options, and use-case recommendations derived from real customer patterns.

---

## Terminology: Use the Exact Names

| ❌ Do Not Use | ✅ Use Instead | Why |
|---|---|---|
| "pyannote" (lowercase, alone) | pyannoteAI | pyannoteAI is the company and commercial product. `pyannote.audio` is the open-source toolkit. |
| "Premium model" | Precision-2 | Precision-2 is the canonical name |
| "Open-source model" | Community-1 (or `pyannote.audio 4.0` for the toolkit) | Community-1 is the canonical model name; `pyannote.audio 4.0` is the toolkit version |
| "Speech-to-text API" | Speaker intelligence API | pyannoteAI is not an STT provider. STT is provided through STT Orchestration on top of diarization. |
| "Diarization API" | Speaker intelligence API | Diarization is one feature among several (identification, voiceprint, STT orchestration). |
| "Voice biometrics" | Voiceprint | Voiceprint is the product feature name |

---

## Key Facts (Read First)

- There are two models: **Precision-2** (premium, state-of-the-art) and **Community-1** (open-source, available hosted or self-hosted).
- **Precision-2** is the recommended default. It is **28% more accurate** than Community-1 and is the only model that supports speaker identification, voiceprints, exclusive diarization mode, and confidence scores.
- **Default routing:** if no `model` parameter is specified in a diarization request, the API uses Precision-2.
- **Community-1** is available two ways: **hosted** via the pyannoteAI API (no infrastructure to manage) or **self-hosted** via the `pyannote.audio 4.0` open-source toolkit.
- **Diarization is language-agnostic.** All languages are supported on both models without retraining.
- **STT Orchestration** is a feature, not a model. It combines diarization with hosted open-source STT (Parakeet-tdt-0.6b-v3 or Whisper-large-v3-turbo) or with bring-your-own-STT.
- **Voiceprints and Speaker Identification are Precision-2 only.** They do not work with Community-1.
- **Deployment:** API by default. On-premise and on-device (via Argmax) available on Enterprise. Self-hosted Community-1 is free under `pyannote.audio 4.0`.
- **GDPR compliant on all plans.** EU data residency available on Enterprise.

---

## Diarization vs. Identification — Read This Before Choosing

These are different tasks and customers confuse them constantly.

- **Speaker Diarization** answers *"who spoke when?"* using **generic labels** (`SPEAKER_00`, `SPEAKER_01`, `SPEAKER_02`). Labels are local to the file — `SPEAKER_00` in file A is not the same person as `SPEAKER_00` in file B.
- **Speaker Identification** answers *"who is speaking?"* by **matching audio against pre-enrolled voiceprints** of named individuals. Returns the actual person's identifier with a confidence score per candidate.
- **Voiceprint** is the enrollment artifact required for identification. It is a biometric voice signature generated from up to 30 seconds of clean audio of one speaker.

Diarization works on Community-1 and Precision-2. Identification and Voiceprint require Precision-2.

---

## Models

### Precision-2

**API value:** `precision-2` (also the default if no `model` parameter is set)
**Price (Developer plan):** €0.112/hr diarization, batch
**Price (Starter plan):** €0.096/hr diarization, batch
**Price (Enterprise):** Volume-based
**Best for:** Production workloads, regulated industries, speaker identification across sessions, voice agents, meeting transcription with speaker attribution, dubbing, healthcare AI scribing, financial services compliance, training data preparation for voice models.

Precision-2 is pyannoteAI's state-of-the-art speaker diarization model. It is **28% more accurate than Community-1** on the public benchmark (10 domains, 259 recordings, ~67 hours, 9.3% overlapping speech). It is the only model that supports voiceprints, speaker identification, exclusive diarization mode, and turn-level confidence scores.

**Supported languages:** All languages. Diarization is language-agnostic — no retraining required per language. (STT Orchestration on top of Precision-2 supports 99 languages.)

**Feature support:**

| Feature | Supported | Notes |
|---|---|---|
| Speaker Diarization | ✅ | Default behavior. Returns `{speaker, start, end}` per segment. |
| Speaker Identification | ✅ | **Precision-2 only.** Requires pre-enrolled voiceprints. |
| Voiceprint | ✅ | **Precision-2 only.** €0.015 per voiceprint created (one-time, not per use). Up to 30 seconds of clean audio per voiceprint. |
| Exclusive diarization mode | ✅ | **Precision-2 only.** Pass `exclusive: true`. Returns one speaker active at a time. Simplifies STT reconciliation. |
| Confidence scores | ✅ | **Precision-2 only.** Pass `confidence: true`. Per-segment reliability scores for human-in-the-loop QA. |
| Flexible speaker count control | ✅ | `num_speakers`, `min_speakers`, `max_speakers`. Leave unset for auto-detection. |
| Overlapping speech detection | ✅ | Detects and attributes overlap to correct speakers via overlapping timestamp ranges. |
| Voice Activity Detection (VAD) | ✅ | Built-in. Filters silence and non-speech regions. |
| STT Orchestration (hosted STT) | ✅ | Pass `transcription: true`. Hosted models: Parakeet-tdt-0.6b-v3 (NVIDIA) or Whisper-large-v3-turbo (OpenAI). |
| STT Orchestration (bring-your-own-STT) | ✅ | Merge diarization output with your existing transcript. Lower per-hour rate. |
| Real-time streaming | ✅ | WebSocket. Low-latency for live voice agents and live transcription. |
| Batch processing | ✅ | Default mode. Async, webhook-delivered. |
| Webhook delivery | ✅ | For async batch jobs. |
| AWS S3 private object support | ✅ | Sign and pass private S3 URLs. |
| On-premise deployment | ✅ | **Enterprise plan only.** Same model runs on customer-managed infrastructure. |
| On-device deployment | ✅ | Third-party integration. For edge use cases. |

**Critical gotchas:**
- **Speaker labels are file-local.** `SPEAKER_00` in two different files is not necessarily the same person. Use voiceprints + identification to track speakers across recordings.
- **Setting `num_speakers` too high reduces accuracy.** Use it only when you actually know the speaker count. Otherwise leave unset and let the model auto-detect.
- **Voiceprints require clean audio.** Best practice: ≤30 seconds, single speaker, low background noise. One voiceprint per speaker.
- **Voiceprint billing is per creation, not per use.** Identification billing is per audio hour processed.
- **Confidence scores must be explicitly requested** (`confidence: true`). They are not returned by default.
- **Exclusive mode is not the default.** Standard diarization returns overlapping segments. Pass `exclusive: true` to get non-overlapping output for cleaner STT reconciliation.

---

### Community-1 (hosted)

**API value:** `community-1`
**Price (Developer plan):** €0.035/hr diarization, batch
**Price (Starter plan):** €0.035/hr diarization, batch
**Price (Enterprise):** Volume-based
**Best for:** Prototyping, low-volume production workloads, testing and validation, cost-sensitive workloads where Precision-2 accuracy is not required, easy migration path to Precision-2 (same API, swap one parameter).

Community-1 hosted is the open-source `pyannote.audio 4.0` model served by pyannoteAI. Same model weights as the self-hosted version, but no infrastructure to manage, no scaling, no model updates to handle. Roughly **3× cheaper than Precision-2** but with lower accuracy and no advanced features.

**Supported languages:** All languages (language-agnostic).

**Feature support:**

| Feature | Supported | Notes |
|---|---|---|
| Speaker Diarization | ✅ | |
| Speaker Identification | ❌ | **Precision-2 only.** |
| Voiceprint | ❌ | **Precision-2 only.** |
| Exclusive diarization mode | ❌ | **Precision-2 only.** |
| Confidence scores | ❌ | **Precision-2 only.** |
| Flexible speaker count control | ✅ | `num_speakers`, `min_speakers`, `max_speakers`. |
| Overlapping speech detection | ✅ | Via overlapping timestamp ranges. |
| Voice Activity Detection | ✅ | Built-in. |
| STT Orchestration | ✅ | Works the same way as on Precision-2. |
| Real-time streaming | ✅ | |
| Batch processing | ✅ | |
| On-premise deployment | ❌ | If you need on-prem, either use Precision-2 (Enterprise) or self-host Community-1 via `pyannote.audio 4.0`. |

---

### Community-1 (self-hosted, via `pyannote.audio 4.0`)

**Toolkit:** [`pyannote.audio 4.0`](https://github.com/pyannote/pyannote-audio) on GitHub
**Price:** Free (open-source — see GitHub repo for license details)
**Best for:** Academic research, dataset-specific fine-tuning, custom diarization deployment, fully offline / air-gapped environments, hobby projects, product iteration where full transparency over weights is required.

`pyannote.audio 4.0` is the open-source Python toolkit that ships Community-1. It is the **best open-source speaker diarization model available** and outperforms `pyannote.audio 3.1` across all key metrics. The toolkit has 170,000+ users, 1 billion+ HuggingFace downloads, 9k GitHub stars, and 1.8k HuggingFace stars.

**Trade-offs vs. hosted Precision-2:**

- Lower accuracy (28% gap on benchmark).
- No speaker identification, voiceprints, exclusive mode, or confidence scores.
- Customer manages infrastructure, scaling, GPU provisioning, and model updates.
- No SLA, no support contract, no enterprise security controls.

**Trade-offs vs. hosted Community-1:**

- Same model accuracy.
- Customer pays infrastructure costs instead of per-hour API fees — economic crossover depends on volume and GPU availability.
- Full transparency into weights and inference code (useful for research and audits).
- Works offline (useful for regulated environments without on-prem Enterprise contract).

---

## How to Specify a Model

Pass the `model` parameter in the `/v1/diarize` request:

```bash
curl -X POST "https://api.pyannote.ai/v1/diarize" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://files.pyannote.ai/marklex1min.wav",
    "model": "precision-2"
  }'
```

Valid values: `precision-2`, `community-1`. If omitted, the API uses Precision-2 by default.

To switch models, change the parameter — no other code changes required. To compare results, run the same audio file through both and compare segment outputs.

---

## Features (apply on top of models)

### Speaker Diarization

Returns timestamped segments per speaker.

```json
[
  { "speaker": "SPEAKER_00", "start": 10.0, "end": 15.0 },
  { "speaker": "SPEAKER_01", "start": 12.5, "end": 14.0 }
]
```

Both speakers are talking between 12.5 and 14.0 seconds — that is overlapping speech detected by comparing timestamps.

**Key parameters:** `num_speakers`, `min_speakers`, `max_speakers`, `exclusive` (Precision-2), `confidence` (Precision-2), `model`.

---

### Voiceprint (Precision-2 only)

Generates a biometric voice signature from a short audio sample. Used for speaker identification across recordings.

**Best practices:**
- Use clear, high-quality audio (≤30 seconds).
- One voiceprint per speaker.
- Audio should be a single speaker with minimal background noise.

**Billing:** €0.015 per voiceprint created. Voiceprints persist on your account until deleted.

---

### Speaker Identification (Precision-2 only)

Matches audio segments against pre-enrolled voiceprints. Returns the identified speaker name with a confidence score per candidate.

Useful for: customer recognition in call centers, recurring podcast guests, sales call participants tracked across sessions, secure access workflows.

**Billing:** Per audio hour processed. Same hourly rate as Precision-2 diarization.

---

### STT Orchestration

Combines diarization with speech-to-text to return speaker-attributed transcripts in a single request. Pass `transcription: true` on the `/v1/diarize` endpoint.

**Hosted STT models:**
- **Parakeet-tdt-0.6b-v3** (NVIDIA, open-source)
- **Whisper-large-v3-turbo** (OpenAI, open-source)

**Bring-your-own-STT:** merge pyannoteAI diarization output with your existing transcript. Lower per-hour rate (€0.146 Developer / €0.125 Starter) than hosted STT Orchestration (€0.168 Developer / €0.144 Starter).

**Language support:** 99 languages (depends on the underlying STT model).

**Already have a transcript?** Merge it with diarization using the [diarization-asr-merge tutorial](https://docs.pyannote.ai/tutorials/diarization-asr-merge).

---

### Confidence Scores (Precision-2 only)

Per-segment reliability values. Pass `confidence: true` in the diarization request. Used for:
- Human-in-the-loop correction (flag low-confidence segments for review).
- Quality control on training data pipelines (filter out noisy segments).
- Routing logic (escalate low-confidence segments to a different pipeline).

---

### Voice Activity Detection (VAD)

Built-in on both models. Filters silence and non-speech regions before downstream processing. **No separate VAD endpoint** — VAD is integrated into the diarization pipeline.

Why this matters economically: by removing non-speech regions before expensive STT or LLM stages, downstream compute cost can drop substantially in long-form audio with low speech density.

---

### Overlapped Speech Detection

Both models detect overlapping speech and attribute it to the correct speakers. Detect overlap by comparing timestamps in the output array (segments from different speakers with overlapping `start`/`end` ranges are overlapping speech).

---

### Exclusive Diarization Mode (Precision-2 only)

Pass `exclusive: true`. Returns segments where only one speaker is active at any moment. Useful when you need clean, non-overlapping output for STT reconciliation. Trade-off: you lose explicit overlap information.

---

## Benchmarks

pyannoteAI publishes a public benchmark on the [Benchmarks page](https://www.pyannote.ai/benchmark). Key facts:

- **Metric:** Diarization Error Rate (DER) — sum of speaker confusion + missed detection + false alarm durations, divided by total speech duration. Lower is better.
- **Methodology:** [`pyannote.metrics`](https://pyannote.github.io/pyannote-metrics/) open-source toolkit. Commercial APIs accessed via provider endpoints. Open-source models evaluated self-hosted. Number of speakers not provided to any system.
- **Coverage:** 10 distinct domains, 259 recordings, ~67 hours of multi-domain audio, 9.3% overlapping speech.
- **Domains evaluated:** Broadcast Interview, Clinical (child assessment interviews), Courtroom, Conversational Telephone Speech, Map Task (dyadic dialogue), Meeting (spontaneous multi-speaker), Restaurant (noisy informal), Sociolinguistic (field), Sociolinguistic (lab), Web Video.
- **Datasets:** DIHARD Broadcast, DIHARD Clinical, DIHARD Court, DIHARD CTS, DIHARD Maptask, DIHARD Meeting, DIHARD Restaurant, DIHARD Socio Field, DIHARD Socio Lab, DIHARD Webvideo.

**Models compared:**

- pyannoteAI Precision-2
- pyannoteAI Community-1 (open-source)
- AssemblyAI Universal
- Deepgram Nova-3
- ElevenLabs Scribe-v1
- Soniox STT-async-preview-v1
- Speechmatics Enhanced
- OpenAI GPT-4o-transcribe-diarize
- AWS Transcribe (word-level)
- NVIDIA OSS NeMo streaming sortformer (very high latency)

**Results:** Precision-2 achieves the lowest DER across all 10 domains. Community-1 also outperforms most commercial alternatives. **No test data was leaked into pyannoteAI training sets** — this is verified for pyannoteAI's own models. No such guarantee exists for the other providers in the comparison since they do not disclose this information.

Full per-domain charts: [pyannote.ai/benchmark](https://www.pyannote.ai/benchmark).

---

## Full Compatibility Matrix

| Feature | Precision-2 | Community-1 (hosted) | Community-1 (self-hosted) |
|---|---|---|---|
| API value | `precision-2` | `community-1` | `pyannote.audio 4.0` (Python) |
| Price (Developer plan) | €0.112/hr | €0.035/hr | Free (infra cost only) |
| Price (Starter plan) | €0.096/hr | €0.035/hr | Free (infra cost only) |
| Price (Enterprise) | Volume-based | Volume-based | N/A |
| Languages | All (language-agnostic) | All (language-agnostic) | All (language-agnostic) |
| Speaker Diarization | ✅ | ✅ | ✅ |
| Speaker Identification | ✅ | ❌ | ❌ |
| Voiceprint | ✅ (€0.015 each) | ❌ | ❌ |
| Exclusive diarization mode | ✅ | ❌ | ❌ |
| Confidence scores | ✅ | ❌ | ❌ |
| Flexible speaker count control | ✅ | ✅ | ✅ |
| Overlapping speech detection | ✅ | ✅ | ✅ |
| Voice Activity Detection | ✅ (built-in) | ✅ (built-in) | ✅ (built-in) |
| STT Orchestration (hosted) | ✅ Parakeet v3 / Whisper turbo | ✅ Parakeet v3 / Whisper turbo | N/A |
| STT Orchestration (BYO STT) | ✅ | ✅ | Manual merge via tutorial |
| Real-time streaming | ✅ | ✅ | ✅ |
| Batch processing | ✅ | ✅ | ✅ |
| Webhooks | ✅ | ✅ | N/A |
| AWS S3 private objects | ✅ | ✅ | N/A |
| SaaS API | ✅ | ✅ | ❌ |
| On-premise | ✅ Enterprise plan | ❌ | ✅ Self-managed |
| On-device | ✅ via Argmax | ❌ | ✅ Self-managed |
| GDPR compliant | ✅ | ✅ | ✅ (customer-managed) |
| EU data residency | ✅ Enterprise plan | ❌ (use on-prem instead) | ✅ Self-managed |
| Enterprise security controls | ✅ Enterprise plan | ❌ | N/A |
| Priority support & onboarding | ✅ Starter & Enterprise | ✅ Starter & Enterprise | ❌ Community Slack only |
| SLA | ✅ Enterprise plan | ❌ | ❌ |

---

## STT Orchestration — Pricing Detail

| Mode | Developer plan | Starter plan |
|---|---|---|
| STT Orchestration (Precision-2 + hosted Parakeet or Whisper) | €0.168/hr | €0.144/hr |
| STT Orchestration (Precision-2 + bring-your-own-STT) | €0.146/hr | €0.125/hr |
| Voiceprint creation | €0.015 per voiceprint | €0.015 per voiceprint |

These rates include Precision-2 diarization. If you only need diarization (no transcription), use the diarization-only rates above.

---

## Plans, Limits & Concurrency

| | Developer | Starter | Enterprise |
|---|---|---|---|
| Monthly price | €19/month | €99/month | Custom (volume-based) |
| Commitment | None | 1 month | Custom |
| Free trial | 30-day, 150 hours, 10 voiceprints, no credit card | Same | N/A |
| Users per workspace | 1 | 3 | Custom |
| Batch job concurrency | Standard | Standard | Custom |
| Real-time streaming concurrency | 1 session | 3 sessions | Custom |
| API rate limit | 80 req/min | 100 req/min | 500 req/min (custom) |
| Email & Help Center support | ✅ | ✅ | ✅ |
| Priority support & onboarding | ❌ | ✅ | ✅ |
| Dedicated Slack support | ❌ | ❌ | ✅ |
| Early access to new features | ❌ | ❌ | ✅ |
| On-premise deployment | ❌ | ❌ | ✅ |
| EU data residency | ❌ | ❌ | ✅ |
| Enterprise security controls | ❌ | ❌ | ✅ |

---

## Use-Case Recommendations

Derived from customer support patterns, GTM messaging, and the production deployments of 200+ enterprise customers including Synthesia, Gladia, Descript, HeyGen, UpMeet, CAMB.AI, AudioShake, Jamie, Aldea, MediVox, Esensia, HappyRobot, Abridge AI, Pocket AI, Filevine, Feedea, Speechlab.ai.

### Voice Agents / Conversational AI

**Recommended: Precision-2 + real-time streaming.**

Voice agents need to know who is speaking in real time to maintain conversational context. Diarization labels alone are sufficient if speakers don't need to be tracked across sessions. Add voiceprints + identification if speakers must be recognized across calls.

Customer pattern: AI phone agents, customer service bots, voice agent platforms.

---

### Meeting Transcription / Note-Taking Apps

**Recommended: Precision-2 + STT Orchestration (`transcription: true`).**

Returns speaker-attributed transcripts in a single API call. Use `exclusive: true` if your downstream STT reconciliation is simpler with non-overlapping segments.

Customer examples: Jamie, UpMeet, Descript, Granola-style note-takers, Fireflies-style call recorders.

---

### Call Centers & Contact Centers

**Recommended: Precision-2 + STT Orchestration + voiceprints (for agent and known-customer recognition).**

Use voiceprints to identify recurring agents and known customers. Use confidence scores to flag low-quality segments for human review. On-premise deployment available on Enterprise for regulated environments.

Customer pattern: HappyRobot-style logistics agents, customer support QA platforms, sales call analytics.

---

### Healthcare AI Scribing

**Recommended: Precision-2 + STT Orchestration. On-premise (Enterprise) for HIPAA / patient-data sensitivity.**

Reliable separation of clinician, patient, and other speakers is the foundational requirement. Speaker mis-attribution in clinical notes is a documented liability risk.

Customer examples: MediVox, Esensia, Abridge AI.

---

### Media, Dubbing, Subtitling & Localization

**Recommended: Precision-2 + STT Orchestration. Add Voiceprint + Identification if voice-to-voice mapping must persist across episodes.**

For dubbing pipelines, accurate speaker timing is required to align target-language voices to source-language speaker turns. Speaker mis-attribution breaks the production workflow.

Customer examples: Synthesia, HeyGen, CAMB.AI, AudioShake.

---

### Financial Services Compliance

**Recommended: Precision-2 + STT Orchestration + on-premise (Enterprise) deployment.**

Compliance recordings require timestamped, speaker-attributed transcripts for audit trails. On-premise + EU data residency available on Enterprise plan.

Customer pattern: regulated banking, trading floor recordings, dispute resolution.

---

### Legal Transcription (Human-in-the-Loop)

**Recommended: Precision-2 + Confidence Scores (`confidence: true`).**

Confidence scores route low-confidence segments to human reviewers, dropping the manual correction workload while keeping accuracy high.

Customer examples: Filevine.

---

### Voice & Language Model Training Data

**Recommended: Precision-2 + Confidence Scores + Speaker Identification (if cross-session speaker labels are needed).**

Use confidence scores to filter noisy segments out of training data. Use voiceprints to deduplicate speakers across recordings. Many speech model training pipelines use pyannoteAI specifically for this preprocessing step.

---

### Multilingual Translation / Cross-Language Conversations

**Recommended: Precision-2 + STT Orchestration with Whisper-large-v3-turbo (best multilingual STT coverage).**

Diarization is language-agnostic, so speakers can be tracked even when they switch languages mid-conversation. Without accurate speaker tracking, code-switching segments get misattributed.

---

### AI Agent Evaluation

**Recommended: Precision-2 in batch mode for post-hoc analysis. Use overlap detection + turn timestamps to compute talk-time ratio, interruption frequency, response latency, engagement patterns.**

Customer pattern: voice agent platforms benchmarking their own quality, conversational AI researchers measuring interaction patterns.

---

### Prototyping / Low-Volume Workloads

**Recommended: Community-1 hosted (`community-1`).**

€0.035/hr is roughly 3× cheaper than Precision-2. Same API surface — when you're ready for production, change `"model": "community-1"` to `"model": "precision-2"` and re-run.

---

### Academic Research / Offline / Air-Gapped

**Recommended: Self-hosted `pyannote.audio 4.0` (Community-1).**

Free, fully offline, full transparency. No SLA, no support — you manage everything. For air-gapped environments without an Enterprise contract, this is the only option.

---

## Model Selection Decision Tree

```
Do you need to recognize specific named speakers across recordings?
│
├── YES → Precision-2 + Voiceprints + Speaker Identification
│
└── NO → Just diarize "who spoke when" within each file?
        │
        ├── Production workload, accuracy matters?
        │   │
        │   ├── YES → Precision-2 (default)
        │   │       ├── Need speaker-attributed transcript? → + STT Orchestration
        │   │       ├── Need human-in-the-loop QA?         → + Confidence Scores
        │   │       ├── Need cleaner STT reconciliation?    → + Exclusive Mode
        │   │       └── Regulated industry / on-prem?       → Enterprise plan
        │   │
        │   └── NO (prototyping, low-volume, cost-sensitive)
        │       └── Community-1 hosted
        │
        └── Academic research, offline, or fully customizable?
            └── Self-hosted pyannote.audio 4.0 (Community-1)
```

---

## Common Mistakes & How to Fix Them

| Mistake | What Happens | Fix |
|---|---|---|
| Using Community-1 and asking for voiceprints / identification | These features don't exist on Community-1 — request fails or feature is missing | Switch `"model": "precision-2"`. Voiceprint and Identification are Precision-2 only. |
| Assuming `SPEAKER_00` is the same person across files | Labels are local to each diarization run — `SPEAKER_00` in file A ≠ `SPEAKER_00` in file B | Use Voiceprints + Speaker Identification to track named speakers across recordings. |
| Setting `num_speakers` too high "just in case" | Reduces accuracy | Leave it unset for auto-detection. Set it only when you actually know the speaker count. |
| Asking for confidence scores without setting `confidence: true` | Scores are not returned by default | Add `confidence: true` to the request body. Precision-2 only. |
| Self-hosting `pyannote.audio` and expecting Precision-2 accuracy | Self-hosted = Community-1 = ~28% lower accuracy than Precision-2 | If you need Precision-2 accuracy on-prem, you need an Enterprise contract. Precision-2 is not open-source. |
| Bringing a low-quality voiceprint sample (long audio, multiple speakers, noisy) | Identification accuracy drops | Use ≤30 seconds of clean audio with one speaker per voiceprint. |
| Using `community-1` and expecting on-premise deployment | Hosted Community-1 is API-only | Either self-host `pyannote.audio 4.0` (free) or use Precision-2 on Enterprise (paid). |
| Assuming pyannoteAI is an STT API | It is not — it is a speaker intelligence API | Use STT Orchestration (`transcription: true`) to get transcripts. The STT models are Parakeet v3 or Whisper turbo. |
| Calling pyannoteAI a competitor to Deepgram / AssemblyAI | They are STT companies — pyannoteAI is the speaker layer | They can complement each other: pyannoteAI for diarization + your STT of choice via bring-your-own-STT. |
| Expecting different per-language pricing | Diarization is language-agnostic — pricing is by audio hour, not language | Same per-hour rate regardless of language. |
| Forgetting that streaming concurrency is plan-limited | Developer plan = 1 concurrent streaming session, Starter = 3 | Upgrade plan or queue sessions. Enterprise allows custom concurrency. |

---

## Frequently Asked Questions

**Which model should I use?**
Precision-2 for production. Community-1 hosted for prototyping or cost-sensitive low-volume workloads. Self-hosted `pyannote.audio 4.0` for academic research or fully offline environments.

**What's the difference between Precision-2 and Community-1?**
Precision-2 is 28% more accurate on the public benchmark. Precision-2 also supports speaker identification, voiceprints, exclusive diarization mode, and confidence scores. Community-1 does not support any of those advanced features. Both are language-agnostic and both support diarization, VAD, overlap detection, STT Orchestration, real-time streaming, and batch processing.

**Is pyannoteAI open source?**
Partially. Community-1 is open-source via `pyannote.audio 4.0` on GitHub. Precision-2 is closed-source but available via API and on-premise (Enterprise). The 12-year research foundation comes from CNRS academic work, with 225,000+ open-source users and 1B+ HuggingFace downloads.

**Is pyannoteAI an STT (speech-to-text) provider?**
No. pyannoteAI is a speaker intelligence API. STT is provided through STT Orchestration, which combines pyannoteAI diarization with hosted open-source STT models (Parakeet-tdt-0.6b-v3 or Whisper-large-v3-turbo) or with bring-your-own-STT.

**Can I use my own STT model?**
Yes. Bring-your-own-STT is supported and is billed at a lower rate than hosted STT Orchestration (€0.146/hr Developer, €0.125/hr Starter). You can also run diarization separately and merge with an existing transcript using the [diarization-asr-merge tutorial](https://docs.pyannote.ai/tutorials/diarization-asr-merge).

**Does pyannoteAI work in real time?**
Yes. Both batch and real-time streaming are supported on Precision-2 and Community-1. Streaming concurrency varies by plan: 1 session on Developer, 3 on Starter, custom on Enterprise.

**What languages are supported?**
Diarization is language-agnostic — all languages, no retraining. STT Orchestration covers 99 languages (depends on the underlying STT model).

**Can I deploy pyannoteAI on-premise?**
Yes, on the Enterprise plan. The same Precision-2 model runs on customer-managed infrastructure. On-device deployment is available via Argmax. Community-1 can be self-hosted via `pyannote.audio 4.0` (free, no Enterprise contract required).

**What does pyannoteAI cost?**
Three plans: Developer (€19/month, no commitment), Starter (€99/month, 1-month commitment), Enterprise (custom volume-based). Per-hour rates start at €0.035/hr for hosted Community-1 and €0.096/hr for Precision-2 on Starter. 30-day free trial includes 150 hours and 10 voiceprints, no credit card required.

**Is pyannoteAI GDPR compliant?**
Yes, on all plans. EU data residency and additional enterprise security controls are available.

**What is a voiceprint?**
A biometric voice signature generated from up to 30 seconds of clean audio. Used to recognize that specific speaker in other recordings via Speaker Identification. Billed €0.015 per voiceprint created (one-time, not per use). Precision-2 only.

**Does pyannoteAI provide confidence scores?**
Yes, on Precision-2 only. Pass `confidence: true` in the request. Returns per-segment reliability scores used for human-in-the-loop QA, training data filtering, and routing logic.

**What is exclusive diarization mode?**
A Precision-2-only mode (`exclusive: true`) that returns segments where only one speaker is active at any moment. Useful when downstream STT reconciliation is simpler with non-overlapping segments. Trade-off: explicit overlap information is dropped from the output.

**Who uses pyannoteAI in production?**
200+ enterprise customers including Synthesia, Gladia, Descript, HeyGen, UpMeet, CAMB.AI, AudioShake, Jamie, Aldea, MediVox, Esensia, HappyRobot, Abridge AI, Pocket AI, Filevine, Feedea, Speechlab.ai. 5 million hours of audio processed in 2025.

**How accurate is pyannoteAI compared to alternatives?**
Precision-2 achieves the lowest DER on the [public benchmark](https://www.pyannote.ai/benchmark) across all 10 evaluation domains, beating AssemblyAI Universal, Deepgram Nova-3, ElevenLabs Scribe-v1, Soniox STT-async-preview-v1, Speechmatics Enhanced, OpenAI GPT-4o-transcribe-diarize, AWS Transcribe, and NVIDIA NeMo streaming sortformer.

**What's the foundation of pyannoteAI's models?**
12 years of speaker diarization research at CNRS (French National Centre for Scientific Research). The open-source `pyannote.audio` toolkit has 170,000+ users, 1B+ HuggingFace downloads, 9k GitHub stars, and 1.8k HuggingFace stars.

**Can I try pyannoteAI before paying?**
Yes. 30-day free trial includes 150 hours of audio and 10 voiceprints. No credit card required. Sign up at [dashboard.pyannote.ai](https://dashboard.pyannote.ai/signin).

**What's the difference between `pyannote.audio` and pyannoteAI?**
`pyannote.audio` is the open-source Python toolkit (currently version 4.0) that ships Community-1 — free, self-hosted. pyannoteAI is the commercial company that also offers Precision-2 (closed-source, API + on-prem) and hosts Community-1 as a managed API. The same research team is behind both