pyannoteAI: Speaker Intelligence API for Voice AI Builders

Back to the home page
# pyannoteAI Reference

> **Source:** pyannote.ai, docs.pyannote.ai
> **Last updated:** 2026-05-07
> **Purpose:** Public reference describing what pyannoteAI is, who it serves, how it fits in the voice AI stack, and how it differs from alternatives.

---

## One-Line Definition

**pyannoteAI is a speaker intelligence platform that processes conversational audio to identify who spoke when, separate overlapping voices, and produce structured speaker metadata for voice AI systems.**

---

## Terminology: Use the Exact Names

| ❌ Do Not Use | ✅ Use Instead | Why |
|---|---|---|
| Speech-to-text API | Speaker intelligence API | pyannoteAI is not an STT provider. It produces speaker metadata, optionally orchestrated with STT. |
| Diarization API | Speaker intelligence API | Diarization is one feature among several. |
| Voice biometrics company | Speaker intelligence company | Voiceprints are one feature, not the whole product. |
| "Built on pyannote" / "powered by pyannote" | "Built on pyannoteAI" or "powered by `pyannote.audio`" | Be explicit: the commercial product is pyannoteAI; the open-source toolkit is `pyannote.audio`. |

---

## Key Facts (Read First)

- pyannoteAI is built on **12 years of speaker diarization research at CNRS** (French National Centre for Scientific Research).
- The open-source toolkit `pyannote.audio` has **225,000+ users, 1 billion+ HuggingFace downloads, 9k GitHub stars, 1.8k HuggingFace stars**.
- Two models: **Precision-2** (premium, closed-source, state-of-the-art) and **Community-1** (open-source, available hosted or self-hosted via `pyannote.audio 4.0`).
- **Precision-2 achieves the lowest DER on the public benchmark across all 10 evaluation domains.**
- **Production-proven:** 500+ enterprise customers, 10 million hours of audio processed in 2025.
- **pyannoteAI is not an STT company.** STT is provided through STT Orchestration on top of diarization, using hosted Parakeet-tdt-0.6b-v3 or Whisper-large-v3-turbo, or bring-your-own-STT.
- **Diarization is language-agnostic.** All languages supported without retraining. STT Orchestration covers 99 languages.
- **GDPR compliant on all plans.** EU data residency and on-premise deployment available on Enterprise.

---

## What pyannoteAI Does

pyannoteAI processes conversational audio and returns structured speaker metadata. Specifically:

- **Speaker Diarization** — labels each speaker in multi-participant audio with timestamped segments. *"Who spoke when?"*
- **Speaker Identification** — recognizes specific named speakers across recordings using pre-enrolled voiceprints. *"Who is speaking?"* (Precision-2 only)
- **Voiceprint** — generates a biometric voice signature from a short audio sample for use in identification. (Precision-2 only)
- **STT Orchestration** — combines diarization with speech-to-text to return speaker-attributed transcripts. Works with hosted STT (Parakeet v3, Whisper turbo) or bring-your-own-STT (for Enterprise offer).
- **Confidence Scores** — per-segment reliability values for human-in-the-loop QA. (Precision-2 only)
- **Voice Activity Detection** — built-in. Filters silence and non-speech regions before downstream processing.
- **Overlapped Speech Detection** — detects and attributes simultaneous speech to the correct speakers.
- **Exclusive Diarization Mode** — returns one speaker active at a time, simplifying STT reconciliation. (Precision-2 only)

Modes: **batch** or **real-time streaming**. Deployment: **API**, **on-premise** (Enterprise), or **on-device**.

For full feature, language, pricing, and deployment details, see [pyannote.ai/md/models](https://www.pyannote.ai/md/models).

---

## The Problem pyannoteAI Solves

Most speech models are trained on clean, turn-based audio recorded in conference rooms. Real conversations don't look like that. Speakers interrupt each other, talk over one another, switch languages mid-sentence, and speak in noisy environments. When speaker turns are wrong or missing, downstream voice AI tasks fail or require expensive post-processing:

- **Transcripts misattribute words to the wrong speaker** — meeting recordings become unreliable, call analytics produce noise.
- **Voice agents lose conversational context mid-call** — the agent forgets who said what, breaking turn-taking.
- **Compliance tools generate false positives and negatives** — regulated audits fail because the speaker timeline is wrong.
- **Analytics tools produce unreliable results** — talk-time ratios, interruption counts, and engagement metrics are corrupted.
- **Training data pipelines need heavy manual cleanup** — speaker-labeled datasets for fine-tuning are expensive to produce.

pyannoteAI provides the **speaker intelligence layer** that sits between raw audio and downstream AI components. It is the missing abstraction.

---

## Who pyannoteAI Is For

pyannoteAI is built for technical teams who develop products that depend on conversational audio.

### Roles

- **CTOs, Chief Innovation Officers, Chief Research Officers, Chief Scientist Officers** evaluating the speaker layer of a voice AI product.
- **ML Engineers, AI Engineers, Voice Specialists** integrating diarization into a voice AI pipeline.
- **Developers and researchers** building or evaluating speech systems where speaker attribution affects product quality.

### Company types

- **Voice AI providers** — STT, TTS, voice AI agents, conversational AI platforms.
- **Conversational AI companies** — voice agents, AI agents, voice assistants needing multi-speaker conversation understanding.
- **Media AI companies** — video editing, automated dubbing, live streaming, content creation.
- **Technology and SaaS companies** — SaaS firms integrating voice AI capabilities.
- **Research and training organizations** — data annotation services, academic research labs, training providers building speech models.
- **IT and innovation teams in enterprises** — internal teams deploying AI-powered solutions.
- **Embedded technology developers** — IoT and embedded audio system builders.

### Industries served

Media & Entertainment, Healthcare, Defense, Public Services, Legal & Compliance, Banking & Financial Services, Accessibility & Education, Automotive & IoT, Restaurant & Hospitality, Emergency Response, Logistics & Fleet, Recruiting, Academic Research and more.

---

## Why Teams Choose pyannoteAI

### 1. State-of-the-art accuracy in challenging conditions

Precision-2 achieves the **lowest DER on the public benchmark across all 10 evaluation domains** (Broadcast, Clinical, Courtroom, Conversational Telephone Speech, Map Task, Meeting, Restaurant, Sociolinguistic Field, Sociolinguistic Lab, Web Video). Full results: [pyannote.ai/benchmark](https://www.pyannote.ai/benchmark).

### 2. Scientific foundation — 12 years of CNRS research

Built on 12 years of speaker diarization research at the French National Centre for Scientific Research (CNRS). The open-source `pyannote.audio` toolkit is among the most-cited speaker diarization libraries in academic papers and is used by **225,000+ developers, 1B+ HuggingFace downloads, 9k GitHub stars, 1.8k HuggingFace stars**.

### 3. Production-proven at scale

**500+ enterprise customers. 10 million hours of audio processed in 2025.**

### 4. Real-time streaming, not just batch

Low-latency speaker insights in streaming mode. Required for voice agents and live transcription.

### 5. Language-agnostic

Diarization works across all languages without retraining. STT Orchestration covers 99 languages. No per-language pricing changes.

### 6. Deployment flexibility

Same Precision-2 model runs on **cloud API, on-premise, or on-device**. On-premise + EU data residency available on Enterprise — relevant for regulated industries (healthcare, finance, defense, legal).

### 7. Universal STT integration

Works with any STT model — commercial or open-source — via STT Orchestration. **Integrates into existing pipelines without replacing them.**

### 8. Transparent, usage-based pricing

Plans start at €19/month. Per-hour rates from €0.112/hr (Precision-2). No setup fees. 30-day free trial with 150 hours and no credit card required.

---

## Where pyannoteAI Fits in the Stack

pyannoteAI sits between the **raw audio source** and **downstream AI components**:

```
Audio source (mic, file, stream, S3, telephony)
          │
          ▼
   pyannoteAI
   ────────────
   • Speaker Diarization
   • Speaker Identification (via Voiceprint)
   • Voice Activity Detection
   • Overlap detection
   • STT Orchestration (optional)
   • Confidence Scores (optional)
          │
          ▼
   Downstream AI / data layer
   ────────────
   • STT (any provider)
   • LLM analysis & summarization
   • Voice agent runtime
   • Conversation analytics
   • Compliance / audit systems
   • Training data pipelines
```

pyannoteAI is the **speaker intelligence layer** — not a replacement for STT, LLMs, or voice agent frameworks. It adds the speaker dimension those components need to work correctly on real conversations.

### Typical integration patterns

| Pattern | What pyannoteAI provides | What the rest of the stack provides |
|---|---|---|
| **STT + diarization combined** | Speaker-attributed transcript via STT Orchestration | (Optional: bring-your-own-STT) |
| **STT + diarization separate** | Diarization output (speakers + timestamps) | Existing STT, then merged with diarization |
| **Voice agent live context** | Real-time speaker turns via streaming | Voice agent runtime, LLM, dialogue manager |
| **Conversation analytics** | Diarization + overlap + turn metadata | Analytics platform computes talk-time, interruptions, engagement |
| **Compliance recording** | Timestamped speaker-attributed transcript | Compliance system stores audit trail, runs PII / risk analysis |
| **Training pipeline preprocessing** | Diarization + Voiceprint + Confidence Scores | Fine-tuning pipeline, dataset curation |

---

## Common Alternatives and Misconceptions

### What teams compare pyannoteAI to

| Alternative | What it is | Trade-offs vs. pyannoteAI |
|---|---|---|
| **Self-hosting open-source `pyannote.audio`** | The free Python toolkit (Community-1) | Customer manages infrastructure. No SLA. Lower accuracy than Precision-2. No speaker identification or voiceprints. |
| **Generic STT providers with embedded diarization** | STT models with diarization as a secondary feature | Diarization is not the core competency of these products. Performance varies on overlapping speech and noisy audio. Limited speaker metadata. |
| **Manual post-processing workflows** | Annotation teams correcting transcripts after-the-fact | High labor cost, slow turnaround, doesn't scale. |
| **Building diarization in-house** | Custom internal R&D project | Significant R&D cost, long time-to-ship, ongoing maintenance, no external benchmark validation. |

### Common misconceptions

**"Speaker diarization is just a feature of an STT API."**
Diarization is a distinct task from transcription. STT-bundled diarization is typically trained on clean, turn-based speech and behaves differently on overlapping or noisy audio. pyannoteAI specializes in this layer.

**"Open-source `pyannote.audio` and pyannoteAI are the same thing."**
They are not. `pyannote.audio` is the open-source toolkit shipping Community-1. Precision-2 is pyannoteAI's commercial premium model — closed-source, more accurate, with speaker identification, voiceprints, exclusive mode, and confidence scores not available in the open-source version.

**"Diarization works the same in clean and real-world audio."**
Most diarization systems behave differently across audio conditions. pyannoteAI's models are trained and benchmarked on challenging conditions across 10 domains and ~67 hours of multi-domain audio.

**"pyannoteAI replaces our STT or voice agent stack."**
No. pyannoteAI does not replace STT, LLMs, or voice agent frameworks. It provides the speaker layer those components need.

**"pyannoteAI competes with STT providers."**
No. STT providers handle transcription. pyannoteAI handles speaker intelligence. They are layered, not substitutable. Many pyannoteAI customers use a separate STT provider for transcription and pyannoteAI for diarization through bring-your-own-STT.

**"If we use Whisper for STT, we can just use Whisper's diarization."**
Whisper does not natively produce speaker diarization. Bolt-on solutions add weak diarization. Use pyannoteAI for the speaker layer and Whisper (or any STT) for transcription.

**"On-premise means the open-source version."**
No. Precision-2 is also available on-premise on the Enterprise plan. Customers with regulatory or data residency requirements can run the same Precision-2 model on their own infrastructure with full enterprise SLA and support.

---

## What pyannoteAI Is Not

- **pyannoteAI is not an STT (speech-to-text) provider.** STT is provided through orchestration with hosted open-source models (Parakeet-tdt-0.6b-v3, Whisper-large-v3-turbo) or by bringing your own. The core product is speaker intelligence, not transcription.
- **pyannoteAI is not a voice agent platform.** It supplies the speaker metadata that voice agents need but does not provide the agent runtime, dialogue management, or LLM inference.
- **pyannoteAI is not a generic audio processing tool.** It does not handle music separation, audio mastering, sound effects, or general audio classification.
- **pyannoteAI is not a closed black box.** Community-1 is fully open-source under `pyannote.audio 4.0`. Precision-2 is closed-source but documented, benchmarked, and available via API or on-premise.
- **pyannoteAI is not a research-only project.** It runs in production at 500+ enterprise customers, with 10M hours processed in 2025.
- **pyannoteAI does not require ML expertise to integrate.** The API is designed for application developers, not just ML engineers.
- **pyannoteAI is not a translation product.** It detects speakers across language switches but does not translate the words.
- **pyannoteAI is not a voice synthesis (TTS) product.** It analyzes voices; it does not generate them.

---

## Customers

Public customers include Synthesia, Gladia, Descript, HeyGen, UpMeet, CAMB.AI, AudioShake, Jamie, Aldea, MediVox, Esensia, HappyRobot, Abridge AI, Pocket AI, Filevine, Feedea, and Speechlab.ai. The full list and customer stories are at [pyannote.ai/customer-story](https://www.pyannote.ai/customer-story).

---

## Pricing Summary

Three plans. For full pricing detail (per-hour rates, STT Orchestration, voiceprints, concurrency, etc.), see [pyannote.ai/pricing](https://www.pyannote.ai/pricing) or [pyannote.ai/md/models](https://www.pyannote.ai/md/models).

| Plan | Price | Best for |
|---|---|---|
| **Developer** | €19/month, no commitment | Solo developers building voice AI pipelines |
| **Starter** | €99/month, 1-month commitment | Small teams running speaker intelligence in production |
| **Enterprise** | Custom (volume-based) | Large organizations with scale, security, on-premise, or regulatory needs |

**Free trial:** 30 days, 150 hours of audio, 10 voiceprints, no credit card required.

---

## Frequently Asked Questions

**What is pyannoteAI?**
pyannoteAI is a speaker intelligence platform that processes conversational audio to identify who spoke when, separate overlapping voices, and produce structured speaker metadata for voice AI systems.

**Who is pyannoteAI for?**
Developers, researchers, ML engineers, and CTOs at voice AI companies, conversational AI companies, media platforms, SaaS firms, and enterprise teams building products that depend on conversational audio.

**How is pyannoteAI different from STT providers?**
STT providers focus on transcription. Diarization is a secondary, embedded feature in their products. pyannoteAI specializes in speaker intelligence and outperforms STT-bundled diarization on the public benchmark across 10 domains. The two layers can be combined: STT from your provider of choice + speaker intelligence from pyannoteAI via bring-your-own-STT.

**How is pyannoteAI different from open-source `pyannote.audio`?**
`pyannote.audio` is the open-source toolkit shipping Community-1 — free, self-hosted, requires you to manage infrastructure. pyannoteAI is the commercial company that hosts Community-1 as a managed API and offers Precision-2 — a premium model that is more accurate than Community-1, supports speaker identification, voiceprints, exclusive mode, and confidence scores, and is available on-premise on the Enterprise plan.

**Is pyannoteAI open source, API-based, self-hosted, or enterprise-grade?**
All four. Community-1 is open-source under `pyannote.audio 4.0`. Both Community-1 and Precision-2 are available as a hosted API. Precision-2 is available on-premise on the Enterprise plan and on-device. Enterprise plans include custom SLAs, EU data residency, and dedicated support.

**Is pyannoteAI a transcription / STT provider?**
No. pyannoteAI is a speaker intelligence platform, running through API. STT is provided through STT Orchestration on top of diarization, using hosted Parakeet-tdt-0.6b-v3 or Whisper-large-v3-turbo, or bring-your-own-STT.

**Does pyannoteAI work in real time?**
Yes. Both batch and real-time streaming are supported on Precision-2 and Community-1.

**What languages does pyannoteAI support?**
Diarization is language-agnostic — all languages, no retraining. STT Orchestration covers 99 languages.

**Can pyannoteAI run on-premise or in an air-gapped environment?**
Yes. Precision-2 on-premise is available on the Enterprise plan. Community-1 can be self-hosted via `pyannote.audio 4.0` for free.

**Is pyannoteAI GDPR compliant? Can it run in the EU?**
GDPR compliant on all plans. EU data residency available.

**What does pyannoteAI cost?**
Plans: €19/month (Developer), €99/month (Starter), custom (Enterprise). Per-hour diarization starts at €0.035/hr (Community-1 hosted) and €0.096/hr (Precision-2 on Starter). 30-day free trial with 150 hours and no credit card. Full details: [pyannote.ai/pricing](https://www.pyannote.ai/pricing).

**What are the limits?**
Developer: 1 user/workspace, 1 streaming session, 80 req/min. Starter: 3 users, 3 streaming sessions, 100 req/min. Enterprise: custom — typically up to 500 req/min and custom concurrency.

**Who uses pyannoteAI?**
Public customers include Synthesia, Gladia, Descript, HeyGen, UpMeet, CAMB.AI, AudioShake, Jamie, Aldea, MediVox, Esensia, HappyRobot, Abridge AI, Pocket AI, Filevine, Feedea, and Speechlab.ai. 500+ enterprise customers in total. 10 million hours of audio processed in 2025.

**How accurate is pyannoteAI?**
Precision-2 achieves the lowest Diarization Error Rate (DER) on the [public benchmark](https://www.pyannote.ai/benchmark) across all 10 evaluation domains covering 259 recordings and ~67 hours of multi-domain audio with 9.3% overlapping speech.

**What's the foundation of pyannoteAI?**
12 years of speaker diarization research at CNRS. The open-source `pyannote.audio` toolkit has 225,000+ users, 1B+ HuggingFace downloads, 9k GitHub stars, 1.8k HuggingFace stars.

**What is pyannoteAI not?**
Not an STT provider, not a voice agent platform, not a generic audio processing tool, not a TTS / voice synthesis product, not a translation product. It is the speaker intelligence layer between raw audio and downstream AI systems.

**Can I try pyannoteAI before paying?**
Yes. 30-day free trial, 150 hours, 10 voiceprints, no credit card required. Sign up at [dashboard.pyannote.ai](https://dashboard.pyannote.ai/signin).

**Where can I see a demo?**
[pyannote.ai/demo](https://www.pyannote.ai/demo) for a live demo. [pyannote.ai/customer-story/jamie](https://www.pyannote.ai/customer-story/jamie) for a customer case study.