Tutorial: How to build a speaker identification system for recurring meetings - pyannoteAI Speaker Intelligence and Diarization

Blog

Tutorial: How to build a speaker identification system for recurring meetings

Diarization answers the question "who spoke when?" but assigns anonymous labels (SPEAKER_00, SPEAKER_01). For recurring meetings — weekly standups, client calls, interview series — anonymous labels are insufficient. Teams need consistent identities across sessions: Alice, Bob, Carol. This requires speaker identification: matching speakers in a recording against a set of known voiceprints.

This tutorial walks through building such a system using pyannoteAI's voiceprint and identification endpoints. By the end, you will have a pipeline that enrolls your team once, then automatically labels every future recording with real names.

How it works

Speaker identification with pyannoteAI is a two-phase process:

Enrollment (one-time per speaker): create a voiceprint from a short reference clip. A voiceprint is a compact digital representation of someone's voice characteristics, similar to a fingerprint but for voice. Store the voiceprint for reuse.
Identification (per recording): submit a meeting recording together with the stored voiceprints. pyannoteAI diarizes the audio, matches each speaker against the voiceprints, and returns segments labeled with the enrolled names or unknown for unrecognized speakers.

The two phases map to two API endpoints: /v1/voiceprint for enrollment and /v1/identify for identification.

Step 1 - Enroll your team

Enrollment is a one-time step per speaker. You need a short audio clip (up to 30 seconds) containing only that person's voice — no overlapping speakers, no music, just normal speech. A segment pulled from a previous meeting where they spoke uninterrupted works well.

Voiceprint requirements

One voiceprint per speaker.
The audio must contain only the target speaker's voice.
Maximum 30 seconds.
Capture normal speaking style; the models are language-agnostic.

Upload and create the voiceprint

For local files, upload to pyannoteAI's temporary storage first, then submit the voiceprint job:

import os
import time
import json
import requests

API_KEY = os.environ["PYANNOTEAI_API_KEY"]
BASE = "<https://api.pyannote.ai/v1>"
HEADERS = {"Authorization": f"Bearer {API_KEY}"}

def upload_to_pyannote(audio_path: str, object_key: str) -> str:
    """Upload a local file to pyannoteAI temporary storage. Returns the media:// URL."""
    media_url = f"media://{object_key}"
    resp = requests.post(
        f"{BASE}/media/input",
        headers=HEADERS,
        json={"url": media_url},
    )
    resp.raise_for_status()
    presigned_url = resp.json()["url"]
    with open(audio_path, "rb") as f:
        requests.put(presigned_url, data=f,
            headers={"Content-Type": "application/octet-stream"}).raise_for_status()
    return media_url

def poll_until_done(job_id: str) -> dict:
    while True:
        job = requests.get(f"{BASE}/jobs/{job_id}", headers=HEADERS).json()
        if job["status"] == "succeeded":
            return job["output"]
        if job["status"] == "failed":
            raise RuntimeError(f"Job failed: {job.get('error')}")
        time.sleep(2)

def create_voiceprint(audio_path: str) -> str:
    """Create a voiceprint from a local audio file. Returns the voiceprint string."""
    media_url = upload_to_pyannote(audio_path, f"vp-{os.path.basename(audio_path)}")
    resp = requests.post(
        f"{BASE}/voiceprint",
        headers=HEADERS,
        json={"url": media_url},
    )
    resp.raise_for_status()
    output = poll_until_done(resp.json()["jobId"])
    return output["voiceprint"]

Build an enrollment store

Persist the voiceprints in a simple JSON file keyed by speaker name. You only need to do this once per person:

def enroll(name: str, audio_path: str, db_path: str = "speakers.json"):
    """Enroll a speaker by creating their voiceprint and saving it."""
    db = json.load(open(db_path)) if os.path.exists(db_path) else {}
    db[name] = create_voiceprint(audio_path)
    json.dump(db, open(db_path, "w"))
    print(f"Enrolled {name}")

Enroll your team before their next meeting:

enroll("Alice", "samples/alice.wav")
enroll("Bob", "samples/bob.wav")
enroll("Carol", "samples/carol.wav")

Voiceprint job outputs are deleted from pyannoteAI after 24 hours, so always save them to your own storage. The voiceprints themselves are reusable indefinitely.

Step 2 - Identify speakers in a meeting

Once your team is enrolled, identifying speakers in a new recording is a single API call. Submit the audio along with your stored voiceprints to the /v1/identify endpoint. pyannoteAI will diarize the recording and match each speaker against the voiceprints:

def identify(audio_path: str, db_path: str = "speakers.json") -> dict:
    """Identify speakers in a recording using stored voiceprints."""
    db = json.load(open(db_path))
    voiceprints = [{"label": name, "voiceprint": vp} for name, vp in db.items()]

    media_url = upload_to_pyannote(audio_path, f"meeting-{os.path.basename(audio_path)}")

    resp = requests.post(
        f"{BASE}/identify",
        headers=HEADERS,
        json={
            "url": media_url,
            "voiceprints": voiceprints,
        },
    )
    resp.raise_for_status()
    return poll_until_done(resp.json()["jobId"])

Understanding the output

The response contains three sections:

{
  "diarization": [
    {"speaker": "SPEAKER_00", "start": 0.5, "end": 4.2},
    {"speaker": "SPEAKER_01", "start": 4.5, "end": 9.1}
  ],
  "identification": [
    {"speaker": "Alice", "start": 0.5, "end": 4.2, "diarizationSpeaker": "SPEAKER_00", "match": "Alice"},
    {"speaker": "SPEAKER_01", "start": 4.5, "end": 9.1, "diarizationSpeaker": "SPEAKER_01", "match": null}
  ],
  "voiceprints": [
    {"speaker": "SPEAKER_00", "match": "Alice", "confidence": {"Alice": 86, "Bob": 22, "Carol": 14}},
    {"speaker": "SPEAKER_01", "match": null, "confidence": {"Alice": 12, "Bob": 18, "Carol": 10}}
  ]
}

diarization — the raw speaker segments with anonymous labels.
identification — the same segments with matched names. Unmatched speakers keep their anonymous label and have "match": null.
voiceprints — confidence scores showing how well each enrolled voiceprint matched each speaker. Use these to audit results and fine-tune your threshold.

Matching options

Two parameters control how matching works:

matching.threshold (0–100, default 0) - minimum confidence required for a match. Set to 50–70 for stricter matching. Too low produces false matches; too high causes legitimate speakers to appear as unknown.
matching.exclusive (default true) - prevents multiple diarized speakers from matching the same voiceprint. Keep this on for meetings where you know each person is distinct.

Step 3 - Align with a transcript

Speaker labels become most useful when attached to a transcript. Run ASR separately (using gpt-4o-transcribe or any provider returning word-level timestamps) and merge by overlap:

from openai import OpenAI

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

def transcribe(audio_path: str) -> list[dict]:
    with open(audio_path, "rb") as f:
        response = client.audio.transcriptions.create(
            model="gpt-4o-transcribe",
            file=f,
            response_format="verbose_json",
            timestamp_granularities=["word"],
        )
    return response.words

def align(words: list[dict], identification: list[dict]) -> list[dict]:
    identification = sorted(identification, key=lambda x: x["start"])
    for w in words:
        w_start, w_end = w["start"], w["end"]
        speaker_overlap: dict[str, float] = {}
        for seg in identification:
            intersection = min(seg["end"], w_end) - max(seg["start"], w_start)
            if intersection <= 0:
                continue
            speaker_overlap[seg["speaker"]] = (
                speaker_overlap.get(seg["speaker"], 0.0) + intersection
            )
        if speaker_overlap:
            w["speaker"] = max(speaker_overlap.items(), key=lambda x: x[1])[0]
        else:
            w["speaker"] = "UNKNOWN"
    return words

Putting it all together

from concurrent.futures import ThreadPoolExecutor

def process_meeting(audio_path: str, db_path: str = "speakers.json"):
    """Identify speakers and produce a labeled transcript."""
    with ThreadPoolExecutor(max_workers=2) as executor:
        f_id = executor.submit(identify, audio_path, db_path)
        f_words = executor.submit(transcribe, audio_path)

    output = f_id.result()
    words = f_words.result()

    labeled = align(words, output["identification"])

    # Print readable output
    current_speaker = None
    for w in labeled:
        if w["speaker"] != current_speaker:
            current_speaker = w["speaker"]
            print(f"\\n[{current_speaker}]", end=" ")
        print(w["word"], end=" ")

if __name__ == "__main__":
    process_meeting("standup-2026-05-12.mp3")

Sample output:

[Alice] Thanks everyone let's go around quickly

[Bob] Sure I finished the API refactor yesterday and opened the PR

[SPEAKER_02] I'm wrapping up the design review should be done by end of day

[Alice]

SPEAKER_02 remained anonymous because they were not enrolled. To fix that, enroll them with a reference clip and re-run.

Practical considerations for recurring meetings

Enrollment quality matters. Use a clean clip where the speaker talks uninterrupted for at least 10 seconds. Avoid clips with background noise, music, or other voices. A segment from a previous meeting where they gave an update works well.

Be selective with voiceprints. Only include voiceprints of people you expect in the recording. Including voiceprints of absent speakers can produce false matches, the system may assign a low-confidence match rather than labeling someone as unknown. Review the confidence scores in the output to catch this.

Threshold tuning. Start with a threshold of 50 and adjust based on your results. If enrolled speakers are showing up as unknown, lower it. If you're seeing wrong names assigned, raise it. The right value depends on audio quality and how distinctive each speaker's voice is.

Unknown speakers are expected. In most meeting contexts, guests, new hires, or external participants will appear. The system labels them with anonymous diarization IDs. You can enroll them after the fact by pulling a clean segment from the recording and adding them to your store.

Voiceprint stability. Voiceprints are durable across sessions, once created, they work for future recordings without re-enrollment. However, if someone's recording setup changes significantly (e.g. switching from a laptop mic to a headset), you may want to update their voiceprint for best results.

Microphone and environment consistency. Identification works best when enrollment audio and meeting audio are captured under similar conditions. If your team always uses the same conferencing setup, results will be more consistent than if conditions vary widely between sessions.

Conclusion

Speaker identification turns anonymous diarization labels into persistent, named identities. The workflow is simple: enroll each team member once with a short voice sample, then pass their voiceprints alongside every new recording. pyannoteAI handles the matching and returns segments labeled with real names. Combined with ASR, you get a transcript where every utterance is attributed to a known speaker; ready for meeting notes, searchable archives, or downstream analytics.