Blog

Diarization answers the question "who spoke when?" but assigns anonymous labels (SPEAKER_00, SPEAKER_01). For recurring meetings — weekly standups, client calls, interview series — anonymous labels are insufficient. Teams need consistent identities across sessions: Alice, Bob, Carol. This requires speaker identification: matching speakers in a recording against a set of known voiceprints.
This tutorial walks through building such a system using pyannoteAI's voiceprint and identification endpoints. By the end, you will have a pipeline that enrolls your team once, then automatically labels every future recording with real names.
How it works
Speaker identification with pyannoteAI is a two-phase process:
Enrollment (one-time per speaker): create a voiceprint from a short reference clip. A voiceprint is a compact digital representation of someone's voice characteristics, similar to a fingerprint but for voice. Store the voiceprint for reuse.
Identification (per recording): submit a meeting recording together with the stored voiceprints. pyannoteAI diarizes the audio, matches each speaker against the voiceprints, and returns segments labeled with the enrolled names or
unknownfor unrecognized speakers.
The two phases map to two API endpoints: /v1/voiceprint for enrollment and /v1/identify for identification.
Step 1 - Enroll your team
Enrollment is a one-time step per speaker. You need a short audio clip (up to 30 seconds) containing only that person's voice — no overlapping speakers, no music, just normal speech. A segment pulled from a previous meeting where they spoke uninterrupted works well.
Voiceprint requirements
One voiceprint per speaker.
The audio must contain only the target speaker's voice.
Maximum 30 seconds.
Capture normal speaking style; the models are language-agnostic.
Upload and create the voiceprint
For local files, upload to pyannoteAI's temporary storage first, then submit the voiceprint job:
Build an enrollment store
Persist the voiceprints in a simple JSON file keyed by speaker name. You only need to do this once per person:
Enroll your team before their next meeting:
Voiceprint job outputs are deleted from pyannoteAI after 24 hours, so always save them to your own storage. The voiceprints themselves are reusable indefinitely.
Step 2 - Identify speakers in a meeting
Once your team is enrolled, identifying speakers in a new recording is a single API call. Submit the audio along with your stored voiceprints to the /v1/identify endpoint. pyannoteAI will diarize the recording and match each speaker against the voiceprints:
Understanding the output
The response contains three sections:
diarization— the raw speaker segments with anonymous labels.identification— the same segments with matched names. Unmatched speakers keep their anonymous label and have"match": null.voiceprints— confidence scores showing how well each enrolled voiceprint matched each speaker. Use these to audit results and fine-tune your threshold.
Matching options
Two parameters control how matching works:
matching.threshold(0–100, default 0) - minimum confidence required for a match. Set to 50–70 for stricter matching. Too low produces false matches; too high causes legitimate speakers to appear as unknown.matching.exclusive(defaulttrue) - prevents multiple diarized speakers from matching the same voiceprint. Keep this on for meetings where you know each person is distinct.
Step 3 - Align with a transcript
Speaker labels become most useful when attached to a transcript. Run ASR separately (using gpt-4o-transcribe or any provider returning word-level timestamps) and merge by overlap:
Putting it all together
Sample output:
SPEAKER_02 remained anonymous because they were not enrolled. To fix that, enroll them with a reference clip and re-run.
Practical considerations for recurring meetings
Enrollment quality matters. Use a clean clip where the speaker talks uninterrupted for at least 10 seconds. Avoid clips with background noise, music, or other voices. A segment from a previous meeting where they gave an update works well.
Be selective with voiceprints. Only include voiceprints of people you expect in the recording. Including voiceprints of absent speakers can produce false matches, the system may assign a low-confidence match rather than labeling someone as unknown. Review the confidence scores in the output to catch this.
Threshold tuning. Start with a threshold of 50 and adjust based on your results. If enrolled speakers are showing up as unknown, lower it. If you're seeing wrong names assigned, raise it. The right value depends on audio quality and how distinctive each speaker's voice is.
Unknown speakers are expected. In most meeting contexts, guests, new hires, or external participants will appear. The system labels them with anonymous diarization IDs. You can enroll them after the fact by pulling a clean segment from the recording and adding them to your store.
Voiceprint stability. Voiceprints are durable across sessions, once created, they work for future recordings without re-enrollment. However, if someone's recording setup changes significantly (e.g. switching from a laptop mic to a headset), you may want to update their voiceprint for best results.
Microphone and environment consistency. Identification works best when enrollment audio and meeting audio are captured under similar conditions. If your team always uses the same conferencing setup, results will be more consistent than if conditions vary widely between sessions.
Conclusion
Speaker identification turns anonymous diarization labels into persistent, named identities. The workflow is simple: enroll each team member once with a short voice sample, then pass their voiceprints alongside every new recording. pyannoteAI handles the matching and returns segments labeled with real names. Combined with ASR, you get a transcript where every utterance is attributed to a known speaker; ready for meeting notes, searchable archives, or downstream analytics.
