All articles

Speaker diarization on Pro: who said what, automatically

Real-time labels on AssemblyAI streaming so you never lose the thread

A transcript without speakers is a wall of text. On a busy call — four people in a discovery, a standup with overlapping voices — you need to know who pushed back on pricing and who offered the workaround. Scriba Pro and Max plans include real-time speaker diarization through AssemblyAI's EU streaming stack, so labels appear as the conversation unfolds instead of after a post-processing delay.

How diarization works in Scriba

When you hit record, audio flows through voice activity detection into five-to-ten-second chunks. On managed Pro/Max, those chunks stream to AssemblyAI's u3-rt-pro relay in eu-central-1. The API returns timestamped segments with speaker labels — Speaker A, Speaker B, and so on — that render live in the meeting view as people talk.

  • Streaming diarization is a Pro/Max managed feature — not available in BYOK mode.
  • Batch fallback uses universal-3-pro or universal-2 when streaming isn't an option.
  • Segments land in SQLite alongside the transcript for search and export.
  • Playback stays at the 16 kHz mono invariant so audio and timestamps stay aligned.

Why labels matter after the call

Speaker tags survive into chat context. Ask the assistant to summarize only what the customer said, or pull every objection from Speaker C. Link contacts to meetings and names resolve in the UI over time. For teams running bulk import on Pro or Max, batch transcription applies the same diarization model to hundreds of archived recordings in the local job queue — crash recovery and retries included.

Diarization turns a transcript from a document into a conversation — and conversations are what you actually need to act on.

Pro vs BYOK: know the tradeoff

BYOK mode sends chunked audio to OpenAI Whisper. It's excellent for transcription on your key, but streaming diarization and bulk import are managed-only capabilities. If who-said-what is non-negotiable for your workflow, Pro or Max on managed mode is the path — AssemblyAI handles labels in real time without you operating a separate diarization pipeline.

Tips for cleaner labels

Use a decent mic, avoid two people on one laptop when you can, and pause before jumping in — overlapping speech is the hardest case for any diarization model. Scriba still captures the audio locally, so if labels drift you can replay the segment and correct the record in notes or Brain memory before the story spreads. On long calls, skim by speaker in the transcript panel to jump straight to the buyer's objections.

Keep reading