Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
31 changes: 21 additions & 10 deletions docs/speech-to-text/batch/batch_diarization.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -25,11 +25,11 @@ To learn more about diarization as a feature, check out the [diarization](../fea

Batch diarization offers the following ways to separate speakers in audio:

- [**Speaker diarization**](#speaker-diarization) — Identifies each speaker by their voice.
Useful when there are multiple speakers in the same audio stream.
- [**Speaker diarization**](#speaker-diarization) — Identifies each speaker by their voice.
Useful when there are multiple speakers in the same audio stream.

- [**Channel diarization**](#channel-diarization) — Transcribes each audio channel separately.
Useful when each speaker is recorded on their own channel.
- [**Channel diarization**](#channel-diarization) — Transcribes each audio channel separately.
Useful when each speaker is recorded on their own channel.

## Speaker diarization

Expand Down Expand Up @@ -170,23 +170,34 @@ You can reduce the likelihood of incorrectly switching between similar sounding
}
}
```
By default this flag is `false`. When this flag is set to `true`, the system will stay with the speaker of the previous word, if they closely match the speaker of the new word.
By default this is `false`. When this is set to `true`, the system will stay with the speaker of the previous word, if they closely match the speaker of the new word.

This can reduce instances where the system inadvertently alternates between different speaker labels within a single speaker audio segment

However, it may also result in some shorter speaker turn changes between similar speakers being missed.

This may result in some shorter speaker turn changes between similar speakers being missed.

### Speaker diarization and punctuation

Speaker diarization uses punctuation to improve accuracy. Small corrections are applied to speaker labels based on sentence boundaries.
Speaker diarization uses punctuation to improve the accuracy of speaker change points. Small adjustments to speaker labels may be applied based on sentence boundries.

For example, consider a case where the diarization marks a speaker change one word after a full stop:

> <span style={{ color: "red" }}>Hello my name is John. And</span> <span style={{ color: "blue" }}> my name is Alice.</span>

In this case, the above would be corrected to move the speaker change point to match with the end of sentence:

> <span style={{ color: "red" }}>Hello my name is John.</span> <span style={{ color: "blue" }}> And my name is Alice.</span>
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if this is the best way of showing an example, so feedback welcome!


For example, if the system initially assigns 9 words in a sentence to S1 and 1 word to S2, the lone S2 word may be corrected to S1.
Speaker diarization may also insert punctuation when a speaker change occurs without a corresponding sentence-ending punctuation mark in the transcription result.

This adjustment only works when punctuation is enabled. Disabling punctuation via the `permitted_marks` setting in `punctuation_overrides` can reduce diarization accuracy.
These adjustments are only applied when punctuation is enabled. Disabling punctuation via the `permitted_marks` setting in `punctuation_overrides` can reduce diarization accuracy.

Adjusting punctuation sensitivity can also affect how accurately speakers are identified.

### Speaker change (legacy)

The speaker change detection feature was removed in July 2024. The `speaker_change` and `channel_and_speaker_change` parameters are no longer supported. Use the [speaker diarization](#speaker-diarization) feature for speaker labeling.
The speaker change detection feature was removed in July 2024. The `speaker_change` and `channel_and_speaker_change` parameters are no longer supported. Use the [speaker diarization](#speaker-diarization) feature for speaker labeling.

For API-related questions, contact [Support](https://support.speechmatics.com).

Expand Down
57 changes: 35 additions & 22 deletions docs/speech-to-text/realtime/realtime_diarization.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -28,18 +28,18 @@ To learn more about diarization as a feature, check out the [diarization](../fea

Real-time diarization offers the following ways to separate speakers in audio:

- [**Speaker diarization**](#speaker-diarization) — Identifies each speaker by their voice.
Useful when there are multiple speakers in the same audio stream.
- [**Speaker diarization**](#speaker-diarization) — Identifies each speaker by their voice.
Useful when there are multiple speakers in the same audio stream.

- [**Channel diarization**](#channel-diarization) — Transcribes each audio channel separately.
Useful when each speaker is recorded on their own channel.
- [**Channel diarization**](#channel-diarization) — Transcribes each audio channel separately.
Useful when each speaker is recorded on their own channel.

- [**Channel & speaker diarization**](#channel-and-speaker-diarization) — Combines both methods.
Each channel is transcribed separately, with unique speakers identified within each channel.
Useful when multiple speakers are present across multiple channels.
- [**Channel & speaker diarization**](#channel-and-speaker-diarization) — Combines both methods.
Each channel is transcribed separately, with unique speakers identified within each channel.
Useful when multiple speakers are present across multiple channels.

## Speaker diarization


Speaker diarization picks out different speakers from the audio stream based on acoustic matching.

Expand Down Expand Up @@ -169,7 +169,7 @@ Transcripts are returned independently for each channel, with the `channel` prop
```

:::warning
The `channel` property will be returned for `AddTranscript` and `AddPartialTranscript` messages only.
The `channel` property will be returned for `AddTranscript` and `AddPartialTranscript` messages only.
Features such as [audio events](/speech-to-text/features/audio-events), [translation](/speech-to-text/features/translation) and [end of turn detection](/speech-to-text/realtime/end-of-turn) do not currently include this property. To request this feature, please contact [support](https://support.speechmatics.com).
:::

Expand All @@ -179,7 +179,7 @@ Channel and speaker diarization combines speaker diarization and channel diariza

To enable this mode, follow the steps in [speaker diarization](#speaker-diarization) and set the `diarization` mode to `channel_and_speaker`.

To send audio to a channel, follow the instructions in [send audio to a channel](#send-audio-to-a-channel).
To send audio to a channel, follow the instructions in [send audio to a channel](#send-audio-to-a-channel).

Transcripts are returned in the same way as channel diarization, but with individual speakers identified:

Expand Down Expand Up @@ -221,15 +221,14 @@ For SaaS customers, the maximum number of channels is 2.

For On-prem Container customers, the maximum number of channels depends on your [Multi-session container's](../../deployments/container/cpu-speech-to-text.mdx#multi-session-containers) maximum number of connections.

The Speechmatics Python client CLI is currently limited to transcribing multi-channel audio in via files and not streaming/raw audio.
The Speechmatics Python client CLI is currently limited to transcribing multi-channel audio in via files and not streaming/raw audio.

## Configuration

You can customize diarization to match your use case by adjusting settings for sensitivity, limiting the maximum number of speakers, preferring the current speaker to reduce false switches, and controlling how punctuation influences accuracy.

### Speaker sensitivity


You can configure the sensitivity of speaker detection by using the `speaker_sensitivity` setting in the `speaker_diarization_config` section of the job config object as shown below:

```json
Expand All @@ -250,7 +249,7 @@ You can configure the sensitivity of speaker detection by using the `speaker_sen
This takes a value between 0 and 1 (the default is 0.5). A higher sensitivity will
increase the likelihood of more unique speakers returning.

### Prefer Current Speaker
### Prefer current speaker

You can reduce the likelihood of incorrectly switching between similar sounding speakers by setting the `prefer_current_speaker` flag in the `speaker_diarization_config`:

Expand All @@ -270,9 +269,11 @@ You can reduce the likelihood of incorrectly switching between similar sounding
```
By default this is `false`. When this is set to `true`, the system will stay with the speaker of the previous word, if they closely match the speaker of the new word.

This may result in some shorter speaker turn changes between similar speakers being missed.
This can reduce instances where the system inadvertently alternates between different speaker labels within a single speaker audio segment

However, it may also result in some shorter speaker turn changes between similar speakers being missed.

### Max. Speakers
### Max. speakers

You can prevent too many speakers from being detected by using the `max_speakers` setting in the `StartRecognition` message as shown below:

Expand All @@ -299,27 +300,39 @@ You can prevent too many speakers from being detected by using the `max_speakers

The default value is 50, but it can take any integer value between 2 and 100 inclusive.

### Punctuation
This restricts the number of unique speaker labels that may be output by the system.

Note that accuracy may decline once this limit is reached. It is advisable to set the value to at least the expected number of speakers, and preferably slightly higher.

### Speaker diarization and punctuation

Speaker diarization uses punctuation to improve the accuracy of speaker change points. Small adjustments to speaker labels may be applied based on sentence boundries.

For example, consider a case where the diarization marks a speaker change one word after a full stop:

> <span style={{ color: "red" }}>Hello my name is John. And</span> <span style={{ color: "blue" }}> my name is Alice.</span>

In this case, the above would be corrected to move the speaker change point to match with the end of sentence:

Speaker diarization uses punctuation to improve accuracy. Small corrections are applied to speaker labels based on sentence boundaries.
> <span style={{ color: "red" }}>Hello my name is John.</span> <span style={{ color: "blue" }}> And my name is Alice.</span>

For example, if the system initially assigns 9 words in a sentence to S1 and 1 word to S2, the lone S2 word may be corrected to S1.
Speaker diarization may also insert punctuation when a speaker change occurs without a corresponding sentence-ending punctuation mark in the transcription result.

This adjustment only works when punctuation is enabled. Disabling punctuation via the `permitted_marks` setting in `punctuation_overrides` can reduce diarization accuracy.
These adjustments are only applied when punctuation is enabled. Disabling punctuation via the `permitted_marks` setting in `punctuation_overrides` can reduce diarization accuracy.

Adjusting punctuation sensitivity can also affect how accurately speakers are identified.

### Speaker change (legacy)

The Speaker Change Detection feature was removed in July 2024. The `speaker_change` and `channel_and_speaker_change` parameters are no longer supported. Use the [Speaker diarization](#speaker-diarization) feature for speaker labeling.
The Speaker Change Detection feature was removed in July 2024. The `speaker_change` and `channel_and_speaker_change` parameters are no longer supported. Use the [Speaker diarization](#speaker-diarization) feature for speaker labeling.

For API-related questions, contact [support](https://support.speechmatics.com).

## On-prem

To run `channel` or `channel_and_speaker` diarization with an on-prem deployment, configure your environment as follows:

- Use a [GPU Speech-to-Text container](../../deployments/container/gpu-speech-to-text.mdx). Handling multiple audio streams is computationally intensive and benefits from GPU acceleration.
- Set the `SM_MAX_CONCURRENT_CONNECTIONS` environment variable to match the number of channels you want to process.
- Use a [GPU Speech-to-Text container](../../deployments/container/gpu-speech-to-text.mdx). Handling multiple audio streams is computationally intensive and benefits from GPU acceleration.
- Set the `SM_MAX_CONCURRENT_CONNECTIONS` environment variable to match the number of channels you want to process.

For more details on container setup, see the [on-prem deployment docs](../../deployments/index.md).