Skip to main content

Optimizing Speaker's Audio Settings

What are those new settings for a speaker?

When processing videos with complex audio—such as those with background noise or specific vocal requirements—fine-tuning your speaker settings is essential. Use this guide to determine which slider will best solve your specific audio challenges.

  • Go to Speaker Settings
    Open your project and navigate to Speaker Settings.

  • Select a Speaker
    Choose the speaker you want to configure (for example, Speaker 4).

  • Set Basic Details

    • Enter or edit the Speaker Name

    • Select the Gender


Quick Reference Table

Feature

When to Use

Recommended Value

Clean Reference

Public settings, background cheering, or poor mic quality.

Off - if the audio has been already processed and cleaned
On - if poor mic quality

Acoustic Boost

If the output voice sounds "hazy" or muffled.

Off - if output is already clear
On - if output is hazy

Speaker Stability

To make the speech more "expressive" or dynamic.
Or to reduce hallucinations.

Decrease Value towards 0 to make speech more expressive.
Increase value towards 1 to remove hallucinations.

Speaker Similarity

When cloning permission is not available.

0.7-1 (Standard) for exact cloning
0-0.3 for no cloning

Accent Boost

Non-English to English voiceover only.

0 (for source language affected accent)

1 (for target English regional accent)

⚠️ Note:
Noisy or distorted input audio may reduce naturalness if the reference audio has noise or distortion.

  • Same speaker settings apply to all target languages, but there is no need to apply same settings to all speakers of a video. This is highly subjective to how much voice data one speaker has in the video.

  • You can use intermediate values to achieve varying levels of trade-off across the settings. Please note re-generations is required after changing a setting resulting in credit consumption.

Deep Dive: Setting Descriptions

1. Clean Reference

Use this setting when your source audio is "dirty." If the video was filmed in a public space with singing, cheering, or low-quality microphones, this will help isolate the primary speaker from the environment.

2. Acoustic Boost

This is your primary tool for clarity. If the output sounds hazy or lacks definition,

toggle this on to sharpen the vocal profile.

3. Speaker Stability

This controls how "robotic" or "human" the voice sounds.

  • For more expressive speech: Decrease the value. This allows for more natural inflection and emotion.

4. Speaker Similarity

This adjusts how closely the AI mimics the original voice.

  • 0.0 – 0.4: Use this when you want the speaker to sound like a different person.

  • 0.7 (Default): The "sweet spot" for most standard speakers.

  • 1.0: Reserved for special cases where the output doesn't sound enough like the original.

  • Note: Use lower settings if you do not have explicit permission to clone a specific voice.

5. Accent Boost (Non-English to English only)

Specifically designed for carrying over native accented English.

  • 0 (Default): Retains the native tongue's accent in the English output. Even an 0.3 accent boost can remove the accent from the source language in some cases

  • 0.7 – 1.0: Use this to achieve unaccented American / British English.

6. Speaker Rating

Start with a baseline of 1.0. If the speech feels rushed, increase the rate to > 1.0; if the pauses between words are too long, decrease it to < 1.0.
For Eg:- for Audiobooks and simple narrations, use 0.8.

Did this answer your question?