When processing videos with complex audio—such as those with background noise or specific vocal requirements—fine-tuning your speaker settings is essential. Use this guide to determine which slider will best solve your specific audio challenges.
Go to Speaker Settings
Open your project and navigate to Speaker Settings.Select a Speaker
Choose the speaker you want to configure (for example, Speaker 4).Set Basic Details
Enter or edit the Speaker Name
Select the Gender
Quick Reference Table
Feature | When to Use | Recommended Value |
Clean Reference | Public settings, background cheering, or poor mic quality. | Off - if the audio has been already processed and cleaned |
Acoustic Boost | If the output voice sounds "hazy" or muffled. | Off - if output is already clear |
Speaker Stability | To make the speech more "expressive" or dynamic. | Decrease Value towards 0 to make speech more expressive. |
Speaker Similarity | When cloning permission is not available. | 0.7-1 (Standard) for exact cloning |
Accent Boost | Non-English to English voiceover only. | 0 (for source language affected accent) 1 (for target English regional accent) |
⚠️ Note:
Noisy or distorted input audio may reduce naturalness if the reference audio has noise or distortion.
Same speaker settings apply to all target languages, but there is no need to apply same settings to all speakers of a video. This is highly subjective to how much voice data one speaker has in the video.
You can use intermediate values to achieve varying levels of trade-off across the settings. Please note re-generations is required after changing a setting resulting in credit consumption.
Deep Dive: Setting Descriptions
1. Clean Reference
Use this setting when your source audio is "dirty." If the video was filmed in a public space with singing, cheering, or low-quality microphones, this will help isolate the primary speaker from the environment.
2. Acoustic Boost
This is your primary tool for clarity. If the output sounds hazy or lacks definition,
toggle this on to sharpen the vocal profile.
3. Speaker Stability
This controls how "robotic" or "human" the voice sounds.
For more expressive speech: Decrease the value. This allows for more natural inflection and emotion.
4. Speaker Similarity
This adjusts how closely the AI mimics the original voice.
0.0 – 0.4: Use this when you want the speaker to sound like a different person.
0.7 (Default): The "sweet spot" for most standard speakers.
1.0: Reserved for special cases where the output doesn't sound enough like the original.
Note: Use lower settings if you do not have explicit permission to clone a specific voice.
5. Accent Boost (Non-English to English only)
Specifically designed for carrying over native accented English.
0 (Default): Retains the native tongue's accent in the English output. Even an 0.3 accent boost can remove the accent from the source language in some cases
0.7 – 1.0: Use this to achieve unaccented American / British English.
6. Speaker Rating
Start with a baseline of 1.0. If the speech feels rushed, increase the rate to > 1.0; if the pauses between words are too long, decrease it to < 1.0.
For Eg:- for Audiobooks and simple narrations, use 0.8.
