🗣️ What is Speaker Similarity?
Speaker Similarity controls how closely the generated voice matches the original speaker’s voice from the reference audio. It helps maintain the same voice identity when audio is generated, translated, or dubbed into another language.
⭐ Why Speaker Similarity Is Important?
Speaker Similarity ensures:
The same speaker identity across languages Natural and believable AI-generated speech Consistency in tone, pitch, and vocal character.
Better listener experience without sudden voice changes.
This is especially useful for:
🎬 AI Dubbing
🧑🤝🧑 Voice Cloning
🌐 Multilingual Narration
📦 Content Localization
⚖️ How to Use Speaker Similarity?
Go to Speaker Settings
Open your project and navigate to Speaker Settings.Select a Speaker
Choose the speaker you want to configure (for example, Speaker 4).Set Basic Details
Enter or edit the Speaker Name
Select the Gender
(Optional) Improve Reference Audio
Enable Clean Reference to remove noise from the reference audio
Enable Maintain Source Accent if you want to keep the original accent
Adjust the Speaker Similarity Slider
Move the Speaker Similarity slider to control how closely the generated voice matches the original speaker
Recommended: The default Speaker Similarity value is 0.7, which provides a good balance between voice accuracy and natural sound.
Higher values sound more like the original voice, lower values give more flexibility
Select Voice Models
Choose the language
Select the voice model (for example, Voice From Original Media)
Save the Speaker
Click Confirm.
🎚️ How the Speaker Similarity Slider Works?
The slider allows you to adjust the level of similarity between the original voice and the generated voice.
🔹Lower Values (e.g., 0.3 – 0.5) :
Voice is less similar to the original
More flexibility in pronunciation and delivery
Useful when:
Reference audio quality is low
Exact voice matching is not required
🔹Medium Values (Recommended: ~0.6 – 0.75) :
Balanced similarity and clarity
Voice sounds like the original speaker but remains natural
Useful when:
Best for most dubbing and narration use cases
🔹Higher Values (e.g., 0.8 – 1.0) :
Voice sounds very close to the original speaker
Preserves vocal identity strongly
Useful when:
High-quality reference audio is available
Voice consistency is critical
⚠️ Note:
Noisy or distorted input audio may reduce naturalness if the reference audio has noise or distortion.


