Skip to main content

Speaker Similarity: What It Is & How to Use It? 🎙️

Speaker Similarity lets you decide how much of the original speaker’s voice identity is preserved in the output.

Updated over a week ago

🗣️ What is Speaker Similarity?

Speaker Similarity controls how closely the generated voice matches the original speaker’s voice from the reference audio. It helps maintain the same voice identity when audio is generated, translated, or dubbed into another language.

⭐ Why Speaker Similarity Is Important?

Speaker Similarity ensures:

  • The same speaker identity across languages Natural and believable AI-generated speech Consistency in tone, pitch, and vocal character.

  • Better listener experience without sudden voice changes.

This is especially useful for:

  • 🎬 AI Dubbing

  • 🧑‍🤝‍🧑 Voice Cloning

  • 🌐 Multilingual Narration

  • 📦 Content Localization

⚖️ How to Use Speaker Similarity?

  • Go to Speaker Settings
    Open your project and navigate to Speaker Settings.

  • Select a Speaker
    Choose the speaker you want to configure (for example, Speaker 4).

  • Set Basic Details

    • Enter or edit the Speaker Name

    • Select the Gender

  • (Optional) Improve Reference Audio

    • Enable Clean Reference to remove noise from the reference audio

    • Enable Maintain Source Accent if you want to keep the original accent

  • Adjust the Speaker Similarity Slider

    • Move the Speaker Similarity slider to control how closely the generated voice matches the original speaker

    • Recommended: The default Speaker Similarity value is 0.7, which provides a good balance between voice accuracy and natural sound.

    • Higher values sound more like the original voice, lower values give more flexibility

  • Select Voice Models

    • Choose the language

    • Select the voice model (for example, Voice From Original Media)

  • Click Confirm and save the speaker.


What Is Voice Fine-Tuning Sliders?

These sliders help balance realism and clarity. These options are enabled by default. Use the slider to adjust and click Confirm to apply. Each speaker can be adjusted individually.

Stability

  • Controls how steady the voice sounds.

  • Higher values make the voice more consistent and less expressive.

Speaker Similarity

  • Adjusts how closely the AI voice matches the original speaker.

  • Higher values mean the voice sounds more like the original person.

Accent Boost

  • Strengthens the accent in the generated voice.

  • This is helpful when the accent feels too neutral or weak.

🔹Lower Values (e.g., 0.3 – 0.5) :

  • Voice is less similar to the original

  • More flexibility in pronunciation and delivery

  • Useful when:

    • Reference audio quality is low

    • Exact voice matching is not required

🔹Medium Values (Recommended: ~0.6 – 0.75) :

  • Balanced similarity and clarity

  • Voice sounds like the original speaker but remains natural

  • Useful when:

    • Best for most dubbing and narration use cases

🔹Higher Values (e.g., 0.8 – 1.0) :

  • Voice sounds very close to the original speaker

  • Preserves vocal identity strongly

  • Useful when:

    • High-quality reference audio is available

    • Voice consistency is critical


⚠️ Note:
Noisy or distorted input audio may reduce naturalness if the reference audio has noise or distortion.

Did this answer your question?