Skip to main content

Speaker Similarity: What It Is & How to Use It? 🎙️

Speaker Similarity lets you decide how much of the original speaker’s voice identity is preserved in the output.

Updated over a week ago

🗣️ What is Speaker Similarity?

Speaker Similarity controls how closely the generated voice matches the original speaker’s voice from the reference audio. It helps maintain the same voice identity when audio is generated, translated, or dubbed into another language.

⭐ Why Speaker Similarity Is Important?

Speaker Similarity ensures:

  • The same speaker identity across languages Natural and believable AI-generated speech Consistency in tone, pitch, and vocal character.

  • Better listener experience without sudden voice changes.

This is especially useful for:

  • 🎬 AI Dubbing

  • 🧑‍🤝‍🧑 Voice Cloning

  • 🌐 Multilingual Narration

  • 📦 Content Localization

⚖️ How to Use Speaker Similarity?

  1. Go to Speaker Settings
    Open your project and navigate to Speaker Settings.

  2. Select a Speaker
    Choose the speaker you want to configure (for example, Speaker 4).

  3. Set Basic Details

    • Enter or edit the Speaker Name

    • Select the Gender

  4. (Optional) Improve Reference Audio

    • Enable Clean Reference to remove noise from the reference audio

    • Enable Maintain Source Accent if you want to keep the original accent

  5. Adjust the Speaker Similarity Slider

    • Move the Speaker Similarity slider to control how closely the generated voice matches the original speaker

    • Recommended: The default Speaker Similarity value is 0.7, which provides a good balance between voice accuracy and natural sound.

    • Higher values sound more like the original voice, lower values give more flexibility

  6. Select Voice Models

    • Choose the language

    • Select the voice model (for example, Voice From Original Media)

  7. Save the Speaker
    Click Confirm.

🎚️ How the Speaker Similarity Slider Works?

The slider allows you to adjust the level of similarity between the original voice and the generated voice.

🔹Lower Values (e.g., 0.3 – 0.5) :

  • Voice is less similar to the original

  • More flexibility in pronunciation and delivery

  • Useful when:

    • Reference audio quality is low

    • Exact voice matching is not required

🔹Medium Values (Recommended: ~0.6 – 0.75) :

  • Balanced similarity and clarity

  • Voice sounds like the original speaker but remains natural

  • Useful when:

    • Best for most dubbing and narration use cases

🔹Higher Values (e.g., 0.8 – 1.0) :

  • Voice sounds very close to the original speaker

  • Preserves vocal identity strongly

  • Useful when:

    • High-quality reference audio is available

    • Voice consistency is critical

⚠️ Note:
Noisy or distorted input audio may reduce naturalness if the reference audio has noise or distortion.

Did this answer your question?