Chatterbox uses zero-shot voice cloning to synthesize speech in any voice using just a short reference audio clip. You don’t need to train or fine-tune the model—simply provide a reference audio file and the model will clone the voice characteristics.Documentation Index
Fetch the complete documentation index at: https://mintlify.com/yocxy2/chatterboxyocxy/llms.txt
Use this file to discover all available pages before exploring further.
How Zero-Shot Cloning Works
Chatterbox extracts voice characteristics from your reference audio and uses them to condition the speech generation. The model analyzes:- Speaker identity using a voice encoder that creates speaker embeddings
- Speech patterns by tokenizing the reference audio into speech tokens
- Prosody and style through the conditioning mechanism
Prepare your reference audio
Choose a clean audio clip of the voice you want to clone. The clip should:
- Be at least 5 seconds long (Turbo requires 5+ seconds, standard models work with shorter clips)
- Ideally be 6-15 seconds for best results
- Have clear speech without background noise
- Contain natural speaking patterns representative of the target voice
Load the model
Initialize the Chatterbox model on your preferred device:For the standard English or multilingual models:
Using the audio_prompt_path Parameter
Theaudio_prompt_path parameter accepts a path to your reference audio file:
audio_prompt_path, the model automatically calls prepare_conditionals() internally to extract voice characteristics from the reference audio.
Best Practices for Reference Audio
Duration Guidelines
- Minimum: 5 seconds (required for Turbo model)
- Recommended: 6-15 seconds
- Maximum: The model uses only the first 6-10 seconds depending on the model variant
Content Recommendations
Choose reference audio that matches the speaking style you want to generate. If you want energetic speech, use an energetic reference. For calm narration, use calm reference audio.
- Clear articulation: Avoid mumbled or unclear speech
- Natural pacing: Not too fast or too slow
- Single speaker: Ensure only one person is speaking
- Complete sentences: Avoid fragments or cutoffs
- Representative style: Match the intended output style
Format Support
The model useslibrosa for loading audio, which supports:
- WAV
- MP3
- FLAC
- OGG
- and other common formats
Pre-computing Voice Conditionals
For repeated use of the same voice, you can pre-compute conditionals to save time:Adjusting Exaggeration
Theexaggeration parameter controls how expressive the cloned voice sounds:
Built-in Voice
All models come with a built-in default voice that’s used when you don’t provide anaudio_prompt_path:
Troubleshooting
”Audio prompt must be longer than 5 seconds”
The Turbo model requires at least 5 seconds of reference audio. Ensure your audio file meets this minimum duration:Poor Voice Similarity
If the generated voice doesn’t match your reference well:- Use higher quality reference audio - Reduce background noise
- Try a longer reference clip - Use 10-15 seconds instead of 5-6
- Adjust cfg_weight (standard models only) - Try lower values like
cfg_weight=0.3 - Match speaking style - Ensure reference style matches your target text