Audio Studio
Generate voice, instrumental music, music with vocals and audio effects. Post-production included.
What you can do
Audio Studio creates and edits audio in every form:
- Synthesized voice — read texts aloud (TTS)
- Instrumental music — genre and mood (jazz, lo-fi, cinematic, etc.)
- Music with voice — description + sung lyrics
- Post-production — stem separation, noise reduction, mastering
Part 1: Voice and music generation
Choose the type
At the top left, three buttons:
- Voice — voice synthesis from text
- Instrumental music — genre and atmosphere
- Music with voice — full track with lyrics
Synthesized voice
- Choose the model:
- Chatterbox TTS — fast, good quality, free
- Chatterbox HD — HD quality local, free, more natural voice
- IndexTTS2 — Premium HD — ultra-realistic, emotional voice, gating (paid plan)
- Cloud models (MiniMax, F5-TTS, Qwen3-TTS) — excellent quality, rich voice options
- Write the text you want to hear read aloud
- Optional: select the voice
- Cloud models have preset voices (Maria, Luca, etc.)
- Some models allow custom voice: describe the voice you want (*"male voice deep and calm"*)
- Generate — the voice is created in a few seconds
Instrumental music
- Choose the model:
- Stable Audio — free, quality instrumental
- Lyria 2, Eleven Music — cloud, premium quality
- Describe the genre and mood
- Example: *"lo-fi hip hop, relaxing, with piano and soft drums"*
- Example: *"epic cinematic, strings, brass, dramatic"*
- Duration: 10–120 seconds (depends on the model)
- Generate — the music is created
Music with voice
- Choose the model:
- ACE-Step — free, generates full tracks with lyrics
- Yue, Lyria 2 (cloud) — premium, studio quality
- Write the style/genre
- Example: *"melodic pop, energetic, female vocals"*
- Paste lyrics (optional)
- Use tags like
[verse],[chorus]to structure the song - If empty, the model generates instrumental music
- Duration: 15–300 seconds
- Generate — the track is created with synthesized voice
Part 2: Audio post-production
After generating audio or uploading your own, you can apply effects:
From Gallery → "Use in"
- Generate or upload audio
- In Gallery, click Use in → Audio post-production
- Choose the effect:
- Stem split (Demucs) — separate voice, drums, bass, other instruments
- Denoise voice — remove background noise from voice
- Master to reference — automatic mastering matching (provide a reference audio and shaping applies to yours)
Available effects
| Effect | What it does | When to use |
|---|---|---|
| Demucs | Separates voice from music | If you want only the voice, or only the music from a song |
| Denoise | Removes noise and hum | Voice recorded poorly, cheap microphone recording |
| Matchering | Copies mastering style | You want your track to sound like another artist's |
Session management
On the right, Audio Gallery shows:
- All generated audio in the current session
- Duration and model used
- Buttons: Listen, Download, Use in, Delete
Press New session to clear the grid and start over.
Free vs Premium models
| Model | Cost | Quality | When to use |
|---|---|---|---|
| Chatterbox TTS | Free | Good | Fast TTS, narrative |
| Chatterbox HD | Free | Good HD | More natural TTS, local |
| Stable Audio | Free | Good | Fast instrumental music |
| ACE-Step | Free | Good | Tracks with synthesized vocals |
| Cloud (MiniMax, F5, Qwen, Lyria, Eleven) | Credits | Excellent | Studio quality, very natural voice |
Real timings
- Voice TTS: 5–20 seconds
- Instrumental music: 20–60 seconds
- Music with voice: 1–3 minutes
- Post-production (stem/denoise/master): 30–120 seconds
Common issues
"I can't hear the voice"
→ Try a different model. Some models require credits (lock 🔒).
"The audio is too fast/slow"
→ Cloud models allow the Speed parameter: adjust from 0.5× to 2.0×.
"I want to clone MY voice"
→ Record a sample audio (10–20 seconds), upload it as Voice sample. F5-TTS and Qwen3-TTS support zero-shot cloning.
"The mastering sounds "weird""
→ Use a high-quality reference audio (from Spotify, YouTube with good mastering). The matching will be faithful to the reference.
Pro tip
Generate the narration voice in Audio Studio, then send it to Cinema Studio to sync it with a video using lip sync. Or generate music + voice in a complete track and use it as soundtrack in your videos.