Understanding
Multi-turn conversation with the model. Each turn can freely combine Text / Audio / Video inputs. Text history is preserved across turns for context.
Generation
Select task type to show corresponding input components.
- T2A / T2M: Text prompt only
- V2A / V2M: Video input (no text prompt)
- TTS: Transcript + optional voice prompt for timbre cloning (auto voice-conversion when voice prompt is provided)
Editing
Edit audio by selecting a task type and describing the target sound.
- Add: Add a new sound to the input audio
- Extract: Extract a specific sound from the input audio
- Remove: Remove a specific sound from the input audio
- Style Transfer: Transform one sound into another