Gradio

Understanding

Multi-turn conversation with the model. Each turn can freely combine Text / Audio / Video inputs. Text history is preserved across turns for context.

Conversation

Text Input

Audio Input (optional, upload takes priority)

Or Enter Audio Path

Video Input (optional, upload takes priority)

Or Enter Video Path

Use Audio Track in Video

Start Time (s) — for audio / video

Duration (s) — 0 = use full clip

Generation

Select task type to show corresponding input components.

T2A / T2M: Text prompt only
V2A / V2M: Video input (no text prompt)
TTS: Transcript + optional voice prompt for timbre cloning (auto voice-conversion when voice prompt is provided)

Text Prompt

Output Video

Output Audio

Editing

Edit audio by selecting a task type and describing the target sound.

Add: Add a new sound to the input audio
Extract: Extract a specific sound from the input audio
Remove: Remove a specific sound from the input audio
Style Transfer: Transform one sound into another

Sound to Add

Audio Input (upload audio file)

Or Enter Audio Input Path

Input Audio (Preview)

Output Audio

🎧 Audio-Omni

Understanding

Generation

Editing