Appearance
Configure Voice-to-Voice Action
Preview
End-to-end voice-to-voice mode is a preview feature. Available only after a positive review by sipgate support. See Access Gate below.
Switch a session into end-to-end voice-to-voice mode. From the moment this action is processed the assistant no longer goes through the standard STT → text → TTS pipeline — caller audio is forwarded directly to a speech-to-speech model and the model's spoken response is sent back to the caller in real time.
The transcribed user text is still surfaced as user_speak events for logging and call traces, but you don't need (and shouldn't send) speak actions in response to them — the model speaks autonomously.
Action Structure
json
{
"type": "configure_voice_to_voice",
"session_id": "550e8400-e29b-41d4-a716-446655440000",
"system_prompt": "You are a friendly assistant for the Acme dental practice. Be concise.",
"greeting": "Hello, this is Acme Dental — how can I help you?",
"temperature": 0.8,
"language": "en"
}Fields
| Field | Type | Required | Default | Description |
|---|---|---|---|---|
type | string | Yes | — | Always "configure_voice_to_voice" |
session_id | string (UUID) | Yes | — | Session identifier from the event |
system_prompt | string | Yes | — | Persona / behaviour instructions for the model. Sent once at the start of the session. |
greeting | string | No | — | Opening line the model should speak after connecting. Delivered as an inference trigger so the model phrases it naturally. |
temperature | number | No | 0.8 | Sampling temperature (0–2). Lower values make replies more deterministic. |
language | string | No | — | Preferred response language hint (e.g. "de", "en"). The model decides ultimately. |
Behavioral Details
STT and TTS are inactive
Once voice-to-voice is active for a session:
user_speakevents still arrive, but they reflect the model's own transcription of the caller's turns — not your configured STT provider.speakactions are honoured by forwarding the text to the model as a speaking instruction. The model will speak the text in its own voice — it may rephrase slightly (the protocol has no verbatim-TTS path).tts,ssml,barge_in,vadanduser_input_timeout_secondsfields on thespeakaction are ignored.- Barge-in is handled inside the model — the configured barge-in strategy has no effect for the rest of the session.
- VAD parameters set via
configure_transcription.vadorspeak.vadare ignored.
Reverting to the normal pipeline
Send a configure_transcription action to switch the session back to the standard STT/TTS pipeline. After that, you can send speak actions again.
json
{
"type": "configure_transcription",
"session_id": "550e8400-e29b-41d4-a716-446655440000",
"provider": "AZURE",
"languages": ["de-DE"]
}Greeting
When greeting is provided, the model speaks an opening line as soon as the session is ready (typically within 1–2 seconds). The text is given to the model as guidance — the exact wording may differ slightly.
If you want full silence at the start (e.g. you announce yourself first via a speak action before sending configure_voice_to_voice), simply omit greeting.
Latency
End-to-end speech-to-speech models respond noticeably faster than the standard STT → LLM → TTS pipeline because there are no per-stage decode/encode steps. First-byte latency for the spoken response is typically in the 200–600 ms range from the end of the caller's turn.
Examples
Minimal: persona-only, no greeting
json
{
"type": "configure_voice_to_voice",
"session_id": "550e8400-e29b-41d4-a716-446655440000",
"system_prompt": "You are a friendly assistant for the Acme dental practice. Be concise."
}Persona + greeting in German
json
{
"type": "configure_voice_to_voice",
"session_id": "550e8400-e29b-41d4-a716-446655440000",
"system_prompt": "Du bist ein freundlicher Assistent für die Zahnarztpraxis Acme.",
"greeting": "Guten Tag, hier ist die Praxis Acme. Wie kann ich Ihnen helfen?",
"language": "de"
}Logging caller turns while the model handles the conversation
Your code receives user_speak events for the call trace but does not need (and should not send) any further actions:
javascript
app.post('/webhook', (req, res) => {
const event = req.body;
if (event.type === 'session_start') {
return res.json({
type: 'configure_voice_to_voice',
session_id: event.session.id,
system_prompt: 'You are a helpful assistant.',
greeting: 'Hi! How can I help today?',
});
}
if (event.type === 'user_speak') {
// Log only — the model is already responding.
console.log(`Caller said: ${event.text}`);
return res.status(200).send();
}
return res.status(200).send();
});Access Gate
Voice-to-voice mode is only available upon request and after a positive review by sipgate support. Mention configure_voice_to_voice when you reach out so we can enable it for your account.
Next Steps
- Actions Overview - Complete action reference
- Configure Transcription - Switch back to the STT/TTS pipeline
- Event Types - What events carry transcribed text