Configure Voice-to-Voice Action

Preview

End-to-end voice-to-voice mode is a preview feature. Available only after a positive review by sipgate support. See Access Gate below.

Switch a session into end-to-end voice-to-voice mode. From the moment this action is processed the assistant no longer goes through the standard STT → text → TTS pipeline — caller audio is forwarded directly to a speech-to-speech model and the model's spoken response is sent back to the caller in real time.

The transcribed user text is still surfaced as user_speak events for logging and call traces, but you don't need (and shouldn't send) speak actions in response to them — the model speaks autonomously.

Action Structure

json

{
  "type": "configure_voice_to_voice",
  "session_id": "550e8400-e29b-41d4-a716-446655440000",
  "system_prompt": "You are a friendly assistant for the Acme dental practice. Be concise.",
  "greeting": "Hello, this is Acme Dental — how can I help you?",
  "temperature": 0.8,
  "language": "en"
}

Fields

Field	Type	Required	Default	Description
`type`	string	Yes	—	Always `"configure_voice_to_voice"`
`session_id`	string (UUID)	Yes	—	Session identifier from the event
`system_prompt`	string	Yes	—	Persona / behaviour instructions for the model. Sent once at the start of the session.
`greeting`	string	No	—	Opening line the model should speak after connecting. Delivered as an inference trigger so the model phrases it naturally.
`temperature`	number	No	`0.8`	Sampling temperature (0–2). Lower values make replies more deterministic.
`language`	string	No	—	Preferred response language hint (e.g. `"de"`, `"en"`). The model decides ultimately.

Behavioral Details

STT and TTS are inactive

Once voice-to-voice is active for a session:

user_speak events still arrive, but they reflect the model's own transcription of the caller's turns — not your configured STT provider.
speak actions are honoured by forwarding the text to the model as a speaking instruction. The model will speak the text in its own voice — it may rephrase slightly (the protocol has no verbatim-TTS path). tts, ssml, barge_in, vad and user_input_timeout_seconds fields on the speak action are ignored.
Barge-in is handled inside the model — the configured barge-in strategy has no effect for the rest of the session.
VAD parameters set via configure_transcription.vad or speak.vad are ignored.

Reverting to the normal pipeline

Send a configure_transcription action to switch the session back to the standard STT/TTS pipeline. After that, you can send speak actions again.

json

{
  "type": "configure_transcription",
  "session_id": "550e8400-e29b-41d4-a716-446655440000",
  "provider": "AZURE",
  "languages": ["de-DE"]
}

Greeting

When greeting is provided, the model speaks an opening line as soon as the session is ready (typically within 1–2 seconds). The text is given to the model as guidance — the exact wording may differ slightly.

If you want full silence at the start (e.g. you announce yourself first via a speak action before sending configure_voice_to_voice), simply omit greeting.

Latency

End-to-end speech-to-speech models respond noticeably faster than the standard STT → LLM → TTS pipeline because there are no per-stage decode/encode steps. First-byte latency for the spoken response is typically in the 200–600 ms range from the end of the caller's turn.

Examples

Minimal: persona-only, no greeting

json

{
  "type": "configure_voice_to_voice",
  "session_id": "550e8400-e29b-41d4-a716-446655440000",
  "system_prompt": "You are a friendly assistant for the Acme dental practice. Be concise."
}

Persona + greeting in German

json

{
  "type": "configure_voice_to_voice",
  "session_id": "550e8400-e29b-41d4-a716-446655440000",
  "system_prompt": "Du bist ein freundlicher Assistent für die Zahnarztpraxis Acme.",
  "greeting": "Guten Tag, hier ist die Praxis Acme. Wie kann ich Ihnen helfen?",
  "language": "de"
}

Logging caller turns while the model handles the conversation

Your code receives user_speak events for the call trace but does not need (and should not send) any further actions:

javascript

app.post('/webhook', (req, res) => {
  const event = req.body;

  if (event.type === 'session_start') {
    return res.json({
      type: 'configure_voice_to_voice',
      session_id: event.session.id,
      system_prompt: 'You are a helpful assistant.',
      greeting: 'Hi! How can I help today?',
    });
  }

  if (event.type === 'user_speak') {
    // Log only — the model is already responding.
    console.log(`Caller said: ${event.text}`);
    return res.status(200).send();
  }

  return res.status(200).send();
});

Access Gate

Voice-to-voice mode is only available upon request and after a positive review by sipgate support. Mention configure_voice_to_voice when you reach out so we can enable it for your account.

Next Steps

Actions Overview - Complete action reference
Configure Transcription - Switch back to the STT/TTS pipeline
Event Types - What events carry transcribed text

Configure Voice-to-Voice Action ​

Action Structure ​

Fields ​

Behavioral Details ​

STT and TTS are inactive ​

Reverting to the normal pipeline ​

Greeting ​

Latency ​

Examples ​

Minimal: persona-only, no greeting ​

Persona + greeting in German ​

Logging caller turns while the model handles the conversation ​

Access Gate ​

Next Steps ​

Configure Voice-to-Voice Action

Action Structure

Fields

Behavioral Details

STT and TTS are inactive

Reverting to the normal pipeline

Greeting

Latency

Examples

Minimal: persona-only, no greeting

Persona + greeting in German

Logging caller turns while the model handles the conversation

Access Gate

Next Steps