Skip to content

Configure Voice-to-Voice Action

Preview

End-to-end voice-to-voice mode is a preview feature. Available only after a positive review by sipgate support. See Access Gate below.

Switch a session into end-to-end voice-to-voice mode. From the moment this action is processed the assistant no longer goes through the standard STT → text → TTS pipeline — caller audio is forwarded directly to a speech-to-speech model and the model's spoken response is sent back to the caller in real time.

The transcribed user text is still surfaced as user_speak events for logging and call traces, but you don't need (and shouldn't send) speak actions in response to them — the model speaks autonomously.

Action Structure

json
{
  "type": "configure_voice_to_voice",
  "session_id": "550e8400-e29b-41d4-a716-446655440000",
  "system_prompt": "You are a friendly assistant for the Acme dental practice. Be concise.",
  "greeting": "Hello, this is Acme Dental — how can I help you?",
  "temperature": 0.8,
  "language": "en"
}

Fields

FieldTypeRequiredDefaultDescription
typestringYesAlways "configure_voice_to_voice"
session_idstring (UUID)YesSession identifier from the event
system_promptstringYesPersona / behaviour instructions for the model. Sent once at the start of the session.
greetingstringNoOpening line the model should speak after connecting. Delivered as an inference trigger so the model phrases it naturally.
temperaturenumberNo0.8Sampling temperature (0–2). Lower values make replies more deterministic.
languagestringNoPreferred response language hint (e.g. "de", "en"). The model decides ultimately.

Behavioral Details

STT and TTS are inactive

Once voice-to-voice is active for a session:

  • user_speak events still arrive, but they reflect the model's own transcription of the caller's turns — not your configured STT provider.
  • speak actions are honoured by forwarding the text to the model as a speaking instruction. The model will speak the text in its own voice — it may rephrase slightly (the protocol has no verbatim-TTS path). tts, ssml, barge_in, vad and user_input_timeout_seconds fields on the speak action are ignored.
  • Barge-in is handled inside the model — the configured barge-in strategy has no effect for the rest of the session.
  • VAD parameters set via configure_transcription.vad or speak.vad are ignored.

Reverting to the normal pipeline

Send a configure_transcription action to switch the session back to the standard STT/TTS pipeline. After that, you can send speak actions again.

json
{
  "type": "configure_transcription",
  "session_id": "550e8400-e29b-41d4-a716-446655440000",
  "provider": "AZURE",
  "languages": ["de-DE"]
}

Greeting

When greeting is provided, the model speaks an opening line as soon as the session is ready (typically within 1–2 seconds). The text is given to the model as guidance — the exact wording may differ slightly.

If you want full silence at the start (e.g. you announce yourself first via a speak action before sending configure_voice_to_voice), simply omit greeting.

Latency

End-to-end speech-to-speech models respond noticeably faster than the standard STT → LLM → TTS pipeline because there are no per-stage decode/encode steps. First-byte latency for the spoken response is typically in the 200–600 ms range from the end of the caller's turn.

Examples

Minimal: persona-only, no greeting

json
{
  "type": "configure_voice_to_voice",
  "session_id": "550e8400-e29b-41d4-a716-446655440000",
  "system_prompt": "You are a friendly assistant for the Acme dental practice. Be concise."
}

Persona + greeting in German

json
{
  "type": "configure_voice_to_voice",
  "session_id": "550e8400-e29b-41d4-a716-446655440000",
  "system_prompt": "Du bist ein freundlicher Assistent für die Zahnarztpraxis Acme.",
  "greeting": "Guten Tag, hier ist die Praxis Acme. Wie kann ich Ihnen helfen?",
  "language": "de"
}

Logging caller turns while the model handles the conversation

Your code receives user_speak events for the call trace but does not need (and should not send) any further actions:

javascript
app.post('/webhook', (req, res) => {
  const event = req.body;

  if (event.type === 'session_start') {
    return res.json({
      type: 'configure_voice_to_voice',
      session_id: event.session.id,
      system_prompt: 'You are a helpful assistant.',
      greeting: 'Hi! How can I help today?',
    });
  }

  if (event.type === 'user_speak') {
    // Log only — the model is already responding.
    console.log(`Caller said: ${event.text}`);
    return res.status(200).send();
  }

  return res.status(200).send();
});

Access Gate

Voice-to-voice mode is only available upon request and after a positive review by sipgate support. Mention configure_voice_to_voice when you reach out so we can enable it for your account.

Next Steps