Skip to content

Speak Action

Speak text or SSML to the user using text-to-speech.

Action Structure

json
{
  "type": "speak",
  "session_id": "550e8400-e29b-41d4-a716-446655440000",
  "text": "Hello! How can I help you?",
  "tts": {
    "provider": "azure",
    "language": "en-US",
    "voice": "en-US-JennyNeural"
  },
  "barge_in": {
    "strategy": "minimum_characters",
    "minimum_characters": 3
  }
}

Fields

FieldTypeRequiredDescription
typestringYesAlways "speak"
session_idstring (UUID)YesSession identifier from event
textstringNo*Plain text to speak
ssmlstringNo*SSML markup for advanced control
ttsobjectNoTTS provider configuration
barge_inobjectNoBarge-in behavior configuration
user_input_timeout_secondsnumberNoTimeout in seconds to wait for user input after speech ends. If no speech is detected within this time, a user_input_timeout event is sent
vadobjectNoVoice-activity detection tuning for the caller's reply. See VAD Configuration

* Either text OR ssml is required (not both)

Simple Text

json
{
  "type": "speak",
  "session_id": "550e8400-e29b-41d4-a716-446655440000",
  "text": "Hello! How can I help you?"
}

SSML (Advanced)

json
{
  "type": "speak",
  "session_id": "550e8400-e29b-41d4-a716-446655440000",
  "ssml": "<speak version=\"1.0\" xml:lang=\"en-US\"><voice name=\"en-US-JennyNeural\"><prosody rate=\"slow\">Please listen carefully.</prosody><break time=\"500ms\"/>Your account balance is <say-as interpret-as=\"currency\">$42.50</say-as></voice></speak>"
}

TTS Provider Configuration

Azure

json
{
  "type": "speak",
  "session_id": "550e8400-e29b-41d4-a716-446655440000",
  "text": "Hello in a different voice",
  "tts": {
    "provider": "azure",
    "language": "en-US",
    "voice": "en-US-JennyNeural"
  }
}

ElevenLabs

json
{
  "type": "speak",
  "session_id": "550e8400-e29b-41d4-a716-446655440000",
  "text": "Hello from ElevenLabs",
  "tts": {
    "provider": "eleven_labs",
    "voice": "zrHiDhphv9ZnVXBqCLjz"
  }
}

Voice IDs

The voice field accepts the ElevenLabs voice ID (e.g., "zrHiDhphv9ZnVXBqCLjz" for "Mimi"). If omitted, the first available voice will be used. See the TTS Providers documentation for a list of available voices.

Minimal Configuration (uses default voice):

json
{
  "type": "speak",
  "session_id": "550e8400-e29b-41d4-a716-446655440000",
  "text": "Hello from ElevenLabs",
  "tts": {
    "provider": "eleven_labs"
  }
}

Barge-In Configuration

Control how users can interrupt:

Immediate Response (Most Responsive) ⚡

json
{
  "type": "speak",
  "session_id": "550e8400-e29b-41d4-a716-446655440000",
  "text": "I can help you with billing, support, or sales. What would you like?",
  "barge_in": {
    "strategy": "immediate",
    "allow_after_ms": 500
  }
}

Result: Assistant stops instantly when user starts speaking (20-100ms latency).

Character-Based Interruption

json
{
  "type": "speak",
  "session_id": "550e8400-e29b-41d4-a716-446655440000",
  "text": "Your account number is 1234567890. Please write this down.",
  "barge_in": {
    "strategy": "minimum_characters",
    "minimum_characters": 10,
    "allow_after_ms": 2000
  }
}

Result: Assistant stops after user speaks 10+ characters.

See Barge-In Configuration for all strategies and details.

VAD (Voice Activity Detection) Tuning

Optional advanced setting that lets the caller pause longer (or shorter) before their turn is considered finished. When omitted, the system default applies.

json
{
  "type": "speak",
  "session_id": "550e8400-e29b-41d4-a716-446655440000",
  "text": "Please tell me your address.",
  "vad": {
    "end_of_turn_silence_ms": 1500
  }
}
FieldTypeDescription
end_of_turn_silence_msnumberMilliseconds of silence after the caller stops speaking before their turn ends. Recommended range 150–2000.

Out-of-range or invalid values are silently ignored — the speak action still runs as if vad were not set. See VAD Configuration for details.

User Input Timeout

Set a timeout to wait for user input after the assistant finishes speaking. If the user doesn't speak within the specified time, a user_input_timeout event is sent to your application:

json
{
  "type": "speak",
  "session_id": "550e8400-e29b-41d4-a716-446655440000",
  "text": "What is your account number?",
  "user_input_timeout_seconds": 5
}

Behavior:

  • Timer starts when the assistant finishes speaking (assistant_speech_ended event)
  • Timer is cleared when the user starts speaking (any STT event)
  • If timeout is reached, a user_input_timeout event is sent
  • Your application can respond with any action (e.g., repeat question, hangup)

Example with timeout handling:

javascript
app.post('/webhook', (req, res) => {
  const event = req.body;

  if (event.type === 'session_start') {
    return res.json({
      type: 'speak',
      session_id: event.session.id,
      text: 'What is your account number?',
      user_input_timeout_seconds: 5
    });
  }

  if (event.type === 'user_input_timeout') {
    return res.json({
      type: 'speak',
      session_id: event.session.id,
      text: 'I didn\'t hear anything. Let me try again. What is your account number?',
      user_input_timeout_seconds: 5
    });
  }

  if (event.type === 'user_speak') {
    return res.json({
      type: 'speak',
      session_id: event.session.id,
      text: `Your account number is ${event.text}`
    });
  }
});

Examples

Python

python
@app.route('/webhook', methods=['POST'])
def webhook():
    event = request.json

    if event['type'] == 'user_speak':
        return jsonify({
            'type': 'speak',
            'session_id': event['session']['id'],
            'text': f"You said: {event['text']}"
        })

Node.js

javascript
app.post('/webhook', (req, res) => {
  const event = req.body;

  if (event.type === 'user_speak') {
    return res.json({
      type: 'speak',
      session_id: event.session.id,
      text: `You said: ${event.text}`
    });
  }
});

Go

go
action := map[string]interface{}{
    "type":       "speak",
    "session_id": session["id"],
    "text":       "Hello! How can I help you?",
}
json.NewEncoder(w).Encode(action)

Use Cases

  • Respond to user - Answer questions
  • Provide information - Share details
  • Guide conversation - Direct the flow
  • Confirm actions - Acknowledge user input

Best Practices

  1. Keep it concise - Short responses work better
  2. Use SSML sparingly - Only when needed for emphasis
  3. Configure barge-in - Allow natural interruptions
  4. Choose appropriate voice - Match language and tone

Next Steps