Speak Action

Speak text or SSML to the user using text-to-speech.

Action Structure

json

{
  "type": "speak",
  "session_id": "550e8400-e29b-41d4-a716-446655440000",
  "text": "Hello! How can I help you?",
  "tts": {
    "provider": "azure",
    "language": "en-US",
    "voice": "en-US-JennyNeural"
  },
  "barge_in": {
    "strategy": "minimum_characters",
    "minimum_characters": 3
  }
}

Fields

Field	Type	Required	Description
`type`	string	Yes	Always `"speak"`
`session_id`	string (UUID)	Yes	Session identifier from event
`text`	string	No*	Plain text to speak
`ssml`	string	No*	SSML markup for advanced control
`tts`	object	No	TTS provider configuration
`barge_in`	object	No	Barge-in behavior configuration
`user_input_timeout_seconds`	number	No	Timeout in seconds to wait for user input after speech ends. If no speech is detected within this time, a `user_input_timeout` event is sent

* Either text OR ssml is required (not both)

Simple Text

json

{
  "type": "speak",
  "session_id": "550e8400-e29b-41d4-a716-446655440000",
  "text": "Hello! How can I help you?"
}

SSML (Advanced)

json

{
  "type": "speak",
  "session_id": "550e8400-e29b-41d4-a716-446655440000",
  "ssml": "<speak version=\"1.0\" xml:lang=\"en-US\"><voice name=\"en-US-JennyNeural\"><prosody rate=\"slow\">Please listen carefully.</prosody><break time=\"500ms\"/>Your account balance is <say-as interpret-as=\"currency\">$42.50</say-as></voice></speak>"
}

TTS Provider Configuration

Azure

json

{
  "type": "speak",
  "session_id": "550e8400-e29b-41d4-a716-446655440000",
  "text": "Hello in a different voice",
  "tts": {
    "provider": "azure",
    "language": "en-US",
    "voice": "en-US-JennyNeural"
  }
}

ElevenLabs

json

{
  "type": "speak",
  "session_id": "550e8400-e29b-41d4-a716-446655440000",
  "text": "Hello from ElevenLabs",
  "tts": {
    "provider": "eleven_labs",
    "voice": "zrHiDhphv9ZnVXBqCLjz"
  }
}

Voice IDs

The voice field accepts the ElevenLabs voice ID (e.g., "zrHiDhphv9ZnVXBqCLjz" for "Mimi"). If omitted, the first available voice will be used. See the TTS Providers documentation for a list of available voices.

Minimal Configuration (uses default voice):

json

{
  "type": "speak",
  "session_id": "550e8400-e29b-41d4-a716-446655440000",
  "text": "Hello from ElevenLabs",
  "tts": {
    "provider": "eleven_labs"
  }
}

Barge-In Configuration

Control how users can interrupt:

json

{
  "type": "speak",
  "session_id": "550e8400-e29b-41d4-a716-446655440000",
  "text": "Your account number is 1234567890. Please write this down.",
  "barge_in": {
    "strategy": "minimum_characters",
    "minimum_characters": 10,
    "allow_after_ms": 2000
  }
}

See Barge-In Configuration for details.

User Input Timeout

Set a timeout to wait for user input after the assistant finishes speaking. If the user doesn't speak within the specified time, a user_input_timeout event is sent to your application:

json

{
  "type": "speak",
  "session_id": "550e8400-e29b-41d4-a716-446655440000",
  "text": "What is your account number?",
  "user_input_timeout_seconds": 5
}

Behavior:

Timer starts when the assistant finishes speaking (assistant_speech_ended event)
Timer is cleared when the user starts speaking (any STT event)
If timeout is reached, a user_input_timeout event is sent
Your application can respond with any action (e.g., repeat question, hangup)

Example with timeout handling:

javascript

app.post('/webhook', (req, res) => {
  const event = req.body;

  if (event.type === 'session_start') {
    return res.json({
      type: 'speak',
      session_id: event.session.id,
      text: 'What is your account number?',
      user_input_timeout_seconds: 5
    });
  }

  if (event.type === 'user_input_timeout') {
    return res.json({
      type: 'speak',
      session_id: event.session.id,
      text: 'I didn\'t hear anything. Let me try again. What is your account number?',
      user_input_timeout_seconds: 5
    });
  }

  if (event.type === 'user_speak') {
    return res.json({
      type: 'speak',
      session_id: event.session.id,
      text: `Your account number is ${event.text}`
    });
  }
});

Examples

Python

python

@app.route('/webhook', methods=['POST'])
def webhook():
    event = request.json

    if event['type'] == 'user_speak':
        return jsonify({
            'type': 'speak',
            'session_id': event['session']['id'],
            'text': f"You said: {event['text']}"
        })

Node.js

javascript

app.post('/webhook', (req, res) => {
  const event = req.body;

  if (event.type === 'user_speak') {
    return res.json({
      type: 'speak',
      session_id: event.session.id,
      text: `You said: ${event.text}`
    });
  }
});

Go

action := map[string]interface{}{
    "type":       "speak",
    "session_id": session["id"],
    "text":       "Hello! How can I help you?",
}
json.NewEncoder(w).Encode(action)

Use Cases

Respond to user - Answer questions
Provide information - Share details
Guide conversation - Direct the flow
Confirm actions - Acknowledge user input

Best Practices

Keep it concise - Short responses work better
Use SSML sparingly - Only when needed for emphasis
Configure barge-in - Allow natural interruptions
Choose appropriate voice - Match language and tone

Next Steps

TTS Providers - Configure voices
Barge-In Configuration - Control interruptions
Other Actions - Complete action reference

Speak Action ​

Action Structure ​

Fields ​

Simple Text ​

SSML (Advanced) ​

TTS Provider Configuration ​

Azure ​

ElevenLabs ​

Barge-In Configuration ​

User Input Timeout ​

Examples ​

Python ​

Node.js ​

Go ​

Use Cases ​

Best Practices ​

Next Steps ​

Speak Action

Action Structure

Fields

Simple Text

SSML (Advanced)

TTS Provider Configuration

Azure

ElevenLabs

Barge-In Configuration

User Input Timeout

Examples

Python

Node.js

Go

Use Cases

Best Practices

Next Steps