Skip to content

Speak Action

Speak text or SSML to the user using text-to-speech.

Action Structure

json
{
  "type": "speak",
  "session_id": "550e8400-e29b-41d4-a716-446655440000",
  "text": "Hello! How can I help you?",
  "tts": {
    "provider": "azure",
    "language": "en-US",
    "voice": "en-US-JennyNeural"
  },
  "barge_in": {
    "strategy": "minimum_characters",
    "minimum_characters": 3
  }
}

Fields

FieldTypeRequiredDescription
typestringYesAlways "speak"
session_idstring (UUID)YesSession identifier from event
textstringNo*Plain text to speak
ssmlstringNo*SSML markup for advanced control
ttsobjectNoTTS provider configuration
barge_inobjectNoBarge-in behavior configuration
user_input_timeout_secondsnumberNoTimeout in seconds to wait for user input after speech ends. If no speech is detected within this time, a user_input_timeout event is sent

* Either text OR ssml is required (not both)

Simple Text

json
{
  "type": "speak",
  "session_id": "550e8400-e29b-41d4-a716-446655440000",
  "text": "Hello! How can I help you?"
}

SSML (Advanced)

json
{
  "type": "speak",
  "session_id": "550e8400-e29b-41d4-a716-446655440000",
  "ssml": "<speak version=\"1.0\" xml:lang=\"en-US\"><voice name=\"en-US-JennyNeural\"><prosody rate=\"slow\">Please listen carefully.</prosody><break time=\"500ms\"/>Your account balance is <say-as interpret-as=\"currency\">$42.50</say-as></voice></speak>"
}

TTS Provider Configuration

Azure

json
{
  "type": "speak",
  "session_id": "550e8400-e29b-41d4-a716-446655440000",
  "text": "Hello in a different voice",
  "tts": {
    "provider": "azure",
    "language": "en-US",
    "voice": "en-US-JennyNeural"
  }
}

ElevenLabs

json
{
  "type": "speak",
  "session_id": "550e8400-e29b-41d4-a716-446655440000",
  "text": "Hello from ElevenLabs",
  "tts": {
    "provider": "eleven_labs",
    "voice": "zrHiDhphv9ZnVXBqCLjz"
  }
}

Voice IDs

The voice field accepts the ElevenLabs voice ID (e.g., "zrHiDhphv9ZnVXBqCLjz" for "Mimi"). If omitted, the first available voice will be used. See the TTS Providers documentation for a list of available voices.

Minimal Configuration (uses default voice):

json
{
  "type": "speak",
  "session_id": "550e8400-e29b-41d4-a716-446655440000",
  "text": "Hello from ElevenLabs",
  "tts": {
    "provider": "eleven_labs"
  }
}

Barge-In Configuration

Control how users can interrupt:

json
{
  "type": "speak",
  "session_id": "550e8400-e29b-41d4-a716-446655440000",
  "text": "Your account number is 1234567890. Please write this down.",
  "barge_in": {
    "strategy": "minimum_characters",
    "minimum_characters": 10,
    "allow_after_ms": 2000
  }
}

See Barge-In Configuration for details.

User Input Timeout

Set a timeout to wait for user input after the assistant finishes speaking. If the user doesn't speak within the specified time, a user_input_timeout event is sent to your application:

json
{
  "type": "speak",
  "session_id": "550e8400-e29b-41d4-a716-446655440000",
  "text": "What is your account number?",
  "user_input_timeout_seconds": 5
}

Behavior:

  • Timer starts when the assistant finishes speaking (assistant_speech_ended event)
  • Timer is cleared when the user starts speaking (any STT event)
  • If timeout is reached, a user_input_timeout event is sent
  • Your application can respond with any action (e.g., repeat question, hangup)

Example with timeout handling:

javascript
app.post('/webhook', (req, res) => {
  const event = req.body;

  if (event.type === 'session_start') {
    return res.json({
      type: 'speak',
      session_id: event.session.id,
      text: 'What is your account number?',
      user_input_timeout_seconds: 5
    });
  }

  if (event.type === 'user_input_timeout') {
    return res.json({
      type: 'speak',
      session_id: event.session.id,
      text: 'I didn\'t hear anything. Let me try again. What is your account number?',
      user_input_timeout_seconds: 5
    });
  }

  if (event.type === 'user_speak') {
    return res.json({
      type: 'speak',
      session_id: event.session.id,
      text: `Your account number is ${event.text}`
    });
  }
});

Examples

Python

python
@app.route('/webhook', methods=['POST'])
def webhook():
    event = request.json

    if event['type'] == 'user_speak':
        return jsonify({
            'type': 'speak',
            'session_id': event['session']['id'],
            'text': f"You said: {event['text']}"
        })

Node.js

javascript
app.post('/webhook', (req, res) => {
  const event = req.body;

  if (event.type === 'user_speak') {
    return res.json({
      type: 'speak',
      session_id: event.session.id,
      text: `You said: ${event.text}`
    });
  }
});

Go

go
action := map[string]interface{}{
    "type":       "speak",
    "session_id": session["id"],
    "text":       "Hello! How can I help you?",
}
json.NewEncoder(w).Encode(action)

Use Cases

  • Respond to user - Answer questions
  • Provide information - Share details
  • Guide conversation - Direct the flow
  • Confirm actions - Acknowledge user input

Best Practices

  1. Keep it concise - Short responses work better
  2. Use SSML sparingly - Only when needed for emphasis
  3. Configure barge-in - Allow natural interruptions
  4. Choose appropriate voice - Match language and tone

Next Steps