Appearance
Speak Action
Speak text or SSML to the user using text-to-speech.
Action Structure
json
{
"type": "speak",
"session_id": "550e8400-e29b-41d4-a716-446655440000",
"text": "Hello! How can I help you?",
"tts": {
"provider": "azure",
"language": "en-US",
"voice": "en-US-JennyNeural"
},
"barge_in": {
"strategy": "minimum_characters",
"minimum_characters": 3
}
}Fields
| Field | Type | Required | Description |
|---|---|---|---|
type | string | Yes | Always "speak" |
session_id | string (UUID) | Yes | Session identifier from event |
text | string | No* | Plain text to speak |
ssml | string | No* | SSML markup for advanced control |
tts | object | No | TTS provider configuration |
barge_in | object | No | Barge-in behavior configuration |
user_input_timeout_seconds | number | No | Timeout in seconds to wait for user input after speech ends. If no speech is detected within this time, a user_input_timeout event is sent |
* Either text OR ssml is required (not both)
Simple Text
json
{
"type": "speak",
"session_id": "550e8400-e29b-41d4-a716-446655440000",
"text": "Hello! How can I help you?"
}SSML (Advanced)
json
{
"type": "speak",
"session_id": "550e8400-e29b-41d4-a716-446655440000",
"ssml": "<speak version=\"1.0\" xml:lang=\"en-US\"><voice name=\"en-US-JennyNeural\"><prosody rate=\"slow\">Please listen carefully.</prosody><break time=\"500ms\"/>Your account balance is <say-as interpret-as=\"currency\">$42.50</say-as></voice></speak>"
}TTS Provider Configuration
Azure
json
{
"type": "speak",
"session_id": "550e8400-e29b-41d4-a716-446655440000",
"text": "Hello in a different voice",
"tts": {
"provider": "azure",
"language": "en-US",
"voice": "en-US-JennyNeural"
}
}ElevenLabs
json
{
"type": "speak",
"session_id": "550e8400-e29b-41d4-a716-446655440000",
"text": "Hello from ElevenLabs",
"tts": {
"provider": "eleven_labs",
"voice": "zrHiDhphv9ZnVXBqCLjz"
}
}Voice IDs
The voice field accepts the ElevenLabs voice ID (e.g., "zrHiDhphv9ZnVXBqCLjz" for "Mimi"). If omitted, the first available voice will be used. See the TTS Providers documentation for a list of available voices.
Minimal Configuration (uses default voice):
json
{
"type": "speak",
"session_id": "550e8400-e29b-41d4-a716-446655440000",
"text": "Hello from ElevenLabs",
"tts": {
"provider": "eleven_labs"
}
}Barge-In Configuration
Control how users can interrupt:
json
{
"type": "speak",
"session_id": "550e8400-e29b-41d4-a716-446655440000",
"text": "Your account number is 1234567890. Please write this down.",
"barge_in": {
"strategy": "minimum_characters",
"minimum_characters": 10,
"allow_after_ms": 2000
}
}See Barge-In Configuration for details.
User Input Timeout
Set a timeout to wait for user input after the assistant finishes speaking. If the user doesn't speak within the specified time, a user_input_timeout event is sent to your application:
json
{
"type": "speak",
"session_id": "550e8400-e29b-41d4-a716-446655440000",
"text": "What is your account number?",
"user_input_timeout_seconds": 5
}Behavior:
- Timer starts when the assistant finishes speaking (
assistant_speech_endedevent) - Timer is cleared when the user starts speaking (any STT event)
- If timeout is reached, a
user_input_timeoutevent is sent - Your application can respond with any action (e.g., repeat question, hangup)
Example with timeout handling:
javascript
app.post('/webhook', (req, res) => {
const event = req.body;
if (event.type === 'session_start') {
return res.json({
type: 'speak',
session_id: event.session.id,
text: 'What is your account number?',
user_input_timeout_seconds: 5
});
}
if (event.type === 'user_input_timeout') {
return res.json({
type: 'speak',
session_id: event.session.id,
text: 'I didn\'t hear anything. Let me try again. What is your account number?',
user_input_timeout_seconds: 5
});
}
if (event.type === 'user_speak') {
return res.json({
type: 'speak',
session_id: event.session.id,
text: `Your account number is ${event.text}`
});
}
});Examples
Python
python
@app.route('/webhook', methods=['POST'])
def webhook():
event = request.json
if event['type'] == 'user_speak':
return jsonify({
'type': 'speak',
'session_id': event['session']['id'],
'text': f"You said: {event['text']}"
})Node.js
javascript
app.post('/webhook', (req, res) => {
const event = req.body;
if (event.type === 'user_speak') {
return res.json({
type: 'speak',
session_id: event.session.id,
text: `You said: ${event.text}`
});
}
});Go
go
action := map[string]interface{}{
"type": "speak",
"session_id": session["id"],
"text": "Hello! How can I help you?",
}
json.NewEncoder(w).Encode(action)Use Cases
- Respond to user - Answer questions
- Provide information - Share details
- Guide conversation - Direct the flow
- Confirm actions - Acknowledge user input
Best Practices
- Keep it concise - Short responses work better
- Use SSML sparingly - Only when needed for emphasis
- Configure barge-in - Allow natural interruptions
- Choose appropriate voice - Match language and tone
Next Steps
- TTS Providers - Configure voices
- Barge-In Configuration - Control interruptions
- Other Actions - Complete action reference