Appearance
Speak Action
Speak text or SSML to the user using text-to-speech.
Action Structure
json
{
"type": "speak",
"session_id": "550e8400-e29b-41d4-a716-446655440000",
"text": "Hello! How can I help you?",
"tts": {
"provider": "azure",
"language": "en-US",
"voice": "en-US-JennyNeural"
},
"barge_in": {
"strategy": "minimum_characters",
"minimum_characters": 3
}
}Fields
| Field | Type | Required | Description |
|---|---|---|---|
type | string | Yes | Always "speak" |
session_id | string (UUID) | Yes | Session identifier from event |
text | string | No* | Plain text to speak |
ssml | string | No* | SSML markup for advanced control |
tts | object | No | TTS provider configuration |
barge_in | object | No | Barge-in behavior configuration |
user_input_timeout_seconds | number | No | Timeout in seconds to wait for user input after speech ends. If no speech is detected within this time, a user_input_timeout event is sent |
vad | object | No | Voice-activity detection tuning for the caller's reply. See VAD Configuration |
* Either text OR ssml is required (not both)
Simple Text
json
{
"type": "speak",
"session_id": "550e8400-e29b-41d4-a716-446655440000",
"text": "Hello! How can I help you?"
}SSML (Advanced)
json
{
"type": "speak",
"session_id": "550e8400-e29b-41d4-a716-446655440000",
"ssml": "<speak version=\"1.0\" xml:lang=\"en-US\"><voice name=\"en-US-JennyNeural\"><prosody rate=\"slow\">Please listen carefully.</prosody><break time=\"500ms\"/>Your account balance is <say-as interpret-as=\"currency\">$42.50</say-as></voice></speak>"
}TTS Provider Configuration
Azure
json
{
"type": "speak",
"session_id": "550e8400-e29b-41d4-a716-446655440000",
"text": "Hello in a different voice",
"tts": {
"provider": "azure",
"language": "en-US",
"voice": "en-US-JennyNeural"
}
}ElevenLabs
json
{
"type": "speak",
"session_id": "550e8400-e29b-41d4-a716-446655440000",
"text": "Hello from ElevenLabs",
"tts": {
"provider": "eleven_labs",
"voice": "zrHiDhphv9ZnVXBqCLjz"
}
}Voice IDs
The voice field accepts the ElevenLabs voice ID (e.g., "zrHiDhphv9ZnVXBqCLjz" for "Mimi"). If omitted, the first available voice will be used. See the TTS Providers documentation for a list of available voices.
Minimal Configuration (uses default voice):
json
{
"type": "speak",
"session_id": "550e8400-e29b-41d4-a716-446655440000",
"text": "Hello from ElevenLabs",
"tts": {
"provider": "eleven_labs"
}
}Barge-In Configuration
Control how users can interrupt:
Immediate Response (Most Responsive) ⚡
json
{
"type": "speak",
"session_id": "550e8400-e29b-41d4-a716-446655440000",
"text": "I can help you with billing, support, or sales. What would you like?",
"barge_in": {
"strategy": "immediate",
"allow_after_ms": 500
}
}Result: Assistant stops instantly when user starts speaking (20-100ms latency).
Character-Based Interruption
json
{
"type": "speak",
"session_id": "550e8400-e29b-41d4-a716-446655440000",
"text": "Your account number is 1234567890. Please write this down.",
"barge_in": {
"strategy": "minimum_characters",
"minimum_characters": 10,
"allow_after_ms": 2000
}
}Result: Assistant stops after user speaks 10+ characters.
See Barge-In Configuration for all strategies and details.
VAD (Voice Activity Detection) Tuning
Optional advanced setting that lets the caller pause longer (or shorter) before their turn is considered finished. When omitted, the system default applies.
json
{
"type": "speak",
"session_id": "550e8400-e29b-41d4-a716-446655440000",
"text": "Please tell me your address.",
"vad": {
"end_of_turn_silence_ms": 1500
}
}| Field | Type | Description |
|---|---|---|
end_of_turn_silence_ms | number | Milliseconds of silence after the caller stops speaking before their turn ends. Recommended range 150–2000. |
Out-of-range or invalid values are silently ignored — the speak action still runs as if vad were not set. See VAD Configuration for details.
User Input Timeout
Set a timeout to wait for user input after the assistant finishes speaking. If the user doesn't speak within the specified time, a user_input_timeout event is sent to your application:
json
{
"type": "speak",
"session_id": "550e8400-e29b-41d4-a716-446655440000",
"text": "What is your account number?",
"user_input_timeout_seconds": 5
}Behavior:
- Timer starts when the assistant finishes speaking (
assistant_speech_endedevent) - Timer is cleared when the user starts speaking (any STT event)
- If timeout is reached, a
user_input_timeoutevent is sent - Your application can respond with any action (e.g., repeat question, hangup)
Example with timeout handling:
javascript
app.post('/webhook', (req, res) => {
const event = req.body;
if (event.type === 'session_start') {
return res.json({
type: 'speak',
session_id: event.session.id,
text: 'What is your account number?',
user_input_timeout_seconds: 5
});
}
if (event.type === 'user_input_timeout') {
return res.json({
type: 'speak',
session_id: event.session.id,
text: 'I didn\'t hear anything. Let me try again. What is your account number?',
user_input_timeout_seconds: 5
});
}
if (event.type === 'user_speak') {
return res.json({
type: 'speak',
session_id: event.session.id,
text: `Your account number is ${event.text}`
});
}
});Examples
Python
python
@app.route('/webhook', methods=['POST'])
def webhook():
event = request.json
if event['type'] == 'user_speak':
return jsonify({
'type': 'speak',
'session_id': event['session']['id'],
'text': f"You said: {event['text']}"
})Node.js
javascript
app.post('/webhook', (req, res) => {
const event = req.body;
if (event.type === 'user_speak') {
return res.json({
type: 'speak',
session_id: event.session.id,
text: `You said: ${event.text}`
});
}
});Go
go
action := map[string]interface{}{
"type": "speak",
"session_id": session["id"],
"text": "Hello! How can I help you?",
}
json.NewEncoder(w).Encode(action)Use Cases
- Respond to user - Answer questions
- Provide information - Share details
- Guide conversation - Direct the flow
- Confirm actions - Acknowledge user input
Best Practices
- Keep it concise - Short responses work better
- Use SSML sparingly - Only when needed for emphasis
- Configure barge-in - Allow natural interruptions
- Choose appropriate voice - Match language and tone
Next Steps
- TTS Providers - Configure voices
- Barge-In Configuration - Control interruptions
- Other Actions - Complete action reference