Appearance
Mix Audio Action
Play a looping background sound (e.g. train station, café, office ambience) under the call. The loop plays continuously for the lifetime of the session — also during the assistant's TTS turns and during silences between turns.
Sending mix_audio again replaces the active loop. Sending it with stop: true removes the loop. The active loop is dropped automatically when the session ends.
Action Structure
Start a background loop
json
{
"type": "mix_audio",
"session_id": "550e8400-e29b-41d4-a716-446655440000",
"audio": "UklGRiQAAABXQVZFZm10IBAAAAABAAEAQB8AAEAfAAABAAgAZGF0YQAAAAA=",
"volume": 0.3
}Stop an active background loop
json
{
"type": "mix_audio",
"session_id": "550e8400-e29b-41d4-a716-446655440000",
"stop": true
}Fields
| Field | Type | Required | Description |
|---|---|---|---|
type | string | Yes | Always "mix_audio" |
session_id | string (UUID) | Yes | Session identifier from event |
audio | string | Conditional | Base64-encoded WAV (16 kHz, 16-bit, mono PCM). Required when stop is not true. |
volume | number | No | Background loop volume, 0.0–1.0. Defaults to 0.5. |
stop | boolean | No | When true, removes the active loop. |
Audio Format Requirements
Identical to the audio action:
- Format: WAV
- Sample Rate: 16 kHz
- Channels: Mono (single channel)
- Bit Depth: 16-bit PCM
- Encoding: Base64
A 30-second loop at this format is approximately 940 KB raw and ~1.25 MB as a base64 string in the JSON action payload.
Behavior Notes
- Continuous playback. Once started, ambient plays for the rest of the call — under the assistant's TTS during turns and on its own during silences.
- Replace semantics. A second
mix_audio(withoutstop) replaces the buffer and volume of the running loop. - Restart-safe. If the service restarts during an active call, the loop continues automatically.
- Auto-cleanup. The loop is dropped when the session ends.
Use Cases
- Setting the scene. Add café or train-station ambience to make a virtual receptionist feel located somewhere specific.
- Wait-state cues. Light office hum during long lookups so the line doesn't feel dead.
- Accessibility / signaling. Subtle sounds that indicate the agent is "in" a particular context.
Examples
Python (Flask)
python
import base64
# Load and base64-encode the loop once at startup
with open('cafe.wav', 'rb') as f:
AMBIENT_AUDIO = base64.b64encode(f.read()).decode('utf-8')
@app.route('/webhook', methods=['POST'])
def webhook():
event = request.json
if event['type'] == 'session_start':
# Start the ambient loop AND speak the greeting in one response
return jsonify([
{
'type': 'mix_audio',
'session_id': event['session']['id'],
'audio': AMBIENT_AUDIO,
'volume': 0.3,
},
{
'type': 'speak',
'session_id': event['session']['id'],
'text': 'Welcome, how can I help you?',
},
])
if event['type'] == 'user_speak' and 'goodbye' in event['text'].lower():
# Stop the ambient before saying goodbye, then hang up
return jsonify([
{
'type': 'mix_audio',
'session_id': event['session']['id'],
'stop': True,
},
{
'type': 'speak',
'session_id': event['session']['id'],
'text': 'Goodbye!',
},
{ 'type': 'hangup', 'session_id': event['session']['id'] },
])Node.js
javascript
import { readFileSync } from "node:fs";
// Load and base64-encode the loop once at startup
const AMBIENT_AUDIO = readFileSync("./cafe.wav").toString("base64");
app.post('/webhook', (req, res) => {
const event = req.body;
if (event.type === 'session_start') {
return res.json([
{
type: 'mix_audio',
session_id: event.session.id,
audio: AMBIENT_AUDIO,
volume: 0.3,
},
{
type: 'speak',
session_id: event.session.id,
text: 'Welcome, how can I help you?',
},
]);
}
});Go
go
import (
"encoding/base64"
"io/ioutil"
)
func main() {
// Load and base64-encode the loop once at startup
audioBytes, _ := ioutil.ReadFile("cafe.wav")
ambientAudio := base64.StdEncoding.EncodeToString(audioBytes)
http.HandleFunc("/webhook", func(w http.ResponseWriter, r *http.Request) {
var event map[string]interface{}
json.NewDecoder(r.Body).Decode(&event)
if event["type"] == "session_start" {
session := event["session"].(map[string]interface{})
actions := []map[string]interface{}{
{
"type": "mix_audio",
"session_id": session["id"],
"audio": ambientAudio,
"volume": 0.3,
},
{
"type": "speak",
"session_id": session["id"],
"text": "Welcome, how can I help you?",
},
}
w.Header().Set("Content-Type", "application/json")
json.NewEncoder(w).Encode(actions)
}
})
}Converting Audio Files
Convert any audio file to the required format with FFmpeg:
bash
ffmpeg -i input.mp3 -ar 16000 -ac 1 -sample_fmt s16 -f wav output.wavFor ambient sound, normalizing loudness across presets keeps the relative volume consistent at a given volume value. A target of -30 LUFS sits well below typical TTS speech (~-16 LUFS), so the slider stays useful around 0.2–0.5:
bash
ffmpeg -i input.mp3 -t 30 -af "loudnorm=I=-30:LRA=11:TP=-2" \
-ar 16000 -ac 1 -sample_fmt s16 -f wav output.wavBest Practices
- Load once, encode once. Encode each ambient WAV to base64 at startup and reuse the string — don't read+encode per call.
- Start the loop with the greeting. Return
[mix_audio, speak]together onsession_startso the ambient is in place from the first word. - Keep the volume low. Ambient sound should sit under the agent. Start around
0.3and lower from there. - Trim long files. A 30-second loop is plenty for ambience; longer files just mean larger one-time payloads at session start.
- Stop explicitly when ending the call. Sending
mix_audio { stop: true }before a farewell is optional (the loop is dropped atsession_endanyway), but it makes the goodbye land cleanly without ambient bleed.
Mix Audio vs. Audio Action
| Aspect | audio | mix_audio |
|---|---|---|
| Plays | Once, then stops | Loops continuously for the rest of the call |
| Audible during silence | No | Yes |
| Plays under TTS | No | Yes |
| Use case | Hold music, announcements, sound effects | Scene/atmosphere under the agent |
| Restart-safe | No (one-shot) | Yes (loop continues automatically) |
Troubleshooting
Ambient is too loud / drowns out speech
- Lower the
volume(try0.2). - Re-normalize the source file to a quieter target LUFS (e.g.
-30 LUFSinstead of-23).
Loop pops at the boundary
For material with strong transients, fade the source file in/out by 50 ms in your editor before encoding so the loop point is silent.
Next Steps
- Audio Action - Play a single pre-recorded clip
- Speak Action - Text-to-speech under the loop
- Action Types - Complete action reference