Mix Audio Action

Play a looping background sound (e.g. train station, café, office ambience) under the call. The loop plays continuously for the lifetime of the session — also during the assistant's TTS turns and during silences between turns.

Sending mix_audio again replaces the active loop. Sending it with stop: true removes the loop. The active loop is dropped automatically when the session ends.

Action Structure

Start a background loop

json

{
  "type": "mix_audio",
  "session_id": "550e8400-e29b-41d4-a716-446655440000",
  "audio": "UklGRiQAAABXQVZFZm10IBAAAAABAAEAQB8AAEAfAAABAAgAZGF0YQAAAAA=",
  "volume": 0.3
}

Stop an active background loop

json

{
  "type": "mix_audio",
  "session_id": "550e8400-e29b-41d4-a716-446655440000",
  "stop": true
}

Fields

Field	Type	Required	Description
`type`	string	Yes	Always `"mix_audio"`
`session_id`	string (UUID)	Yes	Session identifier from event
`audio`	string	Conditional	Base64-encoded WAV (16 kHz, 16-bit, mono PCM). Required when `stop` is not `true`.
`volume`	number	No	Background loop volume, `0.0`–`1.0`. Defaults to `0.5`.
`stop`	boolean	No	When `true`, removes the active loop.

Audio Format Requirements

Identical to the audio action:

Format: WAV
Sample Rate: 16 kHz
Channels: Mono (single channel)
Bit Depth: 16-bit PCM
Encoding: Base64

A 30-second loop at this format is approximately 940 KB raw and ~1.25 MB as a base64 string in the JSON action payload.

Behavior Notes

Continuous playback. Once started, ambient plays for the rest of the call — under the assistant's TTS during turns and on its own during silences.
Replace semantics. A second mix_audio (without stop) replaces the buffer and volume of the running loop.
Restart-safe. If the service restarts during an active call, the loop continues automatically.
Auto-cleanup. The loop is dropped when the session ends.

Use Cases

Setting the scene. Add café or train-station ambience to make a virtual receptionist feel located somewhere specific.
Wait-state cues. Light office hum during long lookups so the line doesn't feel dead.
Accessibility / signaling. Subtle sounds that indicate the agent is "in" a particular context.

Examples

Python (Flask)

python

import base64

# Load and base64-encode the loop once at startup
with open('cafe.wav', 'rb') as f:
    AMBIENT_AUDIO = base64.b64encode(f.read()).decode('utf-8')

@app.route('/webhook', methods=['POST'])
def webhook():
    event = request.json

    if event['type'] == 'session_start':
        # Start the ambient loop AND speak the greeting in one response
        return jsonify([
            {
                'type': 'mix_audio',
                'session_id': event['session']['id'],
                'audio': AMBIENT_AUDIO,
                'volume': 0.3,
            },
            {
                'type': 'speak',
                'session_id': event['session']['id'],
                'text': 'Welcome, how can I help you?',
            },
        ])

    if event['type'] == 'user_speak' and 'goodbye' in event['text'].lower():
        # Stop the ambient before saying goodbye, then hang up
        return jsonify([
            {
                'type': 'mix_audio',
                'session_id': event['session']['id'],
                'stop': True,
            },
            {
                'type': 'speak',
                'session_id': event['session']['id'],
                'text': 'Goodbye!',
            },
            { 'type': 'hangup', 'session_id': event['session']['id'] },
        ])

Node.js

javascript

import { readFileSync } from "node:fs";

// Load and base64-encode the loop once at startup
const AMBIENT_AUDIO = readFileSync("./cafe.wav").toString("base64");

app.post('/webhook', (req, res) => {
  const event = req.body;

  if (event.type === 'session_start') {
    return res.json([
      {
        type: 'mix_audio',
        session_id: event.session.id,
        audio: AMBIENT_AUDIO,
        volume: 0.3,
      },
      {
        type: 'speak',
        session_id: event.session.id,
        text: 'Welcome, how can I help you?',
      },
    ]);
  }
});

Go

import (
    "encoding/base64"
    "io/ioutil"
)

func main() {
    // Load and base64-encode the loop once at startup
    audioBytes, _ := ioutil.ReadFile("cafe.wav")
    ambientAudio := base64.StdEncoding.EncodeToString(audioBytes)

    http.HandleFunc("/webhook", func(w http.ResponseWriter, r *http.Request) {
        var event map[string]interface{}
        json.NewDecoder(r.Body).Decode(&event)

        if event["type"] == "session_start" {
            session := event["session"].(map[string]interface{})
            actions := []map[string]interface{}{
                {
                    "type":       "mix_audio",
                    "session_id": session["id"],
                    "audio":      ambientAudio,
                    "volume":     0.3,
                },
                {
                    "type":       "speak",
                    "session_id": session["id"],
                    "text":       "Welcome, how can I help you?",
                },
            }
            w.Header().Set("Content-Type", "application/json")
            json.NewEncoder(w).Encode(actions)
        }
    })
}

Converting Audio Files

Convert any audio file to the required format with FFmpeg:

bash

ffmpeg -i input.mp3 -ar 16000 -ac 1 -sample_fmt s16 -f wav output.wav

For ambient sound, normalizing loudness across presets keeps the relative volume consistent at a given volume value. A target of -30 LUFS sits well below typical TTS speech (~-16 LUFS), so the slider stays useful around 0.2–0.5:

bash

ffmpeg -i input.mp3 -t 30 -af "loudnorm=I=-30:LRA=11:TP=-2" \
  -ar 16000 -ac 1 -sample_fmt s16 -f wav output.wav

Best Practices

Load once, encode once. Encode each ambient WAV to base64 at startup and reuse the string — don't read+encode per call.
Start the loop with the greeting. Return [mix_audio, speak] together on session_start so the ambient is in place from the first word.
Keep the volume low. Ambient sound should sit under the agent. Start around 0.3 and lower from there.
Trim long files. A 30-second loop is plenty for ambience; longer files just mean larger one-time payloads at session start.
Stop explicitly when ending the call. Sending mix_audio { stop: true } before a farewell is optional (the loop is dropped at session_end anyway), but it makes the goodbye land cleanly without ambient bleed.

Mix Audio vs. Audio Action

Aspect	`audio`	`mix_audio`
Plays	Once, then stops	Loops continuously for the rest of the call
Audible during silence	No	Yes
Plays under TTS	No	Yes
Use case	Hold music, announcements, sound effects	Scene/atmosphere under the agent
Restart-safe	No (one-shot)	Yes (loop continues automatically)

Troubleshooting

Ambient is too loud / drowns out speech

Lower the volume (try 0.2).
Re-normalize the source file to a quieter target LUFS (e.g. -30 LUFS instead of -23).

Loop pops at the boundary

For material with strong transients, fade the source file in/out by 50 ms in your editor before encoding so the loop point is silent.

Next Steps

Audio Action - Play a single pre-recorded clip
Speak Action - Text-to-speech under the loop
Action Types - Complete action reference

Mix Audio Action ​

Action Structure ​

Start a background loop ​

Stop an active background loop ​

Fields ​

Audio Format Requirements ​

Behavior Notes ​

Use Cases ​

Examples ​

Python (Flask) ​

Node.js ​

Go ​

Converting Audio Files ​

Best Practices ​

Mix Audio vs. Audio Action ​

Troubleshooting ​

Ambient is too loud / drowns out speech ​

Loop pops at the boundary ​

Next Steps ​