Skip to content

Mix Audio Action

Play a looping background sound (e.g. train station, café, office ambience) under the call. The loop plays continuously for the lifetime of the session — also during the assistant's TTS turns and during silences between turns.

Sending mix_audio again replaces the active loop. Sending it with stop: true removes the loop. The active loop is dropped automatically when the session ends.

Action Structure

Start a background loop

json
{
  "type": "mix_audio",
  "session_id": "550e8400-e29b-41d4-a716-446655440000",
  "audio": "UklGRiQAAABXQVZFZm10IBAAAAABAAEAQB8AAEAfAAABAAgAZGF0YQAAAAA=",
  "volume": 0.3
}

Stop an active background loop

json
{
  "type": "mix_audio",
  "session_id": "550e8400-e29b-41d4-a716-446655440000",
  "stop": true
}

Fields

FieldTypeRequiredDescription
typestringYesAlways "mix_audio"
session_idstring (UUID)YesSession identifier from event
audiostringConditionalBase64-encoded WAV (16 kHz, 16-bit, mono PCM). Required when stop is not true.
volumenumberNoBackground loop volume, 0.01.0. Defaults to 0.5.
stopbooleanNoWhen true, removes the active loop.

Audio Format Requirements

Identical to the audio action:

  • Format: WAV
  • Sample Rate: 16 kHz
  • Channels: Mono (single channel)
  • Bit Depth: 16-bit PCM
  • Encoding: Base64

A 30-second loop at this format is approximately 940 KB raw and ~1.25 MB as a base64 string in the JSON action payload.

Behavior Notes

  • Continuous playback. Once started, ambient plays for the rest of the call — under the assistant's TTS during turns and on its own during silences.
  • Replace semantics. A second mix_audio (without stop) replaces the buffer and volume of the running loop.
  • Restart-safe. If the service restarts during an active call, the loop continues automatically.
  • Auto-cleanup. The loop is dropped when the session ends.

Use Cases

  • Setting the scene. Add café or train-station ambience to make a virtual receptionist feel located somewhere specific.
  • Wait-state cues. Light office hum during long lookups so the line doesn't feel dead.
  • Accessibility / signaling. Subtle sounds that indicate the agent is "in" a particular context.

Examples

Python (Flask)

python
import base64

# Load and base64-encode the loop once at startup
with open('cafe.wav', 'rb') as f:
    AMBIENT_AUDIO = base64.b64encode(f.read()).decode('utf-8')

@app.route('/webhook', methods=['POST'])
def webhook():
    event = request.json

    if event['type'] == 'session_start':
        # Start the ambient loop AND speak the greeting in one response
        return jsonify([
            {
                'type': 'mix_audio',
                'session_id': event['session']['id'],
                'audio': AMBIENT_AUDIO,
                'volume': 0.3,
            },
            {
                'type': 'speak',
                'session_id': event['session']['id'],
                'text': 'Welcome, how can I help you?',
            },
        ])

    if event['type'] == 'user_speak' and 'goodbye' in event['text'].lower():
        # Stop the ambient before saying goodbye, then hang up
        return jsonify([
            {
                'type': 'mix_audio',
                'session_id': event['session']['id'],
                'stop': True,
            },
            {
                'type': 'speak',
                'session_id': event['session']['id'],
                'text': 'Goodbye!',
            },
            { 'type': 'hangup', 'session_id': event['session']['id'] },
        ])

Node.js

javascript
import { readFileSync } from "node:fs";

// Load and base64-encode the loop once at startup
const AMBIENT_AUDIO = readFileSync("./cafe.wav").toString("base64");

app.post('/webhook', (req, res) => {
  const event = req.body;

  if (event.type === 'session_start') {
    return res.json([
      {
        type: 'mix_audio',
        session_id: event.session.id,
        audio: AMBIENT_AUDIO,
        volume: 0.3,
      },
      {
        type: 'speak',
        session_id: event.session.id,
        text: 'Welcome, how can I help you?',
      },
    ]);
  }
});

Go

go
import (
    "encoding/base64"
    "io/ioutil"
)

func main() {
    // Load and base64-encode the loop once at startup
    audioBytes, _ := ioutil.ReadFile("cafe.wav")
    ambientAudio := base64.StdEncoding.EncodeToString(audioBytes)

    http.HandleFunc("/webhook", func(w http.ResponseWriter, r *http.Request) {
        var event map[string]interface{}
        json.NewDecoder(r.Body).Decode(&event)

        if event["type"] == "session_start" {
            session := event["session"].(map[string]interface{})
            actions := []map[string]interface{}{
                {
                    "type":       "mix_audio",
                    "session_id": session["id"],
                    "audio":      ambientAudio,
                    "volume":     0.3,
                },
                {
                    "type":       "speak",
                    "session_id": session["id"],
                    "text":       "Welcome, how can I help you?",
                },
            }
            w.Header().Set("Content-Type", "application/json")
            json.NewEncoder(w).Encode(actions)
        }
    })
}

Converting Audio Files

Convert any audio file to the required format with FFmpeg:

bash
ffmpeg -i input.mp3 -ar 16000 -ac 1 -sample_fmt s16 -f wav output.wav

For ambient sound, normalizing loudness across presets keeps the relative volume consistent at a given volume value. A target of -30 LUFS sits well below typical TTS speech (~-16 LUFS), so the slider stays useful around 0.20.5:

bash
ffmpeg -i input.mp3 -t 30 -af "loudnorm=I=-30:LRA=11:TP=-2" \
  -ar 16000 -ac 1 -sample_fmt s16 -f wav output.wav

Best Practices

  1. Load once, encode once. Encode each ambient WAV to base64 at startup and reuse the string — don't read+encode per call.
  2. Start the loop with the greeting. Return [mix_audio, speak] together on session_start so the ambient is in place from the first word.
  3. Keep the volume low. Ambient sound should sit under the agent. Start around 0.3 and lower from there.
  4. Trim long files. A 30-second loop is plenty for ambience; longer files just mean larger one-time payloads at session start.
  5. Stop explicitly when ending the call. Sending mix_audio { stop: true } before a farewell is optional (the loop is dropped at session_end anyway), but it makes the goodbye land cleanly without ambient bleed.

Mix Audio vs. Audio Action

Aspectaudiomix_audio
PlaysOnce, then stopsLoops continuously for the rest of the call
Audible during silenceNoYes
Plays under TTSNoYes
Use caseHold music, announcements, sound effectsScene/atmosphere under the agent
Restart-safeNo (one-shot)Yes (loop continues automatically)

Troubleshooting

Ambient is too loud / drowns out speech

  • Lower the volume (try 0.2).
  • Re-normalize the source file to a quieter target LUFS (e.g. -30 LUFS instead of -23).

Loop pops at the boundary

For material with strong transients, fade the source file in/out by 50 ms in your editor before encoding so the loop point is silent.

Next Steps