Streaming LLM Responses: Sentence-by-Sentence Best Practices

When integrating Large Language Models (LLMs) like OpenAI's GPT, Anthropic's Claude, or similar services with sipgate AI Flow, how you stream responses significantly impacts the naturalness of synthesized speech. This guide shows you how to achieve smooth, natural-sounding voice output by sending complete sentences rather than individual tokens.

The Problem: Token-by-Token Streaming

LLMs stream responses token-by-token (small text fragments). Sending each token directly to the TTS provider creates choppy, unnatural speech:

typescript

// ❌ BAD: Sends every token immediately
for await (const chunk of llmStream) {
  await sendAction({
    type: 'speak',
    session_id: sessionId,
    text: chunk.content, // Individual tokens: "Hello", ", ", "how", " ", "can", " ", "I"...
    tts: { provider: 'azure', language: 'en-US', voice: 'en-US-JennyNeural' },
  })
}

Why this sounds bad:

Each TTS call treats the text as a complete utterance with sentence-ending prosody (falling intonation, longer pauses)
Results in robotic, choppy speech: "Hello↘️ [pause] how↘️ [pause] can↘️ [pause] I↘️ [pause] help↘️ [pause]"
TTS providers optimize for complete sentences, not fragments

The Solution: Sentence Segmentation

✅ Best Practice: Buffer LLM tokens and send complete sentences to the TTS provider.

Benefits:

Natural prosody and intonation
Appropriate pauses between sentences
Better voice quality from TTS providers
Maintains low latency (sentences typically complete within 1-2 seconds)

Prompting LLMs for Voice Output

Critical: Instruct your LLM to avoid abbreviations that interfere with speech synthesis and sentence detection.

The Problem with Abbreviations

Abbreviations like "Dr.", "bzw.", "z.B.", "etc." cause two issues:

Incorrect sentence segmentation - Intl.Segmenter detects periods as sentence boundaries:

typescript

// "Dr. Smith will help you."
// Incorrectly splits into:
// Sentence 1: "Dr."
// Sentence 2: "Smith will help you."

Poor TTS pronunciation - Text-to-speech may mispronounce abbreviations:
- "Dr." → "D R" or "Doctor point" instead of "Doctor"
- "bzw." → "B Z W" instead of "beziehungsweise"
- "z.B." → "Z B" instead of "zum Beispiel"

System Prompt Guidelines

Add these instructions to your LLM system prompt:

typescript

const systemPrompt = `You are a voice assistant. Follow these rules strictly:

VOICE OUTPUT RULES:
- Write out all abbreviations fully (e.g., "Doctor" not "Dr.", "for example" not "e.g.")
- Use complete words instead of shortened forms
- Avoid punctuation-based abbreviations that end with periods
- Use natural, spoken language as if talking to someone in person

Examples:
❌ "Dr. Smith can help you with that."
✅ "Doctor Smith can help you with that."

❌ "You can use method A, B, or C, e.g., the first one."
✅ "You can use method A, B, or C, for example the first one."

❌ "This is available Mon.-Fri."
✅ "This is available Monday through Friday."

Your responses will be converted to speech, so write exactly how you would say it out loud.`

Language-Specific Examples

English:

typescript

const englishVoiceRules = `
- "Dr." → "Doctor"
- "Mr." → "Mister"
- "Mrs." → "Missus"
- "e.g." → "for example"
- "i.e." → "that is"
- "etc." → "and so on" or "etcetera"
- "vs." → "versus"
- "approx." → "approximately"
`

German:

typescript

const germanVoiceRules = `
- "Dr." → "Doktor"
- "bzw." → "beziehungsweise"
- "z.B." → "zum Beispiel"
- "usw." → "und so weiter"
- "ca." → "circa"
- "etc." → "et cetera" or "und so weiter"
- "inkl." → "inklusive"
- "ggf." → "gegebenenfalls"
- "evtl." → "eventuell"
`

Complete OpenAI Example

typescript

const stream = await openai.chat.completions.create({
  model: 'gpt-4',
  messages: [
    {
      role: 'system',
      content: `You are a voice assistant for customer service.

CRITICAL VOICE RULES:
- Never use abbreviations with periods (Dr., e.g., etc.)
- Write everything as you would speak it out loud
- Use complete words: "Doctor" not "Dr.", "for example" not "e.g."
- Your responses will be synthesized to speech

Be helpful, concise, and conversational.`
    },
    {
      role: 'user',
      content: userMessage
    }
  ],
  stream: true,
})

Complete Anthropic Example

typescript

const stream = await anthropic.messages.create({
  model: 'claude-3-5-sonnet-20241022',
  max_tokens: 1024,
  system: `You are a voice assistant. Your responses will be converted to speech.

VOICE OUTPUT REQUIREMENTS:
- Write all abbreviations in full (Doctor, not Dr.)
- Avoid period-based abbreviations (e.g., i.e., etc.)
- Use natural spoken language
- Write numbers as words when it sounds more natural

Examples:
Wrong: "Dr. Schmidt can help you, e.g., with billing."
Right: "Doctor Schmidt can help you, for example with billing."

Wrong: "Available Mon.-Fri., 9 a.m.-5 p.m."
Right: "Available Monday through Friday, 9 AM to 5 PM."`,
  messages: [
    {
      role: 'user',
      content: userMessage
    }
  ],
  stream: true,
})

Testing Your Prompt

Verify your LLM follows voice rules by testing with edge cases:

typescript

const testCases = [
  "Tell me about Dr. Smith",
  "What are the benefits, e.g., cost savings?",
  "This applies to companies like IBM, Microsoft, etc.",
  "Available Mon.-Fri.",
]

// Expected responses should have NO abbreviations with periods

Common Mistake

Don't rely on post-processing to fix abbreviations. LLMs are excellent at following voice guidelines when properly instructed. Post-processing is fragile and language-dependent.

Using JavaScript's Built-in Sentence Segmenter

JavaScript provides Intl.Segmenter - a native API for text segmentation, including sentence detection. It's available in Node.js ≥16.

Basic Example

typescript

// Create a sentence segmenter (do this once, reuse for performance)
const sentenceSegmenter = new Intl.Segmenter('en', { granularity: 'sentence' })

function* extractSentences(text: string): Generator<string> {
  const segments = sentenceSegmenter.segment(text)

  for (const segment of segments) {
    yield segment.segment.trim()
  }
}

// Usage
const text = "Hello, how can I help? I'm here to assist you today."
for (const sentence of extractSentences(text)) {
  console.log(sentence)
  // Output:
  // "Hello, how can I help?"
  // "I'm here to assist you today."
}

Streaming with OpenAI

typescript

import OpenAI from 'openai'

const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY })
const segmenter = new Intl.Segmenter('en', { granularity: 'sentence' })

async function streamOpenAIResponse(
  sessionId: string,
  userMessage: string,
  sendAction: (action: any) => Promise<void>
) {
  let buffer = ''
  let lastSentenceEnd = 0

  const stream = await openai.chat.completions.create({
    model: 'gpt-4',
    messages: [{ role: 'user', content: userMessage }],
    stream: true,
  })

  for await (const chunk of stream) {
    const content = chunk.choices[0]?.delta?.content
    if (!content) continue

    // Add token to buffer
    buffer += content

    // Check for complete sentences
    const segments = Array.from(segmenter.segment(buffer))

    // Find complete sentences (all but possibly the last incomplete one)
    for (let i = lastSentenceEnd; i < segments.length - 1; i++) {
      const sentence = segments[i].segment.trim()

      if (sentence) {
        // Send complete sentence to TTS
        await sendAction({
          type: 'speak',
          session_id: sessionId,
          text: sentence,
          tts: { provider: 'azure', language: 'en-US', voice: 'en-US-JennyNeural' },
        })
      }

      lastSentenceEnd = i + 1
    }
  }

  // Send any remaining text as final sentence
  const remainingSegments = Array.from(segmenter.segment(buffer))
  for (let i = lastSentenceEnd; i < remainingSegments.length; i++) {
    const sentence = remainingSegments[i].segment.trim()
    if (sentence) {
      await sendAction({
        type: 'speak',
        session_id: sessionId,
        text: sentence,
        tts: { provider: 'azure', language: 'en-US', voice: 'en-US-JennyNeural' },
      })
    }
  }
}

Streaming with Anthropic Claude

typescript

import Anthropic from '@anthropic-ai/sdk'

const anthropic = new Anthropic({ apiKey: process.env.ANTHROPIC_API_KEY })
const segmenter = new Intl.Segmenter('en', { granularity: 'sentence' })

async function streamClaudeResponse(
  sessionId: string,
  userMessage: string,
  sendAction: (action: any) => Promise<void>
) {
  let buffer = ''
  let lastSentenceEnd = 0

  const stream = await anthropic.messages.create({
    model: 'claude-3-5-sonnet-20241022',
    max_tokens: 1024,
    messages: [{ role: 'user', content: userMessage }],
    stream: true,
  })

  for await (const event of stream) {
    if (event.type === 'content_block_delta' && event.delta.type === 'text_delta') {
      const content = event.delta.text

      // Add token to buffer
      buffer += content

      // Check for complete sentences
      const segments = Array.from(segmenter.segment(buffer))

      // Send complete sentences
      for (let i = lastSentenceEnd; i < segments.length - 1; i++) {
        const sentence = segments[i].segment.trim()

        if (sentence) {
          await sendAction({
            type: 'speak',
            session_id: sessionId,
            text: sentence,
            tts: { provider: 'azure', language: 'en-US', voice: 'en-US-JennyNeural' },
          })
        }

        lastSentenceEnd = i + 1
      }
    }
  }

  // Send remaining text
  const remainingSegments = Array.from(segmenter.segment(buffer))
  for (let i = lastSentenceEnd; i < remainingSegments.length; i++) {
    const sentence = remainingSegments[i].segment.trim()
    if (sentence) {
      await sendAction({
        type: 'speak',
        session_id: sessionId,
        text: sentence,
        tts: { provider: 'azure', language: 'en-US', voice: 'en-US-JennyNeural' },
      })
    }
  }
}

Reusable Helper Class

For production use, extract this logic into a reusable helper:

typescript

export class SentenceStreamBuffer {
  private buffer = ''
  private lastSentenceEnd = 0
  private segmenter: Intl.Segmenter

  constructor(locale: string = 'en') {
    this.segmenter = new Intl.Segmenter(locale, { granularity: 'sentence' })
  }

  /**
   * Add a token/chunk to the buffer and return any complete sentences.
   * @returns Array of complete sentences ready to be sent to TTS
   */
  push(chunk: string): string[] {
    this.buffer += chunk

    const segments = Array.from(this.segmenter.segment(this.buffer))
    const completeSentences: string[] = []

    // Extract complete sentences (all but possibly the last incomplete one)
    for (let i = this.lastSentenceEnd; i < segments.length - 1; i++) {
      const sentence = segments[i].segment.trim()
      if (sentence) {
        completeSentences.push(sentence)
      }
      this.lastSentenceEnd = i + 1
    }

    return completeSentences
  }

  /**
   * Flush remaining buffer as final sentence(s).
   * Call this when the stream ends.
   */
  flush(): string[] {
    const segments = Array.from(this.segmenter.segment(this.buffer))
    const remainingSentences: string[] = []

    for (let i = this.lastSentenceEnd; i < segments.length; i++) {
      const sentence = segments[i].segment.trim()
      if (sentence) {
        remainingSentences.push(sentence)
      }
    }

    // Reset state
    this.buffer = ''
    this.lastSentenceEnd = 0

    return remainingSentences
  }

  /**
   * Reset the buffer (useful for error handling or conversation resets)
   */
  reset(): void {
    this.buffer = ''
    this.lastSentenceEnd = 0
  }
}

Using the Helper

typescript

async function streamLLMToVoice(
  sessionId: string,
  llmStream: AsyncIterable<string>,
  sendAction: (action: any) => Promise<void>,
  locale: string = 'en'
) {
  const buffer = new SentenceStreamBuffer(locale)

  try {
    // Process streaming tokens
    for await (const token of llmStream) {
      const sentences = buffer.push(token)

      // Send each complete sentence to TTS
      for (const sentence of sentences) {
        await sendAction({
          type: 'speak',
          session_id: sessionId,
          text: sentence,
          tts: { provider: 'azure', language: 'en-US', voice: 'en-US-JennyNeural' },
        })
      }
    }

    // Send any remaining text when stream completes
    const finalSentences = buffer.flush()
    for (const sentence of finalSentences) {
      await sendAction({
        type: 'speak',
        session_id: sessionId,
        text: sentence,
        tts: { provider: 'azure', language: 'en-US', voice: 'en-US-JennyNeural' },
      })
    }
  } catch (error) {
    buffer.reset() // Clean up on error
    throw error
  }
}

Multi-Language Support

Intl.Segmenter supports multiple languages out of the box:

typescript

// German
const germanSegmenter = new Intl.Segmenter('de', { granularity: 'sentence' })

// Spanish
const spanishSegmenter = new Intl.Segmenter('es', { granularity: 'sentence' })

// French
const frenchSegmenter = new Intl.Segmenter('fr', { granularity: 'sentence' })

// Reusable buffer with language detection
function createStreamBuffer(languageCode: string): SentenceStreamBuffer {
  return new SentenceStreamBuffer(languageCode)
}

Handling Edge Cases

Short Responses

For very short responses (single sentence or fragment), the buffer approach still works:

typescript

// LLM response: "Hello!"
buffer.push("Hello!")  // Returns: []
buffer.flush()         // Returns: ["Hello!"]

Incomplete Sentences During Interruption

If the user interrupts (barge-in), you may have incomplete sentences in the buffer:

typescript

// Handle barge-in event
function handleBargeIn(sessionId: string) {
  const buffer = sessionBuffers.get(sessionId)

  if (buffer) {
    // Option 1: Discard incomplete sentence
    buffer.reset()

    // Option 2: Send incomplete sentence as-is (for context)
    const remaining = buffer.flush()
    // Log or store for context but don't send to TTS
  }
}

Very Long Sentences

Sometimes LLMs generate very long sentences. Consider adding a character limit:

typescript

class SentenceStreamBuffer {
  private maxSentenceLength = 500 // characters

  push(chunk: string): string[] {
    this.buffer += chunk

    // Force break on very long buffers
    if (this.buffer.length > this.maxSentenceLength && this.buffer.includes(' ')) {
      const lastSpace = this.buffer.lastIndexOf(' ', this.maxSentenceLength)
      const forcedSentence = this.buffer.substring(0, lastSpace).trim()
      this.buffer = this.buffer.substring(lastSpace).trim()
      this.lastSentenceEnd = 0
      return [forcedSentence]
    }

    // Normal sentence detection...
    const segments = Array.from(this.segmenter.segment(this.buffer))
    const completeSentences: string[] = []

    for (let i = this.lastSentenceEnd; i < segments.length - 1; i++) {
      const sentence = segments[i].segment.trim()
      if (sentence) {
        completeSentences.push(sentence)
      }
      this.lastSentenceEnd = i + 1
    }

    return completeSentences
  }

  // ... rest of class
}

Performance Considerations

Buffer Management

For production deployments with many concurrent sessions:

typescript

// Store buffers per session
const sessionBuffers = new Map<string, SentenceStreamBuffer>()

function getOrCreateBuffer(sessionId: string, locale: string): SentenceStreamBuffer {
  if (!sessionBuffers.has(sessionId)) {
    sessionBuffers.set(sessionId, new SentenceStreamBuffer(locale))
  }
  return sessionBuffers.get(sessionId)!
}

// Clean up on session end
function handleSessionEnd(sessionId: string) {
  sessionBuffers.delete(sessionId)
}

Timeout Protection

Add timeout to prevent indefinitely buffered text:

typescript

class SentenceStreamBufferWithTimeout extends SentenceStreamBuffer {
  private lastPushTime = Date.now()
  private timeout = 5000 // 5 seconds

  push(chunk: string): string[] {
    this.lastPushTime = Date.now()
    return super.push(chunk)
  }

  hasTimedOut(): boolean {
    return Date.now() - this.lastPushTime > this.timeout
  }

  flushIfTimedOut(): string[] {
    if (this.hasTimedOut()) {
      return this.flush()
    }
    return []
  }
}

Complete Example: Express.js Integration

typescript

import express from 'express'
import OpenAI from 'openai'

const app = express()
const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY })
const sessionBuffers = new Map<string, SentenceStreamBuffer>()

app.use(express.json())

app.post('/webhook', async (req, res) => {
  const event = req.body

  switch (event.type) {
    case 'user_speak':
      // Don't await - respond immediately to avoid timeout
      handleUserSpeak(event).catch(console.error)
      return res.status(204).send()

    case 'session_end':
      sessionBuffers.delete(event.session.id)
      return res.status(204).send()

    default:
      return res.status(204).send()
  }
})

async function handleUserSpeak(event: any) {
  const sessionId = event.session.id
  const userText = event.text

  // Get or create buffer for this session
  const buffer = sessionBuffers.get(sessionId) || new SentenceStreamBuffer('en')
  sessionBuffers.set(sessionId, buffer)

  // Stream LLM response
  const stream = await openai.chat.completions.create({
    model: 'gpt-4',
    messages: [{ role: 'user', content: userText }],
    stream: true,
  })

  for await (const chunk of stream) {
    const content = chunk.choices[0]?.delta?.content
    if (!content) continue

    const sentences = buffer.push(content)

    for (const sentence of sentences) {
      await sendAction({
        type: 'speak',
        session_id: sessionId,
        text: sentence,
        tts: { provider: 'azure', language: 'en-US', voice: 'en-US-JennyNeural' },
      })
    }
  }

  // Flush remaining text
  const finalSentences = buffer.flush()
  for (const sentence of finalSentences) {
    await sendAction({
      type: 'speak',
      session_id: sessionId,
      text: sentence,
      tts: { provider: 'azure', language: 'en-US', voice: 'en-US-JennyNeural' },
    })
  }
}

async function sendAction(action: any) {
  // Send action via WebSocket or HTTP to sipgate AI Flow
  // Implementation depends on your integration method
  await fetch('https://your-aiflow-endpoint/actions', {
    method: 'POST',
    headers: { 'Content-Type': 'application/json' },
    body: JSON.stringify(action),
  })
}

app.listen(3000, () => console.log('Server running on port 3000'))

Fallback for Older Node.js Versions

If you're using Node.js <16, you can use a simple regex-based fallback:

typescript

// Simple fallback for environments without Intl.Segmenter
function splitSentencesSimple(text: string): string[] {
  // Basic sentence splitting (not as robust as Intl.Segmenter)
  // Matches sentence endings followed by whitespace
  return text
    .split(/(?<=[.!?])\s+/)
    .map(s => s.trim())
    .filter(s => s.length > 0)
}

// Use in SentenceStreamBuffer as fallback
class SentenceStreamBufferLegacy {
  private buffer = ''

  push(chunk: string): string[] {
    this.buffer += chunk
    const sentences = splitSentencesSimple(this.buffer)

    if (sentences.length > 1) {
      // Keep last sentence in buffer (might be incomplete)
      const complete = sentences.slice(0, -1)
      this.buffer = sentences[sentences.length - 1]
      return complete
    }

    return []
  }

  flush(): string[] {
    const sentence = this.buffer.trim()
    this.buffer = ''
    return sentence ? [sentence] : []
  }
}

Regex Limitations

The regex fallback is less robust than Intl.Segmenter and may incorrectly split on abbreviations (Dr., e.g., etc.). If using the fallback, it's even more critical to follow the LLM prompting guidelines to avoid abbreviations.

Best Practices Summary

Prompt LLMs to avoid abbreviations - Instruct your LLM to write out "Doctor" not "Dr.", "for example" not "e.g." to prevent incorrect segmentation and poor pronunciation
Always segment sentences - Never send individual tokens to TTS, always buffer and send complete sentences
Use Intl.Segmenter - Native, robust, multi-language support (Node.js ≥16)
Buffer per session - Keep separate buffers for concurrent conversations
Clean up on session end - Delete buffers to prevent memory leaks
Handle timeouts - Flush buffer if no new tokens arrive within 5 seconds
Support multiple languages - Pass correct locale to Intl.Segmenter
Handle barge-in - Reset or discard incomplete sentences on interruption
Limit sentence length - Force breaks for very long sentences (500+ characters)

Token Accumulation Speed

In practice, sentences complete quickly (typically 1-2 seconds with modern LLMs). Users won't notice the buffering delay, but they will notice the dramatic improvement in speech quality.

Speak Action - Complete reference for the speak action
TTS Providers - KugelAudio, Azure, ElevenLabs configuration
Barge-In Best Practices - Handling interruptions during speech
Async Hold Pattern - Managing long-running LLM requests

Streaming LLM Responses: Sentence-by-Sentence Best Practices ​

The Problem: Token-by-Token Streaming ​

The Solution: Sentence Segmentation ​

Prompting LLMs for Voice Output ​

The Problem with Abbreviations ​

System Prompt Guidelines ​

Language-Specific Examples ​

Complete OpenAI Example ​

Complete Anthropic Example ​

Testing Your Prompt ​

Using JavaScript's Built-in Sentence Segmenter ​

Basic Example ​

Streaming with OpenAI ​

Streaming with Anthropic Claude ​

Reusable Helper Class ​

Using the Helper ​

Multi-Language Support ​

Handling Edge Cases ​

Short Responses ​

Incomplete Sentences During Interruption ​

Very Long Sentences ​

Performance Considerations ​

Buffer Management ​

Timeout Protection ​

Complete Example: Express.js Integration ​

Fallback for Older Node.js Versions ​

Best Practices Summary ​

Related Documentation ​

Streaming LLM Responses: Sentence-by-Sentence Best Practices

The Problem: Token-by-Token Streaming

The Solution: Sentence Segmentation

Prompting LLMs for Voice Output

The Problem with Abbreviations

System Prompt Guidelines

Language-Specific Examples

Complete OpenAI Example

Complete Anthropic Example

Testing Your Prompt

Using JavaScript's Built-in Sentence Segmenter

Basic Example

Streaming with OpenAI

Streaming with Anthropic Claude

Reusable Helper Class

Using the Helper

Multi-Language Support

Handling Edge Cases

Short Responses

Incomplete Sentences During Interruption

Very Long Sentences

Performance Considerations

Buffer Management

Timeout Protection

Complete Example: Express.js Integration

Fallback for Older Node.js Versions

Best Practices Summary

Related Documentation