#getting call ended without transcript. even though using right encoding base64 and format for audio

1 messages · Page 1 of 1 (latest)

tawdry radishBOT
#
AI-generated Answer for andy_94855

Here are the most common, doc-backed causes of “call ended” with no transcript when audio chunks appear in history, plus how to verify each one:

  • You’re mixing up APIs. If you’re using the Agents WebSocket, you must send user audio as {"user_audio_chunk": "<base64>"} to wss://api.elevenlabs.io/v1/convai/conversation and you should receive type: "user_transcript" events. If you instead use the realtime STT WebSocket, you must send {"message_type":"input_audio_chunk","audio_base_64":"…","sample_rate":…,"commit":…} to wss://api.elevenlabs.io/v1/speech-to-text/realtime and you’ll receive partial/committed transcript events. Using the STT schema on the Agents socket (or vice‑versa) results in no transcript. (elevenlabs.io)

  • Input audio format mismatch. The agent expects a specific ASR input format (conversation_config.asr.user_input_audio_format, e.g., pcm_16000 or mulaw_8000). If you send a different encoding/rate, audio may show up in history but no user_transcript is produced. Fetch your agent and confirm the value, or set it appropriately in Agent settings. Telephony integrations (e.g., Genesys) must use ÎĽ-law 8000 Hz for both input and output. (elevenlabs.io)

  • No commit in STT realtime. If you’re on the STT realtime socket and only send input_audio_chunk without committing (or without VAD enabled), you’ll see audio but never receive committed transcripts. Either send commit:true with the final chunk or set commit_strategy=vad and ensure sufficient silence. Also include sample_rate in each chunk. (elevenlabs.io)

  • Agent ended the call early. The default End Call tool can terminate the session; you’ll see the tool call in history. If it’s being invoked prematurely, adjust your prompt/tool conditions or remove the tool from the system tools. Also check turn settings like silence_end_call_timeout, which can end calls on brief silence. (elevenlabs.io)

  • Configuration/authorization issues that prevent transcripts from being generated. Check Call History for a red error explaining early termination (e.g., missing required dynamic variables, allowlist blocks, daily call limit). These commonly result in no transcript. (help.elevenlabs.io)

  • Privacy/retention settings. Audio saving can be disabled while transcripts remain visible; if transcripts are being purged by retention, adjust retention settings. Verify your privacy configuration if history items look inconsistent. (elevenlabs.io)

What I’d check next (minimal repro):

  1. On connect, log the conversation_initiation_metadata event and confirm user_input_audio_format; stream audio exactly in that format. (elevenlabs.io)
  2. If using Agents WS, send only {"user_audio_chunk": "<base64 of raw PCM/μ-law>"} — no extra fields — and confirm you receive user_transcript events. (elevenlabs.io)
  3. If using STT WS, add commit:true on the final chunk (or set commit_strategy=vad) and include sample_rate on each chunk. (elevenlabs.io)
  4. Review Call History for a
proper berry
#

I am implementing the convo AI in android mobile app. so main issue is when I send audio it reached to the 11 labs but it did not transcribe it which unfortunately does not get me any response

tawdry radishBOT
#
AI-generated Answer for andy_94855

Thanks for the detail. When audio “reaches” the conversation but you never get user_transcript events, it’s almost always one of these issues:

  1. Audio format mismatch
  • The Agents WebSocket returns conversation_initiation_metadata with user_input_audio_format (default is pcm_16000). Your user_audio_chunk payloads must be raw 16‑bit PCM, mono, little‑endian at that sample rate, base64‑encoded, and sent as:
    { "user_audio_chunk": "<base64 of raw PCM>" }
    If you send WAV/MP3/Opus containers or the wrong rate/endianness, the ASR won’t trigger and no transcript is produced. Verify the format you receive in conversation_initiation_metadata and match it exactly. (elevenlabs.io)
  • Supported realtime input formats (for PCM/μ‑law) are documented here; pcm_16000 is recommended. (elevenlabs.io)
  1. Event order and keep‑alive
  • After opening wss://api.elevenlabs.io/v1/convai/conversation?agent_id=..., first send conversation_initiation_client_data (optional overrides), then wait for conversation_initiation_metadata before streaming audio. (elevenlabs.io)
  • You must reply to ping with a pong carrying the same event_id. If you don’t, the server can close the socket before any transcription completes. (elevenlabs.io)
  1. Not enough audio to start transcription
  • Realtime STT begins processing after the first ~2 seconds of audio are sent. If you send only very short bursts or stop immediately, you may see no transcript. Stream a continuous couple of seconds before pausing. (elevenlabs.io)
  1. Agent configured for text‑only or missing required variables
  • If “Text‑only mode” was enabled (via agent settings or an override), audio will be ignored. Disable text‑only mode or remove the override. (elevenlabs.io)
  • If your agent requires Dynamic Variables and they aren’t supplied, calls can end with no transcript; check Call History for an explicit red error message. (help.elevenlabs.io)
  1. Quick Android path (avoids manual encoding)
  • Use the official Kotlin/Android SDK, which opens a WebRTC session and handles mic capture, encoding, and transport for you via startSession. This removes format/ping/pacing pitfalls. (elevenlabs.io)

What to check right now

  • Log every inbound event: do you see conversation_initiation_metadata? vad_score? user_transcript? If you never see vad_score, the server likely isn’t detecting speech (usually format). (elevenlabs.io)
  • Confirm your user_audio_chunk is raw PCM S16LE, mono, 16 kHz, base64 (no WAV headers). Match the rate reported in conversation_initiation_metadata. (elevenlabs.io)
  • Ensure you respond to ping with pong. (elevenlabs.io)
  • In the dashboard Call History, check the failure reason (e.g., missing dynamic variables). (help.elevenlabs.io)
  • As a control, send a text user_message to the same socket; if you
proper berry
#

text message do get a response audio

tawdry radishBOT
#
AI-generated Answer for andy_94855

Got it—your agent responds to text, so TTS is fine; the missing transcript points to how audio is being sent/encoded on the ConvAI WebSocket.

Do these checks (each item links to the relevant doc):

  1. Send the correct frame
  • The ConvAI WebSocket expects raw audio frames as JSON objects like: {"user_audio_chunk":"<base64>"} (no extra wrapper/type). You should then receive user_transcript events. (elevenlabs.io)
  1. Match the input codec/rate the agent expects
  • When the socket opens, read conversation_initiation_metadata.user_input_audio_format (e.g., "pcm_16000"). Encode your chunks to exactly that format. If you need to change it, set conversation_config.asr.user_input_audio_format on the agent (e.g., pcm_8000, ulaw_8000) and republish. (elevenlabs.io)
  1. Encode audio correctly
  • Use mono, 16‑bit little‑endian PCM at the agreed sample rate (8–48 kHz) or μ‑law 8 kHz; send raw frames (no WAV/OGG headers) base64‑encoded. (elevenlabs.io)
  1. Chunking
  • Stream small chunks regularly (about 0.1–1.0 s per message). Very large or very tiny frames can hurt recognition and latency. (elevenlabs.io)
  1. Keep the socket alive
  • Reply to server ping with {"type":"pong","event_id":...} or the connection can close mid‑utterance. (elevenlabs.io)
  1. Verify from the API
  • After a call, GET /v1/convai/conversations/:conversation_id and check has_user_audio and the transcript array. If has_user_audio=false, your frames weren’t accepted/decoded. (elevenlabs.io)
  1. Format for telephony (if applicable)
  • If your capture path is μ‑law 8 kHz (common on phone audio), configure the agent’s user_input_audio_format to ulaw_8000 (and TTS to μ‑law 8 kHz) to match. (elevenlabs.io)
  1. Isolate the encoder vs. ConvAI
  • As a sanity check, send the same audio to the Realtime Speech‑to‑Text (Scribe) WebSocket. If it transcribes there, your encoding is good and the issue is ConvAI config; if not, fix the encoding. (elevenlabs.io)

If you share which user_input_audio_format your session reports and a sample frame (base64 length and sample rate), I can point to

proper berry
#

I use User audio chunk for json schema and Sample Rate: 16000 Hz (exactly)

Channels: 1 (mono only)

Bit Depth: 16-bit signed integers

Encoding: Base64 string

Format: PCM (Pulse Code Modulation)
these

tawdry radishBOT
#
AI-generated Answer for andy_94855

Short answer: your settings are almost right, but make sure the PCM is little‑endian and that you stream enough audio (or explicitly commit). Here’s the checklist that matches the official requirements:

  • Encoding and format

    • Use raw 16‑bit PCM little‑endian at 16 kHz. ElevenLabs’ realtime STT lists pcm_16000 as “16‑bit PCM, little‑endian (recommended).” (elevenlabs.io)
    • When using the Conversational AI WebSocket, send chunks as {"user_audio_chunk": "<base64>"} to wss://api.elevenlabs.io/v1/convai/conversation. You’ll receive conversation_initiation_metadata that echoes user_input_audio_format (e.g., "pcm_16000"). Verify this matches what you’re sending. (elevenlabs.io)
  • Why you might see “no transcript”

    • Transcripts begin only after enough audio arrives. In our realtime STT, “Transcript processing starts after the first 2 seconds of audio are sent,” and transcripts are finalized either by manual commit or VAD. If your app sends very short chunks and never commits, you may not see user_transcript events. (elevenlabs.io)
    • If the audio format doesn’t match expectations (e.g., not little‑endian or wrong sample rate), the transcriber won’t produce text. You can quickly validate your exact base64 audio by sending it to the Scribe v2 Realtime STT WebSocket ("input_audio_chunk" with sample_rate=16000). If it transcribes there, the encoding is correct. (elevenlabs.io)
  • Android integration tips

    • If you’d rather avoid manual chunking, the Android (Kotlin) Agents SDK streams mic audio for you via a managed session (WebRTC) and surfaces user/agent messages and VAD scores; this removes format mismatches as a source of error. (elevenlabs.io)

References

  • Agent WebSockets (message types, user_audio_chunk, metadata with user_input_audio_format). (elevenlabs.io)
  • Realtime Speech‑to‑Text streaming (supported formats, commit/VAD behavior, 2‑second start). (elevenlabs.io)
  • Scribe v2 Realtime WebSocket (input_audio_chunk schema to test your audio). (elevenlabs.io)
  • Kotlin (Android) Agents SDK. (elevenlabs.io)

Sources:
Realtime Speech to Text | ElevenLabs Documentation
Agent WebSockets | ElevenLabs Documentation
Realtime Speech to Text | ElevenLabs Documentation
Realtime | ElevenLabs Documentation
Kotlin SDK | ElevenLabs Documentation
Agent WebSockets | ElevenLabs Documentation
Realtime Speech to Text | ElevenLabs Documentation
Realtime | ElevenLabs Documentation
Kotlin SDK | ElevenLabs Documentation

proper berry
#

what is little endian

tawdry radishBOT
#
AI-generated Answer for andy_94855

ElevenLabs doesn’t define “little‑endian” conceptually, but in our docs it’s the required byte order for raw PCM inputs. If you send raw PCM, use:

  • file_format=pcm_s16le_16: 16‑bit PCM, 16 kHz, mono, little‑endian byte order. (elevenlabs.io)
  • For realtime streaming, audio_format like pcm_16000: “16‑bit PCM, little‑endian.” (elevenlabs.io)

In short: when using raw PCM with our APIs, ensure the data is little‑endian as specified above. Our docs do not further define the term. (elevenlabs.io)

Sources:
Create transcript | ElevenLabs Documentation
Realtime Speech to Text | ElevenLabs Documentation
Create transcript | ElevenLabs Documentation

proper berry
#

what is VAD

tawdry radishBOT
#
AI-generated Answer for andy_94855

VAD stands for Voice Activity Detection. In ElevenLabs Realtime Speech-to-Text, VAD is a commit strategy that automatically detects speech vs. silence and, when a silence threshold is reached, commits the current transcript segment for you. (elevenlabs.io)

How to use and tune it (query parameters when connecting):

  • commit_strategy: "vad" to enable VAD (default is "manual").
  • vad_silence_threshold_secs: seconds of silence before auto‑commit (default 1.5; range 0.3–3.0).
  • vad_threshold: VAD sensitivity (default 0.4; range 0.1–0.9; lower = more sensitive).
  • min_speech_duration_ms: minimum speech duration to count as speech (default 100; 50–2000 ms).
  • min_silence_duration_ms: minimum silence duration to count as silence (default 100; 50–2000 ms).
    These parameters apply only when commit_strategy is set to "vad". (elevenlabs.io)

Sources:
Realtime Speech to Text | ElevenLabs Documentation
Realtime Speech to Text | ElevenLabs Documentation

proper berry
#

in my conversation inititation metadata event i do not receive anything like VAD

tawdry radishBOT
#
AI-generated Answer for andy_94855

Short answer: that’s expected.

  • conversation_initiation_metadata only carries identifiers and audio format info (e.g., conversation_id, agent_output_audio_format, user_input_audio_format). It does not include VAD fields. (elevenlabs.io)
  • VAD is surfaced as its own stream of events. Listen for type:"vad_score" messages while you are sending user_audio_chunk frames; that’s when you’ll see VAD scores. (elevenlabs.io)

If you’re not seeing any vad_score events:

  • Make sure you actually start streaming audio with user_audio_chunk after the initiation; VAD scores are emitted only once audio arrives. (elevenlabs.io)
  • Confirm the audio you send matches the user_input_audio_format returned in the initiation (e.g., pcm_16000 = 16‑bit PCM, little‑endian, mono, base64). (elevenlabs.io)

Note: If you’re using the standalone Realtime Speech‑to‑Text API (not Agents), VAD is a commit strategy you enable via query params (commit_strategy=vad with tunables like vad_silence_threshold_secs and vad_threshold). That configuration is not part of the Agents conversation_initiation_metadata. (elevenlabs.io)

Sources:
Agent WebSockets | ElevenLabs Documentation
Agent WebSockets | ElevenLabs Documentation
Agent WebSockets | ElevenLabs Documentation
Agent WebSockets | ElevenLabs Documentation
Realtime | ElevenLabs Documentation

proper berry
#

okay the format is little endian but when i sent the data to STT i got time out

tawdry radishBOT
#
AI-generated Answer for andy_94855

A timeout on STT usually means the realtime STT WebSocket never reached a valid session or never received enough valid audio to start transcribing. Please check these items (all are required by the ElevenLabs STT docs):

  • Use the STT realtime WebSocket endpoint and include a model_id. Example: wss://api.elevenlabs.io/v1/speech-to-text/realtime?model_id=scribe_v2_realtime. If model_id is missing, the handshake won’t succeed. (elevenlabs.io)
  • Send the correct message type for STT. The server expects message_type: "input_audio_chunk" with fields audio_base_64 and sample_rate. Using "user_audio_chunk" (which belongs to the Conversational AI conversation socket) will not be accepted by the STT socket. (elevenlabs.io)
  • Audio format and rate must match your query params. Default audio_format is pcm_16000 (little‑endian 16‑bit), and you must include sample_rate (e.g., 16000) in every chunk. (elevenlabs.io)
  • Send enough audio before expecting output. Transcript processing starts after the first ~2 seconds of audio are sent; if you send less and then pause, you may see no partials and hit client‑side timeouts. (elevenlabs.io)
  • Chunking: send 0.1–1.0 s chunks continuously; mono only is supported. (elevenlabs.io)
  • If you prefer automatic segmentation, set commit_strategy=vad and tune vad_* params; otherwise, with manual commit you can call commit yourself (auto‑commit occurs every 90 seconds if you don’t). Neither mode prevents partial transcripts. (elevenlabs.io)
  • If no text arrives, log inbound events and check for scribeAuthError, scribeQuotaExceededError, or input_error (invalid audio/fields) before the socket closes. The cookbook’s troubleshooting list also calls out: verify audio format, base64 correctness, presence of sample_rate, and auth/quota. (elevenlabs.io)

Given your note about sending “user audio chunk”: if you’re connecting to the STT realtime endpoint, switch to input_audio_chunk as above. If you are connecting to the Conversational AI conversation endpoint instead, then user_audio_chunk is correct for that socket, but it won’t work on the STT realtime socket. (elevenlabs.io)

If you try the above and still hit timeouts, please share the exact STT WebSocket URL (without secrets)

proper berry
#

Can you make a raw format for the postman where I can test it

tawdry radishBOT
#
AI-generated Answer for andy_94855

Yes — you can exercise the Conversational AI WebSocket from Postman. Here’s a minimal, copy/paste “raw” setup that works with Postman’s WebSocket client.

  1. Get a signed URL (only required if your agent is private)
  1. Open the WebSocket in Postman
  • If public agent: wss://api.elevenlabs.io/v1/convai/conversation?agent_id=YOUR_AGENT_ID
  • If private agent: use the signed_url you retrieved in step 1.
  • Connect, then you’ll start receiving events (e.g., conversation_initiation_metadata). (elevenlabs.io)
  1. (Optional) Send conversation initiation data
    Paste this JSON into Postman’s “Raw message” and Send:
    {
    "type": "conversation_initiation_client_data",
    "conversation_config_override": {
    "agent": {
    "language": "en"
    }
    }
    }
    You should receive conversation_initiation_metadata after connect; it reports the agent_output_audio_format and user_input_audio_format. (elevenlabs.io)

  2. Send an audio chunk (PCM16 LE, 16 kHz, mono, base64)

  • Prepare raw 16‑bit PCM little‑endian mono at 16,000 Hz; base64‑encode the bytes.
  • In Postman, send:
    {
    "user_audio_chunk": "BASE64_OF_PCM16LE_MONO_16KHZ"
    }
    The agent will emit user_transcript events when audio is processed. Note: ElevenLabs realtime STT supports multiple PCM rates; 16 kHz little‑endian mono is recommended. (elevenlabs.io)
  1. Respond to ping (to keep the socket healthy)
    If you see:
    { "type": "ping", "ping_event": { "event_id": 12345 } }
    reply with:
    { "type": "pong", "event_id": 12345 }
    This is supported by the Agent WebSockets protocol. (elevenlabs.io)

  2. What you should see back

  • user_transcript with the recognized text
  • audio events containing base64 TTS output if your agent speaks
  • vad_score events (sent separately; they are not part of the initiation metadata) (elevenlabs.io)

Audio format notes (to avoid timeouts/input errors)

  • Use 16‑bit PCM, little‑endian, mono; 16 kHz is recommended. Other supported PCM rates (8–48 kHz) are accepted by the realtime STT, but 16 kHz is a good default. Send chunks in short intervals for smooth streaming. (elevenlabs.io)

References:

  • Agent WebSockets API (endpoint, events, message shapes): wss://api.elevenlabs.io/v1/convai/conversation. (elevenlabs.io)
  • Agents Platform WebSocket guide (signed URL flow and cURL example): GET /v1/convai/conversation/get-signed-url.
proper berry
#

I am actually not getting response in postman for this API

tawdry radishBOT
#
AI-generated Answer for andy_94855

If you’re hitting the Conversational AI endpoint from Postman and “not getting a response,” the most common issue is using HTTP instead of a WebSocket. The ConvAI API is only available over WSS and returns events after a 101 Switching Protocols upgrade. You must use Postman’s WebSocket client, not a REST tab. (elevenlabs.io)

Quick Postman checklist (Conversational AI, WebSocket)

  • Connect to: wss://api.elevenlabs.io/v1/convai/conversation?agent_id=YOUR_AGENT_ID (public agents), or use a signed URL for private agents. To get a signed URL: GET /v1/convai/conversation/get-signed-url?agent_id=... with header xi-api-key: YOUR_KEY. Paste the returned wss URL into Postman’s WebSocket URL field. (elevenlabs.io)
  • Immediately send this frame after connect to start the session:
    { "type": "conversation_initiation_client_data" }
    You should receive conversation_initiation_metadata (includes user_input_audio_format and agent_output_audio_format). (elevenlabs.io)
  • Send audio as base64 PCM chunks:
    { "user_audio_chunk": "BASE64_PCM_16KHZ_MONO_16BIT" }
    You should then receive events like vad_score, user_transcript, agent_response, and audio. (elevenlabs.io)
  • Keep the connection alive: if you get a ping event, reply with:
    { "type": "pong", "event_id": <from ping_event> } (elevenlabs.io)

Audio format to use

  • PCM little‑endian, mono, 16‑bit, 16 kHz (PCM_16000). The realtime STT docs specify PCM formats explicitly (all PCM variants are 16‑bit little‑endian), and ConvAI’s initiation metadata will state pcm_16000 if that’s what it expects. (elevenlabs.io)

If you’re testing Speech‑to‑Text Realtime (Scribe) in Postman instead of ConvAI

  • Connect via WebSocket to: wss://api.elevenlabs.io/v1/speech-to-text/realtime
    Authenticate with xi-api-key header or a one-time token in the token query param. (elevenlabs.io)
  • After you receive session_started, send audio frames like:
    { "message_type": "input_audio_chunk", "audio_base_64": "BASE64_PCM", "sample_rate": 16000, "commit": true }
    You’ll receive partial/committed transcript events; you can also enable VAD-based commits by connecting with commit_strategy=vad and control silence threshold via vad_silence_threshold_secs. Supported audio_format values include pcm_16000 (recommended). (elevenlabs.io)

Why Postman might show “no response”

  • Using HTTP tab instead of WebSocket (ConvAI and STT realtime are WSS-only). (elevenlabs.io)
  • Didn’t send conversation_initiation_client_data after connecting (ConvAI won’t start). (elevenlabs.io)
  • Audio not in expected format (use PCM_16000, 16‑bit LE, mono; verify user_input_audio_format in the initiation metadata). (elevenlabs.io)
  • Not replying to ping with pong causing the server to close the socket.
proper berry
#

can you give me a sample BASE64_PCM

tawdry radishBOT
#
AI-generated Answer for andy_94855

Here’s a minimal valid Base64-encoded PCM S16LE (mono, little‑endian) chunk you can use for testing. It represents 4 zero bytes (silence), which is enough to validate payload shape:

AAAAAA==

To send this over the WebSocket, the message body should look like:
{ "user_audio_chunk": "AAAAAA==" }

Notes:

  • The Conversational AI WebSocket expects user_audio_chunk to be Base64 of raw 16‑bit PCM; the initiation metadata will indicate user_input_audio_format like pcm_16000. (elevenlabs.io)
  • Supported PCM inputs are 16‑bit little‑endian; pcm_16000 (16 kHz) is recommended. (elevenlabs.io)

Sources:
Agent WebSockets | ElevenLabs Documentation
Realtime Speech to Text | ElevenLabs Documentation

proper berry
#

I need for STT Transcribing

tawdry radishBOT
#
AI-generated Answer for andy_94855

Here’s how to transcribe speech with ElevenLabs STT, depending on whether you want batch (REST) or realtime (WebSocket).

Batch STT (best accuracy; easy to test in Postman)

  • Endpoint: POST https://api.elevenlabs.io/v1/speech-to-text/convert
  • Auth header: xi-api-key: YOUR_API_KEY
  • Body (form-data):
    • model_id: scribe_v1
    • file: attach your audio/video file (all common formats supported; up to 3 GB)
    • Optional: file_format=pcm_s16le_16 if you upload raw 16‑bit PCM at 16 kHz, mono, little‑endian (lower latency than encoded formats)
    • Optional: language_code, use_multi_channel, webhook, webhook_id, additional_formats, etc.
  • Response: JSON transcript (or async via webhook if webhook=true). (elevenlabs.io)

Realtime STT (low latency; for live Android mic streams)

  • Connect via WebSocket: wss://api.elevenlabs.io/v1/speech-to-text/realtime
    • Query params:
      • model_id=scribe_v2_realtime
      • audio_format=pcm_16000 (or one of the supported formats below)
      • include_timestamps=true (if you want word-level timestamps)
      • Optional commit_strategy=vad (auto‑commit on silence) or leave default manual
      • Optional VAD tuning: vad_silence_threshold_secs, vad_threshold, min_speech_duration_ms, min_silence_duration_ms
    • Auth: set xi-api-key header, or pass a single‑use token in token=… (create it via POST /v1/single-use-token/realtime_scribe). (elevenlabs.io)
  • Send audio chunks:
    • Message type: { "message_type": "input_audio_chunk", "audio_base_64": "<BASE64_BYTES>", "sample_rate": 16000 }
    • With manual commit, you must commit segments (SDK provides a commit call; auto‑commit is available via VAD). (elevenlabs.io)
  • Supported audio formats (mono only):
    • pcm_8000, pcm_16000 (recommended), pcm_22050, pcm_24000, pcm_44100, pcm_48000, ulaw_8000. (elevenlabs.io)
  • Best practices and troubleshooting:
    • Use 16 kHz mono for best balance of quality/bandwidth.
    • Chunk size: 0.1–1.0 s.
    • If you get no transcripts: ensure audio_format matches what you send, base64 is valid, sample_rate is included, auth/quotas are OK. (elevenlabs.io)

About a “sample BASE64_PCM”

  • The docs don’t publish a literal Base64 blob. For realtime, encode the raw 16‑bit little‑endian PCM frames (no WAV header) to Base64 and place them in audio_base_64 with sample_rate set appropriately (e.g., 16000). If you prefer to avoid Base64 testing, use the batch REST convert endpoint in Postman and upload a file directly. (elevenlabs.io)

Reference overviews

  • Product/overview pages for STT and Realtime STT (background, models, languages, pricing): (elevenlabs.io)

If you tell me whether you want batch (Postman)

proper berry
#

I prepared a pcm string and got the response back from the STT in post man but using same string did not get any response from my agent

tawdry radishBOT
#
AI-generated Answer for andy_94855

Likely cause: you’re sending the STT Realtime payload to the Agent WebSocket. The two protocols are different.

What STT expects

  • Endpoint: wss://api.elevenlabs.io/v1/speech-to-text/realtime
  • Message: { "message_type": "input_audio_chunk", "audio_base_64": "...", "sample_rate": 16000, "commit": true|false }
  • You also choose audio_format via query param (defaults to pcm_16000). (elevenlabs.io)

What Agents expect

  • Endpoint: wss://api.elevenlabs.io/v1/convai/conversation?agent_id=YOUR_AGENT_ID
  • First server event: conversation_initiation_metadata with user_input_audio_format and agent_output_audio_format.
  • Audio upload message: { "user_audio_chunk": "..." } (no message_type, no sample_rate field). (elevenlabs.io)

If you send the STT-style message (input_audio_chunk/audio_base_64) to the Agent socket, it will be ignored; the agent only consumes user_audio_chunk. (elevenlabs.io)

Format must match the agent’s expected input

  • Use the value returned in conversation_initiation_metadata.user_input_audio_format (e.g., pcm_16000 or ulaw_8000). (elevenlabs.io)
  • You can also inspect your agent’s configured ASR format via GET /v1/convai/agents/:agent_id (field: conversation_config.asr.user_input_audio_format). (elevenlabs.io)
  • For PCM, provide mono 16‑bit little‑endian at the stated sample rate (e.g., pcm_16000 = 16 kHz S16LE). (elevenlabs.io)

Quick fix checklist

  • Connect to the Agents WebSocket with a valid agent_id and xi-api-key. (elevenlabs.io)
  • Wait for conversation_initiation_metadata; note user_input_audio_format. (elevenlabs.io)
  • Send audio as { "user_audio_chunk": "<base64 of raw S16LE mono at that rate>" }. Do not include message_type or sample_rate. (elevenlabs.io)
  • Let the turn end (the agent replies after it detects end-of-speech/silence based on turn settings). You can confirm turn behavior via the agent config (turn.mode, turn_timeout). (elevenlabs.io)
  • To sanity‑check the pipeline, send a text user message: { "type": "user_message", "text": "hello" } — it triggers the same response flow as speech. (elevenlabs.io)

Reference examples

  • STT Realtime example payload and parameters. (elevenlabs.io)
  • Agents WebSocket message examples (user_audio_chunk, vad_score, user_transcript). (elevenlabs.io)
  • Supported audio format definitions (PCM little‑endian, ulaw) for realtime streaming. (elevenlabs.io)

If you still get no response, please log the first few server events (conversation_initiation_metadata, vad_score, user_transcript) from the Agent socket and share the conversation_id so we can pinpoint

proper berry
#

I just used the same PCM file as a sanity check, but i used the format for both convo AI and STT separately. What i want to say is that PCM works for STT but not for Convo AI