#Bug: Audio binary injected into context after transcription

1 messages · Page 1 of 1 (latest)

steep heath
#

Version: 2026.1.29

What happens:
Voice message transcription works, but the raw .ogg binary is ALSO included in context as text/plain:
Transcript: "Раз-раз проверка..." ✅

<file name="file.ogg" mime="text/plain"> 杏卧Ȁ 䒵䥈 햘땦ጁ... (garbage) </file>

Expected:
After successful transcription, original audio should NOT be injected into context.

Impact:
• Wastes 50-80% of context on binary garbage
• Forces frequent compactions

Config:

"tools.media.audio": { "enabled": true, "language": "ru", "models": [{"provider": "groq", "model": "whisper-large-v3-turbo"}] }

Channel: Telegram

minor kindleBOT
#

To help others find answers, you can mark your question as solved via Right click solution message -> Apps -> ✅ Mark Solution

digital kilnBOT
# steep heath Version: 2026.1.29 What happens: Voice message transcription works, but the raw...

Yeah, that’s a real bug.

What’s happening: after transcription we replace ctx.Body with the transcript, but we leave the original Telegram voice attachment (MediaPath/MediaUrl/MediaType, and possibly the MediaPaths* arrays) on the message context. Downstream, the message-to-model formatting still sees “there’s a file attached” and tries to include it as a generic <file … mime="text/plain">…</file> blob — and since it’s an .ogg, you get UTF-8 garbage that blows up context.

Expected behavior

If transcription succeeds, we should:

  • keep only the transcript text in the prompt/context
  • not inject the raw audio bytes

Quick workaround (until fixed)

If you can, disable the “include raw media as file content” behavior for Telegram voice messages (if there’s a config switch on your build). If not, the only safe workaround is currently to disable audio handling entirely (which defeats the point, I know).

What we’ll change (fix)

After successful transcription, we should strip audio media fields from ctx (and sessionCtx) so nothing tries to serialize the audio attachment into the LLM context, e.g.:

  • ctx.MediaPath = undefined
  • ctx.MediaUrl = undefined
#
  • ctx.MediaType = undefined
  • ctx.MediaPaths/MediaUrls/MediaTypes = undefined (or at least remove audio entries)

…but only when isAudio(ctx.MediaType) and transcription returns non-empty text.

One thing I need from you (to pinpoint the exact injection site)

Can you paste a small snippet of the actual context right before the model call where the <file name="file.ogg" …> appears (even 20–40 lines around it), or tell me which provider/model you’re running for chat (Anthropic/OpenAI/etc.)? The <file …> wrapper isn’t coming from the Telegram bot code directly, so I want to confirm which formatter is producing it.

If you can’t easily grab that, no worries — the fix above is still correct and should stop the bleed.

If you want, I can turn this into a proper GitHub issue + patch/PR once I confirm where the <file> block is generated.

steep heath
# digital kiln - `ctx.MediaType = undefined` - `ctx.MediaPaths/MediaUrls/MediaTypes = undefined...

Provider/Model: anthropic/claude-opus-4-5-20251101

The <file> block context:

The injection happens in the user message that gets sent to Anthropic. Here's the structure I see:

[Audio] User text: [Telegram Danil Trubitsyn (@ali3412)...] media:audio
Transcript: Раз-раз проверка как дела...

<file name="file_49---f16a1380-3f45-46cb-b6f1-bd6f2d0d6d2a.ogg" mime="text/plain">
杏卧Ȁ 䒵䥈 햘땦ጁ灏獵效摡āĸ뮀...
</file>

The transcript is there and correct, but the raw .ogg binary still gets serialized into a <file> block with wrong MIME (text/plain instead of audio/ogg).

Yes please — a GitHub issue + PR would be great!