#Applying Prompt to {{Transcript}} Before Sending TTS Provider

30 messages Ā· Page 1 of 1 (latest)

peak basin
#

It is possible to have an agent run with a specific prompt to an audio transcript before it's sent off to my TTS provider (which in this case is ElevenLabs)? The flow would look like this:

OpenClaw receives audio file -> Sends to Deepgram for transcription (STT) -> OpenClaw receives the transcript back from Deepgram -> OpenClaw uses a pre-defined prompt to "enhance" the transcript -> the enhanced transcript is sent to ElevenLabs -> ElevenLabs generates an audio file using the trnascript and sends it back to OpenClaw
dawn flameBOT
#

Yep, with one small caveat.

Recommended path: let the agent do the enhancement, then let TTS speak the agent’s reply.

So the flow becomes:

audio -> Deepgram STT -> {{Transcript}} injected into the prompt -> agent rewrites/cleans it -> ElevenLabs TTS

That already matches OpenClaw pretty well, because:

  • Deepgram transcription is injected into the reply pipeline as {{Transcript}}
  • TTS speaks the outbound reply, not the raw transcript

Minimal config

#
{
  tools: {
    media: {
      audio: {
        enabled: true,
        models: [{ provider: "deepgram", model: "nova-3" }],
      },
    },
  },

  messages: {
    tts: {
      auto: "inbound",
      provider: "elevenlabs",
    },
#
  },
}

Then put something like this in your agent/system prompt:

When {{Transcript}} is present, rewrite it into clean spoken text for TTS.
Fix filler words, punctuation, and obvious transcription mistakes, but preserve meaning.
Reply with only the final spoken script.

Important distinction

  • If you want ā€œagent-enhanced transcript, then TTSā€, this works now and is the easiest route.
  • If you want a hard middleware step that always rewrites the transcript before normal reply handling, that’s more of a plugin/custom hook job.

There is also tools.media.audio.prompt, which can shape audio-understanding output style, but for a Deepgram -> ElevenLabs flow I’d rely on the agent prompt first.

peak basin
#

I was just looking at using a hook since it seems to have message:transcribed. Would that be the correct hook point?

dawn flameBOT
#

Close, but I would not use message:transcribed as the main transform point for this.

What message:transcribed is good for

It’s the right event if you want to react immediately after STT, because it fires after audio transcription and gives you context.transcript.

Why I wouldn’t use it here

That internal hook is better for observing / side effects than for rewriting the pipeline.
Your goal is really:

transcript -> apply enhancement instructions -> agent reply text -> ElevenLabs TTS

So the better fit is:

  • Simplest / recommended: put the enhancement rules in the agent/system prompt
  • Plugin version: use before_prompt_build to inject those instructions before the LLM run
  • Full override: use before_agent_reply only if you want the plugin to completely take over reply generation
#

Rough map

  • message:transcribed = ā€œaudio was transcribedā€
  • message:preprocessed = ā€œfinal enriched inbound body is readyā€
  • before_prompt_build = best place to inject your ā€˜enhance for TTS’ prompt
  • before_agent_reply = take over and return a synthetic reply yourself

So for your exact use case, I’d say:

message:transcribed is the right observation point, but before_prompt_build is the better control point.

And then let TTS do its normal thing on the final reply with:

messages: {
  tts: {
peak basin
#

If I instead go with your first suggestion of putting:

When {{Transcript}} is present, rewrite it into clean spoken text for TTS.
Fix filler words, punctuation, and obvious transcription mistakes, but preserve meaning.
Reply with only the final spoken script.
into the agent/system prompt, where exactly would I put it (which file)?

dawn flameBOT
#

I’d put that in AGENTS.md.

For most setups, that means:

~/.openclaw/workspace/AGENTS.md

If you’re using a custom/named agent, put it in that agent’s workspace AGENTS.md instead.

Why AGENTS.md?

Because this is an operating instruction, not just personality/tone.

  • AGENTS.md = rules, behavior, standing orders
  • SOUL.md = persona / voice / style

So your transcript-cleanup rule belongs in AGENTS.md.

#

Example

Add a section like:

## Audio transcript cleanup for TTS

When `{{Transcript}}` is present, rewrite it into clean spoken text for TTS.
Fix filler words, punctuation, and obvious transcription mistakes, but preserve meaning.
Reply with only the final spoken script.

Small tip

I would not use systemPromptOverride for this. That replaces the whole assembled system prompt, which is overkill here.

After editing AGENTS.md, the next run should pick it up. For the cleanest test, I’d send /new once and then try a fresh voice note.

peak basin
dawn flameBOT
#

I checked the paste. It should load fine from AGENTS.md, but I’d tweak it a bit.

Two important fixes

  1. Don’t key it off literal {{Transcript}}
    In the agent turn, OpenClaw usually gives the model an audio block like:

    [Audio]
    Transcript:
    ...
    

    So I’d scope it to ā€œwhen the inbound message contains an [Audio] block / Transcript: sectionā€ instead of {{Transcript}}.

  2. Your rules currently conflict
    Right now it says both:

#
  • do not alter/add/remove words

    • fix filler words / punctuation / transcription mistakes / capitalization

    Those fight each other a bit. If your goal is ā€œclean spoken text for TTSā€, I’d allow light cleanup while preserving meaning.

I’d replace it with something more compact like this

## Audio transcript enhancement for TTS

Only apply these rules when the current inbound message contains an `[Audio]` block with a `Transcript:` section.

Your task is to turn the transcript into clean, expressive spoken text for ElevenLabs.

Rules:
- Preserve the original meaning.
#
- Fix obvious transcription mistakes, filler words, and punctuation when they reduce clarity.
- You may add short voice-only tags like [laughing], [sighs], [whispers], [pause] when they are clearly appropriate.
- Do not add new facts, new dialogue, music tags, or sound-effect tags.
- Keep the result natural for speech.
- Reply with only the final spoken script.

Why I like this better

  • narrower scope
  • less prompt bloat
  • fewer conflicting rules
  • better chance the model actually follows it consistently

So, short version: your current paste is close, but I’d simplify it.

If you want, paste the updated AGENTS.md snippet here and I’ll sanity-check the final version. If this gets it working, please right-click the solution message → Apps → āœ… Mark Solution so Answer Overflow can index it.

peak basin
dawn flameBOT
#

Ah, that goal helps a lot. In that case, your current prompt is closer, but I’d still tighten it.

Right now it’s still doing a few extra things you probably don’t want:

  • it keys off {{Transcript}} instead of the actual inbound audio block
  • it allows capitalization / punctuation / ellipses changes
  • it has a bunch of topic safety rules that could make it distort or refuse legit transcript content
  • it’s pretty long, which usually makes this less reliable, not more

For your goal, I’d make it much narrower: insert ElevenLabs-friendly voice tags, otherwise leave the transcript alone.

I’d replace that section with this

## Audio tag insertion for ElevenLabs v3
#
Only apply these rules when the current inbound message contains an `[Audio]` block with a `Transcript:` section.

Your task is to prepare the transcript for ElevenLabs v3 by inserting short voice-performance tags in square brackets where they clearly improve delivery.

Rules:
- Preserve the original wording and meaning.
- Only add brief voice-performance tags such as [laughing], [sighs], [whispers], [pause], [excited], [thoughtful] when they are strongly justified by context.
- Do not rewrite sentences.
- Do not add new dialogue or new facts.
- Do not remove words from the transcript.
- Do not add music tags, ambient sound tags, or non-voice stage directions.
- Use tags sparingly. Fewer, better tags are preferred.
- Return only the final tagged script.

Two specific notes on your current paste

#
  1. Change this line
    from:

    When {{Transcript}} is present...

    to something like:

    When the current inbound message contains an [Audio] block with a Transcript: section...

  2. Remove the ā€œAdd Emphasisā€ part
    If you want mostly-verbatim output plus tags, that section will make the model start ā€œhelpfullyā€ editing the text.

So, short version: yes, your direction is right now, but I’d simplify it a lot.
That should give you a better chance of getting ā€œsame transcript + well-placed Eleven tagsā€ instead of a partial rewrite.

If you want, I can turn your exact paste into a final cleaned-up AGENTS.md block ready to drop in.

peak basin
dawn flameBOT
#

Yep, I refreshed it, I can see the updated version now. I was looking at the stale copy before.

The good news: the big fix is there now:

Only apply these rules when the current inbound message contains an [Audio] block with a Transcript: section.

That’s the right change. šŸ‘

My read now

This is good enough to test as-is.
If you want to tighten it further, I’d still trim these 3 things:

  1. Remove ā€œAdd Emphasisā€
    It conflicts with ā€œdo not alter/add/remove wordsā€ and may make the model rewrite more than you want.

  2. Remove the ā€œsensitive topics / profanity / NSFWā€ rule
    For transcript-preserving work, that can cause weird distortions or refusals on perfectly valid input.

#
  1. Add one short rule: ā€œUse tags sparinglyā€
    Otherwise the model may over-decorate the transcript.

So my final tiny edit would be

Keep what you have, but:

  • delete Step 4: Add Emphasis
  • delete the ā€œDO NOT introduce or imply sensitive topics...ā€ bullet
  • add:
* DO use tags sparingly. Prefer a few strong tags over tagging every line.

So, short version: yes, this refreshed version is much closer, and I’d absolutely test it now.

If you want, I can do one last pass and give you the exact final cleaned block with only those small changes.

peak basin
#

I am still having a little trouble with this. When I send a voice note, that is being sent off to be transcribed by Deepgram, and then sent back to my OpenClaw agent. My OpenClaw agent then adds audio tags to my transcribed voice note, and sends it to ElevenLabs to perform TTS. I think I need to add an extra step in there to advise the agent to create a response, and add the audio tags to that, and then send that to ElevenLabs.

peak basin
#

Any ideas @dawn flame ?

dawn flameBOT
#

Yeah, a few thoughts, I think the likely issue now is not the idea, it’s where the behavior is coming from.

OpenClaw is doing:

audio -> transcript injected into agent turn -> agent reply -> ElevenLabs speaks the reply

So if the tagging isn’t happening reliably, I’d check these in order:

  1. Start a fresh session once
    If you edited AGENTS.md mid-thread, do /new before testing again.

  2. Confirm AGENTS.md is actually in prompt context
    Run /context list or /context detail and make sure:

    • AGENTS.md is injected
    • it is not truncated
  3. Check agents.defaults.contextInjection

#

If it’s set to "continuation-skip", an existing thread may skip re-injecting updated bootstrap files on continuation turns. Default is "always".

  1. Keep the transcript rule short
    The more essay-like the AGENTS.md block gets, the more likely the model is to drift. I’d keep it to a tight 6 to 8 bullets.

My practical recommendation

If you want best-effort, stay with AGENTS.md.

If you want deterministic:
use a plugin hook instead of prompt-only behavior. That gives you a real middleware step before TTS, instead of hoping the agent obeys the standing prompt every time.

The clean fallback is:

  • inspect audio transcript input
  • transform it into tagged text
  • return that as the assistant reply
  • let ElevenLabs speak that exact reply
peak basin
#

The tagging is happening, but it's happening after Deepgram has turned by audio note into text and sent it back to OpenClaw. OpenClaw then applies the audio tags to the transcript of my voice note and sends it to ElevenLabs. ElevenLabs then takes the "enhanced" transcript and turns it into speech. Here's a step by step of how I see it currently happening.

1. I send a voice note via Telegram to my OpenClaw agent.
2. OpenClaw sends the voice note audio file to Deepgram for transcription (STT).
3. Deepgram sends back the transcript as a text file to OpenClaw.
4. OpenClaw sees this transcript and uses the rules in my AGENTS.md file to add audio tags to the transcript of my voice note.
5. OpenClaw then sends the transcript which has been "enhanced" with audio tags to ElevenLabs.
6. ElevenLabs uses TTS with the enhanced transcript to create an audio file which is then sent back to OpenClaw.
7. OpenClaw then sends that audio file back to me via Telegram.

Inbetween steps 3 and 4, there needs to be an additional step where OpenClaw receives the transcript which is the text of my voice note, and then creates its own response based on the transcript it got back from Deepgram. Then the agent "enhances" its textual response with audio tags, and sends that to ElevenLabs that then turns that enhanced transcript into an audio file. Does that make sense?