#Getting my agent to check in on random intervals and planned intervals.

106 messages · Page 1 of 1 (latest)

gray sluice
#

I have created a Personal Life Coach, Arnold. I talk to him via Telegram mostly. I have installed opensource tts and its working via exectutable python script.
What I am not able to do is using Arnolds heartbeat create random and scheduled check ins. I sometimes see them being fired in his heartbeat session but he never contacts me via telegram.

I would preferer keeping the heartbeat session and main session separate.

I have tried adding that he needs to do a session list and send his messages to the telegram session but Its not working.
I am not only missing that he is more autonomouse but also scheduled check is. I would prefere using the heartbeat for everyting instead having dozens crontabs.

Any advice?

gritty archBOT
#

First, can you paste openclaw --version? Heartbeat behavior has changed a bit across releases, so that matters here.

But the likely fix is this:

  1. Do not use sessions_list / sessions_send for Telegram delivery
    Those talk to another OpenClaw session, not to Telegram itself.

  2. Set heartbeat delivery explicitly
    Heartbeat defaults to target: "none", so it can fire and still send nothing.

  3. Use isolatedSession: true if you want heartbeat separate from the main chat
    That gives you a fresh heartbeat run, but you can still deliver to Telegram.

Something like this is the clean setup for Arnold:

#
{
  agents: {
    list: [
      {
        id: "arnold",
        heartbeat: {
          every: "30m",
          isolatedSession: true,
          lightContext: true,
          target: "telegram",
          to: "123456789", // your Telegram user/chat id, or -100...:topic:42
          // accountId: "default", // if you use multi-account Telegram
          activeHours: {
            start: "08:00",
            end: "22:00",
#
            timezone: "Europe/London"
          }
        }
      }
    ]
  }
}

A few important gotchas:

  • If you keep heartbeat in a separate session, do not rely on target: "last". Use explicit target: "telegram" + to.
  • If your HEARTBEAT.md uses a tasks: block, and no task is due, the heartbeat is skipped as no-tasks-due.
  • So for random-ish check-ins, either:
    • keep a normal checklist in HEARTBEAT.md, or
    • add a base task that runs every heartbeat tick.
#

Example:

tasks:

- name: coach-checkin
  interval: 30m
  prompt: "During active hours, decide whether a short supportive Telegram check-in would be useful. Only send if it adds value."

- name: evening-review
  interval: 1d
  prompt: "If it is evening, send a short reflection and planning check-in."

One honest caveat: if you need exact times like 9:00 AM every day, heartbeat is not the right tool. Heartbeat is approximate. For exact timing, use a small number of cron jobs, then keep heartbeat for the fuzzy, autonomous check-ins. You do not need dozens.

Good docs:

gray sluice
#

I am using latest 2016-4-14

gritty archBOT
#

Assuming you mean 2026-04-14 😄, that helps.

On that version, I’d still bet this is a routing issue, not a “heartbeat isn’t firing” issue.

The key distinction is:

  • sessions_send sends to another OpenClaw session
  • Heartbeat delivery to Telegram should be done by heartbeat config itself, or by the message tool, not by session-hopping

So for your case, I would test in this order:

1) Verify Telegram outbound works at all

openclaw message send --channel telegram --target <your_chat_id> --message "heartbeat test"

If that fails, it is a Telegram outbound/config problem.

#

2) Check what heartbeat thinks happened

openclaw system heartbeat last

You are looking for stuff like:

  • target: none
  • reason=no-tasks-due
  • reason=dm-blocked

3) Use explicit heartbeat delivery

If you want heartbeat separate from the main chat session, use:

{
#
  agents: {
    list: [
      {
        id: "arnold",
        heartbeat: {
          every: "30m",
          isolatedSession: true,
          lightContext: true,
          target: "telegram",
          to: "<your_chat_id>",
          activeHours: {
            start: "08:00",
            end: "22:00",
            timezone: "Europe/London"
          }
#
        }
      }
    ]
  }
}

The important bit is target: "telegram" plus to.
I would not use target: "last" if heartbeat is running separately.

4) Watch out for tasks: mode

If your HEARTBEAT.md uses tasks:, and no task is due, OpenClaw skips the whole run.

So if you want random-ish check-ins plus scheduled-ish behavior, do this:

#
tasks:

- name: coach-loop
  interval: 30m
  prompt: "During active hours, decide whether Arnold should send a short useful Telegram check-in. Only send if it adds value."

- name: reflection
  interval: 1d
  prompt: "If it is evening, send a short reflection/planning check-in."

That first 30m task keeps the loop alive. Otherwise no-tasks-due can make it look like heartbeat is “working” but never actually messaging.

One honest limitation

If by “scheduled” you mean exact wall clock times like 9:00 AM or 6:30 PM, heartbeat is the wrong tool. It is approximate.
For exact timing, use a tiny number of cron jobs, then let heartbeat handle the fuzzy/autonomous check-ins.

gray sluice
#

I'd want to verify before we know the setup actually works:

  1. Does target: "telegram" + accountId: "arnold" deliver to Arnold's Telegram bot (separate from Neon's)? Or does it need channel: "telegram" instead of target: "telegram"?
  2. Does activeHours with end: "00:00" mean midnight end-of-day or midnight start-of-next-day? If 00:00 is interpreted as 00:00 the same day it might mean active until 00:59 then stop.
  3. Does the heartbeat fire on schedule once it's restarted, or does it need a manual test first?

Also I updated config as sugested, but I am getting the checks in telegram like this:
12:10 PM — No scheduled push at this time. Last action was PUSH_AB_WORKOUT at 11:20. Next due is CHECK_AB_WORKOUT at 13:00. No active commitments. Nothing urgent.

NO_REPLY

How do I make these replies go to the Heartbeat session but when Arnold actually has something to say use telegram?

gritty archBOT
#

Yep, this is the right thing to verify first.

Short answers

  1. Heartbeat config uses target, not channel
    So for heartbeat, this is correct:
heartbeat: {
  target: "telegram",
  accountId: "arnold",
  to: "<your-chat-id>"
}

accountId: "arnold" selects channels.telegram.accounts.arnold.
If that account id does not exist, delivery is skipped.

#
  1. Use end: "24:00" for end-of-day
    Docs explicitly say 24:00 is the end-of-day value.
    00:00 is start-of-day. Runtime also supports overnight windows, but for “active until midnight” I would use 24:00 because it is much clearer.

  2. After restart, heartbeat should run on the next tick
    You do not need a manual test, but doing one is a good sanity check:

openclaw system event --text "heartbeat test" --mode now

Why you are getting those Telegram messages

This part is the real bug:

#

No scheduled push at this time ... NO_REPLY

Heartbeat silence is not NO_REPLY.
Heartbeat silence is HEARTBEAT_OK.

OpenClaw treats any heartbeat reply that is not HEARTBEAT_OK as alert content and sends it outward when alerts are enabled.

Also, NO_REPLY only suppresses delivery when it is the exact whole reply.
So:

  • NO_REPLY alone = silent
  • No scheduled push... \n\n NO_REPLY = not silent

That is why Telegram is getting spammed.


#

Best setup for what you want

You have two good patterns here.

Option A, simplest

Keep heartbeat delivery to Telegram, but make “nothing to say” return exactly:

HEARTBEAT_OK

and nothing else.

Then only real check-ins go to Telegram.

Use instructions like:

#
If nothing user-facing is needed, reply exactly HEARTBEAT_OK.
Do not explain why there is no check-in.
Do not include NO_REPLY.
Only return plain user-facing text when Arnold should actually contact me.

This is the easiest fix.


Option B, better if you want heartbeat to stay private

If you want heartbeat to be an internal/private loop, and only actual check-ins to go to Telegram, do this:

heartbeat: {
#
  target: "none",
  isolatedSession: true,
  lightContext: true,
  every: "30m"
}

Then instruct Arnold:

  • if nothing to tell you, reply HEARTBEAT_OK
  • if he really wants to contact you, use the message tool to Telegram
  • after a successful message tool send, reply NO_REPLY

Important distinction:

  • heartbeat config uses target: "telegram"
#
  • message tool uses channel: "telegram"

So the message tool version would look like:

  • channel: "telegram"
  • target: "<your-chat-id>"
  • accountId: "arnold"

I’d actually recommend Option B for your coach setup. It gives you the clean separation you asked for.


Practical recommendation

For Arnold, I would do:

  • heartbeat.target: "none"
#
  • isolatedSession: true
  • activeHours: { start: "08:00", end: "24:00", timezone: "..." }
  • use the message tool only for actual check-ins
  • never print scheduling/status chatter unless you want it in chat

If you want, paste your current heartbeat block plus the relevant HEARTBEAT.md, and I can rewrite it into a working Arnold pattern.

gray sluice
#

My heartbeat is too long to paste

#

"If isolatedSession: true is set on a heartbeat, can the agent inside that isolated session use the message tool to send Telegram messages — or are outbound tool calls blocked in isolated sessions?"

gritty archBOT
#

Yes, isolatedSession: true heartbeats can still use the message tool.

isolatedSession only means fresh session / no prior conversation history. It does not by itself block outbound tool calls.

A few important nuances:

  • Heartbeat isolated sessions are not the same as cron-owned isolated jobs
    Cron has some special delivery policies that can disable the message tool in certain cases. Heartbeat isolation does not have that restriction.
  • The message tool still has to be available normally:
    • allowed by your tool config
    • Telegram send action enabled (channels.telegram.actions.sendMessage)
    • correct accountId / target

Practical recommendation

If you want heartbeat to decide privately, then only message you when needed, use:

#
heartbeat: {
  isolatedSession: true,
  lightContext: true,
  target: "none"
}

Then in Arnold’s instructions:

  • if nothing user-facing is needed, reply exactly HEARTBEAT_OK
  • if he should contact you, use the message tool to Telegram
  • after a successful message-tool send, reply exactly NO_REPLY

That avoids double sends.

If you leave heartbeat.target: "telegram" and also use the message tool, you can easily get duplicate/out-of-band behavior.

gray sluice
#

Good news — our setup is already correct. We implemented exactly the recommended pattern:

• target: "none" ✅ (prevents double sends)
• isolatedSession: true ✅
• Arnold uses message tool for real check-ins ✅
• HEARTBEAT_OK for silence ✅
• NO_REPLY after message tool ✅

No more changes needed. The architecture is right — just waiting for the next heartbeat cycle to fire and see if Arnold contacts you on Telegram.

gritty archBOT
#

Yep, that sounds right. That is the architecture I’d recommend.

So at this point, I would stop changing config and do a smoke test instead of waiting:

openclaw system event --text "Heartbeat smoke test. If there is something useful to say, send a short Telegram message via the message tool. Otherwise reply HEARTBEAT_OK." --mode now

Then check:

openclaw system heartbeat last

If that works, you’re done.

If it still does not message Telegram, the problem is probably no longer architecture. It is usually one of these:

#
  • the heartbeat prompt never decides to send
  • the message tool is using the wrong Telegram target/account
  • Telegram outbound sending is disabled by channel config

So yes, your pattern is correct. I just would not wait blindly for the next cycle, I’d force one manual heartbeat now and verify the full loop end-to-end.

gray sluice
#

The loop is working. Looking at the log — Arnold already sent a real message at 11:20 (before our latest config fix):

"2x10 reps. Get on the floor."

That's Arnold sending via the old config. The smoke test at 12:35 returned HEARTBEAT_OK correctly — he received, evaluated, decided nothing was due.

Verified:

• Arnold receives heartbeats ✅
• He evaluates commitments ✅
• He sends Telegram messages when something is due ✅
• He returns HEARTBEAT_OK when silent ✅
• The log tracks every cycle ✅

The architecture is sound. Arnold is running.

gritty archBOT
#

Nice, that’s the good state. Arnold is alive and the loop is doing what it should.

At this point you’re mostly tuning behavior, not plumbing:

  • heartbeat cadence
  • activeHours
  • how pushy vs gentle his check-ins should be
  • exact wording in HEARTBEAT.md

Glad that helped! 🎉
If this fixed it, please right-click the solution message → Apps → ✅ Mark Solution.
This helps Answer Overflow index the solution for future users.

gray sluice
#

What is your advice of using TTS? Using ubuntu without a gpu. Would prefere local model.

gritty archBOT
#

My honest take, for Ubuntu + no GPU + prefer local, I’d do this:

Recommendation

  1. Keep using a local CPU TTS engine
    If your Python script already works, that is probably the right direction.

  2. For local CPU TTS, I’d pick Piper first
    It’s lightweight, fast on CPU, and good enough for coach/check-in voices.

  3. If you want tighter OpenClaw integration, expose it as an OpenAI-compatible TTS endpoint
    OpenClaw can point messages.tts.providers.openai.baseUrl at a non-OpenAI endpoint, as long as it speaks the OpenAI-style TTS API.

  4. For Telegram, prefer OGG/Opus output if you can
    That usually feels better as a voice note than plain MP3.


#

My practical advice for Arnold

I would not use auto-TTS for every message. That gets noisy fast.

For a life-coach agent, I’d use:

  • messages.tts.auto: "tagged"
    or
  • keep TTS manual/tool-driven only

That way Arnold only sends voice when it actually adds value, for example:

  • workout prompts
  • short morning check-ins
  • evening reflection prompts

#

If you want easiest vs best

  • Easiest built-in in OpenClaw: Microsoft speech
    Good, zero API key, but not local
  • Best local CPU path: Piper
  • Best “clean integration” path: local OpenAI-compatible TTS server + baseUrl

Example shape if you go the OpenAI-compatible route:

{
  messages: {
    tts: {
      auto: "tagged",
      provider: "openai",
      providers: {
#
        openai: {
          baseUrl: "http://127.0.0.1:8000/v1",
          model: "your-local-tts-model",
          voice: "your-local-voice"
        }
      }
    }
  }
}

OpenClaw docs for that part:

So short version: for your setup, I’d choose Piper or keep your current local script, not a cloud TTS, and I’d trigger voice selectively instead of always.

#

If you want, I can sketch a good Arnold TTS strategy next, like when he should speak vs stay text-only.

gray sluice
#

yes please a Arnold TTS strategy would be nice

gritty archBOT
#

Yep, here’s the strategy I’d use for Arnold.

Arnold TTS strategy

Rule 1: default to text, not voice.
Voice should feel intentional, not spammy.

Use voice for

  • action prompts
    “2x10 reps. Get on the floor.”
  • short encouragement
    “Nice. You did the hard part. Keep going.”
  • time-sensitive nudges
    “Leave now if you want to make that on time.”
  • evening reflection prompts
    “Quick check-in. What went well today?”
  • replying to your voice note
#

Voice-to-voice feels natural.

Use text for

  • schedules
  • lists
  • multi-step plans
  • anything with links/details
  • repeated reminders
  • anything longer than about 20 seconds spoken

Good constraints

I’d give Arnold these limits:

  • max 1 voice note every 2 to 3 hours, unless it’s genuinely urgent
  • keep voice notes under 20 seconds
  • one idea per voice note
#
  • no lists in audio
  • no motivational monologues
  • during quiet hours, text only unless urgent

Tone

Arnold should sound like:

  • calm
  • direct
  • slightly warm
  • confident
  • never cheesy

So, more:

“Get up. Ten reps. Then tell me you did it.”

#

Less:

“Hey champ, let’s crush the day with positivity and momentum!”

Best operating model

I’d use a simple 4-level ladder:

  1. Nothing neededHEARTBEAT_OK
  2. Useful but low urgency → text
  3. Short emotional / action cue → voice
  4. Complex follow-up after voice → short voice + text details if needed

That keeps voice special.


#

Pasteable policy for Arnold

You could drop this into AGENTS.md or HEARTBEAT.md:

## Arnold voice policy

Default to text. Use voice sparingly.

Use voice only when:
- a short action prompt would be more effective than text
- encouragement, warmth, or urgency matters
- replying to an inbound voice message
- sending a brief morning, workout, or evening reflection prompt

Do not use voice for:
#
- lists
- schedules
- long explanations
- multi-step instructions
- repeated reminders
- anything that would sound rambling

When using voice:
- keep it under 20 seconds
- keep it under 2 short sentences
- communicate only one action or one reflection prompt
- sound calm, direct, and grounded
- avoid hype, fluff, and motivational clichés

Cadence:
#
- no more than one voice note every 2 to 3 hours unless urgent
- during quiet hours, prefer text only

If nothing user-facing is needed, reply exactly HEARTBEAT_OK.
If voice generation fails, fall back to plain text.

My recommendation

For Arnold specifically, I’d use voice for only 3 categories:

  • workout/start-now prompts
  • short supportive check-ins
  • evening reflection questions

Everything else, text.

#

That usually feels great on Telegram.

If you want, I can also write you a more “Arnold-like” voice style block, so he sounds like a consistent character instead of generic TTS.

gray sluice
#

can you tell me more about Microsoft tts? how good is it? how many voices does it have? how do I configure that?

gritty archBOT
#

Yep. Microsoft TTS in OpenClaw is a pretty nice option if you want easy setup.

What it is

OpenClaw’s Microsoft provider currently uses node-edge-tts, which talks to Microsoft Edge’s online neural TTS service.

So:

  • no API key needed
  • good quality
  • not local
  • best-effort, not a formal paid SLA thing

Docs: https://docs.openclaw.ai/tools/tts

How good is it?

#

My honest take:

  • better than “robot voice”
  • good enough for:
    • reminders
    • coach nudges
    • short check-ins
    • voice note replies
  • usually less natural / expressive than ElevenLabs
  • often easier and cheaper than anything else, because it needs no key

For an Arnold-style coach, it’s absolutely usable.

If your bar is “warm, clear, human enough for short Telegram voice notes”, Microsoft TTS is fine.

If your bar is “this should sound emotionally rich and premium”, it’s not the best option.

#

How many voices?

I would not trust a fixed voice count, because that catalog can change.

The practical answer is: a lot. Multiple languages, accents, and male/female neural voices.

Instead of me guessing a number, check what your install sees with:

openclaw infer tts voices

And inspect providers with:

openclaw infer tts providers --json
#
openclaw infer tts status --json

How to configure it

Minimal example:

{
  messages: {
    tts: {
      auto: "tagged",
      provider: "microsoft",
      providers: {
        microsoft: {
#
          enabled: true,
          voice: "en-US-MichelleNeural",
          lang: "en-US",
          outputFormat: "audio-24khz-48kbitrate-mono-mp3",
          rate: "+0%",
          pitch: "+0%"
        }
      }
    }
  }
}

Fields that matter most

  • provider: "microsoft"
#
  • providers.microsoft.voice
  • providers.microsoft.lang
  • providers.microsoft.outputFormat
  • optional:
    • rate
    • pitch
    • volume

Telegram-specific note

Microsoft TTS can work for Telegram, but if you need guaranteed Opus voice-note behavior, OpenAI/ElevenLabs are cleaner.

Docs say Microsoft falls back to MP3 if the configured format fails, and not all output formats are available from the bundled transport.

So for Telegram:

  • works fine for many cases
#
  • not the cleanest if you are picky about voice-note format

My recommendation for you

If you want:

  • fastest setup → Microsoft TTS
  • strictly local → keep your current local script / Piper-style setup
  • best quality → not Microsoft
  • good enough, zero-key, easy → Microsoft is a solid choice

For Arnold specifically, I’d try Microsoft if you want a super easy second option, but I would probably still keep your local setup if you care about privacy and control.

If you want, I can give you 3 good Microsoft voice picks for Arnold’s personality and a config snippet for each.

gray sluice
#

is there a site I can try the MS voices?

gritty archBOT
#

Yep, the easiest one is Speech Studio / Audio Content Creation:

You can paste your own text, pick a language + voice, and hit play.

Useful companion page for browsing the voice catalog:

And the Microsoft docs page that points to the tool:

Small caveat, it may want a Microsoft/Azure sign-in.

If you want, I can pick 3 Microsoft voices that fit Arnold so you do not have to sift through the whole catalog.

gray sluice
#

Kokoro-82M + OpenVoice v2 for Arnold's voice cloning. But the voice cloning is not very good. way to deep and slow voice.

gritty archBOT
#

Yeah, that tracks. My blunt take, Kokoro + OpenVoice v2 is cool, but for a coach voice it can get uncanny fast, especially on CPU-only setups.

If it sounds too deep and too slow, I’d try this order:

1) Stop cloning first, get a good base voice

For Arnold, I’d honestly test plain Kokoro without cloning first.

A good stock voice usually sounds better than a mediocre clone.

2) Use a much better reference clip

If you keep cloning, use a sample that is:

  • 10 to 20 seconds
  • clean mic
  • one speaker only
  • normal conversational tone
  • not low / sleepy / dramatic
#
  • no room echo
  • no background noise

If the source sample is slightly deep or slow, the clone exaggerates it.

3) Speed it up a bit

For a coach voice, I’d usually want around 1.1x to 1.2x speaking speed.

That alone often fixes the “slow Arnold” problem.

4) Aim for “coach”, not “movie trailer”

A lot of cloning setups over-index on deep masculine voices, and that makes them sound heavy and weird.

For a life coach, a better target is:

  • warm
  • clear
#
  • mid-pitch
  • slightly energetic
  • direct

Not super deep.

5) Clone lightly, do not overdo identity transfer

If OpenVoice is pushing too hard, the result often gets muddy, slow, and over-dark.

A partial style transfer is usually better than “perfect clone”.


My actual recommendation

If you want Arnold to sound good now, I’d do one of these:

#

Best local pragmatic path

  • Kokoro base voice only
  • no cloning
  • slightly faster rate
  • short coach-style scripts

Best easy polished path

  • use a Microsoft neural voice
  • pick one that already feels like Arnold
  • stop fighting the clone

Only keep cloning if

  • you really need a specific person’s voice
  • and you are okay spending time tuning reference clips / pitch / speed

#

For Arnold specifically

I would optimize for:

  • clarity over realism
  • energy over depth
  • consistency over cloning accuracy

That usually gives a much better Telegram experience.

If you want, I can help you design an Arnold voice profile like:

  • target personality
  • pitch
  • speed
  • sentence style
  • which Microsoft or Kokoro-style voice would match it best
gray sluice
#

can you compare piper, kokoro and MS?

gritty archBOT
#

Yep. Short version:

Piper = most practical local CPU option
Kokoro = better sounding local option, but fussier
Microsoft = easiest good result, but not local

Quick comparison

| Option    | Local? | CPU-friendly?    | Voice quality     | Setup pain | Cloning                                    | Best for                      |
| --------- | ------ | ---------------- | ----------------- | ---------- | ------------------------------------------ | ----------------------------- |
| Piper     | Yes    | Yes              | Decent to good    | Low        | No real cloning focus                      | reliable local reminders      |
| Kokoro    | Yes    | Yes, but heavier | Good to very good | Medium     | Possible with extra tooling, mixed results | nicer local personality voice |
| Microsoft | No     | N/A              | Good              | Very low   | No                                         | easiest polished coach voice  |

My take on each

#

Piper

Pros

• fully local
• fast on CPU
• stable
• low hassle

Cons

• voices can sound more “functional”
• less expressive
• not great if you want premium personality

Use Piper if:
you want Arnold to be private, reliable, and lightweight.

#

───

Kokoro

Pros

• best chance of a good local voice
• more natural than Piper
• can feel more “alive”

Cons

• more tuning
• slower/heavier than Piper
• cloning stack gets messy fast
• easy to end up with weird prosody, deep tone, or slow delivery

#

Use Kokoro if:
you care about local quality and do not mind tinkering.

───

Microsoft TTS

Pros

• sounds good out of the box
• lots of voices
• easy setup
• no API key for the Edge-backed path in OpenClaw

Cons

• not local

#

• best-effort service
• less control than your own local stack
• not really a voice-cloning path

Use Microsoft if:
you want Arnold sounding decent today with minimal effort.

───

For Arnold specifically

If your priorities are:

  1. Fully local + reliable

Pick Piper

#
  1. Fully local + best sound

Pick Kokoro, but probably without cloning at first

  1. Best balance of ease + quality

Pick Microsoft

───

My honest recommendation

For a Telegram life coach:

• if you want local-first, I’d do Kokoro base voice first, no cloning
• if that still feels too fiddly, drop to Piper
• if you want the fastest “this actually sounds good” result, use Microsoft

#

So:

Piper = dependable
Kokoro = ambitious
Microsoft = convenient

If you want, I can also give you a “which one should I choose?” decision tree for Arnold in one minute.

gray sluice
#

I now use a python script with Kokoro,. what would be a better way to use it?

gritty archBOT
#

If Kokoro is already working from a Python script, I would not throw that away. I’d just make the integration cleaner.

Best way to use it

Good, now

Use it as a special-purpose voice tool, not for every reply:

  1. Arnold decides a voice note is worth sending
  2. he calls your local Kokoro script
  3. script writes an .ogg / .opus file if possible
  4. Arnold sends that file to Telegram with the message tool
  5. final reply is NO_REPLY

That is a solid setup already.


#

Better than “just a script”

Wrap the script in a tiny local service with a stable API.

Why this is better:

  • cleaner than shelling out everywhere
  • easier retries/timeouts
  • easier to swap voices later
  • easier to reuse from heartbeat, main chat, or other agents
  • easier to make OpenClaw talk to it consistently

So instead of “run random python with args”, make it:

  • input: text, voice, speed
  • output: audio file path or bytes

#

Best long-term integration

If you want it to feel native inside OpenClaw, expose Kokoro as an OpenAI-compatible TTS endpoint and point:

  • messages.tts.providers.openai.baseUrl

at your local service.

That is the cleanest architecture.

Why:

  • you get OpenClaw’s normal TTS flow
  • easier config
  • easier future provider swap
  • less custom glue logic in prompts

#

What I would recommend for Arnold

If you want the least work

Keep the Python script, but tighten the workflow:

  • only use voice for short, valuable check-ins
  • generate short clips only
  • send Telegram-friendly audio format
  • do not auto-TTS everything

If you want the cleanest setup

Turn the script into a small local HTTP service.

If you want the most elegant OpenClaw setup

Make that service OpenAI-compatible and plug it into messages.tts.


#

One strong opinion

The biggest improvement is probably not the model.
It’s making the interface predictable.

A reliable local voice service with:

  • one input format
  • one output format
  • one place to tune speed/voice/pitch

will feel much better than a loose Python script called ad hoc.

If you want, I can sketch the exact architecture I’d use for Arnold:

  • simple script wrapper
  • service version
#
  • OpenAI-compatible version

and tell you which one is worth doing first.

gray sluice
#

please provide me the exact architecture

gritty archBOT
#

Yep. Here is the exact architecture I’d recommend for Arnold.

1. Core design

Use heartbeat as Arnold’s private decision loop.
Use Telegram only for actual user-facing messages.
Use your Kokoro Python script as a local voice worker.

So the flow is:

Heartbeat session (private)
→ decide whether Arnold should contact you
→ if no, HEARTBEAT_OK
→ if yes, generate short voice audio locally
→ send audio to Telegram with message tool
→ final reply NO_REPLY

#

No sessions_list, no sessions_send, no channel-hopping.


2. Session architecture

Main Telegram session

This is your normal Arnold conversation.

Heartbeat session

Run it separately with:

  • isolatedSession: true
  • lightContext: true
  • target: "none"

That means:

#
  • heartbeat thinks privately
  • it does not dump internal status into Telegram
  • it only sends to Telegram when Arnold explicitly chooses to

3. Exact config shape

{
  agents: {
    list: [
      {
        id: "arnold",
        heartbeat: {
          every: "30m",
#
          isolatedSession: true,
          lightContext: true,
          target: "none",
          activeHours: {
            start: "08:00",
            end: "24:00",
            timezone: "Europe/Berlin"
          }
        }
      }
    ]
  }
}

#

4. Exact voice delivery flow

When a heartbeat fires:

Case A, nothing useful to say

Arnold replies:

HEARTBEAT_OK

That is the whole reply.

Case B, Arnold wants to contact you

Arnold does this:

  1. write a very short voice script
#
  1. call your local Kokoro Python script
  2. generate a Telegram-friendly audio file
  3. send it with the message tool
  4. reply exactly:
NO_REPLY

That prevents duplicate text output.


5. Exact TTS worker contract

Your Kokoro script should become a stable local helper with this interface:

#

Input

  • text
  • voice name
  • speed
  • output path

Output

  • one finished audio file path

Example behavior:

  • text in
  • produce arnold-2026-04-16T11-15-00.ogg
  • return success/failure cleanly

Strong recommendation

Have the script output .ogg / .opus if possible.

#

That is the cleanest fit for Telegram voice-note style delivery.

If Kokoro only gives you WAV first, convert once inside the script/wrapper, not in Arnold’s prompt logic.


6. Exact prompt policy for Arnold

Put this in AGENTS.md or HEARTBEAT.md:

## Arnold heartbeat messaging policy

Heartbeat is private decision-making space.

If nothing user-facing is needed, reply exactly HEARTBEAT_OK.
#

Do not send internal schedule/status chatter to Telegram.

If a short check-in, nudge, or reminder would genuinely help:
1. Write a short spoken script, under 20 seconds, ideally under 25 words.
2. Use the local Kokoro TTS script to generate audio.
3. Send the audio to Telegram using the message tool.
4. After sending successfully, reply exactly NO_REPLY.

If audio generation fails, fall back to a short plain text Telegram message, then reply exactly NO_REPLY.

Never send both a full text reply and a voice note unless the text adds necessary detail.
Default to one action, one message, one cue.

#

7. Exact content policy for voice

Arnold should only use voice for:

  • workout prompts
  • “do it now” nudges
  • supportive check-ins
  • short evening reflection prompts

Arnold should not use voice for:

  • lists
  • schedules
  • long explanations
  • repeated nagging
  • internal logs
  • system chatter
#

8. Exact technical path

Recommended now

Heartbeat
exec runs local Kokoro script
→ audio file created locally
message tool sends file to Telegram
NO_REPLY

Recommended later

Replace the raw Python script with a tiny local HTTP service, then eventually make it OpenAI-compatible so OpenClaw can use it like a normal TTS provider.

But right now, the direct script worker is completely fine.


#

9. What not to do

Do not:

  • use target: "telegram" for the heartbeat itself
  • send internal heartbeat replies to Telegram
  • use sessions_send for this
  • auto-TTS every reply
  • make Arnold generate long voice notes
  • make the prompt handle file conversion logic

10. My final recommendation

If I were building Arnold today, I would ship exactly this:

#
  • isolatedSession: true
  • lightContext: true
  • target: "none"
  • Kokoro Python script as local TTS worker
  • message tool for Telegram delivery
  • HEARTBEAT_OK for silence
  • NO_REPLY after successful send
  • short voice notes only

That is the cleanest architecture for your setup.

If you want, next I can give you the exact Arnold heartbeat prompt and a sample Kokoro wrapper contract so you can wire it with less trial and error.

gray sluice
#

can you give me the instructions how to create a openAI api for kokoro? more a prompt I could use with my agent for her to create it.

gritty archBOT
#

Yep. If you want your agent to build this, I’d give her a very explicit prompt so she does not overengineer it.

What you want to build

A small local OpenAI-compatible TTS server in front of Kokoro, so OpenClaw can use:

  • messages.tts.providers.openai.baseUrl

and treat your Kokoro service like an OpenAI-style TTS endpoint.


Ready-to-paste prompt for your agent

Build a small local HTTP TTS service for Ubuntu CPU-only that wraps my existing Kokoro Python TTS setup and exposes an OpenAI-compatible speech endpoint for OpenClaw.
#

Goal:
- I want OpenClaw to call a local endpoint via `messages.tts.providers.openai.baseUrl`
- The service should look enough like OpenAI TTS for OpenClaw to use it cleanly
- Keep it simple, local-first, and reliable
- Do not add cloud dependencies
- Do not build voice cloning first. Start with a stable base voice flow

Implementation requirements:
1. Use Python, since Kokoro is already in Python
2. Use FastAPI or another minimal Python HTTP server
3. Bind only to localhost by default (`127.0.0.1`)
4. Implement these endpoints:
   - `GET /health` → returns JSON health status
   - `POST /v1/audio/speech` → OpenAI-compatible TTS-style endpoint
#
5. `POST /v1/audio/speech` should accept a JSON body similar to:
   - `model`
   - `input`
   - `voice`
   - `response_format`
   - `speed`
6. Map those fields into Kokoro generation
7. Return raw audio bytes in the response with the correct content-type
8. Prefer `ogg` / `opus` output for Telegram compatibility if practical, otherwise support at least one stable format like `wav` or `mp3`
9. Keep the API tolerant of custom `model` and `voice` names
10. Add sane timeouts and useful error messages
11. Log failures clearly
12. Keep temporary file handling clean
13. Add a README with:
   - how to install deps
#
   - how to run locally on Ubuntu
   - one curl example for `/health`
   - one curl example for `/v1/audio/speech`
   - example OpenClaw config using `messages.tts.providers.openai.baseUrl`
14. Do not overbuild auth. If auth is added, make it optional and simple for localhost use
15. Keep the project structure small and easy to maintain

Compatibility target:
- OpenClaw should be able to use:
  - `provider: "openai"`
  - `baseUrl: "http://127.0.0.1:8000/v1"`
  - `model: "kokoro"`
  - `voice: "arnold"`

Acceptance criteria:
#
- I can start the service locally
- `curl http://127.0.0.1:8000/health` works
- `POST /v1/audio/speech` returns playable audio
- OpenClaw can point at the local `baseUrl` and synthesize speech through it

Please create:
- the server code
- dependency file
- run instructions
- sample curl commands
- sample OpenClaw config snippet

Keep it pragmatic. I want the smallest working version first, not a framework.

#

After your agent builds it

Your OpenClaw config would look roughly like this:

{
  messages: {
    tts: {
      provider: "openai",
      auto: "tagged",
      providers: {
        openai: {
          baseUrl: "http://127.0.0.1:8000/v1",
          model: "kokoro",
          voice: "arnold"
        }
#
      }
    }
  }
}

That works because OpenClaw accepts non-default OpenAI-compatible TTS endpoints there.

Docs:

If you want, I can also give you a stricter version of the prompt that tells the agent exactly how to shape the /v1/audio/speech request and response.

gray sluice
#

Can I use this API to give every agent a different voice?

gritty archBOT
#

Yes, absolutely.

That is actually one of the best reasons to put Kokoro behind an OpenAI-compatible TTS API.

Best pattern

One local TTS API

Expose multiple voice names, for example:

  • arnold
  • coach_soft
  • ops
  • nora
  • default

Then each agent uses its own voice name in the voice field.

#

How to do it cleanly in OpenClaw

I’d use:

  • messages.tts.auto: "tagged"
  • per-agent instructions in AGENTS.md
  • each agent emits its own [[tts:voice=...]] directive when speaking

Example agent rule:

When using TTS, always use voice=arnold unless explicitly told otherwise.

And the reply can include:

#
[[tts:voice=arnold]]
[[tts:text]]Get up. Ten reps. Then report back.[[/tts:text]]

Because your endpoint is OpenAI-compatible with a non-default baseUrl, custom voice names are fine.


Practical architecture

Your local API

Maps:

  • voice: "arnold" → Kokoro voice preset A
  • voice: "nora" → Kokoro voice preset B
  • voice: "ops" → Kokoro voice preset C
#

So the API becomes the stable “voice router”.

Each agent

Gets a voice identity in its own workspace instructions.

That way:

  • Arnold always sounds like Arnold
  • another agent can sound softer, sharper, more formal, etc.
  • you only maintain one TTS server

My recommendation

Do not create one TTS server per agent.

#

Do this instead:

  • one Kokoro API
  • many voice aliases
  • one default voice alias per agent

That is much cleaner.


One useful nuance

If you want the voice to be hard-coded and not left to model behavior, you can also make your API treat:

  • voice=arnold
  • voice=coach
  • voice=assistant