#Local CLI TTS provider setup

1 messages · Page 1 of 1 (latest)

civic dove
#

Hi Krill — I want to use a local CLI TTS provider (Kokoro) running on a Windows 11 host with GPU access. My OpenClaw gateway runs in a Docker container on that same Windows host. I have a HostOps Windows node already configured and working for another agent. Two questions:

For the cli TTS provider type in messages.tts — does OpenClaw execute the CLI command directly from inside the container, or can it route through a paired node device?
If it runs inside the container, what's the recommended pattern for invoking a host-side executable from a containerized OpenClaw instance for TTS? i am running 2026.4.26 in the container instance and the window host instance is currently running on 2026.4.20.

late kettleBOT
#

To help others find answers, you can mark your question as solved via Right click solution message -> Apps -> âś… Mark Solution

sick orbitBOT
#

I checked the current TTS docs + v2026.4.26 source path. Short answer:

  1. tts-local-cli runs directly where the Gateway/plugin is running.
    In your setup, that means inside the Docker container, not through the paired HostOps/node device. It uses Node child_process.spawn(...) locally.

  2. Recommended pattern: don’t try to execute the Windows Kokoro CLI directly from the Linux container. Instead:

    • run Kokoro on Windows as a tiny local HTTP service/wrapper with GPU access
    • have the container’s tts-local-cli command call that service with curl
    • save the returned audio to {{OutputPath}} inside the container

Example config shape:

{
  messages: {
    tts: {
#
      auto: "always",
      provider: "tts-local-cli",
      providers: {
        "tts-local-cli": {
          command: "/opt/openclaw/kokoro-bridge.sh",
          args: ["{{OutputPath}}"],
          outputFormat: "wav",
          timeoutMs: 120000,
        },
      },
    },
  },
}

Example container-side wrapper:

#
#!/usr/bin/env sh
set -eu

out="$1"

curl -fsS \
  -X POST "http://host.docker.internal:5005/tts" \
  -H "Content-Type: text/plain; charset=utf-8" \
  --data-binary @- \
  -o "$out"

Because args doesn’t include {{Text}}, OpenClaw writes the TTS text to stdin, so the script can forward it safely with --data-binary @-.

A few gotchas:

#
  • {{OutputPath}} is a container path, so don’t ask the Windows host CLI to write there unless you’ve built path translation/shared-volume glue.
  • host.docker.internal is usually the right Docker Desktop hostname from container → Windows host.
  • Keep the Kokoro HTTP wrapper bound locally / firewall-restricted; don’t expose it publicly.
  • If Kokoro outputs wav and you set outputFormat: "wav", that’s the least moving parts. If OpenClaw needs conversion, make sure ffmpeg exists in the container.

Your 2026.4.20 Windows node version doesn’t affect this specific TTS provider path, since it isn’t using the node. I’d still keep the node host close to the Gateway version for normal node exec/browser stuff, but it won’t make tts-local-cli route through HostOps.

Docs: https://docs.openclaw.ai/tools/tts and https://docs.openclaw.ai/nodes

civic dove
#

Hi Krill — for the tts-local-cli provider, you mentioned using a curl bridge to a local HTTP service. But Kokoro-FastAPI exposes an OpenAI-compatible API (/v1/audio/speech). Two questions:

Can I skip the curl bridge entirely and instead use the openai TTS provider in messages.tts with a baseUrl pointing to http://host.docker.internal:8880/v1? If yes, what's the minimum config shape needed, and are there any gotchas with host.docker.internal routing from inside the OpenClaw container to a sibling Docker container on the same host?
For the Kokoro-FastAPI container — my existing OpenClaw container already has runtime: nvidia and GPU reservations working. Can I add Kokoro-FastAPI as a second service in the same docker-compose.yml with --gpus all? Any CUDA version conflicts to watch out for, or specific image tag recommendations for the GPU variant?

sick orbitBOT
#

Yes — I’d use the OpenAI TTS provider here. Cleaner than tts-local-cli.

Minimum OpenClaw config:

{
  messages: {
    tts: {
      auto: "always",
      provider: "openai",
      providers: {
        openai: {
          apiKey: "not-needed",
          baseUrl: "http://kokoro-fastapi-gpu:8880/v1",
          model: "kokoro",
          voice: "af_bella",
#
          responseFormat: "mp3",
        },
      },
    },
  },
}

Use baseUrl ending in /v1 — OpenClaw appends /audio/speech. OpenClaw docs explicitly allow messages.tts.providers.openai.baseUrl, and non-default values are treated as OpenAI-compatible TTS endpoints with custom model/voice names accepted. (docs.openclaw.ai) Kokoro-FastAPI’s own docs show base_url="http://localhost:8880/v1" with api_key="not-needed", model="kokoro", and voices like af_sky / af_bella. (github.com)

For networking: if Kokoro is a sibling service in the same compose file, prefer the Compose service name: http://kokoro-fastapi-gpu:8880/v1. Docker Compose gives same-network services DNS by service name. (docs.docker.com) Use http://host.docker.internal:8880/v1 only when routing from the OpenClaw container back through the Windows host port; Docker Desktop provides that hostname for container → host access. (docs.docker.com)

Compose sketch:

Docker Documentation

Explore all the attributes the services top-level element can have.

Docker Documentation

Learn how to connect containers to the host, across containers, or through proxies and VPNs in Docker Desktop.

#
services:
  openclaw:
    # your existing OpenClaw service
    depends_on:
      - kokoro-fastapi-gpu

  kokoro-fastapi-gpu:
    image: ghcr.io/remsky/kokoro-fastapi-gpu:latest
    restart: unless-stopped
    ports:
      - "127.0.0.1:8880:8880" # optional; not needed for OpenClaw-to-Kokoro inside compose
    deploy:
      resources:
        reservations:
          devices:
#
            - driver: nvidia
              count: all
              capabilities: [gpu]

Yes, you can add Kokoro as a second GPU service. CUDA libraries are isolated per container; the main compatibility check is that the host NVIDIA driver is new enough for the CUDA runtime inside the Kokoro image. NVIDIA notes newer drivers are backward-compatible, but each CUDA version has a minimum driver requirement. (docs.nvidia.com) Docker Compose’s GPU reservation syntax also requires capabilities: [gpu]. (docs.docker.com)

Image tag: I’d start with ghcr.io/remsky/kokoro-fastapi-gpu:latest for quick testing, then pin whatever version works for you. Upstream examples use the GPU image and --gpus all; some community docs currently show versioned tags like v0.3.0, but I’d validate before pinning because Kokoro-FastAPI tags have moved around a bit. (github.com)

Docker Documentation

Learn how to configure Docker Compose to use NVIDIA GPUs with CUDA-based containers

civic dove
#

Hi Krill — planning to add Kokoro-FastAPI as a local OpenAI-compatible TTS provider alongside the existing Microsoft provider. Two questions on the config:

Fallback setup — is keeping microsoft in the providers block sufficient for it to act as fallback when openai (Kokoro) fails, or does OpenClaw need an explicit fallback field to define the fallback chain? Current proposed config:

json"tts": {
"auto": "inbound",
"provider": "openai",
"providers": {
"openai": {
"apiKey": "not-needed",
"baseUrl": "http://kokoro-fastapi-gpu:8880/v1",
"model": "kokoro",
"voice": "af_bella",
"responseFormat": "mp3"
},
"microsoft": {
"voice": "en-US-RogerNeural",
"outputFormat": "ogg-48khz-16bit-mono-opus"
}
}
}

responseFormat — Kokoro-FastAPI returns MP3. OpenClaw then transcodes to Ogg/Opus via ffmpeg for WhatsApp delivery. Is responseFormat: "mp3" correct here, or should it be opus to skip the transcode step?

sick orbitBOT
#

Yep — your config is basically right.

Fallback: no explicit fallback field needed. TTS fallback is automatic: provider: "openai" is tried first, then other configured/registered providers are tried in provider order; unconfigured ones are skipped. So keeping microsoft configured in providers is sufficient.

I’d use:

{
  messages: {
    tts: {
      auto: "inbound",
      provider: "openai",
      providers: {
        openai: {
          apiKey: "not-needed",
          baseUrl: "http://kokoro-fastapi-gpu:8880/v1",
#
          model: "kokoro",
          voice: "af_bella",
          responseFormat: "mp3",
        },
        microsoft: {
          voice: "en-US-RogerNeural",
          // optional; default MP3 is usually safest
          outputFormat: "audio-24khz-48kbitrate-mono-mp3",
        },
      },
    },
  },
}

responseFormat: keep responseFormat: "mp3" for Kokoro if Kokoro actually returns MP3.

#

Don’t set opus just to skip transcoding unless Kokoro-FastAPI truly returns valid Opus bytes for response_format=opus. OpenClaw’s OpenAI TTS provider trusts the configured/requested responseFormat; if you say opus but Kokoro sends MP3, WhatsApp voice-note delivery can mis-handle it because OpenClaw may treat it as already voice-compatible.

For WhatsApp, MP3 is fine: OpenClaw will transcode MP3 → 48kHz Ogg/Opus with ffmpeg for PTT delivery when needed.

Docs: https://docs.openclaw.ai/tools/tts

civic dove
#

Hi Krill — sharing our updated docker-compose.yml for a second opinion before we execute. We've added Kokoro-FastAPI as a sibling service and wired it to OpenClaw via the OpenAI-compatible TTS provider. Two specific questions:

GPU acceleration — the kokoro-fastapi-gpu service only has capabilities: [gpu] in its reservation. The existing openclaw-gateway service uses runtime: nvidia plus capabilities: [gpu, compute, utility]. Should Kokoro also have runtime: nvidia and the full [gpu, compute, utility] capabilities, or is [gpu] sufficient for PyTorch/CUDA inference inside the Kokoro image?
Overall shape — does this look correct to you?

yamlkokoro-fastapi-gpu:
image: ghcr.io/remsky/kokoro-fastapi-gpu:latest
container_name: kokoro-fastapi-gpu
restart: unless-stopped
ports:
- "127.0.0.1:8880:8880"
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu]

sick orbitBOT
#

Yes — the overall shape looks correct.

For Docker Compose GPU reservations, capabilities: [gpu] is sufficient and is the Docker-documented pattern. Don’t put plain compute, utility in that Compose capabilities list; those are NVIDIA driver capabilities, not the generic Compose device capability. Docker’s deploy spec says driver-specific capabilities should be prefixed, e.g. nvidia-compute, while the standard GPU examples use capabilities: [gpu]. (docs.docker.com)

I’d use:

kokoro-fastapi-gpu:
  image: ghcr.io/remsky/kokoro-fastapi-gpu:latest
  container_name: kokoro-fastapi-gpu
  restart: unless-stopped
  ports:
    - "127.0.0.1:8880:8880"
  deploy:
    resources:
      reservations:
#
        devices:
          - driver: nvidia
            count: all
            capabilities: [gpu]

runtime: nvidia is usually not needed with modern Compose GPU reservations. If your Docker Desktop/WSL2 setup already works with this reservation style, keep it clean. If you hit “could not select device driver” or the container can’t see CUDA, then try adding runtime: nvidia as a compatibility fallback.

For PyTorch/CUDA, NVIDIA says unset NVIDIA_DRIVER_CAPABILITIES defaults to compute,utility, and compute is the CUDA/OpenCL capability; so you don’t need to explicitly add it unless you want clarity. (docs.nvidia.com) Optional:

  environment:
    NVIDIA_DRIVER_CAPABILITIES: compute,utility

Two small checks before executing:

#
  1. OpenClaw should use the service DNS name:

    baseUrl: "http://kokoro-fastapi-gpu:8880/v1"
    

    The 127.0.0.1:8880:8880 port bind is only for host-side testing.

  2. Validate GPU from inside Kokoro after startup:

    docker compose exec kokoro-fastapi-gpu nvidia-smi
    

Kokoro-FastAPI’s README shows the GPU image with docker run --gpus all, and also warns latest may be less stable than a pinned version; I’d test with latest, then pin a known-good tag/digest once it works. (github.com)

civic dove
#

On Windows 11 with OpenClaw running in a Docker container, is there a native continuous Talk Mode (listen → think → speak loop) that works on Windows, or is a paired mobile device the only supported path for that? If Windows is supported, what's the minimum setup required?

sick orbitBOT
#

Short version: no native Windows companion Talk Mode yet. A Windows Docker Gateway by itself won’t do the continuous mic → model → speaker loop, and the HostOps/node setup doesn’t magically make Windows a Talk capture device.

Supported paths today:

  1. Browser Talk on Windows — yes, via Control UI/WebChat in Edge/Chrome.
    This is the Windows-friendly path, but it’s browser-based, not native Windows.

    Minimum:

    • Gateway reachable from the browser
    • mic permission in the browser
    • configure a realtime Talk provider, e.g. top-level talk.provider: "openai" or "google" with matching talk.providers.*.apiKey
    • open Control UI chat and click the waves Talk button

    Important gotcha: this is not messages.tts Kokoro. Browser realtime Talk needs a realtime voice provider; Kokoro-FastAPI as OpenAI-compatible /v1/audio/speech is batch TTS, so it’s good for spoken message replies, not realtime Talk.

  2. Paired mobile node — yes, iOS/Android are supported for Talk Mode.
    Android specifically has the Voice tab with Talk continuous mode.

#
  1. macOS app — yes, macOS has native Talk overlay.

So for your Windows 11 + Docker setup: use Control UI browser Talk if you want it on the Windows machine, or use a paired mobile device for the native app-style continuous loop. Native Windows companion Talk is still planned, not the supported path right now.

Docs: https://docs.openclaw.ai/nodes/talk and https://docs.openclaw.ai/web/control-ui

civic dove
#

On Android Talk Mode (continuous loop via the Voice tab) — does it require a separate realtime voice provider API key (like OpenAI Realtime or Google), or can it use the gateway's existing TTS/STT config (e.g. Kokoro-FastAPI as the TTS provider)?

sick orbitBOT
#

Android Talk Mode does not require OpenAI Realtime / Google Live. That realtime-provider path is for browser Talk.

For Android Voice tab → Talk:

  • STT/listening: uses Android’s on-device/platform SpeechRecognizer, not your gateway tools.media.audio STT config.
  • Thinking: sends the transcript to the Gateway via chat.send.
  • Speaking: calls Gateway talk.speak, which uses the configured Talk TTS provider.
  • If talk.speak isn’t configured/available, Android can fall back to local Android system TTS.

So yes, you can use Kokoro-FastAPI for the spoken replies, but configure it through top-level talk, not only messages.tts.

Minimum shape if you already have Kokoro under messages.tts.providers.openai:

{
  talk: {
#
    provider: "openai",
    speechLocale: "en-US",
    silenceTimeoutMs: 700
  }
}

And keep your existing:

{
  messages: {
    tts: {
      provider: "openai",
      providers: {
#
        openai: {
          apiKey: "not-needed",
          baseUrl: "http://kokoro-fastapi-gpu:8880/v1",
          model: "kokoro",
          voice: "af_bella",
          responseFormat: "mp3"
        }
      }
    }
  }
}

If you want Talk-specific voice/model, add: