#Phone/voice setup on VPS
1 messages · Page 1 of 1 (latest)
Assuming you mean phone-call voice on a VPS:
-
Run the built-in checks first
openclaw --version openclaw voicecall setup openclaw voicecall latency --last 200The latency command is the best way to see whether the bottleneck is listening/STT, model response, or playback.
-
Use
realtimefor lowest-latency conversationsrealtime.enabled= full-duplex live voice.streaming.enabled= transcription stream + normal agent loop.- Don’t enable both; pick one.
-
Keep the spoken agent short
In your realtime instructions, add something like:
Speak briefly. Ask one question at a time. Use deeper tools only when necessary.
-
Use a fast voice model + “consult” only when needed
For realtime calls, let the realtime provider handle normal back-and-forth, and only call the regular OpenClaw agent for deeper tool use/reasoning. That keeps calls from feeling sluggish. -
Use stable public webhook exposure
Prefer a stable domain / Tailscale Funnel over changing ngrok URLs. SetpublicUrlandwebhookSecurity.allowedHosts, then rerun:openclaw voicecall setup -
Tune silence/endpointing carefully
Defaults are around800ms. Lower values can feel snappier but may cut people off; higher values feel slower but safer.
Docs:
If you mean Discord voice channels instead of phone calls, the optimization path is different: check ffmpeg/ffprobe, /vc status, DAVE decrypt logs, and Discord voice config.
Before we set up voice on Hetzner VPS, I want to optimize Roger for maximum human-like dialog — everything else is secondary.
The goal: a conversation that feels like a real human. Natural flow, understanding, insight, no scripted feel. Responds immediately when natural, intentional pauses when it makes sense.
I'm willing to compromise on all other OpenClaw functionality to achieve this.
What is the optimal configuration regarding:
- Identity/persona setup
- Which LLM model gives the best conversational quality here
- Heartbeat/loading — how much does Roger load per call, and what minimizes latency
- ElevenLabs voice — which settings give the most natural speech (stability, similarity, style etc)
- Twilio — anything that can be optimized for low latency
- Pauses and response timing — can OpenClaw control natural pauses and speaking rhythm
And a specific question: In OpenClaw there appears to be both a Google voice integration and an OpenClaw Voice Call plugin — are these two separate paths, and which do you recommend for this use case on VPS?
First: please paste openclaw --version before you apply any exact config — voice-call changed quickly, so version matters.
My take: optimize for realtime voice, not “LLM + STT + ElevenLabs TTS”.
For maximum human-like dialog on a Hetzner VPS, I’d use:
Twilio phone number → OpenClaw Voice Call plugin → realtime voice provider → OpenClaw consult only when needed.
The Google voice integration and Voice Call plugin are not competing paths:
- Voice Call plugin = telephony layer: Twilio/Telnyx/Plivo, inbound/outbound calls, webhooks, call state.
- Google/OpenAI realtime voice = voice brain/audio engine used inside Voice Call.
- Talk mode / Google browser voice = different local/UI/node path, not the normal PSTN phone-call path.
So for VPS phone calls: use Voice Call + realtime provider. Google can be the realtime provider behind it. OpenClaw docs describe Voice Call as the telephony plugin and Google as a realtime voice provider for backend bridges like Voice Call/Meet. (docs.openclaw.ai)
Recommended shape:
{
agents: {
list: [
{
id: "roger-voice",
workspace: "~/.openclaw/workspace-roger-voice"
}
],
defaults: {
heartbeat: {
every: "0m"
}
}
},
plugins: {
entries: {
"voice-call": {
enabled: true,
config: {
provider: "twilio",
fromNumber: "+YOUR_TWILIO_NUMBER",
publicUrl: "https://voice.yourdomain.com/voice/webhook",
webhookSecurity: {
allowedHosts: ["voice.yourdomain.com"]
},
agentId: "roger-voice",
realtime: {
enabled: true,
provider: "google",
instructions: "You are Roger's live phone voice. Sound like a thoughtful human, not an assistant. Keep replies short. Do not monologue. If the caller pauses mid-thought, stay quiet. If a question needs tools or memory, briefly say you're checking, then call the consult tool. Never narrate system behavior.",
toolPolicy: "safe-read-only",
providers: {
google: {
activityHandling: "start-of-activity-interrupts",
turnCoverage: "only-activity"
}
}
},
streaming: { enabled: false }
}
}
}
}
}
For identity/persona: make a dedicated roger-voice workspace. Keep SOUL.md, IDENTITY.md, and USER.md tiny. Long persona files increase prompt/context load; OpenClaw injects workspace files like AGENTS.md, SOUL.md, IDENTITY.md, USER.md, etc. into context, with per-file and total bootstrap caps. (docs.openclaw.ai)
For LLM choice: don’t route normal conversation through a text LLM if “human-like” is priority. Use a native realtime voice model/provider for the live loop, and let it call OpenClaw only for “deeper brain” moments. For the freshest model pick, ask in #models — I don’t want to hardcode stale model advice here. The key architecture choice is realtime voice first, normal OpenClaw agent second.
For heartbeat/loading latency:
- Heartbeat does not need to run for calls.
- Disable it for the voice agent or set
skipWhenBusy. - Keep
HEARTBEAT.mdempty/tiny. - Use
/context listto see what Roger is loading. - Use
openclaw voicecall latency --last 200after tests.
OpenClaw heartbeat can use lightContext and isolatedSession, but for a voice-first agent I’d just disable heartbeats unless you need proactive behavior. (docs.openclaw.ai)
For ElevenLabs: only use ElevenLabs if the custom voice matters more than end-to-end flow. It can sound gorgeous, but STT → LLM → TTS usually feels less alive than realtime voice. If you do use it, start around:
voiceSettings: {
stability: 0.5,
similarityBoost: 0.8,
style: 0.2,
useSpeakerBoost: true,
speed: 0.98
}
Lower stability gives more emotional variation; higher can become monotone. style and useSpeakerBoost can improve character/similarity but may add latency. ElevenLabs also notes Flash/Turbo-style models trade quality for latency, while more expressive models are higher quality but slower. (elevenlabs.io)
For Twilio latency:
- Use bidirectional Media Streams /
<Connect><Stream>for realtime AI calls. - Put the VPS close to the caller/Twilio region where possible.
- Use a stable public HTTPS/WSS endpoint, not free ngrok.
- Avoid extra reverse-proxy buffering.
- Twilio media back to the call is
audio/x-mulawat8000 Hz, base64 payloads; OpenClaw handles the bridge, but that explains why telephony will never sound like studio audio. (twilio.com)
For pauses/rhythm: OpenClaw can influence this, but not perfectly “conduct” it. The realtime provider’s VAD decides when a turn is complete. Google/OpenAI realtime APIs both support automatic voice activity detection; OpenClaw exposes knobs like silence duration / sensitivity depending on provider. Start with defaults, then tune only after listening tests. (ai.google.dev)
Bottom line: Voice Call plugin + realtime provider + tiny persona + consult only when needed. That’s the path I’d use for “feels like a real person.”
Does Google realtime voice support Danish? If not, is ElevenLabs the only viable path for Danish language calls, and how does that change the recommended architecture?
Short answer: yes — Google realtime voice supports Danish now. Google’s current Live API docs list Danish da among supported languages, and say native-audio models can switch languages naturally; they also say native-audio models don’t let you explicitly set a language code, so steer it via instructions instead. (ai.google.dev)
So I’d not switch to ElevenLabs-first yet. Keep the recommended architecture:
Voice Call plugin + Twilio + realtime.provider: "google"
Add Danish-specific realtime instructions:
realtime: {
enabled: true,
provider: "google",
instructions: "Du taler dansk som en naturlig, varm og kortfattet person. Svar pĂĄ dansk, medmindre brugeren beder om noget andet. Hold pauser naturligt. Afbryd ikke brugeren midt i en tanke."
}
Caveat: Google’s Vertex Live docs still show a narrower configurable language_code list that does not include Danish, but that mainly matters for non-native/configured-language paths. For this use case, prefer the native realtime audio path and prompt it to Danish. (docs.cloud.google.com)
If Google’s Danish voice quality feels off in testing, then fallback architecture becomes:
Twilio → Voice Call streaming STT → OpenClaw/text LLM → ElevenLabs Danish TTS
ElevenLabs definitely supports Danish for TTS, including newer multilingual models, and their STT supports 90+ languages; OpenClaw’s ElevenLabs docs also mention Voice Call streaming STT via Scribe v2 Realtime. (help.elevenlabs.io)
But tradeoff: ElevenLabs path may sound prettier, Google realtime will usually feel more conversational because it avoids the separate STT → LLM → TTS loop. My recommendation: test Google Danish first, then only switch if the Danish accent/prosody is unacceptable.
OpenClaw runs in Docker on the VPS. openclaw --version not found on host. Found /var/lib/docker/rootfs/overlayfs/.../usr/local/bin/openclaw inside container. How do I get the version and run openclaw commands — should I exec into the container, or is there a docker compose command?
Use Docker Compose from the directory that has docker-compose.yml. Don’t run the binary out of /var/lib/docker/... — that’s Docker’s internal overlay storage.
Try this first:
cd /path/to/openclaw # folder with docker-compose.yml
docker compose ps
docker compose exec openclaw-gateway openclaw --version
If openclaw isn’t on PATH inside the container, use the documented fallback:
docker compose exec openclaw-gateway node dist/index.js --version
docker compose exec openclaw-gateway node dist/index.js status
docker compose exec openclaw-gateway node dist/index.js voicecall setup
Some Compose setups also define an openclaw-cli helper service. If yours does:
docker compose run --rm openclaw-cli --version
docker compose run --rm openclaw-cli status
docker compose run --rm openclaw-cli voicecall setup
For logs:
docker compose logs -f openclaw-gateway
So: prefer docker compose exec openclaw-gateway ... for your Hetzner setup unless your compose file includes openclaw-cli.
Docs: https://docs.openclaw.ai/install/docker and https://docs.openclaw.ai/install/hetzner
OpenClaw 2026.4.26
Perfect — 2026.4.26 is recent enough for the realtime Voice Call path we’ve been discussing.
From the folder with docker-compose.yml, run:
docker compose exec openclaw-gateway node dist/index.js status --all
docker compose exec openclaw-gateway node dist/index.js plugins list
docker compose exec openclaw-gateway node dist/index.js voicecall setup
If voicecall says command not found, install/enable the plugin:
docker compose exec openclaw-gateway node dist/index.js plugins install @openclaw/voice-call
docker compose restart openclaw-gateway
Then rerun:
docker compose exec openclaw-gateway node dist/index.js voicecall setup
For your next debugging paste, the most useful outputs are:
docker compose ps
docker compose exec openclaw-gateway node dist/index.js plugins list
docker compose exec openclaw-gateway node dist/index.js voicecall setup
docker compose logs --tail=100 openclaw-gateway
Don’t paste raw .env or full config with secrets. status --all is designed to be safer/redacted.
is there a newer version of claw I should install?
No — 2026.4.26 is currently latest. I checked npm directly:
latest: 2026.4.26
beta: 2026.4.26
So I’d not update right now. Continue with voice setup on that version.
For Docker later, don’t run openclaw update inside the container. Update the image/deploy instead:
git pull
docker compose build
docker compose up -d
If you’re using the prebuilt image tag:
docker compose pull
docker compose up -d
docker compose exec openclaw-gateway node dist/index.js status --all
docker compose exec openclaw-gateway node dist/index.js plugins list
docker compose exec openclaw-gateway node dist/index.js voicecall setup
â—Ź Running all three commands now.
â—Ź Bash(docker compose exec openclaw-gateway node dist/index.js status --all 2>&1)
⎿  OpenClaw status --all
Overview
… +34 lines (ctrl+o to expand)
â—Ź Bash(docker compose exec openclaw-gateway node dist/index.js plugins list 2>&1)
⎿  Plugins (68/115 enabled)
Source roots:
stock: /app/dist/extensions
… +228 lines (ctrl+o to expand)
â—Ź Bash(docker compose exec openclaw-gateway node dist/index.js voicecall setup 2>&1)
⎿  Error: Exit code 1
error: unknown command 'voicecall'
â—Ź Here's a summary of each command's result:
status --all — Gateway is running (v2026.4.26, Node 24.14.0). Key issues:
- Gateway connection has missing scope: operator.read — health check can't auth
- Port 18789 in use (by the gateway itself — expected)
- No channels configured
- 1 agent (main) active, 2 sessions, but bootstrap file is ABSENT
- Tailscale exposure is off
plugins list — 68 of 115 plugins enabled. Output was large (44KB); full list saved. Notable: active-memory is disabled,
providers like Alibaba and Amazon Bedrock are enabled.
voicecall setup — Failed: unknown command 'voicecall'. This command doesn't exist in the installed version. You may be
thinking of a different command name — you can check available commands with:
docker compose exec openclaw-gateway node dist/index.js --help
The most actionable issue is the operator.read scope missing on the gateway auth token, which is blocking gateway health
checks.
âś» Cogitated for 3m 4s
That unknown command 'voicecall' means the voice-call plugin command didn’t load — not that 2026.4.26 lacks it.
Run this next from the Compose dir:
docker compose exec openclaw-gateway node dist/index.js plugins inspect voice-call
If it exists but is disabled:
docker compose exec openclaw-gateway node dist/index.js plugins enable voice-call
docker compose restart openclaw-gateway
docker compose exec openclaw-gateway node dist/index.js voicecall setup
If inspect voice-call says missing/not found, install it:
docker compose exec openclaw-gateway node dist/index.js plugins install @openclaw/voice-call
docker compose restart openclaw-gateway
docker compose exec openclaw-gateway node dist/index.js voicecall setup
If it’s still unknown after restart, check whether your config has a restrictive plugins.allow; if so, voice-call must be included there.
The operator.read scope issue is separate — annoying for health/status, but it’s not why voicecall is unknown. Also: no channels configured / Tailscale off / bootstrap absent aren’t blockers for phone voice.
Docs: https://docs.openclaw.ai/plugins/voice-call and https://docs.openclaw.ai/cli/plugins
Inspecting the voice-call plugin first.
â—Ź Bash(docker compose exec openclaw-gateway node dist/index.js plugins inspect voice-call 2>&1)
⎿  @openclaw/voice-call
id: voice-call
OpenClaw voice-call plugin
… +11 lines (ctrl+o to expand)
● It's present and bundled — just disabled by default. Enabling it and running setup.
â—Ź Bash(docker compose exec openclaw-gateway node dist/index.js plugins enable voice-call 2>&1)
⎿  Config overwrite: /home/node/.openclaw/openclaw.json (sha256 965778f3abb29c03a4c4ddc1dc0fde72d2be9293991994f4813401
f01202c5fb -> f9b41b0d7fe2f3fbb0c3019459a6788ef26a4196739a52bbd564810831909e73, backup=/home/node/.openclaw/opencla
w.json.bak)
Enabled plugin "voice-call". Restart the gateway to apply.
â—Ź Bash(docker compose restart openclaw-gateway 2>&1)
⎿  Container openclaw-openclaw-gateway-1 Restarting
Container openclaw-openclaw-gateway-1 Started
â—Ź Bash(docker compose exec openclaw-gateway node dist/index.js voicecall setup 2>&1)
⎿  Voice Call setup: OK
OK plugin-enabled: Voice Call plugin is enabled
OK provider: Provider configured: mock
… +3 lines (ctrl+o to expand)
● All green. Voice-call was bundled but disabled by default — enabling it was all it needed.
Setup checks out:
- Plugin enabled
- Provider: mock (configured with credentials complete)
- Webhook: not needed for mock provider
- Mode: notify/conversation using normal TTS/STT flow
If you want to switch from mock to a real provider (e.g. Twilio), run voicecall setup --provider twilio and follow the
prompts to supply credentials.
âś» Worked for 3m 10s
Nice — that confirms the plugin side is fixed. 🎉
Small correction: voicecall setup --provider twilio is not a thing in 2026.4.26. voicecall setup only checks the current config.
Right now you’re still on provider: mock, so it proves the plugin loads, but it won’t make real calls yet. Next step is configuring Twilio + realtime Google.
High-level:
-
Put secrets in Docker
.env/ env vars:TWILIO_ACCOUNT_SID=... TWILIO_AUTH_TOKEN=... TWILIO_FROM_NUMBER=+... GEMINI_API_KEY=... -
Recreate container so env changes load:
docker compose up -d --force-recreate openclaw-gateway
- Set voice-call config:
docker compose exec openclaw-gateway node dist/index.js config set --batch-json '[ {"path":"plugins.entries.voice-call.config.provider","value":"twilio"}, {"path":"plugins.entries.voice-call.config.fromNumber","value":"+YOUR_TWILIO_NUMBER"}, {"path":"plugins.entries.voice-call.config.publicUrl","value":"https://voice.example.com/voice/webhook"}, {"path":"plugins.entries.voice-call.config.webhookSecurity.allowedHosts","value":["voice.example.com"]}, {"path":"plugins.entries.voice-call.config.streaming.enabled","value":false}, {"path":"plugins.entries.voice-call.config.realtime.enabled","value":true}, {"path":"plugins.entries.voice-call.config.realtime.provider","value":"google"}, {"path":"plugins.entries.voice-call.config.realtime.instructions","value":"Du taler dansk som en naturlig, varm og kortfattet person. Svar på dansk, medmindre brugeren beder om noget andet. Afbryd ikke brugeren midt i en tanke. Hvis noget kræver værktøjer eller hukommelse, sig kort at du tjekker det."} ]'
- Restart + verify:
docker compose restart openclaw-gateway docker compose exec openclaw-gateway node dist/index.js voicecall setup
Important Docker/VPS bit: Twilio needs a public HTTPS/WSS webhook. If you reverse-proxy from nginx/Caddy on the host, make sure the voice-call webhook port/path is reachable and WebSocket upgrades work for the realtime stream too.
Docs: https://docs.openclaw.ai/plugins/voice-call and https://docs.openclaw.ai/install/docker
Does Google Gemini realtime voice (Live API) work on the free tier in Google AI Studio, or does it require billing? And for this use case (maximum human-like Danish phone conversation), is there any quality difference between using AI Studio vs Vertex AI?
Yes — Gemini Live can run on the Google AI Studio / Gemini Developer API free tier for testing. Google lists gemini-2.5-flash-native-audio-preview-12-2025 with free-tier input/output and describes it as Live API native audio optimized for better pacing, voice naturalness, verbosity, and mood. (ai.google.dev)
But for Roger’s use case, I’d still enable billing before real phone use:
- free tier has stricter/variable rate limits;
- preview/live models can have more restrictive limits;
- free-tier prompts/responses may be used to improve Google products, while paid tier says they are not. (ai.google.dev)
On AI Studio vs Vertex/Gemini Enterprise Agent Platform: I would not expect a meaningful voice-quality difference if you’re hitting the same model with the same config. The differences are mostly product/ops: API key vs service account, no-cost usage vs cloud billing/credits, quotas/regions, enterprise support/SLA, compliance, IAM, VPC, dedicated capacity, etc. (docs.cloud.google.com)
For maximum human-like Danish phone conversation, I’d do:
- Start with AI Studio / Gemini Developer API key in OpenClaw:
GEMINI_API_KEY=... - Use:
realtime: {
enabled: true,
provider: "google",
providers: {
google: {
model: "gemini-2.5-flash-native-audio-preview-12-2025"
}
}
}
- Test Danish quality on real Twilio calls.
- Only move to Vertex/Gemini Enterprise if you need enterprise controls, billing governance, compliance, or dedicated capacity — not for expected better Danish voice quality.
So: AI Studio free tier is viable for proof-of-concept. Paid AI Studio/Gemini Developer API is my recommendation for real usage. Vertex is an ops/compliance upgrade, not the first quality upgrade.
For real phone usage with gemini-2.5-flash-native-audio-preview-12-2025 via the Live API — roughly how much would $10 USD cover in terms of call minutes? Is $10 enough to get started, or should I put in more?
For Gemini model cost only, $10 is absolutely enough to get started.
Google lists gemini-2.5-flash-native-audio-preview-12-2025 at $3 / 1M audio input tokens and $12 / 1M audio output tokens on paid tier. Gemini audio is roughly 32 tokens/sec = 1,920 tokens/minute. (ai.google.dev)
Rough math:
| Scenario | Model cost/min | $10 covers |
| ------------------------------------------------ | -------------- | ---------- |
| Caller audio all minute, Roger speaks ~â…“ of time | ~$0.013/min | ~750 min |
| Caller audio all minute, Roger speaks ~½ of time | ~$0.017/min | ~580 min |
| Worst-ish: input + Roger speaking 100% of time | ~$0.029/min | ~350 min |
So for testing: put in $10 and you’re fine.
For real phone usage, budget extra for Twilio minutes + phone number. Twilio pricing varies by country, call direction, number type, and account pricing; Twilio exposes account-specific per-number/per-country pricing via their Voice Pricing API. (twilio.com)
Twilio's Pricing Voice resource provides a simple API to pull real-time, account-specific pricing for Twilio's programmable voice product.
My practical recommendation:
• $10 Gemini credit: enough for setup, tuning, lots of test calls.
• $25–50 total budget: more comfortable once Twilio is included and you start doing real Danish phone tests.
• Don’t overfund yet; measure actual call logs first.
For the public HTTPS webhook URL — we don't have a domain set up yet. Is Cloudflare Tunnel a good option for this, or do you recommend setting up a proper domain? What's the fastest/most stable path for production use?
Also: once voice is working, we want to set up a dedicated roger-voice workspace with optimized persona (tiny SOUL.md, IDENTITY.md as you described earlier). Does the domain/tunnel choice affect that setup at all?
Cloudflare Tunnel is a good option if you attach it to a stable hostname like:
https://voice.yourdomain.com/voice/webhook
I would not use a random/temporary tunnel URL for production. Twilio needs a stable webhook URL, and changing it later is annoying.
My recommendation:
- Fastest stable production path: buy/use a cheap domain → put it on Cloudflare → create a named Cloudflare Tunnel → route
voice.yourdomain.comto the OpenClaw voice webhook. - Also good: Caddy/Nginx directly on the VPS with Let’s Encrypt.
- Avoid for production: free/random ngrok or temporary Cloudflare quick tunnels.
Make sure Cloudflare Access / login protection is off for this route — Twilio must be able to call it directly. Rely on Twilio signature verification + OpenClaw config instead.
OpenClaw config shape:
{
plugins: {
entries: {
"voice-call": {
config: {
publicUrl: "https://voice.yourdomain.com/voice/webhook",
webhookSecurity: {
allowedHosts: ["voice.yourdomain.com"]
}
}
}
}
}
}
Also confirm the tunnel/reverse proxy supports WebSocket upgrades, since realtime voice uses streaming.
The domain/tunnel choice does not affect the dedicated roger-voice workspace. That’s separate:
{
agents: {
list: [
{
id: "roger-voice",
workspace: "~/.openclaw/workspace-roger-voice"
}
]
},
plugins: {
entries: {
"voice-call": {
config: {
agentId: "roger-voice"
}
}
}
}
}
So: set up stable public HTTPS first, then tune roger-voice persona/workspace independently.
Called +4552517979. Got "we are sorry an application error has occurred". What should we check first — Twilio logs or OpenClaw voice-call logs?
Check Twilio Debugger / call logs first. That message usually means Twilio tried your webhook and got an error, timeout, bad TwiML, TLS issue, or couldn’t reach it.
In parallel, tail OpenClaw logs while making one fresh test call:
docker compose logs -f --tail=200 openclaw-gateway
Then check Voice Call setup:
docker compose exec openclaw-gateway node dist/index.js voicecall setup
In Twilio, look for:
- HTTP status from your webhook request
- timeout / connection refused / TLS cert errors
- signature validation failures
- invalid TwiML
- exact webhook URL Twilio requested
In OpenClaw logs, look for:
[voice-call]- inbound webhook received / rejected
- missing provider config
- public URL / allowed host mismatch
- realtime provider errors
Fastest diagnosis question: does OpenClaw log anything at the exact time of the call?
- No OpenClaw log entry → domain/tunnel/proxy/Twilio webhook routing problem.
- OpenClaw receives it but rejects/errors → OpenClaw voice-call config or Twilio signature/publicUrl mismatch.
- OpenClaw accepts then errors after stream starts → realtime Google/Twilio Media Stream issue.