#Which local LLM(s) to run on DGX Spark while openclaw is on a base Mac mini m4?
1 messages · Page 1 of 1 (latest)
If your goal is “OpenClaw as an agent that uses tools” (browser/exec/files/etc), the #1 thing to optimize for on the DGX box isn’t raw size — it’s reliable tool/function calling + strict JSON. A lot of local/open weights (especially big 4‑bit quants) can be impressive at reasoning, but still flaky at tools, which makes the agent loop feel broken.
So I’d approach it like this:
- Decide what you want local to do
- Tool-heavy agent work: prioritize models/endpoints that are known to be rock-solid at tools/JSON.
- Long “thinking / writing / analysis” tasks: your DeepSeek‑R1‑Distill‑Llama‑70B 4‑bit experiment can make sense here, but don’t be surprised if tool calling is brittle.
- Use a 2-model strategy (common setup)
- Keep a tool-reliable model as primary (or at least a fallback) for “agent needs to act”.
- Use the big reasoning model when you explicitly want deep analysis (or as a manual switch), not necessarily as the only model driving everything.
- Integration path (LAN DGX → Mac mini gateway)
OpenClaw is happiest when the DGX serves an OpenAI-compatible endpoint. The easiest “it just works” local path OpenClaw documents today is Ollama, including auto-discovery of tool-capable models.
Docs:
- Models overview: https://docs.openclaw.ai/concepts/models
- Ollama (local models + auto-discovery + remote baseUrl): https://docs.openclaw.ai/providers/ollama
If you tell me:
- whether you’re serving on the DGX via Ollama vs vLLM/TGI/etc, and
- whether you want 100% local or “local + a cloud fallback for tools”,
I can give a more concrete recommendation for how to structure the model/fallback setup in OpenClaw.
Also: for specific “which exact model names are best this week on DGX Spark”, the server’s #models channel tends to have the freshest, battle-tested picks (this changes constantly).
Using vLLM deployment on the spark, I want 100% local. Is there an LLM that will use the spark to the maximum capability while easily interfacing with openclaw?
Yes — vLLM is a good fit for OpenClaw as long as you expose an OpenAI‑compatible /v1 API and you pick a model that’s reliable at tool/function calling + strict JSON (that’s the real limiter for “agentic” use).
How to interface OpenClaw ⇄ vLLM (LAN)
OpenClaw doesn’t need anything DGX-specific — it just needs a provider pointing at your vLLM base URL.
In ~/.openclaw/openclaw.json, add a custom provider (use your Spark’s IP/hostname):
{
agents: {
defaults: {
model: { primary: "spark/my-model-id" },
},
},
models: {
mode: "merge",
providers: {
spark: {
baseUrl: "http://DGX_SPARK_LAN_IP:8000/v1",
apiKey: "sk-local", // any string if your vLLM doesn’t enforce auth
api: "openai-completions", // safest default for vLLM
models: [
{
id: "my-model-id",
name: "My Spark Model",
reasoning: false,
input: ["text"],
cost: { input: 0, output: 0, cacheRead: 0, cacheWrite: 0 },
contextWindow: 128000, // set to what you actually serve
maxTokens: 8192
}
]
}
}
}
}
Quick sanity checks:
curl http://DGX_SPARK_LAN_IP:8000/v1/modelsopenclaw models status
Docs (this is the canonical guide): https://docs.openclaw.ai/gateway/local-models
“Use the Spark to the maximum capability” (model choice reality)
To really “max out” the Spark and have OpenClaw work smoothly:
- Prefer the largest instruct model you can run without aggressive 4‑bit quant, because heavy quant often hurts tool/JSON reliability (the agent loop suffers).
- Make sure the model/server supports tool/function calling in the OpenAI-compatible format (otherwise OpenClaw can’t reliably use browser/exec/etc).
I’m avoiding naming specific “best model of the week” here because it changes fast and I’ll get you stale info — the server’s #models channel is where folks post the current “works well with tools on vLLM” picks.
If you paste:
- your vLLM launch args, and
- what
/v1/modelsreports as the model id,
I can tell you whether your endpoint + OpenClaw config look correct, and what to tweak (API mode, contextWindow/maxTokens, etc.).
["deepseek-ai/DeepSeek-R1-Distill-Llama-70B","--host","0.0.0.0","--port","8000","--dtype","half","--max-model-len","16384","--gpu-memory-utilization","0.68","--enable-prefix-caching","--served-model-name","DeepSeek-R1-Distill-Llama-70B","--quantization","bitsandbytes","--chat-template","{% for m in messages %}{% if m["role"] == "system" %}System: {{ m["content"] }}\n{% elif m["role"] == "user" %}User: {{ m["content"] }}\n{% elif m["role"] == "assistant" %}Assistant: {{ m["content"] }}\n{% endif %}{% endfor %}Assistant:"]
DeepSeek-R1-Distill-Llama-70B
For OpenClaw, the “easy interface” part is: your vLLM must behave like a clean OpenAI /v1 server and return proper tool_calls when tools are provided. Raw reasoning ability (R1) isn’t enough by itself.
1) Wire OpenClaw to your vLLM (using your served model name)
Since you’re serving:
--served-model-name DeepSeek-R1-Distill-Llama-70B
…then in OpenClaw you want the model ref to be:
spark/DeepSeek-R1-Distill-Llama-70B
Example config (Mac mini → Spark over LAN):
{
agents: {
defaults: {
model: { primary: "spark/DeepSeek-R1-Distill-Llama-70B" },
},
},
models: {
mode: "merge",
providers: {
spark: {
baseUrl: "http://DGX_SPARK_LAN_IP:8000/v1",
apiKey: "sk-local",
api: "openai-completions",
models: [
{
id: "DeepSeek-R1-Distill-Llama-70B",
name: "DeepSeek R1 Distill Llama 70B",
reasoning: true,
input: ["text"],
cost: { input: 0, output: 0, cacheRead: 0, cacheWrite: 0 },
contextWindow: 16384,
maxTokens: 8192
}
]
}
}
}
}
Docs: https://docs.openclaw.ai/gateway/local-models
2) The big gotcha: tool calling needs to be enabled in vLLM
OpenClaw’s agent loop depends on tool/function calling. With vLLM, that typically means running with tool calling enabled (and using a supported tool-call parser).
vLLM docs (authoritative flags + parsers): https://docs.vllm.ai/en/latest/features/tool_calling/
Also: I’d strongly recommend you try removing your custom --chat-template first. Tool-call parsers tend to assume specific templates; custom “System:/User:/Assistant:” formatting is a common way to accidentally break tool calling / JSON reliability.
3) Sanity test before involving OpenClaw
From the Mac mini, confirm:
curl http://DGX_SPARK_LAN_IP:8000/v1/models- Then try a minimal
/v1/chat/completionsrequest with atoolsarray and verify the response containstool_calls(not just plain text telling you what it would do).
4) About “maxing out the Spark” with DeepSeek R1 Distill 70B
You can run it, but R1-distill style models are often less reliable at strict tool/JSON output, especially when quantized. If you hit issues like “it reasons but never calls tools” or malformed tool args, that’s usually model/tooling mismatch rather than OpenClaw config.
If you want, paste:
- the output of
curl .../v1/models, and - one sample
/v1/chat/completionscall+response with tools enabled,
and I’ll tell you whether the endpoint is “OpenClaw-ready” or what to adjust.
{
"object": "list",
"data": [
{
"id": "nvidia/Llama-3.1-8B-Instruct-FP8",
"object": "model",
"created": 1771268614,
"owned_by": "tensorrt_llm"
}
]
}
{
"id": "chatcmpl-a4e3381011be497c879802e715d2d94a",
"object": "chat.completion",
"created": 1771268623,
"model": "nvidia/Llama-3.1-8B-Instruct-FP8",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "",
"reasoning_content": null,
"reasoning": null,
"tool_calls": [
{
"id": "chatcmpl-tool-8c3e67fb78ca4cf0aeb2e24fd3c4c37c",
"type": "function",
"function": {
"name": "add",
"arguments": "add(a=2, b=2)"
}
}
]
},
"logprobs": null,
"finish_reason": "stop",
"stop_reason": null,
"mm_embedding_handle": null,
"disaggregated_params": null,
"avg_decoded_tokens_per_iter": 1.0
}
],
"usage": {
"prompt_tokens": 29,
"total_tokens": 39,
"completion_tokens": 10,
"prompt_tokens_details": {
"cached_tokens": 23
}
},
"prompt_token_ids": null
}
On that setup (Mac mini runs the OpenClaw gateway + tools, DGX Spark serves the LLM over LAN), the big decision isn’t “which biggest model fits” — it’s which model+serving stack can reliably do tool/function calling. Most “works great in chat” local models fall apart once you require strict JSON/tool args.
1) What your snippet is already telling you
This part:
tool_callswithfunction.name = "add"- but
arguments = "add(a=2, b=2)"(not JSON)
…is a classic sign the model is not actually function-calling tuned (or the server isn’t enforcing a JSON schema / structured output mode). OpenClaw expects tool arguments to be valid JSON, so you’ll see flaky/broken tool use with that kind of output.
2) Practical recommendation for OpenClaw right now
- Use a “tool-reliable” model for the main agent (the one that drives browser/exec/message/etc).
- If you want to use DGX Spark anyway, use it for a secondary agent where you disable tools and use it for:
- drafting responses
- summarizing
- rewriting/formatting
- (optionally) RAG-style Q&A where tool calls aren’t required
This split tends to feel much better than trying to force local models to be the tool-using “brains”.
3) If you do want the DGX model to drive tools
Then pick based on these criteria (not just “70B 4-bit”):
- Proven tool/function calling (strict JSON args, consistent)
- Low hallucination under schema pressure
- Server supports OpenAI-compatible tool calling well (some stacks “accept” tools but don’t constrain outputs)
If tool calls still look like add(a=2,b=2) instead of JSON, it’s not ready for agent control.
4) Best next step (so people can give you accurate, current picks)
Can you drop this into #models (they track what actually works this month on local stacks):
- what serving stack on the DGX? (TensorRT-LLM vs vLLM vs Ollama vs LM Studio vs other)
- do you need OpenClaw to use tools with the local model, or is “chat-only local” fine?
- what latency target (interactive vs batch)?
Docs to anchor the config side (and community notes):