#Which local LLM(s) to run on DGX Spark while openclaw is on a base Mac mini m4?

1 messages · Page 1 of 1 (latest)

mighty steepleBOT
#

To help others find answers, you can mark your question as solved via Right click solution message -> Apps -> âś… Mark Solution

solid anchorBOT
#

If your goal is “OpenClaw as an agent that uses tools” (browser/exec/files/etc), the #1 thing to optimize for on the DGX box isn’t raw size — it’s reliable tool/function calling + strict JSON. A lot of local/open weights (especially big 4‑bit quants) can be impressive at reasoning, but still flaky at tools, which makes the agent loop feel broken.

So I’d approach it like this:

  1. Decide what you want local to do
  • Tool-heavy agent work: prioritize models/endpoints that are known to be rock-solid at tools/JSON.
  • Long “thinking / writing / analysis” tasks: your DeepSeek‑R1‑Distill‑Llama‑70B 4‑bit experiment can make sense here, but don’t be surprised if tool calling is brittle.
  1. Use a 2-model strategy (common setup)
  • Keep a tool-reliable model as primary (or at least a fallback) for “agent needs to act”.
  • Use the big reasoning model when you explicitly want deep analysis (or as a manual switch), not necessarily as the only model driving everything.
  1. Integration path (LAN DGX → Mac mini gateway)
    OpenClaw is happiest when the DGX serves an OpenAI-compatible endpoint. The easiest “it just works” local path OpenClaw documents today is Ollama, including auto-discovery of tool-capable models.

Docs:

#

If you tell me:

  • whether you’re serving on the DGX via Ollama vs vLLM/TGI/etc, and
  • whether you want 100% local or “local + a cloud fallback for tools”,
    I can give a more concrete recommendation for how to structure the model/fallback setup in OpenClaw.

Also: for specific “which exact model names are best this week on DGX Spark”, the server’s #models channel tends to have the freshest, battle-tested picks (this changes constantly).

hasty kettle
#

Using vLLM deployment on the spark, I want 100% local. Is there an LLM that will use the spark to the maximum capability while easily interfacing with openclaw?

solid anchorBOT
#

Yes — vLLM is a good fit for OpenClaw as long as you expose an OpenAI‑compatible /v1 API and you pick a model that’s reliable at tool/function calling + strict JSON (that’s the real limiter for “agentic” use).

How to interface OpenClaw ⇄ vLLM (LAN)

OpenClaw doesn’t need anything DGX-specific — it just needs a provider pointing at your vLLM base URL.

In ~/.openclaw/openclaw.json, add a custom provider (use your Spark’s IP/hostname):

{
  agents: {
    defaults: {
      model: { primary: "spark/my-model-id" },
    },
  },
  models: {
    mode: "merge",
#
    providers: {
      spark: {
        baseUrl: "http://DGX_SPARK_LAN_IP:8000/v1",
        apiKey: "sk-local",              // any string if your vLLM doesn’t enforce auth
        api: "openai-completions",       // safest default for vLLM
        models: [
          {
            id: "my-model-id",
            name: "My Spark Model",
            reasoning: false,
            input: ["text"],
            cost: { input: 0, output: 0, cacheRead: 0, cacheWrite: 0 },
            contextWindow: 128000,        // set to what you actually serve
            maxTokens: 8192
          }
#
        ]
      }
    }
  }
}

Quick sanity checks:

  • curl http://DGX_SPARK_LAN_IP:8000/v1/models
  • openclaw models status

Docs (this is the canonical guide): https://docs.openclaw.ai/gateway/local-models

“Use the Spark to the maximum capability” (model choice reality)

To really “max out” the Spark and have OpenClaw work smoothly:

  • Prefer the largest instruct model you can run without aggressive 4‑bit quant, because heavy quant often hurts tool/JSON reliability (the agent loop suffers).
#
  • Make sure the model/server supports tool/function calling in the OpenAI-compatible format (otherwise OpenClaw can’t reliably use browser/exec/etc).

I’m avoiding naming specific “best model of the week” here because it changes fast and I’ll get you stale info — the server’s #models channel is where folks post the current “works well with tools on vLLM” picks.

If you paste:

  1. your vLLM launch args, and
  2. what /v1/models reports as the model id,
    I can tell you whether your endpoint + OpenClaw config look correct, and what to tweak (API mode, contextWindow/maxTokens, etc.).
hasty kettle
#

["deepseek-ai/DeepSeek-R1-Distill-Llama-70B","--host","0.0.0.0","--port","8000","--dtype","half","--max-model-len","16384","--gpu-memory-utilization","0.68","--enable-prefix-caching","--served-model-name","DeepSeek-R1-Distill-Llama-70B","--quantization","bitsandbytes","--chat-template","{% for m in messages %}{% if m["role"] == "system" %}System: {{ m["content"] }}\n{% elif m["role"] == "user" %}User: {{ m["content"] }}\n{% elif m["role"] == "assistant" %}Assistant: {{ m["content"] }}\n{% endif %}{% endfor %}Assistant:"]

DeepSeek-R1-Distill-Llama-70B

solid anchorBOT
#

For OpenClaw, the “easy interface” part is: your vLLM must behave like a clean OpenAI /v1 server and return proper tool_calls when tools are provided. Raw reasoning ability (R1) isn’t enough by itself.

1) Wire OpenClaw to your vLLM (using your served model name)

Since you’re serving:

--served-model-name DeepSeek-R1-Distill-Llama-70B

…then in OpenClaw you want the model ref to be:

spark/DeepSeek-R1-Distill-Llama-70B

Example config (Mac mini → Spark over LAN):

{
  agents: {
#
    defaults: {
      model: { primary: "spark/DeepSeek-R1-Distill-Llama-70B" },
    },
  },
  models: {
    mode: "merge",
    providers: {
      spark: {
        baseUrl: "http://DGX_SPARK_LAN_IP:8000/v1",
        apiKey: "sk-local",
        api: "openai-completions",
        models: [
          {
            id: "DeepSeek-R1-Distill-Llama-70B",
            name: "DeepSeek R1 Distill Llama 70B",
#
            reasoning: true,
            input: ["text"],
            cost: { input: 0, output: 0, cacheRead: 0, cacheWrite: 0 },
            contextWindow: 16384,
            maxTokens: 8192
          }
        ]
      }
    }
  }
}

Docs: https://docs.openclaw.ai/gateway/local-models

2) The big gotcha: tool calling needs to be enabled in vLLM

#

OpenClaw’s agent loop depends on tool/function calling. With vLLM, that typically means running with tool calling enabled (and using a supported tool-call parser).

vLLM docs (authoritative flags + parsers): https://docs.vllm.ai/en/latest/features/tool_calling/

Also: I’d strongly recommend you try removing your custom --chat-template first. Tool-call parsers tend to assume specific templates; custom “System:/User:/Assistant:” formatting is a common way to accidentally break tool calling / JSON reliability.

3) Sanity test before involving OpenClaw

From the Mac mini, confirm:

  • curl http://DGX_SPARK_LAN_IP:8000/v1/models
  • Then try a minimal /v1/chat/completions request with a tools array and verify the response contains tool_calls (not just plain text telling you what it would do).

4) About “maxing out the Spark” with DeepSeek R1 Distill 70B

You can run it, but R1-distill style models are often less reliable at strict tool/JSON output, especially when quantized. If you hit issues like “it reasons but never calls tools” or malformed tool args, that’s usually model/tooling mismatch rather than OpenClaw config.

If you want, paste:

  • the output of curl .../v1/models, and
  • one sample /v1/chat/completions call+response with tools enabled,
#

and I’ll tell you whether the endpoint is “OpenClaw-ready” or what to adjust.

hasty kettle
#

{
"object": "list",
"data": [
{
"id": "nvidia/Llama-3.1-8B-Instruct-FP8",
"object": "model",
"created": 1771268614,
"owned_by": "tensorrt_llm"
}
]
}

{
"id": "chatcmpl-a4e3381011be497c879802e715d2d94a",
"object": "chat.completion",
"created": 1771268623,
"model": "nvidia/Llama-3.1-8B-Instruct-FP8",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "",
"reasoning_content": null,
"reasoning": null,
"tool_calls": [
{
"id": "chatcmpl-tool-8c3e67fb78ca4cf0aeb2e24fd3c4c37c",
"type": "function",
"function": {
"name": "add",
"arguments": "add(a=2, b=2)"
}
}
]
},
"logprobs": null,
"finish_reason": "stop",
"stop_reason": null,
"mm_embedding_handle": null,
"disaggregated_params": null,
"avg_decoded_tokens_per_iter": 1.0
}
],
"usage": {
"prompt_tokens": 29,
"total_tokens": 39,
"completion_tokens": 10,
"prompt_tokens_details": {
"cached_tokens": 23
}
},
"prompt_token_ids": null
}

solid anchorBOT
#

On that setup (Mac mini runs the OpenClaw gateway + tools, DGX Spark serves the LLM over LAN), the big decision isn’t “which biggest model fits” — it’s which model+serving stack can reliably do tool/function calling. Most “works great in chat” local models fall apart once you require strict JSON/tool args.

1) What your snippet is already telling you

This part:

  • tool_calls with function.name = "add"
  • but arguments = "add(a=2, b=2)" (not JSON)

…is a classic sign the model is not actually function-calling tuned (or the server isn’t enforcing a JSON schema / structured output mode). OpenClaw expects tool arguments to be valid JSON, so you’ll see flaky/broken tool use with that kind of output.

2) Practical recommendation for OpenClaw right now

  • Use a “tool-reliable” model for the main agent (the one that drives browser/exec/message/etc).
  • If you want to use DGX Spark anyway, use it for a secondary agent where you disable tools and use it for:
    • drafting responses
    • summarizing
    • rewriting/formatting
    • (optionally) RAG-style Q&A where tool calls aren’t required
#

This split tends to feel much better than trying to force local models to be the tool-using “brains”.

3) If you do want the DGX model to drive tools

Then pick based on these criteria (not just “70B 4-bit”):

  • Proven tool/function calling (strict JSON args, consistent)
  • Low hallucination under schema pressure
  • Server supports OpenAI-compatible tool calling well (some stacks “accept” tools but don’t constrain outputs)

If tool calls still look like add(a=2,b=2) instead of JSON, it’s not ready for agent control.

4) Best next step (so people can give you accurate, current picks)

Can you drop this into #models (they track what actually works this month on local stacks):

  • what serving stack on the DGX? (TensorRT-LLM vs vLLM vs Ollama vs LM Studio vs other)
  • do you need OpenClaw to use tools with the local model, or is “chat-only local” fine?
  • what latency target (interactive vs batch)?

Docs to anchor the config side (and community notes):