I am setting up a distributed LLM inference architecture where the model serving and the openclaw client application run on same machine, and I want them to communicate seamlessly.
At a high level, my goal is to use OpenClaw (running on a this server machine) as the frontend/orchestrator, while leveraging a vLLM-based model server running on same machine hosted at port 8000 (remote or target) device as the backend inference engine.
System Components
- vLLM Server (Backend)
Running on a target device (likely Linux-based with QAIC acceleration).
Hosts a large language model such as:
meta-llama/Llama-3.3-70B-Instruct
Exposes an OpenAI-compatible REST API:
http://0.0.0.0:8000/v1
Handles:
Prompt processing
Token generation
Tool/function calling (enabled via flags)
OpenClaw (Frontend / Client)
Running on the same machine.
Acts as:
A user-facing interface or agent framework
A tool-calling orchestrator
Sends requests in OpenAI API format:
/v1/chat/completions
Expects:
Fast responses
Streaming or structured outputs
Tool-calling compatibility
Help me with the setup, and get the response from the model. I already tried connecting the model, but got no response from the openclaw output.
In vllm server run, it shows like this,
(APIServer pid=75263) INFO: 127.0.0.1:48594 - "POST /v1/chat/completions HTTP/1.1" 400 Bad Request