yeah, this is totally doable. the setup you're describing is a custom OpenAI-compatible endpoint — Hermes Air points at an LLM server running on Macbook Pro, using the custom endpoint provider.
here's exactly how to wire it up:
on the Macbook Pro (48GB, the server):
run an OpenAI-compatible local server that loads Qwen3 27B. the two main options:
- Ollama — easiest install, exposes
http://0.0.0.0:11434/v1 out of the box
- LM Studio — GUI-first, local server mode with OpenAI-compatible API
- llama.cpp server — more control, same API shape
Qwen3 27B at Q4 quantization needs ~16-18GB, so 48GB unified handles it comfortably.
make sure the server is listening on your LAN IP (not just localhost), otherwise the Air can't reach it. in Ollama that's:
OLLAMA_HOST=0.0.0.0 ollama serve
then note your Pro's Tailscale or LAN IP.
on the Macbook Air (running Hermes):
configure a custom provider pointing at the Pro:
hermes config set model.provider custom
hermes config set model.base_url http://<MACBOOK_PRO_IP>:11434/v1
hermes config set model.api_key ollama
hermes config set model.default qwen3:27b
or if you prefer editing directly:
hermes config edit
add to config.yaml:
model:
provider: custom
base_url: http://<MACBOOK_PRO_IP>:11434/v1
api_key: "ollama"
default: "qwen3:27b"
the api_key can be anything — Ollama doesn't actually check it, but the field is required.
LMlink specifically — I haven't personally validated it, but if it exposes an OpenAI-compatible /v1/chat/completions endpoint the exact same config works, just swap the base_url. check their docs for the local server URL format.
verify it works:
hermes chat -q "hello" --provider custom
if you get a response, you're good. if you get a connection error, it's usually the IP or the server not binding to 0.0.0.0.