#Local Qwen Embeddings: separate agents appear to load separate runtimes and hit VRAM limit

1 messages · Page 1 of 1 (latest)

craggy prairie
#

Need help confirming expected behavior for local GGUF memory embeddings across multiple agents.

Setup:

  • OpenClaw 2026.5.6 on WSL Ubuntu 22.04
  • Two agents
  • Builtin memory-core
  • provider: local
  • model: hf:Qwen/Qwen3-Embedding-8B-GGUF/Qwen3-Embedding-8B-Q6_K.gguf
  • sources: ["memory", "sessions"]
  • sessionMemory: true
  • contextSize: 4096
  • node-llama-cpp: 3.16.2
  • GPU: RTX 3070 8GB

Local loader patch:

  • getLlama({ gpu: "cuda" })
  • loadModel({ modelPath, gpuLayers })

What happened:

  • gpuLayers: 12 works for both agents.
  • Warm VRAM is around 4GB.
  • Tried gpuLayers: 24.
  • First agent memory search worked, VRAM around 5.9-6GB.
  • Second agent memory search failed:
    "Not enough VRAM to fit the model with the specified settings"

What I checked:

  • No per-agent memorySearch overrides; both agents inherit same config.
  • Each agent has its own SQLite DB.
  • Source seems to cache MemoryIndexManager by agentId/workspace/settings/purpose.
  • Local embedding model/context seems stored inside each provider instance, not shared globally.

Questions:

  1. Is it expected that each agent loads its own local GGUF embedding runtime/context?
  2. Is there a supported way to share one local embedding runtime across agents?
  3. Is gpuLayers: 12 the practical safe value for Qwen3-Embedding-8B Q6_K on 8GB VRAM with two agents?
  4. Would switching to a smaller Qwen3 embedding GGUF be the recommended path if I want future local TTS too?
stable sparrowBOT
#

To help others find answers, you can mark your question as solved via Right click solution message -> Apps -> ✅ Mark Solution

wispy spireBOT
# craggy prairie Need help confirming expected behavior for local GGUF memory embeddings across m...

I checked the current docs/source. Short version: your read is basically right.

  1. Yes, for builtin memory-core + provider: "local", it is expected that each agent can end up with its own local GGUF runtime/model/context. MemoryIndexManager.get() caches by agentId:workspaceDir:settings:purpose, and local provider creation closes over its own llama, embeddingModel, and embeddingContext variables. So two agents with the same inherited config are still two manager/provider instances, not one shared embedding context.

  2. I don’t see a supported config knob for sharing one builtin local GGUF runtime across agents. The supported “shared runtime” shape would be to put embeddings behind a service and have agents call that service: e.g. provider: "ollama", or an OpenAI-compatible embedding endpoint via provider: "openai" + memorySearch.remote.baseUrl. Docs: https://docs.openclaw.ai/reference/memory-config and https://docs.openclaw.ai/concepts/memory-search

  3. For your exact patched gpuLayers setup: yes, gpuLayers: 12 sounds like the practical safe value on an 8GB 3070 for two agents if 24 loads once but fails on the second. That is empirical rather than an OpenClaw-supported sizing rule, because gpuLayers is from your local loader patch, not a documented OpenClaw memory config key. Also keep local.contextSize bounded; docs explicitly warn that Qwen3-Embedding-8B with huge/auto context can explode VRAM, and recommend lowering to 1024-2048 on constrained hosts.

  4. If you want headroom for local TTS too, I’d switch smaller. The docs’ default local embedding model is embeddinggemma-300m-qat-Q8_0.gguf (~0.6GB), and the QMD docs show Qwen3-Embedding-0.6B as a local embedding option. An 8B Q6 embedding model plus multiple agents plus future TTS is tight on 8GB VRAM.

One nuance: each agent having its own SQLite DB is separate from runtime sharing. The DB/index is per-agent-ish by design; the VRAM issue is the local provider/context lifetime.