Need help confirming expected behavior for local GGUF memory embeddings across multiple agents.
Setup:
- OpenClaw 2026.5.6 on WSL Ubuntu 22.04
- Two agents
- Builtin memory-core
- provider: local
- model: hf:Qwen/Qwen3-Embedding-8B-GGUF/Qwen3-Embedding-8B-Q6_K.gguf
- sources: ["memory", "sessions"]
- sessionMemory: true
- contextSize: 4096
- node-llama-cpp: 3.16.2
- GPU: RTX 3070 8GB
Local loader patch:
- getLlama({ gpu: "cuda" })
- loadModel({ modelPath, gpuLayers })
What happened:
- gpuLayers: 12 works for both agents.
- Warm VRAM is around 4GB.
- Tried gpuLayers: 24.
- First agent memory search worked, VRAM around 5.9-6GB.
- Second agent memory search failed:
"Not enough VRAM to fit the model with the specified settings"
What I checked:
- No per-agent memorySearch overrides; both agents inherit same config.
- Each agent has its own SQLite DB.
- Source seems to cache MemoryIndexManager by agentId/workspace/settings/purpose.
- Local embedding model/context seems stored inside each provider instance, not shared globally.
Questions:
- Is it expected that each agent loads its own local GGUF embedding runtime/context?
- Is there a supported way to share one local embedding runtime across agents?
- Is gpuLayers: 12 the practical safe value for Qwen3-Embedding-8B Q6_K on 8GB VRAM with two agents?
- Would switching to a smaller Qwen3 embedding GGUF be the recommended path if I want future local TTS too?