#lancedb inserting itself into session data thus polluting session data for memory recall
1 messages · Page 1 of 1 (latest)
If youâve enabled session memory search (agents.defaults.memorySearch.experimental.sessionMemory: true + sources: ["memory","sessions"]), OpenClaw will take your session transcripts (~/.openclaw/agents/<agentId>/sessions/*.jsonl), extract only user + assistant text, and index that into the memory SQLite (~/.openclaw/memory/<agentId>.sqlite). Docs: https://docs.openclaw.ai/concepts/memory#session-memory-search-experimental
Key detail: tool results are not indexed (it ignores roles other than user/assistant). So if LanceDB âfactsâ are showing up in the session-derived index, it means those âfactsâ are being persisted as assistant messages in the session JSONL (not as a tool result).
What you can do
- Quick workaround (stop the pollution): donât index sessions
agents: {
defaults: {
memorySearch: {
sources: ["memory"],
experimental: { sessionMemory: false }
}
}
}
(Then run openclaw memory index --force if you want to rebuild cleanly.)
- If you do want sessions indexed: adjust your LanceDB integration so recall output is stored as a non-assistant role (tool/system), or at least not written into the transcript as an âAssistant:â message. Today there isnât a config knob to âexclude assistant messages matching Xâ from session indexing.
To confirm whatâs happening
Can you paste:
openclaw --version- one redacted JSONL record containing the âfactsâ (just the relevant few lines from the
*.jsonl), so we can see whether itâsmessage.role: "assistant"vs a tool result?
its in the user role of the message
If itâs showing up as a user message in the session JSONL, that basically means your LanceDB âmemory_recallâ layer is doing RAG-by-concatenation (i.e. itâs prepending/merging the retrieved âFacts:â into the actual user turn that gets sent to the model).
And OpenClawâs session memory search intentionally indexes user + assistant turns into the SQLite index â so yeah, that will get embedded and will âpolluteâ session search. There isnât currently a built-in âignore user messages matching Xâ filter for session indexing.
Fix options (in order of sanity)
-
Change the LanceDB integration to not inject recall as
user- Prefer: recall results as a tool result (
toolResult) or a system message (âRetrieved context, donât quoteâ), not merged into the user message. - Session indexing only picks up
user/assistant, so moving it to tool/system keeps it out of the session-derived index.
- Prefer: recall results as a tool result (
-
Donât index sessions (or do it in a different agent)
- Keep session memory search off on the agent that uses LanceDB injection, and enable it on a âcleanâ agent that doesnât do that augmentation.
-
Advanced (if you own the plugin): strip those injected âfactsâ at persistence time
- Thereâs a
before_message_writehook in the persistence pipeline that can drop/modify messages before they hit the JSONL transcript, but this requires writing/modifying a plugin.
- Thereâs a
If you paste a single redacted JSONL line where it happens (the object that has message.role: "user" and includes the âfactsâ), plus what LanceDB integration youâre using (skill name / plugin / your own code), I can tell you which of the above is most realistic to implement in your setup.
here is a snippet from the openclaw memory search --agent main --query:
"snippet": "est 2. [fact] use /System/ActivityLog/Entries filtered by Type=playback.stop, username, and date - not /Users/{id}/Items. 3. [fact] watch history: do NOT use /Users/{id}/Items sorted by DatePlayed - often returns null. Use /System/ActivityLog/Entries filtered to Type=playback.stop + username + date. </relevant-memories> Conversation info (untrusted metadata): json { \"message_id\": \"3358\", \"sender_id\": \"id\", \"sender\": \"user\", \"timestamp\": \"Sun 2026-03-08 00:18 UTC\" } Sender (untrusted metadata): json { \"label\": \"user (user)\", \"id\": \"id\", \"name\": \"user\", \"username\": \"user\" } we're going to watch tv now. :(\nAssist",
"source": "sessions"
},
its loaded with things like this
i have these settings:
"memorySearch": {
"enabled": true,
"sources": [
"memory",
"sessions"
],
"experimental": {
"sessionMemory": true
},
"provider": "ollama",
"remote": {
"baseUrl": "http://10.0.0.11:11434/v1"
},
"fallback": "none",
"model": "nomic-embed-text",
"chunking": {
"tokens": 200,
"overlap": 40
}
},
going to a local ollama running nomic-embed-text for the memory search
Yeah, that snippet is basically the smoking gun:
</relevant-memories>+ the[fact] ...lines are inside the text that got persisted as a normal message- and
memory_searchsession indexing only knows âuser/assistant textâ, so it happily embeds/indexes all of it.
So the behavior is âcorrectâ given the inputs: your LanceDB layer is injecting recall into the same message text that becomes the session transcript (and in your earlier message you said itâs in the user role â that guarantees itâll be indexed).
What to change (to keep session indexing usable)
Best fix: stop injecting recall into the user message text.
- Inject recall as a tool result (
toolResult) or a system/developer message instead. - Session indexing ignores non-
user/assistantroles, so that keeps recall out of the SQLite session index.
If you canât change the LanceDB integration right now: disable session indexing until you can.
agents: {
defaults: {
memorySearch: {
experimental: { sessionMemory: false },
sources: ["memory"]
}
}
}
Why itâs extra-bad in your example
Youâre also embedding the literal âConversation info (untrusted metadata) ⊠Sender âŠâ blocks into the message text. That will also get indexed and will drown your session search in junk.
If you want a pragmatic hack (code-side)
The place to strip this (if youâre running from source) is the session indexing extractor:
src/memory/session-files.ts â buildSessionEntry() / extractSessionText()
You can sanitize user text before itâs added to collected, e.g. remove:
<relevant-memories> ... </relevant-memories>- the âConversation info (untrusted metadata)â blocks
âŠbut thatâs a patch/PR territory.
If you tell me which LanceDB integration youâre using (skill/plugin name + how it injects), I can point you to the cleanest place to switch it from âprepend into user messageâ â âsystem/toolResult injectionâ.
its the lancedb memory plugin
Those memorySearch settings look fine for sessionMemory + Ollama embeddings (and /v1 is OK â OpenClaw strips it for Ollamaâs native /api/embeddings endpoint).
What youâre seeing in the source: "sessions" snippets is coming from the LanceDB memory pluginâs auto-recall, not from Ollama or chunking:
extensions/memory-lancedbuses the plugin hookbefore_agent_startand returnsprependContextcontaining the<relevant-memories>âŠ</relevant-memories>block.- Core then does:
effectivePrompt = prependContext + "\n\n" + userPrompt(so it becomes part of the user prompt string). - That means itâs persisted in the session transcript as role=user, and your sessionMemory indexer (which indexes user/assistant turns) embeds it into SQLite.
So: with memory-lancedb autoRecall + sessionMemory indexing enabled, this âpollutionâ is basically guaranteed.
Practical fixes
Option A (fastest): disable LanceDB autoRecall injection
Keep capture if you want, but stop injecting into prompts:
plugins: {
entries: {
"memory-lancedb": {
enabled: true,
config: {
autoRecall: false,
// autoCapture: true/false (your choice)
}
}
}
}
(or even more blunt: plugins.entries.memory-lancedb.hooks.allowPromptInjection: false)
Then reindex:
openclaw memory index --agent main --force
Option B (keep autoRecall, but keep SQLite clean): needs a code change
The session indexer currently does no stripping of the <relevant-memories> scaffolding before embedding sessions.
Thereâs already a helper that strips those blocks:
src/shared/text/assistant-visible-text.ts â stripAssistantInternalScaffolding()
A good upstream fix would be: apply that (or at least âstrip relevant-memories blocksâ) inside
src/memory/session-files.ts when building the session entry content.
If you want this to be supported cleanly, itâs worth filing a GitHub issue: https://github.com/openclaw/openclaw/issues (this is a legit bug/feature request: âdonât index internal memory scaffolding into sessionMemoryâ).
One question to pick the right direction
Do you actually need memory-lancedb autoRecall and sessionMemory indexing at the same time? (Theyâre both ârecall into contextâ systems, so running both tends to create feedback loops unless we strip scaffolding during indexing.)
Yep â if itâs the bundled memory-lancedb plugin, then what youâre seeing is expected with your current setup:
memory-lancedbdoes auto-recall by returningprependContext(the<relevant-memories>âŠ</relevant-memories>block)- core prepends that onto the user prompt string
- sessionMemory indexing indexes user/assistant text, so it ends up in the SQLite session index
The âno pollutionâ choices today
1) Turn off LanceDB autoRecall (keep it for capture/storage if you want):
plugins: {
entries: {
"memory-lancedb": {
enabled: true,
config: {
autoRecall: false,
// autoCapture: true // optional
}
}
}
}
2) Or keep autoRecall, but donât index sessions
agents: {
defaults: {
memorySearch: { sources: ["memory"], experimental: { sessionMemory: false } }
}
}
3) If you want both (autoRecall + session indexing) cleanly: that needs an upstream tweak
A good fix would be for the session indexer (src/memory/session-files.ts) to strip <relevant-memories>âŠ</relevant-memories> blocks before embedding. Thereâs already a helper that does this (stripAssistantInternalScaffolding), itâs just not applied during session indexing right now.
If you want, file an issue for that (âsessionMemory indexing should strip relevant-memories scaffoldingâ) and link this thread/snippet:
https://github.com/openclaw/openclaw/issues
After changing config, rebuild the index:
openclaw memory index --agent main --force
arent all previous session data already polluted tho?
Yes â the old session JSONLs are already âpollutedâ in the sense that they literally contain the injected <relevant-memories>âŠ</relevant-memories> text (because memory-lancedb prepended it into the user prompt at the time).
Two important distinctions:
-
Turning off
autoRecallonly prevents future pollution.
Your existing~/.openclaw/agents/main/sessions/*.jsonlfiles wonât retroactively change. -
The SQLite file is just an index/cache. Reindexing doesnât magically clean data if the source transcripts still contain the injected block.
So how do you âunpolluteâ now?
You have 3 practical paths:
A) Best long-term fix (upstream/code): strip <relevant-memories>âŠ</relevant-memories> during session indexing, then reindex.
That would clean even old transcripts in the SQLite results without deleting history.
B) Start fresh without deleting anything: create a new agent (new sessions dir + new memory sqlite), enable sessionMemory there, and stop autoRecall on that agent. Old agent keeps old history.
C) Archive old transcripts out of the indexed folder: move the old .jsonl transcripts elsewhere (backup), then run openclaw memory index --agent main --force.
(Only do this if youâre OK not having those old sessions searchable via sessionMemory anymore.)
If you tell me which of these you prefer (keep old sessions searchable vs start clean), Iâll recommend the least painful route.
what are the proper chunk token and overlap settings?
There isnât one âcorrectâ chunk size â itâs a tradeoff â but OpenClawâs defaults are a solid starting point:
- tokens:
400 - overlap:
80(â 20%)
Thatâs also what the docs describe (~400 token target, 80-token overlap) and what the code defaults to.
Docs: https://docs.openclaw.ai/concepts/memory
Rules of thumb
- Overlap â 15â25% of chunk size (enough continuity without tons of duplicates).
- Smaller chunks (e.g.
200/40)- Pros: more âpreciseâ hits for short facts
- Cons: many more chunks, more near-duplicates, and for sessions it tends to fragment conversations
- Larger chunks (e.g.
500/100)- Pros: better conversational coherence (sessions), fewer duplicate hits
- Cons: can mix topics if your notes are dense
For your case (sessions enabled)
Iâd go back to 400/80 first. Your current 200/40 is more likely to produce lots of redundant session hits, especially when the transcript contains injected scaffolding.
After you change chunking, the index fingerprints change â youâll want to rebuild:
openclaw memory index --agent main --force
i am getting errors when using 400/80 it imports around 166 of 168 session files but then it gives this error and hangs:
Memory index failed (main): Ollama embeddings HTTP 500: {"error":"the input length exceeds the context length"}
That error is coming from Ollama, and it means: at least one chunk weâre trying to embed is still too long for the embedding modelâs context window.
A couple of gotchas in your setup:
- OpenClawâs âtokensâ chunking is a heuristic (
tokens * ~4 chars). With âweirdâ text (lots of IDs/JSON/markup), the real token count in Ollama can be much worse than the heuristic. - Ollama returns this as HTTP 500, and OpenClaw currently treats 5xx as retryable, so it will backoff/retry â looks like it âhangsâ.
What to do
- Lower the chunk size until indexing completes.
Try something like:
chunking: { tokens: 250, overlap: 50 } // or even 200/40 (what you had)
- If you have to stay with smaller chunks (because of model context), enable MMR to reduce the duplicate/redundant hits that smaller chunks tend to produce:
memorySearch: {
query: {
hybrid: {
mmr: { enabled: true }
}
}
}
- Since youâre also running memory-lancedb autoRecall, the injected
<relevant-memories>âŠ</relevant-memories>block can be part of the user prompt that gets indexed. DisablingautoRecall(or stripping that block during session indexing) also reduces the chance of âtoo longâ chunks.
Quick way to identify the culprit (optional)
Usually itâs 1â2 session files with very long user turns. Check the largest ones and/or temporarily move the biggest .jsonl out of ~/.openclaw/agents/main/sessions/ to confirm.
If you tell me:
- what embedding model
ollama show nomic-embed-textreports forcontext_length(on the Ollama host),