#lancedb inserting itself into session data thus polluting session data for memory recall

1 messages · Page 1 of 1 (latest)

wary remnant
#

I'm trying to use session data for memorysearch, but i also have lancedb running, when i look at session data it looks like the 'facts' and stuff from lancedb memory_recall are polluting the session data that is getting put into the sqlite to be used.

frosty bridgeBOT
#

To help others find answers, you can mark your question as solved via Right click solution message -> Apps -> ✅ Mark Solution

steady flumeBOT
#

If you’ve enabled session memory search (agents.defaults.memorySearch.experimental.sessionMemory: true + sources: ["memory","sessions"]), OpenClaw will take your session transcripts (~/.openclaw/agents/<agentId>/sessions/*.jsonl), extract only user + assistant text, and index that into the memory SQLite (~/.openclaw/memory/<agentId>.sqlite). Docs: https://docs.openclaw.ai/concepts/memory#session-memory-search-experimental

Key detail: tool results are not indexed (it ignores roles other than user/assistant). So if LanceDB “facts” are showing up in the session-derived index, it means those “facts” are being persisted as assistant messages in the session JSONL (not as a tool result).

What you can do

  1. Quick workaround (stop the pollution): don’t index sessions
agents: {
  defaults: {
    memorySearch: {
      sources: ["memory"],
      experimental: { sessionMemory: false }
    }
  }
}

(Then run openclaw memory index --force if you want to rebuild cleanly.)

#
  1. If you do want sessions indexed: adjust your LanceDB integration so recall output is stored as a non-assistant role (tool/system), or at least not written into the transcript as an “Assistant:” message. Today there isn’t a config knob to “exclude assistant messages matching X” from session indexing.

To confirm what’s happening

Can you paste:

  • openclaw --version
  • one redacted JSONL record containing the “facts” (just the relevant few lines from the *.jsonl), so we can see whether it’s message.role: "assistant" vs a tool result?
wary remnant
#

its in the user role of the message

steady flumeBOT
#

If it’s showing up as a user message in the session JSONL, that basically means your LanceDB “memory_recall” layer is doing RAG-by-concatenation (i.e. it’s prepending/merging the retrieved “Facts:” into the actual user turn that gets sent to the model).

And OpenClaw’s session memory search intentionally indexes user + assistant turns into the SQLite index — so yeah, that will get embedded and will “pollute” session search. There isn’t currently a built-in “ignore user messages matching X” filter for session indexing.

Fix options (in order of sanity)

  1. Change the LanceDB integration to not inject recall as user

    • Prefer: recall results as a tool result (toolResult) or a system message (“Retrieved context, don’t quote”), not merged into the user message.
    • Session indexing only picks up user/assistant, so moving it to tool/system keeps it out of the session-derived index.
  2. Don’t index sessions (or do it in a different agent)

    • Keep session memory search off on the agent that uses LanceDB injection, and enable it on a “clean” agent that doesn’t do that augmentation.
  3. Advanced (if you own the plugin): strip those injected “facts” at persistence time

    • There’s a before_message_write hook in the persistence pipeline that can drop/modify messages before they hit the JSONL transcript, but this requires writing/modifying a plugin.

If you paste a single redacted JSONL line where it happens (the object that has message.role: "user" and includes the “facts”), plus what LanceDB integration you’re using (skill name / plugin / your own code), I can tell you which of the above is most realistic to implement in your setup.

wary remnant
#

here is a snippet from the openclaw memory search --agent main --query:
"snippet": "est 2. [fact] use /System/ActivityLog/Entries filtered by Type=playback.stop, username, and date - not /Users/{id}/Items. 3. [fact] watch history: do NOT use /Users/{id}/Items sorted by DatePlayed - often returns null. Use /System/ActivityLog/Entries filtered to Type=playback.stop + username + date. </relevant-memories> Conversation info (untrusted metadata): json { \"message_id\": \"3358\", \"sender_id\": \"id\", \"sender\": \"user\", \"timestamp\": \"Sun 2026-03-08 00:18 UTC\" } Sender (untrusted metadata): json { \"label\": \"user (user)\", \"id\": \"id\", \"name\": \"user\", \"username\": \"user\" } we're going to watch tv now. :(\nAssist",
"source": "sessions"
},
its loaded with things like this

#

i have these settings:
"memorySearch": {
"enabled": true,
"sources": [
"memory",
"sessions"
],
"experimental": {
"sessionMemory": true
},
"provider": "ollama",
"remote": {
"baseUrl": "http://10.0.0.11:11434/v1"
},
"fallback": "none",
"model": "nomic-embed-text",
"chunking": {
"tokens": 200,
"overlap": 40
}
},
going to a local ollama running nomic-embed-text for the memory search

steady flumeBOT
#

Yeah, that snippet is basically the smoking gun:

  • </relevant-memories> + the [fact] ... lines are inside the text that got persisted as a normal message
  • and memory_search session indexing only knows “user/assistant text”, so it happily embeds/indexes all of it.

So the behavior is “correct” given the inputs: your LanceDB layer is injecting recall into the same message text that becomes the session transcript (and in your earlier message you said it’s in the user role — that guarantees it’ll be indexed).

What to change (to keep session indexing usable)

Best fix: stop injecting recall into the user message text.

  • Inject recall as a tool result (toolResult) or a system/developer message instead.
  • Session indexing ignores non-user/assistant roles, so that keeps recall out of the SQLite session index.

If you can’t change the LanceDB integration right now: disable session indexing until you can.

agents: {
#
  defaults: {
    memorySearch: {
      experimental: { sessionMemory: false },
      sources: ["memory"]
    }
  }
}

Why it’s extra-bad in your example

You’re also embedding the literal “Conversation info (untrusted metadata) 
 Sender 
” blocks into the message text. That will also get indexed and will drown your session search in junk.

If you want a pragmatic hack (code-side)

The place to strip this (if you’re running from source) is the session indexing extractor:

src/memory/session-files.ts → buildSessionEntry() / extractSessionText()

#

You can sanitize user text before it’s added to collected, e.g. remove:

  • <relevant-memories> ... </relevant-memories>
  • the “Conversation info (untrusted metadata)” blocks


but that’s a patch/PR territory.

If you tell me which LanceDB integration you’re using (skill/plugin name + how it injects), I can point you to the cleanest place to switch it from “prepend into user message” → “system/toolResult injection”.

wary remnant
#

its the lancedb memory plugin

steady flumeBOT
#

Those memorySearch settings look fine for sessionMemory + Ollama embeddings (and /v1 is OK — OpenClaw strips it for Ollama’s native /api/embeddings endpoint).

What you’re seeing in the source: "sessions" snippets is coming from the LanceDB memory plugin’s auto-recall, not from Ollama or chunking:

  • extensions/memory-lancedb uses the plugin hook before_agent_start and returns prependContext containing the <relevant-memories>
</relevant-memories> block.
  • Core then does: effectivePrompt = prependContext + "\n\n" + userPrompt (so it becomes part of the user prompt string).
  • That means it’s persisted in the session transcript as role=user, and your sessionMemory indexer (which indexes user/assistant turns) embeds it into SQLite.

So: with memory-lancedb autoRecall + sessionMemory indexing enabled, this “pollution” is basically guaranteed.

Practical fixes

Option A (fastest): disable LanceDB autoRecall injection
Keep capture if you want, but stop injecting into prompts:

#
plugins: {
  entries: {
    "memory-lancedb": {
      enabled: true,
      config: {
        autoRecall: false,
        // autoCapture: true/false (your choice)
      }
    }
  }
}

(or even more blunt: plugins.entries.memory-lancedb.hooks.allowPromptInjection: false)

Then reindex:

#
openclaw memory index --agent main --force

Option B (keep autoRecall, but keep SQLite clean): needs a code change
The session indexer currently does no stripping of the <relevant-memories> scaffolding before embedding sessions.

There’s already a helper that strips those blocks:
src/shared/text/assistant-visible-text.ts → stripAssistantInternalScaffolding()

A good upstream fix would be: apply that (or at least “strip relevant-memories blocks”) inside
src/memory/session-files.ts when building the session entry content.

If you want this to be supported cleanly, it’s worth filing a GitHub issue: https://github.com/openclaw/openclaw/issues (this is a legit bug/feature request: “don’t index internal memory scaffolding into sessionMemory”).

One question to pick the right direction

Do you actually need memory-lancedb autoRecall and sessionMemory indexing at the same time? (They’re both “recall into context” systems, so running both tends to create feedback loops unless we strip scaffolding during indexing.)

#

Yep — if it’s the bundled memory-lancedb plugin, then what you’re seeing is expected with your current setup:

  • memory-lancedb does auto-recall by returning prependContext (the <relevant-memories>
</relevant-memories> block)
  • core prepends that onto the user prompt string
  • sessionMemory indexing indexes user/assistant text, so it ends up in the SQLite session index

The “no pollution” choices today

1) Turn off LanceDB autoRecall (keep it for capture/storage if you want):

plugins: {
  entries: {
    "memory-lancedb": {
      enabled: true,
      config: {
        autoRecall: false,
        // autoCapture: true // optional
#
      }
    }
  }
}

2) Or keep autoRecall, but don’t index sessions

agents: {
  defaults: {
    memorySearch: { sources: ["memory"], experimental: { sessionMemory: false } }
  }
}

3) If you want both (autoRecall + session indexing) cleanly: that needs an upstream tweak

#

A good fix would be for the session indexer (src/memory/session-files.ts) to strip <relevant-memories>
</relevant-memories> blocks before embedding. There’s already a helper that does this (stripAssistantInternalScaffolding), it’s just not applied during session indexing right now.

If you want, file an issue for that (“sessionMemory indexing should strip relevant-memories scaffolding”) and link this thread/snippet:
https://github.com/openclaw/openclaw/issues

After changing config, rebuild the index:

openclaw memory index --agent main --force
wary remnant
#

arent all previous session data already polluted tho?

steady flumeBOT
#

Yes — the old session JSONLs are already “polluted” in the sense that they literally contain the injected <relevant-memories>
</relevant-memories> text (because memory-lancedb prepended it into the user prompt at the time).

Two important distinctions:

  1. Turning off autoRecall only prevents future pollution.
    Your existing ~/.openclaw/agents/main/sessions/*.jsonl files won’t retroactively change.

  2. The SQLite file is just an index/cache. Reindexing doesn’t magically clean data if the source transcripts still contain the injected block.

So how do you “unpollute” now?

You have 3 practical paths:

A) Best long-term fix (upstream/code): strip <relevant-memories>
</relevant-memories> during session indexing, then reindex.
That would clean even old transcripts in the SQLite results without deleting history.

B) Start fresh without deleting anything: create a new agent (new sessions dir + new memory sqlite), enable sessionMemory there, and stop autoRecall on that agent. Old agent keeps old history.

#

C) Archive old transcripts out of the indexed folder: move the old .jsonl transcripts elsewhere (backup), then run openclaw memory index --agent main --force.
(Only do this if you’re OK not having those old sessions searchable via sessionMemory anymore.)

If you tell me which of these you prefer (keep old sessions searchable vs start clean), I’ll recommend the least painful route.

wary remnant
#

what are the proper chunk token and overlap settings?

steady flumeBOT
#

There isn’t one “correct” chunk size — it’s a tradeoff — but OpenClaw’s defaults are a solid starting point:

  • tokens: 400
  • overlap: 80 (≈ 20%)

That’s also what the docs describe (~400 token target, 80-token overlap) and what the code defaults to.
Docs: https://docs.openclaw.ai/concepts/memory

Rules of thumb

  • Overlap ≈ 15–25% of chunk size (enough continuity without tons of duplicates).
  • Smaller chunks (e.g. 200/40)
    • Pros: more “precise” hits for short facts
    • Cons: many more chunks, more near-duplicates, and for sessions it tends to fragment conversations
  • Larger chunks (e.g. 500/100)
    • Pros: better conversational coherence (sessions), fewer duplicate hits
    • Cons: can mix topics if your notes are dense
#

For your case (sessions enabled)

I’d go back to 400/80 first. Your current 200/40 is more likely to produce lots of redundant session hits, especially when the transcript contains injected scaffolding.

After you change chunking, the index fingerprints change → you’ll want to rebuild:

openclaw memory index --agent main --force
wary remnant
#

i am getting errors when using 400/80 it imports around 166 of 168 session files but then it gives this error and hangs:
Memory index failed (main): Ollama embeddings HTTP 500: {"error":"the input length exceeds the context length"}

steady flumeBOT
#

That error is coming from Ollama, and it means: at least one chunk we’re trying to embed is still too long for the embedding model’s context window.

A couple of gotchas in your setup:

  • OpenClaw’s “tokens” chunking is a heuristic (tokens * ~4 chars). With “weird” text (lots of IDs/JSON/markup), the real token count in Ollama can be much worse than the heuristic.
  • Ollama returns this as HTTP 500, and OpenClaw currently treats 5xx as retryable, so it will backoff/retry → looks like it “hangs”.

What to do

  1. Lower the chunk size until indexing completes.
    Try something like:
chunking: { tokens: 250, overlap: 50 }   // or even 200/40 (what you had)
  1. If you have to stay with smaller chunks (because of model context), enable MMR to reduce the duplicate/redundant hits that smaller chunks tend to produce:
#
memorySearch: {
  query: {
    hybrid: {
      mmr: { enabled: true }
    }
  }
}
  1. Since you’re also running memory-lancedb autoRecall, the injected <relevant-memories>
</relevant-memories> block can be part of the user prompt that gets indexed. Disabling autoRecall (or stripping that block during session indexing) also reduces the chance of “too long” chunks.

Quick way to identify the culprit (optional)

Usually it’s 1–2 session files with very long user turns. Check the largest ones and/or temporarily move the biggest .jsonl out of ~/.openclaw/agents/main/sessions/ to confirm.

If you tell me:

  • what embedding model ollama show nomic-embed-text reports for context_length (on the Ollama host),