#Dynamically loading context for Neuro's memories

1 messages · Page 1 of 1 (latest)

fathom rivet
#

Right now, Neuro's memory is limited, most likely because it all has to be stored in the context alongside the conversation's history, greatly limiting both. But even if that's not how it works, this would be near-unlimited (which her current implementation isn't considering she forgot fire/water).

Humans don't have all of their memories instantly accessible at once, otherwise we'd be overwhelmed. Instead, we only remember things if we see, hear, etc, or think of something related. A simple way to do this for Neuro would be to store all but her most essential memories and only load them into the context when they're relevant.

Each memory would be stored as string of text, along with keywords for it. e.g. "Anny is my mother." could be an essential memory kept at all times, but the full memory could be like:
"Anny is my mother, she's a fox girl vtuber who's really good at art and drew my model." (Anny, mother, mom, art, model)

This is then loaded into a hash table (for performance) where each keyword has its own copy of (or pointer to) the memory string. You might have to handle a keyword corresponding to multiple memories, too (concatenate them?). For every word of the input, and every word Neuro says, check if that word is a keyword, and then add that string into the context with other background information.

Then you have much more room for Neuro to remember things she currently doesn't (like her birthday), and more room for context from the current conversation! There are probably ways to extend this to let Neuro store new memories herself, but that would be much more difficult because she'd need to figure out the keywords too. Either way, I'm a big fan of Neuro and hope she continues to grow as the best AI vtuber!

crystal laurel
#

I'm pretty sure Neuro uses vector storage for memory rather than conversation history

fathom rivet
#

That's just an encoded form of the conversation history.

crystal laurel
#

Not really?

#

To be fair I'm speaking from a fuzzy understanding, but vector storage is word relation storage, and is housed in a database that is external to the actual language model. So utilizing this, Neuro's memory is just limited by drive space

west saddle
#

Vector storage is fuzzy, searching has the advantage of being able to find similar things but doesn't always find everything.

#

Lorebooks like this are keyword based, but have their own challenges. Mostly you can still run into context limitations, and they don't handle synonyms or relations very well. It's all key -> value pairs.

warm sparrow
#

Eh, Vector storage is semantic, not necessarily fuzzy. I.e. words and sentences with similar meaning are 'closer' to one another in terms of points on a graph.
Regular text search can also be fuzzy, but it is around spelling correction of keywords rather than semantic meaning

#

Your ideal situation would be both

#

Because often, keyword search relevancy can be very very good, in places where the semantic meaning between documents can be far apart

west saddle
#

My understanding of vector storage is admittedly limited, but the primary challenge is that it can miss things, right?

warm sparrow
#

yeah, what documents are similar to input query is down to how the model generates the embedding, but normally, it is around words with similar meaning and sentences with similar features, etc...

#

The downside is when you take longer peaces of text, your relevancy drops, because it is like a hash, there is only so much info you can store in your embedding of a certain number of dimensions

#

and often embeddings don't take into account the length or popularity of a word in a document

fathom rivet
#

So is the problem that too much context dilutes the information as a whole, or that you can only store so many vectors of context?

warm sparrow
#

you can realistically store an unlimited number of vectors for context, naturally there is only so many you can give back to the AI in one go though, but often the big limitation is as text gets bigger -> Semantic meaning of sentence is lost

fathom rivet
#

So the amount of context you can give is limited. That's the limitation I'm trying to work around.

#

You don't give all of the context at once. If something is relevant, add it into the context for a while.

crystal laurel
#

How do you know it's relevant?

fathom rivet
#

Based on keywords.

warm sparrow
#

yeah, that is normally what vector storage engines do

#

well, actually this is false, because most vector engines are a bit shit

crystal laurel
#

How deep of relevance do you go?

warm sparrow
#

but, the idea is they should, you should be able to combine something like the cosine similarity of the text with the BM25 score to get a good relevancy when looking at sentence recall

#

Most (all?) LLMs allow you to give a prompt, it gives back a embedding, then you can do similarity search based on that embedding

fathom rivet
#

Does something like that already happen for AIs like Neuro? They choose which parts of the context to feed into the language model based on the inputs?

warm sparrow
#

then you can always add the BM25 to the top

warm sparrow
#

the actual LLM is largely just what is forming the sentences

crystal laurel
#

LLM AI's are, as I understand them, large text prediction engines

fathom rivet
#

Then why is their memory limited?

crystal laurel
#

Drive space and ram to store it in for quick processing as well as model input size

warm sparrow
#

Depends how much storage you're willing to give them 😅 and generally speaking, tuning your search is quite a difficult task and hard to do a one case fits all type solution

#

As far as I'm aware, most implementations of this only use semantic meaning currently, which limits you when you want to take into account things like time, keywords, etc...

#

Basically at some point, you need keywords to prune results more often than not when things get big, semantic search is great until you have lots of similar things (like conversations) and then it becomes a battle of "How do I choose which contexts to inject out of these thousands of docs?"

fathom rivet
#

If RAM is a factor, that means that the vectors must be extremely large. Input size doesn't seem like it should be a limit to the total available information.

west saddle
#

So how would getting an embedding database actually work in practice. Take your prompt, send it out to get the embedding, search your vector database for results using that embedding, and then change your prompt with what you found?

#

Or inject what you found into the context maybe?

warm sparrow
#

The vectors are normally between 300 - 900 dimensions, So each vectors normally take up around a few KB unless you quantise them (math be like)

fathom rivet
#

Yeah, inject it into the context, and continue to update it when the AI model outputs words too.

#

Like in my example, if Neuro decides to talk about Anny, it would load the rest of the information on her.

crystal laurel
#

What do you mean by "rest of information" though? And how would it actually relate that information to what is being input being processed?

warm sparrow
#

The only thing I am not sure about is if you actually feed the AI's responses back into itself? I think normally just feeding the prompts back in allow it to infer, since they should return identical prompts if the seed is the same

warm sparrow
west saddle
fathom rivet
#

I may have made some assumptions about how the AI generates text, that you can go word-by-word.

warm sparrow
crystal laurel
fathom rivet
#

So in between that, if that word was a keyword, it puts some new information into the context (possibly pushing out older information)

#

(Of course, you need to keep track to avoid duplicating information and bump up the existing information instead)

west saddle
#

GPT4-turbo has 128k context, and I wonder how much smarter that's going to make most applications.

#

3.5 turbo only had 16k tops

crystal laurel
fathom rivet
#

It's not processing it as if it was input, but adding it to the context.

crystal laurel
#

What are you meaning by context?

warm sparrow
west saddle
fathom rivet
#

As well as background information from before the conversation.

#

Basically, however that information is currently handled, dynamically add it only when relevant (based on keywords or theoretically another metric if one is better) and remove the oldest if you need extra space.

crystal laurel
#

It feels like you're describing how sillytavern and lorebooks works

warm sparrow
#

one thing i dont get is How silly tavern gets the keywords

#

do they do keyword extraction? Or do you need to manually pass the keyword?

fathom rivet
#

I might be, if it's a good idea, it makes sense that other people have thought of it and are using it.

warm sparrow
#

I think it can probably be improved tbh

west saddle
#

keyword extraction from the prompt. But remember for ST the prompt is continuously built and sent as a giant chunk, so you'd have your base prompt which contains your ongoing conversation, then they'd inject all the lorebook keywords that are contained in your prompt

#

They do some magic with playing with context lengths and where in the prompt your lorebook entries are injected

#

It works fairly well but I'm sure there's room for improvement

fathom rivet
#

In this case, Vedal would be supplying the keywords for things. In most cases, it shouldn't matter because the keywords are just when the thing in question is explicitly mentioned, and maybe a few related things just in case you want Neuro to be able to bring them up. e,g, "Whale" could be a keyword for ShyLily and Bao.

#

So then if there's a conversation about whales Neuro could say "I love whales like ShyLily" or something.

#

But for the most part, it only really matters if something is directly referenced.

west saddle
#

Neuro is also a little different than ST, because it seems to me like Neuro doesn't have much overall memory from earlier chat questions and whatnot. Like, she wouldn't remember what a donation 5 minutes ago said.

crystal laurel
#

I'm not sure of Vedal's exact implementation of Neuro, but I'd have to think he isn't just dumping a year's worth of conversation history into her each time she speaks. He's using some sort of vector storage implementation to serve as Neuro's memory

warm sparrow
#

Eh, Take prompt -> Strip stopwords -> Stem text -> Text search -> Combine with cosine

fathom rivet
west saddle
#

Vedal is clearly priming Neuro with some context based on the stream theme of the day though - like the other day he clearly primed the model up with the fact that it was a family stream and Anny was her mom.

fathom rivet
#

That's what I'm hoping to try to fix.

west saddle
crystal laurel
#

I don't think she has a memory problem at the moment (other than a leak from some third party library)

fathom rivet
#

She can forget things within seconds of saying them.

crystal laurel
#

That's just her dementia

warm sparrow
#

lol

west saddle
#

It'd be fairly easy to test this just by donating twice in a row and asking her about what you just said

warm sparrow
#

That is something that probably isn't currently done and would be call to see

#

to remember users and their previous chats (or parts of it)

#

that being said, that makes a pretty big DB pretty quickly

west saddle
#

I'd be that either each interaction is independent (but there's a per-stream setup with the system prompt or permanent context), or vedal is storing each user's interaction individually in a vector db so she remembers things about them.

#

But yeah, that would grow very quickly

#

and considering he was running Neuro on a desktop and a single 4090 I'm not sure that'd even be possible

warm sparrow
#

it isn't that big 😅 You can store a lot of it on disk, but I think the existing solutions currently aren't the best for this sorta thing

west saddle
#

More like vram concerns

#

but then again he's offloading the llm I'm sure

#

so maybe there's room

warm sparrow
#

The vector search indexes doesn't sit on the vram normally

#

The LLM is largely the only thing actually eating the gpu memory

floral phoenix
crystal laurel
west saddle
#

I just started reading into how this stuff works maybe 9 months ago myself, and man there's some crazy smart people on some of these discord servers for it.

warm sparrow
#

wearyaf And then eventually you learn that AI is 99% just data and most of your time then gets spent trying to deal with data, cleaning it and making the most of it

fathom rivet
#

Most likely, only Vedal knows enough about how Neuro works to determine if using a method like this would be helpful.

light summit
#

I have an inquiry about Neuro from picking standard responses. When she regurgitates very, pipelined opinions. That it didn't come to its own conclusion at al