#How to add 'thinking' for Neuro's LLM

1 messages · Page 1 of 1 (latest)

lethal storm
#

Just a simple idea on how to integrate a thinking model for Neuro/Evil. Parallel processing is not novelty, but anyways, I'll give my input on this:

The idea here is to separate Neuro's logic into two layers, one of which is responsible for her direct answers (as she's been doing so far), the other one is for thinking (processing more tough questions, requiring more time and processing).
I've attempted to display this process on a simple diagram (see pic.):
VC or Twitch chat queries will go into Q blocks, responsible for decision - does the question require thinking, i.e. it's a submodel that decides whether you need to process the given prompt as a simple or a complex one. If the block's output is no, then the question is passed onto main layer model in R blocks (whichever way Neuro answers at the moment), so the question is processed instantly. If the Q-block output is yes, then the question is forwarded to the thinking model P, which acts as a inner monologue. Meanwhile the query can be skipped (if it's a regular twitch chat query) and move on to the next. Once the next question has been processed in the next Q block, regardless of the answer, the previous P block provides the result for the previous query, and the question before than can now be answered in R' block. (Note: if the Q2 block outputs 'no', meaning question can be processed instantly, it will be answered before outputting the answer for the previous query Q1 from the thinking layer P1).
If there are no parallel queries going (ex. Neuro is talking to Vedal without Chat, Priority Messages, etc - a 1 on 1 dialogue), thinking will have to be processed in Serial mode instead of Parallel (Q1 -> {no: R1, yes: P1 -> R1'), which will obviously slightly increase latency in a regular dialogue.

Short designations:
Q - thinking-decision blocks;
P - thinking processing models/units;
R - simple response;
R' - response after thinking.

lethal storm
# lethal storm Just a simple idea on how to integrate a thinking model for Neuro/Evil. Parallel...
Pros:
+ More intelligent responses when needed;
+ Saving response time for simple queries (depending on Q-block efficiency);
Cons:
- Increased latency due to addition of decision blocks;
- More response delay in direct conversation;
- Complexity of the architecture (need to think how to realize the Q and P blocks respectively);
- Heavy increase in compute requirements (yes, thinking layers require a lot of compute evidently).
inland steppe
#

Hmmm I see, so it's a step above hidden thought tokens because Neuro can answer a new question even though she hasn't answered the last question yet

reef yacht
#

Certainly a very interesting approach.

The Q block is going to be quite challenging though, because it is most likely going to be a Neuro Network. The Training data for this isn't going to be easy to come by. And response from the P blocks maybe obviously different from the R model, causing some cohesion problem.

One interesting thought I have is, instead of having a Q block making decisions, the P process is always running, and its output is then fed back to into the R block as context. R serves as the Q block essentially.

Then the R block can be trained to produce "buying time" behavior when it recognizes context from P block hasn't reached a conclusion. And it could also phrase response from P block in a more classic "Neuro" way.

If would also be interesting to see if the P block can assist in producing "models" for the social context and participants/subjects in a conversation. "We are discussing collabs in an offline call, my friend and partner Layna seems annoyed" would be something P produces as "context" for R.

lethal storm
# reef yacht Certainly a very interesting approach. The Q block is going to be quite chall...

Interesting, but there are a few issues here:

  • it would be hard to train P block to output the "end-of-stream" token, after which the output is sent into the R block as a context, and to decide when to send it, so it stays in the conversation;
  • if there is too much context stacked up in P block, it can overload Neuro's model to the point there will be context leaks (models happen to 'forget' things if the context is too long)
  • continuously outputting tokens puts a much bigger strain on hardware and won't be energy efficient either + you will get bigger chances of crashes (at least until Tutel gets himself a B100 or a high end GPU of such sort)

Producing context for Neuro as time goes by does seem to be a nice idea though.

reef yacht
# lethal storm Interesting, but there are a few issues here: - it would be hard to train P bloc...
  • I was assuming that part is done through a normal program and Neuro was just receiving a timer for it. But if it's controlled by Neuro, I would think R block is still responsible for it. P block is strictly a thinking module in my mind.
  • Yeah, context size is a problem. I was assuming it's already a problem for the R block too, as evident by the earlier minecraft streams. Turtle has some mechanisms for context management/continuation, it seems to be working fine after the first few. I don't know if it was just reduced spam from the Minecraft module or actual context continuation. But either way, P block doesn't really need to all the info R block is being presented.
  • Good point. I actually wasn't thinking about the Python/C boundaries for IOs. Maybe the idea is just then letting the P block run as long as it takes, and R block would just infer the lack of input from P block. Well, variable timing is going to be so hard to train. Aww. This might be prohibitive for the idea.
crystal surge
#

This is interesting, but I think Neuro needs responses that are hidden more than long think time for questions.
In either way, I think whatever triggers a thinking can also make her give a response at the same time ("I'm thinking", I have to think about it").
I don't think she needs time or multiple lines to process thoughts. Also I'm wondering in this case, when can she read her thoughts? Can she read everything any time? Or just specific responses at specific moments? Cause that could feel either too confusing or too random.

#

I honestly think just a thought command in the same process line is fine, they are fast enough to hardly notice because she doesn't have to say them and if she thinks a lot she'll look like she's thinking.
Having parallel thought is intriguing but feels needlessly complicated imo, she's kinda already able to respond at multiple questions at once.

gloomy kiln
#

why not use MOE(Mixture of Experts)

crystal surge
#

Also parallel llm are probably more useful for playing videogames.

midnight pelican
#

What about having several tiny programs designed to pick up context that Neuro's program could read for context, instead of relying on Neuro's program for context. Think like tiny energy efficient neural networks.

#

So Neuro could be presented context from smaller A.I.s whos function is just to present context

#

Perhaps Neuro could be programmed to read a variety of contexts from these smaller A.I.s if general contexts could be compressed into a single character like Context A or Context A-3, then you could string a bunch of these into the view of Neuro's Neural Network which was programmed to interpret the contextual readouts of the smaller A.I.s

#

As an example:

With friend? A
With friend Numi? A-3
With friend Numi in Minecraft? A-3c
With friend Numi in Minecraft, need coal? A-3c-C

crystal surge
midnight pelican
#

The human nervous system is completely full of such efficiencies, the brain is not one entity that controls everything in your body. Your nervous system is made up of dozens of individual processes that allow you to do things more efficiently and with less active focus.
I would like Neuro to not have to "Focus" directly on something for her to understand that it is happening. For her to not have to "think" about it to comprehend it.

crystal surge
#

Idk, I feel like she has enough awareness of current context, she just sometimes forget what she's doing, I don't think she needs to remember what she's playing and with who because she always knows that most of the times.

rugged locust
#

i wonder if you could also have stuff like toggles and her movements in games occur during the thinking process in this case.
if the latency is fast enough, she might be able to play in real time without having to tell a separate bot what to do