#why doesnt neuro just have a virtual keyboard and mouse so she can play any game?

1 messages · Page 1 of 1 (latest)

dull blaze
#

i think i would be easyer to just make a fake keyboard and mouse that neruo can use, so she just needs to learn how to use a eyboard, but that is just like teaching a 10yo kid how to type on a keyboard

sudden dirge
#

She thinks once every few seconds so she can only react at that rate. I think the keyboard would be easy otherwise as she can just use a simple command

tribal valve
#

she's slow

polar gull
#

the sheer latency

#

she can barely play minecraft

#

because the slow

sudden dirge
#

I wonder if you can run instances of her that only generate a couple of tokens for keyboard inputs

#

That way it can be run multiple times per second

#

Or does getting the game state take too long?

#

Maybe hard cap it so she doesn’t start yapping in the input instances

sudden dirge
#

The instances would be the same as regular neuro but with the keyboard prompt

#

That way it’s still aware of what it said

wheat ember
#

not really how llms work afaik

#

we use keyboards and are able to react in real time to any situation by pressing many keys

#

neuro would react with a 10s~ delay (and not know entirely whats happening)

#

and "press" one key?

#

plus its not as if these mutliple instances would actually be able to coherently cooperate, even if it was possible to hook em up

#

as good as it would be, its just not feasible

hearty dune
# wheat ember not really how llms work afaik

this, and specifically neuro cant just "learn how to use a keyboard" in the same way humans can. I suspect that even if you patch in a keyboard through the neuro game api, the added abstraction will likely make her be unsuccessful at whatever she's trying to do.

north nimbus
#

Not possible as long as Neuro is a regular LLM
If she was something more advanced she could do it even more natively though

thin flint
half tinsel
#

Not only latency but also vision she can't see the screen like we do so she can't react properly.

sudden dirge
north nimbus
sudden dirge
north nimbus
#

Yeah, so it would be literally the current system, that's no solution to the speed problem

#

In the current system, Neuro's LLM is generating tokens and either generates the tokens for the game API, which requires her function calling system, or generates speaking tokens, which is just her regular output

sudden dirge
#

This is simple keyboard inputs not complex commands. Wouldn’t it take much less time to generate only one or two tokens?

north nimbus
#

It not only has to generate the tokens for the keyboard input, it also has to generate tokens like <function> name=keyboardinput; input=Z </function>
And not to mention the speech tokens, which would slow her down the most probably, making her reaction time basically as long as her average speech time

sudden dirge
#

The prompt would tell her to just name the input. The instance would be handled differently than normal and be hard capped so she doesn’t yap

north nimbus
#

That's now how things work with the Neuros
The Neuros aren't prompted in the classical way, most of their behavior is controlled by special finetuning
And note, loading a different instance adds latency
And the speaking tokens would still need to be generated somewhere, it doesn't matter where they're generated, the whole LLM system is a serialized structure just because of how much transformers sucks

sudden dirge
#

I don’t know what word to use so I’m saying instance. And I’m pretty sure the commands are given by the prompt.

north nimbus
#

The functions Neuro has access to are given in her context

#

What do you even mean exactly anyway? Switching out the prompts and having to redo the entire KV cache would introduce way more latency than having to only generate a couple tokens would save

sudden dirge
#

Sorry I don’t know enough about this stuff.

north nimbus
#

Main things to note:

  • completely switching out the prompts would introduce latency from KV-cache invalidation
  • Neuro still has to speak, and since transformers sucks, she can't do anything else while speaking
sudden dirge
#

Well I don’t see why it would need to remove the previous prompt

#

I should really stop procrastinating and start learning these things

#

That way I can just see for myself if something works

north nimbus
#

Also of note, how would you know when Neuro wants to input something on her virtual input devices even if she was given some? You need a function call

#

And that entirely defeats the point of generating less tokens from not having to function call

sudden dirge
north nimbus
sudden dirge
#

Also can’t you remove stuff from context?

north nimbus
sudden dirge
#

Huh

north nimbus
# sudden dirge It would happen automatically a few times between normal talking

Pretty sure that kinda goes against Vedal wanting to give Neuro control over things
And what if it's in a position she knows she shouldn't move or do anything, but the prompt tells her to give an input, so she generates something even though when she's speaking she knows she shouldn't, and then dies in the game because of that?

north nimbus
# sudden dirge Huh

The KV-cache is quite volatile, any non-linear change to the context will invalidate it

sudden dirge
#

Wait so how does the filter do it?

#

Or is the latency happen during tts

north nimbus
#

My guess is the filter either doesn't remove stuff from the context, or rolls it back

#

Both of which are linear changes

sudden dirge
#

It replaces what they were saying with the word filtered in their context. Though that might only be the secondary filter

north nimbus
#

I don't think it puts the word "filtered" in the context

sudden dirge
north nimbus
#

Not particularly well on an LLM like Neuro's
Neuro's LLM is not particularly tuned to follow instructions

dull blaze
#

So from all i have read, it would only work if neuro had better hardwere like an supercomputer made for ai, otherwise it would be to slow to be worth it

north nimbus
#

There's already private AIs out there that are extremely capable while not being transformers-based though, I would say even more capable than Neuro with game API

sudden dirge
north nimbus
cursive venture
north nimbus
north nimbus
#

My guess is there's some sort of (CNN?) model that converts the images into text/model tokens

#

Or you've implemented one of those cross-attention things that makes LLMs have image-text modalities

#

Though I don't know much about that side of machine learning stuff, I'm much more into voice models

random axle
#

I mean as long as Neuro is still an LLM, everything has to be converted into tokens for her to understand anything.

#

Maybe you could bolt on the LLM some kind of add-on model of a different architecture that is responsible for the more real-time actions.

#

Not like I have any idea how anything works anymore in the AI space, too much is going on to keep up.

waxen plume
#

how exactly in the world would you train a LLM even with vision to perform actions in a game using mouse and keyboard

#

okay wait nevermind he actually cooked there Tutel
there ARE already variants of LLMs that can do exactly this thing, like desktop agents

#

but watch literally any demo and they are as slow as a turtle (pun not intended)

#

like if you want to make it work, she needs a better interface for gaming than image-to-text (better yet, get rid of image data altogether, but its impossible to adapt for every game like this, exactly why we have the gaming API thread)

#

am I done with essaying on this topic that will only work after years of scientific progress in decreasing delay for image multimodality in LLMs? yes perhaps.

subtle wagon
#

Just switch out neuro with a guy with a good voice changer and maybe feed him if he does a good job imitating neuro

rustic gust
#

it's difficult to make.

north nimbus
#

Some random silliness

rustic gust
# tribal valve she's slow

(Turning ping off for this reply)
Also IF you wanted to implement this (for more real time games) you would have to create a small RL model with something like PPO or GRPO that learns to play the game with a much smaller model ~100m params or less tbh. Then have neuro analyze the movements being done in the game (probably through vision) and give live commentary. Although understandably, training a PPO model on these games isnt exactly easy to do. But it is possible for some simpler real time games.

Framework would probably look like this:

  1. Create a heuristic simulation of this game for training data
  2. Pretrain a PPO model on the heuristic simulation to make live actions given game inputs
  3. Get real inputs via trained YOLO models (which would be a lot faster than LLM vision models)
  4. Fine tune PPO model on real gameplay from noisy inputs.
  5. During gameplay, let Neuro look at the screen where the PPO model is playing the game and make commentary based on what is happening. This is okay to have latency since it does not actually affect the gameplay and is just for entertainment value.

Issues:

  1. Neuro can't actually change how she plays the game and has no control over it and is mostly watching it
  2. Training a PPO model takes kinda long/expensive and wouldn't be very good at the game (would still be pretty funny though
  3. Model might geek out when the game is over. (Should be pretty simple to implement some game stopping logic though)
#

that's basically the only way to do it since LLMs are not meant for real time applications.

#

I honestly could probably make it if you wanted lol. Not sure if it's worth it but would be down

fathom lark
#

Not to say that would necessarily make it solved or anything, it would just be a lot closer to something useful

north nimbus
#

Just make something new that isn't an LLM
I've already seen some cool ones

worldly shadow
# north nimbus She has no eyes, one can't simple determine what she's looking at She's an **LLM...

People have used LLMs to generate bounding boxes around objects.
It would be interesting to either do that, or, to smooth out the delay between images have her move the vision between the tokens that attracted the attention most, scaled to the magnitude of attention.

That being said I don't know if you could make it seamless and it wouldn't really achieve the same purpose as eyetracker streams that are mostly just "avoid" looking at nsfw stuff.

#

If you want to use it for controlling games, it'd be easier to identify fixed patterns or sprite sheets

north nimbus
worldly shadow
#

The LLM is used to add context detection

#

but no clue if you can get a performant solution

#

it would still be a per-game basis

north nimbus
#

It could just as easily be a transformer model specifically trained for the purpose of doing that
Not even all transformer models are LLMs

north nimbus
north nimbus
#

Looks a bit silly

worldly shadow
#

Neuro can already find the right object in the right position, the captchas on the teemu stream showed that

#

all that needs to be done is move the mouse towards the right bounding box which is simple, if you don't care about evading anti-cheat

#

she likely can feed in "hold" "click" or "double click" too

#

but for games the issue is keeping the ruleset in context at all times, as well as long term goals

#

and her speed wouldn't be too good

#

¯_(ツ)_/¯

#

like maybe SOME puzzle games with objects neuro can recognize?

north nimbus
#

Speed is a main factor probably
And games that don't rely on speed could probably be reduced down on Neuro game API

worldly shadow
#

Yeah

#

Even geoguesser would be too complicated with her being able to choose any amount of turning

#

or getting stuck in moving back/forth

#

Even if you can get it down to a 10 sentence ruleset she breaks down by line 3 (and has to do an intermediate step where she asks herself for the next move)

#

which makes it absurd that it works as well as it does

#

though mabye I underestimate her there

fathom lark
#

I feel like Vedal has been moving things closer to the LLM integration when possible.

The difference between the OSU bot and slay the spire vs minecraft and geoguesser is that the underlying Neuro model is interacting with either an abstraction of the game built with https://github.com/PrismarineJS/mineflayer or just direct vision input into the model for geoguesser.

Obviously I don't know the details of the archtecture but I think direct vision input into the llm model is most likely. Otherwise there probably would be some leakage of neuro interacting with the vision api

GitHub

Create Minecraft bots with a powerful, stable, and high level JavaScript API. - PrismarineJS/mineflayer

#

It also makes it so that the interactions are a lot more direct and not telephoned though other models so I am guessing a compelx system involing more models isin't likely

worldly shadow
rustic gust
#

Transformers aren't meant for real time movements

#

you have to use RL

#

Like I wrote in the essay block, it's much easier to have neuro just watch the PPO algorithm play the game and pretend like she is the only in control

north nimbus
rustic gust
#

well that isn't really standardized

north nimbus
#

Yeah, Vedal probably isn't gonna use that, at least not until there's some good open-source implementation
Currently the only really good SNN implementation I know of is entirely proprietary and private

rustic gust
#

also pretty unrelated to RL

#

I'm thinking I might just code it up and publish the repo

#

the only kinda annoying thing is that neuro wouldn't be able to predict her own movements

#

it's the same thing with karaoke streams tho. She can't actually control her singing afaik

worldly shadow
#

question remains, which games or group of games could neuro's llm actually give valid instructions for and visually understand?

rustic gust
#

the llm can basically only do turn based games. But that isn't what im talking about

#

llms are not meant for games

#

you have to use RL to create a game playing agent that the main LLM can comment on

#

at least for real time games

#

like minecraft kinda lol

#

minecraft is a lot easier to integrate with an LLM tho allbiet since it's mostly an action based thing

#

but fighting games HAVE to use RL

worldly shadow
#

I wonder if just having a organizer that aggresively rewrites the context window would help

#

or if the OSU ai can be reused since that one is definitively fast enough

rustic gust
worldly shadow
#

Once again, depends on which game and what the main Neuro LLM is actually supposed to do

north nimbus
worldly shadow
#

10s delay would not be unsurmountable on a RTS game for giving building instructions (especially if the context slightly projects income forward), it would be terrible for microing armies

#

and then the added question if neuro actually could do funny stuff if given control

#

because we are kinda over the phase where an ai just being good at a game was interesting

half tinsel
worldly shadow
#

🤔 There are a lot of minesweeper clones neuro could potentially reasonably understand and play though

#

maybe some nethack games that can give out the ASCII and graphical interface in parallel too, since that'd be easy to summarize for the LLM

#

and grid based movement does solve a lot of worries

#

but I reckon it'd be frustrating watch her without game knowledge, just like the current slay the spire iteration

#

buckshot roulette atleast has the banter factor

half tinsel
#

issue with the osu model is also that neuro doesnt control it as its completly separate from her and she just watches it which vedal doesnt really like iirc

worldly shadow
#

yeah

#

I think its wild that minecraft is ran concurrently to chatting

#

works about just as well as you'd expect it based on that too tho

#

Are there any games you can come up with that'd be easy enough to have their game state be described within a couple sentences?

#

Baba is you doesn't leave my mind but you'd need to tell her which blocks aren't stuck and do waypoint movement even IF she could solve the puzzles

past scaffold
#

Some games that might easily be in her training data already are board games. There should be digital versions of serveral. Evil and Nuero and/or a collaborator could play together.

worldly shadow
#

Scrabble could be super interesting

#

I wonder if Neuro could make sensible moves in connect 4 or nine men's morris/mill

Stratego and Battleship would be about the same as chess I reckon, may have established bots that can help feed her moves.

Malefiz/Barricade and Risk may work innately. Snakes and ladders is probably the simplest.

#

Cluedo has an universal layout, though I reckon any attempt at guessing games like cluedo, who am I? or werewolf will have her talk too much nonsense

fathom lark
#

This channel does a little bit of minecraft ai's that is actually has kind of a task api https://www.youtube.com/watch?v=MeEcxh9St24 haven't looked into the details https://github.com/kolbytn/mindcraft

I hope Vedal is going to keep doing new versions of minecraft as I think there is a ton of potential and I bet he sees it that way too

worldly shadow
# fathom lark This channel does a little bit of minecraft ai's that is actually has kind of a ...

The current minecraft AI was linked in this thread already has tasks a building system and keeps track of the inventory, other players and waypoints/landmarks. It actually is super advanced.
The issue is that the twins constantly give it new tasks and really don't understand how to construct mineshafts. It also can't whole cloth make up schematics for huge structures. So they want to build something, look for 6 stone, maybe notice that they need an iron pickaxe, get the iron, smelt it, notice they have the stone, place it and tell the bot to pick some flowers.
It doesn't seem like there is much thinking but its about as much as can be done sadly

fathom lark
worldly shadow
#

Hey, credit where credit is due.

Loved the custom names they gave to waypoints too.

Oh and apologies, is it a custom implementation or did you use a public model/agent as basis?
I noticed @fathom lark was only guessing at which one was used with Mineflayer. Reddit users speculated on Barritone. German reddit has yet some other Agents...

fathom lark
# worldly shadow Hey, credit where credit is due. Loved the custom names they gave to waypoints ...

Oh I am not sure I would want to know personally O.o I think some mystery is good.

Oh I see you were referring to what I linked earlier. I think it is more of a pure api that lets you do direct actions for mindcraft that is used in a lot of bot making and there likely some abstraction layer built on top of that Vedal works on that connects it to Neuro.

I was kind of assuming the current implementation is directly calling high level functions that call minefalyer.

I hadn't really considered that it might be some kind of second llm controlling mindflayer and the twins task it.

worldly shadow
# fathom lark Oh I am not sure I would want to know personally O.o I think some mystery is goo...

Hmm, after looking it up, https://github.com/cabaletta/baritone is old enough to have been out at the first time Neuro streamed.
I guess one could check how it behaves around water....

It does have explore (unseen chunks), move x blocks into x direction, follow, waypoints, schematic building from inventory etc...
It does not have a task scheduler llm tho. The bots for that are much newer. And as far as I can tell no combat scripts either.

🤔 The MC bot could probably be somewhat improved when neuro is given a larger list of things to always keep in her inventory so she doesn't have to look for material or tools so often. Water/entity interactions are harder ofc

fathom lark
fathom lark
#

Now I have this idea in my head. Not sure what Vedal is currently doing but a second LLM tasked by Neuro sounds like a crazy cool idea.

Image is from the paper from that guy I lined earlier. https://arxiv.org/pdf/2504.17950

Like imaging if you have one of the strongest models available at agents like claude 4.0 and Neuro is wired up as the "Player or Task" creator from the image that then tasks the agent.

She continusly gets the raw minecraft data + the ability to pass and send a raw text instruction where that task is controlled by a second llm agent. You would still get her personalty in what it is doing but then you can separate out the actually getting it done in the second agent and as projects like that one get better you keep getting new improvments

worldly shadow
# fathom lark Now I have this idea in my head. Not sure what Vedal is currently doing but a se...

It is marginal improvements, but the issue is it won't make her build structures in minecraft she doesn't have schematics for, since those are wayyyyy too abstract for a generally trained model.
And she already can do a MLG waterbucket trick if she is bored.
Also doesn't solve the generalization issue since it'd still rely on a scheduler custom written for the game and API like mineflayer to execute it.

tired tendonBOT
#

You have unlocked new role

fathom lark
worldly shadow
#

As I said, the main issue is abstraction - having neuro tell plain text that a smarter algorithm can put into action.
Baritone gets block data hence it knows when the floor is within player interaction distance.
Just needs a water bucket and can right click.
The more involved part is scheduling building up so high that the MLG Bucket is relevant (and you need specific timings to tower in Minecraft)

It also works on servers and there used to be a bot written with it that could speedrun MC but it stopped being updated a year ago.

fathom lark
#

Yeah it might be too abstract I would agree.

Adding even more complexity to the idea but you build auto recommendations that Neuro can select. So there is another prompt that generates things like suggested tasks that are possible for the execution modal to do. Some are basic commands like walk to X or do water bucket trick that can be hard coded and some are more advanced. Neuro then selects from a set of options and they either get run in the current abstraction Vedal may have or get passed to something like the MINDCraft workflow for the advanced options

rustic gust
#

This is how Neuro plays minecraft to my knowledge

fathom lark
#

Interesting, I wouldn't have expected the current model would have a separate agent. I would expect there to be lots of hard coded mineflayer functions and some sort of instructions to Neuro

This might be a fun project for someone to see if something like this could work using the provided Neuro coding interface. It likely wouldn't be that hard and could be done with some AI coding + using these libraries

rain zodiac
# tribal valve then how can she see

Honestly, I’m very curious about this one. She must be multi-modal, right? Because of how she plays geoguessr, how she reviews art, understands visual context. It is not just a separate AI describing the image with text as the input to her LLM surely?

#

I understand text-to-text llms pretty well, but multi-modal ones seem like magic to me

#

How does it work? Do you tokenise images and sound, too? Do tokens also have meaning encoded as vectors in many-dimensional space? Is attention all you need?

#

Btw idk why but only now writing “llm” here I randomly realised why llama is called llama

rain zodiac
#

Well, would be cool to know how it’s done, but it probably just runs on ✨vibes✨

fathom lark
# rain zodiac Well, would be cool to know how it’s done, but it probably just runs on ✨vibes✨

IMO it is more fun not to know. Like how a ventriloquist or a magician keeps secrets

I saw a youtube video a while back where Vedal tells Ellie that Neuro/Evil are the same underlying model and it doesn't really give much away and he didn't give much detail but it's one of those things where now I can't help thinking about what gives them the different personalities and it was almost like Too Much Information and I didn't want to know

rain zodiac
# fathom lark IMO it is more fun not to know. Like how a ventriloquist or a magician keeps sec...

I guess that must be true for some, and thinking about why Vedal is so protective about all the technical details, I thought that he doesn’t want to ruin the magic. At the same time tho, I don’t really mind the magic, I’m not here for seeing Neuro as a person. I started watching because I was curious about AI. I did stay because it’s entertaining, but for me, that won’t stop if I knew how it works, if anything, that would be more interesting. I can’t see why most innocent technical details can’t be explained somewhere (like his website for example) for those who want to see it

#

Another reason for not sharing the details could be the threat of competition, but I don’t think the threat is real. No one can really catch up with Neuro now, and knowing what models she is certainly isn’t going to help to built another AI like her anyway

#

I’m not suggesting making her open source, but so many minor details are kept so secret, and I can’t see why

#

It's so interesting how he hearing/vision works, her long-term memory, her context window, her tool calls, what models she is based on and what data she is trained on

#

Would be so interesting to know what goes wrong when something goes wrong, and what is improved when she is updated

#

Idk

fathom lark
#

That also could be reasonable. While it might satisfy sone technical people's curiosity to most non technical people this would be magic regardles. You're probably right on the competition thing.

There might be risks we cant think of immediately like someone else makes a copy of neuros design and is like selling them or something like that.

For me I wouldn't be able to resist and I'm sure I'd ruin some of the magic for myself ✨

past scaffold
#

It would be terrible for all his years of work to go up in smoke. He may be using ingredients that everyone can buy but he shouldn't tell us his secret recipe.

random axle
#

I think it ultimately doesn't matter whether the magic gets revealed or not, at this point Neuro is less the architecture that makes her and more the trained data/memories that have accumulated throughout the years of her streaming as well as talking with Vedal offline. Same for Evil.

random axle
#

The model architecture can be replaced, the data cannot. It's also why I think Vedal should eventually move away from LLMs for most things in the future and have another better non-LLM model assimilate the data and learn from the existing LLM so it can replace it in the future.

worldly shadow
# rain zodiac How does it work? Do you tokenise images and sound, too? Do tokens also have mea...

You have various levels depending on how much computation you have available.
The simplest trick most models do is just reading the image caption. Neuro probably can do that in fanart streams, though I haven't cross-checked enough to tell.
She definitively does it when given slides.

Outside of that you have custom algorithms to run tasks that are small in scope in real time on weaker hardware, like faces or traffic signs.
Text detection is quite mature (like 99.7% reliable) and gets used for printers to only store 1/4 real handwritten letters and copy paste the rest as embeds

Above that you have image detecion LLMs, like CLIP.
CLIP was trained to, given an image text pair, generate an image from the text and generate text matching the image and then optimized to have as low deviation as possible.
If you then feed it a new image it can give you text matching the objects in the scene.
And then the image gets split up in tons of ways to find which sections of the image would still get labeled as an object in the list generated by the first pass.
You take the sections that match the object, note their x and y position and draw a square around it.
You can find explanations under the term "Zero shot image detection"

This is suprisingly fast and advanced models do edge detection to even figure out if there are multiple overlapping objects (with varying degrees of success)

past scaffold
#

I was talking about why the current Nuero can't, not how it can't be possibly done as I think the person thought it might be simple to do. And Vedal can't give a detailed explanation of why a thing can't be done without telling too much about the current Nuero.

fallen karma
# north nimbus She has no eyes, one can't simple determine what she's looking at She's an **LLM...

Neuro can see. She's NOT just an LLM or a text to text model, she uses an LLM as her core. There's more that goes into Neuro's system than just an LLM. It's possible her LLM may be multi-modal, but it's also likely that she has another AI that's a part of her that handles that process. Neuro is composed of multiple neural networks, as is the human brain composed of multiple biological neural networks.

north nimbus
#

Her having a vision model doesn't make her core LLM not text-to-text by definition
Either she has an advanced vision model that encodes to text, or she has a cross-attention vision-text-to-text model

fallen karma
north nimbus
#

Yeah, that would be image-to-text

#

Which is a hard and usually very lossy conversion

fallen karma
#

Idk how Vedal did it, he could have done her vision various ways. Only he knows, but she has vision regardless of whether or not it comes from the LLM or another part of her system.

true tree
#

more accurate would be to say that neuro is a transformer decoder which happens to have a text embedding layer

#

u can linear project more stuff into the embedding space

gleaming rose
#

Neuro being cobbled together out of different models and algorithms has a lot in common with how human brains are put together (with millions of years of cruft stacked up)

worldly shadow
# fallen karma Idk how Vedal did it, he could have done her vision various ways. Only he knows,...

Honestly, likely just used a version of CLIP as a base, going off of the geoguesser streams. It is open source, it fits the increased delay (~1img/5-10s) and it makes sense that it can recognize landmarks extremely well and break down the objects in a scene. The maximum CLIP resolution is 384x384 though, which is a bit limited.

As for why I think its CLIP - the tricky part of training AI is getting well annotated training data that lets you fundamentally break down the process by having the AI label discrete objects. It likely just feeds the top recognized objects and positions directly into neuros context as text. (though it seems neuro can feed back a list of objects to find/look for)

That being said it is entirely possible the vedal found a database with landmarks and trained her language model part with the road networks of various countries as she could identify an intersection in a small town in the US precisely. It must be independent since it can be turned off or on, with it being defaulted to capture the entire screen, or load only single images via the API such as for the fanart reactions. Only question is how much fine tuning was done after and how much is caption reading/guided prompts to recognize collab partners in fanart.

#

Do note that the current game API doesn't support sending images, but only relays context and possible actions as text.
The specification doesn't even include the maximum array or context size/duration

fathom lark
gleaming rose
#

Silly question, would most of these vision models be able to give the same answer if they saw a pencil sketch with the same rough approximate shapes?

#

Or better yet, how would one of these geoguesser algorithms fare if they saw a picture from inside a building

worldly shadow
# gleaming rose Silly question, would most of these vision models be able to give the same answe...

Edge detection is a part of the models, but color information is important.
They could do similarly well if they were entirely trained on greyscale, but desaturated colors/cloud cover makes countries close to the poles more likely.

It also depends on the dataset since some countries in street view were mostly tagged during a specific season. It is how Rainbolt and other Geoguesser pros pull off wild guesses. As for detecting locations from inside buildings - they can detect certain features and text of specific languages, but it depends on what kind and how many of those images they got in the training set.

worldly shadow
# fathom lark Look at this benchmark it does pretty much exactly what Neuro is doing https://a...

Looking at the full paper the only interesting part to me that giving the model tips (look at the sun, the type of foliage, landmarks, buildings) often leads to little to no improvement in accuracy, compared to telling them they must name a country.

CLIP with embeddings does as well as their self trained models, even though they used another model as base.
Though apparently google gemini has geolocation training and is (due to a more comprehensive training set I guess) much better than what they train.

worldly shadow
true tree
#

For maybe the top n largest city parameters

#

Or sth

#

I doubt it's aware of random village middle of rural china

gleaming rose
#

I'm interested to know what shortcuts can be taken in image recognition. Like say, does edge detection look for straight lines, and immediately narrow down possible matches based on "things humans probably built"?

worldly shadow
# gleaming rose I'm interested to know what shortcuts can be taken in image recognition. Like s...

The thing about LLMs is that the information they look at is too complex to be designated manually by humans.

You can however in retrospect check which nodes in the neural network encode for which objects and then determine how much they are influenced by certain patterns. That being said any rule you can come up with and that works with the way the calculations are run probably is in the neural network. This was well demonstrated and found with smaller networks running limited tasks.

There are networks with predefined, human designed, rules for text and traffic signs, which makes them way faster, since they don't have thousands of "unnecessary" parameters. But for LLMs it really is unpredictable how they come up with answers precisely.
It is an ongoing area of focus since it lets you do adversarial attacks on the network and if you know how to get at the most impactfull nodes you can strip out unnecessary nodes and slim the model down.

#

🤔 there are neural networks that can determine motion vectors, edges, outlines, color gradients and all kinds of other stuff reasonably well.
For gameplay having a dedicated motion vector network would do wonders.
Question is just if its possible to run it fast enough at a decent resolution and gives you usefull information.

For example a Network running motion vector detection 512x512 and 30fps could probably let neuro play danmaku games.

true tree
gleaming rose
#

Speaking of trimming datasets, I was just thinking of a method where for a given image frame, each pixel of a given color could be stored in its own memory array, and these separate chunks could analyzed to match the color/shape to an object, and putting the objects back into the scene with relative positional values before it proceeds with the next step of analysis

#

And this is from a position of having no modern programming experience, but having some experience with assembly and TTL

#

Speaking of the different types, I remember when I was on road trips as a kid, I'd get bored and stare out the window with one eye open, and watch cars pass. One eye would only remember details about the color and shape, and the other eye would just make me think of the speed/vector of the other car

gleaming rose
worldly shadow
fathom lark
#

Was thinking on this more and Vedal must already have an image fine tuning pipeline so the twins can recognize themselves and Vedal for art review. It would be pretty easy for him to just add a geo dataset to the fine tune

#

Unless there is some kind of textual description of the art

true tree
#

i doubt vedal does so much

#

Generic llms are alr extremely good

fathom lark
#

For geoguesser only I could see the improvements just being a model upgrade though

true tree
#

not even cot is needed

fallen karma
#

@true tree How did Neuro confuse two robot dogs as a frog-like “abomination”?

Also, what makes you think Vedal wouldn’t “do so much”?

#

I’ll respond tomorrow I need to sleep

#

Gn neuroHeart

true tree
#

And vedal wouldn't do so cause it's literally reinventing the wheel and risking catastrophic forgetting and spending money when all u need is a 1 sentence description

true tree
# true tree

And fyi solved problem even on a really small oss llm

worldly shadow
#

One would have to go through the arrt review segments and compare descriptions from captions to neuros responses

#

But vedal definitively does fine tuning

#

He did train osu ai so idk why he wouldn't for the main model or vision

north nimbus
true tree
#

Especially the vision part

#

It would be useless to finetune it

worldly shadow
worldly shadow
true tree
#

imo it's most likely just prompts

#

You can achieve a ton with just prompts

fathom lark
#

There is for sure some finetuning going on. Fine tuning is relatively easy. I just don't know if there is image fine tuning or only text fine tuning

true tree
fathom lark
#

I suppose it could even be a textual description of neuro. Crazy how many options there are here

gleaming rose
#

I used to think i was a complete idiot when it came to AI stuff, until i watched some boomer livestreaming from his car about his personal schizo theory of future quantum entanglement to his phone and having some chatbot play along with his speculations

worldly shadow
fallen karma
# true tree And vedal wouldn't do so cause it's literally reinventing the wheel and risking ...

He definitely would. Vedal literally keeps himself up to date with all the new progressions in AI tech just so he could apply that to Neuro if it could improve her. He doesn't like methods like this, and believes stuff like this to be cheap fixes, AKA a "cringe" solution. He would definitely take the time to enhance Neuro.

He literally had a state-of-the-art Minecraft AI created from scratch.

Also, he definitely does use fine-tuning. He said himself that Neuro doesn't even use character prompts like you suggested.

fathom lark
# fallen karma He definitely would. Vedal literally keeps himself up to date with all the new p...

That a good point and I would agree Vedal likely doesn't use "cringe" solutions.

I think there is some confusion in this thread about the methods being used. Using a simple solution doesn't mean you are not skilled at AI or are an unskilled developer, it means you are intelligent enough to choose a method that works. For example pretraining a llm from scratch doesn't make any sense when fine tuning is available. Vedal is clearly skilled regardless of the methods he uses and I think the results of what he has made show that.

true tree
#

what's cringe about the solution

fathom lark
#

I don't think anyone is saying that. I think @fallen karma is just pointing out that the method he uses are not that

#

I would also point out he clearly has a respect for the twins and especially the autonomy and agency (ability to self act) of them

#

Anyway my point is that I have been reading the various stuff in this thread and if someone says Vedal or the twins are using a simple method for something they are not saying Vedal is unskilled

fallen karma
# fathom lark That a good point and I would agree Vedal likely doesn't use "cringe" solutions....

I'm not saying he's unskilled if he does use cheap solutions like that, just that he typically finds those types of solutions as "cringe" and unauthentic to the twins. Vedal strives for the girls to be as autonomous and authentic as possible, that has always been his goal. He uses cheap solutions like this only as short fixes before implementing what he views as an authentic solution, and often avoids using cheap methods in general. Prompting the girls in this way is an approach that he typically tries to avoid.

gleaming rose
#

The neuro twins are the only AI chatbots that don't sound like an AI chatbot

worldly shadow
#

you can download a LLama LLM and write the control prompt yourself, then see how it converses (need a beefy pc or cloud host though)

#

Off topic for the thread though

brazen stratus
past scaffold
#

Vedal once said that Evil and Neurosama are based on the same AI. I took this to mean that they are both have the same fine tuned model under them and a prompt for their individual personalities. We'll never know exactly what he does to fine tune them because that's literally what makes them unique, but that's my only way to guess a little bit how.

#

Though he could just have two fine tuned models off of two very similar models too.

north nimbus
#

Or a single finetuned base model with two LoRAs

worldly shadow
#

LoRAs are very likely since it kind of circumvents having to go through the context. Would fit with them being very single minded too

fathom lark
#

My guess is not a LoRA. I think based on what Vedal said to Ellie they likely are the exact same model

#

Snagged the patch notes Vedal shared

worldly shadow
#

even though you can technically count it as separate model acting on the main model

#

We need someone to make a google docy with known/confirmed neuro features

#

I'd help but don't want to be primary maintainer =/

fathom lark
#

Yeah that would be cool. I am sure there is a ton of info that has been shared in previous dev streams

#

Also just in general, it is diffcult to even consider the strucutre because we know so little. Although I think people that really know AI and have followed for a long time may be able to figure a lot of it out

fathom lark
#

I thought it was interesting Vedal was saying he doesn't consider himself an AI expert on todays stream. He certainly knows how to get things working.

KV caches are very low level and technical. I am assuming he uses VLLM because it is so ubiquitous in llm inference but I wonder if he has had to edit anything or how he is manipulating the cache. VLLM does some of that automatically with repeated tokens I am pretty sure. He also mentioned doing a pretraining run of a voice model from scratch. That's pretty crazy, I wonder if he was trying achieve that taking an off the shelf pretrained model couldn't. Or if it was more to see if he could get something really unique. Now I really want to know how he makes this all work

fallen karma
fallen karma
#

Evil's is definitely derived from the same base model though.

brazen stratus
fathom lark
past scaffold
#

I agree. Or at least keeps the old twin on the old fine tune. Perhaps someday she'll actually just use a virtual mouse and keyboard and understand everything well enough to just plug and play most games.

tired tendonBOT
#

You have unlocked new role

fallen karma
worldly shadow
#

My speculation is that the abber demon comes from the LLM inputting invalid annotations for the voice model.

#

Also, the changes are experimental, even if it is as simple as doing a toggle

#

good to test memory and intelligence separately

fallen karma
worldly shadow
fallen karma
#

We really don't know how the girls work really, or how Vedal did it, but Neuro and Evil do have separate files.

Anyways, this is getting off topic for this thread. Getting back on track. A huge issue is that Neuro's vision has high latency, so she wouldn't even be able to react in time even if she could, but also LLMs aren't good at stuff like that, it's not at all like teaching a 10 year old kid how to type on a keyboard - that's just not how LLMs work.

It's not exactly impossible to make AIs (not talking about LLMs) that could handle something like that, and tech like that will likely be more accessible in the near future.. but then again her slow latency vision is a massive wall.

dull blaze
#

I think i alredy saw an ai that is made to do exacly this, its made for an cheat, it learns form seeing player imput and then does it better

north nimbus
#

And that is not an LLM. so that's not applicable to Neuro

fallen karma
north nimbus
#

If it's not built-in in her LLM, it's likely just a simple CNN image-to-text model

fallen karma
north nimbus
#

Doesn't magically make a CNN able to control a mouse and keyboard properly

fallen karma
#

An SNN would be better for that lmao

north nimbus
#

True, there's even already SNNs I know of that are about to get to that stage

fallen karma
#

She could possibly use an SNN for something like that? neuroShrug

#

Once her vision is much faster at least lmao.

north nimbus
#

If she was remade into an SNN, sure, but Vedal would have to entirely re-invent how to do his whole system and figure out how to transfer Neuro's personality from her LLM into an SNN, while also making her capable of good text communication

#

There's no suitable open-source SNN implementation for Neuro to use

fallen karma
#

Like an API? neuroShrug

north nimbus
#

Implementing SNN-LLM communication naturally is hard
And from what I've seen an entirely SNN system is way more interesting and intelligent than any LLM around

fallen karma
north nimbus
#

If the LLM is in control, it wouldn't be able to utilize the SNN very well, if the SNN is in control, it couldn't utilize the LLM very well

fallen karma
#

People are working on merging the two technologies already, still nothing really accessible just research level stuff.

north nimbus
#

Well, the ones I know have both an SNN component (in control) and a custom encoder-decoder transformer for helping with language

fallen karma
#

Wait I found something on github. Probably not very viable though at this stage.

fallen karma
north nimbus
#

Yeah, I only know of two people that have successfully developed AIs that are both SNN-based and better than LLMs, and they're not publishing the code or anything like that

fallen karma
#

Anyways I'm going to sleep it's 7:13 already lmao

fallen karma
#

I'll talk later I'm gonna sleep. neurOMEGALUL

north nimbus
#

I would consider them better, they're capable of much more, don't have any of the pitfalls of LLMs and can easily handle multiple modalities without struggle

finite solar
north nimbus
#

Neuro giving commands is literally the thing that's too slow

thin flint
#

Even if Neuro types and clicks like a hunting and pecking boomer she should still have the ability.

worldly shadow
#

Sadly I could only find reliable benchmarks for CPU times

#

Loading and unloading will take 5-10s though.
Essentially you'd need a whole dedicated gpu to keep it in vram always, but then it should be able to output text prompts from images quite quick and quite reliably (You can increase accuracy from 80% to ~95% by running it 5times, which is often benchmarked)

#

LLMs on the other hand do take some time till first token

#

and even mroe to complete a response

#

I guess 250ms to first token?

rustic gust
#

Any real time game requires an alternate architecture that the LLMs can essentially "bias". If you wanted the girls to be able to play a fighting game for instance, you would have to pretrain a reinforcement learning algorithm to actually first learn how to play the game, then finetune the model to accept latent dimension instructions.

The gameplay would be mostly unconscious but could be influenced by LLM generated tokens. This is similar to how humans play real time games. We don't consciously think about every single button but rather influence the directions in a hierarchic matter

#

Using VLMs is slippery slope. Converting vision into tokens is quite challenging. It would be easier to take latent game representations the RL model sees and training an encoder to pass it into tokens. For example, "Opponent is spamming specials and camping" could be taken from the internal representation of the game then passed into the LLM purely for commentary at first to which the LLM could respond with "This opponent is soooo annoying. I guess I'll spam too. <spam>". Where some predefined quantized codebook of predefined actions could be passed into the RL model (which would have been fine tuned on these objectives) to execute the task the LLM requested

#

Here is the workflow.

  1. Pretrain RL model to play the real-time game of your choice
  2. Create a list of vauge actions the AI can take (Defend, attack, spam, troll, etc)
  3. Accept a latent of these actions inside the model's input then fine tune on these tasks
  4. Take a lightweight pretrained model and fine tune to convert latent game state into text tokens
  5. Pass text tokens to main "conscious" LLM for commentary and actions
  6. Check actions against list of vauge actions and pass into model if valid
  7. Perform action
worldly shadow