#why doesnt neuro just have a virtual keyboard and mouse so she can play any game?
1 messages · Page 1 of 1 (latest)
She thinks once every few seconds so she can only react at that rate. I think the keyboard would be easy otherwise as she can just use a simple command
she's slow
I wonder if you can run instances of her that only generate a couple of tokens for keyboard inputs
That way it can be run multiple times per second
Or does getting the game state take too long?
Maybe hard cap it so she doesn’t start yapping in the input instances
The instances would be the same as regular neuro but with the keyboard prompt
That way it’s still aware of what it said
not really how llms work afaik
we use keyboards and are able to react in real time to any situation by pressing many keys
neuro would react with a 10s~ delay (and not know entirely whats happening)
and "press" one key?
plus its not as if these mutliple instances would actually be able to coherently cooperate, even if it was possible to hook em up
as good as it would be, its just not feasible
this, and specifically neuro cant just "learn how to use a keyboard" in the same way humans can. I suspect that even if you patch in a keyboard through the neuro game api, the added abstraction will likely make her be unsuccessful at whatever she's trying to do.
Not possible as long as Neuro is a regular LLM
If she was something more advanced she could do it even more natively though
So the average online gamer then, she would fit right in
Not only latency but also vision she can't see the screen like we do so she can't react properly.
By instances I meant thinking instances (not sure what it’s called) it would just be one at a time so the context would be consistent
There can only ever be one instance of Neuro's LLM running at a time
That LLM is occupied by either playing a game or talking, not able to generate for both at the same time
I literally just said it would be one at a time
Yeah, so it would be literally the current system, that's no solution to the speed problem
In the current system, Neuro's LLM is generating tokens and either generates the tokens for the game API, which requires her function calling system, or generates speaking tokens, which is just her regular output
This is simple keyboard inputs not complex commands. Wouldn’t it take much less time to generate only one or two tokens?
It not only has to generate the tokens for the keyboard input, it also has to generate tokens like <function> name=keyboardinput; input=Z </function>
And not to mention the speech tokens, which would slow her down the most probably, making her reaction time basically as long as her average speech time
The prompt would tell her to just name the input. The instance would be handled differently than normal and be hard capped so she doesn’t yap
That's now how things work with the Neuros
The Neuros aren't prompted in the classical way, most of their behavior is controlled by special finetuning
And note, loading a different instance adds latency
And the speaking tokens would still need to be generated somewhere, it doesn't matter where they're generated, the whole LLM system is a serialized structure just because of how much transformers sucks
I don’t know what word to use so I’m saying instance. And I’m pretty sure the commands are given by the prompt.
The functions Neuro has access to are given in her context
What do you even mean exactly anyway? Switching out the prompts and having to redo the entire KV cache would introduce way more latency than having to only generate a couple tokens would save
Sorry I don’t know enough about this stuff.
Main things to note:
- completely switching out the prompts would introduce latency from KV-cache invalidation
- Neuro still has to speak, and since transformers sucks, she can't do anything else while speaking
Well I don’t see why it would need to remove the previous prompt
I should really stop procrastinating and start learning these things
That way I can just see for myself if something works
How else would you make the LLM generate specifically what you want? You need to modify the prompt in some way so that whatever it gets isn't just a random speaking token
Also of note, how would you know when Neuro wants to input something on her virtual input devices even if she was given some? You need a function call
And that entirely defeats the point of generating less tokens from not having to function call
Could you put something in context where she would be talking? Like putting words in her mouth except it tells her that her next output is a command?
How would you know when to do that?
And that would stay in the context, clogging it up for later
It would happen automatically a few times between normal talking
Also can’t you remove stuff from context?
-> KV cache invalidation
Huh
Pretty sure that kinda goes against Vedal wanting to give Neuro control over things
And what if it's in a position she knows she shouldn't move or do anything, but the prompt tells her to give an input, so she generates something even though when she's speaking she knows she shouldn't, and then dies in the game because of that?
The KV-cache is quite volatile, any non-linear change to the context will invalidate it
Oh
Wait so how does the filter do it?
Or is the latency happen during tts
My guess is the filter either doesn't remove stuff from the context, or rolls it back
Both of which are linear changes
It replaces what they were saying with the word filtered in their context. Though that might only be the secondary filter
I believe that is false
I don't think it puts the word "filtered" in the context
New idea: always have the keyboard prompt there but have it tell neuro that after a specific token, the next token is the keyboard input she wants. Then each input will only add 2 tokens to context.
Could that work?
Not particularly well on an LLM like Neuro's
Neuro's LLM is not particularly tuned to follow instructions
So from all i have read, it would only work if neuro had better hardwere like an supercomputer made for ai, otherwise it would be to slow to be worth it
Optionally Neuro could move away from the transformers architecture, though I don't know if Vedal can do stuff that isn't already open-source
There's already private AIs out there that are extremely capable while not being transformers-based though, I would say even more capable than Neuro with game API
Is there a way to keep the original personality or would he have to start over with new weights?
Neuro's current personality has changed many times actually
A new system would just have to be coerced to adapt a similar personality
Maybe if we start simple like only mouse no key board the mouse being controlled by what she is looking at maybe linked to her own version of a eye tracker only buttons she can control is left click and right click. i think If we had a cute UI of a mouse it would give more expression
She has no eyes, one can't simple determine what she's looking at
She's an LLM, a text to text model
then how can she see
My guess is there's some sort of (CNN?) model that converts the images into text/model tokens
Or you've implemented one of those cross-attention things that makes LLMs have image-text modalities
Though I don't know much about that side of machine learning stuff, I'm much more into voice models
I mean as long as Neuro is still an LLM, everything has to be converted into tokens for her to understand anything.
Maybe you could bolt on the LLM some kind of add-on model of a different architecture that is responsible for the more real-time actions.
Not like I have any idea how anything works anymore in the AI space, too much is going on to keep up.
i think thats the smallest problem you have here
how exactly in the world would you train a LLM even with vision to perform actions in a game using mouse and keyboard
okay wait nevermind he actually cooked there 
there ARE already variants of LLMs that can do exactly this thing, like desktop agents
but watch literally any demo and they are as slow as a turtle (pun not intended)
like if you want to make it work, she needs a better interface for gaming than image-to-text (better yet, get rid of image data altogether, but its impossible to adapt for every game like this, exactly why we have the gaming API thread)
am I done with essaying on this topic that will only work after years of scientific progress in decreasing delay for image multimodality in LLMs? yes perhaps.
Just switch out neuro with a guy with a good voice changer and maybe feed him if he does a good job imitating neuro
Yeah a lot of models like 4o have a ViT that creates a latent that can be passed into the LLMs attention
it's difficult to make.
(Turning ping off for this reply)
Also IF you wanted to implement this (for more real time games) you would have to create a small RL model with something like PPO or GRPO that learns to play the game with a much smaller model ~100m params or less tbh. Then have neuro analyze the movements being done in the game (probably through vision) and give live commentary. Although understandably, training a PPO model on these games isnt exactly easy to do. But it is possible for some simpler real time games.
Framework would probably look like this:
- Create a heuristic simulation of this game for training data
- Pretrain a PPO model on the heuristic simulation to make live actions given game inputs
- Get real inputs via trained YOLO models (which would be a lot faster than LLM vision models)
- Fine tune PPO model on real gameplay from noisy inputs.
- During gameplay, let Neuro look at the screen where the PPO model is playing the game and make commentary based on what is happening. This is okay to have latency since it does not actually affect the gameplay and is just for entertainment value.
Issues:
- Neuro can't actually change how she plays the game and has no control over it and is mostly watching it
- Training a PPO model takes kinda long/expensive and wouldn't be very good at the game (would still be pretty funny though
- Model might geek out when the game is over. (Should be pretty simple to implement some game stopping logic though)
that's basically the only way to do it since LLMs are not meant for real time applications.
I honestly could probably make it if you wanted lol. Not sure if it's worth it but would be down
I think this is a harder problem then even general computer use. Like moving a mouse and interacting with a desktop.
Until this benchmark gets saturated at 70%+ that problem has not really been solved in frontier models https://os-world.github.io/
OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments
Not to say that would necessarily make it solved or anything, it would just be a lot closer to something useful
Just make something new that isn't an LLM
I've already seen some cool ones
People have used LLMs to generate bounding boxes around objects.
It would be interesting to either do that, or, to smooth out the delay between images have her move the vision between the tokens that attracted the attention most, scaled to the magnitude of attention.
That being said I don't know if you could make it seamless and it wouldn't really achieve the same purpose as eyetracker streams that are mostly just "avoid" looking at nsfw stuff.
If you want to use it for controlling games, it'd be easier to identify fixed patterns or sprite sheets
Are you sure that isn't a different kind of transformer model or even a CNN model? Not all machine learning models are LLMs
The LLM is used to add context detection
but no clue if you can get a performant solution
it would still be a per-game basis
It could just as easily be a transformer model specifically trained for the purpose of doing that
Not even all transformer models are LLMs
And that kind of stuff I believe Vedal is moving away from in favor of the universal API
Looks a bit silly
Neuro can already find the right object in the right position, the captchas on the teemu stream showed that
all that needs to be done is move the mouse towards the right bounding box which is simple, if you don't care about evading anti-cheat
she likely can feed in "hold" "click" or "double click" too
but for games the issue is keeping the ruleset in context at all times, as well as long term goals
and her speed wouldn't be too good
¯_(ツ)_/¯
like maybe SOME puzzle games with objects neuro can recognize?
Speed is a main factor probably
And games that don't rely on speed could probably be reduced down on Neuro game API
Yeah
Even geoguesser would be too complicated with her being able to choose any amount of turning
or getting stuck in moving back/forth
Even if you can get it down to a 10 sentence ruleset she breaks down by line 3 (and has to do an intermediate step where she asks herself for the next move)
which makes it absurd that it works as well as it does
though mabye I underestimate her there
I feel like Vedal has been moving things closer to the LLM integration when possible.
The difference between the OSU bot and slay the spire vs minecraft and geoguesser is that the underlying Neuro model is interacting with either an abstraction of the game built with https://github.com/PrismarineJS/mineflayer or just direct vision input into the model for geoguesser.
Obviously I don't know the details of the archtecture but I think direct vision input into the llm model is most likely. Otherwise there probably would be some leakage of neuro interacting with the vision api
It also makes it so that the interactions are a lot more direct and not telephoned though other models so I am guessing a compelx system involing more models isin't likely
Well, its just that the LLM is only suited for giving out plain text answers
Transformers aren't meant for real time movements
you have to use RL
Like I wrote in the essay block, it's much easier to have neuro just watch the PPO algorithm play the game and pretend like she is the only in control
There's also potential in SNN probably
well that isn't really standardized
Yeah, Vedal probably isn't gonna use that, at least not until there's some good open-source implementation
Currently the only really good SNN implementation I know of is entirely proprietary and private
also pretty unrelated to RL
I'm thinking I might just code it up and publish the repo
the only kinda annoying thing is that neuro wouldn't be able to predict her own movements
it's the same thing with karaoke streams tho. She can't actually control her singing afaik
kinda a holdover from the early days, barring any better solution
question remains, which games or group of games could neuro's llm actually give valid instructions for and visually understand?
the llm can basically only do turn based games. But that isn't what im talking about
llms are not meant for games
you have to use RL to create a game playing agent that the main LLM can comment on
at least for real time games
like minecraft kinda lol
minecraft is a lot easier to integrate with an LLM tho allbiet since it's mostly an action based thing
but fighting games HAVE to use RL
I wonder if just having a organizer that aggresively rewrites the context window would help
or if the OSU ai can be reused since that one is definitively fast enough
This wouldn't work. The fundamental issue is that autoregressive transformers are incapable of making fast choices
Once again, depends on which game and what the main Neuro LLM is actually supposed to do
Not in its current form, no, too integrated with Osu
10s delay would not be unsurmountable on a RTS game for giving building instructions (especially if the context slightly projects income forward), it would be terrible for microing armies
and then the added question if neuro actually could do funny stuff if given control
because we are kinda over the phase where an ai just being good at a game was interesting
the osu ai was very osu specific i think dont think you can just reuse it
New to the fandom. Yeah i can see that it was a traditional neural net trained on a custom fitness function. Hence the jitter inbetween moves. You could probably use something similar on some fighting games, but then you get into different movesets and universality and bleh
🤔 There are a lot of minesweeper clones neuro could potentially reasonably understand and play though
maybe some nethack games that can give out the ASCII and graphical interface in parallel too, since that'd be easy to summarize for the LLM
and grid based movement does solve a lot of worries
but I reckon it'd be frustrating watch her without game knowledge, just like the current slay the spire iteration
buckshot roulette atleast has the banter factor
issue with the osu model is also that neuro doesnt control it as its completly separate from her and she just watches it which vedal doesnt really like iirc
yeah
I think its wild that minecraft is ran concurrently to chatting
works about just as well as you'd expect it based on that too tho
Are there any games you can come up with that'd be easy enough to have their game state be described within a couple sentences?
Baba is you doesn't leave my mind but you'd need to tell her which blocks aren't stuck and do waypoint movement even IF she could solve the puzzles
Some games that might easily be in her training data already are board games. There should be digital versions of serveral. Evil and Nuero and/or a collaborator could play together.
Scrabble could be super interesting
I wonder if Neuro could make sensible moves in connect 4 or nine men's morris/mill
Stratego and Battleship would be about the same as chess I reckon, may have established bots that can help feed her moves.
Malefiz/Barricade and Risk may work innately. Snakes and ladders is probably the simplest.
Cluedo has an universal layout, though I reckon any attempt at guessing games like cluedo, who am I? or werewolf will have her talk too much nonsense
This channel does a little bit of minecraft ai's that is actually has kind of a task api https://www.youtube.com/watch?v=MeEcxh9St24 haven't looked into the details https://github.com/kolbytn/mindcraft
I hope Vedal is going to keep doing new versions of minecraft as I think there is a ton of potential and I bet he sees it that way too
i... am steve.
Paper: https://arxiv.org/pdf/2504.17950
Website: https://mindcraft-minecollab.github.io/index.html
Coauthors
Kolby: https://x.com/kolbytn
Izzy: https://x.com/isadorcw & https://bsky.app/profile/izzycw.bsky.social
Ayush: https://x.com/AyushManiar
Lianhui: https://x.com/Lianhuiq
Prithviraj: https://x.com/rajammanabrolu
Me: https://...
The current minecraft AI was linked in this thread already has tasks a building system and keeps track of the inventory, other players and waypoints/landmarks. It actually is super advanced.
The issue is that the twins constantly give it new tasks and really don't understand how to construct mineshafts. It also can't whole cloth make up schematics for huge structures. So they want to build something, look for 6 stone, maybe notice that they need an iron pickaxe, get the iron, smelt it, notice they have the stone, place it and tell the bot to pick some flowers.
It doesn't seem like there is much thinking but its about as much as can be done sadly
Oh interesting, I may have to find that and take a look. Thanks!
thanks 
Hey, credit where credit is due.
Loved the custom names they gave to waypoints too.
Oh and apologies, is it a custom implementation or did you use a public model/agent as basis?
I noticed @fathom lark was only guessing at which one was used with Mineflayer. Reddit users speculated on Barritone. German reddit has yet some other Agents...
Oh I am not sure I would want to know personally O.o I think some mystery is good.
Oh I see you were referring to what I linked earlier. I think it is more of a pure api that lets you do direct actions for mindcraft that is used in a lot of bot making and there likely some abstraction layer built on top of that Vedal works on that connects it to Neuro.
I was kind of assuming the current implementation is directly calling high level functions that call minefalyer.
I hadn't really considered that it might be some kind of second llm controlling mindflayer and the twins task it.
Hmm, after looking it up, https://github.com/cabaletta/baritone is old enough to have been out at the first time Neuro streamed.
I guess one could check how it behaves around water....
It does have explore (unseen chunks), move x blocks into x direction, follow, waypoints, schematic building from inventory etc...
It does not have a task scheduler llm tho. The bots for that are much newer. And as far as I can tell no combat scripts either.
🤔 The MC bot could probably be somewhat improved when neuro is given a larger list of things to always keep in her inventory so she doesn't have to look for material or tools so often. Water/entity interactions are harder ofc
Wow that's cool, I watched the nether traversal. I didn't know that was a thing.
I checked and mineflayer supports pathfinding with this plugin so that would personally be my guess how it is implemented
https://github.com/PrismarineJS/mineflayer-pathfinder
Now I have this idea in my head. Not sure what Vedal is currently doing but a second LLM tasked by Neuro sounds like a crazy cool idea.
Image is from the paper from that guy I lined earlier. https://arxiv.org/pdf/2504.17950
Like imaging if you have one of the strongest models available at agents like claude 4.0 and Neuro is wired up as the "Player or Task" creator from the image that then tasks the agent.
She continusly gets the raw minecraft data + the ability to pass and send a raw text instruction where that task is controlled by a second llm agent. You would still get her personalty in what it is doing but then you can separate out the actually getting it done in the second agent and as projects like that one get better you keep getting new improvments
It is marginal improvements, but the issue is it won't make her build structures in minecraft she doesn't have schematics for, since those are wayyyyy too abstract for a generally trained model.
And she already can do a MLG waterbucket trick if she is bored.
Also doesn't solve the generalization issue since it'd still rely on a scheduler custom written for the game and API like mineflayer to execute it.
You have unlocked new role
Yeah defiantly wouldn't solve the generalization that is the original thread. I was just hyperfocused on minecraft
Huh I wonder if that water bucket algorithm is hand coded or generated. Could also be something like a skill library like this paper https://voyager.minedojo.org/
I feel like for a lot of this stuff you have to get your hands dirty if you want to actually know if it works
As I said, the main issue is abstraction - having neuro tell plain text that a smarter algorithm can put into action.
Baritone gets block data hence it knows when the floor is within player interaction distance.
Just needs a water bucket and can right click.
The more involved part is scheduling building up so high that the MLG Bucket is relevant (and you need specific timings to tower in Minecraft)
It also works on servers and there used to be a bot written with it that could speedrun MC but it stopped being updated a year ago.
Yeah it might be too abstract I would agree.
Adding even more complexity to the idea but you build auto recommendations that Neuro can select. So there is another prompt that generates things like suggested tasks that are possible for the execution modal to do. Some are basic commands like walk to X or do water bucket trick that can be hard coded and some are more advanced. Neuro then selects from a set of options and they either get run in the current abstraction Vedal may have or get passed to something like the MINDCraft workflow for the advanced options
This is how Neuro plays minecraft to my knowledge
Interesting, I wouldn't have expected the current model would have a separate agent. I would expect there to be lots of hard coded mineflayer functions and some sort of instructions to Neuro
This might be a fun project for someone to see if something like this could work using the provided Neuro coding interface. It likely wouldn't be that hard and could be done with some AI coding + using these libraries
Honestly, I’m very curious about this one. She must be multi-modal, right? Because of how she plays geoguessr, how she reviews art, understands visual context. It is not just a separate AI describing the image with text as the input to her LLM surely?
I understand text-to-text llms pretty well, but multi-modal ones seem like magic to me
How does it work? Do you tokenise images and sound, too? Do tokens also have meaning encoded as vectors in many-dimensional space? Is attention all you need?
Btw idk why but only now writing “llm” here I randomly realised why llama is called llama
Neuro can also recognise sounds that are not words, too, right? Like laughter, coughing, etc. So it’s not a speech-to-text either, at least not completely. Or is there only a limited number of sounds she recognises and they are still given to her as a text input?
Well, would be cool to know how it’s done, but it probably just runs on ✨vibes✨
IMO it is more fun not to know. Like how a ventriloquist or a magician keeps secrets
I saw a youtube video a while back where Vedal tells Ellie that Neuro/Evil are the same underlying model and it doesn't really give much away and he didn't give much detail but it's one of those things where now I can't help thinking about what gives them the different personalities and it was almost like Too Much Information and I didn't want to know
I guess that must be true for some, and thinking about why Vedal is so protective about all the technical details, I thought that he doesn’t want to ruin the magic. At the same time tho, I don’t really mind the magic, I’m not here for seeing Neuro as a person. I started watching because I was curious about AI. I did stay because it’s entertaining, but for me, that won’t stop if I knew how it works, if anything, that would be more interesting. I can’t see why most innocent technical details can’t be explained somewhere (like his website for example) for those who want to see it
Another reason for not sharing the details could be the threat of competition, but I don’t think the threat is real. No one can really catch up with Neuro now, and knowing what models she is certainly isn’t going to help to built another AI like her anyway
I’m not suggesting making her open source, but so many minor details are kept so secret, and I can’t see why
It's so interesting how he hearing/vision works, her long-term memory, her context window, her tool calls, what models she is based on and what data she is trained on
Would be so interesting to know what goes wrong when something goes wrong, and what is improved when she is updated
Idk
That also could be reasonable. While it might satisfy sone technical people's curiosity to most non technical people this would be magic regardles. You're probably right on the competition thing.
There might be risks we cant think of immediately like someone else makes a copy of neuros design and is like selling them or something like that.
For me I wouldn't be able to resist and I'm sure I'd ruin some of the magic for myself ✨
It would be terrible for all his years of work to go up in smoke. He may be using ingredients that everyone can buy but he shouldn't tell us his secret recipe.
I think it ultimately doesn't matter whether the magic gets revealed or not, at this point Neuro is less the architecture that makes her and more the trained data/memories that have accumulated throughout the years of her streaming as well as talking with Vedal offline. Same for Evil.
The model architecture can be replaced, the data cannot. It's also why I think Vedal should eventually move away from LLMs for most things in the future and have another better non-LLM model assimilate the data and learn from the existing LLM so it can replace it in the future.
You have various levels depending on how much computation you have available.
The simplest trick most models do is just reading the image caption. Neuro probably can do that in fanart streams, though I haven't cross-checked enough to tell.
She definitively does it when given slides.
Outside of that you have custom algorithms to run tasks that are small in scope in real time on weaker hardware, like faces or traffic signs.
Text detection is quite mature (like 99.7% reliable) and gets used for printers to only store 1/4 real handwritten letters and copy paste the rest as embeds
Above that you have image detecion LLMs, like CLIP.
CLIP was trained to, given an image text pair, generate an image from the text and generate text matching the image and then optimized to have as low deviation as possible.
If you then feed it a new image it can give you text matching the objects in the scene.
And then the image gets split up in tons of ways to find which sections of the image would still get labeled as an object in the list generated by the first pass.
You take the sections that match the object, note their x and y position and draw a square around it.
You can find explanations under the term "Zero shot image detection"
This is suprisingly fast and advanced models do edge detection to even figure out if there are multiple overlapping objects (with varying degrees of success)
I was talking about why the current Nuero can't, not how it can't be possibly done as I think the person thought it might be simple to do. And Vedal can't give a detailed explanation of why a thing can't be done without telling too much about the current Nuero.
Neuro can see. She's NOT just an LLM or a text to text model, she uses an LLM as her core. There's more that goes into Neuro's system than just an LLM. It's possible her LLM may be multi-modal, but it's also likely that she has another AI that's a part of her that handles that process. Neuro is composed of multiple neural networks, as is the human brain composed of multiple biological neural networks.
Her having a vision model doesn't make her core LLM not text-to-text by definition
Either she has an advanced vision model that encodes to text, or she has a cross-attention vision-text-to-text model
Her core LLM would still be text-to-text, but her vision model wouldn't be.
Yeah, that would be image-to-text
Which is a hard and usually very lossy conversion
Idk how Vedal did it, he could have done her vision various ways. Only he knows, but she has vision regardless of whether or not it comes from the LLM or another part of her system.
more accurate would be to say that neuro is a transformer decoder which happens to have a text embedding layer
u can linear project more stuff into the embedding space
Neuro being cobbled together out of different models and algorithms has a lot in common with how human brains are put together (with millions of years of cruft stacked up)
Honestly, likely just used a version of CLIP as a base, going off of the geoguesser streams. It is open source, it fits the increased delay (~1img/5-10s) and it makes sense that it can recognize landmarks extremely well and break down the objects in a scene. The maximum CLIP resolution is 384x384 though, which is a bit limited.
As for why I think its CLIP - the tricky part of training AI is getting well annotated training data that lets you fundamentally break down the process by having the AI label discrete objects. It likely just feeds the top recognized objects and positions directly into neuros context as text. (though it seems neuro can feed back a list of objects to find/look for)
That being said it is entirely possible the vedal found a database with landmarks and trained her language model part with the road networks of various countries as she could identify an intersection in a small town in the US precisely. It must be independent since it can be turned off or on, with it being defaulted to capture the entire screen, or load only single images via the API such as for the fanart reactions. Only question is how much fine tuning was done after and how much is caption reading/guided prompts to recognize collab partners in fanart.
Do note that the current game API doesn't support sending images, but only relays context and possible actions as text.
The specification doesn't even include the maximum array or context size/duration
Look at this benchmark it does pretty much exactly what Neuro is doing
https://arxiv.org/html/2405.20363v1
I bet Vedal took some modal that supports this or took the fine tune for it and applied it to a visual model. I also think Vedal is basically just taking a screenshot of his desktop every few seconds and running in through the model. That would explain the latency
Silly question, would most of these vision models be able to give the same answer if they saw a pencil sketch with the same rough approximate shapes?
Or better yet, how would one of these geoguesser algorithms fare if they saw a picture from inside a building
Edge detection is a part of the models, but color information is important.
They could do similarly well if they were entirely trained on greyscale, but desaturated colors/cloud cover makes countries close to the poles more likely.
It also depends on the dataset since some countries in street view were mostly tagged during a specific season. It is how Rainbolt and other Geoguesser pros pull off wild guesses. As for detecting locations from inside buildings - they can detect certain features and text of specific languages, but it depends on what kind and how many of those images they got in the training set.
Looking at the full paper the only interesting part to me that giving the model tips (look at the sun, the type of foliage, landmarks, buildings) often leads to little to no improvement in accuracy, compared to telling them they must name a country.
CLIP with embeddings does as well as their self trained models, even though they used another model as base.
Though apparently google gemini has geolocation training and is (due to a more comprehensive training set I guess) much better than what they train.
Gemini has geo training?
It performs the best in all their benchmarks, yes.
Fair ig they do have it pretrained on maps data
For maybe the top n largest city parameters
Or sth
I doubt it's aware of random village middle of rural china
I'm interested to know what shortcuts can be taken in image recognition. Like say, does edge detection look for straight lines, and immediately narrow down possible matches based on "things humans probably built"?
The thing about LLMs is that the information they look at is too complex to be designated manually by humans.
You can however in retrospect check which nodes in the neural network encode for which objects and then determine how much they are influenced by certain patterns. That being said any rule you can come up with and that works with the way the calculations are run probably is in the neural network. This was well demonstrated and found with smaller networks running limited tasks.
There are networks with predefined, human designed, rules for text and traffic signs, which makes them way faster, since they don't have thousands of "unnecessary" parameters. But for LLMs it really is unpredictable how they come up with answers precisely.
It is an ongoing area of focus since it lets you do adversarial attacks on the network and if you know how to get at the most impactfull nodes you can strip out unnecessary nodes and slim the model down.
🤔 there are neural networks that can determine motion vectors, edges, outlines, color gradients and all kinds of other stuff reasonably well.
For gameplay having a dedicated motion vector network would do wonders.
Question is just if its possible to run it fast enough at a decent resolution and gives you usefull information.
For example a Network running motion vector detection 512x512 and 30fps could probably let neuro play danmaku games.
hope this helps :) https://poloclub.github.io/cnn-explainer/
Speaking of trimming datasets, I was just thinking of a method where for a given image frame, each pixel of a given color could be stored in its own memory array, and these separate chunks could analyzed to match the color/shape to an object, and putting the objects back into the scene with relative positional values before it proceeds with the next step of analysis
And this is from a position of having no modern programming experience, but having some experience with assembly and TTL
Speaking of the different types, I remember when I was on road trips as a kid, I'd get bored and stare out the window with one eye open, and watch cars pass. One eye would only remember details about the color and shape, and the other eye would just make me think of the speed/vector of the other car
Very interesting
The main thing with LLMs is that they can better extract information from videos with non-static backgrounds. What you detail is (basically) just video compression algorithms.
Was thinking on this more and Vedal must already have an image fine tuning pipeline so the twins can recognize themselves and Vedal for art review. It would be pretty easy for him to just add a geo dataset to the fine tune
Unless there is some kind of textual description of the art
If not though then for art review how do the twins recognize themselves? A normal LLM likely wouldn't recognize Neuro
For geoguesser only I could see the improvements just being a model upgrade though
text description + cot
not even cot is needed
@true tree How did Neuro confuse two robot dogs as a frog-like “abomination”?
Also, what makes you think Vedal wouldn’t “do so much”?
I’ll respond tomorrow I need to sleep
Gn 
llms are not perfect and will sometimes have problems recognizing stuff since they are prediction machines + it will just fall back to the system prompt in that case which will make it say something random/silly as long as remotely related
And vedal wouldn't do so cause it's literally reinventing the wheel and risking catastrophic forgetting and spending money when all u need is a 1 sentence description
One would have to go through the arrt review segments and compare descriptions from captions to neuros responses
But vedal definitively does fine tuning
He did train osu ai so idk why he wouldn't for the main model or vision
Osu ai is wayyyyyy easier
Do you know how expensive pretraining an LLM is?
Even if he uses clip or whatever
Especially the vision part
It would be useless to finetune it
I am entirely open to the idea that they are based on open source models. But finetuning and embeddings can be done with few ressources.
Then again with how pepega the pc building stream was maybe he has us fooled and its prompts all the way down.
The point is mainly that'd he be familiar with writing a fitness function and making a training workflow.
that's what i think tbh
imo it's most likely just prompts
You can achieve a ton with just prompts
There is for sure some finetuning going on. Fine tuning is relatively easy. I just don't know if there is image fine tuning or only text fine tuning
finetuning and getting good results without catastrophic forgetting is not easy
Not super familiar with how fine tuning affects models. I was just saying saying that there likely is not pretraining llms's.
Yeah I guess it could be clever prompting for visual that is possible
I suppose it could even be a textual description of neuro. Crazy how many options there are here
Grok erryday
I used to think i was a complete idiot when it came to AI stuff, until i watched some boomer livestreaming from his car about his personal schizo theory of future quantum entanglement to his phone and having some chatbot play along with his speculations
If I am being serious, We did have several instances of severe latency increases. So there still must be a lot of custom stuff.
He definitely would. Vedal literally keeps himself up to date with all the new progressions in AI tech just so he could apply that to Neuro if it could improve her. He doesn't like methods like this, and believes stuff like this to be cheap fixes, AKA a "cringe" solution. He would definitely take the time to enhance Neuro.
He literally had a state-of-the-art Minecraft AI created from scratch.
Also, he definitely does use fine-tuning. He said himself that Neuro doesn't even use character prompts like you suggested.
That a good point and I would agree Vedal likely doesn't use "cringe" solutions.
I think there is some confusion in this thread about the methods being used. Using a simple solution doesn't mean you are not skilled at AI or are an unskilled developer, it means you are intelligent enough to choose a method that works. For example pretraining a llm from scratch doesn't make any sense when fine tuning is available. Vedal is clearly skilled regardless of the methods he uses and I think the results of what he has made show that.
what's cringe about the solution
I don't think anyone is saying that. I think @fallen karma is just pointing out that the method he uses are not that
I would also point out he clearly has a respect for the twins and especially the autonomy and agency (ability to self act) of them
Anyway my point is that I have been reading the various stuff in this thread and if someone says Vedal or the twins are using a simple method for something they are not saying Vedal is unskilled
I'm not saying he's unskilled if he does use cheap solutions like that, just that he typically finds those types of solutions as "cringe" and unauthentic to the twins. Vedal strives for the girls to be as autonomous and authentic as possible, that has always been his goal. He uses cheap solutions like this only as short fixes before implementing what he views as an authentic solution, and often avoids using cheap methods in general. Prompting the girls in this way is an approach that he typically tries to avoid.
The neuro twins are the only AI chatbots that don't sound like an AI chatbot
Well, most of it is GPT being tuned (and instructed to) give you answers and most other AI chatbots instructed to flirt with you
you can download a LLama LLM and write the control prompt yourself, then see how it converses (need a beefy pc or cloud host though)
Off topic for the thread though
So what exactly does vedal do to fine tune them in terms of personality etc without the simple solution of using character prompts ?
Vedal once said that Evil and Neurosama are based on the same AI. I took this to mean that they are both have the same fine tuned model under them and a prompt for their individual personalities. We'll never know exactly what he does to fine tune them because that's literally what makes them unique, but that's my only way to guess a little bit how.
Though he could just have two fine tuned models off of two very similar models too.
Or a single finetuned base model with two LoRAs
LoRAs are very likely since it kind of circumvents having to go through the context. Would fit with them being very single minded too
My guess is not a LoRA. I think based on what Vedal said to Ellie they likely are the exact same model
Snagged the patch notes Vedal shared
I mean the whole point of a lora is that you only use 5 million parameters to modify the total model with 1bill or more parameters
even though you can technically count it as separate model acting on the main model
We need someone to make a google docy with known/confirmed neuro features
I'd help but don't want to be primary maintainer =/
Yeah that would be cool. I am sure there is a ton of info that has been shared in previous dev streams
Also just in general, it is diffcult to even consider the strucutre because we know so little. Although I think people that really know AI and have followed for a long time may be able to figure a lot of it out
I thought it was interesting Vedal was saying he doesn't consider himself an AI expert on todays stream. He certainly knows how to get things working.
KV caches are very low level and technical. I am assuming he uses VLLM because it is so ubiquitous in llm inference but I wonder if he has had to edit anything or how he is manipulating the cache. VLLM does some of that automatically with repeated tokens I am pretty sure. He also mentioned doing a pretraining run of a voice model from scratch. That's pretty crazy, I wonder if he was trying achieve that taking an off the shelf pretrained model couldn't. Or if it was more to see if he could get something really unique. Now I really want to know how he makes this all work
No one knows exactly for sure, but very likely at least partly RFHL.
However, he also said that Neuro had all the intelligence/memory upgrades and that Evil didn't. So that's probably not the case.
Evil's is definitely derived from the same base model though.
i think that maybe the twitch chat could also play a part in it as ive heard people sometimes mention neuro acting slightly different whenever vedal runs her off stream, apparently giving offline chat behaviour
My guess is what is happening here is that Vedal fine tunes a newer LLM. Then the upgrade is switching to using that new model but he still might use the old model for the other twin
I agree. Or at least keeps the old twin on the old fine tune. Perhaps someday she'll actually just use a virtual mouse and keyboard and understand everything well enough to just plug and play most games.
You have unlocked new role
Could also mean they are using separately fine-tuned models. 
he may just not have copied the files over.
Evil probably needs slight changes for her v2 voice implementation since the LLM seems to directly interact with the tts.
My speculation is that the abber demon comes from the LLM inputting invalid annotations for the voice model.
Also, the changes are experimental, even if it is as simple as doing a toggle
good to test memory and intelligence separately
If she had slight changes, then that wouldn't be the same fine-tuned model. Just one very that's very similar.
there is a difference between changes in the neural network (which is a n-dimensional matrix with weights) and the ""wrapper"" that sends info into the network
We really don't know how the girls work really, or how Vedal did it, but Neuro and Evil do have separate files.
Anyways, this is getting off topic for this thread. Getting back on track. A huge issue is that Neuro's vision has high latency, so she wouldn't even be able to react in time even if she could, but also LLMs aren't good at stuff like that, it's not at all like teaching a 10 year old kid how to type on a keyboard - that's just not how LLMs work.
It's not exactly impossible to make AIs (not talking about LLMs) that could handle something like that, and tech like that will likely be more accessible in the near future.. but then again her slow latency vision is a massive wall.
I think i alredy saw an ai that is made to do exacly this, its made for an cheat, it learns form seeing player imput and then does it better
And that is not an LLM. so that's not applicable to Neuro
Her vision may not be an LLM either? 
If it's not built-in in her LLM, it's likely just a simple CNN image-to-text model
Yeah, which isn't an LLM. CNN's are a completely different architecture, still an AI though.
Doesn't magically make a CNN able to control a mouse and keyboard properly
Nah I'm not saying it would be a CNN
An SNN would be better for that lmao
True, there's even already SNNs I know of that are about to get to that stage
She could possibly use an SNN for something like that? 
Once her vision is much faster at least lmao.
If she was remade into an SNN, sure, but Vedal would have to entirely re-invent how to do his whole system and figure out how to transfer Neuro's personality from her LLM into an SNN, while also making her capable of good text communication
There's no suitable open-source SNN implementation for Neuro to use
Why not just add a seperate SNN to her system??
Like an API? 
Implementing SNN-LLM communication naturally is hard
And from what I've seen an entirely SNN system is way more interesting and intelligent than any LLM around
Right, but how would an SNN along with an LLM be less intelligent than just an SNN? 
If the LLM is in control, it wouldn't be able to utilize the SNN very well, if the SNN is in control, it couldn't utilize the LLM very well
Oh yeah good point.
People are working on merging the two technologies already, still nothing really accessible just research level stuff.
Well, the ones I know have both an SNN component (in control) and a custom encoder-decoder transformer for helping with language
Wait I found something on github. Probably not very viable though at this stage.
I actually thought about this before, but unfortunately it's not accessible and there aren't many resources out there for developing SNN's compared to LLMs. Hopefully if SNNs ever become more mainstream then that could change but idk. 
Yeah, I only know of two people that have successfully developed AIs that are both SNN-based and better than LLMs, and they're not publishing the code or anything like that
Anyways I'm going to sleep it's 7:13 already lmao
Idk about "better". They're better in some ways, but SNN's aren't as conversational as LLMs are.
I'll talk later I'm gonna sleep. 
I would consider them better, they're capable of much more, don't have any of the pitfalls of LLMs and can easily handle multiple modalities without struggle
what if is other AI that do this and neuro only have to give command like hold W for 2 second will this still be too slow?
Neuro giving commands is literally the thing that's too slow
Even if Neuro types and clicks like a hunting and pecking boomer she should still have the ability.
The vision latency should be quite good. Common CLIP models take like 0.5-2 Terraflops, with a 4090rtx gpu having 90 tflops/s
Sadly I could only find reliable benchmarks for CPU times
Loading and unloading will take 5-10s though.
Essentially you'd need a whole dedicated gpu to keep it in vram always, but then it should be able to output text prompts from images quite quick and quite reliably (You can increase accuracy from 80% to ~95% by running it 5times, which is often benchmarked)
LLMs on the other hand do take some time till first token
and even mroe to complete a response
I guess 250ms to first token?
Any real time game requires an alternate architecture that the LLMs can essentially "bias". If you wanted the girls to be able to play a fighting game for instance, you would have to pretrain a reinforcement learning algorithm to actually first learn how to play the game, then finetune the model to accept latent dimension instructions.
The gameplay would be mostly unconscious but could be influenced by LLM generated tokens. This is similar to how humans play real time games. We don't consciously think about every single button but rather influence the directions in a hierarchic matter
Using VLMs is slippery slope. Converting vision into tokens is quite challenging. It would be easier to take latent game representations the RL model sees and training an encoder to pass it into tokens. For example, "Opponent is spamming specials and camping" could be taken from the internal representation of the game then passed into the LLM purely for commentary at first to which the LLM could respond with "This opponent is soooo annoying. I guess I'll spam too. <spam>". Where some predefined quantized codebook of predefined actions could be passed into the RL model (which would have been fine tuned on these objectives) to execute the task the LLM requested
Here is the workflow.
- Pretrain RL model to play the real-time game of your choice
- Create a list of vauge actions the AI can take (Defend, attack, spam, troll, etc)
- Accept a latent of these actions inside the model's input then fine tune on these tasks
- Take a lightweight pretrained model and fine tune to convert latent game state into text tokens
- Pass text tokens to main "conscious" LLM for commentary and actions
- Check actions against list of vauge actions and pass into model if valid
- Perform action
Yes, that would be the workflow for any real time game. Vedal won't do it for dozens of games. The greater question this thread asks however is if there is a universal setup that could work (Albeit asked ignorant of llm limits)
For some games being able to click automatically generated bounding boxes may be enough (such as the captchas fillian was unable to solve lol)