#[WIP] Neuro's Desktop (An integration for letting neuro use a computer)
1 messages ยท Page 4 of 1
coool
here latest updates from me, didnt do pull request yet, want to add things first
https://github.com/vituha230/neuro-relay/tree/main/src/dev/nakurity

Btw, from a quick look, the server sends intermediary, intermediary only messages (aka it only works with intermediary), so either intermediary or client, has to translate them back into neuro messages
Along with server sending non neuro messages to integrations (which they might not understand)
Actually uhmmm, neuro relay tests should probably be updated
busy week
cant do much yet
Its okay!
I was procastinating on this, actually...
Sad
CAUGHT
WAIT REALLY?
yea!
Would it send back information to Neuro to let her know what she did at least?
NeurOS would be sick.
Yea
I think Nakurity implemented a vision system for neuro, using ocr
Though it is a bit worse than thr system eithout ocr
Since ocr is kinda bad rn
With ocr, it depends onnthe machine running the integratioj to be beefy
I'll get back to working on this now
I kinda have an idea for a new integration design
I made a pull request for windows API a while ago and still haven't heard anything back yet
https://github.com/nakurity/windows-api/pull/1
Oh, I just didnt see it. I'll review it today.
I should note, that anything that needs our attention. Ping me or nakurity here. Ty!
@timber basin
Its your repo
Not the main one, why'd u move it
Sigh
srry
Is giving the twins access to the entirety of windows a good idea? Like even if there are safe guards, I feel like Neuro or Evil would find a way to bypass it.
They would be able to do things, that a regular human can on windows, thru keyboard and mouse.
You mean digitally right?
Yea
Evil or Nuero crashing Microsoft Defender or an anti-virus is a scary thought.
may I add, that they cannot do that?
they use windows in the way we do. Using keyboard and mouse. They have access to these two basic inputs. And I assume Naku plans to add integrations to this integration that will allow neuro / evil to have an higher abstract control of an specific windows application.
They do not have access to:
- the windows underlying processes
- the windows codebase
- the ability to crash an application, unless us normal users can do it (thru a bug)
- the ability to solve captchas (but this depends on the algrothim we end up using for mouse movement)
though depending on what defender thinks (hopefully our software does not get flagged (probably won't))
but they won't crash it
don't worry, there still needs to be an explanation for some people about the vscode extension that they cannot just access git(/github for that matter) or execute arbitrary commands at will
it's kinda a thing with these kinds of integrations, almost like clickbait, but we aren't trying
true
I honestly have no idea what to do with this as of rn
I got some free time, to work on this
but the state of this project is quite a mess
should I just rewrite it? wdyt, KTrain
@steel dagger
I mean, if it helps you read easier, you should rewrite it, but remember to PR first so that you don't have 20 merge conflicts
kay
@timber basin what do you think? (not the full screenshot, but should give u an idea of the refactor)
there ain't no way there is gonna be 4 langs in one repo
what would be wrong with that, other than just trolling with the language usages :D
mmpm?
it's just that
most tooling is not very well designed for that
I wish it was
but it really is nnot
I guess not "well-designed" more like "well optimized" in a way
idk, kinda hard to say
eehhhh, I already went with it. I'm not gonna rewrite it a second time
@steel dagger would you know how to embed python correctly?
https://prod.liveshare.vsengsaas.visualstudio.com/join?B6715E893AB4A43AD42664B95FBE3A1A89BC
Build with Visual Studio Code, anywhere, anytime, entirely in your browser.
or anyone else, could- uhm. maybe help?
sorry, am not skilled enough to compile code across languages (or whatever it is they call it)
would you know how to correctly bundle embedded python tho-? that was kinda my issue
to run the python packages
otherwise you'd need python installed on the host system to run the application
well idk
there's something called python-build-standalone that I think helps for your case
winpython- ooohhh, I remember that now
somewhere on sourceforge
I forgot about winpython
but my problem was site-packages
hmmm
@timber basin you know python right? (you coded the whole app in python last time)
that's probably enough for today
I'ma commit this to my fork. and go sleep. Naku, whereever u are, take over the code please
mmphm, okay.
this looks fine, I guess. But I don't know most of the languages you're using.
I'll try to figure it out tho
goodluck I guess, I still can't get python to run
thanks
no, not really. I did use Claude Code last time
completely rewrote a lot of stuff: https://github.com/cassitly/neuro-desktop
I really want this to be a rainbow D:
reason the code is all in the desktop folder, is because I'm planning to make an CLI version too
two modes, desktop user interface for the swarm to see (what neuro is doing (content)). CLI mode for neuro to use, like far in the future (if she even gets access, or becomes sentient enough to continue development on herself)
probably should make a dev branch for all of this, instead of pushing to master
uhmmmmmmmmm- wait- @timber basin I didn't know I was using your codespaces... uhmmm sorry...
Why was your account even logged into my tablet, I'm confused.

oh, I forgot to log out, when I last used it.
Your tablet's just faster than my phone, plus- bigger screen yk?
I didn't notice because, I was commiting to my own repository
but whatever, next time please log out
why don't you use guest mode or wtv
there's incognito.
I didn't know...
there was?!
yes. there is
would you like to review my PR? https://github.com/Nakashireyumi/neuro-desktop/pull/25
code rabbit is reviewing it
who added that AI anyway, I forgot
that's enough for me today. I'll go do smth else
code rabbit is having a lot of complaints on this one
oh btw can I ask how independent neuro-relay is from neuro-desktop
very, I think
I'll rewrite that too
cause that's also a mess
Neuro Relay should be rewritten tbh
Hi 
Im a bit lost but ok
Do you know rust?
no 
wait did you get some new idea
something to make?
m-, no- I kinda wanted someone to teach me Rust, so I can help @wheat reef work on Neuro Desktop. Because she rewrote it in Rust
pushed out a change today. PR #25
I will need some help aswell
You could code it in python, and I'll implement the code you made in python btw
it's a multi-language codebase for a reason
I guess so...
yea
I'm going around adding changes suggested by coderabbitai rn
Idk what I did today
Should have I implemented the integration code in python?
idk, but rust is so confusing for me.
first time using rust btw
did you guys end up doing any neuro api communication code in go btw?
tried
I'm thinking if I wanna try again, I probably will just make the go code its own binary
and we communicate using text files
:D
did you like, want the sdk code I made for go or smth?
cause I don't think there's an neuro sdk for go.
(couldn't find any)
yeah would be great
might be something I was trying to resolve
nah just point me to it
1 moment
since I didn't manage to implement the integration in Go, the code is not tested btw
fair enough
use v2 for better functionality. v1 was the version I had in neuro-desktop
v2 is way better
thanks
np
May I ask what did you need it for?
[WIP] Neuro Desktop Integration
[WIP] Neuro's Desktop (An integration for letting neuro use a computer)
as mentioned above I was considering/wanting to publish a Go SDK for uses and was having trouble, but also I'm considering switching neurontainer's backend to use Go instead because the Docker Go SDK is more complete than the Docker TypeScript SDK
it's kind of weird because normally no games use go for their lang so for a while there isn't a go sdk
ooh
yeah
like pretty much any tool integration either uses python (most) or go (none) and I'm not sure if neurontainer should join that crowd yet
I finally wrote the neuro integration
It is semi functional
pressing a key doesn't work rn, it's something with the action handling
but moving a mouse works!
but for some reason it's also executing all previous actions, in the action history, sigh
recursive execution?
No-, the monitor (the system I made for providing neuro with information happening on the computer) keeps track of an action history, and when I implemented the system in Rust to interact with the pc. I found that, thru testing. Previous actions that happened before the current action seems to also be executed.
It might be a bug with me not clearing the action queue. But I'm also not sure if its related to DesktopMonitor's action history
Wait- I think it might be a problem with the Action Queue persisting. Since it needs to be manually cleared after an execution. This behavior is so that macros can be used / an implementation for saving an macro for doing something like opening notepad
I was thinking of implementing a default macro set. And ones that neuro can add / a UI for vedal to add ones specific to how the PC is configured
Is that for the funnsies?
Or should it have an actual use case?
I remember seeing you and others implementing that for neuropilot
the use case is bribery
@timber basin could you help me in fixing the powershell scripts? #programming
yeah
I'll go checkout your repository
did.... you end up checking it out?
aside from the powershell scripts, neuro desktop works-! And it bundles correctly for a release ver.
Also added macros
will be fixing the issues where the mouse won't click, and the key press doesn't work later
Removed the broken actions
now it works perfectly
Everything now works correctly
I just need to implement something like vision now
oops, sorry @timber basin I accidentally merged into the org repo. I'll revert that
Ths integration obviously need it's own PC to let neuro use.
Sooo, mmphm
neuro-desktop .iso build 
this is the built file
it works now :D
you can go build it yourself at my repo ig. I need someone to test rn https://github.com/cassitly/neuro-desktop/tree/dev
I don't have an openai key to test it
cd desktop
make run
to build and run it
wait it doesn't work
lemme fix that
why do you need an openai key to test
jippity
I need an actual AI to test it
and possibly a second pc
but I'll take my chances that I won't lose control with AI spam
it works
but I need to test the praticality
with jippity
to see if the AI would actually understand what to do
ah
so... I was thinking about this message... hmmm, maybe an .iso file would be best for that. Wdyt?
thought that would I mean I'd have to rewrite the entire project again.
idk how you bundle apps into windows isos, but you could build an iso with all apps bundled, idt you need to rewrite the entire project for that?
oh- yea, you're right. I just had my idea on hooking neuro-desktop right into the windows operating system layer
I mean... you could also do that, if you want to offer a permanent neuro-windows
but I would take the bundled inside route
and make it so that the integration starts on startup by default 
that's a great idea!
but microsoft.. sigh
where is that evil neuro emoji where her head was bonked into a ravine shape?
I'll add the vision stuff tmr
Nakurity is slacking on this project, whyyy?
Btw @steel dagger do you think the python library pyautogui's screen coordinates depend on how big the screen is?
could be neuro's excuse to get vedal to get her a good monitor xD
you'll get a better answer asking chatgpt than myself I'm afraid
oh, okay
I mean also when is that never the case 
this is funni though
yea. that was my first thought about that question lol
NEURO HAS AN EXCUSE NOW
:PagMan:
Full GPT5.2's response:
Sadly, I'll have to make neuro-desktop windows only for now
@steel dagger what do you think about my movement algrothim?
also @timber basin
it could be better with fluidness ig
feels kinda slow but otherwise ig it's fine
yea ik
would you know how to smooth out a number range so that the algrothim would smoothly move to the next pos
how are you doing it now? it looks like you're doing some kind of "accelerate in the direction of the target" thing?
algrothimic path. I'll give you my v3 code.
its okay! you need some sleep then :3
yea it kinda starts off with an slow motional drift, and then starts accelerating
i think modifying your fade curve so it does a snappier curve, spending more time at the ends and quickly moving at some point would make it feel less slow motion. or just speeding it up generally. but pretty cool.
Btw @steel dagger I'm updating the Go SDK. Have you used it / there might be some game breaking changes.
Asking If I would need to archive the v2 / v1 files.
haven't used it yet, but I thought you were using rust for impl now?
nope switched back to go. Waaayyy better

am using IPC for communication between the two binaries now (linking is too hard, exec bash cmd better)
how about we both just work on a general-purpose go sdk instead?
I have a private repo for it because I was trying to figure out both the server and client portions of the api
lemme just invite you rq
done
I figured out the client, though I already have a repo tho
oops -_-
this was the refractor pr to my main branch from dev
wdyt?
idk what a refractor is but uhh sure that seems good
rewrite / restructure
like anything, that changes a bunch of core features
that's a refactor
oh, I misspelled it then oopsies
I'll leave the rewrite for Neuro Desktop to use the new SDK to @timber basin ... I made a mess.. and I am just gonna leave it. hehe :3
claude is great. GPT5.2 will burn your code and sanity to the ground.
I feel like GPT5.2 doesn't know anything at all. And just knows enough to seem intelligent, but it fails at literally everything
weird. gpt5.2 on gh copilot seems to work fine
it works fine. But it literally is so dumb for me, idk why.
I ask it a simple task (refactor smth), it does the bare minimium, and most of the time it doesn't work. I like Gpt-4o better.
it will 87% of the time, mess up your code
and forget functions from the original code
btw you could tell it was claude code from the codebase?
aside from the README
I just have an intuition for gpt vs claude
ooh
it's sometimes incorrect like everything so I always take it with a grain of salt
okay
Gemini 3 seems ok so far. admittedly that's only been to summarize where in the implementation plan the current code is and describe the next steps and to track down a memory use error in code written by a different agent.
gpt5.2 can only plan as well. it knows it can only plan, and if it tries to do the impl. It knows it will fail, so it's decided to plan instead
openai seems to have distilled GPT5.2 to a point where it's like worse than GPT3
but atleast it can do something, it just fails at coding (from my experience with it so far)
anyways! what brings you here?
making an AI similar to Neuro and Evil. Not as ambitious but similar.
looking at what other people are doing.
oooh
I have finished adding an natural movement pathfinder for mouse control
wait- I have a suggestion to your idea
have you considered implementing the neuro backend api as how you can add integrations to your AI?
if so, then you could have your AI use alot of the community-made integrations for neuro / evil. since you've implemented the same API that they use
yeah, i've thought about it. it would be a "maybe, assuming i can get it working well" thing.
what language did you use?
I'm still rewriting the client side speech recognition code. this version is going to be more streamlined and actually planned better but the old code was a mess.
the first version was all written in python for simplicity but ran into issues with real time audio processing and async code. the AI was running in a llama-server instance.
this version loads the AI code into a process which then starts listening for requests, adding the info to a queue. another thread passes data on the queue into the LLM, in a structured format. the LLM output goes to the action handler which takes speech actions and converts them to TTS (tbd... there's a lot of things the TTS needs that I'm still researching. previous TTS didn't give timing info on output. kind of want that.).
the new version has the main code written in C++ because that's where the LLM libraries want to be.
oooh
okay
llm libs in c++ sounds odd
ig it does make some sense though given cpython is written in c/c++
the python code could use the native libraries which were written in C++ to access the GPU but I wanted to have the event handler and the LLM in the same process so they had a coherent view of the currently "in process" state and it would be possible to interrupt them.
also wanted to be able to update the context dynamically and have multiple models loaded at once so i could keep a "thinking" model and a "speaking" model. that would let them think while they are speaking as well as being able to receive new information and choose whether or not they should stop speaking.
debugging why actions still auto execute, when specified not to rn
I wonder why this is
fixed it
:D
does it work for clicking the "I'm not a robot" button without being detected as a robot?
not sure. the captcha checks mouse movement for humanness
so, when filian solved the captcha for neuro. Neuro can obv solve it, but the problem would be that filian's mouse movement is obv in a human like path
while I would have to algrothimically mimic that path, using code
captchas are smarter than you think, yk?
I think I finished 0.0.3b-dev now, it allows for Low Level control of the windows OS without issues
@timber basin go review the PR!!!
I will be waiting :D
meant to reply to this btw
also unit tests are needed
well... I think I have an idea for allowing the twins to do that... and it involves a neural net controlling the mouse in a random (human-like) path, while moving towards the mouse position that the twins selected
0.0.3c-dev will be the version I implement the system to give neuro context of the desktop env
Okaay, I found the problem on why captchad are still failing. Python's Pyautogui library is not best for this kind of task. So I'll have to rewrite pathfinding library in C. And use it in python.
Also, how would we get vedal to even see this?
I have an idea for other AIs to use this integration and basically maintain it thru using this integration. Seems possible / useful :D
@jolly dove may I ask do you plan to open-source your AI, and if it supports RT Learning?
Completely unrelated btw :D (I totally don't plan to borrow your code)
Didnt know you could unregister an action before sending an action result back-, because my thought was that, how could one recieve an result to an action that no linger exists.
The more u know
the action result is tied to the execution ID, not the action name.
I will rewrite neuro-relay to be a integration that provides more utilities than the neuro api would
Didn't know that, ty
I mean it is kinda obvious if you look at the action result spec, but tbh it took a good bit of f'ing around and finding out before I got ahold of the ability to make basic integrations with the api
Fixing the vibecoded mess nakurity made
It was quite easy with claude. And I just read its code and went about following what it did. And it worked for refactoring neuro-desktop (my actual first attempt at a neuro integration).
I do also have to fix the neuro integration code mess, written in Go
And refactor to actually use the neuro-sdk I made
Refactoring neuro-relay will hopefully be easy
oh did you publish your go-neuro-sdk yet?
Yea
Its on github, I tried
And it worked with go get
Just havent published a proper gh release yet
sorry I meant is it listed on the go module thing yet
I have no idea what that is
I never used Go before the day I started working on refactoring neuro-desktop
Sure! Which one would you like?
If you dont have one in mind, I'll go with MIT
see above
and then you can PR it into the vedalai/neuro-sdk repo
Oh, i was typing that message before i saw that
Okay
Added
Did you wanted to do smth with the SDK / that's why you wanted me to place a License?
uhm
Forgot that
yes, neurontainer backend potential rewrite
Shouldnt have used claude for the README, but I was lazy :p
@timber basin it would be nice if you stop ignoring me, and work on neuro-desktop with me?!
Might need to slowly migrate this into an full toolchain for neuro to use (and might aswell make it its own OS atp), once we finish the proof of concept
it still fails the basic captcha test. hmmm
will work on this integration later
I can help
So I do not own a windows computer actually at all
That being said I quite a few ideas on how to improve some things and also fix some bugs you are having
I strongly suggest using Deepseek OCR if not already
The model is a bit heavy for CPU users but 100% feasible even with lower end GPUs
The model is only ~3b iirc
so literally any GPU can run it and any one from the past like 8 years should be able to handle it in real time
it's perfect because it can take a screenshot and literally make a markdown file seperating things into groups of text and seperate images (never word by word ๐คข )
ty!
Comes with Image summarization and also block based OCR
I was procastinating on this, because I have ADHD (and I forgot my medicine)
Tbf I am procrastinating on a lot of important things I probably should be doing too
but instead I am working on neuro things
that's a great idea.... I honestly forgot that Deepseek released their OCR
that seems very healthy ngl
our goddess neuro would be proud
its basically separate into multiple binaries
I was highkey getting ragebaited by the readme cuz some directories don't exist
with each function being its own separate binary
oh, I was lazy and asked claude to make it
that was so long ago
real
but unless you mean nakurity's readme on the main repo
this the current latest branch
with my impl
and on PR #28
because honestly, nakurity's impl sucks
crazy question but why are u the OP but not repo owner
I am part of the github organization
but I forked the repo and worked on my own instead
its just better to PR stuff
instead of working on the main repository
which nakurity made a mess on, by doing so
the structure is this
new functions are separate binaries
we wire into the rust app
(desktop/apps/neuro-desktop)
we forgot about linking in this repo
it doesn't exist in my mind
/ I couldnt get it to work
It's been a really long time since I used Rust
IPC is better
but I surely still got it
oh dw
you can still code in any other language
๐
and I will wire it up to the rust app for u
bro what
except that
shucks!
I am not gonna deal with brainfuck
Can't imagine why
not even AI can help me with brainfuck
lmao
its gonna fuck up both our brains
Indeed it will
go to the dev branch of my repo
and you'll see the latest changes there
I was gonna PR v0.0.3b-dev without context stuff
and just working actions
Oh I'm fucking stupid the org is "Nakashireyumi"
so I'll do that
and v0.0.3c-dev will have the stuff that gives neuro / evil context about the desktop
na-ka-shi-re-yu-mi is how you pronounce it
very hard, nakurity made that up
Lol ik. I am Japanese 
I imagine so
should probably move desktop/native/neuro-integration to desktop/apps/neuro-integration
since it compiles to it's own binary
I'll do that
Are you still having issues with the port being left open?
nope
that was an issue with the main repo
not mine
That's nakurity's issue
not mine
I fixed that by rewriting the whole mess she made
holy ur repo is a lot easier to read
thank you!
I worked hard on ensuring that :D
doing some cleaning up work on the repo rn
May I ask what are you doing now, @empty trout ?
I was thinking of making an NN for mouse movement stuff
is there already one?
I have no idea
I am reading the code
oooh, okay
Okay so for moving mouse with integration the best thing you can do is to not introduce any additional NNs
why tho?
this is the only NN I planned to add
twins are capable of doing it themselves given good context engineering
cause I wanted mouse movement to pass captchas
uhmm, direct mouse controls are hard yk?
Okay let me try to explain a bit more
The twins are 100% capable of moving to a button
and navigating websites
You need to give it more signal and less noise though
and the NN just moves it to the selected place like a human
no pixel values
No, the NN doesn't do that for u
the twins give the NN a position
and the NN moves to it in a human-like way
as much as possible
pyautogui already has things like that
since I want them to pass captchas
pyautogui moves in a straight line
to the place
captchas actually don't use mouse data as much as you think
they usually just read previous cookie caches and stuff iirc
well, I tried captchas with neuro-desktop as of rn
it failed
the captcha
I tried using tony
and you were successful as a human?
In that case add a gaussian sample
oookay
essentially brown noise
I'll follow your lead then
I can code it in a couple min one sec I'll write u a code segment
NN's are fun
implementation in python is okay
I am an AI researcher afterall
yea, it is fun!
btw apollo, this file, if you haven't already seen, implements the current mouse pathfinder for neuro / evil to use. When they enter a coordinate to move to
I should put a showcase of the application working with Jippity, in the main README ngl
discord is...?
Pastebin.com is the number one paste tool since 2002. Pastebin is a website where you can store text online for a set period of time.
yeah the discord server
I used the bezier curve formula and brownian noise
so it will take a curved path with jittering to make it realistic
my impl already used those I think
why did I read this in Evil's voice?
the sample rate is the issue than
probably
yeah 60 pixels per step is an instant flag
I tried increasing the speed
that's insane
to make it look better
like realllllly high
so a capcha sees you are teleporting x pixels every certain amount of steps
idk if my explanation makes sense

but that's def why
you can add a variable sample rate really easily by adding some time.sleep()'s
or no mb
mmhpm
u can just set the pixel distance to 1
but increase the sample rate a lot
then add some very small sleeps
like literally 5ms or lower
I can send a pr rq it's really easy
but it might fuck some other things up by making it go too fast
so more testing is required
should be mostly fine tho
that's not a big problem, since I haven't implemented that many abstractions with fixed coordinates yet. Except for the windows control utility file
dw the pixel values will still be correct
it's just that any hard coded speed things will be different
I heard pyautogui's pixel values / coordinates depend on screen size
cuz pyautogui automatically sleeps for I think... 0.1 seconds
Well yes
every mouse library does

it means that she has a reason for vedal to get her a good monitor!
and a second good pc!
unironically a worse monitor is actually better
lower resolution is better for AIs weirdly
ooooh! yay!
sent a PR
I don't have windows so I can't test it myself
it MIGHT be like really hard to stop once it's running btw
so just be aware
I'll go review it
it's not as easy to hit that top corner as it was before
its ok!
I changed the PAUSE constant in pyautogui to 0 instead of the default of 0.1 and changed the pixel step calculation to just always be 1
It might still be slow if the failsafe mode forces it to be 0.1
and in that case it's cooked either way unless u want to get rid of the failsafe (potentially a bad idea though ๐ )
so we'll just have to see
haven't used pyautogui since 2021 so... I kinda forgar
forgar
if it's still slow u can use the windows api mouse mover
cuz pyautogui is a bit finicky
but easy to use
windows api relies on pyautogui
I have no idea what nakurity was on, when coding that
Contribute to Nakashireyumi/windows-api development by creating an account on GitHub.
I meant this
nakurity vibecoded this drunk or smth
yes ๐
Vision / a way to give neuro context of the desktop
along with abstractions
from LL (low level) calls to higher abstraction calls
like closing a window / fullscreening it
It's just a bit sad cuz it is adding l*tency
but I don't think the NeuroAPI supports sending Vision as context
https://github.com/cassitly/neuro-desktop/blob/dev/desktop/backend/python/controller/abstractions/windows_utility.py
this file implements some of those abstractions
so there is literally no choice
that's sad, yea
well unless the neuro api added support for that
we will need this
We literally do not need to hardcode anything
because you can literally build a detection head to detect all key-points
and there will never be any tuning or calibration required
I actually think my friend has like 100 hours of training data for this
he did something similar before
realllyyyy?
yeah it's insane
ooooh!
lemme check rq
Oh he didn't even need any complicated AI things for it
standard machine vision stuff
pattern matching kernels
waaaaaaaaaa
I thought we were gonna use Deepseek OCR?
I'm not 100% sure if Deepseek OCR can detect things like close buttons
I never used it myself
but It's perfect for analyzing text on a website
because it makes blocks of text inside a textbox instead of word by word
and can detect images and isolate them
but icons are a tiny but tricker I assume (idk cuz I never used it tho)
so worst case we could make a lightweight vision model that detects that stuff
will try it out with discord as an image
Good idea
mine is still processing
oh I'm dumb
hol on
So yeah u can't give it too much data at a time I'm pretty sure
cuz there is just sooo much text formatted in a weird way
the raw OCR worked but it missed a lot
oooh
but what it can do is convert the screenshot into vision tokens
I think
I'm reading the paper rn
bruh website is cheeks one sec
Ima run it on my laptop
ookaayyy
I think we would have to zoom in a LOT for neuro to see it well
cuz it can identify images and text but only if it's big really
like it saw all of the listed users on the right on the discord
not the channel names though ๐
and it gives coords for all images
Best part is neuro can prompt Deepseek OCR to only get the important parts
she only has to say something like "Tell me what is in this image"
or "Locate the X button"
it can be fancy too like "Locate the call button near the top of the screen next to the video call button"
and it'll give a coord pair
ok uh ๐ ๐
>>> "/home/anon/Pictures/Screenshots/Screenshot_2026-01-04-04-32-29_1920x1080.png\n<|grounding|
... >Given the layout of the image. Locate the pin icon"
Added image '/home/anon/Pictures/Screenshots/Screenshot_2026-01-04-04-32-29_1920x1080.png'
<|ref|>image<|/ref|><|det|>[[0, 0, 999, 999]]<|/det|>
I think it's cooked
lemme see if cropping helps ๐ญ
ima delete and resend image so it doesn't get confusing one sec
this is 1024x1024
surely that fixes it
surely
(I just got back from eating sushi, yummmy!)
I just tested the mouse thingy
it worked
now I'll merge the pr
wait- no
oh
I was doing the last check to see if it compiles
you pulled from the master branch
and not the dev branch
oh mb
the master branch has outdated changes
didn't see the dev branch
Dumb question but how important is seeing text for this project?
very important
Because there is a real-time AI model that can say every single object in an image
because I'll also use it for registering disposable higher abstraction actions
when neuro / evil enables it
and I will use an LLM to determine those action stuff sometime too, most of the time, it'll be the algrothim registering it
wdym?
Since it's interfacing with HTML
there are text inside the HTML tags bruh
we don't need OCR
THAT IS GENIUS
I am also dumb
but what about other games?
other apps
neuro doesn't even use discord
Oh I forgot that this is for the whole computer not just web browsing
yeah that'd be an issue
unless we implement a user interface for neuro's discord stuff
I feel like surely Windows has some kind of accessability thing
yea but games with custom render engines won't show the UI
where it can ...do stuff
to windows
hm yeah
OCR is best if you think about it
Yeah
This is what yolo-world looks like
without prompting
it just names and locates every object
and it's in real time
BUT
we can make it detect "text"
then we crop that image bit into the raw OCR model (deepseek might be overkill for this)
and it can just read what it says and we will have a bounding box of where it is too
great idea!!
Then yeah we can send a context message to neuro listing everything on the monitor
yea
like object names
mmhpm
and she can "investigate" something ig
I'll go start on that now
and search
what lang should we use for the deepseek OCR impl?
I can fix it prolly
are u going to use deepseek for the text reading?
cuz it might not even be needed potentially
might be overkill
yes, and object bounding box. and the screen region they are in
icon recognition too
button recon too
and alot of other things
like, everything on the pc possible
hmmm that's a good idea
modularity is nice
we could just make it a binary
and have something call for it
that's how this codebase works
everything is its own binary
whenever we need to compile smth
which if u wanna use deepseek ocr then i'd reccomend either using an ollama server of vllm (vllm is generally faster but a bit more annoying to setup correctly)
I see
do you think we'd need to separate the integration into a server-client model?
since I assume vedal would prefer all his LLMs in a separate server (e.g. the server neuro runs on)
wait my PR looks like it is on the dev branch
our integration would already need its own pc for neuro / evil to use
am I trippin?
this was an issue only found on the master branch of my repo I think
make the compiler a binary as well, why not
why tho?
also hi Ktrain
hi KTrain!
oh it's not just l*tency
it IS already its own binary, sir-!
also the fact that everything is its own binary is gonna drive him insane
linking is hard yk?
like imagine trying to start the integration up and one of the binaries crash
I spent like 2 days trying to figure out how to link Go with rust, and gave up
and then now you gotta restart the entire sequence
so this was my solution
oh you can just fix and recompile that one binary, dw
Is there even any Go code ๐ญ
except it's like 20 binaries ๐ญ
if you're willing to add js you can use wasm as a "simple" way
dw we can surely always refactor later
holy cooked
also vedal would 99999% not be down to run extra LLMs
but uhhh idrk how else to do OCR
there is literally no vision in the API afaik
there's worse things you can do like make the entire thing route through 20 different langs
if you do that I'm removing my star from the repo
๐ฟ
noooo, I was kidding!
Anything is possible if u remove all elegance of the code
there is more coming... sadly
I haven't even finished the C impl for neuro-desktop yet
I had a process handler written in C
holy
Bro, just, rewrite the entire project in one language and just stick to it...
No C/C++ on god
Has code in rust
needs (wants) to write a portion in C
codebase already includes 2147 other langs
I'm sorry....
Real talk though how tf are we supposed to handle vision
cuz vedal is NOT spinning up a random ahh vllm server for this
tell neuro to tell vedal to remember to turn on vision
because he clearly cannot remember to turn on vision
there is no documentation for vision in the API I think
I barely checked tho tbf so I could be wrong
I think like one or more times he has NOT turned on vision for neuro art streams
but I don't think there is
none, also how do we even implement it anyways
ok that's cooked
but we would need to detect and give neuro screen coordinates
it's possible if he didn't give a shit about the code quality of tech debt
that is not a fix
and higher abstract controls also require we detect screen coordinates for neuro too
Basically we can use YOLO-world which will list all objects and it's coordinates in text form
the fix is to ask vedal for at least a packet that accepts a b64-encoded image in a piece of data that can be sent to neuro
yea!
+3 years for him to do that
also, ngl, the api was like
too busy to implement that prolly
NOT meant for tools
or like uh
we need a way to refer to non-game integrations
that isn't tools
what's a good name
utilities? ๐
technical integrations?
^
I wouldn't call #1372383715992535120 a "technical integration"
I mean kinda
how
also W shoutout
well, what if vm
wdym
in a virtual machine?
that doesn't matter
if neuro-desktop in vm -> vm is just a window on screen -> the vm isn't the only thing scanned
-> can confuse
/
Remember streams where vedal was like "What is on my screen"
and showed a sea turtle
-> makes this already clunky integration even more brittle
the twins already see the desktop when vision is on
was it fullscreen or even at least half the screen
that is required for the integration to move the mouse
along with the integration's higher abstract control
coords is a shit way to really handle anything anyways but to abstract that away requires running another model to translate
???
like neuro can say "I want to move the mouse to <Object name>"
different levels of abstraction
then we have the lookup table
you guys talk it out, I'll go watch
yes that is the ideal state that we want to be in
yes ik
well that's a bit of a problem
because if you give her raw coords to control she will not be able to do anything
like to her, an arbitrary pixel on the "screen" she can "see" can either be 0,0 or 314159, 2147
you still need to describe spacing to her
yes that's okay
which can be useful in cases of moving windows
how would you explain that in text to neuro
with relation to ObjB which has the connection of "Left of" to ObjA
"ObjB is left of ObjA"
she can query for additional information as well
with getdistance() or something
everything needs to be in relative "coords"
suppose we have A{}, to the left, B{}, and to the right is C{}
so the line looks like B{} A{} C{}
and say she moves A{} out of the way so for our intents and purposes, the line is B{} C{}
or sorry, B{} C{}
how would you now convey that to neuro?
Why would she care
you can't exactly just give her arbitrary "units"
say she's learning to use photoshop
That's just unfeasible with current tech
or fucking adobe after effects idfk
tbf true
Like our scope should not be even close to that
Neuro can barely even see
only understand entire images cuz it's compressing into a latent
no spatial understanding
so this is already a MASSIVE upgrade
๐ ๐ซ
It'll be fun
Can't wait for this to prolly never be used by vedal anyways
or idk
cuz he been using VSCode integration recently
so maybe
gunna need a lot more experienced hands
anyways
what should we call non-game integrations?
I don't know if "utilities" help/is a good candidate
Tech Integrations?
It's very funny because if the integration is even slightly cooked the entire twitch chat starts FLAMING it
It's very difficult to context engineer everything perfectly
and even if the things work, the LLMs being able to determine which tool to use for the task isn't always obvious
Visual interaction would be like this:
Commands:
1. List (Lists all objects in frame) No inputs. Outputs a list of English objects
2. Search (Finds object that matches text description) Takes English object name. Outputs basic structural awareness analysis "This object is to the left of X and to the right of Y" and saves object to the current list of objects
3. Click (Clicks an object) Takes an English name object. Success or fail response output. Side effect is clicking saved hidden coords in object lookup table (which Neuro cannot see)
4. FindText (Specifically searches for text). No inputs. Outputs a basic structual awareness analysis of textbox locations and also the text inside the text boxes.
Maybe some things like "drag" and "right click"
but this is the gist
Also Neuro can search for things not found in the list
because the name could be different
and also list isn't 100% accurate (although it's usually pretty good)
didddd you fix it?
I can work on it tomorrow. It's almost 6am here 


