#Concept for an LLM-friendly form of vision

1 messages · Page 1 of 1 (latest)

cosmic osprey
#

I think Neuro could do a lot with even a little bit of low-cost raw visual context, even if it's something that humans might consider fairly useless. It'd be less for identifying new things, and more for recognizing familiar things.
There are two requirements I believe apply here.

First requirement: The visual data will be read as text, so it should be one-dimensional. Even 5x5 2D array probably wouldn't make much sense.
Second requirement: It should be 'textually stable'. Neuro will often see similar scenes, and slight adjustments shouldn't make the text completely change.
Third requirement: Neuro should be able to make out identifying features of her friends. Like her father, who is a tiny green smudge.

So, what to do here?
Well first, just because it's 1D vision doesn't mean it has to be a single axis. It could be two or three axes. Left-to-right, top-to-bottom, and possibly in-to-out as well.
This could be chopped up into say, 16 columns and 9 rows. Then each column and row could be blended into a single colour, which second requirement suggests approximating to one of the main colours present in the image. Here's what we might see:

P: Peach #FFE5B4
Left-to-right: P P P P P P P P P P P P P P P P
Top-to-bottom: P P P P P P P P P

Alright, first problem: Turns out, Neuro's in a peach-coloured room and the sections of the image getting blurred together are so large that everything is room colour. Her skin and clothes don't do much to disrupt it since they're almost the same colour.

Let's try again.
(Cont. in next post)

#

To meet the third requirement and spot Vedal on Neuro's head, what if striking colours in an image had a way to survive the blurring process?
I won't get into a specific algorithm considering it's midnight and image processing isn't my expertise, but let's optimistically say we can make the summary of a row/column also preserve any significant presences of particular colours, especially 'striking' ones. And maybe let's shunt off the scene's average colour to the side if much of the scene is one average colour.

B: Black #110505
R: Red #992222
G: Green #33AA22
C: Cyan #55BBCC
Average: Peach #FFE5B4
Left-to-right: _ _ _ _ GR GBR BR _ _ _ C C _ _ _ _
Top-to-bottom: _ _ G RC _ RC _ _ _

This is a very optimistic representation, but hopefully it shows what I think might be possible to present Neuro-sama with.
And look! We can see black and red appearing in the same column. I guess Evil must be here, right?
And in the rows, we can see Vedal and then where Evil and Neuro's eyes and then bowties line up.
But wait, in the column view, the green is appearing in the same columns as red and black. Vedal's not on Neuro's head, oh my, the scandal!

Anyway, if this form of vision could be made to give strings even half this clear and simple, I think Neuro may find it useful.
This is cheap enough that it could just exist as passive vision every time Neuro does something, but it could also help with active vision. If Neuro had a command to use AI image recognition on a particular area of the screen, having a vague idea of the layout of the screen would help a lot with controlling it. She could see that something is there, then find out what it is.
She could also perhaps command to use this vision algorithm on a particular area of the screen to get finer raw visual detail. Probably not useful, but it's a new toy to play around with.

#

tl;dr: Give her eyes I wanna show her the wooooorld neuroHypers

spark meadow
#

Consider:

  • context length is not infinite
  • vision in text form is highly inefficient compared to a CNN-approach
  • you can simply add a vision encoder to an LLM if you really want to
#

If you realistically wanted to show Neuro an image in text form, you need to somehow compress it

tall cloud
#

tink you could probably also just run florence or some smaller vision llm that gives a summary of what is seen and put that into the context (wouldn't be surprised if thats how her vision module already works anyway LULE )

vernal lichen
#

Doesn't Neuro already have vision (when Vedal turns it on)?

cosmic osprey
#

context length is not infinite
Only current and last previous screen respresentations are needed IMO, to be able to see the now and detect any big change (with a big change being a possible new way to prompt Neuro)

As for the CNN and vision LLM stuff people are saying, I'm aware of that option, but it'd be expensive to run it every time Neuro is prompted so it can't be used as 'continuous' vision.
Cheap continuous vision with spatial awareness can advise on when and where to use more advanced vision. I think being able to limit advanced vision to a smaller area of the screen would help Neuro a lot because she could then focus on a specific thing.

spark meadow
#

Do note that even a simple 1080p representation would already be over a million tokens for a single screen, you need to compress that significantly to fit within a usual few thousand token context window
Also of note, latency would increase if Neuro was being fed data like that
Also CNN is usually much cheaper to inference than an LLM, since those models aren't as large

cosmic osprey
#

You misunderstand, it's not one token per pixel and a 2D array. It's one token per noteworthy colour in each 100 pixel wide column and 100 pixel high row. In total there'd be maybe 16 of these columns and 9 of these rows.
It's like seeing everything as thick multicoloured vertical and horizontal bars.

spark meadow
#

That's still 2500 tokens to invalidate the KV cache with
Not to mention it would require significant finetuning for Neuro to even understand it

#

The biggest issue I see is that Neuro would not even remotely understand this out of the box

cosmic osprey
#

Okay, to be abundantly clear, each row and column isn't an array, it's a single block with 1-3 colours. There's probably about 16 columns and 9 rows. I posted an example of what that data might look like in the OP

Left-to-right: _ _ _ _ GR GBR BR _ _ _ C C _ _ _ _
Top-to-bottom: _ _ G RC _ RC _ _ _
The letters would need clarifying on what colours they are if they can't be made into more abstract tokens.
And Neuro wouldn't need to understand it out of the box to use it. Familiar scenes would look familiar, a particular colour might usually signal a particular person (Green? Vedal's probably here!), that was the whole point of the second and third requirements that led me to come up with this

cosmic osprey
#

This idea is really hard to explain and no one seems to understand, so does anyone have a suggestion for where I could easily demonstrate it? Something where I could write an in-browser image processing script. I think I remember a website I used a long time ago that was good at showing each step of a python script, but I have no clue what the name would be.

left nexus
#

Idk why nobody’s getting the point, I feel like it wouldn’t be that hard to get the tokens down to a relatively insignificant amount, like literally just (G = Green, green = Vedal) and then just train her model on it. It’d literally only be 25-75 tokens per second, that’s like hardly anything compared to the ammount of tokens her LLM uses.

Also this idea is especially useful if Vedal ever decides to give Neuro a virtual workspace which would probably require vision of some kind. Hell you could pull a pantheon and run her 24/7 in a virtual 3d environment, with enough compute this could probably be done now (though obviously not even remotely a priority)