I did some testing with ChatGPT using ascii art instead of images and it seemed to work a lot better than I was expecting. It’s able to tell where things are in the text image and what’s being depicted, even if I made it up entirely myself (I was using a poor ascii generator for this so it could only recognize simple shapes but it also worked fine when I typed it out) (also I didn’t have access to the image processing so unless I found a loophole, I’m pretty sure it isn’t just using that)
This was just the website so it can certainly perform better with fine-tuning
feeding small images (maybe 10x10 or 10x7 idk how many tokens is too much) directly into the llm (replacing the old ones every time so it doesn’t fill up context) instead of or alongside a text description could give spacial information that would be lost otherwise. different things could be assigned different characters by the raycast or even the image recognition though that would require more processing time. allowing neuro to swap between different modes like depth, material, and color would help for different scenarios.