I'm using the Vision API described in this article (https://platform.openai.com/docs/guides/vision), I ask to tell me the coordiantes where the Google Chrome icon is located using the grid I layed on the screenshot, the problem I get is that the results are wrong every single time. Is this done in purpose? I also tried using ChatGPT 4 and gets it wrong all the time, is there any way to increase accuracy?
#Vision API not very accurate.
1 messages · Page 1 of 1 (latest)
The docs state:
Spatial reasoning: The model struggles with tasks requiring precise spatial localization, such as identifying chess positions.
that’s good point, I thought it would be better at spatial reasoning
Yeah, spatial reasoning isn't great with GPT-V but it is generally still somewhat decent, but not with such fine granularity. Try making your grid larger (each individual box larger) and it should get more and more accurate, and you should be able to find a threshold where it is confidently accurate
your suggestion gave me the idea of putting the reference inside each quadrant and GPT-V is now much more accurate, it gets it most of the time! (see screenshot). I'll keep tweaking and see what I get. Thanks for your help
Also you can use some n-shot prompting, put in some examples how to manage this task, and maybe it will learn in it's context how to answer correctly!
It should make the impossible task to make it possible apparently
For example 5 shot will be great
with some diverse examples
of inputs and outputs
And it will get even better
what is n-shot prompting? any documentation where I can read about it?
Great! Glad to see it helped a bit
N-shot prompting refers to providing an AI model with multiple examples (n examples) to guide its responses. It's a technique used in machine learning to help models understand the context or the type of task they are supposed to perform
Just inject some user and assistant messages that will be the examples.
thanks, I googled and found more info about n-shot option although it didn't make much difference. what it did work is to make two attempts (a bit expensive though). First attempt it gets close enough but the second attempt is gets it almost all the time, see the video
is there a particular reason you are using vision to interact with desktop or is it just for learning?
I would suggest using power automate flows in conjuction with gpt if you wannt interact with stuff using AI
probs for learning and seeing vision granularity
aah
It can still help a lot with answering.
You can add more examples, and make it more diverse
and maybe high quality
Also you can use seed and lower temperature for this task
you don't need creativity for referencing a grid
Make high quality, diverse, useful examples but not too much or your gpt-4 bill will increase
Idk if the search criteria str is generated or is just hardcoded str but if it is generated by model you can bump on creativity if you want
but referencing grid can be more deterministic
yes purely experimental but what do you mean with power automate flow?
thanks! I'll play with the parameters and see the results
And other sampling parameters check maybe useful with some research
It is maybe that tool from microsoft to make automation, but this isn't related to python env
Also you can use selenium maybe with some function calling to get better results maybe
And combine with vision to render website for AI so it sees the browser state
But if you want some stuff outside browser then idk use grid still
So in summary, if you want some gpt-4 powered browser script that uses GPT-4 with vision you can combine vision strength and use API with function calling to make actions and send the rendered page that browser uses.
With Microsoft power automate, you can create desktop flows to interact with your computer. Combine that with openAI API. You can pretty much do anything!
You can also create flows in the cloud too!
Pretty cool stuff
ah yes, I thought you mean some other product, yes I’m familiar with Power Automate. That would allow you to predefine specific tasks but won’t let for instance a GPT to define the tasks and execute it. The closer I found that can do it is open interpreter https://openinterpreter.com/
There was actually a research paper published last year that is somewhat adjacent to this,
arXiv.org
We present Set-of-Mark (SoM), a new visual prompting method, to unleash the visual grounding abilities of large multimodal models (LMMs), such as GPT-4V. As illustrated in Fig. 1 (right), we employ off-the-shelf interactive segmentation models, such as SEEM/SAM, to partition an image into regions at different levels of granularity, and overlay t...
Thanks for sharing! I didn't know about that one. I also found this today, https://arxiv.org/abs/2401.01614, they have some demos in python
GPT-4V(ision) is a Generalist Web Agent, if Grounded
I saw that one too, but not the demos. Thanks for the share back.
Last year when I first saw the Set-of-Mark prompting paper I toyed—very briefly with building a pipeline to pass an image through a SegmentAnything model to draw different coloured outlines around objects and appending a numeric label to each, before passing the augmented image on to GPT-4V, but then the end of the term, holidays, life, etc conspired to derail that.
Maybe I should revisit it sometime...