#Vision API not very accurate.

1 messages · Page 1 of 1 (latest)

sharp oasis
#

I'm using the Vision API described in this article (https://platform.openai.com/docs/guides/vision), I ask to tell me the coordiantes where the Google Chrome icon is located using the grid I layed on the screenshot, the problem I get is that the results are wrong every single time. Is this done in purpose? I also tried using ChatGPT 4 and gets it wrong all the time, is there any way to increase accuracy?

lucid salmon
sharp oasis
#

that’s good point, I thought it would be better at spatial reasoning

spiral fern
#

Yeah, spatial reasoning isn't great with GPT-V but it is generally still somewhat decent, but not with such fine granularity. Try making your grid larger (each individual box larger) and it should get more and more accurate, and you should be able to find a threshold where it is confidently accurate

sharp oasis
#

your suggestion gave me the idea of putting the reference inside each quadrant and GPT-V is now much more accurate, it gets it most of the time! (see screenshot). I'll keep tweaking and see what I get. Thanks for your help

pine root
#

It should make the impossible task to make it possible apparently

#

For example 5 shot will be great

#

with some diverse examples

#

of inputs and outputs

#

And it will get even better

sharp oasis
#

what is n-shot prompting? any documentation where I can read about it?

spiral fern
pine root
#

Just inject some user and assistant messages that will be the examples.

sharp oasis
#

thanks, I googled and found more info about n-shot option although it didn't make much difference. what it did work is to make two attempts (a bit expensive though). First attempt it gets close enough but the second attempt is gets it almost all the time, see the video

novel bear
#

is there a particular reason you are using vision to interact with desktop or is it just for learning?

I would suggest using power automate flows in conjuction with gpt if you wannt interact with stuff using AI

spiral fern
pine root
#

You can add more examples, and make it more diverse

#

and maybe high quality

pine root
#

you don't need creativity for referencing a grid

#

Make high quality, diverse, useful examples but not too much or your gpt-4 bill will increase

#

Idk if the search criteria str is generated or is just hardcoded str but if it is generated by model you can bump on creativity if you want

#

but referencing grid can be more deterministic

sharp oasis
sharp oasis
pine root
#

And other sampling parameters check maybe useful with some research

pine root
#

Also you can use selenium maybe with some function calling to get better results maybe

#

And combine with vision to render website for AI so it sees the browser state

#

But if you want some stuff outside browser then idk use grid still

#

So in summary, if you want some gpt-4 powered browser script that uses GPT-4 with vision you can combine vision strength and use API with function calling to make actions and send the rendered page that browser uses.

novel bear
#

Pretty cool stuff

sharp oasis
#

ah yes, I thought you mean some other product, yes I’m familiar with Power Automate. That would allow you to predefine specific tasks but won’t let for instance a GPT to define the tasks and execute it. The closer I found that can do it is open interpreter https://openinterpreter.com/

Open Interpreter is a free, open-source code interpreter.

shy fox
# sharp oasis your suggestion gave me the idea of putting the reference inside each quadrant a...

There was actually a research paper published last year that is somewhat adjacent to this,

https://arxiv.org/abs/2310.11441

sharp oasis
shy fox
#

I saw that one too, but not the demos. Thanks for the share back.

Last year when I first saw the Set-of-Mark prompting paper I toyed—very briefly with building a pipeline to pass an image through a SegmentAnything model to draw different coloured outlines around objects and appending a numeric label to each, before passing the augmented image on to GPT-4V, but then the end of the term, holidays, life, etc conspired to derail that.

Maybe I should revisit it sometime...