How can i optimize (vision+chat) captioning model for my needs? or GPT4o budget? | AI Programming And Chat | Page 1

Hey i am trying this without diving into the technics for a while.

I guess i wont be able to manage it because i will use those models for privately-commercial, so i need a lot of usage therefore gradio api and models doesnt allow me for that kinda use like they are rate-limiting it.

I am trying to caption the images then giving to LLM LLama which has a inference (serverless) api. Then promot engineer it But as you can guess it doesnt work well. I need a model for my special case.

I need to have a text input and image input, determine whats in the image or does this image fullfill a given task like is this a cat? But it is not that simple i need common-sense too. Is this image relevant to a given class for example animal. Like this.

I have mostly low-quality images and kinda similar images to each other. Can i train my own ai or fine-tune it for this case like i will give it 1000 images and 1000 captions, 1000 relevant image pairs and captions of them. Also i need to optimize it, i dont have a massive gpu but for an instance i kinda will send request like 9 or 10 in a minute or maybe if i fine-tune it well, 9 images in 1 request idk (gpt-4o can do it)?

What i ask you is how can i find the right models for those needs and how can i fine-tune them accordingly. Of course i dont want a spoon-feeding answer so dont angry at me. I will be pleased if you just guide me the steps and sources like a map. First you need to learn these then you need to do these.

By the way i want to ask that if anyone knows that can i optimize the budget or how can i optimize the budget in gpt4o just please let me know. Thanks all of you guys.

#How can i optimize (vision+chat) captioning model for my needs? or GPT4o budget?