I want to create a robust image retrieval for my dataset of images. I'm using the qdrant cluster to store my image embeddings (using finetuned dinov2). My dataset has a lot of intricate artwork abstract and cubism. Using my embeddings I'm able to do a similarity search given a query image and I can retrieve the top 5 similar images. The retrieved images aren't good enough. There still needs to be reranking done. I find it difficult to rerank the retrieved images. Since text is easier to rerank, I generated captions for my dataset using the google gemini model. I'm not really happy with the captions. Since these are abstract drawings and colours it got a lot of the detail wrong. Can someone help give me direction as to how I proceed with reranking the images, either using text or image. I'm not happy with the captions generated despite refining the prompt
#Image reranking
40 messages · Page 1 of 1 (latest)
@tropic dawn
Maybe create a small dataset ranked images and fine tune the model on it
But images need to be ranked based on a query image right. When the query image is dynamic how can we make a dataset
My task is to do image retrieval: I have 500 index images (These are HD DSLR pics) and there are query images which have instances of the query image in them. My goal is to pic the right index image given a query image. Thats why i finetuned dino to generate embeddings. Now I need to focus on reranking the retrieved images
you make a dataset by taking reference images at random and manually ranking other images chosen at random against it I would say
You don't necessarily have to rerank based on content. If your application has natural signal, such as clicks, you can just rerank based on probability to be clicked on. Basically, you can concatenate user embedding + image embedding, pass to a small model that predict probability to be clicked on. See more on Learning to Rank: https://towardsdatascience.com/learning-to-rank-a-complete-guide-to-ranking-using-machine-learning-4c9688d370d4
If the ranking is subjective, I would just annotate given a query, what was clicked on, and create a continuous model to learn about given a query and list of images, what is most likely to be clicked on
Hey Ian!! Thank you so much for replying! So mine isn’t really a text- image search like google images. Mine is like an image-image search. Im given a query image and based on its content I need to pick the correct index image. The query is basically an instance of one of the index images. The last time we spoke you gave me an idea of captioning the images to text and working with those embeddings instead, and you also told me how important reranking is. But, I’m having trouble with that. Captions generated after prompting aren’t working well. I tried the gemini model for captioning (it takes over 3-4 secs to generate a caption for an image), and the captions aren’t that accurate. Worst part is when it return no result, cause gemini is not able to describe it. Im stuck with this issue.
So there is no click in my application. Rather just a query image taken by a user which must contain an instance of any index image.
Hey this is a really nice idea. And I was actually trying that out last night. BUT.. later on I will have to register more images. Right now my dataset is small, but when I add more images to the mix, it makes no sense to keep ranking them manually right.
Just pondered over your message again. No image has a higher chance of being clicked on. This application doesnt involve clicking. its more like scanning an image and querying it. The query image is random and could be ANY of an index image instance or NONE.
The more you have ground truth data about the ranking preference you want to have, the more the model will be able to generalize this ranking to other images. In deep learning, more is better.
see that's the thing. I don't have a lot of ground truth data right now. And I'm sure while registering there will be max only 100 different images.
And since my images are artwork even I'm not sure how similar or dissimilar some work are.
If you don't know if the images are supposed to be similar or not, you should probably not expect the AI to do a better job
I mean I have a vague idea. But the retriever did a good job finding similar images and thats all AI as well. Why can't the reranking space be made into something like this.
what is your ground truth reference exactly?
the HD DSLR Artwork images
and it gives you good rankings?
I am talking about the rankings, not the images
rankings? I haven't done reranking yet. The retriever works fine
I'm not able to do the ranking
so if I understand correctly, you have a model for ranking, but you are not happy with it, and you have no reference ranking model or ground truth ranking data?
noo I don't have model for ranking. I want a model for image ranking. The ones in paper with code are either not under and MIT License, or are not clear with their instruction, so I'm not able to rerank the images. Since reranking text is easier, I was looking into captioning the images. And yes, I have no ground truth ranking data.
then maybe you can try other embedding similarity functions, L1 distance, cosine similarity, etc.
I'm using cosine similarity to retrieve the embeddings. L1 didn't work well with retrieval
then maybe you can try something like this:
do not simply embed the entire image, but also embed crops of the image (ex: top left, top right, bottom left, bottom right), concat the embeddings and compute the similarity on the concatenated vectors
of course, with 4 additional crops you would have to perform 5 forward passes to compute all embeddings and you would get combined embedding vectors that are 5 times bigger
I'm sorry I don't understand you, how will this help in reranking?
it might give you a better ranking in the first place
But again that's just for candidate selection right. I wasted a lot of time trying to finetune unsupervised models to give me "GOOD" embeddings, but what about the subjective context. That is reranking based on the query image. Do you really think this will help rerank images based on the query image
Well, depending on how you crop your images it might improve the re-ranking. For instance with embeddings of the 4 quadrants, it might help ranking the images based on the similarity of the composition since corresponding quadrants should have similar embeddings
so this is the transform I used initially:
import torchvision.transforms as T
transform = T.Compose([
T.Resize(256, interpolation=T.InterpolationMode.BICUBIC),
T.ToTensor(),
T.Normalize(mean=(0.485, 0.456, 0.406), std=(0.229, 0.224, 0.225)),
])
I'll try what you're suggesting as well
In any case, I think you have to identify first principles to guide the similarity first. Do you want the images to be ranked depending on the composition? the colors? the style? the artistic current? the content? the quality? something else?
This will guide you towards one method of ranking or another.
for example given a query image. sometimes while retrieving it retrieves another artwork instead of the preferred index artwork. this is because these two are very very similar. they're from the same artist, same colours are used, similar sketches
@final dome You have a design flaw in your application then. On one hand, you're saying there's "correct index image", on the other hand, you're just randomly selecting from candidates and stating there's no higher chance of one being better than the other. Is there correct index image or not? If there is, annotate yourself. If not, redesign he application to show top N (like top 10) candidates. I would do both annotation and redesign
The candidates are being selected by vector search, and there is a higher chance of one being better than the other SUBJECTIVELY. As in, given a query images YES, there is a candidate better than the rest. But all this comes after retrieval. Right now I'm retrieving the top 10 best candidates like you said, using cosine similarity search. I'm using the DINO embeddings to represent my images. So in the retrieved images, YES there is a candidate better than the rest. But how do I choose the best candidate out of the top 10 retrieved ones. I didn't understand you the first time, I thought you meant overall a index image being the better candidate.
No, reranking is done after retrieval of candidates. To make it simpler, this is step by step:
- Create a bunch of sample queries yourself, so collect say 500 images
- Upload, retrieve, and select the image you would choose yourself. It's fine if it's subjective for now. This way, you basically have 2 inputs and 1 output. Say you retrieve 10 candidates, you basically have
# For the correct candidate
{"input_1": query_image_embedding, "input_2": candidate_image_embedding, "target": 1}
# For the other 9 incorrect candidates
{"input_1": query_image_embedding, "input_2": candidate_image_embedding, "target": 0}
- After going through 500 images, design and train a small model that takes both embedding, and output that binary classification. It should add probability to being selected given
query_image_embeddingandcandidate_image_embedding. - Add this model to your application, run it after retrieving the 10 candidates, and rerank the 10 candidates based on prediction score
Wouldn't it be much more efficient to just better train the embedding model so that the top k retrieval already returns correctly ranked images?
Maybe it makes it harder to train the model using user interactions though. I probable would create an interface just for collecting ranking data in any case.
Yes, I understand that. You actually told me how important reranking when i was blindly trying to figure out the best model. Wasted a lot of time training those. So I looked into the papers with code links you shared earlier and for image reranking the code and proof isn’t legit, thats why I reached back to see if you had worked on this before. This logic is fantastic!!! Thanks a lot, man!! Here I was, breaking my head trying to caption my images and finetune clip. This sounds much easier. Thanks a lot for everything