Hi everyone,
I’m trying to automate the selection of different images to illustrate a voice over. Putting aside the cost for now, my goal is to come up with the most relevant image selection.
Despite many iterations, I end up with something particularly bad where images are relevant only 50% (best case scenario).
HERE IS THE PROCESS I CREATED SO FAR:
Step1: Retrieve images from a database using 2 processes:
-Specific keywords: Generate a very accurate keywords related to each paragraph of my voicer over, hoping to retrieve images particularly relevant to illustrate that specific part of the voice over.
-Broad keywords: Generate general keywords related to the voice over, hoping to retrieve additional images that could be used as a backfill solution to illustrate my over in case specific images cannot be used.
→ Approximately 600 images retrieved
Step2: Apply a broad filter to remove bad images (blur, duplicates, images with text, etc.).
→ Approximately 150 images remaining.
Step3: Send each image to GPT & retrieve a 2-3 sentences image description.
Step4: Using a combination of the following elements:
-Image description (retrieved in step 3)
-Voice over script (2 pages)
-Contextual information related to the Voice over (short document, <20 pages, containing general info about the voice over)
I send batches of 10 images to GPT, asking to exclude the most irrelevant / out of topic images.
→ Approximately 80 images remaining
Step 5: Final image selection:
Going through each paragraph of the voice over
Focusing on images obtained with specific keywords (using image description). Asking GPT to select the top 3 relevant images to illustrate the paragraph
Then, focusing on images obtained with broad keywords (using image description). Asking GPT to review the selected images & to determine if the broad images could be more relevant to replace one of the specific images.
What would you recommend to improve the relevancy?
Thanks for the help