I'm working on a search feature where I want similar words to be considered based on their embedding values. For example, searching for "Employee" should also bring up results for "Worker" because they have a high embedding score due to their similarity.
However, there's a problem. When comparing the embedding value of an unrelated term like "Boruto" to phrases such as "What is the name of this Employee" or "the name of this employee is Y," the results are consistently high. What I want is for a search for "Boruto," which has no matches, to return a very low score. Currently, all searches have high scores, typically around 0.7 - 0.8.
I have this function to calculate similarity:
// Variable to store the dot product of the two embeddings
let dotProduct = 0;
// Variables to store the magnitude (Euclidean norm) of embeddings
let norm1 = 0,
norm2 = 0;
for (let i = 0; i < embedding1.length; i++) {
dotProduct += embedding1[i] * embedding2[i]; // Calculate the dot product of corresponding elements
norm1 += embedding1[i] * embedding1[i]; // Calculate the sum of squares of elements in embedding1
norm2 += embedding2[i] * embedding2[i]; // Calculate the sum of squares of elements in embedding2
}
// Calculate the magnitudes (Euclidean norms) of the embeddings
norm1 = Math.sqrt(norm1);
norm2 = Math.sqrt(norm2);
// Calculate the cosine similarity
const similarity = dotProduct / (norm1 * norm2);
return similarity;
};```
Am I looking at it wrong? Maybe I haven't got the concept of these embeddings right.