#How can you calculate similarity between embeddings in the right way in JavaScript?

1 messages · Page 1 of 1 (latest)

thorn abyss Feb 1, 2024, 8:59 AM

I'm working on a search feature where I want similar words to be considered based on their embedding values. For example, searching for "Employee" should also bring up results for "Worker" because they have a high embedding score due to their similarity.

However, there's a problem. When comparing the embedding value of an unrelated term like "Boruto" to phrases such as "What is the name of this Employee" or "the name of this employee is Y," the results are consistently high. What I want is for a search for "Boruto," which has no matches, to return a very low score. Currently, all searches have high scores, typically around 0.7 - 0.8.

I have this function to calculate similarity:

    // Variable to store the dot product of the two embeddings
    let dotProduct = 0;

    // Variables to store the magnitude (Euclidean norm) of embeddings
    let norm1 = 0,
        norm2 = 0;

    for (let i = 0; i < embedding1.length; i++) {
        dotProduct += embedding1[i] * embedding2[i]; // Calculate the dot product of corresponding elements
        norm1 += embedding1[i] * embedding1[i]; // Calculate the sum of squares of elements in embedding1
        norm2 += embedding2[i] * embedding2[i]; // Calculate the sum of squares of elements in embedding2
    }

    // Calculate the magnitudes (Euclidean norms) of the embeddings
    norm1 = Math.sqrt(norm1);
    norm2 = Math.sqrt(norm2);

    // Calculate the cosine similarity
    const similarity = dotProduct / (norm1 * norm2);

    return similarity;
};```

Am I looking at it wrong? Maybe I haven't got the concept of these embeddings right.

sinful tideBOT Feb 1, 2024, 8:59 AM

Hi @thorn abyss. I'm kapa.ai, your friendly support bot. I will try to answer your question by searching through the knowledge base. :wave:

Based on the information provided in the extracts, it seems like your approach to calculating the similarity between two embeddings is correct. You're using the cosine similarity, which is a common method for comparing vectors in high-dimensional spaces, such as word embeddings.

However, the issue you're facing might be due to the nature of the embeddings themselves. Embeddings are high-dimensional representations that capture semantic meaning, and they might not always align perfectly with our intuitive understanding of similarity. For example, the embeddings for "Boruto" and "Employee" might be closer than expected because they share some abstract features, even though they seem unrelated in a human sense.

One possible solution to improve the relevance of your search results is to augment your embeddings with additional information. For example, in the forum post (https://community.openai.com/t/328330) from the OpenAI community, a user suggests a method for improving the relevance of chatbot responses. This involves breaking the documentation into chunks, running embeddings on each chunk, and then comparing the user's question to these chunks. The user also suggests creating typical questions about each chunk and including these in the embeddings vector, which might help to improve the matching.

Another suggestion from a different forum post (https://community.openai.com/t/318312) is to overlap your chunked data, which can improve the quality of retrieval by including more relevant chunks with a single search.

In terms of decoding base64 embeddings in NodeJS, a forum post (https://community.openai.com/t/599308) provides a solution using the Buffer object in NodeJS:

new Float32Array(
      Buffer.from(cacheItem.embedding, 'base64').buffer
);

Remember, these are just suggestions and might not directly solve your issue. The effectiveness of these methods will depend on the specifics of your use case and data.

Note: you can ask me a follow up question by @ mentioning me again :speech_balloon:

@thorn abyss kapa.ai is still learning and improving, please let us know how it did by reacting below