#Embeddings API returning cut information

70 messages ยท Page 1 of 1 (latest)

eternal kite
#

Embedding API is only returning ~10 decimal points while the OpenAI python module returns ~18.

Tried in Java, and postman.

eternal kite
#

Bump

graceful badge
#

The embeddings are float32 (single precision). They only have 7 or 8 significant figures, so you probably aren't benefiting from the extra digits, whose presence or absence mostly reflect different ways of converting between data types. But you can get them if you want, even when you're not using the OpenAI Python library. More specifically, you can get the embeddings as flat sequences of binary float32 values--which is what the Python library is doing--and use them however you like.

#

If I understood your message in #text-embedding-ada-002 correctly, you're using openai.embeddings_utils.get_embedding. That calls openai.Embedding.create, and does not pass an encoding_format argument. When openai.Embedding.create is called without an encoding_format argument, it requests the float32 coordinates as a base64 string. base64 encodes binary data losslessly in a text format suitable for transmission over a network. (The "64" in base64 is not conceptually important here.)

#

openai.Embedding.create decodes the base64 string to get the original binary representations of the coordinates, then uses NumPy to recognize that as a flat sequence of float32 values, making a NumPy array of them. Then it converts that NumPy array to a Python list, which converts the float32 values to Python float. Python's float type is in practice always float64 (double precision). Python itself has no built-in float32 type.

#
# If encoding format was not explicitly specified, we opaquely use base64 for performance
if not user_provided_encoding_format:
    kwargs["encoding_format"] = "base64"

while True:
    try:
        response = super().create(*args, **kwargs)

        # If a user specifies base64, we'll just return the encoded string.
        # This is only for the default case.
        if not user_provided_encoding_format:
            for data in response.data:

                # If an engine isn't using this optimization, don't do anything
                if type(data["embedding"]) == str:
                    assert_has_numpy()
                    data["embedding"] = np.frombuffer(
                        base64.b64decode(data["embedding"]), dtype="float32"
                    ).tolist()

        return response
#

(I've omitted the surrounding code, including the except clause.)

#

Since float32 and float64 are binary floating point (their IEEE 754 names are binary32 and binary64), there is error associated with the base conversion between decimal and binary. For example, in a REPL, np.float32('0.033652876') and float('0.033652876') both show the string representation 0.033652876, while float(np.float32('0.033652876')) shows the string representation 0.03365287557244301.

#

The float64 values you get from the OpenAI Python library are being produced because Python has no built-in float32 type, not because the extra precision would usually be helpful. But if you like, you can use the same technique openai.Embedding.create uses: when accessing the API endpoint directly, specify 'encoding_format': 'base64' and decode the results yourself.

#

The only thing that makes me a little uneasy is that I haven't found where encoding_format is officially documented. The code I showed above seems to indicate that not all models are guaranteed to support base64 encoding. If the model doesn't, then presumably the API endpoint would fall back to giving the embedding value as a JSON array and, due to the parsing logic in the EngineAPIResource base class, type(data["embedding"]) would be list rather than str. Using requests, I tested with text-embedding-ada-002, as well as the five first-generation Ada models, and 'encoding_format': 'base64' always worked to get base64-encoded embeddings.

eternal kite
#

That was a great read, thank you. I've read it over a couple times and will need to again. Sounds like a lot of extra code for something that doesn't make a difference. Your insights really helped bring clarity. Thanks again

graceful badge
#

No problem! I do agree that it shouldn't make a difference and you probably won't want to manually receive and decode base64.

#

(How much extra code it would be probably varies depending on the language/framework and your use case. If you're interested, I can show the shell-script example I made when looking into this. It's just toy/demo code though--it would have to be heavily modified to do anything useful beyond showing how "encoding_format": "base64" works. In any case, the only reason I can think of to actually do this is curiosity.)

eternal kite
#

@graceful badge I would love to see the code. I write in Kotlin. Although I completely agree with everything you said I'm also curious in seeing the differences and simply knowing how to accomplish it.

graceful badge
#

Edit: Fixed serious quoting bug by changing "$*" to "'"$*"'".

My original shell script was:

#!/usr/bin/env bash

make_request() {
    curl https://api.openai.com/v1/embeddings \
        -X POST \
        -H "Authorization: Bearer $(<.api_key)" \
        -H "Content-Type: application/json" \
        -d '{"input": "'"$*"'",
            "model": "text-embedding-ada-002",
            "encoding_format": "base64"}'
}

get_embedding() {
    make_request "$@" |
        jq '.data[0].embedding' |
        sed 's/"//g' |
        base64 -d |
        od -f
}

get_embedding 'The food was delicious and the waiter...'
eternal kite
#

@graceful badge This is very cool. Thank you! The comments really helped. I'm assuming there's no way to accomplish this inside of Java - not that it matters. It'd be interesting to run tests and see the difference in results, if there are any

graceful badge
#

I found a serious bug in the shell script I showed above: due to a mistake I made with quoting, I was always embedding the literal text $* instead of the text that was supposed to be embedded! I've fixed the bug, both in the code shown above and in both versions of the script in the linked gist. Sorry about that!

graceful badge
#

You cannot usually do it through the libraries people have made for accessing the OpenAI API, as far as I know. For example, Theo Kanning's openai-java library won't do it, though it would be feasible to make a fork of it that does. I'm not sure about Mouaad Aallam's openai-kotlin library, which I haven't used.

#

But you can drop down lower and make the HTTP POST request to the API endpoint yourself. When experimenting, I did this in Python with the Requests library. Out of curiosity, I've since tried it in Java with OkHttp (and Jackson to map between JSON and Java objects), and that also works. I used those third-party libraries, but this is still "inside of Java" in the sense that those two Java libraries are written in JVM languages. It's would also be possible--though, I think, inconvenient--to do it in Python with just the Python standard library, or in Java with just the facilities in the Java Class Library.

#

I plan to put my code for all this up on GitHub soon (hopefully I'll manage to do it today).

#

Btw, if one were to decide to do this in production code, I think the reason would always be the optimization in speed and network usage when computing embeddings for large quantities of text (which I presume is why the OpenAI Python library does it), rather than the possibility of slightly less rounding error in rare cases.

eternal kite
#

@graceful badge that's weird. I use retrofit (which uses okhttp) and both only returned the cut decimals. I used Gson instead of Jackson to convert

#

It must be the way I'm asking to display the data then? I was positive that I was reading the raw response

#

@graceful badge Ah, yes. I definitely noticed that... ๐Ÿ˜…

#

The bug

#

Would you mind showing your java snippet of the okhttp request? Along with any configurations?

graceful badge
#

No problem! Actually, even though it's imperfect, I'm going to put the repo on GitHub now (or as soon as I come back from afk), since that will make it easier, since that way I can link to any part of any source code file. Also, you'll be able to see all the stuff. ๐Ÿ™‚

eternal kite
#

Great. Will check it out. This has all been very informative

graceful badge
#

For me as well! I was unaware that the API endpoint ever sent embeddings as Base64 data, before your question caught my eye and I decided to look into it.

graceful badge
#

(That's what I'm doing.)

#

If so, are you keeping the values as floats? If you keep them as floats, then they're the same values as the doubles that show more precision, but the extra digits wouldn't typically be displayed, since they're not typically necessary to specify the floats precisely.

#

Most libraries, including JSON-parsing libraries that aren't specific to OpenAI, have the default behavior of parsing JSON number data as float64. If you were to forgo Base64 encoding but read the data as a List<Float> instead of a List<Double>, and then convert that List<Float> to List<Double>, then you should also get the extra digits, and they should usually be the same as if you request Base64 data, decode it to List<Float>, and convert that to List<Double>.

#

However, if there is ever a case where the decimal representation of a coordinate is not precise enough, then you would lose precision. (A minority of float32 values require 9 decimal figures to represent precisely, and I haven't looked into whether the API sends decimal numbers with 9-digit significands in this situation.) Although I don't think it has anything to do with why the API supplies Base64 on request, using Base64 should always avoid that (largely unimportant) edge-case rounding error.

eternal kite
#

Oh wow. That read with the questions was perfect. I didn't realize it was as simple as an argument in the request. @graceful badge

#

If you don't mind me asking, how is it "optimized" to send base64 - which essentially is just added noise in this context. Seems counter intuitive to me. Or are you saying the opposite; that the cut version is optimized. If that's the case I'm equally confused on why it would still even be there as an option ๐Ÿ˜…

graceful badge
#

The decimal numbers received when Base64 isn't used aren't really cut, or not usually. In either most or all cases, they do not lose any information. The ways that Base64 is an optimization relate to what is being encoded.

#

The embedding that the model produces is a sequence of float32 coordinates, each of which takes up 4 bytes. When Base64 is used, that sequence of float32 values is encoded as Base64. (The individual coordinates are not encoded separately; rather, the entire embedding, consisting of d consecutive 4-byte values, where d is the dimension of the vector space, is encoded.)

#

When Base64 is not used, that same sequence of 4-byte float32 coordinates is converted to a JSON list of numbers, where each coordinate is represented in decimal (so it has to be converted to decimal for transmission, then converted back by the client).

#

Using Base64 to avoid that is an optimization because Base64 is more compact than that JSON list representation, and because the computations to encode and decode Base64 are less involved than those to serialize numbers as their decimal representations and then parse those representations back to binary.

#

To be more concrete about the space issue: Base64 represents 3 bytes as 4 bytes, so it's 4/3 bigger than raw binary data. A float32 value is 4 bytes, so a d-dimensional embedding is 4 d bytes raw, and 16/3 d bytes in Base64. 16/3 < 6.

#

In contrast, representing float32 values in a JSON list requires an average of over 10 bytes per value. In part this is due to the inefficiency of using an entire byte for each digit even though there are only 10 possibilities for a digit. But the presence of nonsignificant 0s, decimal points, and the commas used to separate the values from each other, also contributes.

#

For example, the decimal string 0.033652876 is 11 bytes, which becomes 12 when the , separator that would come after it (unless it happens to be the last coordinate) is counted.

#

I should say that this analysis may be misleading because I think the HTTP connection could be using end-to-end compression, which would narrow the difference in bandwidth (but maybe trade bandwidth for processing).

graceful badge
#

I'm not actually sure which thing you mean by "That read with the questions". Is that something in why.md in the repo, or something I said here? (If it's something I've only said here, then I might want to add it or something like it to the documentation in the repo, too.)

eternal kite
#

I'm sorry I'm pretty sleep deprived today ๐Ÿ˜… I meant to say "That was a very good read, regarding the why.md"

graceful badge
#

oh, cool! ๐Ÿ™‚

eternal kite
#

@graceful badge Thank you. That was surprisingly easy to digest. Assuming the parameter remains and I really wanted to get max efficiency I'd actually prefer to receive my response as base64, decode it, and cut the "noise" myself.

#

Is that a correct train of thought?

#

I wouldn't, but I'd just like to know that I'm somewhat on the same page

graceful badge
#

It seems reasonable to me.

#

You could probably use the approach used in the OpenAI Python library, and check the type of the embedding so that, if support for encodng_format is ever removed, it will fall back to reading a list of numbers.

eternal kite
#

I see. Now to enter the deep hole of base64 and binary. See how far I go. I may just end up like an ostrich but, this has sparked my interest

#

Thanks one last time

graceful badge
# eternal kite Thanks one last time

No problem! Btw, idk if hardly anybody else will end up looking at or using the stuff in that GitHub repository, but either way, I'd like to mention you in the acknowledgements section of the readme, if that is okay with you. Assuming that is okay with you, is RonaldGRuckus the name I should use, or should I use something else? Also, is there a URL (like a GitHub profile or website) I should make that a hyperlink to? The wording I'm thinking of using is something like this:

These materials arose out of conversations with RonaldGRuckus on the OpenAI Discord server. If not for Ronald's observations about embeddings from the Python library, and the conversations that followed, this repository and its contents would not exist.

eternal kite
#

@graceful badge I'd be honored. My discord would be plenty. I've been looking at your stuff quite a bit. It's all a lot for me to learn, but thank you for making it public and asking me

graceful badge
#

Thanks! I've edited the readme.

#

Also, I've just realized I may have misunderstood what you meant. I took what you were saying to mean that my sample text was fine as is. If that's not so, then I'm sorry, and please let me know (I can fix it with another commit).

#

The other thing is that I'm wondering if you meant I should link to your Discord profile. I don't actually know how to do that, if there is a way.

eternal kite
#

@graceful badge I have no idea. In that case I'm okay with no links. No, honestly, all that you have said has been very insightful and helped me understand a lot more than I initially expected from this question

graceful badge
#

I figured out how to do it! Your name in the acknowledgements section of the readme is now linkified.