#A working GPT-3 / 4 Encoder & Tokenizer API

13 messages · Page 1 of 1 (latest)

rotund minnow
#

For those who are interested in knowing the length of tokens and their values before sending requests to the OpenAI API (by default you get the number of tokens in the OpenAI response).
I've created a cross-platform API that allows everyone to easily transform text to tokens, obtain the number of tokens, and also transform tokens back to the initial text.
Feel free to check it out 🙂
https://rapidapi.com/um3rit/api/gpt-3-encoder-tokenizer

earnest raft
#

Hello Mike

I had a problem in my AI copywriter built by the openai API and I need your advice.
I made a tool for copywriting , and all the functions work fine.
The problem is with the Arabic language, when I use it as a prompt input, the API counts every letter as one token, not every word as supposed to be(1k token = 750 words).
I tried to use BERT, or TKSEEM as models may be helpful, but I am very lost and I don't know what is correct.
I need to send a prompt in Arabic, and get back the response in Arabic also, but within this criteria (1k token = 750 words).
If you have any information that helps us survive.

rotund minnow
earnest raft
#

yes, it count as 1 token for each letter

rotund minnow
#

for chinese , arabic , russian .... you will be using unicode

earnest raft
#

I have seen it, but what programmatically solution should be followed that makes the tokens count equivalent to the same count in English, especially since this eats up the balance in a fast way

rotund minnow
rotund minnow
#

all the letters that are not ascii (german u with dots , italian vocals with accent , some turkish letters, all arabic letters, all chinese letters ..... ) are 1 token

#

some symbols are ascii as well

earnest raft
#

I believe that there is a solution that encodes Arabic letters so that the API calculate on the same standard, as evidenced by the fact that some sites operate commercially using the openai API and specialize in the Arabic language, and had it not been for a solution, these sites would have already lost money