I’m using —log_tokenization and it has me wondering what the rhyme or reason is for some of the tokenizations. For example the word “zoomed” gets tokenized as two different colors for zo and omed. I have trouble coming up with any ideas about how this way of splitting that word is helpful. Although the CLI switch is definitely helpful for, if anything, revealing oddities like this. Is there a place to go for more on whether “zo” “omed” is giving me what I want, and how to understand what is going on?
#Making sense of tokenization?
3 messages · Page 1 of 1 (latest)
you can see all the tokens it knows in its vocab.json file: https://huggingface.co/CompVis/stable-diffusion-v1-4/tree/main/tokenizer
Thanks!