#Making sense of tokenization?

3 messages · Page 1 of 1 (latest)

lusty tiger
#

I’m using —log_tokenization and it has me wondering what the rhyme or reason is for some of the tokenizations. For example the word “zoomed” gets tokenized as two different colors for zo and omed. I have trouble coming up with any ideas about how this way of splitting that word is helpful. Although the CLI switch is definitely helpful for, if anything, revealing oddities like this. Is there a place to go for more on whether “zo” “omed” is giving me what I want, and how to understand what is going on?

scenic trellis
lusty tiger
#

Thanks!