i have been running studies of toki pona's word frequency! i am not gonna publish any hard numbers until i have a more complete demonstration ready to go, but in the meantime i plan to share what i discover along the way.
half an hour ago i made this particularly interesting analysis, where i cross compare my frequency data from sentences that are "in" toki pona with data from all sentences whatsoever! i still do my same cleaning/normalizing, but i count all sentences regardless of their contents after that.
my goal was to attest something that i'd only held as an assumption before: that the effort to distinguish toki pona sentences from non-toki pona sentences is worthwhile for measuring the frequency of words as they appear in toki pona, as opposed to how often they appear at all. because if i could just pull the frequency of literally every word and then look for toki pona words in that, that would be much easier! there would still be problems like homographs being over-represented, especially "a", but the entire process would be much easier and faster.
the reason why i held this assumption is that i observe anecdotally that newer words are discussed much more in english than in toki pona. so, going into this analysis, i expected every word to receive some frequency increase, but that the frequency increase would be less pronounced for words which were more frequent in the first place (most core words), and more pronounced for words which are less frequent (most obscure and sandbox words).
and for the most part, i observe exactly that! there are some fascinating details in this cross-analysis. but first: a majority of core words (about 90 of them) see less than 15% of their use outside tp; half of them see 8% or less of their use outside tp. there are outliers such as "toki", "pona", and "sitelen" which are all near 38% use outside tp; i suspect this is due to "toki pona" and "sitelen pona" as phrases in English. "jan," "soweli," and "ma" are also outliers at 25% outside tp. "tonsi" sees 45% of its use outside tp, which i would guess is for the same reason as "jan" and "soweli": common way to personally identify yourself, which crosses into english!
here's where it gets interesting: 69.4% of pu's uses are in non-tp sentences. 68.7% of ku's uses are in non-tp sentences. this is true even against their 80th and 126th place frequency rankings, 14k uses and 6k uses respectively in tp sentences. put another way, these words are used significantly more often in english than in toki pona! roughly speaking, for every 10 uses of these words, you'd see ~3 in toki pona and ~7 in anything besides toki pona.
of course, broadly speaking, words become more represented in english as they become less frequent over-all, but comparing to reported uses in Linku doesn't always help. for example, "kin" is in the "common" usage group in Linku, but has 35k uses in tp- this puts it at rank 53, between "jo" and "pana." It and "n" are the only common words to break the top 100 in frequency. But, we'll continue with examining words by usage group as a convenient and recognizable way to identify words which are "less frequently used". it's accurate enough to make the point!
the important part of this analysis is the question: on average, do words outside core/common become more highly represented compared to words in core/common when you ignore whether the studied sentences are "in toki pona"? not to say they eclipse them in frequency entirely, but that they are used more often in english than core words are, as a percentage of their over-all frequency.
NOTE: this post has been edited since it was published, changing from "percentage increase" to "percentage in english". the insights are the same, it's just easier to reason about the outcome!
it's good to run into this confusion now so i can communicate the final result more clearly
