#Word frequency in Toki Pona

189 messages · Page 1 of 1 (latest)

amber galleon
#

i have been running studies of toki pona's word frequency! i am not gonna publish any hard numbers until i have a more complete demonstration ready to go, but in the meantime i plan to share what i discover along the way.
half an hour ago i made this particularly interesting analysis, where i cross compare my frequency data from sentences that are "in" toki pona with data from all sentences whatsoever! i still do my same cleaning/normalizing, but i count all sentences regardless of their contents after that.

my goal was to attest something that i'd only held as an assumption before: that the effort to distinguish toki pona sentences from non-toki pona sentences is worthwhile for measuring the frequency of words as they appear in toki pona, as opposed to how often they appear at all. because if i could just pull the frequency of literally every word and then look for toki pona words in that, that would be much easier! there would still be problems like homographs being over-represented, especially "a", but the entire process would be much easier and faster.

the reason why i held this assumption is that i observe anecdotally that newer words are discussed much more in english than in toki pona. so, going into this analysis, i expected every word to receive some frequency increase, but that the frequency increase would be less pronounced for words which were more frequent in the first place (most core words), and more pronounced for words which are less frequent (most obscure and sandbox words).

and for the most part, i observe exactly that! there are some fascinating details in this cross-analysis. but first: a majority of core words (about 90 of them) see less than 15% of their use outside tp; half of them see 8% or less of their use outside tp. there are outliers such as "toki", "pona", and "sitelen" which are all near 38% use outside tp; i suspect this is due to "toki pona" and "sitelen pona" as phrases in English. "jan," "soweli," and "ma" are also outliers at 25% outside tp. "tonsi" sees 45% of its use outside tp, which i would guess is for the same reason as "jan" and "soweli": common way to personally identify yourself, which crosses into english!

here's where it gets interesting: 69.4% of pu's uses are in non-tp sentences. 68.7% of ku's uses are in non-tp sentences. this is true even against their 80th and 126th place frequency rankings, 14k uses and 6k uses respectively in tp sentences. put another way, these words are used significantly more often in english than in toki pona! roughly speaking, for every 10 uses of these words, you'd see ~3 in toki pona and ~7 in anything besides toki pona.

of course, broadly speaking, words become more represented in english as they become less frequent over-all, but comparing to reported uses in Linku doesn't always help. for example, "kin" is in the "common" usage group in Linku, but has 35k uses in tp- this puts it at rank 53, between "jo" and "pana." It and "n" are the only common words to break the top 100 in frequency. But, we'll continue with examining words by usage group as a convenient and recognizable way to identify words which are "less frequently used". it's accurate enough to make the point!

the important part of this analysis is the question: on average, do words outside core/common become more highly represented compared to words in core/common when you ignore whether the studied sentences are "in toki pona"? not to say they eclipse them in frequency entirely, but that they are used more often in english than core words are, as a percentage of their over-all frequency.

NOTE: this post has been edited since it was published, changing from "percentage increase" to "percentage in english". the insights are the same, it's just easier to reason about the outcome!

#

the answer is, strongly, yes!
for each of these words, the percentages are what portion of their uses are in languages other than toki pona

  • lanpan: 33.0%
  • oko: 37.6%
  • epiku: 33.8%
  • meso: 32.1%
  • jasima: 44.6%
  • nimisin: 81.4%
  • linluwi: 31.8% (also linuwi is at 24.0% but has ~100 uses)
  • majuna: 37.9%
  • ali: 20.7% (trend breaker, but a known special case!)
  • kokosila: 32.4%
  • su: 49.6% (note: it doesn't have usage data in linku yet; it was pushed into uncommon)
    continuing to obscure:
  • kiki: 51.2%
  • oke: 14.4% (trend breaker, maybe because of its english etymology?)
  • apeja: 42.3%
  • powe: 35.9%
  • usawi: 48.7%
  • wuwojiti: 46.9%
  • yupekosi: 61.3%
  • omekapo: 31.1%
  • isipin: 51.3%
  • kapesi: 52.2%
  • puwa: 49.6%
    i'm not gonna continue for text limit reasons, but i think my point is made: almost every word outside of core+common sees 30% or more of its usage in english, not in toki pona- and for most of them, it's much more than this. this trend continues into the sandbox, and even becomes more pronounced.
#

also, if you're interested in the methodology i use to determine if a sentence is "in toki pona", i describe my algorithm here:
#toki-ale message

and i'll copy it in in a sec!

#
  1. remove known irrelevant objects from a statement (codeblocks, emotes, urls, references [double brackets]
  2. split an entire statement into "words" and "non-words" called "tokens" (words are contiguous writing characters, or individual characters if they're known to be words such as UCSUR; non-words are contiguous punctuation/numbers/etc that aren't "for writing")
  3. clean tokens into a more easily parsed format (consecutive duplicates become a single letter, allowing "pooona" or "RrRrRrobert"; this makes it easier to validate via a regex)
  4. remove known irrelevant tokens from a statement (numbers, punctuation; i used to remove a known list of english words but i am not doing so currently)
  5. score all tokens by their highest matching filter (the scores are not hand picked, but these are the result, just shown for convenience)
  • almost all* words known to Linku or UCSUR, 1.0 (*: I exclude we, i, to, u, and ten)
  • words which match syllabically and are 3+ chars [not the slightly more strict phonotactic match], 0.8
  • words which are marked as names by their first letter being capitalized and are 2+ chars, 0.6
  • words which contain only letters in toki pona's alphabet (eg acronyms like tptpt) and are 3+ chars, 0.4
  • anything else (0)
  1. average all token scores and scale the score a bit if the input string is particularly short (with the sigmoid function)
  2. check if the result is greater than 0.8 (as an example, this allows up to 5 syllabically valid but unknown words, in a sentence made only of those)
autumn spruce
#

oooo

#

this is really cool

thin creek
#

mu

autumn spruce
#

will you have a word frequency list for toki pona, similar to what was wanted by lipamanka for its course?

amber galleon
sharp lion
#

so, if i'm right, this is the frequency of nimisins coming up in convos not in toki pona?

hoary radish
#

how do you get more than 100% im confused

amber galleon
clear bear
#

i think the frequency isn't a total percentage of words in a sentence

#

but each word's increase in frequency compared to how frequent it normally is

amber galleon
# hoary radish how do you get more than 100% im confused

it's in comparison to how much that word showed up in toki pona sentences versus in english ones.
for example, if the word "foo" appeared 10 times in toki pona, but an additional 90 times across all sentences, it would have +900% frequency.

undone plank
hoary radish
clear bear
#

so if a word is 10% overall, 150% frequency increase wouldnt mean 160% it'd mean 25% (10% + 150% of 10%)

vale dune
rotund wasp
#

very interesting! pu, ku and kinda su being more common in english and not very common in tp makes a lot of sense and was an assumption i had made, it's cool to see that it shows itself in the data!

amber galleon
#

thonk it's good to run into this confusion now so i can communicate the final result more clearly

rotund wasp
vale dune
#

is "nasin" used somewhat frequently in english?

rotund wasp
#

it is

amber galleon
#

36% increase
so, it's among the more moved core words

rotund wasp
#

generally used as a shorthand for "the way someone personally speaks toki pona"

#

oh you meant stats oops

vale dune
#

musi

amber galleon
#

another way to read this is "36% of this word's usage is in sentences which are not toki pona"

thin creek
#

@atomic ravine @mild valve o lukin

vale dune
rotund wasp
amber galleon
rotund wasp
#

real

vale dune
#

158/(158 + 100) of this word's usage is not in toki pona

ashen delta
#

I'd like to see the other synonyms, kin and namako

amber galleon
undone plank
amber galleon
ashen delta
#

musi

rotund wasp
#

oko is interesting

ashen delta
#

no, namako is interesting

rotund wasp
#

didnt expect it to be that high

thin creek
#

is namako up there

#

what do you even say with namako in English

amber galleon
rotund wasp
amber galleon
#

augh i hope i did not make that mistake in the original paragraphs lmao

rotund wasp
vale dune
#

i thought the "yes"s were a joke LOL

amber galleon
#

that would totally have been a bit if i had not just been misreading the question lmao

rotund wasp
#

fdsjhfsd

vale dune
#

are there any nimisin that were used more than lupa

amber galleon
rotund wasp
#

every time you say nimisin in english, someone says nimisin backwards in toki pona, creating negative usage which causes more than 100% of the use to be in english

amber galleon
# vale dune are there any nimisin that were used more than lupa

ehehhehehe
i am not going to reveal this, yet
let's say this:
all but one of the words which are more frequent than lupa are either nimi pu, nimi ku suli, or specific very common names.

and, based on my data so far, monsi is actually less used than lupa- although i expect both to go up in frequency as i start to include more communities in the study.

covert socket
#

yikes thats a lot of words
ill maybe look at this over the weekend when im not stressed from school

rotund wasp
amber galleon
#

i was shocked to see it initially, but in comparing it to three words which are its closest semantic neighbors, i see why it's there

#

actually, I'm still shocked to see it, cause i only personally know two people who use it

undone plank
#

o toki len e nimi len tawa mi 👀

stoic gazelle
#

can we try and guess it

rotund wasp
#

im so curious now
eliki? we? alu?

amber galleon
stoic gazelle
#

oh what are the stats on te and to

amber galleon
#

they're fucked up in their own special way because to is a homograph of "to" and "too" due to my normalization!

rotund wasp
#

im trying to think of words that are really niche, used by a few people, but more common than lupa, and that have 3 semantically similar-ish neighbours
is it
penpo???

amber galleon
#

nope, but penpo is actually very interesting because it really breaks the trend
an additional 11% of its uses are in English, that's it!

rotund wasp
#

huh

#

makes sense since it's a toki pona joke but that is interesting

stoic gazelle
#

maybe apeja?

rotund wasp
#

there's no way apeja is used more than lupa

#

i think i've heard/seen it used once outside of talking about it on a meta level

stoic gazelle
#

hmmmm

#

i personally use powe but idk what other people think about it

rotund wasp
#

kkkkkkkkken???

#

probably the best guess

stoic gazelle
#

isipin 💀

rotund wasp
#

horrifying possibility

warm rock
amber galleon
#

i am very confident nobody will guess it
i also suspect that when i get to epoch analysis, i will understand what the deal is with this word

thin gate
#

Let me see if I get this- roughly, the percents you're talking about are (# of times a word is used in both tp and all other languages)/(# of uses in only toki pona conversations)

Or is it (Uses in everything that ISN'T tp)/(uses in only tp)

warm rock
#

maybe something like "this word is mostly found in tp sentences" vs "this word is mostly found in english sentences" could be displayed on the site

vale dune
amber galleon
#

nope aha

vale dune
#

is it even a linku word

amber galleon
#

yes!

vale dune
#

how high is epiku

warm rock
#

@amber galleon how many misspellings of aanusememailmahjong does your data include

rotund wasp
# amber galleon yes!

let's guess every linku word in order and the one he refuses to elaborate on is the nimi len

thin gate
amber galleon
warm rock
vale dune
amber galleon
#

right now my data is cut off at 100 uses, so vim actually loads the json file before next week

vale dune
#

LOL

warm rock
#

@amber galleon oh another fun one. given a dictionary of english words, what are the most common english words in toki pona sentences? (for a less trivial answer exclude homographs)

vale dune
#

there are 8.4 million messages in the server that's a lot of words but somehow it isn't as many as i was expecting

vale dune
#

actually real

warm rock
#

im thinking its gonna be names when theyre not tokiponised

#

like "ilo GitHub"

#

almost said "ilo 🧹", but thats an emote. my brain is fried

vale dune
#

i think emotes are far enough from toki pona that they would fit in this category

amber galleon
warm rock
#

okay chart idea

autumn spruce
warm rock
#

x = log num of times a word appears
y = share of times its in a tp sentence

#

should look somewhat diagonal but idk how curvy itd be

amber galleon
#

my intent with the data visualization i will eventually go on to make is to allow you to select:

  • whether to include only toki pona or all messages
  • epoch (years, but starting from the creation of toki pona)
  • minimum* sentence length
  • what group(s) of words you want to include
vale dune
autumn spruce
#

sonaa

warm rock
#

you know i just realised

amber galleon
warm rock
#

every time you talk at length about maybe using a word, instead of just doing it, there is a star somewhere lowering its toki-pona-ness statistic

amber galleon
#

HAHAHAHHAHAHA

autumn spruce
#

true!!!!!

rotund wasp
#

Except time ig

amber galleon
warm rock
#

btw are you gonna put your "is this string toki pona or not" algorithm out as a python package or something. i feel like its be useful

vale dune
amber galleon
#

although, actually, interesting observation:
for the data i'm actively hanging onto, which is tokens that have >100 uses, toki pona only sentences discover 733 unique tokens
obviously if i include all sentences, i'm gonna get more than that, but the difference is surprising: there are 9883 unique tokens which more than 100 uses among all sentences!

i haven't done any post-processing of the discovered tokens, so like, there's plenty of weird stuff like single letters and the string os that is, a command prompt that used to be used to summon a bot that wrote sitelen pona- similarly, all versions of one word are represented in english, like "run" and "running", since i'm not normalizing those.

amber galleon
warm rock
#

cool cool

amber galleon
#

it's designed in such a way that you can design your own algorithm and plug it into the ilo class, or take one of the ones i've already set up

#

i should go through and document the why/why not use of more config items

warm rock
#

the pfp + description on this are so funny together, i cant stop giggling at this

amber galleon
#

HEHHEHEHEE

runic basin
#

pona? in my toki?
it's more likely than you think.

amber galleon
#

actually, it allows more than those groups, but you get the point lol

vale dune
#

that's so silly

warm rock
#

nice lol

vale dune
#

discord really needs to add a way to jump to the original post in a forum thread

warm rock
#

UX? in this economy?

amber galleon
#

if i find the time today, i am gonna rewrite the percentages with a less confusing percentage to reason about- "what percentage of this word's uses are in English"

livid sequoia
#

anecdotally, searching any sandbox word in mpptp finds you mostly people discussing the word in english, rather than them using it. i guess it's nice to have data that confirms that

amber galleon
# amber galleon i have been running studies of toki pona's word frequency! i am not gonna publis...

hi <@&1085723430818684958> <@&1085724545912164432>, i have re-written this analysis to use the much more easily understandable percentage, "what portion of this word's uses are in languages other than toki pona?" or, put a simpler way, "when this word is used 100 times, how many of those uses are in english?"
the conclusions do not change, but the numbers are much easier to interpret now. and, i am available for more questions for a little while!

vale dune
#

yippee

rotund wasp
#

wawa!

#

curious about meli and mije
i assume they arent trend breakers like tonsi? or at least not to the same extent

vale dune
#

oke is interesting, i assume the reason for it is that people are inclined just to use ok or okay in english so the very tiny few occurrences of it are typically in toki pona

amber galleon
rotund wasp
#

i see!

#

that's still interesting, since meli and mije are some of the rarer words in toki pona

amber galleon
rotund wasp
#

can't wait for the 2 hour video "How many toki pona words are there now?"

amber galleon
#

LMAO

rotund wasp
#

around 135-ish

warm rock
amber galleon
#

i could determine the answer to this
i have author data

warm rock
#

here tho theres no obvious curoff beyween used and not used ig

amber galleon
#

yeah, that's true

warm rock
#

theres a "cutoff" betweeen mostly in toki poba and mostly in english

#

which is def also dcool

vale dune
#

what are the ku suli words used more than monsi

amber galleon
#

some fun facts

  • in the "frequency of every word at all" list, "li" is the 8th most used word in the server, followed by "mi" that is within 1000 uses of "li", then "pona" that is ~40k off of that. except i have to discount "a" because of the impossibility of distinguishing it from english "a"!
  • the most used word in the server whatsoever is "i" with 1.17 million uses
  • almost nobody speaks isipin epiku! even my most generous analyses find a total of 1,295 sentences in the server which are "isipin epiku", and to get there i already had to discount ~40k single word sentences and ~15k two word sentences, and increase the minimum score to omit pu words that it was now reading as phonomatches. and like, looking at them, most of them are not even isipin epiku's style of nonsense, they're just lists of words, or short messages talking about or referencing these words. i did find this (again) though!

melome Kekan San....... 😳

amber galleon
# vale dune what are the ku suli words used more than monsi

from the bottom up, and among toki pona sentences:
kipisi, leko, lanpan, namako, epiku, oko, ku, tonsi, monsuta, kijetesantakalu, n, kin
also, soko misses the mark by ~30 uses.

from the bottom up, among all sentences:
soko, kipisi, leko, lanpan, epiku, namako, oko, monsuta, tonsi, ku, kijetesantakalu, n, kin.

this is a great question because it also demonstrates that the frequency rankings change in significant ways- soko is not only more frequent than monsi among this list, it's actually way further apart than before! and it isn't the only change

warm rock
round jewel
#

hmm j wonder why is there a similar amount of kijetesantakalu, tonsi, lanpan and soko osopon

amber galleon
#

HAHAHAHAHHA

#

THAT WOULD BE WILD but unfortunately there is a disparity of ~7k uses from kijetesantakalu to soko

hoary radish
#

i wonder how much linluwi is used

autumn spruce
#

34% it appears

amber galleon
#

linluwi has 1596 uses in toki pona

autumn spruce
amber galleon
#

a, sona

vale dune
#

i wonder if it would be meaningful in any way to plot this data with one axis being tp use and the other being non-tp use

livid sequoia
twilit flareBOT
#
sona pona

isipin epiku is a humorous style of Toki Pona in which no pu words may be used. The term was coined in 2022 by jan Kita. The name comes from a replacement of the words in "toki pona" with their closest non-pu equivalents, isipin (here "thought") and epiku ("epic").

vale dune
#

tenpo pini pi weka seme la sina lukin

#

tenpo ale pi ma ni anu seme

amber galleon