#sona toki - a toki pona parser that allows for multiple grammatical translations of one sentence.

128 messages · Page 1 of 1 (latest)

mental obsidian
#

Heyo! I've been working on a toki pona parser that allows for many different legal translations of one toki pona sentence!

It doesn't translate (yet) but it can determine parts of speech! As of now, it's relatively bug-free, and can handle numbers, prepositions, la-phrases, and other fun things.

I started working on the first version of this about two to three months ago, but then got hit by severe burnout, and wasn't able to continue working on it until a couple of days ago. (It was actually @raven bluff who gave me the inspiration to keep working on this, so thank you!)

Hopefully it's something you can all find useful / interesting!
I've included a picture of the parser in action, running the main.py script included in the repository. (The number under each parse is the bot's personal ranking of each parse - the lower the number, the better!)

If any of you have any ideas or ways I can improve it, please let me know!

Planned features include:
. Translation tools
. Spell-check
. A better punctuation lexer
. A better way to interface with the parser
https://github.com/dayofni/sona-toki

GitHub

Yet another toki pona parser. Contribute to dayofni/sona-toki development by creating an account on GitHub.

raven bluff
#

wawa a

fast canyonBOT
#

pona tawa sina a!

lipamanka Koko ↩️

[Reply to:](#1055739224487370802 message) wawa a

grim phoenix
#

Wow, I love this! Better than the rest available, for sure

fast canyonBOT
#

Thank you! There’s definitely a lot that can be done to improve it, but after looking through comments left by others on the other translation projects, I think it might be a good idea to start work on a sona toki bot sooner rather than later. I’ll keep posting updates here, if I can!

jan Kasikusa ↩️

[Reply to:](#1055739224487370802 message) Wow, I love this! Better than the rest available, for sure

void shale
#

interesting! jan Mika was working on something similar a while ago in a different server.
how does yours determine the score for each interpretation?
does it handle "ki" based relative clauses?

fast canyonBOT
#

For each interpretation, it iterates over each token and gives it a score based on a couple of factors. My code would probably do a better job of explaining than I can, so you may want to check in tok_translator.py, if you can read it.
It’s only pu (and a tiny bit of ku) at the moment, so it doesn’t handle “ki” yet.

jan Asiku (Puwepu) ↩️

[Reply to:](#1055739224487370802 message) interesting! jan Mika was working on something similar a while ago in a different server.
how does …

void shale
fast canyonBOT
#

Thank you! I’ll try to keep everyone here updated

fast canyonBOT
#

Gloss function might be done, I'd like some testing done on it before I push it to the repo.
Laptop's going flat, so no screenshots yet, sadly.

fast canyonBOT
#

It's working!

fast canyonBOT
#

Gloss has been pushed to GitHub

#

Time for a spellchecker/feedback parser, then onward to making a discord bot

upper pagoda
#

I'll probably port it to javascript (because fuck python i hate python) but may i lanpan this

fast canyonBOT
#

Hah! Fair, it should be fairly easy to translate as it's mainly just classes.
What would you do with it?

jan Mika ↩️

[Reply to:](#1055739224487370802 message) I'll probably port it to javascript (because fuck python i hate python) but may i lanpan this

fast canyonBOT
#

Got it.

#

Sure, go ahead, I can't really stop you! Just make sure you give appropriate credit, please!

upper pagoda
#

out of curiosity how do you deal with prepositions?

fast canyonBOT
#

with great ease >:D

jan Mika ↩️

[Reply to:](#1055739224487370802 message) out of curiosity how do you deal with prepositions?

upper pagoda
#

thats a lie

fast canyonBOT
#

Jokes aside, it's actually not dificult

#

I generate grammatical permutations for each sentence, and then grammar check each one individually

#

Return the ones that work

upper pagoda
#

I've been wanting to make an ilo Kukole discord for a while now but I never got around to it

fast canyonBOT
#

I've been planning on making a toki pona computer translation server for a minute now

jan Mika ↩️

[Reply to:](#1055739224487370802 message) I've been wanting to make an ilo Kukole discord for a while now but I never got around to it

raven bluff
#

I'm going to steal your method

fast canyonBOT
#

Again, I can't stop anyone, just give proper credit!

upper pagoda
fast canyonBOT
#

👍

jan Mika ↩️

[Reply to:](#1055739224487370802 message) bet, I'll actually finish the one i created for ilo Kukole and release it

upper pagoda
#

this is a thing but I've been too lazy to actually setup all the channels and shit

fast canyonBOT
#

But yeah, my process simplifies to "find grammatical permutations, grammar check and rank"

jan Le | dayofni ✨ ↩️

[(click to see attachment)](#1055739224487370802 message)

#

It's surprisingly fast as I use a system of translating from base-10 to a number system with a base variable for each digit to find grammatical permutations

upper pagoda
#

how accurate is it? does it give options that just couldn't be the possible one? does it rank them by likelihood? if so how do you determine that?

#

It's surprisingly fast as I use a system of translating from base-10 to a number system with a base variable for each digit to find grammatical permutations
Oh I can improve that, lemmie start a rust project /s

fast canyonBOT
#

Decently accurate - the correct parse should always be in the output.

jan Mika ↩️

[Reply to:](#1055739224487370802 message) how accurate is it? does it give options that just couldn't be the possible one? does it rank them…

#

It does give grammatical options that you probably shouldn't consider, but they're usually outranked by other options

#

There are edge-cases when it comes to interjections, but otherwise, it doesn't return anything absolutely incorrect.

upper pagoda
#

I'm known as "The girl who made the toki pona machine translator bot" but I really know almost nothing about NLP and am just a web developer that made it out of pure spite and now I'm too deep to go back

fast canyonBOT
#

Ranking parses is something that's just simple enough that it's worth just looking at the code

#

Oh, same

jan Mika ↩️

[Reply to:](#1055739224487370802 message) I'm known as "The girl who made the toki pona machine translator bot" but I really know almost nothi…

#

I don't know what's possible and what's not, and as such, I break them rules

upper pagoda
#

a lot of my making of ilo Kukole was finding an NLP discord and bombarding them with questions

fast canyonBOT
#

Hah, fair

#

Most of my process here was getting over burnout due to school and finishing it

#
jan Le | dayofni ✨ ↩️

[Reply to:](#1055739224487370802 message) Ranking parses is something that's just simple enough that it's worth just looking at the code

upper pagoda
#

but yeah... ilo Kukole's primary purpose wasn't even for this server. I made ilo Kukole because the mods of an r/place website I'm a developer for banned toki pona because they didn't understand it, which is fair, but they also allow russian speaking people to use russian leetspeak on the site because russian is a "real language" and toki pona isn't on google translate and that crossed a line for me so i literally made... "Google" translate... ilo Kukole...

fast canyonBOT
#

Hah! That explains the name

upper pagoda
#

i thought it was funny

fast canyonBOT
#

It is!

upper pagoda
#

the tokiponization is based more on the look of "google" than the pronunciation

fast canyonBOT
#

But yeah, I've been trying to make a solely rules-based set of tools for toki pona parsing and translation

upper pagoda
#

I think the perfect translator lies in between those two

fast canyonBOT
#

Hence tools - not translators

upper pagoda
fast canyonBOT
#

Mhm

#

My method is basically "condense the tokens down, don't go recursive without end"

upper pagoda
#

so my idea is to have databases of words. since english and toki pona share SVO structure, getting the right words in the right order programatically well enough can be grammar corrected with AI for a decent translation. So you'd have a database of content words, a database of verbs, and databases of content words and verbs with adjectives translated as one english word, then you'd search the database starting with the largest groups of words and moving your way down if that makes sense

#

basically putting ku in a database and then searching it, but not actually using ku and making your own database because you can't legally use ku like that

fast canyonBOT
#

Mhm.

#

I think I see what you're getting at

upper pagoda
#

mongodb would be the best choice imo

#

SQL databases aren't good for this purpose

fast canyonBOT
#

Mhm

#

Another thing that you could consider is something Markov Chain-esque

upper pagoda
#

and i also want to have a database of "verified translations", so every time someone sends a command to ilo Kukole, that toki pona sentence (if it's unique) gets sent to a channel where a team of trusted people can verify and/or correct it before verifying it

fast canyonBOT
#

I see

#

Yeah, something along those lines could be good!

upper pagoda
#

then... we suddenly have hundreds more sentences to retrain the machine learning model ilo Kukole runs on... therefore making it better

#

this process then forms an endless loop of improvement

#

(and I can let people submit their own corrections to the bot too)

fast canyonBOT
#

I know of some datasets that might be of use - Tatoeba is one, if you haven't seen it

jan Mika ↩️

[Reply to:](#1055739224487370802 message) then... we suddenly have hundreds more sentences to retrain the machine learning model ilo Kukole ru…

upper pagoda
#

Tatoeba is what ilo Kukole is trained on

fast canyonBOT
#

However, sanitize that data

#

Ah, nice!

jan Mika ↩️

[Reply to:](#1055739224487370802 message) Tatoeba is what ilo Kukole is trained on

upper pagoda
#

it is very lightly sanitized and I am very very aware of its badness

fast canyonBOT
#

👍

upper pagoda
#

hence why I would like to make my own dataset that I can trust to be good using the bad dataset and correcting it

fast canyonBOT
#

Yep, mi sona a

upper pagoda
#

here lemmie make an invite to the ilo Kukole server and you can help set up the channels and stuff

#

I'm gonna need some extra people helping out with it anyways

fast canyonBOT
#

Sounds good!

fast canyonBOT
#

Updated the README to be more than like one or two lines of text

fast canyonBOT
#

Added vocative word class - I'd forgotten to do that!

fast canyonBOT
#

Hija everyone!

#

Do you have any features you'd like in sona toki for your own projects?

#

If so, let me know!

#

what features do you have thus far?

fast canyonBOT
#

Currently, a parser, a grammar-checker, and a horribly-formatted glosser

#

@shadow perch

fast canyonBOT
#

New pfp for the bot!

quasi idol
#

super cool!

fast canyonBOT
#

o pona tawa sina!

soweli Deni ↩️

[Reply to:](#1055739224487370802 message) super cool!

#

I’m pretty proud of how it’s coming along

#

I’d like a decent testing set of sentences/short paragraphs to work with, though

quasi idol
#

Take the transcripts from kalama sin

void shale
#

you know i just realized is this not the first tp parser?

upper pagoda
raven bluff
#

is this the first parser that outputs multiple results?

upper pagoda
#

to my knowledge that is true

raven bluff
#

ooh

upper pagoda
#

feel free to prove me wrong on that but i think it is

fast canyonBOT
#

It most certainly is not!

jan Asiku ↩️

[Reply to:](#1055739224487370802 message) you know i just realized is this not the first tp parser?

quasi idol
#

🤔 is the bot on the kama sona server

fast canyonBOT
#

Which one?

soweli Deni ↩️

[Reply to:](#1055739224487370802 message) 🤔 is the bot on the kama sona server

#

Pluralkit or sona toki?

#

(in the latter case - it shouldn’t be)

quasi idol
#

sona toki

#

Oh I was thinking it might be useful

fast canyonBOT
#

I need to find a way to host it - as of now, I'm only able to host the bot locally

kon Deni ↩️

[Reply to:](#1055739224487370802 message) Oh I was thinking it might be useful

raven bluff
#

does this parses weird things like "jan tu wan pona tu wan"?

fast canyonBOT
#

It should… let me just check

raven bluff
#

I'm planning on banning it to my translator

fast canyonBOT
#

Yup, it handled it fine

tonsi Koko ᜃᜓᜓᜃᜓᜓ ↩️

[Reply to:](#1055739224487370802 message) I'm planning on banning it to my translator

#
----------------------------------------------------------

3-SUB-person-two-unique-good/simple
27 

SUB-person-two-unique-good/simple-two-unique
32 

1-SUB-person-two-unique-good/simple-two
32 ```
#

Three simple, unique, two-people is apparently how it thinks it should be read as

quasi idol
#

that makes sense

fast canyonBOT
#

okay so it's been a minute

#

sina en mi li ken ala sona la ale li pona
-> ['ContentToken', 'AddSubjectParticle', 'ContentToken', 'PredicateParticle', 'ContentToken', 'ContentToken', 'ContentToken', 'ContextParticle', 'ContentToken', 'PredicateParticle', 'ContentToken']
-> ['ContentToken', 'AddSubjectParticle', 'ContentToken', 'PredicateParticle', 'Preverb', 'ContentToken', 'ContentToken', 'ContextParticle', 'ContentToken', 'PredicateParticle', 'ContentToken']
-> ['ContentToken', 'AddSubjectParticle', 'ContentToken', 'PredicateParticle', 'ContentToken', 'ContentToken', 'ContentToken', 'ContextParticle', 'NumberToken', 'PredicateParticle', 'ContentToken']
-> ['ContentToken', 'AddSubjectParticle', 'ContentToken', 'PredicateParticle', 'Preverb', 'ContentToken', 'ContentToken', 'ContextParticle', 'NumberToken', 'PredicateParticle', 'ContentToken']

o ni ala!
-> ['ImperativeParticle', 'ContentToken', 'ContentToken']

jan Len o!
-> ['ContentToken', 'ProperAdjective', 'VocativeParticle']

jan Len o ni ala!
-> ['ContentToken', 'ProperAdjective', 'ImperativeParticle', 'ContentToken', 'ContentToken']

toki a!
-> ['ContentToken', 'ContentToken']
-> ['ContentToken', 'Interjection']

mi ken pona e ni!
-> ['IgnoreLiSubject', 'Preverb', 'ContentToken', 'DirectObjectParticle', 'ContentToken']
#

we're plugging away on sona toki v5 (yes, we may. have gone. through a couple of new prototypes)

#

this version should. hopefully. not have abysmal programming style and a much better overall speed than sona toki v3

#

oh and you guys should be able to make your own nasin so that it can support non-pu / non-ku parsing

raven bluff
#

oh so like it is user extensible?

#

if so, that's super cool!