#Snowball Word Stemming in Gleam

1 messages ยท Page 1 of 1 (latest)

hushed glen
#

https://hexdocs.pm/snowball_stemmer/index.html

After being nerd sniped into doing this by Louis last week, I implemented the Porter2/Snowball English word stemming algorithm in Gleam! It works on all ~40k test words from the Snowball Project and is ready to use should you need word stemming for text search, etc.

This is my first time publishing a Gleam package and my second proper project with Gleam, so I'd greatly appreciate any feedback, especially but not limited to, performance. My stemmer is decidedly slower than Louis' porter_stemmer even though I've gone through the low-hanging fruit of using as much pattern-matching and splitter as possible instead of pop_grapheme, etc. Not entirely sure what can be done from here lucydunno

Thanks in advance!
-# Happy word stemming! (lol)

#

I suppose the initial hope was for this to provide better word stemming for the packages site, but I'm not sure if the perf drop is worth the nicer search. (It's not ridiculously slow or anything, it can stem all ~40k test words in under 0.2 seconds on my laptop)

uncut comet
#

That's plenty fast enough I think

lime robin
#

If yoshie says it, it must be true

sage spade
#

tell that to @sweet nacelle

uncut comet
#

Literally making it faster right now

sweet nacelle
#

Oh my good I look amazing

tame coral
#

Humility is for losers

sage spade
#

why did you get the most goblin photo of me

#

ffs ๐Ÿ’€

odd geyser
#

ooh you're using splitter to find patterns but not actually split on them

#

We could have a function for that, avoiding the work of actually splitting the strings

hushed glen
odd geyser
#

Could expand all the bool.guards to case expressions to make it bit faster

hushed glen
hushed glen
odd geyser
#

aye

#

Are you also using it to check if it starts or ends with patterns?

#

Could you open an issue for checking if a pattern is in a string please

hushed glen
hushed glen
odd geyser
#

For starts-with pattern matching will be faster

#

Splitter won't ever have regular expressions as it is fast, and regular expressions are slow on Erlang

hushed glen
#

oh, I thought splitter used regexp under the hood, don't know where I got that from ๐Ÿ˜…

hushed glen
lime robin
hushed glen
#

ah, that must have been it

hushed glen
odd geyser
hushed glen
#

oh. right, yeah, that's smart

hushed glen
#

hey @odd geyser, I was wondering if I could try making the packages site work with snowball_stemmer and maybe make a PR?

odd geyser
#

Sounds good! Interesting to see how it compares