#📊📉 Gelman: purely functional statistics, in pure Gleam ⭐

1 messages · Page 1 of 1 (latest)

tall token
#

I've written out the scipy.stats implementation of descriptive, summary and frequency statistics in Gleam.

This implementation strives to be:

  1. 🥚 Simple to use, with a consistent interface. Every dataset is a List(Float), so the user needs to convert Ints accordingly!
  2. ⭐ Gleam-y. Functions can be chained together in pipelines, or their results used in other computations.
  3. 🌟 Purely functional in pure Gleam. The fold combinator is used extensively. map(g) followed by fold(f) is rewritten as fold(f(acc, g(a)) to avoid the memory overhead of intermediate lists.
  4. 🦦 Efficient. Gelman endeavors to perform one-pass over the data, even for higher order moments like skewness and kurtosis. If sorting is required, Gelman sorts only once, unless absolutely necessary.
  5. 🧪 Extensively tested. Tests are borrowed from scipy/stats, and so the results are guaranteed to be at least as accurate.

Please have a look here: https://hex.pm/packages/gelman

Gelman is named after Andrew Gelman, whose contributions to Bayesian statistics are fundamental, and whose name is a happy near-anagram of Gleam.

If you'd like me to implement other statistical functions, let me know! I plan on implementing the remaining descriptive, summary and frequency in scipy/stats and then I'd like to start on writing some statistical tests. I may reach out if I reach a roadblock, as many mathematical functions (such as C's special-functions) aren't available in Gleam yet. Please let me know if you are interested in helping out. Your comments are welcome. Have a lovely holiday season! 🎅🎅🎅

flint wolf
#

ooh

#

I'm not a maths-y person. When might I want to reach for this package?

tall token
#

whenever you

  • have a list of values that have been generated by some statistical process (everything is!)
  • and you might want to interrogate that process for its true nature, including questions like:
  • 😍 where along the real number line are you?
  • 😱 how spread along the real number line are you?
  • 👿 how frequently do you behave badly?
  • and so on!

Eventually, the plan is to integrate some random process generators into this library, during which time it might be of interest to game developers, who might need, say,

  • 🐟 a Poisson random variable to generate the number of gifts to give way on a daily basis
  • 💹 an exponential random variable to generate waiting times between encountering animals in the wild
flint wolf
#

Thank you

tall token
#

Furthermore, special care has been taken:

  • to test these functions extensively, against best-in-class implementations such as those in R or Python's scipy library.
  • to write one-pass algorithms using Gleam's combinators, so these functions will be guaranteed to be as good as gleam/list.fold! I
proven tiger
#

i may be incorrect, but isnt it usually the case that writing your own recursive functions usually performs better than list.fold?

lime igloo
#

You are correct

#

In terms of the API I would suggest a couple of things:

  • Move all the functions into one module. It doesn't really make sense to have multiple modules with one function each; usually just better to unify the API.
  • Don't have one error type for everything. I'm not going to get a NegativeValue error from median, but the fact that every function uses the same error type means I have to handle that possibility even if it never occurs. For median, I would probably just make it return Result(Float, Nil) as there's only really one way it can fail.
flint wolf
#

The one-error-per-package pattern is pretty common

#

Though I agree, a lot of them don't seem to need an error type at all

#

Having everything in one module would be a very nice UX improvement

tall token
# lime igloo In terms of the API I would suggest a couple of things: - Move all the functions...

the plan is to put about 2 more functions in summarize, 5 more functions in transform, and 5 more in discretize. Each module is separated by its purpose (and therefore its return type), where summarize returns a single value, transform returns the list with some function applied to the entire list, while discretize returns a bucket of values. But if it's not very gleamy, of course they can be consolidated.

flint wolf
#

It's best to look at API design from the user's point of view

tall token
flint wolf
#

Does having to import multiple modules make it easier to use, or harder? Typically it's harder as they have to look in more places, and have more names to manage when importing

flint wolf
#

Otherwise can design the error type for what would be useful for the user who wants to handle the error

proven tiger
#

also gelman is a module name that's unlikely to conflict with anything else the user might be using, whereas transform is much more likely to conflict (only the last part of the module name is used as the name of the module when importing it)

lime igloo
#

error even more so

flint wolf
#

It doesn't look like any of the functions in the modules would have their names collide, so could be merged easily. I don't think from the user's perspective they'd want to split it by "kind" of function

tall token
#

I see. Walk me through this:

  1. Let's say I want to call median on a list. This function only fails when the input is empty.
  2. I also want to know what the geometric_mean is. This function can fail when the input is empty, or when the values are nonzero.

If (1) has Result(Float, Nil) and (2) has Result(Float, StatsError, then I would need to pattern match using different constructors, so I would need to handle empty input in two different ways.

If (1) has Result(Float, StatsError) then I can at least catch the error in a similar way. But of course, median can only fail in one way, so I could go either way here.

flint wolf
#

I think both approaches would be fine

#

Perhaps the Nil one is slightly clearer, but you could resolve this with documentation stating which error variants are possible in each function's documentation comment

gilded ivy
dull hill
#

This is a neat project! Are you using an existing implementation as a guide?

tall token
dull hill
#

Nice. I'm pretty interested in this. I'm a research scientist, so I'm curious how numerical computing type stuff could be in gleam

tall token
#

i am interested too, which is why I began down this path of experimentation.

dull hill
#

I was skimming through your repository. Are you planning to implement some of the fancier algorithms for the basic stuff like variance?

#

It has been many years since I've gone through the literature on accurately computing this stuff

tall token
tall token
#

because you have to fold once across the entire dataset.

#

wait sorry - whcih fancier algo?

#

(this is equivalent to welford)

dull hill
proven tiger
# tall token which is why statements like these worry me a little.

not sure why that would be worrying? you can take a look at the source code that implements list.fold here
https://github.com/gleam-lang/stdlib/blob/v0.67.1/src/gleam/list.gleam#L748-L757
so fold itself is a recursive function, so writing your own recursive function just avoids the extra function call to your fold callback on each iteration

GitHub

🎁 Gleam's standard library. Contribute to gleam-lang/stdlib development by creating an account on GitHub.

#

the difference of a single function call per element usually isnt important though

#

so the better metric for whether you should write a recursive function is usually whether it would be semantically clearer

#

like having multiple accumulated arguments instead of just one

tall token
#

context: it strikes me as

(a) unlikely to be true on all targets, because even erlang's folds are considered to be very performant
(b) if not true, then that means functions have significant cost overhead?
(c) then why bother offering fold ("gleam is obvious, there should only be one way to do anything")
(d) if gleam is functional then shouldn't everyone that's very involved in this language drop everything and make fold performant, given how fundamental folding is to functional programming?

proven tiger
#

the cost is definitely not significant

tall token
proven tiger
#

my original comment was just to point out that list.fold can never be faster than the equivalent recursive function, because fold itself is implemented as a recursive function

#

the degree to which it is slower is almost always unimportant

tall token
#

Thank you for clarifying. I understand now with more context. But to have this validated by a moderator with no additional context was worrisome. I hope you can see why.

Coming from other functional programming language, almost no one says "write your own recursive function as it is faster". I was deeply surprised to hear.

proven tiger
#

ah yeah I can see how that would be confusing. the main reason to pick recursion over fold isnt performance, but clarity

dull hill
tall token
#

i think a secondary reason is that, from what i've gathered, unless someone is familiar with tail recursion, better performance might not even be guaranteed, because the compiler doesn't and won't detect when something is not being tail recursive.

However the implementation of fold is definitely tail-recursive.

https://github.com/gleam-lang/gleam/discussions/2721

GitHub

This would enforce any function with the attribute to be tail recursive, throwing a compiler error otherwise. This can make ensuring a function is tail recursive much easier, as it removes the need...

tall token
#

shoot i thought i tested it already with the examples in scipy.stats let me check again 🙁

proven tiger
#

for example your mean function I would probably use recursion for due to it accumulating two values, which would avoid the tuple wrapping/unwrapping and give each value clearer names:

pub fn mean(dataset: List(Float)) -> Float {
  do_mean(dataset, 0.0, 0)
}

fn do_mean(dataset: List(Float), sum: Float, num: Int) -> Float {
  case dataset {
    [] -> sum /. int.to_float(num)
    [first, ..dataset] -> do_mean(dataset, sum +. first, num + 1)
  }
}
dull hill
dull hill
#

I'm trying to go back to the welford paper, but I have to log into my institutional email to get access to the journal

tall token
#

hold on

tall token
dull hill
#

I think you should do the division each iteration of the fold, no?

#

to be Welford's

#

I think the idea of that is to avoid the large number problem

tall token
#

ok!

#

i'll implement that, thanks for spotting.

dull hill
#

btw, is your background like numerical computation or more classical stats?

tall token
#

both. although a lot of the classical stats has been returned to sender.

#

😢

#

like i had to revise how to do rejection sampling for normal distribution

#

how about you?

dull hill
#

Neither...I was a pure math undergrad, then bioinformatics/ecology phd

tall token
#

nice!

#

do you do a lot of modeling?

dull hill
#

btw, are you planning to use this package for work?

dull hill
tall token
#

not at all. i no longer work in software!

#

ah 🙁 do you mostly use R? if you do, which packages do you usually use? 🙂

dull hill
#

maybe this chat thread should be in off-topic, haha

tall token
#

haha okay! 🙂 in the meantime please report bugs to me, i will fix! thank you.

dull hill
tall token
#

if you also have ideas on what kind of packages you want to implement in gleam, please let me know how i can support you. in particular i've written quite a lot of modeling, hypothesis testing. we can experiment together!

daring hearth
#

I've been considering reimplementing a lot of stats functions for a while either into the gleam_community_maths lib or another standalone lib, but now I wonder if it'd be more suitable to maybe contribute to this one?

tall token
#

if people are interested, i think there could be some benefit to reorganzing the maths functions vs the stats implementations because those have different kinds of tests.

#

like eventually, the stats functions will need to have probabilistic tests, which guarantee things only probabilistically, which means that they can fail like, say, once every 10000 times.

daring hearth
#

Yeah, I did some work on that 3 years ago. Testing different moments when sampling from probability distributions.

tall token
#
  • some of the stats functions.
#

how did you think about ints vs floats?

#

thank you so much for responding. I appreciate it.

daring hearth
tall token
#

let's do it

#

i'll support you

#

tell me how i can be helpful 🙂

#

although i think if you use iterators for some reason people get very angry; I'm a bit scared.

daring hearth
#

Using iterators/yielders is the correct way to do it I'd say, since there's theoretically no end to the stream of numbers.

#

I.e. the sequence is unbounded and is thus modeled appropriately by an iterator/yielder.

tall token
#

I pretty much got shot down immediately.

#

One person even said that each call to the function should receive a new seed (infeasible, since that would violate the guarantees of pseudorandomness).

daring hearth
#

I think the most reasonable and straight-forward solution is to simply return (a number of requested) random values and the last state, which will then need to be fed to the next function call requesting the next couple of random values.

#

If you use a prng you should care about the state and be aware of it.

#

So make it explicit.

#

The main reason I've put off re-implementing these things is due to lack of time to actually design the API so it's not just a copy of the PRNG library stapled to a loosely coupled set of functions.

#

Being able to pick the between different underlying prng could also be really nice, so a user is not just stuck with a single generator.

#

But then there'll have to be some way to ensure consistency based on which underlying generator is being used (ensuring different generators, when transforming to continuous uniform random numbers, are all able to cover the same interval [0, 1]).

#

As it'll then otherwise affect downstream number generation from various distributions.

tall token
#

I definitely share your concerns. Which is why it was hard for me to design the API. Very few RNGs have an internal state of one Int only. I don’t know how to correctly model RNGs with more complex internal state.

#

if we work together, would you have more time?

#

i can do grunt work since it's the end of the year.

daring hearth
#

Joining forces would definitely be an advantage I think :)

tall token
#

maybe i can take a first stab and you can review? but honestly, i feel quite hesitant if gleam is even the right language to do this with. 🙁

daring hearth
#

Irrespective of whether gleam is the right language to do stats in I still think it'd be nice to have the opportunity to do so and with a nice and easy to use library.

#

And I think there's definitely a gap :)

tall token
#

and it seems so painful to submit my numbers to DIEHARDER 😢

daring hearth
#

Dieharder?

#

Nvm. I've actually not bothered going that far myself haha

#

Another challenge is to make the implementation consistent across JavaScript and Erlang targets.

tall token
#

But this can only be done if the underlying math functions return the same values. Which can be assumed on average, but not at the extremes.

daring hearth
#

As well as finding prngs that are able to balance efficiency with quality.

#

The main problem I encountered has to do with different number types and bit operations.

tall token
#

Because at the extremes the functions are not pure and depend on implementation and machine (eg small values of floats)

daring hearth
#

Haha 😂

tall token
#

Help

#

But is consistent implementation across all values that important?

#

Like if we can’t choose any implantation to prioritize should the result be to prioritize no implementation

daring hearth
#

I think it'd be nice to be able to assure a user that there's consistency between targets.

#

The easiest way to do this is to stick with a PRNG that only uses 32-bit state, since JavaScript’s bitwise operations operate on 32-bit integers.

#

Otherwise the state needs to be split into several integers.

proven tiger
#

Or a bitarray

daring hearth
#

Ah, I see. I actually have not worked a lot with those.

#

The point is also that having to simply operate on just a single integer state is typically quite a bit cheaper than to operate on two or more to compute the next state.

#

I'm not sure how that concern applies to bitarrays though.

proven tiger
#

Yeah there aren't many bitwise operation functions for bit arrays so they might be harder to work with in that respect and require more work to implement the functions you need to work with them.

#

Ints are definitely easier to work with if they suffice for the use case

#

Another option if you want it to work across targets is to use the bigi library, which uses regular int operations on the erlang target, but uses bigints on the js target (which are slower than numbers ofc, but have the same unlimited size as erlang ints)

#

It depends on how much of a concern performance is I guess

flint wolf
#

Big ints are very slow though. I suspect performance matters here

#

(Though big ints are already used on the Erlang target)

daring hearth
#

Yeah, in my experience, for most practical applications speed matters more.

proven tiger
flint wolf
#

I would honestly really like the BEAM to use 64bit ints and ieee 764 floats 😅

daring hearth
#

To generate random numbers from some distributions you need to generate multiple uniform random numbers so performance costs adds up in the end.

flint wolf
#

I’m curious about that xor performance now. I wonder how fast we could get it

#

Or if there’s some tricks

#

I suspect if we can get bit arrays to be fast on Erlang we will fail to do the same on JS

daring hearth
tall token
#

What I meant was I would stab at an outline of what to implement and how to organize things.

daring hearth
#

Ah, awesome :)

tall token
#

For basic rng, it doesn’t matter, we can choose to work with only 32 bit integers, and in fact it is really easier to work with 32 bits in some sense. But if we had access to bit arrays, there are even more tricks that can be done to make things faster. I would have to reread some material to revise those tricks though.

proven tiger
#

did some int vs bitarray XOR benchmarks with gleamy_bench for anyone interested:

Input               Function                       IPS           Min           P99
32-bit numbers      bitwise_exclusive_or    10731.9813        0.0858        0.1072
32-bit numbers      xor_bit_arrays_1           71.4306       13.2475       15.3983
32-bit numbers      xor_bit_arrays_32         692.9644        1.3047        1.8673
64-bit numbers      bitwise_exclusive_or     4130.4059        0.2286        0.3006
64-bit numbers      xor_bit_arrays_1           39.3333       24.5132       31.7436
64-bit numbers      xor_bit_arrays_32         615.3671        1.5058        2.0025
128-bit numbers     bitwise_exclusive_or     4074.4599        0.2303        0.3055
128-bit numbers     xor_bit_arrays_1           20.8853       46.9694       50.9691
128-bit numbers     xor_bit_arrays_32         497.1598        1.8486        2.6258
256-bit numbers     bitwise_exclusive_or     3939.7725        0.2227        0.3202
256-bit numbers     xor_bit_arrays_1           10.4040       95.5089       97.1056
256-bit numbers     xor_bit_arrays_32         348.6144        2.7326        3.5070
512-bit numbers     bitwise_exclusive_or     3367.1450        0.2421        0.3876
512-bit numbers     xor_bit_arrays_1            4.0476      245.3371      248.2894
512-bit numbers     xor_bit_arrays_32         188.8585        4.9415        7.3087
#

bitwise_exclusive_or is just the stdlib int.bitwise_exclusive_or function.

xor_bit_arrays_1 is this function that iterates through the bit arrays bit by bit:

fn xor_bit_arrays_1(a: BitArray, b: BitArray) -> BitArray {
  do_xor_bit_arrays_1(a, b, <<>>)
}

fn do_xor_bit_arrays_1(a: BitArray, b: BitArray, acc: BitArray) -> BitArray {
  case a, b {
    <<0:size(1), a:bits>>, <<0:size(1), b:bits>>
    | <<1:size(1), a:bits>>, <<1:size(1), b:bits>>
    -> do_xor_bit_arrays_1(a, b, <<acc:bits, 0:size(1)>>)
    <<0:size(1), a:bits>>, <<1:size(1), b:bits>>
    | <<1:size(1), a:bits>>, <<0:size(1), b:bits>>
    -> do_xor_bit_arrays_1(a, b, <<acc:bits, 1:size(1)>>)
    _, _ -> acc
  }
}

and xor_bit_arrays_32 is this function that grabs 32 bits from each bitarray at a time and passes them to int.bitwise_exclusive_or (so it's basically just a JS-safe version of the stdlib function that adds overhead on the erlang target):

fn xor_bit_arrays_32(a: BitArray, b: BitArray) -> BitArray {
  do_xor_bit_arrays_32(a, b, <<>>)
}

fn do_xor_bit_arrays_32(a: BitArray, b: BitArray, acc: BitArray) -> BitArray {
  case a, b {
    <<i:size(32), a:bits>>, <<j:size(32), b:bits>> ->
      do_xor_bit_arrays_32(a, b, <<
        acc:bits,
        int.bitwise_exclusive_or(i, j):size(32),
      >>)
    _, _ -> acc
  }
}
flint wolf
#

Wow big difference

proven tiger
#

yeah

#

I made sure the int -> bitarray conversion wasnt included in the timing of functions by including both the int and bitarray versions of the number in the input given to all functions:

fn input(size: Int) -> bench.Input(List(#(#(Int, BitArray), #(Int, BitArray)))) {
  bench.Input(
    int.to_string(size) <> "-bit numbers",
    list.repeat(Nil, 10_000)
      |> list.map(fn(_) {
        let i = int.random(int.bitwise_shift_left(1, size))
        #(i, <<i:size(size)>>)
      })
      |> list.window_by_2,
  )
}
tall token
#

Thank you for helping to benchmark these. I appreciate that you went out of your way to do this.

tall token
#

@daring hearth , i've written out some initial thoughts here. Please chime in! Unfortunately, my bias is towards statistics.

#

If anyone else sees this, please also feel free to chime in!

daring hearth
#

I'll have a look in the evening :)!

tall token
#

thanks @dull hill for your comments!

rain cove
#

But I don't want to hijack your thread, please feel free to ping me in the original one or elsewhere

tall token
#

i will message you!

hallow lichen
#

I'm not anywhere near smart enough to have anything of value to add here but I'm just amazed by how cool this is haha. amazing stuff, especially in a language not designed around this particular usecase

tall token
#

hey @hallow lichen thanks for commenting! if you'd like to learn some stats and do some mathematical programming, can i convince you to write some code?

gilded ivy
tall token
#

math

#

and the community seems actively hostile towards it, might i add

#

😢

gilded ivy
#

@tall token where? You are free to create whatever library you want?

#

Do you mean number crunching?

tall token
#

well, i brought up two issues:

  1. How to hide mutable state from the end user (very important use case for random number generation)
  2. How to deal with discrete (Ints) vs continuous (floats) distributions in a language that does not have polymorphism

In both cases people seemed more interested in what I call "drive-by" helping. They're more interested in telling me about what Gleam isn't, rather than trying to fundamentally understand why a person that does mathematical programming might care about these two questions.

gilded ivy
#

AFAIU the 2nd thing will simply not happen. You can use Erlang and JS to implement whatever you want and write gleam wrappers around that, where isn't that good enough?

tall token
#

is this drive by helping?

#

i will engage if you are truly interested

gilded ivy
#

No, it is just a language design choice

#

I see you care but at the same time feel hurt, and I feel some mockery can hurt but also people (you? me?) need to accept when the other side says "no"

#

There is a difference between hostility and people not caring or not dealing with something ... It is their choice

#

You are free to still create libraries.

#

I don't like there to be literal division of variables by zero in gleam so I wrote a checker... no one else seems to care, so it is fine for them. Time to move on.
I missed if/elseif/else (or cond) in gleam, so I wrote a library to do it.
I didn't like that the bool.guard is not lazy by default, and that guards are so spread among type-modules in the stdlib, so I wrote a whole guard library (given)

tall token
#

sure.

#

back to your original question

#

this language was not designed for mathematics.

#

so much so that people are saying things like this

lime igloo
#

Silby does not represent the Gleam core team or any language design decisions that have been made

#

Most languages aren't designed for mathematics primarily

tall token
lime igloo
#

Yes

tall token
#

well, i can only speakl from my own perspective. i don't feel like

gilded ivy
#

What's the problem with +? + vs +.?

tall token
#

a) this language was designed for mathematics at all
b) the community was particularly helpful when I asked some questions that a mathematical programmer might be interested in.

#

as i said, "drive-by" helping is not helping

gilded ivy
#

It is certainly not Julia or Matlab, but what do you mean designed for

tall token
#

theres' no problem with +. + /.. But if you have to write mean_int, mean_float, var_int, var_float, skew_int, skew_float, kurtosis_int, kurtosis_float, moment_int, moment_float

gilded ivy
#

since there is no polymorphism, any function must have a specialized float and int version.
Yes this allows the type inference to work which is really great, because you can but you needen't write out types in private functions which makes it easier to change them, my 2 cents

tall token
#

if you are serious, i can walk you through the benefits and drawbacks

#

as you can see, i've done a reasonable amount of work to lay out the groundwork

#

would you like to help?

gilded ivy
#

Help with what, that decision will not change, IMHO it is just a question of:

  • can you work with it, do you want to work with it, do you understand the reasons behind if it helps with acceptance?
  • how can you deal with it if you need to / are there hacks around it? are these hacks any good?
tall token
#

for example, there are some functions that need to be written out, and then to write out the distributions.

#

so as you can see... drive by helping

gilded ivy
#

Well open source helping isn't mandatory. And people have other things to do.
They might also not share your enthusiasm for a specific topic. There is still 0 hostility.

#

Without shallowing everything in the google docs:
You mean the permutations of types you can input in say a and b? and you want to do certain operations on a and b and both can be either int or float?

#

You can wrap them in a Number type that can have a constructor MathInt/MathFloat or something

#

in sql I had to deal with int, float, string, bool, null, it still worked out

#

It is common for things to be wrapped in Gleam, then it can be used as one type and the functions deal with the different variants

#

There are probably better use cases / implementations, I am not speaking with any authority

tall token
# gilded ivy in sql I had to deal with int, float, string, bool, null, it still worked out

This is a strong counter example, since SQL is designed with the ergonomics of the querier in mind, therefore the following in SQL works:

SELECT
  *
FROM
  parents
WHERE
  num_children > (SELECT AVG(num_children) FROM parents)

In the above example,

  • num_children is a BIGINT
  • AVG is an overloaded function designed to work with any numeric type, including BIGINT and FLOAT
  • AVG is a function that returns a DOUBLE or FLOAT (depending on particular implementation)
  • > is a function that is overloaded to work with any numeric type on the left-hand side, and any numeric type on the right-hand side.

Clearly, in SQL, the developer never cares about having to do something like the following:

SELECT
  *
FROM
  parents
WHERE
  TO_FLOAT(num_children) >. (SELECT AVG(TO_FLOAT(num_children) FROM parents)

And that, indeed, is the issue with the comment that you made:

gilded ivy
#

oh clearly they do in postgres sometimes in my experience 😄

#

but the point is not if you like any of this but just that this is one way to do it in gleam

#

still no hostility, just pragmatism

#

polymorphism AFAIU adds a lot of complexity to the language, and IMHO the language tries to be simple and reserve the brain-power of the developer to deal with domain logic, concurrency design etc.

#

It is a bit more verbose but it works well enough

tall token
#

enough for what?

gilded ivy
#

enough to design libraries

#

You will not become happy if you try to change the circumstances here... that ship sailed years ago

#

This is a stable language now

tall token
#

as i've said multiple times

#

i am not looking to change the language

#

my issue is with your statement that you think that this language is suitable for maths

#

yes, i can create whatever i want

gilded ivy
#

Where would it not be?

#

Wrapping things is not a problem, you can create a builder API and have a nice internal structure

#

The whole paradigm is not about mutating memory and neither is the BEAM about number crunching. That is IMHO a meaningful statement.

tall token
#

and yet Elixir Nx is a wildly successful numerical crunching library written for BEAM

gilded ivy
#

... so, use it 🙂

tall token
#

So that IMHO is not a meaningful statement

#

anyway, as i said, drive by shooting

#

i have realized that a lot of people here just want to have opinions

stone elbow
#

Nx is such a cool project! I think there was a gleam project providing some wrapped around it? 🤔

gilded ivy
#

it is neither shooting nor drive by, I ll try to get your point later

stone elbow
#

The name is eluding me though

tall token
#

and they don't like to hear that it is not particularly ergonomic writing anmd designing APIs around. mathematics

#

you're not the one designing it, and you've expressed that you don't have the time or desire to. so why is it important for you to be invalidating my lived experience?

#

the point of not FFI-ing into these external languages is to experiment with numerical purity in JS and erlang

#

because FFI at some point is impure and depends on the target runtime

gilded ivy
#

the gleam world also depends on the runtime, a Float in one or the other is not the same, nor an Int.

tall token
#

yes, but if most of the functions are implemented in gleam, then it is possible to isolate where the source of numerical instability is

#

for example, for small values close to zero, certain implementations of JS have bad stability for some inverse trig functions.

gilded ivy
#

What's the benefit of trying to force 2 different VMs to behave the same?

#

that's like forcing postgres and sqlite to behave the same.

stone elbow
tall token
#

there's literally no point

#

it's an experiment and at the very least hopefully therre will be the random number generators implemented

#

so then that completes the statistical piece and then there will be a working implementation of most base stats in gleam

gilded ivy
#

and this can be done, I still don't see where gleam blocks you, it may be more verbose, but more explicit also

tall token
#

ok, thank you for your feedback.

gilded ivy
#

But there is no hostility here, I showed you how to wrap things, you create a builder API to make it less verbose and nicer

tall token
#

thank you for giving me a sketch of your potential solution

gilded ivy
#

The real question to me is: Can it be really fast, and there I have my personal doubts if it is in a hot path because a lot of wrapping happens at runtime also. Where with macros you can remove all that boilerplate, if any was required, at compile time (like Elixir does, say it it does in nimble_parsec AFAIU). But this is just my perspective.

It can be made very ergonomic I believe in Gleam.

#

Better to wrap elixir nx or something - my personal take

tall token
#

thank you for your perspective. to make your assertions reality, why not try your hand at submitting a pull request?

gilded ivy
#

because it is not my domain of expertise (or interest, I care about SQL as you might have noticed :D, and boolean logic / control-flow / guards :D)

#

It is open source after all

lime igloo
#

I understand your frustration with the limitations of Gleam; ad-hoc polymorphism is something that a lot of people ask about when trying out Gleam. The limitation will mean your API will look different than it does in many other languages, and that's something that you will have to figure out a solution to.
I am trying not to give what you have referred to as drive-by help, and I am happy to answer your questions about the language and API design but I can't commit to writing any code for the project.
I think this conversation would best advance by talking about specific problems/solutions, rather than speculating about language design as that discussion seems mostly to be going in circles. I am sorry if the community has made you feel unwelcome, I'm sure everyone is doing their best to help, but perhaps coming off a bit harsh.

tall token
#

okay... thank you for your feedback. it is noted that while Nx has extremely good tensor abilities, Elixir as a whole lacks any standardized statistical software, so your personal take of "wrapping elixir" is simply just that - your personal take.

gilded ivy
#

I know to little, but it certainly isn't python-world (neither with its benefits, not with its drawbacks)

tall token
gilded ivy
tall token
#

then in the spirit of openness, and i mean this sincerely

#

why did you offer the suggestion?

#

because to me it feels very abrasive

stone elbow
#

Please let’s end this here and keep the discussion focused on the library and technical questions.

tall token
#

^ THIS is the design document.

tall token
#

hi @stone elbow , based on the discussion here, I implemented the PCG32 algorithm and found that this implementation that uses atomic arrays is slightly faster. (Y axis is miliseconds, blue is PRNG, red is the reimplementation).

The implementation is here: https://github.com/daikonradish/sudorando, with the benchmarking in _dev.gleam.

This addresses some of the concerns that @dull hill had previously.

As a next step, I would like to try to implement the KISS algorithm and compare that against PCG.

Could I check -- was consistency across JS and erlang of any concern or interest to you at all when you were writing PRNG? How do you approach testing across different implementations? These questions are of interest as I am exploring a more general approach of ensuring that functions are "pure" (i.e. do not depend on machine implementation) across JS / Erlang.

Thank you for your attention.

stone elbow
#

Yeah for prng it was a goal for it to behave the same on both targets

#

I run the tests across both targets against some known reference values to make sure the implementation is ok

#

Could probably make it faster but I’ve never stressed about perf. I want to rework the api slightly though

tall token
#

thank you!

#

can you walk me through how to run tests across both targest?

#

i've also tested manual recursion vs yielder API

stone elbow
#

Sure! gleam test —target=js and gleam test —target=erl

tall token
#

for your library

#

the yielder api is slightly slower, about 6.5

stone elbow
#

Yeah I’ll remove the yielder api for the next major version I think

tall token
#

if you would, could you think about perf?\

#

because it would be less inconvenient for everyone if there's just a single entrypoiint into random number gen

#

in the sense that it can be used both for statistcs (i need random uniform to create more complicated distributions)

#

and also for non statistics purposes.

stone elbow
tall token
#

not necessarily a "deal breaker", but here are my concerns (perf + related to perf)

  1. typically statistical sampling requires multiple calls of uniform random. so to draw once from the gaussian distribution you may need up to 5 draws from the uniform. So in 10000 normals could be 50000 uniforms.
  2. passing around the seed is, in general, not very common. I understand, since I am coming from another functional programming language, that there are very strong concerns about not hiding mutability from the end user. But the user can't unwrap the seed anyway, so why pass it on... :/
  3. Let's say eventually PRNG implments cryptographic grade RNG, then passing the seed around means the end user is one step closer to abusing the seed for nefarious purposes.
  4. in general PRNG tends to be "closest to the bone", e.g. in C it's often not really even a function but just a macro that chanages some block of memory.

It would be ideal for sampling/stats to be able to have tested PRNG that's fast so that it doesn't have to worry about implementing its own.

Here are some requirements:

  1. Consistency testing on erlang and JS
  2. Performance benchmarks to monitor degradations
#
  1. Multiple imlementations of PRNG (e.g. KISS, CONG, MWC maybe even consider cryptographic applciations)
  2. possible submission of output to DIEHARDER to test randomness
#

these considerations are why i think it would be nice if there were 1 single "source of randomness" in the gleam ecosystem

#

because then any developer can look at the landscape and say, "hey this is what i need, and it will cover 95% of my use cases"

dull hill
#

I have thought about implementing something like this before as well

#

I’m not positive though that you can get it to be fast enough on both targets without doing specific things in FFI

stone elbow
#

Mhm I see, I can answer a couple right away:
2. That won’t ever change, as that’s the idiomatic thing to do in Gleam and immutable languages
3. Becoming cryptographic grade is not a goal, so that’s not a concern

dull hill
#

Like target specific things I mean

stone elbow
#

Having benchmarks would be really nice though, that’s something I can have a look at

#

And maybe there’s other things that could be improved performance wise, I’ll see if something can be done in that regard

#

I haven’t had a look at prng’s implementation in the longest time 😂

tall token
#

Can I please make a suggestion on the way this community communicates

tall token
#

And to my knowledge gleam is not immutable

#

There are multiple ways to encapsulate state in immutable languages. Or at least make the state somewhat inaccessible or inscrutable.

stone elbow
#

Yeah but the idiomatic thing to do would be to pass that seed around instead of hiding the mutability

tall token
#

In gleam you mean? Or in all immutable languages?

stone elbow
#

I’m thinking about gleam and elm

tall token
#

Okay. Anyway, as usual, I am struggling to communicate my concerns, and I don’t know why.

#

I think the issue here is that I am trying to do something and the language has a small surface area

#

Let’s just zoom in on the statistical issues for a little moment.

#

What matters to any person writing a sampler is

  • the ability to create a generator from a given seed
  • the ability to draw from that generator, in sequence.

The appearance of randomness is obtained by sequential draws from the same generator.

Being able to swap seeds in the middle is not really very ideal as it could lead to subtle statistical bugs, which are hard to debug.

Having the ability to appear random while being completely deterministic within this context is why other immutable languages like Haskell use a monad. The monadic <- clearly says this is not equality, it’s sequentiality.

#

To be absolutely clear

#

The Haskell monad used is not the state monad but the primitive state monad

#

So it’s deterministic but there’s absolutely no way to unwrap the underlying state

#

So you can’t just ask for the state. There are no accessors. And it cannot be passed around as an intermediate value to another value

lime igloo
#

In Gleam, that would be achieved with an opaque type. Opaque types can only be access by values in the module that defines them, so you prevent consumers of the API from accessing library internals

#

You can use monads in Gleam too, there's just no specific language features that relate directly to monads

#

So if you wanted to, you could re-implement the State monad in Gleam

tall token
#

Anyway this seems like an old question. Apparently it is not handy simply to statisticians but also to developers.

#

As a note, this is why I am claiming vociferously that gleam is not suited to mathematical programming.

tall token
#

I can agree that most languages aren’t designed for mathematics.

#

But I have been in the archives and when Travis Oliphant was trying to make numpy, Guido van Rossum was in general quite supportive and his attitude quite open.

lime igloo
tall token
#

Noted.

#

I don’t particularly want to use gleam.

tall token
#

About 50% of projects on that page have not gotten any commits in 5.49 months.

#

I’m not sure where this idea of ‘if you want to do something you will have to meet us all the way here’ came from.

stone elbow
lime igloo
tall token
#

The primary reason why I’m here is because I grew tired of contributing to Haskell, where the community is predominated by white men, and there was as lot of hype about this being an accepting community.

#

I would much rather be talked down to by white men than to be talked in this way by whatever community this is

stone elbow
#

So the sampling would be written like this:

pub fn sample_prng_manual(s: Int) {
 random.sample(
    random.fixed_size_list(the_float_generator, 10_000),
    seed.new(s),
  )
}
#

Does that make any difference in your benchmarks?

#

(Oops small typo, typing from the phone is really hard)

tall token
#

Again, I draw your attention to this pertinent prove of evidence.

#

I don’t particularly want to do anything.

#

and after this experience, I particularly don’t.

charred hamlet
#

well then

#

I think the core team handled that as well as you could've, I don't think anything short of "rethink the language's fundamental design" was going to work for them in the end

stone elbow
#

Yeah no need to drag this any further

stone elbow
#

Which solves both the problem of it being slow and having to pass the seed around, as you need it only once at the beginning

exotic solstice
#

i'm so confused by this convo, but this is an interesting lib. i actually kinda liked that the functions were split into categories. as a not-stats-y person, i feel like that actually helps with api discoverability: i wouldn't know the names of all the statistical methods and what they are for, but i would have a rough idea of a category of things i want to achieve. maybe module api is not exactly the perfect place to do it and it could be achieved in the docs instead, but otoh i don't think i've seen any docs yet where functions in a single module were grouped (usually it's just an alphabetical list with no rhyme or reason)