#FSRS Megathread

1 messages Β· Page 15 of 1

polar maple
#

mb the young cards have high RMSE and the right side of the graph corresponds to more young cards

unique salmon
#

Also, how do you calculate RMSE? It's super high

#

80% RMSE can't possibly be correct

cosmic hedge
#

I made it ages ago idk πŸ˜‚

#

Maybe you can see wherever i screwed up

unique salmon
#

df["loss"] = df.apply(lambda df: (df["y"] - df["r"]) ** 2, axis=1)
Ah, I see. You're calculating it in a very different way, the "normal" way

cosmic hedge
#
    loss = df_filtered["loss"].mean() ** 0.5
``` i find the mean here
cosmic hedge
unique salmon
unique salmon
cosmic hedge
#

I might look at it later.

cosmic hedge
#

i did it the "simple" way in SSE as well btw so idk where the discrepancy from that might emerge

unique salmon
cosmic hedge
#

no ones ever gonna see that XD

#

well if they do they've been warned

#

I'll do this before i do any weird fatigue stuff if i even do any weird fatigue stuff

unique salmon
#

For example, in Python it looks kinda like this
sum_of_avg_r_over_a_year[today] = average_f_power_forgetting_curve(card_table[col["delta_t"]], card_table[col["delta_t"]] + 365, card_table[col["stability"]], DECAY).sum()

cosmic hedge
# unique salmon For example, in Python it looks kinda like this ` sum_of_avg_r_over_a_yea...
GitHub

FSRS for Rust, including Optimizer and Scheduler. Contribute to open-spaced-repetition/fsrs-rs development by creating an account on GitHub.

#

if you psudocode it or chatgpt it it might save me a job πŸ˜‚

#

as in save me the entire job

unique salmon
# cosmic hedge https://github.com/open-spaced-repetition/fsrs-rs/blob/092c20bac7d9239a991ae5b56...

Oh lord, Rust... 😭

            for i in 0..delta_t {
                memorized_cnt_per_day[last_date_index + i] +=
                    power_forgetting_curve(w, (pre_sim_days + i) as f32, last_stability);
            }
        }```
This is the part that needs to be changed...I think.
Instead of using "instant" R from the forgetting curve, we need to use average R over some period of time, aka the integral thingy.
I'm guessing `pre_sim_days` is delta_t?
unique salmon
#
def average_f_power_forgetting_curve(t1, t2, s, decay):

    def integral_power_forgetting_curve(t, s, decay):
        factor = 0.9 ** (1 / decay) - 1
        return (s / (factor * (decay + 1))) * np.power((1 + factor * t / s), (decay + 1))

    # Calculate F(t2) - F(t1) where F is the antiderivative
    integral = integral_power_forgetting_curve(t2, s, decay) - integral_power_forgetting_curve(t1, s, decay)

    # Divide it by the difference in time to get the average
    return integral / (t2 - t1)```
Port that to Rust (and add an assertion that t2 > t1).
Then do this:
```rust
            for i in 0..delta_t {
                memorized_cnt_per_day[last_date_index + i] +=
                    average_f_power_forgetting_curve(w, (pre_sim_days + i) as f32, (pre_sim_days + i + time_offset) as f32, last_stability);
            }
        }```
`time_offset` is 1/2/3/5/10/50 years, except in days
cosmic hedge
#

you can caluclate R using the stabilitys of the cards

unique salmon
cosmic hedge
#

you're going for retention in the future from the end of the simulation right?

unique salmon
cosmic hedge
unique salmon
cosmic hedge
unique salmon
#

The math is worked out, just not ported to Rust

cosmic hedge
polar maple
cosmic hedge
#

Idk if i like it but it might be ok XD

unique salmon
cosmic hedge
unique salmon
polar maple
#

hmm a problem with using average retention is that user A might have a bunch of cards at 25% and another bunch at 75% so it averages to 50%, and user B just has 50%, but user B's log loss will be expected to be higher just because that's the way it is

cosmic hedge
polar maple
#

i think a health check should be to check the difference between train/test scores based on something like the 5-way split

#

so if FSRS trained on the training set does not generalize well to the test set then this would be a problem that we can indicate to the user

unique salmon
polar maple
unique salmon
#

Nah

#

Screw it

polar maple
#

the metrics are already meaningless without a train/test set

polar maple
#

i can write an algorithm that just memorizes the training data to get nearly perfect on the current metrics, yet this algorithm would be useless

unique salmon
cosmic hedge
unique salmon
#

nope

#

Actually, we should remove that

#

It serves zero purpose

cosmic hedge
unique salmon
#

RMSE(bins)

cosmic hedge
#
fsrs_6 = load_jsonl("../srs-benchmark/result/FSRS-6.jsonl")
button_usage = load_jsonl("button_usage.jsonl")

users = list(zip(fsrs_6, button_usage))
cosmic hedge
unique salmon
#

Well fuck me upside down and sideways then

polar maple
#

@quasi shadow why no 5-way split in anki? Evaluate means nothing without a proper train/test split. If we have a train/test split then an idea for a health check would be to compare the metrics between the train set and the test set to directly evaluate for generalization

unique salmon
#

Of Evaluate

polar maple
#

we can make this tradeoff to make Evaluate actually mean something

unique salmon
polar maple
#

evaluate could include the train/test values, it doesn't have to remain as it is

unique salmon
#

Oh yeah, let's include four values
Logloss (train)
RMSE bins (train)
Logloss (test)
RMSE bins (test)
Surely that will be less confusing and less information overload

#

Come one man, we're not trying to make it good for data scientists, we're trying to make it good for the kind of person who thinks that "log-loss" means "lost reviews"

polar maple
#

we can show only the test version

#

this makes the information actually accurate for once

unique salmon
#

sigh
@quasi shadow do you want to implement the 5-way split in Anki?

#

Before the release of Anki 25.05

#

I have a feeling the answer is "no"

unique salmon
polar maple
#

imo it should be either remove Evaluate or add a 5-way split because right now the numbers shown on Evaluate are unreliable

cosmic hedge
#

well you can build on it if you want

unique salmon
#

back to this junk
Ok(SimulationResult { memorized_cnt_per_day, review_cnt_per_day, learn_cnt_per_day, cost_per_day, correct_cnt_per_day, cards, }) }
I have no idea what cards look like, so I can't help here

cosmic hedge
unique salmon
#

This is extremely awkward

  • I know how to implement the integral in Python

  • I don't know Rust

  • I don't know the simulator code very well

  • You don't know how to implement the integral in Python

  • You know Rust

  • You know the simulator code

unique salmon
cosmic hedge
cosmic hedge
unique salmon
#

from statsmodels.nonparametric.smoothers_lowess import lowess
Take this

#

` lowess_smooth = lowess(RMSE, sizes, it=3, frac=0.1, return_sorted=False)
lowess_smooth = np.asarray(lowess_smooth)

new_sizes = new_sizes[sorter]
RMSE = RMSE[sorter]
lowess_smooth = lowess_smooth[sorter]

plt.figure(figsize=(16, 8))
plt.scatter(sizes, RMSE, s=30, color="#1f77b4")
plt.plot(new_sizes, lowess_smooth, linewidth=5, label="LOWESS", color="darkorange")`

Something like this

#

sizes is n(reviews)

cosmic hedge
#

i did this

vals = lowess(x, y)
ax.plot([x[0] for x in vals], [x[1] for x in vals])
unique salmon
#

Also, you got your axes wrong, lol

#

Axises

#

Erm, whatever

cosmic hedge
#

i'm just plotting what the function spits out though

#

and the shape looks right enough?

cosmic hedge
#

flipped colours sorry

unique salmon
unique salmon
#

Make them dots

cosmic hedge
#

im feeling like luc gpt rn XD

unique salmon
#

Good bot

cosmic hedge
#

dots too big XD

unique salmon
#

And add these settings to lowess: it=3, frac=0.1

#

And just to make it look a little better
plt.ylim([0, max(RMSE) * 1.025]) plt.xlim([0, max(sizes) * 1.025])

cosmic hedge
unique salmon
cosmic hedge
cosmic hedge
unique salmon
#

Ah, ok

#

For the integral

#
def average_f_power_forgetting_curve(t1, t2, s, decay):

    def integral_power_forgetting_curve(t, s, decay):
        factor = 0.9 ** (1 / decay) - 1
        return (s / (factor * (decay + 1))) * np.power((1 + factor * t / s), (decay + 1))

    # Calculate F(t2) - F(t1) where F is the antiderivative
    integral = integral_power_forgetting_curve(t2, s, decay) - integral_power_forgetting_curve(t1, s, decay)

    # Divide it by the difference in time to get the average
    return integral / (t2 - t1)

This is like 6 lines of code

#

God damn Gemini

cosmic hedge
unique salmon
#
use ndarray::Array1;

pub fn average_f_power_forgetting_curve(
    t1: &Array1<f64>,
    t2: &Array1<f64>,
    s: &Array1<f64>,
    decay: f64,
) -> Array1<f64> {
    let factor = 0.9_f64.powf(1.0 / decay) - 1.0;
    let exp = decay + 1.0;
    let den_factor = factor * exp;

    // Closure equivalent to the inner integral function
    let integral_calc = |t: &Array1<f64>| -> Array1<f64> {
        // Performs element-wise: (s / den_factor) * (1.0 + factor * t / s).powf(exp)
        (&s / den_factor) * (1.0 + factor * t / s).mapv(|base| base.powf(exp))
    };

    // Calculate integral difference and divide by time difference element-wise
    (integral_calc(t2) - integral_calc(t1)) / (t2 - t1)
}```
cosmic hedge
#

not rn but provided it works

unique salmon
#

You can verify it by trying t2 that is only slightly larger than t1, like 100.0001 and 100.0. It should give you a value close to the original forgetting curve

#

If you plug t1 into the forgetting curve function

#

I did this in Python

integral_avg = average_f_power_forgetting_curve(t1, t2, s, decay)
print(f'Average R within the [t1, t2] range: {integral_avg:5f}')

# Brute force check that the integral version is correct
n_values = 500_000  # number of data points between t1 and t2 to be used for averaging
t_range = np.linspace(t1, t2, n_values)
r_values = power_forgetting_curve(t_range, s, decay)
brute_force_avg = np.mean(r_values)
print(f'Brute force calculation of average R within the [t1, t2] range: {brute_force_avg:5f}')
print(f'Brute force calculation agrees with integral calculation: {abs(brute_force_avg - integral_avg) < 1e-7}')```
#

Just brute-force calculated the average of 500k points between t1 and t2

cosmic hedge
#

I'll try just forgetting curve it into the future as well

#

seems like a proxy for maximising stability though

cosmic hedge
unique salmon
#

But that just gives you 70%

cosmic hedge
#

maximise the cards for memorised as if you stopped reviewing on the last day of the simulator, for memorised in a years time

#

like that

unique salmon
#

Ah

#

Nah, just use the integral

#

I specifically made it to calculate average R over time without brute-force calculating the average using a loop and a ton of data points

cosmic hedge
#

as in the days in the future to measure?

unique salmon
#

t2 > t1

#

Btw, I have no idea what the hell is going on here
let integral_calc = |t: &Array1<f64>| -> Array1<f64> { // Performs element-wise: (s / den_factor) * (1.0 + factor * t / s).powf(exp) (&s / den_factor) * (1.0 + factor * t / s).mapv(|base| base.powf(exp)) };

#

So just by looking at it, I can't tell if Gemini messed up

cosmic hedge
#

surely then it could be simplified like this

    (integral_calc(365)) / 365
```?
unique salmon
bold terrace
#

btw since the reverse power curve is super flat quite quick, the integral look like a linear function

cosmic hedge
unique salmon
#

Yep

#

You can rewrite it as t1 and t1+offset, if you want

#

Instead of t2 and t1

cosmic hedge
#

to just offset?

unique salmon
#

If t1=0, then it's as if the card has been reviewed just now, but that's not necessarily the case

cosmic hedge
#

ahh right yeah

bold terrace
#

I mean this is the integral function for S=5 from t=1 to 365 with decay -.2

#

Maybe using f(x)=x as approx of integral is good enough

unique salmon
#

At this point it's genuinely simpler to implement the integral than to try to approximate it for no reason

#

It's not even slow or anything

#

Like, it shouldn't make CMRR much slower

bold terrace
#

It had complexity for a glorified f(x)=x

unique salmon
bold terrace
#

Maybe I'm a bit mean

#

it's not necessary f(x)=x

#

quite close to f(x)=x/2

unique salmon
#

Man, leave the integral, honestly

#

There is no reason to try to approximate it

#

Even if it makes it 10 milliseconds faster, the simulations themselves take x1,000,000 more time

#

The bottleneck is the simulation, not the final calculation

#

Like, it's genuinely one minute vs 10 milliseconds or something

#

Hold on, let me time the integral

bold terrace
#

It's not about making it faster but simpler

#

I think that was your main focus

unique salmon
#

7 microseconds
And this is with shit-ass Python

#

Well, tbf, this is for one card

#

But still

unique salmon
bold terrace
#

Mine is 1, 10 char

#

f(stability)=x/2+stability

#

Close enough πŸ˜†

#

Sorry

#

don't want to ruin your fun

#

but the double standard is excellent

#

"People don't care about simplicity"

#

"Let's introduce an average integral ... to approximate f(s)=s"

unique salmon
#

Simplicity of the UI and simplicity of the math are very different things

#

Come on

bold terrace
#

Well at least UI you can move it, the math you need to maintain it

unique salmon
#

As long as the user sees simple UI, it doesn't matter what kind of horrors beyond human comprehension are happening in the backend

bold terrace
#

πŸ˜†

#

One day you'll understand IT

unique salmon
#

And we already have the monstrosity that is the simulator

#

So I really don't see your point

#

Like, I could see advocating for simplicity before the simulator was implemented in Anki, but it's a bit too late to worry about simplicity now

bold terrace
#

And since the function is even more gentle than a f(s)=s, I'm really curious how it will help with the CMRR

#

We saw yesterday sqrt(S) was too gentle

bold terrace
#

f(S) was only strong enough when decay was high enough

unique salmon
#

If it doesn't help, then screw CMRR, I guess

bold terrace
#

With the new decay, I think the weight should take in account the decay in some way

#

Or not

unique salmon
#

Btw @cosmic hedge enable loss aversion for CMRR, the time(Again)*2.5 thingy
NOT for the simulator, ONLY for CMRR

bold terrace
#

But then it will be normal that the returned DR is the lowest bound, since in practice the user seem to never go below a certain R

#

But we'll see !

#

Have to sleep, I'll dream about f(x)=x/2

cosmic hedge
unique salmon
#

Idk, find time(Again)

cosmic hedge
#

on that note i'm done for the night i still have cards to do XD

unique salmon
unique salmon
#

I highly recommend verifying that the integral works as intended via brute-force averaging, like here #1282005522513530952 message

cosmic hedge
# unique salmon I highly recommend verifying that the integral works as intended via brute-force...
pub struct Card {
    // "id" ignored by "simulate", used purely for hook functions (can be all be 0 with no consequence).
    // new cards created by the simulation have negative id's so use positive ones.
    pub stability: f32,
    pub last_date: f32,
}

pub fn average_f_power_forgetting_curve(
    learn_span: usize,
    cards: &[Card],
    decay: f32,
) -> f32 {
    let factor = 0.9_f32.powf(1.0 / decay) - 1.0;
    let exp = decay + 1.0;
    let den_factor = factor * exp;

    // Closure equivalent to the inner integral function
    let integral_calc = |card: &Card| -> f32 {
        // Performs element-wise: (s / den_factor) * (1.0 + factor * t / s).powf(exp)
        let t1 = card.last_date - learn_span as f32;
        let t2 = t1 + 365.;
        (card.stability / den_factor) * (1.0 + factor * t2 / card.stability).powf(exp) -
        (card.stability / den_factor) * (1.0 + factor * t1 / card.stability).powf(exp)  
    };

    // Calculate integral difference and divide by time difference element-wise
    cards.iter().map(integral_calc).sum::<f32>()
}

fn main() {
    let val = average_f_power_forgetting_curve(10, &vec![Card {stability: 5., last_date: 5.}], -0.2);
    assert_eq!(val, 10.);
}
``` This... explains it
unique salmon
#

The output of average_f_power_forgetting_curve should be between 0 and 1 btw

#

I don't see division by (t2-t1)

cosmic hedge
#

t2-t1 will always be 365 or whatever the offset is

rotund summit
#

in the absence of a functional CMRR are there any other tools we can use to find safe minimum DRs without manually gauging how our daily load/time spent changes/increases as I try incrementally decreasing my DR?

quasi shadow
#

The current implementation only evaluates FSRS itself.

#

The train/test split could tell us the generalization capability among different models or ablation variants.

#

But when we only evaluate one model, it's not very helpful.

#

If we implement 5-way split, we will have five sets of parameters optimized on different trainset.

#

And they are all different from the parameters which the user actually uses.

#

What can we derive from the evaluation result with 5-way split?

#

And we have recency weighting. Should the 5-way split consider it?

cursive badge
#

Wouldn't we need to have a consistent train/test split if we want an untainted Evaluate?
e.g. card.id mod 10 == 0 are never trained on, only used for evaluation.

quasi shadow
#

Then FSRS cannot learn from these cards, and its accuracy would decrease on these cards.

cursive badge
#

It might end up with worse actual results, but I cannot see how else you would have comparable numbers for a "health check".

polar maple
quasi shadow
#

IMO, we can use the current metrics to predict the future metrics.

quasi shadow
polar maple
#

hmm not sure what to say to that, if you purposefully want shoddy data science then be my guest

quasi shadow
#

please answer my questions above

#

my professor didn't teach me about that

polar maple
cursive badge
#

It feels like one of the biggest reoccurring problems with SRS is we are so starved for data πŸ˜…. We really need the mega AI to come along that is trained on such a stupid amount of data that it works well without much per-user data.

quasi shadow
#

But the first set of parameters cannot stand for the final one. The health check only represent the health or the first set of parameters.

#

If it could stand for the final one, the final one could also stand for the future one.

cursive badge
quasi shadow
#

If so, why not just evaluate the final one?

polar maple
#

this is only a compromise so that all the data is used, the alternatives is to just use the first 4/5 parameters as the final set, or remove evaluate altogether because we cannot actually say anything about the performance on unseen data

#

e.g. even those big LLMs that you see going around do not train on the final test set before deployment, probably

cursive badge
polar maple
cursive badge
#

Now I'm confused. How would that "leak information" if they were not used for training?

#

N.B. I'm not a big data science person πŸ˜…

#

(also part of my reasoning for that split is it would stay mostly consistent over time as you add cards etc.)

#

I suppose you could also just add data marking cards as a "test" card so they stay consistent πŸ€”

polar maple
#

this is s&p 500

#

so if you have a model that cheats this way, it would be useless for future prediction

#

but FSRS's goal is to predict the future, not fit the past

#

so its similar

cursive badge
#

I might just be dumb but I still cannot see it. If you isolate the revlogs of entire cards I cannot see how the model can cheat other than overall trends (e.g. you performed badly one day because you were tired).

polar maple
#

but it might not be significant enough such that the metrics are still reliable enough if we are only limited to parameters that are trained on a subset of the full revlog

#

parameters trained on card.id mod 10 != 0 would likely perform better than on first 9/10 of the revlog

#

since it would include the most recent information as well

quasi shadow
#

OK, just implement the same method used by the SRS Benchmark.

#

Then we can compare the evaluation result with SRS Benchmark's result.

#

Everything is solved.

polar maple
#

another pro: now recency is expected to improve the metrics

quasi shadow
quasi shadow
#

btw, it doesn't evaluate any parameters.

#

It only evaluates the "optimization" with given dataset.

#

And it's slower than training.

quasi shadow
polar maple
polar maple
quasi shadow
#

wait...

#

I forgot we have use the evaluate here.

#

Is it worth to keep the previous evaluate?

cosmic hedge
#

Didn't that check not exist at the start anyway so it would be fine anyway but just saying.

quasi shadow
#

πŸ˜‚ Fine. I don't want to modify too much code, so I keep it.

bold terrace
#

Something also to keep in mind is compared to case like prediction SP500, here we have way more constraints on how the model can adapt to predict :

  1. There's no up and down. Question is "by how much do we increase S/D when getting right a review right" or down
  2. By nature, we could expect memory to not be super volatile like being "5 times as more potent on certain days" and "only half the perf the other day". So splitting train/test while it makes sense, but it might not change much here
quasi shadow
#

OK, the prep work has been done for the health check idea.

#

😎 I leave the rest of work to @unique salmon

lapis hearth
#

If Anki gets a short term memory model, I believe I can rest in peace finally...

unique salmon
unique salmon
#

Oh, yeah, I get it

#

Right now Evaluate evaluates a set of parameters, but that's not the case with the 5-way split

#

Wait, now people will complain that changing parameters doesn't change the numbers in Evaluate 😭

#

God, this is such a pain

#

It would be so much easier and less confusing to just remove Evaluate

severe storm
#

@polar maple may I ask what was the architecture of the neural network you trained that had better loss than fsrs

unique salmon
unique salmon
severe storm
#

thank you

quasi shadow
quasi shadow
quasi shadow
unique salmon
#

So we're making Evaluate even more confusing to the average user...

quasi shadow
#

For the sake of data science!

unique salmon
cosmic hedge
#

if you have any other ideas lmk

unique salmon
#

Nope, just the integral with various values of offset (or whatever you wanna call it)

unique salmon
#

Still trying to figure out how to correct for both retention and n(reviews), though. Can't use LOWESS on 2D data FeelsBadAnki

unique salmon
#

Man, I'm just giving you tons of work πŸ˜…

unique salmon
#

Here's how RMSE (bins) depends on n(reviews)

#

And here's how it depends on retention

#

Now the question is - how the hell do I make a correction based on both...

#

Here's a 3D plot just because why not

bold terrace
#

Hey, what's the rational behind the second learning step ? I see it seems to match the of "Again then Good", but a second learning steps would be a "Good then Good", right ? For intervals like 82m, it might also means the second learning step will be done the next day, which I thought was not that ideal with FSRS if you want to let FSRS control your interval right ?

quasi shadow
bold terrace
#

Sure, I know the meaning of having 8h as a 2nd step, I'm not entirely sure why FSRS figured out it should be 8h. It feels like it's the "Again then Good" stability, but I don't really see the logic behind it

#

I mean, getting to that 2nd learning step might be any of those case. Could be a Again or a Good first right

polar maple
bold terrace
#

Would make more sense to take : {Again} {Good Then Again} no ? To see if the card is not a difficult one to learn, instead of {Again} {Again then Good}

unique salmon
quasi shadow
quasi shadow
#

You will use the first step for this case.

bold terrace
#

But I guess it's subjective

#

On your side I think your point is : "If in avg, the guy that does Again->Good has a 8h stab, so the second step should have the 8h stab". When in my case I"m more like "If the guy that press Good has a stab of 2.46days, and when things go wrong (Good -> Again), the have only a stab of 11min, then let's put the second learning step at 11min to see if the card survive that stability"

#

My point being : If I take your logic, then "Good then Again" would also put him in the first learning step, so let's use that "11.25m" for a first learning step

#

Ideally those steps should be more like "What kind of succession of good reviews with how much space between those would give the user a DR %-age of chance to have a 1d interval (so he has the chance to succeed it the next day with the desired DR)

#

That's why the {Good Then Again} makes more sense to me : It's the stability of those cards that might good look on paper (first step was a success), but guess what he got the 2nd step wrong .... Well, next time he got it right in the first step, let's wait for that interval to see if now he remember it

#

Good (8h) -> Good : Make sense, it's less optimal then using {Good Stability}, but at least we're sure he's not in a "Good then Again" situation.
Again (2m) -> Good (8h) : Make sense, we use {Again Stability} as interval for the first one, and then the {Again Then Good} stability.
Good (8h) -> Again(2m) -> Good (8h) : the** 8h doesn't make much sense here**, the "Good then Again (12m)" should be used, because the guy just did exactly that, finishing by a "Again then Good".

But yeah, the whole learning steps, if hardcoded in the deck options, lack flexibility to really use those values. And leaving it empty means most of the time not even having learning step (If good/hard have ~ >.5 stability)

clever cargo
#

why is it "Good (2m)"? doesnt it immediately go over to the second step

bold terrace
#

But then yeah even the Good -> Again doesn't make sense. If we know Good -> Again stability is 12min, having to make the user wait 8h (The Again then Good) feels off

cosmic hedge
#

decay = -0.5? i'm guessing thats on purpose

bold terrace
#

TBH the most logical would maybe even :
First Step : {Again}
Second Step : math.min(AtG, GtA)

unique salmon
polar maple
#

i think there is reason to believe that the current FSRS forgetting curve shape doesn't work well for short stabilities

cosmic hedge
polar maple
#

you need some level of near-instant forgetting built into the curve

unique salmon
#

I experimented a bit with FSRS for short-term reviews and what helped was subtracting an "instability term" from S that decays with time, so it would be like S=S-I, where I=f(time)
The metrics were still shit though

#

But it helped make them less shit

#

It would look like this

polar maple
#

πŸ‘

bold terrace
bold terrace
#

There's also the coin-flip knowledge : The knowledge that might get a very high Stability as long as you get enough lucky with your wacky knowledge

#

Example : What year X was born ? Let's say you hesitate between 1990 and 1980

unique salmon
#

But then again, this isn't useful for FSRS since we're not trying to model what's going on during same-day reviews
It's just experimental evidence that a "proper" forgetting curve looks different from what most (~all) people imagine

bold terrace
#

Let's say your model is built on 90% healthy cards, with good knowledge of it, not much ambuigity.
So now, to that flip-coin question, you might get it right 4 times in a row (1 chance over 16), get a stability of 4 months because you did "Good-Good-Good-Good"...

#

But guess what, you never really knew it

#

So sometimes "very low stability" might mean :

  • You never even memorized it a single time
  • You never had anything truly substantial to remember, you just remembered it was A or B and most of the time you can get it right
#

And then you complain about Anki minimum review interval to be 10min

#

(For case 1)

#

But for case 2 however (the flip coin question), I have no idea how you could detect that

#

maybe with 2.7 millions parameters ?

#

πŸ˜„

#

I have some ideas though

#

That ratio of performance drop

#

For example that word I always hesitate between "shuusei" or "shousei"

#

How does it translate ? Lapses with no-increasing performance

#

I think those kind of engineered features, fed to a NN like yours @polar maple , could really detect those

robust hill
#

when is it best to do the first few optimizations of a deck with new deck options

#

after 5 days or after passing the daily reviews size 5x over?

unique salmon
#

There's no "too early"

robust hill
#

well im worried that

#

if i do it too early

#

it will cook me

bold terrace
lapis hearth
#

i want to have an in review leech detector

unique salmon
#

@cosmic hedge @quasi shadow Alright, this was a massive pain, but I made a function that approximates RMSE as f(n_reviews, retention)
def func(a, b, c, d, e, f, g, x, y): return a / (np.power(x, b) + g) + c / (np.power(y, d) + f) + e
x is n_reviews/1000, y is retention
List with parameters (a, b, c, d, e, f, g): [0.16398, 0.73318, 0.018426, 10.0, 0.0, 0.35881, 1.6193]
(e ended up being 0, so we can remove it)
Two notes:

  1. I divided n_reviews by 1000, just because
  2. retention is calculated including same-day reviews

So now we can calculate normalized RMSE - take the user's real RMSE and divide it by the output of this function. If the ratio is >1, then the user's RMSE is greater than what would be expected for his retention and n(reviews). If the ratio is <1, then it's lower than what would be expected. Then we can calculate the percentiles of this normalized RMSE, and finally use that as cutoff values for the health check.
1st percentile of norm. RMSE=0.29912 50th percentile of norm. RMSE=0.9972 90th percentile of norm. RMSE=1.8269 99th percentile of norm. RMSE=3.3006
How to use this:

  1. Take the user's normalized RMSE (which, again, is a ratio) and clamp it so that it's not lower than 0.29912 (1st percentile) and not higher than 3.3006 (99th percentile)
  2. If it's <=50th percentile, display "Good" (green zone)
  3. If it's between the 50th and 90th percentile, display "Acceptable" (yellow zone)
  4. If it's >=90th percentile, display "Poor" (red zone)

One last issue - what do we do if it's in the red? The user will panic and be like "nyooo my fsrs is dying...". But I don't know what to display. If we display a list of possible reasons for poor performance, I doubt people will read it. The kind of person who misuses buttons or uses Anki in some dumb way is exactly the kind of person who won't read a list of possible explanations.

polar maple
unique salmon
#

In this case I don't think either of the two has any advantages, so we might as well flip a coin

polar maple
#

ik, i thought originally you wanted to use RMSE because it didn't require a correction but we discovered yesterday that it also needs one

#

then in this case we might as well use log loss since it is a more accurate metric

cosmic hedge
unique salmon
cosmic hedge
#

yeah

#

well of all the users - the RMSE normalising value(?) which is what i assume were going to be using?

#

how far is the average user from the "middle value" is what trying to ask

#

ahh yeah you get the percentiles so it should probably be good anyway?

unique salmon
unique salmon
unique salmon
#

@cosmic hedge where integral

#

I need to know if we're axing CMRR or no

#

And Dae has been asking about it too

polar maple
#

how about return 0.9 for CMRR

unique salmon
#

Also, I have some things to say regarding the wording in Evaluate, but I guess it's a bit too early for that since Luc hasn't made a PR yet

#
  1. I think we should rename it to "Evaluate FSRS" to make it clear that we are evaluating the algorithm, not the parameters
  2. The text should say "Lower values of RMSE and log-loss indicate a better fit to your review history. The results do not depend on your current FSRS parameters". This makes it more clear what is going on
#

Still unsure what to do with people who will get "Poor"

#

I mentioned it on Github

bold terrace
#

"The results do not depend on your current FSRS parameters" ?

#

I change my param, I press evaluate, the values change

unique salmon
#

Not anymore πŸ™‚

#

Alex and Jarrett want to follow The Path of The Data Scientist

tepid spoke
#

That makes it kinda useless then, doesn't it? Like, it's to check how well your current parameters fit your collection.

bold terrace
#

And in the road of the datascientist, the Evaluate would not Evaluate the parameters against the Test Sets ?

polar maple
#

so we will basically test FSRS on a train/test split, but the final parameters will still be from training on all of the available data

tepid spoke
#

that's SUPER unintuitive, and better to just remove then

bold terrace
#

Ok but in any case, you change the param, you potentially change the log loss no ?

#

If not, then the Evaluate use what params ?

polar maple
unique salmon
#

Yeah, the health check will be for FSRS as an algorithm, not for a specific set of params

cursive badge
#

If Evaluate becomes a sort of self-test and doesn't look at your params I kind of want an "Automatic/Manual" toggle.

  • Automatic: you don't show (editable?) params but have an evaluate button.
  • Manual: you show params but don't have an evaluate button.
bold terrace
#

Ok so if I understand correctly :
You Press Evaluate -> It trains -> It gets params -> It evalute on test set ?

polar maple
bold terrace
#

Why not, if params are present, skip the "train" and evalute on test set ?

bold terrace
polar maple
unique salmon
#

Me waiting for Alex's course on machine learning

tepid spoke
#

I just don't see what information I would gain from Evaluating FSRS as an algorithm against my deck.

tepid spoke
#

While with evaluating the params, I at least have SOME kind of idea if one set is better than the other, or one set is a completele miss somehow

bold terrace
#

The params shouldnt be hidden in some ways ?

#

Would be extremely strange for a UX point of view that you could see and edit params, that they would have effect on scheduling, but not on the Evaluate function

polar maple
bold terrace
#

Or if shown, at least put in read-only

#

Having something that would alter scheduling but not evaluation feel extremely off

tepid spoke
#

Why? That has zero merit. People who don't know what they do won't touch them anyway

#

And if you really WANT to modify them, you can anyway. You just made it more annoying.

bold terrace
#

I'm more thinking about the people who know what they are but were not expecting people to use them in some function (scheduling) but not in other πŸ˜†

unique salmon
#

Just saying, all of this mess could have been avoided if people voted for removing Evaluate Β―_(ツ)_/Β―
But we got 73% first-preference votes for the "health check"

tepid spoke
#

Well, the vote with no word said that it would stop evaluating your parameters...

unique salmon
#

I asked Jarrett "Do you want to implement train set = test set in the benchmark or the 5-way split in Anki?" and he chose the latter

bold terrace
#

The choice of not linking Evaluate to the Parameters in the UI is just an abitrary choice, not a real limitation

#

I just feel the whole topic is some kind of ego war more than anything

#

"I need to do the split I don't wantto do ? Then I won't let you evaluate it"

tepid spoke
#

Evaluate seems entirely pointless if it does not give you a metric related to the current parameters.

cursive badge
#

Well train=test in the benchmark is bad science.

bold terrace
#

"You didn't wanted to hide the Evaluate ? I'll make it useless"

bold terrace
cursive badge
#

(I meant Expertium's comment)

polar maple
#

i think rossgb's card_id mod 10 == 0 might be a decent compromise

cursive badge
#

I'm just slow πŸ˜…

unique salmon
bold terrace
#

I mean, let's just mark the card used for training, and when the guy click on optimize, it runs on the test-set + all the new data

#

Re-Optimized ? Let's mark the new training set and now Evaluate will work on All\Training Set

unique salmon
#

We need either train = test in the benchmark or the 5-way split in Anki. Either one will do

#

Because right now the numbers from Evaluate cannot be compared to the benchmark numbers

#

So either benchmark has to ankified or anki has to be...benchmarkified

polar maple
unique salmon
cursive badge
unique salmon
bold terrace
#

Most people using anki for a year have way more than 10-20k

#

And youngsters should use default params for longer

#

Before there was a threshold to reach I think before being able to optimize

unique salmon
#

There's nothing wrong with optimizing much earlier

bold terrace
#

Right now is there any threshold yet ?

#

Or is it you can optimize with 10 reviews ?

unique salmon
#

Not really. Like 8 reviews for pretrain, 64 for full optimization

bold terrace
#

IMO this is not great

unique salmon
#

Except it has some wacky filters and whatnot

bold terrace
#

when you see @tepid spoke case with 100d stability when he filter out cards with Hard/Easy as first review

unique salmon
#

So you'll never figure out the exact number of reviews used for training

unique salmon
#

Better than the defaults

polar maple
#

do we have any performance metrics for low # of review collections?

bold terrace
#

Ok but maybe some rules like "At least a few Easy, Hard ... ?""

tepid spoke
#

hm? The 100.000 stability value appears when I filter out enough cards via "Ignore cards reviewed before"

bold terrace
#

Because getting 100d stability because you lacked certain case is a bit meh

bold terrace
#

Oh yes, I'm 99% using only again/good, but still thinking about the others

#

But I guess it's not that much a big issue

#

Except if Anki use FSRS by default and auto-optimize optimize with 30 reviews

#

But SM2 has the benefit of building a training set for FSRS lol

unique salmon
#

Anyway, if anyone wants "Evaluate parameters" instead of "Evaluate FSRS", tell Jarrett to implement train = test in the benchmark

#

Because right now he's doing the exact opposite - implementing the 5-way split in Anki

#

Again, don't point fingers at me. I asked, he chose

polar maple
#

if we implement card_id % 10 == 0 we can keep the current Evaluate and it wouldn't be so incorrect

bold terrace
#

Why not just mark the cards that were used for Training and keep Evaluate work on Tests set ?

polar maple
#

@unique salmon btw to train on the test set its pretty much just 1 line of code in other.py

unique salmon
polar maple
#

is 10% too much? too little?

unique salmon
#

Too little IMO

#

20-25% is good

polar maple
#

ok

unique salmon
#

But the point is that we need some method Y that is used both in Anki and in the benchmark so that we gather data for the health check

#

If Anki uses method Y to calculate metrics and benchmark uses Z to calculate metrics, we can't do shit

unique salmon
cursive badge
unique salmon
#

Whatever we do in Anki, we must also do in the benchmark so we can collect data that will be used to decide values for the health check

unique salmon
cursive badge
#

Something is going very weird with my keyboard. It is hard to type πŸ˜•

unique salmon
#

Mod would choose cards randomly, whereas in the 5-way split they are not chosen randomly

cursive badge
#

It might not be perfect. I was assuming that train=test could have very different values, but different methods that do not mix train and test would have similar RMSE/log loss.

#

My laptop keyboard is very unhappy.

polar maple
#

for example if a new user does a bunch of new cards at day 1 and then reviews them at day 100, passing all of them

#

then a mod 2 split would easily get a zero loss

#

but splitting by time in half would get a high loss since it would basically be fsrs default params

unique salmon
#

Man, I'm just dreaming of a nice world where everyone voted to remove Evaluate...

polar maple
#

my view is evaluate parameters is fine if we're not evaluating on the training set

unique salmon
#

Well, I guess tomorrow we'll see Sound and Oromit debating with Jarett

tepid spoke
#

I do not have any kind of strong attachment to the Evaluate button

#

It just seems pointless to me if it doesn't evaluate the parameters

unique salmon
#

Just Sound then 🀣

tepid spoke
#

I did vote to keep it, since it's "nice enough to have"

#

but if it causes trouble like that, meh

unique salmon
tepid spoke
#

The few times I used it was to check how the values looked for my manually tuned parameters

#

to make sure I didn't make a horrible mistake

cursive badge
#

Maybe I'll dig out my Health Check PoC and see if I can create something I don't hate πŸ˜…

tepid spoke
#

But that's such a niche case, it hardly matters

unique salmon
#

Fair, in your case old Evaluate is more useful

tepid spoke
#

With the Simulator being a thing now, it's a better cross-check anyway

quasi shadow
#

You cannot know the performance of any sets of parameters on unseen data (the future reviews) which is actually important in practice.

polar maple
#

but this is exactly why a train/test split is important, likely 99.9% of users would have a better training loss with adaptable params but the actual benefit is not necessarily that high

quasi shadow
quasi shadow
#

Btw, what could we do if the health check's result is bad?

quasi shadow
#

Even if the log loss of adaptable params is better than default params.

bold terrace
# unique salmon This answers your question, I believe

Hmmm not really. Anki and Benchmark using both the same method (Training/Test) doesn't imply being unable to run Evaluate on Parameters.
It's like saying that : evaluate(optimize(data), test_data) implies that **evaluate(user_defined, test_data) **is not possible

quasi shadow
#

If I understand it correctly, you want to know how well the user defined parameters perform on current data.

#

But the benchmarking method evaluate how well the optimization performs in the future.

#

Let's reframe this question: how well the user defined parameters perform? -> how well the manual optimization by user perform?

#

If we want to compare the built-in optimization with user's optimization fairly, the user should also don't optimize the parameters based on the metrics from the test data.

#

But it's tricky.

#

It's tempting to optimize the parameters based on test data.

bold terrace
# quasi shadow Let's reframe this question: how well the user defined parameters perform? -> ho...

Ah OK I see now a very good point that justify your thoughts, here's what I think about it :

  • To be fully consistent, it's true that the user-defined parameters should also be evaluated against the same Test-Set than the optimized one, otherwise the User might have the feeling it gets better result with its parameters when in fact, it's just taht when he defines his params, it runs on the training data and thus he might get better result, which is not great.

But if we mark the data that were used for Training, and only perform Evaluate on the Test Set (excluding the Training Set), then both User defined and Optimized Params will complete on the same "unseen" data.

#

If doing such exclusion is difficult for now / not feasible within Anki, then it would be probably better to rework slightly the menu to make clear that user defined params are the responsabiltiy of the user... for example :

  • Put a warning if the params have been changed by the user
  • Instead of having a "Evaluate" button, just having the logloss/RMSE of the optimization written to make clear that it's computed only when optimized and represent only optimized parameters, not user defined one
  • If we keep a evaluate for user defined users and do not exclude the training set, some warning to notify that he might get better evaluation but it's "cheating"since it's not splitting Training/Test set
unique salmon
#

If we keep the current Evaluate behavior, then @quasi shadow needs to implement train = test in the benchmark. Alex said it would be easy

quasi shadow
#

I’m training another model now.

unique salmon
#

@ashen light challenge accepted 😎

#

We'll see if Jarrett merges this PR or finds any issues

clever cargo
quasi shadow
#

My device is busy.

unique salmon
quasi shadow
#

yep, please include it in the PR

unique salmon
#

Ok

#

🀣

#

Hopefully that's easy to fix

#

Not this error again 😭

#

How many times have I gotten it...

#

If I could see dreams, I would see "not enough values to unpack" in my nightmares

#

@ashen light if AI wrote 60 lines of code and I wrote 1, does that count? 🀣

#

Alright, so now I have to run FSRS-6 on 10k users
See you guys in 30 70 hours, lol

unique salmon
clever cargo
#

damn

#

is that what they call aphantasia or smth

unique salmon
#

Like, imagine a spinning apple or something

#

I can do that

#

@wind palm @hasty fractal @cursive badge @cosmic hedge I hate to say this, but it's time to debate Evaluate. Again.
In order to implement the health check, the way the metrics are calculated has to be consistent between the benchmark and Anki, otherwise we can't collect the necessary data. Currently, that's not the case. There are 2 ways to fix this, both are technically doable:

  1. Implement a training data/testing data split in Anki and instead of evaluating parameters, evaluate FSRS. What this means in practice is that the Evaluate numbers won't depend on your current FSRS parameters. This will also make Evaluate as slow as Optimize, since it has to do an optimization.
  2. Implement train set = test set in the benchmark. Then I'll run FSRS-6 this way (I'm doing it right now, actually), and then we can keep the current behavior of Evaluate

So either Evaluate evaluates a specific set of parameters, like now, or it evaluates FSRS's ability to perform well on unseen data, like Jarrett and Alex want

#

Currently in favor of 1: Jarrett, Alex, Luc
Currently in favor of 2: Sound (kind of), Oromit
Currently in favor of "please god let's just get this over with just choose whichever": me

bold terrace
#
  1. Implement data/test data split, but find a way to evaluate user-defined parameters in Anki (Evaluate) that would run only on test set (All, excluding the cards marked as "trained_set" during the optimize), just like the result of optimize
#

I'm not for 1 neither for 2

unique salmon
bold terrace
#

Basically :
evaluate(optimize(train_set).parameters, test_set)
evaluate(user_defined.parameters, test_set)

unique salmon
unique salmon
bold terrace
#

Why though

unique salmon
#

Test set is for evaluating how well the algorithm performs on data that it was not trained on

bold terrace
#

Yep that's how it's used in

evaluate(optimize(train_set).parameters, test_set)
evaluate(user_defined.parameters, test_set)

unique salmon
#

If you tweak the parameters to get lower logloss/RMSE, you "train" the algorithm on the test set

bold terrace
#

Only to evaluate the output of the algorithm

bold terrace
#

Basically if an user goes that far, he would also be able to just train his parameters on the full set 🀷 , basically he would be hacking his way

#

At least with

evaluate(optimize(train_set).parameters, test_set)
evaluate(user_defined.parameters, test_set)

You allow it to tweak slightly the optimized version and see if it doesn't get too bad by changing for example his initial stabilities

#

Of course, if he use that door to train parameters on the whole set, shame on him

#

But by default it would not be the case

#

So the complexity comes to : How to make sure Evaluate doesn't cheat, while still being able to be modified for whatever reasons ? The answer is then, hide the training set to it

#

I just imagine Dae's face with all those flags/parameters in the custom_data of the cards πŸ˜†

cosmic hedge
unique salmon
#

I guess we could add a stern warning to not tweak parameters manually. But then we'll have to implement a third method, like the mod card ID proposed by Alex, in both Anki AND the benchmark 😭

#

The third method of splitting data into train/test, I mean

bold terrace
#

Shouldn't be that hard right

#

I mean there is certainly a part in the code where you divide the code into X/1-X

#

you make it take a custom function "f_partition(card, card_index)", one would be based on "first 80%", and the other "mod N"

unique salmon
#

Jarrett already made his PR to Anki and FSRS-rs and I already made mine to the benchmark 🀣
God this is a mess

bold terrace
#

You're a bit too intense, chill down and wait. No wonder Dae doesn't involve himself that much in those discussion

#

FSRS is already good enough, a few more days wont hurt

cosmic hedge
#

2 buttons
"evaluate fsrs"
"evaluate current parameters" (burried somewhere)
god save us all

unique salmon
bold terrace
#

I mean

#

Maybe some people want to use the full training set

unique salmon
#

That's confusing as hell

bold terrace
#

Losing 20% review when you have 100 reviews is a lot

unique salmon
#

Think of the average user trying to understand it

#

Serious question: why are we trying to expose more of FSRS rather than hiding it? Ideally, the only setting should be desired retention, that's it

bold terrace
#

Or a switch "Use Train/Test split or just use Whole set for training ? /!\ This means you're cheating the ability of predicting unseen cards"

bold terrace
#

Between the ideal world of FSRS in the benchmark and what people observe, I don't blame people not willing to lose control

#

But I agree ideally it should be hidden

#

It's just not mature enough really to be

#

But anyway

#

Let the people actually coding decide πŸ˜†

#

And if I really want my

evaluate(optimize(train_set).parameters, test_set)
evaluate(user_defined.parameters, test_set)

I'll PR it in a few months πŸ˜†

ashen light
unique salmon
ashen light
#

Β―_(ツ)_/Β―

#

I'm out if date on the latest fsrs meta theres just too much talking

unique salmon
ashen light
#

I'm glad I'm no longer on the evaluate mailing list

#

but literally just stop talking about it till dae says it's reasonable, whatever it is yall are doing

unique salmon
ashen light
#

oh cool

unique salmon
# ashen light oh cool

But health check requires data, and data requires using the same method both in Anki and in the benchmark for calculating log loss and RMSE, so...here we are

ashen light
#

well, have fun

polar maple
#

@unique salmon
i hooked up FSRS-6 to optimize on the entire revlog for logloss & rmse (bins) separately but only on the reviews that are evaluated on in srs-benchmark,
logloss: https://pastebin.com/c9c1WniH
rmse (bins): https://pastebin.com/JCkZZZtA

unique salmon
polar maple
#

it was to simulate how much choosing the best rmse (bins) params within anki might affect the result

#

but recently jarrett changed it to log loss

unique salmon
#

I want you to try this on a combined revlog to see if the weirdness with parameters persists

polar maple
#

how are the params weird?

unique salmon
#

I mean w[16] becoming 1.0
I want to see if it persists with both loss functions

#

On a big combined revlog

polar maple
#

we already know the answer for log loss

unique salmon
polar maple
#

for rmse (bins) even if its not 1.0 we still wouldn't use it

unique salmon
polar maple
#

better to just fix w[16] to 1.5 and optimize logloss around it or something

polar maple
#

do i have to repeat my rants against rmse (bins)?

polar maple
#

rmse (bins) does not push the model predict as best as it can

polar maple
unique salmon
polar maple
#

also new parameters from an optimization are kept only if they improve on the training loss, it could still get worse on the test loss and evaluate would show a worse result

#

but if we insist that the new parameters should also do better on the test set then this is just training on the test set

unique salmon
#

I feel like we're screwed either way

  1. Evaluate evaluates parameters - people start tweaking parameters on the test set and then complaining that the optimizer is garbage because they have gotten "better" parameters manually
  2. Evaluate evaluates FSRS - people start asking why changing their current parameters doesn't affect Evaluate
bold terrace
#

Another option : Only allow to change the initial stability parameters

unique salmon
#

That's weird

bold terrace
#

Would be interesting to see what people tweak

#

Personally I never tweaked anything but I would imagine people only tweak initial stabilty ?

#

Making the params read only would solve most issues I guess

#

If Evaluate is already a bit difficult to interpret, imagine those parameters !

#

Could be present in some "Get Troubleshooting parameters / logs" hidden somewhere

#

My whole point with the evaluate(user_defined.paremeters, test_set) is only there because I was trying to find ways to keep the parameters tweak for all users

#

Buuuut I don't personally think anyone should tweak them

#

and if they tweak initial stability, It's not that much of a big deal in terms of logloss/rmse loss

quasi shadow
robust hill
#

compute optimal retention be like: always 70% πŸ”₯

unique salmon
#

#1282005522513530952 message

wind palm
robust hill
#

the exam is in 50 days

#

i was thinking about it now

#

if i should maek a priority to have like

#

75% desired retention and try to crank out

#

1750 cards in the span of 10 days

#

because it takes around 10 seconds per card and i see them around 3.4 times to turn into young

#

almost like 16 hours of new crads

#

also when exactly does cmrr give me more than 70%

#

😭

#

btw how is this graph linear?

#

shouldnt it be exponential

#

ah nvm

unique salmon
robust hill
#

can i make my own graph with my own parameters

#

i know there was a github link somewhere

unique salmon
#

Nope

robust hill
#

wasnt there one for this

#

the desired retention x workload

unique salmon
#

Well, I can, but I never shared the code

unique salmon
robust hill
#

is there a

#

average retrievability to desired retenton workload

#

like

#

i want to see average retrievability to workload time

robust hill
#

like this but for average retrievability

#

or not possible

unique salmon
#

Yes, you can get a graph like that using the Google Colab optimizer, but again, for now it uses FSRS-5

#

And I don't see why you would want average R instead of DR on the x axis

robust hill
#

just need it to explain something to a friend

#

do i folow the steps from the very beginning of the link

unique salmon
#

Explain it using this graph, lol

robust hill
unique salmon
robust hill
#

thank u

#

i am one of those special people who need step by step instructions for each step πŸ”₯

polar maple
robust hill
#

my question is

#

how do i choose the exact parameter i want it to make it

#

do i just have to upload the deck

#

instead of collection

robust hill
#

cause i sent in my collection

unique salmon
#

You can upload a single deck, if that's what you're asking

robust hill
#

well cause like my collection has a lot of deck options yk with different parameters

#

and i want to choose a specific deck option parameter

#

idk im trying to see

unique salmon
robust hill
#

i have a collection with 5 main deck options:
physiology
biochem
etc etc

#

i wanna see the graph for the physiology deck options only

#

not combination of everything

unique salmon
#

Put all decks that have the "Physiology" preset into one big deck and export that

robust hill
#

alright sounds good

#

my personal parameters πŸ’€

wind palm
# polar maple the problem is that the way we currently implement Evaluate is a big no-no in da...

Hasn't it always been understood that this is bad science to a certain extent? We're asking FSRS to tell us how good of a job it is doing -- like a self-reflection grade. FSRS answers, "when I use this memory model [which I came up with by looking at your review history] on your review history, my predictions are wrong X% of the time." It's not good data science, but it is a good test of whether the model is matching the user (or at least whether FSRS thinks its model is matching the user).

Testing the user's parameters against another user's data seems less helpful. The answer back from FSRS would be, "when I use this memory model on someone's else's review history, my predictions are wrong X% of the time." That seems like a measure of whether the model matches someone else. Why would a user care about that?

You might be interested in Sound's 3) option
Is that this?

  1. Implement data/test data split, but find a way to evaluate user-defined parameters in Anki (Evaluate) that would run only on test set (All, excluding the cards marked as "trained_set" during the optimize), just like the result of optimize #1282005522513530952 message
    Unfortunately, I have no idea what any of that means. πŸ˜…
bold terrace
#

It means that splitting or not the whole set into a training/test set is not a reason to not evaluate parameters based on user defined parameters.

If the Evaluate button is doing : evaluate=evaluate(optimize(training_set).parameters, test_set), you can also do evaluate=evaluate(user_defined.parameters, test_set).

Of course, it means :

  • Having to store what was in the training set, to exclude it from Evaluate when done later.
  • Warning the user that he could get better logloss and/or RMSE by tweaking his parameters, but because he would break the whole "You train on train set, you evaluate on test set"
bold terrace
# wind palm Hasn't it always been understood that this is bad science to a certain extent? W...

It still make sense to split because without split, the optimization might overfit what it sees but would fail miserably to generalize it to new data. Maybe it's not that much a problem with FSRS in the first place since the forgetting curve has well defined properties, but for any scheduling algorithm using things like Neural Network, if you have enough parameters, you could have very over-specific rule like (If the card has been reviewd 4 times, and the last one was on saturday, the stability will be 6d), just because it saw one or two card in that setup.

But if you split train/set for one, you need to split train/set for all, or you're not comparing models with the same set of rules.

#

For example, right now I trained and test my parameters on one my deck, I get Evaluate :
Log loss: 0.4024, RMSE(bins): 3.15%. Smaller numbers indicate a better fit to your review history.

Now, I train it on a a subset, my training set
Log loss: 0.3512, RMSE(bins): 2.94%. Smaller numbers indicate a better fit to your review history.

I use it on my testing set :
Log loss: 0.4700, RMSE(bins): 12.68%. Smaller numbers indicate a better fit to your review history.
It performs much worse than than the first result, which is a sign the first evaluate was good only because the model did train on a non representative class of cards

If my testing set had an optimization made on it directly, I would could have gotten :
Log loss: 0.4138, RMSE(bins): 3.80%. Smaller numbers indicate a better fit to your review history.

So the difference between those 2 results, show that optimizing and testing on it, I was able to get way better precision, but by cheating since I now the on what I'll be tested

unique salmon
# wind palm Hasn't it always been understood that this is bad science to a certain extent? W...

It's helpful though. I've made a function to predict RMSE given the number of reviews and average retention, and we can compare that approximate value to the real RMSE of the user to find out whether he is doing well for his "weight class"
It's like "For your height and weight, your blood pressure is pretty good", if that analogy helps
The real problem is: what do we tell users with "poor" "health"? If someone's RMSE is way higher than what is expected given their retention and n(reviews), what should Evaluate display?
Just "Poor"? Users will complain
A list of possible explanations and advice? Users won't read it and then will complain anyway

bold terrace
#

(In this case I did the partitioning based on High/Low D so of course the diff is enormous, but if the partitioning is done smartly, like card.id % 10 or something, it should be hopefully less)

unique salmon
#

Basically, Sound is saying "Let's use all cards whose ID ends with a zero for testing, the rest for training"

bold terrace
#

Having such a rule would also make it super easy to know what is part of the Training set and what's not, no need to flag πŸ€”

#

But I have no idea how the card fields are populated and if card.id mod N is that to get well randomized partitions ...

unique salmon
#

It's Unix timestamps

cursive badge
#

I think card ID is the epoch ms it was created

unique salmon
#

Yeah, it's milliseconds elapsed since 01.01.1970 or something

bold terrace
#

maybe some chaos_function(card.id) mod N would be better

unique salmon
#

The last digit is as good as random

bold terrace
#

I guess yes

cursive badge
#

It doesn't guarantee you have exactly x% sets, I just gave it an example of one way to get a "stable" test set.

bold terrace
#

I know they say to not create random function based on epoch because if you loop when generating those, you'll get very obvious patterns based on CPU cycles. But here we're talking human creation

unique salmon
cursive badge
#

Or from bulk importing notes/cards

bold terrace
#

At least to me it looks pretty good

#
SELECT id % 20 AS mod_result, COUNT(*)
FROM cards GROUP BY mod_result

In SQL querier

tepid spoke
#

hm, I wonder how the card IDs for my deck look like

#

Cause it initially came to life as import from a CSV file

bold terrace
#

You could test it yes

#

Would be interesting to see

tepid spoke
#

that import took less than a second

bold terrace
#

50% of my cards are also one-shot imported though

#

But it's on millis so even an import would be spreaded normally evenly

cursive badge
#

I think you would have to be unlucky for the import loop to match up with the n you choose , but it could be possible.

bold terrace
#

Yeah with the training set 10k we could check if no collection really diverge too much

tepid spoke
#

1675618557059
1675618557215
1675618559833
1675618567127
1675618567137

#

are some example card IDs

#

so did it just count up when collisions happened?

#

The card IDs are in perfectly ascending order with the WaniKani sort ID

bold terrace
#

sqlite3 ~/Library/Application\ Support/Anki2/User\ 1/collection.anki2 "SELECT id % 20 AS mod_result, COUNT(*) FROM cards GROUP BY mod_result;"

#

(The 'User\ 1" might need to be adapted obviously, or the path alltogether)

unique salmon
#

I think you want an absolute standard, not a relative one

#

You want a standard that does not depend on data from other users

#

But we can't do that. Well, we can, but it would be arbitrary
We could just say "RMSE above 10% is bad", without looking at RMSE from lots of users, but that would be kinda dumb

bold terrace
#

LOL

#

I asked GPT

#

For a chaos function

#

he gave me

#

SELECT abs((id * 2654435761) % 4294967296) % 20 AS chaos_mod, COUNT(*) FROM cards GROUP BY chaos_mod;

#

Result ?

#

WWhat the hell went wrong there πŸ˜†

unique salmon
#

Google "hash function"

#

That's what you're looking for

bold terrace
#

Are hash stricl speaking chaos function ?

#

but it's true here I merely want those to be distributed

cursive badge
#

Not intrinsically. But there must be some good, fast, uniform ones used for hash tables.

#

Cryptographic hash functions would probably be bad because they are deliberately slow.

bold terrace
#

Boah

#

Might not be necessary

#

I see there's none builtin in sqlite

#

SELECT id % 20 AS mod_result, COUNT(*) AS count, ROUND(100.0 * COUNT(*) / (SELECT COUNT(*) FROM cards), 2) AS percent, ROUND(100.0 * COUNT(*) / (SELECT COUNT(*) FROM cards) - 5.0, 2) AS percent_deviation FROM cards GROUP BY mod_result;

#

For SQL, GPT is quite useful πŸ˜†

#

Well I'm not sure it was worth it to do a fancy 5-percent for percent_deviation LOL

#

I also see he hardcoded it

#

super clean code πŸ˜†

#

But yeah, seems mod Nis more than good enough if don't want to do flagging 🀷

unique salmon
#

The numbers will still be displayed, just for reference

#

I am repeating myself, but the only real problem is what to do with people who fall in the red zone.

#

FSRS doesn't have any kind of "emergency mode" or whatever

#

Like, there is no secret button to fix your shit

#

Well, I guess "Remedy Hard Misuse" is a bit like that

#

My point is that it's inevitable that some people will have crappy numbers. What's the course of action then?

cursive badge
#

Someone nice writes a "Reasons why your FSRS evaluation might be bad and what you can do about it" page to put in the manual

#

πŸ€·β€β™‚οΈ

cosmic hedge
#

what was wrong with jarrets solution for this with the cost from retention btw?

wind palm
# bold terrace It still make sense to split because without split, the optimization might overf...

I can tell you did a great job of explaining it, but unfortunately I still don't get it. It's my deficiency, not yours.
I can't even ask clarifying questions, because I just have no idea what you're explaining to me.

Let's see if we can get there without me understanding it. -- You've seen my reasons for wanting #2.

  • Is your #3 a better measure than the current method for the user of how well FSRS is working for them, is matching their memory curve, is predicting the appropriate time to study their cards?
  • Can we still describe it that same way in general terms -- how well it's working, matching, predicting?
  • Will your #3 run nearly as fast as Evaluate does now?
wind palm
# unique salmon It's helpful though. I've made a function to predict RMSE given the number of re...

we can compare that approximate value to the real RMSE of the user to find out whether he is doing well for his "weight class"
Is that better than a simple lower is better? It feels like comparing it to a set scale is going to cause more trouble than number-goes-down=good, number-goes-up=bad.

We could just say "RMSE above 10% is bad", without looking at RMSE from lots of users, but that would be kinda dumb
Are we still using the same "working definition" (not entirely mathematically accurate, blah, blah) of RMSE? So isn't "FSRS makes mistakes scheduling 10% of your cards (or 10% of the time)" objectively bad? I don't need to compare to anyone else's results to figure that out.

cursive badge
cursive badge
wind palm
cursive badge
polar maple
wind palm
cursive badge
# wind palm The way folks talk about this, it sounds like it's impossible to tell the optimi...

Maybe a slightly different framing will help:

Imagine I want to teach you how to do addition, but I can only do it by showing you lots of examples e.g. "45 + 22 = 67"
I give you the big book full of examples and let you try to figure out the rules yourself.

Now I want to test how well you learned by asking you questions.
I ask you questions from the book and you do really well so I think my job here is done!
Unfortunately you cheated, you just memorised the examples from the big book, you didn't actually understand addition.
If you later encounter addition problems that were not in the book you do really badly.

This is the overfitting problem. I've taught you to be very good at repeating what you have seen before, but not the general rules that will let you solve novel problems in the future.

Imagine instead I only gave you 4/5 of the book to learn from but kept the last 1/5 of it for myself.
If I later test you using only questions from my part of the book that you have never seen I can get a better idea of if you really understand addition because you cannot have memorised the answers.

The downside to splitting the book is that you will have fewer examples to learn from, so may find it more difficult to learn the rules of addition in the first place. I'll be better at evaluating your performance but your performance might actually be worse (than if you did not cheat with the full book).

This problem of splitting data into train/test possibly reducing performance is why some (Jarrett?) like the idea of the "5 way split" Evaluate as seen in the benchmark:

You keep optimising with all the data and just hope that there is not too much overfitting.

You can get an idea of how well FSRS works in general on your data (but not your specific parameters) by splitting your data into 5 parts then training and testing 5 times choosing a different part as the "test" data each time and average the results.

#

(N.B. I have not checked if this last part is exactly how the benchmark does it)

polar maple
cursive badge
quasi shadow
#

What’s the difference between a private and public leaderboard?
The Kaggle leaderboard has a public and private component to prevent participants from β€œoverfitting” to the leaderboard. If your model is β€œoverfit” to a dataset then it is not generalizable outside of the dataset you trained it on. This means that your model would have low accuracy on another sample of data taken from a similar dataset.

Public Leaderboard

For all participants, the same 50% of predictions from the test set are assigned to the public leaderboard. The score you see on the public leaderboard reflects your model’s accuracy on this portion of the test set.

Private Leaderboard

The other 50% of predictions from the test set are assigned to the private leaderboard. The private leaderboard is not visible to participants until the competition has concluded. At the end of a competition, we will reveal the private leaderboard so you can see your score on the other 50% of the test data. The scores on the private leaderboard are used to determine the competition winners. Getting Started competitions are run on a rolling timeline so the private leaderboard is never revealed.

#

πŸ˜‚ You can overfit to the public leaderboard by tuning your model based on the test score and get a bad rank in private leaderboard.

#

In the case of FSRS and @bold terrace's method #3, you can tune the parameters on test set even if it isn't used for training, and may get worse result in the future.

cursive badge
#

You can also get into train-test-validation splits because you "taint" any data that you use to twiddle optimisation.
If you are doing good scienceβ„’ the final evaluation must be on data that has never been used previously.

#

At least these are my memories of an undergraduate long ago πŸ˜…

quasi shadow
#

So you could only evaluate the parameters in the next month with new data.

#

🀣

cursive badge
#

This is why I got into nice deterministic simulations for my research. The evaluation was much simpler! πŸ˜…

quasi shadow
#

In my view, the evaluation only makes sense when we search for a reproducible optimization method. Tuning the parameters by hand is unlikely reproducible.

#

πŸ˜… Feel unsatisfied about your parameters? Please challenge the SRS Benchmark!

cursive badge
#

To be frank the moment you manually edit your params you are fully in "here be dragons" territory and should not expect any built-in help.

bold terrace
#

Yeah agree and also wonder what people actually tweak. I'd make a bet that it's mostly the initial stability, but I have no proof

#

And I think most of the time, just to reduce the good/easy initial ones

0.1079, 0.8219, 3.3692, 31.2728, 7.2741, 0.4920, 2.0791, 0.0727, 1.3029, 0.2688, 0.8197, 1.8849, 0.0873, 0.3245, 2.3331, 0.0939, 3.2766, 0.7575, 0.3003, 0.0905, 0.1176
Log loss: 0.3512, RMSE(bins): 2.94%. Smaller numbers indicate a better fit to your review history.

If I tweak them because I fear long first intervals :

1.1079, 1.8219, 1.3692, 1.2728, 7.2741, 0.4920, 2.0791, 0.0727, 1.3029, 0.2688, 0.8197, 1.8849, 0.0873, 0.3245, 2.3331, 0.0939, 3.2766, 0.7575, 0.3003, 0.0905, 0.1176
Log loss: 0.3591, RMSE(bins): 4.21%. Smaller numbers indicate a better fit to your review history.

Soooo ... I'm not against putting them in read only and maybe for people actually tweaking them allowing them to still stipulate the 4 initial stab ? I don't know. I think Evaluate is useful to see how well the model is able ot predict your stuff, and how cool it is to copy your parameters in a visualizer and simulate some revlog, but I dont know how useful it is to tweak the parameters

#

But IMO the mod N way to partition Test/Training set seems so nice it would be cool to be able to test it πŸ™‚

quasi shadow
bold terrace
#

But yeah, I'm completely curious to see how some kind of "clustering" can definitely help πŸ™‚

#

I'm also wondering if the decay wouldn't be different between those 2 groups

bold terrace
#

Since low decay like .1 translate in "It takes very looong time to get down to 60-70% DR", it might be that the group of user with High DR have a different way of approaching Anki (lots of exposure outside Anki) vs people that use mainly Anki (and not a lot of external exposure)

quasi shadow
#

Finished!

bold terrace
#

It's funny how alienation can have different results on people πŸ˜… . Some will shut themselves, and other like me included, are almost getting motivated by it πŸ˜†

unique salmon
# cosmic hedge left is years, all 0.94

Huh
Have you tried dividing by 1/(t2-t1) just in case? Again, originally the average_forgetting_curve function is supposed to return a number between 0 and 1
This is really strange, I feel like the implementation is wrong somehow
Try dividing and if that still doesn't produce sensible results, show me the Rust code of the integral and I'll try my best to find the problem

unique salmon
# quasi shadow Finished!

I'll re-write some stuff and send you a .docx file later
I think you should write it in very simple layman terms and collapse everything technical, like how you collapsed "Background"

unique salmon
quasi shadow
unique salmon
quasi shadow
#

Btw, what's the reason you create FSRS Megathread?

#

I have fogotten it.

unique salmon
quasi shadow
#

make sense

clever cargo
#

there probably should be an fsrs channel, to make searching easier

quasi shadow
#

But it's still very hard to dig messages from discord.πŸ˜‚

unique salmon
quasi shadow
#

We will have 30k messages soon.

clever cargo
#

we cant make threads in a thread and we cant search in a specific thread either

hasty fractal
lapis hearth
#

Is there actually a learning program like Anki that uses Neural Nets (or AI) as its scheduling algorithm

#
#

I have found this but it seems sketchy

unique salmon
lapis hearth
#

has anyone had a good experience with it❓

lapis hearth
#

Because the whole idea of let the program do the work for you has sold me

unique salmon
lapis hearth
#

So what is holding Anki from using a neural-net as well

#

It seems the Dekki guy sees Anki as a competitor and does not want to reveal the works behind his neural net

robust hill
#

the master

#

what can a neural net even do

#

how much more is there that you can optimize

lapis hearth
#

Which would theoretically make it have a pseudo-short term memory model

#

But I dont know what I am talking about here

#

All I know is that it notices patterns which would otherwise not be easy to model by mathematical formulae

#

So I feel quite tempted by it

#

And then I asked if there are learning programs like it

#

with neural nets above all

robust hill
#

somehow

#

im ding so well in this deck

unique salmon
#

Notice how big of a jump it is compared to everything else in that table

clever cargo
#

ye 2.7 million

unique salmon
#

I meant log-loss, RMSE and AUC

#

Other models cannot get below 0.31 log-loss, this gets 0.27
Other models cannot get below 3.5% RMSE, this gets 1.4%
Other models cannot get above 0.73 AUC, this gets 0.82

lapis hearth
#

So what is the hold up❓ The sync problem I get it but why when other programs like Dekki are doing itπŸ₯²

unique salmon
#

@polar maple

#

Well, one of the holdups is that it doesn't have a forgetting curve πŸ˜…
I mean, it does, but not as a nice, simple formula. So you can get all kinds of weirdness, like the probability of recall increasing over time and whatnot
And it would be very difficult to calculate an interval that corresponds to a specific probability of recall, for scheduling purposes
And it would be difficult to ensure things like Again <= Hard <= Good <= Easy

#

The nice thing about FSRS is that predicting the probability of recall and scheduling the next interval are equally easy, but not with this

lapis hearth
#

So weird memory = weird intervals = weird curves

#

And Dekki seems to be fine

#

I was just asking for examples of programs with neural nets and it does not seem to be a major con

robust hill
#

how are there 2.7 million parameters

#

😭

#

so thisi means that neural net is going to make me a super genius

unique salmon
#

I'll message the Dekki guy again, maybe he will participate in the benchmark

lapis hearth
#

I really REALLY hope for Anki to have a Neural Net

#

The only way to come close to match the weirdness of the human memory

robust hill
#

start coding

unique salmon
#

...or not, Reddit just doesn't load chat

lapis hearth
#

F***** me

robust hill
#

is there a way to see average retreivability for a sepcific day

#

now i have this, only for today, but is there a way i can see what it would be like in 5 days or 6 days

#

if i didnt review the deck

unique salmon
robust hill
#

because my plan is to only do filtered decks for cards under 90% average retrievability the day before the exam

cosmic hedge
robust hill
#

yes

#

so like

#

today its 93% but i want to see

#

if i dont do reviews, what would it be tomorrow, or in 5 days

cosmic hedge
cosmic hedge
robust hill
#

does it use the parameters of the deck im in

#

or of the cards of whatever deck they are in

cosmic hedge
unique salmon
cosmic hedge
robust hill
#

lose 2 cards everyday somehow

#

doesnt seem right

#

shouldnt it be a lot more

cosmic hedge
unique salmon
robust hill
cosmic hedge
#

could you screenshot them?

cosmic hedge
cosmic hedge
# robust hill

what does the reviews graph look like bc that is weird

robust hill
#

0

cosmic hedge
robust hill
unique salmon
#

I tried the Python simulator with the integral over the next FIVE THOUSAND YEARS and still got 70%

cosmic hedge
robust hill
#

to 518 at the far right

cosmic hedge
# unique salmon I tried the Python simulator with the integral over the next FIVE THOUSAND YEARS...
pub fn average_f_power_forgetting_curve(
    learn_span: usize,
    cards: &[Card],
    decay: f32,
) -> f32 {
    let factor = 0.9_f32.powf(1.0 / decay) - 1.0;
    let exp = decay + 1.0;
    let den_factor = factor * exp;

    // Closure equivalent to the inner integral function
    let integral_calc = |card: &Card| -> f32 {
        // Performs element-wise: (s / den_factor) * (1.0 + factor * t / s).powf(exp)
        let t1 = card.last_date - learn_span as f32;
        let t2 = t1 + 365.;
        (card.stability / den_factor) * (1.0 + factor * t2 / card.stability).powf(exp) - 
        (card.stability / den_factor) * (1.0 + factor * t1 / card.stability).powf(exp)  
    };

    // Calculate integral difference and divide by time difference element-wise
    cards.iter().map(integral_calc).sum::<f32>()
}
``` if you want to check it
cosmic hedge
robust hill
#

i see

#

i guess i am doubting myself

#

can we bring back decimal desired retention πŸ™

unique salmon
cosmic hedge
robust hill
#

well

#

my desired retention is 80%

#

haha

cosmic hedge
cosmic hedge
#

wait no

#

hold on

unique salmon
#

I just want you to give me the output for some input S, t1, t2, decay so that I can verify the math

cosmic hedge
unique salmon
#

?

#

Is your t1 negative?

#

Or what is going on there?

#

I'm trying to figure out what this could mean, and I can't

#

t1 is just time since the last review of this card

#

And I can't reproduce your number, btw

#

Ok, yeah, so your t1 is negative

#

Though I doubt that's the reason why you're getting 94% every time

#

I have no idea how you're getting 94%
Let's try to do this as properly as possible:

  1. No negative t1, it's the number of days since the last review

def average_f_power_forgetting_curve(t1, t2, s, decay):
if not t2 > t1:
raise ValueError("t2 must be greater than t1")

# Calculate F(t2) - F(t1) where F is the antiderivative
integral = integral_power_forgetting_curve(t2, s, decay) - integral_power_forgetting_curve(t1, s, decay)
print(f'Raw integral={integral:.5f}')

# Divide it by the difference in time to get the average
return integral / (t2 - t1)```

Divide by t2-t1. If the integral is over the next 365 days, divide by 365. If it's over the next 1825 days, divide by 1825, etc. Aka ensure that the output is between 0 and 1

#

I just want to confirm that you get 94% even if everything is exactly as intended, no cutting corners

unique salmon
#

But I'm like 90% sure they won't participate in Jarrett's benchmark

cosmic hedge
# unique salmon Ok, yeah, so your t1 is negative

now its 0.7 again πŸŽ‰ ```rs
pub fn average_f_power_forgetting_curve(
learn_span: usize,
cards: &[Card],
decay: f32,
) -> f32 {
let factor = 0.9_f32.powf(1.0 / decay) - 1.0;
let exp = decay + 1.0;
let den_factor = factor * exp;

let offset = 365. * 10.;
// Closure equivalent to the inner integral function
let integral_calc = |card: &Card| -> f32 {
    // Performs element-wise: (s / den_factor) * (1.0 + factor * t / s).powf(exp)
    let t1 = learn_span as f32 - card.last_date;
    let t2 = t1 + offset;
    (card.stability / den_factor) * (1.0 + factor * t2 / card.stability).powf(exp) - 
    (card.stability / den_factor) * (1.0 + factor * t1 / card.stability).powf(exp)
};

// Calculate integral difference and divide by time difference element-wise
cards.iter().map(integral_calc).sum::<f32>() / offset

}

#

so was the problem you had with Jarrett's cost by retention was that the numbers were too arbitrary or something?

unique salmon
#

We can do it properly though

unique salmon
cosmic hedge
#

i think basing anything off of what happens in 10 years time might be slightly insane already though πŸ˜‚