#FSRS Megathread

1 messages ยท Page 9 of 1

clever cargo
#

you can give him write acess to your fork

unique salmon
#

how

cosmic hedge
#

I'd rather just pr it myself and deal with it that way ๐Ÿ˜ญ

unique salmon
clever cargo
#

svelte not being easily modified by addons is why this needs to be discussed more than just one person being fine with it

sick moth
#

Learn to git patch

cosmic hedge
#

Tbf i do see a fair number of people ask about it.

clever cargo
#

a note in the help modal then?

#

my gripe is "the first review ever made" is too broad

cosmic hedge
sick moth
#

Can't you have it show "N/10000 cards included"

unique salmon
clever cargo
sick moth
#

"-is:suspended" caught me out once

cosmic hedge
#

it would be more complicated though

bold terrace
#

"Ignore cards reviewed before"
... And all the subsequent reviews, their offspring, all their heritage, burn them to the ground as they never existed in the first place

#

๐Ÿ”ฅ

cosmic hedge
bold terrace
#

Ignore cards (and their future reviews) for cards reviewed before could also fit, but maybe too long

unique salmon
#

A lot of people don't even realize that you can click on settings to see a help text thingy

clever cargo
#

in that case, the naming itself is misleading

unique salmon
clever cargo
#

oh its currently alr "ignore cards reviewed before"

unique salmon
#

Ok, maybe not that many

bold terrace
clever cargo
#

maybe that would be better addressed than going for the nuclear neon option all the time

clever cargo
bold terrace
#

Yeah but Help menu should be there to go "deeper" in knoweldge, not to get the knowledge right

unique salmon
#

The warning being displayed only if the selected date > date of the first review, I mean

clever cargo
cosmic hedge
#

i can see the red pr closed symbol now ๐Ÿ˜”

bold terrace
#

"Ignore Cards introduced before" ?

cosmic hedge
#

i mean i guess you could cache the first review of every card every time you open the window

unique salmon
bold terrace
#

With ""Ignore Cards introduced before", you would not risk deviating the focus from the "card" to the "review"

clever cargo
cosmic hedge
#

i mean its an improvement

unique salmon
#

I imagine some people will think it means "Created before that date" and some people will think "u wot m8"

bold terrace
#

"Ignore Cards learned before" ?

cosmic hedge
clever cargo
#

so having anything in that field would differ from the default

bold terrace
#

"Learning" is a "core" word, should do the trick

cosmic hedge
#

yeah but if they dont change it from the default it wont match any cards right?

clever cargo
#

every review ever was done after 1970

cosmic hedge
#

yeah so if its not changed from 1970 its not going to match any cards?

#

so no one needs warning

bold terrace
#

@clever cargo you start to reach the point of : "Finding problems for the sake of finding ones" ๐Ÿ˜›

bold terrace
#

I mean, the fact the default is 1970 doesn't cause really an issue for showing the highlight when the option change no ๐Ÿ™‚ ?

robust hill
#

what if we just execute the user whenever theres an error on their behalf

clever cargo
#

if its filled with any date past the first review, then its going to show a warning

bold terrace
clever cargo
#

which imo is too broad, given how many decks a user can have

cosmic hedge
#

@unique salmon you need to fill it out with text first but yeah it does appear

#

above where it should but it's still there

cursive badge
#

I'm still spooked by the Anki build system. I saw that there was a custom rust program that generated ninja files and went: "Ok, I'm not touching that Gordian knot until I really have to."

clever cargo
quiet saddle
cursive badge
quasi shadow
#

just get used to it๐Ÿ˜…

#

I spent nearly one month to understand the framework of Anki codebase.

clever cargo
#

still have no idea how the scheduler works

quasi shadow
#

which part?

clever cargo
#

arthur's fork has a good explanation on queues

clever cargo
#

will read up on it more

quasi shadow
#

it is the rabbit hole.

#

you will be suprised how deep it is

quiet saddle
polar maple
#

@quasi shadow how about modifying FSRS so that the first rating determines a resulting fixed decay (1 -> 0.5, 3 -> 0.2)? A problem would be that the parameters would be mixed up having to support multiple forgetting curves so maybe an alternative evaluation could be: you train FSRS with a decay 0.2 and evaluate only on cards with first rating = 1, and then do the same thing with decay 0.5 and evaluate on only cards with first rating = 1

#

also good news, weighted exponential curves performs similarly to power curves. I'll work on some plots to see what the curves look like later

unique salmon
#

And considering that people have all kinds of weird habits and make entire threads about "What button do you press for the first review?", this really doesn't seem like a good idea

#

Making decay depend on D would be interesting, but that didn't work

polar maple
tepid spoke
#

this graph is quite concerning. Why does it just go up and up oO

#

if I make it 1000 days, it does this. Which makes no sense to me

cursive badge
#

You run out of new cards around Nov/Dec 2026?

tepid spoke
#

I run out of new cards in two weeks

#

The graph indeed looks like all cards are new

unique salmon
#

How many cards do you have in that preset?

tepid spoke
#

~18000

unique salmon
#

That's 720 days until you run out of new cards, so around 2 years

#

@quasi shadow please investigate this, it seems that the simulator treats all cards as new

tepid spoke
#

It's also just plain wrong even for tomorrow, I'll have ~350 reviews, not 150-200

#

This must be some artifact from me splitting my one big deck into a huge number of sub and sub-sub decks

#

but all decks use the same preset, so it shouldn't matter for the Simulator. Or so I thought.

cosmic hedge
tepid spoke
#

I can export that deck with scheduling info.

#

The whole collection is rather big

cosmic hedge
#

yeah ok

ashen light
tepid spoke
cosmic hedge
#

idk i guess im guarded with my own decks for some reason XD

ashen light
#

also re: leeches and tagging, I was fully thinkinking leech was gonna be its own card prop rather than a tag or whatev, is:leech type stuff

ashen light
tepid spoke
#

I hope that exported fine. Didn't test exporting since the Subdeck-Inflation

unique salmon
ashen light
#

I just think its better in general, not even about problematicness or not

#

(it also allows both leech types to exist at the same time, assuiming addons or whatev care about leech tags)

unique salmon
#

I'd rather both the new and the old detector do the same thing, for consistency

ashen light
#

I figured old method would be removed

#

"leech after N fails" is a shitty metric by every metric ๐Ÿƒ

tepid spoke
#

Do leeches un-leech after a while at the moment?

ashen light
#

nope!

#

theres currently no unleeching mechanic

tepid spoke
#

Then I have a surprisingly low amount of them

ashen light
#

I had leeches but then I turned up the leech count to like 1000 so of course I never have any

#

it just felt not-useful

#

ยฏ_(ใƒ„)_/ยฏ

cosmic hedge
tepid spoke
#

But why? The simulator worked fine not too long ago

cosmic hedge
#

๐Ÿคทโ€โ™‚๏ธ

tepid spoke
#

And what does "missing memory states" actually mean?

cursive badge
#

It is very annoying how they only apply to notes. I have considered duplicating my notes so there is only 1 card per note just so the leech tag is useful.

ashen light
#

I knew there was a reason I just forgot

#

yeah note-level tagging of leeches is actually useless

cosmic hedge
#

a lot of your cards are missing that save

tepid spoke
#

That's so odd

#

how would that happen

#

And nudging the parameters causes a global re-calculation?

cosmic hedge
#

yeah thats what the progress bar that appears after you hit "save" is showing you the progress of

tepid spoke
#

never saw that, guess my PC is too fast or something :D

cosmic hedge
#

suffering from success ๐Ÿ˜”

cursive badge
#

Could an addon have clobbered the custom card data maybe? If I remember correctly all the FSRS state is stored in there.

cosmic hedge
cursive badge
#

Maybe I'm thinking of revlogs ๐Ÿ˜•

clever cargo
#

its got its own memory state field now

tepid spoke
#

I wrote a helper-addon to split the deck into subdecks

#

but all that does is find the cards, and call mw.col.set_deck on them

cosmic hedge
#

every time you move a card it gets its memory state erased

clever cargo
tepid spoke
#

How do I re-generate it from within the addon? :D

clever cargo
#

or just holdover from the old days?

cosmic hedge
unique salmon
ashen light
#

eventually, obviously

#

but maybe maybe this method is so much better dae would just be ok with it being removed

#

๐Ÿƒ

unique salmon
#

Btw jake, read the comments below this: https://forums.ankiweb.net/t/automated-leech-detection/56887/16?u=expertium
Another user also proposed not using tags/flags, but I don't think it will work

ashen light
#

is it bad if a handful of on-the-edge cards bounce back and forth between being a leech and not?

cursive badge
ashen light
#

but is it bad?

#

why is it a problem

tepid spoke
#

it seems unavoidable

#

I have a bunch of cards that are part-time leeches like that

cursive badge
#

๐Ÿคทโ€โ™‚๏ธ

tepid spoke
#

I'm honestly not sure how I should rate some cards. Like, how much "off-ness" I should tolerate

cursive badge
tepid spoke
#

this looks indeed much more reasonable

cursive badge
#

Interestingly DR is stored per-card which suggests you could get weird and do different DRs per-card instead of per-preset if you wanted.

cosmic hedge
unique salmon
#

People will be like "Why is my card going from 'leech' to 'not a leech' so often?"

cursive badge
ashen light
#

"because you keep passing then failing it" ez

unique salmon
#

So we have to do the bullshittery with two thresholds or with updating the status only after every N reviews

ashen light
#

"my leech-p is above theshold_1 but it still marked as a leech, what gives?" - equally nonsensical complaint the other direction

cosmic hedge
unique salmon
#

The detector will be a black box

ashen light
#

that is super lame

unique salmon
#

The only thing we will show is p(leech)

#

Well, and the leech status as a binary variable

ashen light
#

why is it p(leech) if its a bool

unique salmon
#

I mean that we will show both the probability and the binary leech/not a leech label

ashen light
#

this feature is boring now man

unique salmon
#

Power users can search for p_leech in the browse window

ashen light
#

if we show the probability then my complaint will be a thing

#

"my leech-p is above theshold_1 but it still marked as a leech, what gives?" - equally nonsensical complaint the other direction
exists in any situation p(leech) is shown

lapis hearth
#

And this

ashen light
#

trends can be an addon, 0% chance it'll be in anki proper

cursive badge
cosmic hedge
sick moth
cosmic hedge
lapis hearth
unique salmon
#

Right now we can't even decide on the specifics of the leech detector itself ๐Ÿ˜…

#

Oh, btw, I feel like I should clarify precisely what p(leech) means, statistically. I don't think I've explained this clearly before
With this detector, p(normal) aka 1-p(leech) can be interpreted as "Probability of observing this many or fewer successful reviews, assuming that probabilities given by FSRS are the true probabilities of recall", in other words, assuming that FSRS can predict the probability of recall perfectly accurately

#

It's a p-value for a one-sided statistical significance test where the null hypothesis is "The true probabilities are [whatever numbers FSRS predicted]"

#

So if the p-value is low, it means that it's very unlikely that we would see these outcomes if the probabilities predicted by FSRS were the true probabilities of recall (for this card)

lapis hearth
#

You have made it more confusing for us simpletons

#

Okay couple of brain strokes later, I am beginning to understand it

unique salmon
#

Basically, low p(normal) aka high p(leech) means that FSRS sucks at predicting probabilities of recall

lapis hearth
#

So high p(leech) means card is so difficult that FSRS scheduling is useless and you need to find something else to help you recall it

#

So basically in other words, a leech

#

No amount of scheduling would help make this leech unleech

#

That makes sense

unique salmon
#

Alright, with that out of the way, we need to decide on 2 things:

  1. Tags/flags/custom data in card info?
  2. Do we do it the simple way with only one threshold and checks after every review, and if the card keeps bouncing back and forth between being a leech and not being a leech - we say "it's not a bug, it's a feature"; or, alternatively, do we do it the complicated way with two thresholds or checks only every 2/3/4 reviews so that cards don't change their status too often?
    @ashen light @cursive badge @cosmic hedge
ashen light
#

tags are note-level (as opposed to card-level), making them a non-option. cards can only have 1 flag at a time, also making it not an option

#

a custom leech attr on card is the only reasonable thing

#

very much against checking every N reviews, means we need to keep track of an extra attr

unique salmon
#

We can do two thresholds, though that creates another problem: if we show p(leech) or p(normal), some cards can end up counted as leeches and some not, despite at present having the same p(leech)

#

For example, if the first threshold is 5% and the second one is 25%, and p(normal) is 10%, whether itโ€™s a leech or not depends on whether it has crossed the first threshold before or not. If it has crossed it before, itโ€™s a leech, otherwise itโ€™s not a leech.

ashen light
#

people are gonna invent problems to complain about no matter which option is done

#

the bounce back and forth strat has an easier implementation

unique salmon
#

Well, guess it's new data in card info + simple method then

#

Now the real question - who's gonna implement it?

ashen light
#

I'm just annoyed I gotta do the math thing

#

is there an easy off the shelf equation I can grab from a standard stats library

unique salmon
#

Oh come one, math is the easy part
#1282005522513530952 message

ashen light
#

why you gotta do this fuckin tryhard poisson binomial thing literally nothing implements

#

ok port it to rust for me

#

protip: that ai version wasn't going to work

unique salmon
#

Like, "it bugs out and spits nonsense" doesn't work?

ashen light
#

because it assumed rust vec's behave like numpy dataframes

#

I didn't run it but at a glance it wasn't gonna do what you wanted

#

for example pmf[j] = pmf[j] * (1.0 - prob) + pmf[j-1] * prob; the way its implemented in pmf[j] * (1.0 - prob) pmf[j] will always be zero so the first half of that equation is always 0

#

and so....is not gonna do what you want

#

unless we want to do that calculation for fun for some reason

#

I looked over it enough to see an obvious problem then just didn't give it any more thought

unique salmon
ashen light
#

you didn't actually link anything

unique salmon
#

Unless my Python implementation is also somehow bugged and I didn't realize it

#

Just copy-paste it

#

Works fine for 90%, 90%. This is indeed what you get if you do the math by hand

ashen light
#

hm

#

guess I'll just use it

#

ยฏ_(ใƒ„)_/ยฏ

#

any problems I can just blame on you

unique salmon
#

Apparently not

ashen light
#

my point still stands though, why you gotta use some tryhard stats thing

unique salmon
#

Because I don't see any other way

#

We can't just assume that FSRS always predicts the same probability of recall for obvious reasons

#

If FSRS always predicted the same probability of recall, we could use the good ol' binomial distribution
(except that in that case there would be no reason to use FSRS in the first place cause it would be fucking useless)

#

Poisson binomial is a generalization of the binomial distribution for when probabilities of success aren't always the same, like in a coin toss

#

I mean, I guess we could try to come up with something COMPLETELY different that isn't based on fancy probability distributions, but nah

ashen light
#

my point more is I just wanted to pull in a library that I could hand an array and have it do the math for me ๐Ÿƒ

unique salmon
#

Think of it as Claude making a library for you

#

And now you're happy!

ashen light
#

nah

unique salmon
#

Save complaining for later, for the actually annoying parts, such as:

  1. Unleeching
  2. Pop-ups for both leeching and unleeching
  3. Recalculating leechiness every time FSRS parameters change
ashen light
#

unleeching isn't even hard

#

3 is the only actually annoying thing here

#

(but it already does other stuff anyway, just gotta hijack that process)

tepid spoke
#

I just realized why I have so few leeches lol

#

every time I sync the deck with WaniKani, it overwrites the tags

#

Not like it matters. Nothing I can do with the info of it being a leech anyway.

cosmic hedge
polar maple
quasi shadow
#

How about that?

quasi shadow
#

It's the notebook optimizer.

#

It has a detail evaluation which groups the reviews based on the last rating.

#

I guess the forgetting curve is sharper when the last rating is again.

#

Maybe it's better to save the raw data for further analyses.

polar maple
#

@quasi shadow @unique salmon this version of RWKV uses a weighted sum of 128 exponential forgetting curves. Maybe we should make FSRS decay scale with S?

#

also the 1-day stability plot might be a bit inaccurate since RWKV uses elapsed seconds

quasi shadow
#

Seems like the forgetting curve is flat when S is small and becomes sharper as the S increases?

polar maple
#

perhaps even decay = 0.1 could be beneficial for small S

#

S = 1 -> 0.1
S = 30 -> 0.2
S > 100 -> 0.5
and maybe interpolate this in log space

#

this could also be what we are seeing with first rating = 1 since it tends to result in lower stability after all

quasi shadow
#

๐Ÿค” wait

#

my observation is the forgetting curve is sharper with first rating=1.

polar maple
#

whoops i mixed it up

#

there might be some weird behavior when changing decay, let R(t, S, decay) be the function that gets the retention given a certain time, stability, and decay.
Then we would ideally want S1 > S2 => f(t, S1, decay1) > f(t, S2, decay2) but if decay1 != decay2 then this can be broken

quasi shadow
#

Yeah, I know.

#

It means forgetting curves with different S will intersect in certain T (T > 0).

#

be like

quasi shadow
#

It's the distribution of trainable decay.

cursive badge
# unique salmon Alright, with that out of the way, we need to decide on 2 things: 1) Tags/flags/...
  1. I think if a leech revamp does happen it would ideally involve new prop(s). Tagging notes has always been a half-baked solution and we cannot use flags in native Anki because it could clash with user flags. EDIT: I guess another solution would be letting cards have tags, but that is another feature in itself.
  2. I don't know. I haven't touched it since my prototype a few weeks ago but I don't think that we are ready for a "black box" with no knobs for the user to twiddle. I know it would be terrible UX but I never felt "I would be happy to just run this on any dataset" when I was playing with my prototype.

I would take some more convincing before trying to implement it natively in Anki but if @ashen light is interested enough to call off his strike I'm not going to be too negative and interfere ๐Ÿ˜‚ .

lapis hearth
#

I think knowing whether the card is leech or not is very helpful but evenso more helpful is knowing that whether what you are doing is helping you learn the card or not (whether you are on the right track)

quasi shadow
#

OK, now I know how to improve the optimal retention feature.

#

we need increases the cost per review when the desired retention descreses.

polar maple
#

there seems to be a lot of 60k+ as well which could represent way more than 60 seconds in reality

#

time where the user either gives up for a while or has to purposefully spend re-encoding the card into memory

cosmic hedge
# quasi shadow

This is weird to me because I thought the problem with CMRR was it went to 0.7 too often. Increasing the costs with higher DRs to offset this would make it even more likely to result in 0.7 right?

quasi shadow
#

it means the cost is larger when the DR is lower.

#

so it will increase the CMRR

cosmic hedge
cosmic hedge
#

I know in the simulator they both end up being pretty much the same thing

quasi shadow
quasi shadow
#

optimal retention
before: 0.7143667819857166
after: 0.8377484029026208

#

๐Ÿ˜Ž

#

Just add this line

unique salmon
quasi shadow
#

with this, you will spend 20% more time per review when your desired retention is 70% instead of 90%.

unique salmon
quasi shadow
unique salmon
#

I'll take that as a "yes"

quasi shadow
#

๐Ÿ˜‚ Nope

unique salmon
bold terrace
#

IMO with the leech detection stuff, in practice I see a few elements that make it not as useful as I'd hope initially :

  • A lot of card flagged by hit are cards with a few "bad streak". While it is indeed very low probability compared to FSRS model, in practice it's not that uncommon, specially for recently introduced cards.
  • Once the repetitions are higher, it start to make more sense, but still, sometimes you still get cards with moderate amount of reviews still being flagged because they had a very very bad start.
  • For cards with high number of repetitions, it doesn't really bring much more information than checking the number of lapse, since contrary to what I would have expected a few months ago, in my case at least, the more reps a card has, the less stability it also has in average compared to lower reps card. So discriminating "harder cards" based on # reps or # lapses is still .... very valid for FSRS
#

I don't know if some has practical experience with it and see different cases ?

#

This is a typical example. Got flagged for a bad start, but will only get considered "unleeched" when the history count will be big enough ... while the bad streak was 1 year ago, but since easy cards doesn't grow in terms of reps that quickly, it might still be one of the most leechy card of my deck even though it's quite an easy one (1 failed rep in 1 year, and the last failed was ~11 month ago)

#

When I tried the leechkit with my --last-review N, it felt better, but mostly because now it would flag only the one with a recent bad streak. The number of result would of course be way lower, something like 2-5 cards over 4000 active one

unique salmon
# quasi shadow

So what's the plan?
If we aren't going to estimate it for every user, will you estimate one average value (or two, like a - b*R) and just hard-code those?

#

It would be better if it was estimated for each user individually

#

But I guess a - b*R with a and b estimated from the 10k dataset would still be ok on average

bold terrace
#

Comparison of SUM(R) and SUM(R*f(S)) when new card/day change from 8 to 40 to 8 again.
You can see that for SUM(R), the more you add, the better.

For SUM(R*f(S))

  • The more "active card" you have, the better since S can grow for all of those
  • New/card that stay at low S are discounted (I mean, does a .9R on a 1h stability should be the same Memorized Value than a 365d stability one ?)
  • Since R is included in [DR,100] if you're a good boy, for people with high DR, R is a proxy to measure SUM(active cards) ๐Ÿคทโ€โ™‚๏ธ
#

For f, sqrt, ln, or more fancy like 1 - Math.exp(-((8 / 365) * s)) (early rise, converge to 1, and at 365d is already close to 1) doesn't really change much the trend, considering S is already good enough

cosmic hedge
cosmic hedge
unique salmon
#

Aka how well you know the card

unique salmon
#

Well, later, once I'm done with neural D

unique salmon
#

Try it, everyone and anyone
Try telling whether probability of recall is positively or negatively correlated with answer time based on this graph

#

But in case @quasi shadow still wants to do it, I recommend estimating a - b*R for all 4 grades, 8 parameters in total

unique salmon
# quasi shadow

If these graphs are to be believed, a and b can be different for diferent grades

quasi shadow
#

It's more clear if I show you the box graph.

unique salmon
quasi shadow
#

Just hard code it

#

let's all

unique salmon
#

I really think we need a benchmark where the goal is to accurately predict costs

#

Rather than R

#

We could just take FSRS-5 and use all this stuff we use for CMRR and the simulator, and run it on the 10k dataset, and compare predicted costs to real answer times

#

Then we can finally have a way to tell if we're making the CMRR/simulator better or worse with our changes

#

Please make a repo for benchmarking the accuracy of predicting costs

#

You already have FSRS parameters per user, so it's not like you will need to estimate them again

#

Just copy them from the 10k repo

#

And in the new repo you will run FSRS with parameters for each user and with cost estimations for each review exactly as in CMRR/simulator

#

No optimization

#

And at the end we will get average(|predicted cost - actual cost|) and sqrt(average((predicted cost - actual cost)^2))

#

MAE and RMSE

#

And then we can enter the new era of tweaking our cost prediction stuff ๐Ÿคฃ

#

Just think about ALL THE TWEAKS
I'M TWEAKING SO HARD

quasi shadow
#

My solution

cosmic hedge
sick moth
lapis hearth
#

or would it have a similar shape

#

.

#

Whatever @cursive badge did here

unique salmon
unique salmon
# quasi shadow My solution

I'm serious. At this level of complexity we NEED a proper benchmark
Please do it. Just reuse parameters for each user that you already have in the srs-benchmark repo, run FSRS on each user, make it predict answer time using the same formulas as in CMRR/simulator
And then calculate the mean absolute error and RMSE of predicted answer times and real answer times

#

We are past the point where we can just say "Oh, but this change obviously improves how accurately costs are calculated", we need proper tools to assess further changes

unique salmon
bold terrace
# cosmic hedge isn't "how quickly R decays" "stability"?

Hmmm the thing is that the workload is more dependent on the interval than really the stability. What I mean, is that the interval can be reduced by reducing the DR. But in fact, the intrinsic quality of your memory is Stability, not really Interval

#

Stability is agnostic of DR

quasi shadow
unique salmon
#

This is the kind of change that isn't obviously an improvement, and needs benchmarking

quasi shadow
#

Without it, CMRR will give you an extreme low retention.

unique salmon
# quasi shadow It obviously makes the optimal retention with FSRS-6 similar to FSRS-5.

Man, I don't want to argue. I'll just be brief: from this point on any extra complexity added to the simulations, specifically to the part related to estimating answer times, has to be justified via benchmarking as described here (#1282005522513530952 message), otherwise I will not be happy

I mean, you are obviously free to disregard my opinion, but I genuinely hope you will understand that past a certain point of complexity you need proper tools and not just "It works. Source: it was revealed to me in a dream"

bold terrace
#

@cosmic hedge : For example, let say your workload right now is 100 reviews/day for DR=90% with a total sized deck of 1000. Your score would be 900/100 = 9.
If now you set the DR=70%, and let say it divide by 2 the workload. You get now a score of 700/50=14.

So basically, the optimizer will just make the most gain by making you drop the workload as much as possible -> dropping the DR

If you include S in the numerator, now you have something like f(S)/workload(I) that compensate that, and also, it pushes the goal function to try to also not sacrifice S just for the sake of reducing a bit workload

#

Right now when you look at the graph that CMRR is trying to optimize, it's not even a U curve it's almost a purely increasing curves .... so basically yeah, you always get the minimum threshold of 70%, it's worthless

quasi shadow
unique salmon
quasi shadow
#

Itโ€™s more complex.

unique salmon
#

Though, on second thought, I want to see that one benchmarked as well

#

We can benchmark

  1. Current implementation
  2. Current implementation + Markov chain for learning steps
  3. Current implementation + Markov chain for learning steps + your R correction
bold terrace
unique salmon
quasi shadow
unique salmon
#

If your correction makes the predicted answer times less accurate, it will affect everyone who uses CMRR/simulator

unique salmon
#

Clearly not the python version, since that's not what is used in Anki

#

And the Rust version doesn't seem to have the new learning step simulation

quasi shadow
unique salmon
#

Or to learning_step_transitions and relearning_step_transitions

#

All of them need to be smoothed

#

Btw, why are there so many 0.25?

lapis hearth
unique salmon
#

Oh wait, I replied to the wrong comment

#

That was meant to be a reply to Jarrett

#

To this comment

unique salmon
#

Man, I want a benchmark for predicting answer time FeelsBadAnki

cursive badge
#

I wish Anki recorded time-to-answer as well as total study time. I feel the time spent looking at the back kind of poisons the data for other uses.

unique salmon
bold terrace
#

Also look how sum(R*f(S)) represent a better representation of the "gainz" you did, either if it's due to more new cards/day or by just reviewing them more

#

On the opposite side, sum(R) feels like reviewing a lot of cards per day was less useful than just introducing a shit ton of card I was able to recall 2d later

unique salmon
# quasi shadow My solution

If you want to calculate answer time as a function of R, sure. But not like this. Here - assuming I understand your code correctly - you just apply a correction to the already existing average (or median, whatever, that's not the point) time. That's not the same as answer time = a - b*R, that's "I took answer time that is not related to R at all and added some sort of correction to it"

#

I have no objections to answer time = a - b*R if and only if a and b are estimated for each user and for each grade separately. Otherwise we will lose accuracy instead of gaining it. Right now the median answer times are estimated for each user individually. If we use answer time = a - bR with fixed a and b, I'm 100% sure it will be worse than our current approach, since this function will be the same for each user instead of being based on user-specific data
As for the approach in your screenshot, where you just add a correction based on R to answer time that is not related to R - no, absolutely not, please don't

#

Man, I'm telling you, past the current level of complexity WE REALLY NEED A BENCHMARK

unique salmon
sick moth
#

You should, you're asking for a nightmare

First job is to get a shovel and start digging ๐Ÿ™‚

unique salmon
#

Read the user review data from the .parquet file, get FSRS params corresponding to that user, use the code that estimates answer times to estimate answer times, record the difference between "predicted" and real answer time after every review, average it to get the average error
Repeat for ten thousand users
Average the average errors to get the average average error

#

Oh joy...

#

I'll have to stitch together the simulator code and the code that reads data from .parquet files

#

๐Ÿ˜ญ๐Ÿ˜ญ๐Ÿ˜ญ

#

aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa

#

Actually, no, more like I'll have to repurpose the already existing benchmarking code, but load the FSRS parameters instead of calculating them
And somehow make it output answer times

#

The more I think about how I'm going to do it, the more I'm making noises of a dying seal

#

It's not even that bad if you are Jarrett since he's the only one who actually understand the monstrosity that is the benchmarking code

#

But to everyone else the benchmarking code is barely comprehensible

unique salmon
quasi shadow
unique salmon
unique salmon
# quasi shadow I think Alex also understands it.

@polar maple want to make a benchmark of answer time predictions?

  1. For each of the 10k users, get their FSRS params from the srs-benchmark repo
  2. For every review predict answer time, which currently is just a weighted average of user-specific median answer time and default answer time. It's currently "static" - we just estimate a bunch of numbers from the user's review history, no FSRS needed. So even the word "predict" isn't really correct here. But we could estimate answer time as a function of R or something, then we would need to actually run FSRS
  3. Calculate the difference between real answer time and predicted answer time
  4. Calculate the final error across all users and all of their reviews
#

(I assume the answer is "no")

#

Neither do I ๐Ÿ˜…

cursive badge
#

I admit I looked at the benchmark code once and then decided it would be easier to just do my own thing because I found it hard to follow which columns were where in the code.

unique salmon
#

Josh (joshuahamilton on Discord) also said it's hard to understand

unique salmon
cursive badge
#

To be fair to Jarrett I think it's inevitable that this kind of thing would become a bit confusing. He's just doing it as his own side project and is under no obligation to spend extra time trying to make it easier for others to digest.

unique salmon
quasi shadow
#

so we don't know the probs of the next rating after hard

unique salmon
#

The thing is, there isn't really any code that I can just steal and repurpose with minimal effort
The benchmarking code - I will break it if I try to disable optimization. Plus, I have no idea how to add the estimation of median review time to it in a way that is compatible with the rest of the code
The simulator code - I won't be able to read .parquet data and pass it into the simulator instead of randomly generated data

#

Plus a whole lot of little details that only Jarrett knows how to get right

quasi shadow
unique salmon
#

And considering that historically I have never, NOT EVEN ONCE managed to run the benchmarking code on the first try and always had to consult Jarrett, EVERY SINGLE TIME HE CHANGED ANYTHING about the benchmarking, I feel like it would be easier for me to get a job and pay Jarrett

cursive badge
quasi shadow
#

If you spent 3~4 hours per day over a year, you would understand it.

unique salmon
#

God, this is so over

quasi shadow
#

๐Ÿ˜‚ That's what I did to contribute to Anki

unique salmon
#

Unless Jarrett decides that he nobly wishes to never make any changes to calculating answer times without benchmarking them first, for the sake of Anki users

unique salmon
#

figures

unique salmon
polar maple
quasi shadow
cursive badge
#

Expertium really needs AGI to happen so they command an army of AI agents to go off and program all their ideas ๐Ÿ˜‚

quasi shadow
#

It's better than do nothing.

unique salmon
hasty fractal
#

"Create an Anki better than the one Expertium created". Checkmate.

clever cargo
quasi shadow
#

I know it's bad without it.

#

If you really run the unit test, you will know how bad.

unique salmon
#

Your formula works only if cost is defined as "answer time at R=90%", but it's not

quasi shadow
unique salmon
#

So the simulator will do it differently compared to CMRR? Please no...

#

I'd rather ditch CMRR entirely

quasi shadow
#

Fine. Remove CMRR.

#

Forget it.

cursive badge
hasty fractal
#

remove CMRR, improve simulator ๐Ÿ‘

quasi shadow
#

It's the worst feature I made.

unique salmon
#

Maybe not if it always outputs 70% ๐Ÿคฃ

clever cargo
quasi shadow
#

For example, the loss aversion.

#

It's introduced to increase the output.

unique salmon
quasi shadow
#

Actually, the current simulator is incorrect.

unique salmon
#

Instead of the current "spherical in vacuum" implementation

quasi shadow
#

because of loss aversion

unique salmon
quasi shadow
unique salmon
#

Like, I thought it's used only for CMRR

quasi shadow
#

๐Ÿคฃ

unique salmon
#

God damn it man, please disable it

quasi shadow
#

Nobody complain

unique salmon
#

People want accurate workloads

quasi shadow
#

You're the first one.

unique salmon
#

For CMRR it's ok because people don't see the workload graph

#

"Time"

#

Because we show this graph, it'd better be accurate

#

CMRR doesn't show anything related to how much time is spent on reviews, so it's fine to cheat a little bit, users won't be able to see it

quasi shadow
#

Forget CMRR

unique salmon
#

Alright

#

I hope you will remove loss aversion from the simulator

quasi shadow
#

I will

#

after merging the FSRS-6 PR

#

There are several benchmarks I need to complete

#

so maybe the next week

unique salmon
#

Maybe make CMRR 2 with accurate deck sizes and card states? ankieyes

#

Aka just run the simulator with all of it's configs

#

Easy Days, sort order, blah blah

quasi shadow
#

CMRR is designed for average users

#

but it's too hard

#

so I give up

unique salmon
#

Literally just

for R in range(70, 100):
    workload, knowledge = simulator(R, all_other_shit)

#

And there you go, CMRR 2.0!

quasi shadow
#

ask for A bloke or someone else

#

๐Ÿ˜… I need to focus on FSRS-6

unique salmon
#

Instead of having CMRR as a separate entity, just make it a part of the simulator

polar maple
#

i guess that especially without loss_aversion, CMRR would output 0.7 in most cases?

unique salmon
#

According to Jarrett, with FSRS-6 - yes

#

Maybe we need to listen to Sound after all and use sum(R*f(S)) instead of sum(R)

#

But then the choice of f(S) is completely arbitrary

polar maple
#

i vote for one of these that i described here

#

has a better meaning than something like R*sqrt(S)

cursive badge
ashen light
#

oh, so you can't do it

cursive badge
#

You're a tricksy one Jake ;p

#

I mean: I'm not sure it is fully baked yet, and don't want to put a lot of effort into something that might be thrown away if it doesn't work well for most users.

unique salmon
unique salmon
#

Which implicitly takes into account S, since with higher S average over [t1, t2] will be greater

#

I wouldn't use it for graphs, but we can use it for CMRR 2, if a bloke implements it

#

Instead of workload/knowledge(at the end of the simulation), it will be workload/knowledge(average over some time)

bold terrace
#

Maybe I'm wrong but won't it be somewhat linear proportion based on S ?

#

Since S is already kinda describing how R decline with time

unique salmon
#

Not really

polar maple
bold terrace
unique salmon
bold terrace
#

I see

unique salmon
#

Alright, I'm gonna be away from my PC for an hour or two, so feel free to use the file I provided

bold terrace
#

To be fair I think it would reward a bit too much very low Stability compared to very high one (Since let's be honest, S=1/S=2, the knowledge is not at all acquired), but I don't mind testing how it would look like. I just don't know how performant it will be to do that loop from t1 to t2 for every revlog entry (but I guess a few dozen should not hurt)

polar maple
#

in Expertium's code rn it is t2 = 10 which is far too low imo

#

we don't need to iterate between t1 to t2 actually, with the integral the computation is quick

bold terrace
# polar maple we don't need to iterate between t1 to t2 actually, with the integral the comput...
Jonathans-Laptop:tmp jschoreels$ python3 fs.py 
R at t1=1: 0.900000
R at t2=360: 0.331270
Average R within the [t1, t2] range: 0.409252
Brute force calculation of average R within the [t1, t2] range: 0.409252
Brute force calculation agrees with integral calculation: False
Jonathans-Laptop:tmp jschoreels$ python3 fs.py 
R at t1=1: 0.999615
R at t2=360: 0.900000
Average R within the [t1, t2] range: 0.944604
Brute force calculation of average R within the [t1, t2] range: 0.944604
Brute force calculation agrees with integral calculation: True
#

I see

#
def power_forgetting_curve(t, s, decay):
    factor = 0.9 ** (1 / decay) - 1
    return np.power((1 + factor * t / s), decay)

This function doesn't depend on anything else ? D ? FSRS parameters ?

polar maple
#

nah it doesn't. if you wanted something that depends on everything then you could take the result of SSP-MMC as a score in itself, the "average cost to reach target stability given S,D,R, and FSRS params"

bold terrace
#
S=1
Integral at 10 : 9.451472
Integral at 360 : 149.668518

S=360
Integral at 10 : 658.855147
Integral at 360 : 988.986838
#

I removed the avg

#

Maybe I shouldn't have lol

polar maple
bold terrace
#

t2=10

#
stabilities = [i for i in range(1,360)]
print(stabilities)
integrals = [integral_power_forgetting_curve(t2, s, decay) for s in stabilities]
print(integrals)
plt.plot(stabilities, integrals)
plt.show()
#

I just call this

#
def integral_power_forgetting_curve(t, s, decay):
    factor = 0.9 ** (1 / decay) - 1

    # Check that parameters are in valid ranges
    if not (0 > decay >= -1):
        raise ValueError("Decay must be in the range (0, -1]")
    if t <= 0 or s <= 0:
        raise ValueError("t and s must be positive")

    # Special case for decay โ‰ˆ -1
    if abs(decay + 1) < 1e-10:  # Using a small threshold to check if decay โ‰ˆ -1
        return (s / factor) * np.log1p(factor * t / s)

    # General case for decay โ‰  -1
    return (s / (factor * (decay + 1))) * ((1 + factor * t / s) ** (decay + 1))
#

But yeah S=1 integral at 9.4 seems off

#
t_array = [i for i in range(1,100)]
integrals = [integral_power_forgetting_curve(t, 1, decay) for t in t_array]
plt.plot(t_array, integrals)
plt.show()

Feels also a bit off for stability=1

#
t_array = [i for i in range(1,100)]
integrals = [integral_power_forgetting_curve(t, 1, decay) for t in t_array]
plt.plot(t_array, integrals)
plt.show()
unique salmon
#

If you want to get something that can be interpreted as average R over time, you need this

def average_f_power_forgetting_curve(t1, t2, s, decay):
    if not t2 > t1:
        raise ValueError("t2 must be greater than t1")

    # Calculate F(t2) - F(t1) where F is the antiderivative
    integral = integral_power_forgetting_curve(t2, s, decay) - integral_power_forgetting_curve(t1, s, decay)

    # Divide it by the difference in time to get the average
    return integral / (t2 - t1)```

The integrals themselves cannot be interpreted as average R, and their difference cannot be interpreted as average R, but rather, as area under the curve
#

You need the difference between integrals divided by the difference between times

#

If you want the area under the forgetting curve, remove division by (t2 - t1)

unique salmon
#

My idea is to use average R over the next year instead of average R at the end of the simulation for CMRR (again, if a bloke wants to make new CMRR)

bold terrace
#
stabilities = [i for i in range(2,360)]
print(stabilities)
for t2 in range(10, 100, 10):
    integrals = [average_f_power_forgetting_curve(t1, t2, s, decay) for s in stabilities]   
    plt.plot(stabilities, integrals, label=f't2:{t2}')
unique salmon
# bold terrace

This is the case where you really should name your axes...axises...you know ๐Ÿ˜…

#

Anyway, with default FSRS-5 params I get MRR=84%, which is weird because in Anki I get 87%, but oh well, I ain't gonna look for discrepancies. Maybe default params changed, or maybe the default answer times changed
I'll see what I get if I use sum(avg_R(0, 365)) instead of sum(R). MRR should become higher, I think. Maybe. Actually, idk, we'll see

bold terrace
#

I prefer mine ๐Ÿ˜„

polar maple
bold terrace
tepid spoke
#

I thought about maybe trying out what normal optimized parameters but 95% retention would lead to.
Result: no, I'd rather not

bold terrace
#

I prefer my 1-exp(...) ๐Ÿ˜›

#

Could also be more easily customizable : "What's your goal stability ?", it's just a factor to make it reach "1" sooner or later

unique salmon
bold terrace
#

#Team90DR Intensify lol

#

And with that R, you still get plenty points for very low S

#

imagine with a nice and smooth 1-exp

#

๐Ÿ’ฆ

hasty fractal
#

isn't the water emoji lewd?

#

or it's like "this is hot"

unique salmon
#

With R averaged over the next 100 years, I get 0.92

bold terrace
#

Future FSRS parameter : What is your expected remaining lifespan

unique salmon
#

Lol

bold terrace
#

I like the idea that S=365 represent "the max"

hasty fractal
#

let's be real: FSRS doesn't work at really high intervals, saying from experience. at that point, there just isn't enough data.

bold terrace
#

Do you mean the interval are too big or too low ?

#

What would be for you the max S

#

that is relevant

#

It's interesting to remember that if all your card had a 365d stability, you could maintain 50K words by doing 137 reviews/day at 90% DR

#

~5min at 2s/review

#

so clearly, 365d might already be "too high" for realistic "anki endgoal"

#

If we accept 30min of daily anki to maintain 50K words, a stability of 60d would already be enough

hasty fractal
bold terrace
#

DR ?

unique salmon
#

Brute force is just a sanity check

#

To make sure the integral math is mathing

bold terrace
#

yeah I misunderstood

hasty fractal
#

I would say around a year, but depends on content. If it's the general knowledge deck I internalise the cards very quickly and at that point, Anki feels quite unnecessary.

#

If its JP word it'll take longer.

#

btw, folks, do we have a roadmap for what's coming next in algorithmic/fsrs improvement?

#

ya'll put hundreds of messages everyday in this channel, a mere human can't possibly read all that

unique salmon
# hasty fractal btw, folks, do we have a roadmap for what's coming next in algorithmic/fsrs impr...
  1. FSRS-6 with a new parameter for same-day reviews and with a flatter curve
  2. Simulator now takes load balancing and Easy Days into account
  3. Simulator now simulates same-day reviews way better
  4. Load balancing is tweaked, so hopefully maybe potentially possibly Sound will finally stop complaining about LB decreasing retention, but I wouldn't bet on that
  5. Maybe remove CMRR as it's kinda shit according to Jarrett
  6. Maybe make CMRR+ Mega Ultra Giga Chad Sigma Edition if Luc (A bloke) wants to. Instead of CMRR being separate from the simulator, it will use all of the simulator settings, including sort order, Easy Days, etc.
bold terrace
#
  1. Expertium continuing to think he's the boss of everyone doing something ๐Ÿ˜„
unique salmon
#

More like I'm the guy who's job is to remind everyone about that one really cool feature that I suggested a year ago and everyone forgot about

hasty fractal
#

im reminded of a character who loves packing but not really: what he actually likes is lolling on the sofa and telling others how to properly pack

unique salmon
#

Also, I tried CMRR with FSRS-6 parameters and decay, and yeah, it's just forever 0.7 ๐Ÿ˜…

ashen light
unique salmon
#

Then again, current CMRR isn't realistic in the first place since it doesn't take into account LB, Easy Days, sort order, real new cards/day limit, real review/day limit, real deck size, real card states, etc.

#

And I really hope Luc will just use the simulator code with all of its settings for the next-gen CMRR

#

...or maybe he won't, and then users will forever keep asking "What's the best value of desired retention?" until the end of the universe (or Anki), everyone will be coming up with their own rule of thumb, twenty bloggers will write twenty articles on the best value of desired retention, and then 10 years later somebody will ask "Why not just run the simulator for every allowed value of DR and check the workload?", and I will answer "Because nobody wanted to do implement it 10 years ago"

#

And after that we will be back to asking "What's the best value of desired retention?" until the end of the universe

bold terrace
#

But some otherthinker came up with some "next gen optimizer"

#

And screw naive people

#

(I'm looking at you)

#

"Crunching Numbers" lol

polar maple
#

to be fair aren't we just panic finding new methods/metrics in order to purposefully increase MMR

#

and the moment we get a high number we declare victory

bold terrace
#

To be fair, I don't think there is a huge huge hurry, mine is blocked to 0.70 since it has been introduced

#

I did the mistake to change my DR to that .70, once

#

Then my effective R was around 50-55%

#

and it took me 2-3 weeks to recover from that week pain

#

But CMRR was right ! to increase my knowledge, I had to drop DR very low, add a lot of words ...
... And be in a state with a shitton of card with stability <1d that would all contribute to my marvelous "total knowledge" that was that sum(R)

#

(I over exagerate since yes, the interval/stability is somewhat accounted in the workload, so it's not like it was completely ignored, but still)

#

Problem is that CMRR estimated that with a DR set to 70% I would fail 30% of the, when I failed in fact 45% of my cards ๐Ÿฅฒ

unique salmon
#

More accurate FSRS-6 + sum(avg_r(delta_t, delta_t+1095)) instead of sum(R) + using the actual deck size and the actual card states could alleviate a lot of that

#

IMO, the biggest problem with CMRR is not the choice of the function to minimize/maximize, but the fact that the settings have barely anything to do with reality

bold terrace
#

IMO a small warning : "If you plan to change your DR, please do it incrementally"

#

Would save many lifes

#

What about ditching forgetting curve, and just train different set of parameters for different DRxD range ?

#

We might even have more params than Alex doing so

#

๐Ÿ˜„

robust hill
#

execute the complainers

cursive badge
bold terrace
#

Could make the recency weight a bit more aggressive for phases where DR change, to let it adapt quicker

unique salmon
#

@cosmic hedge sorry for frequent pings guys, but I want to ask - do you want to implement next gen CMRR? And by that I mean just use the simulator with all of its settings to make CMRR as realistic as possible.
Currently, CMRR assumes fixed deck size, no learned cards, doesn't take into account sort order, Easy Days, etc. All of that can be fixed by reusing the simulator.
@quasi shadow wants to remove CMRR because with FSRS-6 it outputs 70% too frequently, and also because it's kinda crap overall, and while that's understandable, I think we should instead improve CMRR and make it more realistic by using real deck sizes, real card states, real new and review limits, etc.

#

Removing CMRR completely would be a net loss of functionality, and since there are obvious ways to make it more realistic, I think we should do that instead

#

Though, there is also the problem that Alex pointed out - we are in a situation where we want CMRR to output higher numbers, so we will declare any tweaks that make the output bigger good

bold terrace
#

Perfection, 90% ๐Ÿ˜„

#

Look how daily outcome become so much predictable ๐Ÿ™‚ ANd it's even by 5-day average there.

#

Without 5-day average, it would give this for Anki Scheduling daily R

#

Compare to this with Filtered DEcks

quasi shadow
quasi shadow
unique salmon
#

I've said it yesterday, CMRR the biggest problem with CMRR is the unrealistic settings it uses

#

It should just use the same settings and the same deck and card info as the simulator, for maximum realism

#

So fixing (or at least improving) CMRR is just a matter of reusing the simulator config

quasi shadow
#

Or, it's harmful.

unique salmon
cosmic hedge
#

#1282005522513530952 message does this not help enough btw?

unique salmon
#

Next gen CMRR would certainly be more realistic, though that doesn't automatically guarantee that it won't always output 70%

#

But it's definitely more realistic than assuming that deck_size = 10*days_to_simulate and an infinite number of new cards that can be learned per day

cosmic hedge
cosmic hedge
unique salmon
#

It's adding apples to oranges

cosmic hedge
#

ahh I suppose so

unique salmon
#

So, are you up to the task?

cosmic hedge
#

what the next gen CMRR

#

sure why not

#

i hope XD

unique salmon
#

90% of the work is just reusing the simulator code, literally

cosmic hedge
#

yep

unique salmon
#

I'll write you a detailed spec later

cosmic hedge
#

i hope XD

#

you dont really need to

unique salmon
#

Btw, why did you close the PR with the "smooth" button?

cosmic hedge
#

dae said he didnt want it

unique salmon
#

Will it remain like this?

cosmic hedge
#

yep

unique salmon
#

God damn it man

#

This is so ass

cosmic hedge
#

its fine its not a huge issue XD

unique salmon
#

All these settings except for Smooth Graph affect scheduling and are real settings that you can find in deck options. So grouping Smooth Graph - which only affects the plotting - with real settings seems like a bad UI to me.

#

User's shouldn't have to play "pick the odd one out"

cosmic hedge
#

I think the hint would be it affecting the graphs which already exist

#

if someone did assume the button affected the actual results what would be the problem?
i suppose it belongs in advanced settings because it's a setting that very few people will need to touch.

unique salmon
cosmic hedge
unique salmon
#

Lol, alright

#

I mean, even if it creates confusion that doesn't last long, it still creates >0 confusion

cosmic hedge
#

I don't think people expect 0 confusion when they open the "advanced settings"

unique salmon
#

Another matter: https://forums.ankiweb.net/t/desired-retention-ui-overhaul/57678/33?u=expertium
Will you add this to your ever-growing list of "suggestions that Expertium pings me about every day?" ๐Ÿคฃ

cosmic hedge
unique salmon
#

Like, graphs short-circuit their brains

cosmic hedge
#

the horror ๐Ÿ˜”

unique salmon
#

And IMO, the idea with answer buttons is just very neat and clear

#

We show users what they have already seen before - answer buttons with interval lengths

#

Instead of something completely unfamiliar

#

Idk how hard it would be to implement

cosmic hedge
unique salmon
#

Should we display fuzz, though? That's the issue

cosmic hedge
quasi shadow
lapis hearth
bold terrace
#

Eeeeereh with decay -0.2, the power_forgetting_curve still has a value of 20% around 1000d for a stability of 1d ...

#

I'm sorry but the integral stuff sound fishy

#

Writing it in in a word document doesn't make it less fishy

#

And randomly chosing the avg retention over 5y to compensate for a bad function just also feel like you just deny everything else that you came up yourself @unique salmon

unique salmon
#

Otherwise users will be like "WHERE ARE MAH LEARNIN' SHTEPSH?!?!?!"

unique salmon
#

It definitely makes way more sense than arbitrary f(S)

bold terrace
#

Well, it make no sense if the forgetting curve can't be trusted for extreme value, which it can't when I see that after 1000 days, a S=1d card will translate into a 20% probability

#

So f(S) make more sense if it goes from 0 to 1 in a lapse of time that we can interpret as "acquired"

unique salmon
#

It forbids low R

#

Anyway, with the new decay we could either scrap CMRR (as Jarrett wants) and leave users forever wondering what is the best value of DR, or we could try to save CMRR somehow (as I want) to give users some answer

bold terrace
unique salmon
bold terrace
#

Also I think decoupling CMRR evaluation function with a f(S) than just reusing the same forgetting curve, would allow to have more like a discriminant that is not poisoned by artifacts from FSRS (the fact that extremely R won't drop below 1% for example)

cosmic hedge
unique salmon
#

Like, I immediately think "that's hella confusing"

bold terrace
#

What I mean is the fact that if the forgettive_curve was more realistic (meaning a S=1d card would have its R really really low after a few weeks), then there would be a no chance a card with S=1d would already have a 0.40 score.

#

I still think FSRS good prediction is based on the R it was trained on, and not based on the quality of its forgetting curve

#

So using the curve as a way to find the evaluation function of "how good a S is" feels wrong

cosmic hedge
#

i say just give up and stick a link to the visualiser there

unique salmon
#

Just display what the user would normally see with these parameters and learning steps above his real buttons

#

As long as what the user sees above the fake buttons is the same as what he sees above the real buttons, we're good

cosmic hedge
unique salmon
#

Thank Dae for making learning steps a nightmare

#

And making the entire scheduling system janky

unique salmon
#

And if we agree that some change that makes metrics worse is in some sense "better", then...what's the point of the benchmark?

#

I guess we could solve it by asking Dae to make another 10k dataset, but this time make it have a uniform distribution of retentions by cherry-picking users with all kinds of retentions
Jarrett, I hope you agree that this is a good idea

#

So that there is more or less the same amount of data for all retentions

#

Like my uniform dataset with 100 users, but 100 times larger

#

Otherwise I can't think of any way to prevent overfitting to higher retentions

lapis hearth
#

Though I still do (regarding having FSRS-sec)

unique salmon
unique salmon
#

But that's a different problem, not exactly what I'm talking about above

lapis hearth
#

You want to benchmark the benchmark

unique salmon
#

What I'm talking about is that regardless of which metrics we use, we will end up overfitting to high retentions because most of the data is in the >50% retention range, with very few users with <50% retention
So a change that makes the forgetting curve less realistic, like the whole "A card will never reach probability of recall of 10%" thing, might look good on paper (uh, on the monitor)

lapis hearth
#

But Jarrett surely knows his benchmark

lapis hearth
unique salmon
#

Analogy: think of it as making an artificial city where the number of millionaires is the same as the number of poor people

lapis hearth
#

Use me

unique salmon
#

Nah, it would be anonymous

#

Dae would open his secret vault where he keeps user data ๐Ÿคฃ

lapis hearth
#

Who would care. Make me an honorary specimen

unique salmon
#

There is another way, which is a lot more arbitrary and dumb but doesn't require getting a new dataset.
After optimizing FSRS on all 10k users, we calculate the final metric as a weighted average where weights are proportional to retention in a certain way. Specifically, we put all users into 40 categories: retention between 100% and 97.5%, retention between 97.5% and 95%, retention between 95% and 92.5%, etc.
Then we count how many users fall into each category. Then, when calculating the final log-loss and RMSE over the entire dataset, the user is weighted inversely proportional to the number of users in his "retention class".
What this means is that if someone has a retention of 90%, his weight will be lower because that's common. If someone has a retention of 10%, his weight will be huge because that's uncommon. So we assign more weight to people with uncommon values of retention and less weight to people with common retentions.

#

This doesn't even require re-running the algorithms in the benchmark, just re-calculating the final average across all collections

#

So we could do it like...today

#

@polar maple it's inspired by the approach of giving more weight to rare classes in classification problems on imbalanced datasets

#

Except that we're just making the classes up ๐Ÿคฃ

#

I'm fully expecting that with this change FSRS-6 will look worse than FSRS-5

#

Because FSRS-6 has a curve that doesn't fit people at low retentions

#

Getting an actual uniform dataset would be a lot better, though

quasi shadow
#

The only thing I need to do is calculating the retention and saving it in the result with the metric, right?

#

Then we can compare algorithms in each retention level.

unique salmon
# quasi shadow The only thing I need to do is calculating the retention and saving it in the re...
  1. Split users based on their retention (exclude same-day reviews and the first review) into sufficiently many groups, like 20 or 40
  2. Calculate 1/n(users in the group) for each group
  3. Calculate weighted average log-loss, RMSE(bins) and AUC across all 10k users, where each user has a weight of 1/n(users in the group), depending on which group he belongs to

Example: suppose there are 1000 users in the 90%-92.5% group. So if a user's retention is 91%, his weight is 1/1000

severe storm
#

Is there a way to (correctly) guess how long it take for True retention: to move closer to desired retention?
My case: I have used anything between 72%-85% desired retention (mostly on the lower end) for at least 6 months (with change all cards on schedule), but like last week I have turned it up to 90% desired retention (without change all cards on schedule).

#

not really a problem just curious

bold terrace
#

A compromise is with FSRS Helper Addon, only reschedule the further away cards

severe storm
#

I got 1000 cards ๐Ÿฅด "due" if I take the quick rout haha. But I think I can "endure" true retention < desired retention for a while

bold terrace
#

You can always do the compromise if you want to speed up a bit things ๐Ÿ™‚

#

You can do a tiny batch per day

#

taking only the one only in far far future

severe storm
#

That will mess with me mentally I think haha

#

But I think I have figured it out.
As you said, it will take as long as the amount of days that most of my cards have with old DR. I think looking at the review intervals tab on "stats" might help.
If I look at where the cumulative 50% & 80% (randomly taken) is, it might give me some sort of idea.
running total of 50% is @ 56-60 days review interval, and running total 80% is @ 152-155 days review interval. So I would guess it's somewhere @ 120. Because during this time I am also new cards etc

polar maple
polar maple
unique salmon
#

So I think we REALLY should ask Dae to make a uniform dataset

#

Or do the thing I described above, which is worse

polar maple
#

also was just showing that RWKV's flatter forgetting curves still achieves good calibration on low R so its not necessarily a big issue to have a flat forgetting curves

unique salmon
#

I'd like to see the calibration graph for FSRS-5 and 6

polar maple
#

FSRS-5-recency, first 500 users, 0.5 decay

#

i don't have one for FSRS-6

unique salmon
polar maple
#

K

unique salmon
#

Oh, and I mean "run the optimizer", not "run it with the same parameters as for the old curve"

#

So you'll have to optimize parameters for the new curve

unique salmon
polar maple
#

yeah i still have the old code

#

at least that plot was generated way back then

unique salmon
#

ok

polar maple
unique salmon
# quasi shadow yep

Maybe we should do that instead of the fixed decay then?
The problem is estimation of S0. You need to know decay in advance to accurately estimate S0. Maybe do what you did, and then do a second optimization with fixed decay that was found during the first optimization?

#

It's 2x slower, but should work better

polar maple
unique salmon
# unique salmon It's 2x slower, but should work better

The more I think about it, the more I think that's the best course of action
If we make decay trainable, that alleviates the problem that different values of decay are better at different retentions, which is what we have been arguing about all the time. And it should be more accurate than any fixed value of decay. The problem is S0. Actually, even other parameters may (and likely will) still have different values depending on the choice of decay. I don't think FSRS params are "decay-agnostic", though I don't have a solid proof of that.
So the solution is to run optimization twice: once with variable decay, to find which value of decay is good, and the second time with fixed decay from the first run, to fine-tune the parameters

#

We could only run it once if parameters are "decay-agnostic", but again, I doubt that they are
By "decay-agnostic" I mean "parameters will converge to the same values regardless of the choice of decay"

polar maple
#

@unique salmon same 500 users

unique salmon
#

It actually looks good below 50%

#

Somehow

polar maple
#

not too surprising given the RWKV curves

#

but where are my confidence intervals?

#

i updated fsrs-optimizer

#

i thought you added confidence intervals or something

unique salmon
#

I thought so too ๐Ÿ˜…

polar maple
#

the update must've failed or something, you mentioned you removed some lines but i think i can still see them all

polar maple
#

@unique salmon the swap at p=0.45 is interesting

#

i think ill get another 500 users to see if this repeats

unique salmon
# polar maple <@530106856593424407> the swap at p=0.45 is interesting

https://github.com/open-spaced-repetition/fsrs-optimizer/pull/169#issuecomment-2794715383
Mind voicing your thoughts? Or giving me a thumbs up, that works too ๐Ÿคฃ

GitHub

candidate for FSRS-6
Log Loss: 0.3273 -> 0.3257 (-0.0016)
RMSE(bins): 0.0518 -> 0.0510 (-1.5%)
Model: FSRS-5-dev
Total number of users: 9999
Total number of reviews: 349923850
Weighte...

#

I don't see any problems with my idea, aside from making optimization two times slower

polar maple
unique salmon
polar maple
#

i don't see why we need to fix decay for a second optimization and why this would necessarily benefit over just a joint optimization of all parameters at once

unique salmon
#

I mean, I guess we could just double the number of epochs?

polar maple
#

you could do that, idk, or just leave it as-is

#

i did check that increasing epochs does improve performance a bit but i think this is already a tradeoff that jarrett has decided on

polar maple
#

@unique salmon on users 501-1000, looks like theres an actual pattern

unique salmon
# polar maple was this done with the 5-way split on the 10k dataset? if so, does it improve te...

Model: FSRS-5
Total number of users: 9999
Total number of reviews: 349923850
Weighted average by reviews:
FSRS-5 LogLoss (meanยฑstd): 0.3273ยฑ0.1525
FSRS-5 RMSE(bins) (meanยฑstd): 0.0518ยฑ0.0332

Model: FSRS-5-recency
Total number of users: 9999
Total number of reviews: 349923850
Weighted average by reviews:
FSRS-5-recency LogLoss (meanยฑstd): 0.3256ยฑ0.1519
FSRS-5-recency RMSE(bins) (meanยฑstd): 0.0493ยฑ0.0321

Model: FSRS-5-dev (optimizable decay)
Total number of users: 9999
Total number of reviews: 349923850
Weighted average by reviews:
FSRS-5-dev LogLoss (meanยฑstd): 0.3220ยฑ0.1488
FSRS-5-dev RMSE(bins) (meanยฑstd): 0.0466ยฑ0.0290

bold terrace
#

If it's part of the cost function, can't it be optimized at the same time ?

polar maple
polar maple
#

just for completeness here is the combined chart for users 1-1000

unique salmon
#

I mean, I can, but it will take a ton of time

polar maple
#

and i cant be sure that optimizable decay is with all other parameters equal with FSRS-6

#

so doesn't hurt to just ask directly

unique salmon
polar maple
#

unless it overfits too much

bold terrace
# unique salmon What do you mean?

The fact the decay need to be trained before training params, if the decay is part of the forgetting curve can't it be optimized at the same time ?

#

gradient descent and doing the derivate of the forgetting curve by the decay

unique salmon
# bold terrace The fact the decay need to be trained before training params, if the decay is pa...

Uh, it's complicated.
It needs to be fixed for the first 4 parameters, since they are estimated separately. For other parameters, as I said, they likely depend on the value of decay, so if you change decay, optimal parameters will no longer be optimal
BUT
In FSRS-5 the first 4 parameters are also optimized via gradient descent after they are estimated initially. So now the only problem is that parameters that are optimal at one value of decay are not optimal at the other. But running the optimizer for more epochs will likely solve it. After each epoch the change of the decay parameter will be smaller and smaller

#

If you want to ask "If we can optimize decay, why are even bothering with fixed decay?" - I have no idea ๐Ÿคฃ

#

Jarrett just decided to use fixed decay for...reasons that I don't know

bold terrace
#

Sure, if the decay get optimized it might/will change the value of those 4, but if I remember correctly the few lessons I did with gradient descent, you do the derivative of the cost function for every parameter, and you "glide the slope" of all those dimensions until you reach a minimum

#

so you would glide that bias and those 4 parameters, leading you to the point they would balance themselves out ?

unique salmon
#

And now I'm like "Wait, why are we doing fixed decay again?"

#

Idk, maybe we all collectively had a brain fart

bold terrace
#

But maybe there are reasons I don't see why it is better fixed

unique salmon
#

I don't either

#

It improves metrics and is more adaptive than fixed (well, duh, obviously)

unique salmon
#

It wouldn't improve log-loss and RMSE if it overfitted a lot

#

Though I still think we should choose a reasonable range for it, not just (0.01, 1)

#

According to this graph, I'd say 0.1-0.7 is reasonable
Green is just me trying to eyeball the best fit

polar maple
small crow
#

Why did this card go from 48% to 100% after the manual reschedule? It's a card that I got right so I feel tike difficulty actually is closer to the beginning 50% so i just reset the card as a knee-jerk recation and rated it again.

polar maple
bold terrace
#

Thing that always bugs me a bit is, sure we get a good decay that should fit most people with this, but just like default params won't be ideal for an individual, I guess decay should also behave the same way isn't it ?

unique salmon
unique salmon
#

Though, 100% D with no lapses is strange either way

unique salmon
unique salmon
small crow
# unique salmon Have you optimized parameters inbetween these two reviews?

I had, but I've not been using the "reschedule cards on change" cards in the deck options for that card's deck (that automagically puts those there I think), instead using FSRS helper to do so and then catching up on lapses. Is there a way to look for other cards that were rescheduled that day with that manual reschedule to see if they've also had that happen to them?

bold terrace
#

But my gut feeling is that you're very new to Anki and I guess the optimizer didn't bother really put much different difficulty, just put everything in one big basket

small crow
bold terrace
#

for the deck I mean

#

here the info is pretty simple, the card never failed, was ~50% D before, it's 100% now

#

but then I would expect all your cards to be at 100%

small crow
#

FSRS5 parameters with DR@92%:
0.9842, 8.0109, 41.2131, 100.0000, 7.3324, 0.5695, 1.7045, 0.0010, 1.3330, 0.3374, 0.8130, 1.9629, 0.1152, 0.3734, 2.2973, 0.1129, 3.0047, 0.4220, 0.7896

with 747 cards in the deck

bold terrace
#

Hmm

small crow
#

yeah, i don't get it

bold terrace
#

Yeah no having 100% D doesn't make sense for that card

small crow
bold terrace
#

Did you try to reschedule it with the FSRS plugin ?

small crow
#

yeah, it didn't budge the difficulty, interval, or due date when trying to rechedule with the right-click context menu or the reschedule all cards option

bold terrace
#

With your param, the 27d interval / 48% D seems to be the correct values

#

You can't right click -> Forget -> set Due Date 0 and re-review it ?

small crow
#

which is what made me super curious about why it jumped to 100% difficulty. you don't know of a way to specifically search for manually rescheduled cards on that day to see if it happened to some others, do you?

bold terrace
#

hmmmmm

#

no but I would maybe just search for all cards with high D with no review failed

#

-rated:180:1 prop:d>0.99

#

something like that

small crow
#

ah, not the only card

bold terrace
#

did you try the right click -> recompute memory state ?

#

or did you just did the deck -> reschedule

small crow
#

taht's the same as "update memory state and rechedule" right?

bold terrace
#

I'm not entirely sure

small crow
#

yeah, no dice.

bold terrace
#

I know in the past when some bugs happened, the memory state had to be refreshed

unique salmon
#

Out of curiousity, try changing the last digit of any parameter, like from 1.2345 to 1.2346, just to recalculate memory states, and check that card again

small crow
#

I'm like 99% sure it's because these cards had a review history before 3/15, then got reset through the Cards->Reset function in the card browser as the cards i found that have this issue are like that, lol.

small crow
#

also no dice :c

unique salmon
small crow
#

I did use the "rechedule on change" option, lemme try something else.

bold terrace
#

Maybe reseting repetitions/lapse could maybe help but not sure

#

really sounds like a very tricky/specific issue

small crow
small crow
#

oh LOL it won't let me included links in my post ใ… ใ… 

unique salmon
small crow
#

Also, is the intervals on those cards becoming lower even while passing reviews...supposed to happen? I just just noticed after making the post, lol.

unique salmon
unique salmon
#

@polar maple I'm benchmarking decay=-0.5 vs decay=-0.2 vs optimizable decay within the (0.1, 0.7) range, and the optimizable one is like baaarely better. There is a clear difference between decay=-0.5 vs decay=-0.2, but not much difference between decay=-0.2 vs opt. decay.

Model: FSRS-6 (opt decay)
Total number of users: 102
Total number of reviews: 2123285
Weighted average by reviews:
FSRS-6 LogLoss (meanยฑstd): 0.3731ยฑ0.1780
FSRS-6 RMSE(bins) (meanยฑstd): 0.0665ยฑ0.0355

Model: FSRS-5 (decay=-0.2)
Total number of users: 102
Total number of reviews: 2123285
Weighted average by reviews:
FSRS-5 LogLoss (meanยฑstd): 0.3733ยฑ0.1780
FSRS-5 RMSE(bins) (meanยฑstd): 0.0666ยฑ0.0355

Model: FSRS-5 (decay=-0.5)
Total number of users: 102
Total number of reviews: 2123285
Weighted average by reviews:
FSRS-5 d=-0.5 LogLoss (meanยฑstd): 0.3780ยฑ0.1802
FSRS-5 d=-0.5 RMSE(bins) (meanยฑstd): 0.0691ยฑ0.0346

I've only done 100 users so far, so I will report back tomorrow. But weirdly enough, it seems like the fixed one is just too good for some reason. Then again, maybe the optimizable one needs more epochs.

polar maple
#

it could be worth having a modified version of the script that also saves the training loss for each user to see if opt decay fits the training data much better than decay=0.2 or not

small crow
# unique salmon If you have changed parameters inbetween reviews, yes, it could happen

while true, i really think something is weird with the cards and how the math is mathing on them because using the parameters from 2025-03-14 that I have, a 333 should have an interval of 36 days, while using my most current set of parameters from 2025-04-09 says they should be 62 days, and not the 8 days as show in the first card in this message here:
#1282005522513530952 message

I kinda actually wanna see about setting my PC in the future to see how the next rating affects the difficulties and intervals. let's time travel~

but first: BACKUP TIME

edit: i was wrong, after two reviews into May, the intervals started decreasing. so it really just is that the cards now have a difficulty of 100% lol, I think it'd be faster if I were to just reset them and go from there

oh, I found all the cards with resched:27 -resched<27 lol, and a lot of them are exhibiting this behavior. I there a way to remove the manual review/scheduling data that gives it a 1 to see if that's messing with it? thanks

quasi shadow
quasi shadow
polar maple
#

nice, i think it needs more samples but i dont think there is an obvious trend

quasi shadow
#

FSRS-5 LogLoss: 0.4939, FSRS-6 LogLoss: 0.5114

#

๐Ÿ˜…

#

@polar maple

quasi shadow
quasi shadow
#

I'm benchmarking optimizable decay.

polar maple
quasi shadow
#

Btw, the optimizable decay reduces RMSE(bins) ~5% compared with decay = -0.2

#

And it also performs well in low R region.

polar maple
#

looks promising!

bold terrace
#

Nice ! Also, if I'm not wrong, since the fixed decay was computed on the same training set than the optimizable one, on those numbers it's normal there is not much difference, the big benefit would be for users not matching the training set right ๐Ÿ™‚ ?

bold terrace
quasi shadow
bold terrace
#

So to evaluate it, you should probably compare a fixed decay trained on the 10K user but applied to different user and see how much performance decrease, vs the score with their own optimized decay

quasi shadow
#

FSRS-6-dev is FSRS-6 with optimizable decay.

#

๐Ÿ˜‚ Fine. More work for me to implement it in Rust.

polar maple
# bold terrace So to evaluate it, you should probably compare a fixed decay trained on the 10K ...

if we only have this 10k dataset then what you want is something like, we find a fixed decay by only looking at the first 5k users, then evaluate this fixed decay on the remaining 5k users and also try learnable decay on the remaining 5k users as well? yeah it's true that we may be overfitting the dataset with certain hyperparameter and algorithm choices, but i think this is swept under the rug as being arguably not influential

bold terrace
#

Sorry ๐Ÿ˜ฆ But yeah, having training/test set is even another step @polar maple, but here I think it's even more unfair for the "Optmized vs Fixed" decay, since here Fixed=Optimized(10k)... So optimizing the Decay on the same set, will just get you of course the same results

#

But I agree ideally even the optimizer for 1 user should be done on a training set, and then evaluated on a test set

unique salmon
bold terrace
unique salmon
#

Speaking of which
Model: FSRS-6
Total number of users: 774
Total number of reviews: 26313110
Weighted average by reviews:
FSRS-6 LogLoss (meanยฑstd): 0.3383ยฑ0.1601
FSRS-6 RMSE(bins) (meanยฑstd): 0.0511ยฑ0.0342

Model: FSRS-5
Total number of users: 774
Total number of reviews: 26313110
Weighted average by reviews:
FSRS-5 LogLoss (meanยฑstd): 0.3384ยฑ0.1599
FSRS-5 RMSE(bins) (meanยฑstd): 0.0512ยฑ0.0341

According to my tests opt. decay is pretty much the same as decay=-0.2 ๐Ÿค”

bold terrace
#

So in your test, you optimized different decay for every users ?

#

Or maybe I just misunderstood what you said ๐Ÿ™‚

unique salmon
unique salmon
# quasi shadow

Let's combine the 5 leftmost bins into one, so that it has a larger sample size

unique salmon
#

Idk how you get better results

#

Oh wait, maybe my implementation is bugged

#

How do I do this properly? ๐Ÿ˜ญ

#

class FSRS(nn.Module)

unique salmon
#

I guess it doesn't matter since you will benchmark opt. decay on your own anyway, but still

#

-self.model.w[19] doesn't work, 'FSRS5' object has no attribute 'model'

quasi shadow
#

The self is model.

#

So you don't need to .model again, I guess.

unique salmon
#

The thing is that optimization goes fine and different users have different decay, but then the results are pretty much identical to decay=-0.2. So I'm guessing that I screwed up outside of the optimization

quasi shadow
#

I cannot review your code if you don't use GitHub and git (

unique salmon
#

Yes

#

Interesting. So FSRS-6 performs better across all retentions, except for super low ones

#

The graph is pretty awkward though

#

There are 3 graphs, but xticks are on only one. And without a horizontal line it's kinda hard to tell where 0 difference is

#

Having a line like this would be good, but the line is unrelated to the curve, it's related to the differences

#

Ah man, this is awkward ๐Ÿ˜…

#

@bold terrace ok, so this stuff is kinda hard to read, so TLDR: FSRS-6 with it's "A card with S=1 day can never reach R=10%" super-duper-flat curve...somehow performs better than FSRS-5 even for people with retentions like 40% and 50%

#

It performs better almost universally, except for people with retention around 20%

#

So...flat curves are just better

bold terrace
#

Probably some external influence on how people use Anki

unique salmon
unique salmon
#

That's the only explanation that I see

bold terrace
unique salmon
#

Yep

bold terrace
#

and people with lower R might be people not really doing anything else outside anki

#

That would be interesting (but difficult I guess ?) to do some clustering on Users, Reviews .... to try to see if we can't also "profile" things

#

FOr example, by splittng my deck into "Normal D" and "High D", I discovered the default FSRS parameters are actually quite good for my Normal D !

#

And only my High D collection has a worst logloss/RMSE if not optimized

quasi shadow