#FSRS Megathread
1 messages ยท Page 9 of 1
how
I'd rather just pr it myself and deal with it that way ๐ญ
I'll close mine then
svelte not being easily modified by addons is why this needs to be discussed more than just one person being fine with it
Learn to git patch
Tbf i do see a fair number of people ask about it.
a note in the help modal then?
my gripe is "the first review ever made" is too broad
Its there already i think
Can't you have it show "N/10000 cards included"
Not "BRIGHT NEON COLORS AND BIG TEXT FLYING STRAIGHT INTO THE USER'S FACE" enough
that would be nice
"-is:suspended" caught me out once
it would be more complicated though
"Ignore cards reviewed before"
... And all the subsequent reviews, their offspring, all their heritage, burn them to the ground as they never existed in the first place
๐ฅ
thats one way to get the point across ๐
Ignore cards (and their future reviews) for cards reviewed before could also fit, but maybe too long
A lot of people don't even realize that you can click on settings to see a help text thingy
in that case, the naming itself is misleading
oh its currently alr "ignore cards reviewed before"
Ok, maybe not that many
yeah technically it's correct, but you have a false sense of "it will only ignore the reviews"
they're not very discoverable at the moment
maybe that would be better addressed than going for the nuclear neon option all the time
i say that, but as im looking at deck options, there's a โ in the corner of each section, and hovering changes the cursor
Yeah but Help menu should be there to go "deeper" in knoweldge, not to get the knowledge right
Would that be easier to implement than "first review ever" or "first review of a card in this preset"?
The warning being displayed only if the selected date > date of the first review, I mean
depends on how easily understandable and accurate the setting's title is
everything involving reviews would mean that you would have to go through every review in a preset everytime you changed a setting just for a tooltip
i can see the red pr closed symbol now ๐
"Ignore Cards introduced before" ?
i mean i guess you could cache the first review of every card every time you open the window
Then just make the warning appear if the date is non-default
With ""Ignore Cards introduced before", you would not risk deviating the focus from the "card" to the "review"
"Introduced" isn't obvious
again, the default is 1970.
i mean its an improvement
I imagine some people will think it means "Created before that date" and some people will think "u wot m8"
"Ignore Cards learned before" ?
why is that a problem?
so having anything in that field would differ from the default
"Learning" is a "core" word, should do the trick
yeah but if they dont change it from the default it wont match any cards right?
every review ever was done after 1970
yeah so if its not changed from 1970 its not going to match any cards?
so no one needs warning
@clever cargo you start to reach the point of : "Finding problems for the sake of finding ones" ๐
what?
I mean, the fact the default is 1970 doesn't cause really an issue for showing the highlight when the option change no ๐ ?
what if we just execute the user whenever theres an error on their behalf
if its filled with any date past the first review, then its going to show a warning
or maybe "Learnt" for UK-blokes
which imo is too broad, given how many decks a user can have
@unique salmon you need to fill it out with text first but yeah it does appear
above where it should but it's still there
I'm still spooked by the Anki build system. I saw that there was a custom rust program that generated ninja files and went: "Ok, I'm not touching that Gordian knot until I really have to."
well, i can always open a pr in that case
for me it's the ping-pong game between the python code and the rust backend that throws me off. I may have misunderstood, but it seems the rust code instanciates the rust backend through the python bidding which sounds weird. I plan on digging a little bit more.
I just start at the .proto files then work my way to python/rust from that.
just get used to it๐
I spent nearly one month to understand the framework of Anki codebase.
still have no idea how the scheduler works
which part?
arthur's fork has a good explanation on queues
it is the rabbit hole.
you will be suprised how deep it is
I'm not surprised, from my first impression I'll have to spend at least that much too before coding anything there.
@quasi shadow how about modifying FSRS so that the first rating determines a resulting fixed decay (1 -> 0.5, 3 -> 0.2)? A problem would be that the parameters would be mixed up having to support multiple forgetting curves so maybe an alternative evaluation could be: you train FSRS with a decay 0.2 and evaluate only on cards with first rating = 1, and then do the same thing with decay 0.5 and evaluate on only cards with first rating = 1
also good news, weighted exponential curves performs similarly to power curves. I'll work on some plots to see what the curves look like later
I think that's too harsh if decay can then never change. It places too much weight on the first review
And considering that people have all kinds of weird habits and make entire threads about "What button do you press for the first review?", this really doesn't seem like a good idea
Making decay depend on D would be interesting, but that didn't work
this is more to test jarrett's observation that maybe first rating = 1 would benefit from a higher decay
this graph is quite concerning. Why does it just go up and up oO
if I make it 1000 days, it does this. Which makes no sense to me
You run out of new cards around Nov/Dec 2026?
How many cards do you have in that preset?
~18000
That's 720 days until you run out of new cards, so around 2 years
@quasi shadow please investigate this, it seems that the simulator treats all cards as new
It's also just plain wrong even for tomorrow, I'll have ~350 reviews, not 150-200
This must be some artifact from me splitting my one big deck into a huge number of sub and sub-sub decks
but all decks use the same preset, so it shouldn't matter for the Simulator. Or so I thought.
would you mind sending me your collection?
yeah ok
the joys of multi-language programs
@cosmic hedge
i was expecting you to DM me but so long as you're comfortable thats fine XD
idk i guess im guarded with my own decks for some reason XD
also re: leeches and tagging, I was fully thinkinking leech was gonna be its own card prop rather than a tag or whatev, is:leech type stuff
we will judge you as a person based on how you do anki
I hope that exported fine. Didn't test exporting since the Subdeck-Inflation
If it's less problematic to implement that way, sure
I just think its better in general, not even about problematicness or not
(it also allows both leech types to exist at the same time, assuiming addons or whatev care about leech tags)
Do you plan to make the old leech detector keep using tags?
I'd rather both the new and the old detector do the same thing, for consistency
I figured old method would be removed
"leech after N fails" is a shitty metric by every metric ๐
Do leeches un-leech after a while at the moment?
Then I have a surprisingly low amount of them
I had leeches but then I turned up the leech count to like 1000 so of course I never have any
it just felt not-useful
ยฏ_(ใ)_/ยฏ
A lot of your cards are missing memory states. This is what the simulator looks like after slightly changing one of your parameters, saving and trying it again.
But why? The simulator worked fine not too long ago
๐คทโโ๏ธ
And what does "missing memory states" actually mean?
It is very annoying how they only apply to notes. I have considered duplicating my notes so there is only 1 card per note just so the leech tag is useful.
oh this is probably why I basically turned the feature off
I knew there was a reason I just forgot
yeah note-level tagging of leeches is actually useless
when you rate a card for the first time it calculates its memory state (stability, difficulty etc.) from the cards history. It then saves that and uses it for future reviews.
a lot of your cards are missing that save
That's so odd
how would that happen
And nudging the parameters causes a global re-calculation?
yeah thats what the progress bar that appears after you hit "save" is showing you the progress of
never saw that, guess my PC is too fast or something :D
suffering from success ๐
Could an addon have clobbered the custom card data maybe? If I remember correctly all the FSRS state is stored in there.
used to be back in the js scheduler days
Maybe I'm thinking of revlogs ๐
its got its own memory state field now
I wrote a helper-addon to split the deck into subdecks
but all that does is find the cards, and call mw.col.set_deck on them
thats it
every time you move a card it gets its memory state erased
but is there any point to the addon's "v=reschedule"?
How do I re-generate it from within the addon? :D
or just holdover from the old days?
no idea XD
Not immediately. For now it will be a toggle (read my word document)
eventually, obviously
but maybe maybe this method is so much better dae would just be ok with it being removed
๐
Btw jake, read the comments below this: https://forums.ankiweb.net/t/automated-leech-detection/56887/16?u=expertium
Another user also proposed not using tags/flags, but I don't think it will work
If a card is a leech, it will be failed more often than FSRS predicts. Thatโs how we define leeches with the new detector. So yes, the probability of recall will be higher at higher DR, but since leeches have a lower p(recall) than FSRS predicts, they will be failed more often. So depending on how much lower it is exactly, itโs possible that...
I epxlained the issues here
https://forums.ankiweb.net/t/automated-leech-detection/56887/26?u=expertium
So the UI would display a pop-up based on the p-value stored in card info, and when searching prop:is_leech, Anki would convert it to prop:p<0.01, correct? Actually, no, that still wonโt work. In order to reduce the amount of time the leech status changes, in my specification of the detector I wrote that we should use two thresholds. If pthre...
is it bad if a handful of on-the-edge cards bounce back and forth between being a leech and not?
I think that actually happened quite often in my prototype.
๐คทโโ๏ธ
I'm honestly not sure how I should rate some cards. Like, how much "off-ness" I should tolerate
Ok, I see why I was confused. It's all packed in the data column, but custom data is its own thing (presumably nested in the data column).
this looks indeed much more reasonable
Interestingly DR is stored per-card which suggests you could get weird and do different DRs per-card instead of per-preset if you wanted.
if i had to guess i'd say its probably so it doesn't have to check the preset every review?
User experience, man
People will be like "Why is my card going from 'leech' to 'not a leech' so often?"
I assume so too, but it opens the door to tomfoolery ;p
"because you keep passing then failing it" ez
So we have to do the bullshittery with two thresholds or with updating the status only after every N reviews
"my leech-p is above theshold_1 but it still marked as a leech, what gives?" - equally nonsensical complaint the other direction
I always thought an "expected retention" row to the true retention table might be handy. You know like the average DR for every card in the search.
We won't expose thresholds or any other stuff to users
The detector will be a black box
that is super lame
The only thing we will show is p(leech)
Well, and the leech status as a binary variable
why is it p(leech) if its a bool
I mean that we will show both the probability and the binary leech/not a leech label
this feature is boring now man
Power users can search for p_leech in the browse window
if we show the probability then my complaint will be a thing
"my leech-p is above theshold_1 but it still marked as a leech, what gives?" - equally nonsensical complaint the other direction
exists in any situationp(leech)is shown
And this
trends can be an addon, 0% chance it'll be in anki proper
It's a cool idea, but it has problems:
- We are very limited on space. It could be different for Young & Mature which means 1-3 extra columns
- We do not have a history of DR so it could be wildly wrong for "Last Week" etc. if you changed DR at any point.
ahh i didnt do it myself because of the space issue but i think the history of dr thing kinda ruins it
A search is pretty much instant, unless I'm forgetting something
@cosmic hedge
It's not as simple as "introduced:x". It also counts cards which were forgotten and then re-introduced after the date.
I think we can write some simple sql for this?
SELECT count(DISTINCT cid) FROM revlog
WHERE id > ignore_before AND type == 0
```?
https://github.com/ankitects/anki/blob/ccab18b7ba624d888f3d881e14f04c830e3eaa44/rslib/src/scheduler/fsrs/params.rs#L323 it's if first_of_last_learn_entries is before the cuttoff then its ignored
No way you thought all this would be in anki proper๐ญ ๐ญ ๐ญ dae would set himself on fire before he would let that happen
Right now we can't even decide on the specifics of the leech detector itself ๐
Oh, btw, I feel like I should clarify precisely what p(leech) means, statistically. I don't think I've explained this clearly before
With this detector, p(normal) aka 1-p(leech) can be interpreted as "Probability of observing this many or fewer successful reviews, assuming that probabilities given by FSRS are the true probabilities of recall", in other words, assuming that FSRS can predict the probability of recall perfectly accurately
It's a p-value for a one-sided statistical significance test where the null hypothesis is "The true probabilities are [whatever numbers FSRS predicted]"
So if the p-value is low, it means that it's very unlikely that we would see these outcomes if the probabilities predicted by FSRS were the true probabilities of recall (for this card)
You have made it more confusing for us simpletons
Okay couple of brain strokes later, I am beginning to understand it
Basically, low p(normal) aka high p(leech) means that FSRS sucks at predicting probabilities of recall
So high p(leech) means card is so difficult that FSRS scheduling is useless and you need to find something else to help you recall it
So basically in other words, a leech
No amount of scheduling would help make this leech unleech
That makes sense
Mmm, not exactly. More like "No amount of reviews will help FSRS accurately predict probabilities for this card"
Alright, with that out of the way, we need to decide on 2 things:
- Tags/flags/custom data in card info?
- Do we do it the simple way with only one threshold and checks after every review, and if the card keeps bouncing back and forth between being a leech and not being a leech - we say "it's not a bug, it's a feature"; or, alternatively, do we do it the complicated way with two thresholds or checks only every 2/3/4 reviews so that cards don't change their status too often?
@ashen light @cursive badge @cosmic hedge
tags are note-level (as opposed to card-level), making them a non-option. cards can only have 1 flag at a time, also making it not an option
a custom leech attr on card is the only reasonable thing
very much against checking every N reviews, means we need to keep track of an extra attr
We can do two thresholds, though that creates another problem: if we show p(leech) or p(normal), some cards can end up counted as leeches and some not, despite at present having the same p(leech)
For example, if the first threshold is 5% and the second one is 25%, and p(normal) is 10%, whether itโs a leech or not depends on whether it has crossed the first threshold before or not. If it has crossed it before, itโs a leech, otherwise itโs not a leech.
people are gonna invent problems to complain about no matter which option is done
the bounce back and forth strat has an easier implementation
Well, guess it's new data in card info + simple method then
Now the real question - who's gonna implement it?
I'm just annoyed I gotta do the math thing
is there an easy off the shelf equation I can grab from a standard stats library
Oh come one, math is the easy part
#1282005522513530952 message
why you gotta do this fuckin tryhard poisson binomial thing literally nothing implements
ok port it to rust for me
protip: that ai version wasn't going to work
Why?
Like, "it bugs out and spits nonsense" doesn't work?
because it assumed rust vec's behave like numpy dataframes
I didn't run it but at a glance it wasn't gonna do what you wanted
for example pmf[j] = pmf[j] * (1.0 - prob) + pmf[j-1] * prob; the way its implemented in pmf[j] * (1.0 - prob) pmf[j] will always be zero so the first half of that equation is always 0
and so....is not gonna do what you want
unless we want to do that calculation for fun for some reason
I looked over it enough to see an obvious problem then just didn't give it any more thought
https://play.rust-lang.org/?version=stable&mode=debug&edition=2024
Looks good to me
A browser interface to the Rust compiler to experiment with the language
you didn't actually link anything
Unless my Python implementation is also somehow bugged and I didn't realize it
Just copy-paste it
Works fine for 90%, 90%. This is indeed what you get if you do the math by hand
Apparently not
my point still stands though, why you gotta use some tryhard stats thing
Because I don't see any other way
We can't just assume that FSRS always predicts the same probability of recall for obvious reasons
If FSRS always predicted the same probability of recall, we could use the good ol' binomial distribution
(except that in that case there would be no reason to use FSRS in the first place cause it would be fucking useless)
Poisson binomial is a generalization of the binomial distribution for when probabilities of success aren't always the same, like in a coin toss
I mean, I guess we could try to come up with something COMPLETELY different that isn't based on fancy probability distributions, but nah
my point more is I just wanted to pull in a library that I could hand an array and have it do the math for me ๐
nah
Save complaining for later, for the actually annoying parts, such as:
Unleeching- Pop-ups for both leeching and unleeching
- Recalculating leechiness every time FSRS parameters change
unleeching isn't even hard
3 is the only actually annoying thing here
(but it already does other stuff anyway, just gotta hijack that process)
I just realized why I have so few leeches lol
every time I sync the deck with WaniKani, it overwrites the tags
Not like it matters. Nothing I can do with the info of it being a leech anyway.
https://github.com/ankitects/anki/pull/3910 you're right XD
i disagree with this interpretation, we cannot distinguish if we were just unlucky or if FSRS was wrong
How about that?
It's the notebook optimizer.
It has a detail evaluation which groups the reviews based on the last rating.
I guess the forgetting curve is sharper when the last rating is again.
Maybe it's better to save the raw data for further analyses.
@quasi shadow @unique salmon this version of RWKV uses a weighted sum of 128 exponential forgetting curves. Maybe we should make FSRS decay scale with S?
also the 1-day stability plot might be a bit inaccurate since RWKV uses elapsed seconds
Seems like the forgetting curve is flat when S is small and becomes sharper as the S increases?
perhaps even decay = 0.1 could be beneficial for small S
S = 1 -> 0.1
S = 30 -> 0.2
S > 100 -> 0.5
and maybe interpolate this in log space
this could also be what we are seeing with first rating = 1 since it tends to result in lower stability after all
whoops i mixed it up
there might be some weird behavior when changing decay, let R(t, S, decay) be the function that gets the retention given a certain time, stability, and decay.
Then we would ideally want S1 > S2 => f(t, S1, decay1) > f(t, S2, decay2) but if decay1 != decay2 then this can be broken
Yeah, I know.
It means forgetting curves with different S will intersect in certain T (T > 0).
be like
It's the distribution of trainable decay.
- I think if a leech revamp does happen it would ideally involve new prop(s). Tagging notes has always been a half-baked solution and we cannot use flags in native Anki because it could clash with user flags. EDIT: I guess another solution would be letting cards have tags, but that is another feature in itself.
- I don't know. I haven't touched it since my prototype a few weeks ago but I don't think that we are ready for a "black box" with no knobs for the user to twiddle. I know it would be terrible UX but I never felt "I would be happy to just run this on any dataset" when I was playing with my prototype.
I would take some more convincing before trying to implement it natively in Anki but if @ashen light is interested enough to call off his strike I'm not going to be too negative and interfere ๐ .
I am still sticking to my idea of trends
I think knowing whether the card is leech or not is very helpful but evenso more helpful is knowing that whether what you are doing is helping you learn the card or not (whether you are on the right track)
OK, now I know how to improve the optimal retention feature.
we need increases the cost per review when the desired retention descreses.
there seems to be a lot of 60k+ as well which could represent way more than 60 seconds in reality
time where the user either gives up for a while or has to purposefully spend re-encoding the card into memory
This is weird to me because I thought the problem with CMRR was it went to 0.7 too often. Increasing the costs with higher DRs to offset this would make it even more likely to result in 0.7 right?
increases the cost per review when the desired retention descreses
it means the cost is larger when the DR is lower.
so it will increase the CMRR
Oh right ๐
I thought you were looking to counterract the effect in the simulator but forgot it doesnt exist in the simulator.
Would this effect be worth doing with retreivability instead of DR?
I know in the simulator they both end up being pretty much the same thing
it's more accurate if there is a backlog.
optimal retention
before: 0.7143667819857166
after: 0.8377484029026208
๐
Just add this line
I was happy with a double threshold + dividing thresholds by 1.4
with this, you will spend 20% more time per review when your desired retention is 70% instead of 90%.
Can we estimate this from the user's history?
yes but it need to calculate the R for the history
I'll take that as a "yes"
๐ Nope

IMO with the leech detection stuff, in practice I see a few elements that make it not as useful as I'd hope initially :
- A lot of card flagged by hit are cards with a few "bad streak". While it is indeed very low probability compared to FSRS model, in practice it's not that uncommon, specially for recently introduced cards.
- Once the repetitions are higher, it start to make more sense, but still, sometimes you still get cards with moderate amount of reviews still being flagged because they had a very very bad start.
- For cards with high number of repetitions, it doesn't really bring much more information than checking the number of lapse, since contrary to what I would have expected a few months ago, in my case at least, the more reps a card has, the less stability it also has in average compared to lower reps card. So discriminating "harder cards" based on # reps or # lapses is still .... very valid for FSRS
I don't know if some has practical experience with it and see different cases ?
This is a typical example. Got flagged for a bad start, but will only get considered "unleeched" when the history count will be big enough ... while the bad streak was 1 year ago, but since easy cards doesn't grow in terms of reps that quickly, it might still be one of the most leechy card of my deck even though it's quite an easy one (1 failed rep in 1 year, and the last failed was ~11 month ago)
When I tried the leechkit with my --last-review N, it felt better, but mostly because now it would flag only the one with a recent bad streak. The number of result would of course be way lower, something like 2-5 cards over 4000 active one
So what's the plan?
If we aren't going to estimate it for every user, will you estimate one average value (or two, like a - b*R) and just hard-code those?
It would be better if it was estimated for each user individually
But I guess a - b*R with a and b estimated from the 10k dataset would still be ok on average
hard code
And now imagine if instead of sum(R) we had a sum(R*f(S)) with f a function that would converge to 1 when S is big enough (360d) ๐
#Team90DR
Comparison of SUM(R) and SUM(R*f(S)) when new card/day change from 8 to 40 to 8 again.
You can see that for SUM(R), the more you add, the better.
For SUM(R*f(S))
- The more "active card" you have, the better since S can grow for all of those
- New/card that stay at low S are discounted (I mean, does a .9R on a 1h stability should be the same Memorized Value than a 365d stability one ?)
- Since R is included in [DR,100] if you're a good boy, for people with high DR, R is a proxy to measure SUM(active cards) ๐คทโโ๏ธ
For f, sqrt, ln, or more fancy like 1 - Math.exp(-((8 / 365) * s)) (early rise, converge to 1, and at 365d is already close to 1) doesn't really change much the trend, considering S is already good enough
CMRR is sum(R) / cost so the stability of the cards will already be factored into the cost right?
Cost is time
So no
time spent on cards, cards which are scheduled according to stability?
Yeah, but (I think) Sound's point is that CMRR doesn't take into account how quickly R decays
Aka how well you know the card
I can test making the decay depend on log(S)
Well, later, once I'm done with neural D
Just from eyeballing it, if I pretend that the red line doesn't exist, I can't even tell if there is any correlation at all ๐คฃ
Try it, everyone and anyone
Try telling whether probability of recall is positively or negatively correlated with answer time based on this graph
But in case @quasi shadow still wants to do it, I recommend estimating a - b*R for all 4 grades, 8 parameters in total
If these graphs are to be believed, a and b can be different for diferent grades
It's more clear if I show you the box graph.
Actually, wait
We calculate costs separately for learning and reviewing, so 16 new parameters ๐คฃ
I really think we need a benchmark where the goal is to accurately predict costs
Rather than R
We could just take FSRS-5 and use all this stuff we use for CMRR and the simulator, and run it on the 10k dataset, and compare predicted costs to real answer times
Then we can finally have a way to tell if we're making the CMRR/simulator better or worse with our changes
Please make a repo for benchmarking the accuracy of predicting costs
You already have FSRS parameters per user, so it's not like you will need to estimate them again
Just copy them from the 10k repo
And in the new repo you will run FSRS with parameters for each user and with cost estimations for each review exactly as in CMRR/simulator
No optimization
And at the end we will get average(|predicted cost - actual cost|) and sqrt(average((predicted cost - actual cost)^2))
MAE and RMSE
And then we can enter the new era of tweaking our cost prediction stuff ๐คฃ
Just think about ALL THE TWEAKS
I'M TWEAKING SO HARD
My solution
isn't "how quickly R decays" "stability"?
Fantastic stuff!
but this is not p(leech) against time is it
or would it have a similar shape
.
Whatever @cursive badge did here
Yes
I'm serious. At this level of complexity we NEED a proper benchmark
Please do it. Just reuse parameters for each user that you already have in the srs-benchmark repo, run FSRS on each user, make it predict answer time using the same formulas as in CMRR/simulator
And then calculate the mean absolute error and RMSE of predicted answer times and real answer times
We are past the point where we can just say "Oh, but this change obviously improves how accurately costs are calculated", we need proper tools to assess further changes
This is just probability of recall against answer time
Hmmm the thing is that the workload is more dependent on the interval than really the stability. What I mean, is that the interval can be reduced by reducing the DR. But in fact, the intrinsic quality of your memory is Stability, not really Interval
Stability is agnostic of DR
If I have a full-time job in Ankitects, I will consider it.

Fine. But then don't implement this change yet
This is the kind of change that isn't obviously an improvement, and needs benchmarking
It obviously makes the optimal retention with FSRS-6 similar to FSRS-5.
Without it, CMRR will give you an extreme low retention.
Man, I don't want to argue. I'll just be brief: from this point on any extra complexity added to the simulations, specifically to the part related to estimating answer times, has to be justified via benchmarking as described here (#1282005522513530952 message), otherwise I will not be happy
I mean, you are obviously free to disregard my opinion, but I genuinely hope you will understand that past a certain point of complexity you need proper tools and not just "It works. Source: it was revealed to me in a dream"
@cosmic hedge : For example, let say your workload right now is 100 reviews/day for DR=90% with a total sized deck of 1000. Your score would be 900/100 = 9.
If now you set the DR=70%, and let say it divide by 2 the workload. You get now a score of 700/50=14.
So basically, the optimizer will just make the most gain by making you drop the workload as much as possible -> dropping the DR
If you include S in the numerator, now you have something like f(S)/workload(I) that compensate that, and also, it pushes the goal function to try to also not sacrifice S just for the sake of reducing a bit workload
Right now when you look at the graph that CMRR is trying to optimize, it's not even a U curve it's almost a purely increasing curves .... so basically yeah, you always get the minimum threshold of 70%, it's worthless
I donโt know why you didnโt say it when I and @cosmic hedge implemented the learning steps in the simulator.
If you mean the one in the manual, it's workload, not workload/total knowledge, so the shape is different
Itโs more complex.
You mean the Markov chain thingy, with 12 costs? That one is "obvious enough" IMO ๐คฃ
The chances that that one somehow making estimations of answer times less accurate are very low
Though, on second thought, I want to see that one benchmarked as well
We can benchmark
- Current implementation
- Current implementation + Markov chain for learning steps
- Current implementation + Markov chain for learning steps + your R correction
Ah yeah my bad https://github.com/open-spaced-repetition/fsrs4anki/blob/main/fsrs4anki_optimizer.ipynb, I checked this one.
But yeah, it feels that CMRR to me is always "return 0.70" right now. And I think it's because the goal function (being only SUM(R) or SUM(R)/workload) is not taxing errors enough. Again, a question of setting the right tarif
Slightly unrelated, but you apply the same smoothing to those 12 costs, right? So that the final value is a weighted average of the default cost and user-specific cost, weighted by n reviews
Itโs tedious for me to benchmark it when there are only a few people who are concerned with it.
Just check the code.
I'm not asking you to benchmark it because I am concerned about it, though I certainly am. I'm asking you to benchmark it for the sake of making a better algorithm and making things better for anyone who will be using CMRR/simulator
If your correction makes the predicted answer times less accurate, it will affect everyone who uses CMRR/simulator
Idk where to look
Clearly not the python version, since that's not what is used in Anki
And the Rust version doesn't seem to have the new learning step simulation
https://github.com/orgs/open-spaced-repetition/discussions/36
I mentioned it here
I don't think smoothing is applied to new costs?
Or to learning_step_transitions and relearning_step_transitions
All of them need to be smoothed
Btw, why are there so many 0.25?
Yes but you were talking about we could use p(leech) to determine whether a card is leech or not.
So plot p(leech) against time is what I am saying and see if there are trends to be seen.
Oh wait, I replied to the wrong comment
That was meant to be a reply to Jarrett
To this comment
I wish Anki recorded time-to-answer as well as total study time. I feel the time spent looking at the back kind of poisons the data for other uses.
How much Dae would have to pay you? ๐คฃ
Also look how sum(R*f(S)) represent a better representation of the "gainz" you did, either if it's due to more new cards/day or by just reviewing them more
On the opposite side, sum(R) feels like reviewing a lot of cards per day was less useful than just introducing a shit ton of card I was able to recall 2d later
If you want to calculate answer time as a function of R, sure. But not like this. Here - assuming I understand your code correctly - you just apply a correction to the already existing average (or median, whatever, that's not the point) time. That's not the same as answer time = a - b*R, that's "I took answer time that is not related to R at all and added some sort of correction to it"
I have no objections to answer time = a - b*R if and only if a and b are estimated for each user and for each grade separately. Otherwise we will lose accuracy instead of gaining it. Right now the median answer times are estimated for each user individually. If we use answer time = a - bR with fixed a and b, I'm 100% sure it will be worse than our current approach, since this function will be the same for each user instead of being based on user-specific data
As for the approach in your screenshot, where you just add a correction based on R to answer time that is not related to R - no, absolutely not, please don't
Man, I'm telling you, past the current level of complexity WE REALLY NEED A BENCHMARK
Pull requests welcome?
sigh
I could maaaaaybe try to do it myself and then make a repo, but god it would be a nightmare
You should, you're asking for a nightmare
First job is to get a shovel and start digging ๐
Read the user review data from the .parquet file, get FSRS params corresponding to that user, use the code that estimates answer times to estimate answer times, record the difference between "predicted" and real answer time after every review, average it to get the average error
Repeat for ten thousand users
Average the average errors to get the average average error
Oh joy...
I'll have to stitch together the simulator code and the code that reads data from .parquet files
๐ญ๐ญ๐ญ
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
Actually, no, more like I'll have to repurpose the already existing benchmarking code, but load the FSRS parameters instead of calculating them
And somehow make it output answer times
The more I think about how I'm going to do it, the more I'm making noises of a dying seal
It's not even that bad if you are Jarrett since he's the only one who actually understand the monstrosity that is the benchmarking code
But to everyone else the benchmarking code is barely comprehensible
Uhhhh...ok, I'm just going to trust you ๐
I think Alex also understands it.
@polar maple want to make a benchmark of answer time predictions?
- For each of the 10k users, get their FSRS params from the srs-benchmark repo
- For every review predict answer time, which currently is just a weighted average of user-specific median answer time and default answer time. It's currently "static" - we just estimate a bunch of numbers from the user's review history, no FSRS needed. So even the word "predict" isn't really correct here. But we could estimate answer time as a function of R or something, then we would need to actually run FSRS
- Calculate the difference between real answer time and predicted answer time
- Calculate the final error across all users and all of their reviews
(I assume the answer is "no")
Neither do I ๐
I admit I looked at the benchmark code once and then decided it would be easier to just do my own thing because I found it hard to follow which columns were where in the code.
Josh (joshuahamilton on Discord) also said it's hard to understand
Honestly, I think I could do this if it's "static". But answer time as a function of R...no
To be fair to Jarrett I think it's inevitable that this kind of thing would become a bit confusing. He's just doing it as his own side project and is under no obligation to spend extra time trying to make it easier for others to digest.
@quasi shadow
because the user doesn't use Hard
so we don't know the probs of the next rating after hard
The thing is, there isn't really any code that I can just steal and repurpose with minimal effort
The benchmarking code - I will break it if I try to disable optimization. Plus, I have no idea how to add the estimation of median review time to it in a way that is compatible with the rest of the code
The simulator code - I won't be able to read .parquet data and pass it into the simulator instead of randomly generated data
Plus a whole lot of little details that only Jarrett knows how to get right
And considering that historically I have never, NOT EVEN ONCE managed to run the benchmarking code on the first try and always had to consult Jarrett, EVERY SINGLE TIME HE CHANGED ANYTHING about the benchmarking, I feel like it would be easier for me to get a job and pay Jarrett
Reading parquet was pretty easy when I tried. I've never tried touching the simulator code so I cannot comment on that.
If you spent 3~4 hours per day over a year, you would understand it.
Reading is easy, passing it into the simulator code (or whatever relevant parts of it, minus random number generation) is hard
God, this is so over
๐ That's what I did to contribute to Anki
Unless Jarrett decides that he nobly wishes to never make any changes to calculating answer times without benchmarking them first, for the sake of Anki users
no
figures
At least promise me to not do this
i tried scaling decay with this scaling, it does not work well
I do it because I introduce the flat forgetting curve.
Expertium really needs AGI to happen so they command an army of AI agents to go off and program all their ideas ๐
It's better than do nothing.
This but unironically
No, that's the point - you don't know that it's better
"Create an Anki better than the one Expertium created". Checkmate.
at that point we'd just ask the agi to emulate a perfect version of anki
I know it's bad without it.
If you really run the unit test, you will know how bad.
Your formula works only if cost is defined as "answer time at R=90%", but it's not
OK, I will modify the code to enable it only for CMRR.
So the simulator will do it differently compared to CMRR? Please no...
I'd rather ditch CMRR entirely
Maybe the AI superintelligence will humour us and at least tell us our ideas are good. ๐
remove CMRR, improve simulator ๐
It's the worst feature I made.
CMRR? I think it's fine, at least right now
Maybe not if it always outputs 70% ๐คฃ
fwiw the code was very cool, learnt a lot!
Nope. It has so many problems.
For example, the loss aversion.
It's introduced to increase the output.
Can't we just run the simulations, with real deck sizes and real cards states and all that, and get CMRR 2.0?
Actually, the current simulator is incorrect.
Instead of the current "spherical in vacuum" implementation
I thought it's disabled for simulations?
It's always 2.5.
Like, I thought it's used only for CMRR
๐คฃ
God damn it man, please disable it
Nobody complain
People want accurate workloads
You're the first one.
For CMRR it's ok because people don't see the workload graph
"Time"
Because we show this graph, it'd better be accurate
CMRR doesn't show anything related to how much time is spent on reviews, so it's fine to cheat a little bit, users won't be able to see it
Forget CMRR
I will
after merging the FSRS-6 PR
There are several benchmarks I need to complete
so maybe the next week
Maybe make CMRR 2 with accurate deck sizes and card states? 
Aka just run the simulator with all of it's configs
Easy Days, sort order, blah blah
Literally just
for R in range(70, 100):
workload, knowledge = simulator(R, all_other_shit)
And there you go, CMRR 2.0!
@cosmic hedge wanna replace the current CMRR that assumes a specific deck size, a specific number of new cards/day, no already learned cards, etc. with CMRR Turbo Plus Ultra?
Instead of having CMRR as a separate entity, just make it a part of the simulator
i guess that especially without loss_aversion, CMRR would output 0.7 in most cases?
According to Jarrett, with FSRS-6 - yes
Maybe we need to listen to Sound after all and use sum(R*f(S)) instead of sum(R)
But then the choice of f(S) is completely arbitrary
i vote for one of these that i described here
has a better meaning than something like R*sqrt(S)
you can do it
I believe in you
Not that kind of convincing ๐
oh, so you can't do it
You're a tricksy one Jake ;p
I mean: I'm not sure it is fully baked yet, and don't want to put a lot of effort into something that might be thrown away if it doesn't work well for most users.
Yeah, I really like the integral idea
For anyone who wants to mess around with it and implement it in the advanced stats add-on:
The only issue is that the choice of the range of time (t1, t2) is arbitrary
@cosmic hedge @bold terrace
Hmmm f(S) = ?
There is no f(S). Instead, we use R averaged over some period of time
Which implicitly takes into account S, since with higher S average over [t1, t2] will be greater
I wouldn't use it for graphs, but we can use it for CMRR 2, if a bloke implements it
Instead of workload/knowledge(at the end of the simulation), it will be workload/knowledge(average over some time)
Maybe I'm wrong but won't it be somewhat linear proportion based on S ?
Since S is already kinda describing how R decline with time
it's the average R over a certain time period so it will be divided out by the amount of time
Average R
For S=180 and S=360 you would have how much ?
I see
Alright, I'm gonna be away from my PC for an hour or two, so feel free to use the file I provided
To be fair I think it would reward a bit too much very low Stability compared to very high one (Since let's be honest, S=1/S=2, the knowledge is not at all acquired), but I don't mind testing how it would look like. I just don't know how performant it will be to do that loop from t1 to t2 for every revlog entry (but I guess a few dozen should not hurt)
in this case probably t1 = 1 is fixed and t2 will be something we experiment with
in Expertium's code rn it is t2 = 10 which is far too low imo
we don't need to iterate between t1 to t2 actually, with the integral the computation is quick
Jonathans-Laptop:tmp jschoreels$ python3 fs.py
R at t1=1: 0.900000
R at t2=360: 0.331270
Average R within the [t1, t2] range: 0.409252
Brute force calculation of average R within the [t1, t2] range: 0.409252
Brute force calculation agrees with integral calculation: False
Jonathans-Laptop:tmp jschoreels$ python3 fs.py
R at t1=1: 0.999615
R at t2=360: 0.900000
Average R within the [t1, t2] range: 0.944604
Brute force calculation of average R within the [t1, t2] range: 0.944604
Brute force calculation agrees with integral calculation: True
I see
def power_forgetting_curve(t, s, decay):
factor = 0.9 ** (1 / decay) - 1
return np.power((1 + factor * t / s), decay)
This function doesn't depend on anything else ? D ? FSRS parameters ?
nah it doesn't. if you wanted something that depends on everything then you could take the result of SSP-MMC as a score in itself, the "average cost to reach target stability given S,D,R, and FSRS params"
S=1
Integral at 10 : 9.451472
Integral at 360 : 149.668518
S=360
Integral at 10 : 658.855147
Integral at 360 : 988.986838
I removed the avg
Maybe I shouldn't have lol
Integral at 10 : 658.855147 Does this mean t1=1 to t2=10? why is it a number larger than 10? if R = 1 for the whole duration theres no way it can sum up to such a large number
t2=10
stabilities = [i for i in range(1,360)]
print(stabilities)
integrals = [integral_power_forgetting_curve(t2, s, decay) for s in stabilities]
print(integrals)
plt.plot(stabilities, integrals)
plt.show()
I just call this
def integral_power_forgetting_curve(t, s, decay):
factor = 0.9 ** (1 / decay) - 1
# Check that parameters are in valid ranges
if not (0 > decay >= -1):
raise ValueError("Decay must be in the range (0, -1]")
if t <= 0 or s <= 0:
raise ValueError("t and s must be positive")
# Special case for decay โ -1
if abs(decay + 1) < 1e-10: # Using a small threshold to check if decay โ -1
return (s / factor) * np.log1p(factor * t / s)
# General case for decay โ -1
return (s / (factor * (decay + 1))) * ((1 + factor * t / s) ** (decay + 1))
But yeah S=1 integral at 9.4 seems off
t_array = [i for i in range(1,100)]
integrals = [integral_power_forgetting_curve(t, 1, decay) for t in t_array]
plt.plot(t_array, integrals)
plt.show()
Feels also a bit off for stability=1
t_array = [i for i in range(1,100)]
integrals = [integral_power_forgetting_curve(t, 1, decay) for t in t_array]
plt.plot(t_array, integrals)
plt.show()
If you want to get something that can be interpreted as average R over time, you need this
def average_f_power_forgetting_curve(t1, t2, s, decay):
if not t2 > t1:
raise ValueError("t2 must be greater than t1")
# Calculate F(t2) - F(t1) where F is the antiderivative
integral = integral_power_forgetting_curve(t2, s, decay) - integral_power_forgetting_curve(t1, s, decay)
# Divide it by the difference in time to get the average
return integral / (t2 - t1)```
The integrals themselves cannot be interpreted as average R, and their difference cannot be interpreted as average R, but rather, as area under the curve
You need the difference between integrals divided by the difference between times
If you want the area under the forgetting curve, remove division by (t2 - t1)
This is meaningless, or at least I can't think of any useful interpretation
My idea is to use average R over the next year instead of average R at the end of the simulation for CMRR (again, if a bloke wants to make new CMRR)
stabilities = [i for i in range(2,360)]
print(stabilities)
for t2 in range(10, 100, 10):
integrals = [average_f_power_forgetting_curve(t1, t2, s, decay) for s in stabilities]
plt.plot(stabilities, integrals, label=f't2:{t2}')
This is the case where you really should name your axes...axises...you know ๐
Anyway, with default FSRS-5 params I get MRR=84%, which is weird because in Anki I get 87%, but oh well, I ain't gonna look for discrepancies. Maybe default params changed, or maybe the default answer times changed
I'll see what I get if I use sum(avg_R(0, 365)) instead of sum(R). MRR should become higher, I think. Maybe. Actually, idk, we'll see
add t2 = 365 and higher
I thought about maybe trying out what normal optimized parameters but 95% retention would lead to.
Result: no, I'd rather not
I prefer my 1-exp(...) ๐
Could also be more easily customizable : "What's your goal stability ?", it's just a factor to make it reach "1" sooner or later
With R averaged over one year, I get slightly higher MRR (specifically, averaged over delta_t, delta_t+365, where delta_t = today - last_review_date, so it's R over "from today and one year into the future")
0.84 -> 0.86
With R averaged over 5 years, I get
0.84 -> 0.89
#Team90DR Intensify lol
And with that R, you still get plenty points for very low S
imagine with a nice and smooth 1-exp
๐ฆ
With R averaged over the next 100 years, I get 0.92
Future FSRS parameter : What is your expected remaining lifespan
Lol
I like the idea that S=365 represent "the max"
let's be real: FSRS doesn't work at really high intervals, saying from experience. at that point, there just isn't enough data.
Do you mean the interval are too big or too low ?
What would be for you the max S
that is relevant
It's interesting to remember that if all your card had a 365d stability, you could maintain 50K words by doing 137 reviews/day at 90% DR
~5min at 2s/review
so clearly, 365d might already be "too high" for realistic "anki endgoal"
If we accept 30min of daily anki to maintain 50K words, a stability of 60d would already be enough
I can't say that tbh. It's just that the retention is usually very different for these cards compared to others.
Ok but at how much interval would you say you start to feel they have "long" interval ?
DR ?
Which loop? The brute force approach will not be used
Brute force is just a sanity check
To make sure the integral math is mathing
yeah I misunderstood
I would say around a year, but depends on content. If it's the general knowledge deck I internalise the cards very quickly and at that point, Anki feels quite unnecessary.
If its JP word it'll take longer.
btw, folks, do we have a roadmap for what's coming next in algorithmic/fsrs improvement?
ya'll put hundreds of messages everyday in this channel, a mere human can't possibly read all that
- FSRS-6 with a new parameter for same-day reviews and with a flatter curve
- Simulator now takes load balancing and Easy Days into account
- Simulator now simulates same-day reviews way better
- Load balancing is tweaked, so hopefully maybe potentially possibly Sound will finally stop complaining about LB decreasing retention, but I wouldn't bet on that
- Maybe remove CMRR as it's kinda shit according to Jarrett
- Maybe make CMRR+ Mega Ultra Giga Chad Sigma Edition if Luc (A bloke) wants to. Instead of CMRR being separate from the simulator, it will use all of the simulator settings, including sort order, Easy Days, etc.
- This
- Expertium continuing to think he's the boss of everyone doing something ๐
More like I'm the guy who's job is to remind everyone about that one really cool feature that I suggested a year ago and everyone forgot about
lol this is true
im reminded of a character who loves packing but not really: what he actually likes is lolling on the sofa and telling others how to properly pack
literally me
Also, I tried CMRR with FSRS-6 parameters and decay, and yeah, it's just forever 0.7 ๐
one day he'll boss himself around
So CMRR will probably be removed because it's lobotomized now
Then again, current CMRR isn't realistic in the first place since it doesn't take into account LB, Easy Days, sort order, real new cards/day limit, real review/day limit, real deck size, real card states, etc.
And I really hope Luc will just use the simulator code with all of its settings for the next-gen CMRR
...or maybe he won't, and then users will forever keep asking "What's the best value of desired retention?" until the end of the universe (or Anki), everyone will be coming up with their own rule of thumb, twenty bloggers will write twenty articles on the best value of desired retention, and then 10 years later somebody will ask "Why not just run the simulator for every allowed value of DR and check the workload?", and I will answer "Because nobody wanted to do implement it 10 years ago"
And after that we will be back to asking "What's the best value of desired retention?" until the end of the universe
It was the default value all along
But some otherthinker came up with some "next gen optimizer"
And screw naive people
(I'm looking at you)
"Crunching Numbers" lol
to be fair aren't we just panic finding new methods/metrics in order to purposefully increase MMR
and the moment we get a high number we declare victory
kind of
To be fair, I don't think there is a huge huge hurry, mine is blocked to 0.70 since it has been introduced
I did the mistake to change my DR to that .70, once
Then my effective R was around 50-55%
and it took me 2-3 weeks to recover from that week 
But CMRR was right ! to increase my knowledge, I had to drop DR very low, add a lot of words ...
... And be in a state with a shitton of card with stability <1d that would all contribute to my marvelous "total knowledge" that was that sum(R)
(I over exagerate since yes, the interval/stability is somewhat accounted in the workload, so it's not like it was completely ignored, but still)
Problem is that CMRR estimated that with a DR set to 70% I would fail 30% of the, when I failed in fact 45% of my cards ๐ฅฒ
More accurate FSRS-6 + sum(avg_r(delta_t, delta_t+1095)) instead of sum(R) + using the actual deck size and the actual card states could alleviate a lot of that
IMO, the biggest problem with CMRR is not the choice of the function to minimize/maximize, but the fact that the settings have barely anything to do with reality
IMO a small warning : "If you plan to change your DR, please do it incrementally"
Would save many lifes
What about ditching forgetting curve, and just train different set of parameters for different DRxD range ?
We might even have more params than Alex doing so
๐
execute the complainers
Or be sneaky and make it act like a PID controller. Don't just immediately schedule based on the user DR, slowly adjust things internally over time based on the difference between DR and true retention.
Yeah I thought about that and I was also thinking it could be nice coupled with the fact FSRS has some recency weight
Could make the recency weight a bit more aggressive for phases where DR change, to let it adapt quicker
@cosmic hedge sorry for frequent pings guys, but I want to ask - do you want to implement next gen CMRR? And by that I mean just use the simulator with all of its settings to make CMRR as realistic as possible.
Currently, CMRR assumes fixed deck size, no learned cards, doesn't take into account sort order, Easy Days, etc. All of that can be fixed by reusing the simulator.
@quasi shadow wants to remove CMRR because with FSRS-6 it outputs 70% too frequently, and also because it's kinda crap overall, and while that's understandable, I think we should instead improve CMRR and make it more realistic by using real deck sizes, real card states, real new and review limits, etc.
Removing CMRR completely would be a net loss of functionality, and since there are obvious ways to make it more realistic, I think we should do that instead
Though, there is also the problem that Alex pointed out - we are in a situation where we want CMRR to output higher numbers, so we will declare any tweaks that make the output bigger good
Perfection, 90% ๐
Look how daily outcome become so much predictable ๐ ANd it's even by 5-day average there.
Without 5-day average, it would give this for Anki Scheduling daily R
Compare to this with Filtered DEcks
It's better to remove it than keep it as is. I agree that there may be a good design. But I'm not the one who will implement it.
It is a net gain because it won't cause more confusion.
People will ask "What's the best value of DR?", so it makes sense to have a tool to answer that question
I've said it yesterday, CMRR the biggest problem with CMRR is the unrealistic settings it uses
It should just use the same settings and the same deck and card info as the simulator, for maximum realism
So fixing (or at least improving) CMRR is just a matter of reusing the simulator config
A calculate which always outputs zero is useless.
Or, it's harmful.
- Using more realistic settings will likely change it
- I also have my idea with using the average R over the next year or two instead of R at the end of the simulation, to bump up the output (#1282005522513530952 message)
Would the next gen CMRR fix it at all?
#1282005522513530952 message does this not help enough btw?
It's not that it doesn't help, it's that it doesn't make sense
Next gen CMRR would certainly be more realistic, though that doesn't automatically guarantee that it won't always output 70%
But it's definitely more realistic than assuming that deck_size = 10*days_to_simulate and an infinite number of new cards that can be learned per day
yeah always thought that was odd but figured it was just magic or something XD
why doesn't it make sense? catch me up pls
It only works if cost is defined as "time per review at R=90%", but it's not
It's adding apples to oranges
ahh I suppose so
So, are you up to the task?
yep
I'll write you a detailed spec later
Btw, why did you close the PR with the "smooth" button?
dae said he didnt want it
yep
its fine its not a huge issue XD
All these settings except for Smooth Graph affect scheduling and are real settings that you can find in deck options. So grouping Smooth Graph - which only affects the plotting - with real settings seems like a bad UI to me.
User's shouldn't have to play "pick the odd one out"
I think the hint would be it affecting the graphs which already exist
if someone did assume the button affected the actual results what would be the problem?
i suppose it belongs in advanced settings because it's a setting that very few people will need to touch.
if someone did assume the button affected the actual results what would be the problem?
That they would look for it in deck options and never find it
feel free to @ me when someone's looking for the "smooth graph" deck option ๐คฃ
Lol, alright
I mean, even if it creates confusion that doesn't last long, it still creates >0 confusion
I don't think people expect 0 confusion when they open the "advanced settings"
Another matter: https://forums.ankiweb.net/t/desired-retention-ui-overhaul/57678/33?u=expertium
Will you add this to your ever-growing list of "suggestions that Expertium pings me about every day?" ๐คฃ
Ok, how about an idea suggested by Brayan: answer buttons that show interval lengths The interval lengths above answer buttons would change instantly when desired retention is changed More from Brayan: put the fsrs parameters at the bottom of the FSRS section and add some title to the โquery inputโ (idk what is called the form below...
I had an idea with that but realised it was just your original one flipped sideways XD
According to David, some people don't understand graphs
Like, graphs short-circuit their brains
the horror ๐
And IMO, the idea with answer buttons is just very neat and clear
We show users what they have already seen before - answer buttons with interval lengths
Instead of something completely unfamiliar
Idk how hard it would be to implement
it would just be weights 1-4 run through the forgetting curve right?
Yes, just the first 4 params multiplied by a coefficient that depends on DR. Plus learning steps
Should we display fuzz, though? That's the issue
Since i'd be the one implementing it apparently and it would be easier not to then no XD
15m for again? So it also considers the learning steps?
I agree this is a good idea
https://discordapp.com/channels/368267295601983490/1282005522513530952/1359853548669636788 He thinks so, but I don't think it should?
Eeeeereh with decay -0.2, the power_forgetting_curve still has a value of 20% around 1000d for a stability of 1d ...
I'm sorry but the integral stuff sound fishy
Writing it in in a word document doesn't make it less fishy
And randomly chosing the avg retention over 5y to compensate for a bad function just also feel like you just deny everything else that you came up yourself @unique salmon
Yes
No, it should display learning steps
Otherwise users will be like "WHERE ARE MAH LEARNIN' SHTEPSH?!?!?!"
It makes sense though. As I wrote, and as you yourself said many times, we care not only about how much we know at a specific point in time, but also about how slowly that knowledge is forgotten
It definitely makes way more sense than arbitrary f(S)
That's not what I'm arguing, I'm arguing about the integral usage ๐
Well, it make no sense if the forgetting curve can't be trusted for extreme value, which it can't when I see that after 1000 days, a S=1d card will translate into a 20% probability
So f(S) make more sense if it goes from 0 to 1 in a lapse of time that we can interpret as "acquired"
Btw @polar maple that was also another reason why I was against the new decay
It forbids low R
Anyway, with the new decay we could either scrap CMRR (as Jarrett wants) and leave users forever wondering what is the best value of DR, or we could try to save CMRR somehow (as I want) to give users some answer
Sure but it doesn't have to be "all your way" or "nothing
Anyone is free to suggest changes to CMRR
Also I think decoupling CMRR evaluation function with a f(S) than just reusing the same forgetting curve, would allow to have more like a discriminant that is not poisoned by artifacts from FSRS (the fact that extremely R won't drop below 1% for example)
I'm not sure what you mean
yeah but the user knows their learning steps so its not as helpful
If the user sees the learning steps above the real buttons but not above the fake buttons, don't you think that's confusing?
Like, I immediately think "that's hella confusing"
What I mean is the fact that if the forgettive_curve was more realistic (meaning a S=1d card would have its R really really low after a few weeks), then there would be a no chance a card with S=1d would already have a 0.40 score.
I still think FSRS good prediction is based on the R it was trained on, and not based on the quality of its forgetting curve
So using the curve as a way to find the evaluation function of "how good a S is" feels wrong
aww and it doesnt count intra-day reviews changing the stability either
i say just give up and stick a link to the visualiser there
This one?
https://open-spaced-repetition.github.io/anki_fsrs_visualizer/
Please tell me you're joking
Screw it, I think it's fine
Just display what the user would normally see with these parameters and learning steps above his real buttons
As long as what the user sees above the fake buttons is the same as what he sees above the real buttons, we're good
yeah im joking XD
so then if they have more than 1 learning step the only button thats going to show anything fsrs related is easy?
Sadly, yes
Thank Dae for making learning steps a nightmare
And making the entire scheduling system janky
Btw, I'm starting to wonder if maybe the benchmark is fundamentally flawed, in the sense that because there is less data at lower retentions than at higher retentions, FSRS just adapts to higher retentions, and there is no way to make it stop doing that without making the metrics worse
@quasi shadow @polar maple
And if we agree that some change that makes metrics worse is in some sense "better", then...what's the point of the benchmark?
I guess we could solve it by asking Dae to make another 10k dataset, but this time make it have a uniform distribution of retentions by cherry-picking users with all kinds of retentions
Jarrett, I hope you agree that this is a good idea
So that there is more or less the same amount of data for all retentions
Like my uniform dataset with 100 users, but 100 times larger
Otherwise I can't think of any way to prevent overfitting to higher retentions
I was questioning the integrity of the benchmark not so long ago. You told me RMSE correlates heavily with Retention, if I am not mistaken
Though I still do (regarding having FSRS-sec)
Log-loss correlates with retention, RMSE with the number of reviews
Yes that
But that's a different problem, not exactly what I'm talking about above
You want to benchmark the benchmark
What I'm talking about is that regardless of which metrics we use, we will end up overfitting to high retentions because most of the data is in the >50% retention range, with very few users with <50% retention
So a change that makes the forgetting curve less realistic, like the whole "A card will never reach probability of recall of 10%" thing, might look good on paper (uh, on the monitor)
I know. I am just saying the benchmark might not be 100% perfect
But Jarrett surely knows his benchmark
Do you want to change the benchmark for something else
๐
Analogy: think of it as making an artificial city where the number of millionaires is the same as the number of poor people
Use me
Nah, it would be anonymous
Dae would open his secret vault where he keeps user data ๐คฃ
Who would care. Make me an honorary specimen
There is another way, which is a lot more arbitrary and dumb but doesn't require getting a new dataset.
After optimizing FSRS on all 10k users, we calculate the final metric as a weighted average where weights are proportional to retention in a certain way. Specifically, we put all users into 40 categories: retention between 100% and 97.5%, retention between 97.5% and 95%, retention between 95% and 92.5%, etc.
Then we count how many users fall into each category. Then, when calculating the final log-loss and RMSE over the entire dataset, the user is weighted inversely proportional to the number of users in his "retention class".
What this means is that if someone has a retention of 90%, his weight will be lower because that's common. If someone has a retention of 10%, his weight will be huge because that's uncommon. So we assign more weight to people with uncommon values of retention and less weight to people with common retentions.
This doesn't even require re-running the algorithms in the benchmark, just re-calculating the final average across all collections
So we could do it like...today
@polar maple it's inspired by the approach of giving more weight to rare classes in classification problems on imbalanced datasets
Except that we're just making the classes up ๐คฃ
I'm fully expecting that with this change FSRS-6 will look worse than FSRS-5
Because FSRS-6 has a curve that doesn't fit people at low retentions
Getting an actual uniform dataset would be a lot better, though
I will evaluate FSRS-6 in this way tomorrow.
The only thing I need to do is calculating the retention and saving it in the result with the metric, right?
Then we can compare algorithms in each retention level.
- Split users based on their retention (exclude same-day reviews and the first review) into sufficiently many groups, like 20 or 40
- Calculate 1/n(users in the group) for each group
- Calculate weighted average log-loss, RMSE(bins) and AUC across all 10k users, where each user has a weight of 1/n(users in the group), depending on which group he belongs to
Example: suppose there are 1000 users in the 90%-92.5% group. So if a user's retention is 91%, his weight is 1/1000
Is there a way to (correctly) guess how long it take for True retention: to move closer to desired retention?
My case: I have used anything between 72%-85% desired retention (mostly on the lower end) for at least 6 months (with change all cards on schedule), but like last week I have turned it up to 90% desired retention (without change all cards on schedule).
not really a problem just curious
It can be as quick as "Reschedule all your card and do your backlog right now" or as long as "You'll have to gradually review cards scheduled with your old DR and they will be rescheduled with the new one"
A compromise is with FSRS Helper Addon, only reschedule the further away cards
I got 1000 cards ๐ฅด "due" if I take the quick rout haha. But I think I can "endure" true retention < desired retention for a while
You can always do the compromise if you want to speed up a bit things ๐
You can do a tiny batch per day
taking only the one only in far far future
That will mess with me mentally I think haha
But I think I have figured it out.
As you said, it will take as long as the amount of days that most of my cards have with old DR. I think looking at the review intervals tab on "stats" might help.
If I look at where the cumulative 50% & 80% (randomly taken) is, it might give me some sort of idea.
running total of 50% is @ 56-60 days review interval, and running total 80% is @ 152-155 days review interval. So I would guess it's somewhere @ 120. Because during this time I am also new cards etc
yeah a problem is that in the data that we have, we cannot distinguish users who study purely using anki and users who just happen to have some of their knowledge in anki and end up studying elsewhere like with language immersion or school. So, it could be true that the forgetting curve for the average user just looks something like that, never going to 0
at least for evaluation theres still a decent amount of data at low DR, e.g. this calibration chart for the first 500 users
I mean like, if most users have retention around 90%, it means that if we try to find decay that provides the best metrics, it will be whatever decay is best for 90% retention
So I think we REALLY should ask Dae to make a uniform dataset
Or do the thing I described above, which is worse
also was just showing that RWKV's flatter forgetting curves still achieves good calibration on low R so its not necessarily a big issue to have a flat forgetting curves
I'd like to see the calibration graph for FSRS-5 and 6
Just change decay to -0.2 and run it
There is a new parameter for same-day reviews, but screw it
K
Oh, and I mean "run the optimizer", not "run it with the same parameters as for the old curve"
So you'll have to optimize parameters for the new curve
Also, this looks like a mess. Didn't we remove some of the curves?
ok
was this done with the 5-way split on the 10k dataset? if so, does it improve test performance?
yep
Maybe we should do that instead of the fixed decay then?
The problem is estimation of S0. You need to know decay in advance to accurately estimate S0. Maybe do what you did, and then do a second optimization with fixed decay that was found during the first optimization?
It's 2x slower, but should work better
how much better is it? idk which file to compare it to
The more I think about it, the more I think that's the best course of action
If we make decay trainable, that alleviates the problem that different values of decay are better at different retentions, which is what we have been arguing about all the time. And it should be more accurate than any fixed value of decay. The problem is S0. Actually, even other parameters may (and likely will) still have different values depending on the choice of decay. I don't think FSRS params are "decay-agnostic", though I don't have a solid proof of that.
So the solution is to run optimization twice: once with variable decay, to find which value of decay is good, and the second time with fixed decay from the first run, to fine-tune the parameters
We could only run it once if parameters are "decay-agnostic", but again, I doubt that they are
By "decay-agnostic" I mean "parameters will converge to the same values regardless of the choice of decay"
@unique salmon same 500 users
Well, color me green and call me a pickle
It actually looks good below 50%
Somehow
not too surprising given the RWKV curves
but where are my confidence intervals?
i updated fsrs-optimizer
i thought you added confidence intervals or something
I thought so too ๐
the update must've failed or something, you mentioned you removed some lines but i think i can still see them all
@unique salmon the swap at p=0.45 is interesting
i think ill get another 500 users to see if this repeats
https://github.com/open-spaced-repetition/fsrs-optimizer/pull/169#issuecomment-2794715383
Mind voicing your thoughts? Or giving me a thumbs up, that works too ๐คฃ
I don't see any problems with my idea, aside from making optimization two times slower
is S0 even a big issue anymore? isn't S0 already a learnable parameter after the initial estimation? so it doesn't really matter if we initially estimate it with a different decay, as long as the optimization process will still move S0 to a good value
The problem is that if we change decay, we have to re-estimate parameters
i don't see why we need to fix decay for a second optimization and why this would necessarily benefit over just a joint optimization of all parameters at once
I mean, I guess we could just double the number of epochs?
you could do that, idk, or just leave it as-is
i did check that increasing epochs does improve performance a bit but i think this is already a tradeoff that jarrett has decided on
@unique salmon on users 501-1000, looks like theres an actual pattern
Model: FSRS-5
Total number of users: 9999
Total number of reviews: 349923850
Weighted average by reviews:
FSRS-5 LogLoss (meanยฑstd): 0.3273ยฑ0.1525
FSRS-5 RMSE(bins) (meanยฑstd): 0.0518ยฑ0.0332
Model: FSRS-5-recency
Total number of users: 9999
Total number of reviews: 349923850
Weighted average by reviews:
FSRS-5-recency LogLoss (meanยฑstd): 0.3256ยฑ0.1519
FSRS-5-recency RMSE(bins) (meanยฑstd): 0.0493ยฑ0.0321
Model: FSRS-5-dev (optimizable decay)
Total number of users: 9999
Total number of reviews: 349923850
Weighted average by reviews:
FSRS-5-dev LogLoss (meanยฑstd): 0.3220ยฑ0.1488
FSRS-5-dev RMSE(bins) (meanยฑstd): 0.0466ยฑ0.0290
If it's part of the cost function, can't it be optimized at the same time ?
i think FSRS-5 and FSRS-5-recency here uses decay=0.5, what i want is a comparison between optimizable decay and decay=0.2
Well, can't do that
just for completeness here is the combined chart for users 1-1000
I mean, I can, but it will take a ton of time
What do you mean?
yeah, but jarrett probably has the results somewhere since he did benchmarking for FSRS-6
and i cant be sure that optimizable decay is with all other parameters equal with FSRS-6
so doesn't hurt to just ask directly
Optimizable is pretty much guaranteed to be better than fixed
unless it overfits too much
This is all Jarrett posted
The fact the decay need to be trained before training params, if the decay is part of the forgetting curve can't it be optimized at the same time ?
gradient descent and doing the derivate of the forgetting curve by the decay
Uh, it's complicated.
It needs to be fixed for the first 4 parameters, since they are estimated separately. For other parameters, as I said, they likely depend on the value of decay, so if you change decay, optimal parameters will no longer be optimal
BUT
In FSRS-5 the first 4 parameters are also optimized via gradient descent after they are estimated initially. So now the only problem is that parameters that are optimal at one value of decay are not optimal at the other. But running the optimizer for more epochs will likely solve it. After each epoch the change of the decay parameter will be smaller and smaller
If you want to ask "If we can optimize decay, why are even bothering with fixed decay?" - I have no idea ๐คฃ
Jarrett just decided to use fixed decay for...reasons that I don't know
Sure, if the decay get optimized it might/will change the value of those 4, but if I remember correctly the few lessons I did with gradient descent, you do the derivative of the cost function for every parameter, and you "glide the slope" of all those dimensions until you reach a minimum
so you would glide that bias and those 4 parameters, leading you to the point they would balance themselves out ?
And I somehow forgot about his experiments with optimizable decay
And now I'm like "Wait, why are we doing fixed decay again?"
Idk, maybe we all collectively had a brain fart
Pretty much
But maybe there are reasons I don't see why it is better fixed
I don't either
It improves metrics and is more adaptive than fixed (well, duh, obviously)
See #1282005522513530952 message
It doesn't, or at least not enough to become a problem
It wouldn't improve log-loss and RMSE if it overfitted a lot
Though I still think we should choose a reasonable range for it, not just (0.01, 1)
According to this graph, I'd say 0.1-0.7 is reasonable
Green is just me trying to eyeball the best fit
but we cannot compare learnable decay to decay = 0.5 when decay = 0.2 does way better on the metrics already
Why did this card go from 48% to 100% after the manual reschedule? It's a card that I got right so I feel tike difficulty actually is closer to the beginning 50% so i just reset the card as a knee-jerk recation and rated it again.
jarrett already posted a distribution of learnt decay values, isn't that one not bounded to (0.01, 2)?
Thing that always bugs me a bit is, sure we get a good decay that should fit most people with this, but just like default params won't be ideal for an individual, I guess decay should also behave the same way isn't it ?
Yeah, my bad, it's (0.01, 1) in the code. I think we should change it to (0.1, 0.7)
Have you optimized parameters inbetween these two reviews?
Though, 100% D with no lapses is strange either way
Ah. Yeah, we're just waiting for Jarrett to post full benchmark results
Are you saying that optimizable decay is better than fixed? If so, then yes, I think so too. Though we'll have to see benchmark results to be sure
I had, but I've not been using the "reschedule cards on change" cards in the deck options for that card's deck (that automagically puts those there I think), instead using FSRS helper to do so and then catching up on lapses. Is there a way to look for other cards that were rescheduled that day with that manual reschedule to see if they've also had that happen to them?
You can share your parameters, and also screenshot the graph of "Card Difficulty" in the stats view
But my gut feeling is that you're very new to Anki and I guess the optimizer didn't bother really put much different difficulty, just put everything in one big basket
how do you get to the card difficulty graph of a single card? I know how to look at it for decks, but didn't know it was possible to do so for a single card.
for the deck I mean
here the info is pretty simple, the card never failed, was ~50% D before, it's 100% now
but then I would expect all your cards to be at 100%
FSRS5 parameters with DR@92%:
0.9842, 8.0109, 41.2131, 100.0000, 7.3324, 0.5695, 1.7045, 0.0010, 1.3330, 0.3374, 0.8130, 1.9629, 0.1152, 0.3734, 2.2973, 0.1129, 3.0047, 0.4220, 0.7896
with 747 cards in the deck
Hmm
yeah, i don't get it
Yeah no having 100% D doesn't make sense for that card
I think it has to do with some shenanigans I did described in this thread
https://discord.com/channels/368267295601983490/1350593441654116463
Did you try to reschedule it with the FSRS plugin ?
yeah, it didn't budge the difficulty, interval, or due date when trying to rechedule with the right-click context menu or the reschedule all cards option
With your param, the 27d interval / 48% D seems to be the correct values
You can't right click -> Forget -> set Due Date 0 and re-review it ?
which is what made me super curious about why it jumped to 100% difficulty. you don't know of a way to specifically search for manually rescheduled cards on that day to see if it happened to some others, do you?
hmmmmm
no but I would maybe just search for all cards with high D with no review failed
-rated:180:1 prop:d>0.99
something like that
this was my quick solution, yeah. But then i got worried that there's 50 other cards with a higher than intended difficulty because of whatever caused this to be 100% difficulty
ah, not the only card
did you try the right click -> recompute memory state ?
or did you just did the deck -> reschedule
taht's the same as "update memory state and rechedule" right?
I'm not entirely sure
yeah, no dice.
I know in the past when some bugs happened, the memory state had to be refreshed
Out of curiousity, try changing the last digit of any parameter, like from 1.2345 to 1.2346, just to recalculate memory states, and check that card again
I'm like 99% sure it's because these cards had a review history before 3/15, then got reset through the Cards->Reset function in the card browser as the cards i found that have this issue are like that, lol.
i'm not sure if this'll change anythign because i've been opitimizing parameters at least once a month since then, but lemme try it
also no dice :c
https://forums.ankiweb.net/c/anki/fsrs/19
Make an issue on the forum
I did use the "rechedule on change" option, lemme try something else.
Maybe reseting repetitions/lapse could maybe help but not sure
really sounds like a very tricky/specific issue
should i export the deck and attach it to the post? What other stuff should i put in there, other than some screenshots, the parameters, maybe a link to the shenanigans I did that day
oh LOL it won't let me included links in my post ใ ใ
Screenshots, parameters (in text) and a short description
Also, is the intervals on those cards becoming lower even while passing reviews...supposed to happen? I just just noticed after making the post, lol.
If you have changed parameters inbetween reviews, yes, it could happen
@polar maple I'm benchmarking decay=-0.5 vs decay=-0.2 vs optimizable decay within the (0.1, 0.7) range, and the optimizable one is like baaarely better. There is a clear difference between decay=-0.5 vs decay=-0.2, but not much difference between decay=-0.2 vs opt. decay.
Model: FSRS-6 (opt decay)
Total number of users: 102
Total number of reviews: 2123285
Weighted average by reviews:
FSRS-6 LogLoss (meanยฑstd): 0.3731ยฑ0.1780
FSRS-6 RMSE(bins) (meanยฑstd): 0.0665ยฑ0.0355
Model: FSRS-5 (decay=-0.2)
Total number of users: 102
Total number of reviews: 2123285
Weighted average by reviews:
FSRS-5 LogLoss (meanยฑstd): 0.3733ยฑ0.1780
FSRS-5 RMSE(bins) (meanยฑstd): 0.0666ยฑ0.0355
Model: FSRS-5 (decay=-0.5)
Total number of users: 102
Total number of reviews: 2123285
Weighted average by reviews:
FSRS-5 d=-0.5 LogLoss (meanยฑstd): 0.3780ยฑ0.1802
FSRS-5 d=-0.5 RMSE(bins) (meanยฑstd): 0.0691ยฑ0.0346
I've only done 100 users so far, so I will report back tomorrow. But weirdly enough, it seems like the fixed one is just too good for some reason. Then again, maybe the optimizable one needs more epochs.
i wonder if optimizable decay just needs some heavy regularization to reduce overfitting
it could be worth having a modified version of the script that also saves the training loss for each user to see if opt decay fits the training data much better than decay=0.2 or not
while true, i really think something is weird with the cards and how the math is mathing on them because using the parameters from 2025-03-14 that I have, a 333 should have an interval of 36 days, while using my most current set of parameters from 2025-04-09 says they should be 62 days, and not the 8 days as show in the first card in this message here:
#1282005522513530952 message
I kinda actually wanna see about setting my PC in the future to see how the next rating affects the difficulties and intervals. let's time travel~
but first: BACKUP TIME
edit: i was wrong, after two reviews into May, the intervals started decreasing. so it really just is that the cards now have a difficulty of 100% lol, I think it'd be faster if I were to just reset them and go from there
oh, I found all the cards with resched:27 -resched<27 lol, and a lot of them are exhibiting this behavior. I there a way to remove the manual review/scheduling data that gives it a 1 to see if that's messing with it? thanks
๐ I forget to release it.
nice, i think it needs more samples but i dont think there is an obvious trend
The retention data has been added into https://github.com/open-spaced-repetition/Anki-button-usage/blob/main/button_usage.jsonl
Contribute to open-spaced-repetition/Anki-button-usage development by creating an account on GitHub.
is this with the 1/n scaling that expertium described? i guess the outlier affects it too much, the n=1 bucket has a huge effect on the outcome
yep
Btw, the optimizable decay reduces RMSE(bins) ~5% compared with decay = -0.2
And it also performs well in low R region.
looks promising!
Nice ! Also, if I'm not wrong, since the fixed decay was computed on the same training set than the optimizable one, on those numbers it's normal there is not much difference, the big benefit would be for users not matching the training set right ๐ ?
If I take those numbers for example, it's normal the optimized doesn't outperform the fixed on the same data the fixed was optimized, since the fixed is an optimized one in the first place (but that won't change anymore).
So to evaluate it, you should probably compare a fixed decay trained on the 10K user but applied to different user and see how much performance decrease, vs the score with their own optimized decay
FSRS-6-dev is FSRS-6 with optimizable decay.
๐ Fine. More work for me to implement it in Rust.
if we only have this 10k dataset then what you want is something like, we find a fixed decay by only looking at the first 5k users, then evaluate this fixed decay on the remaining 5k users and also try learnable decay on the remaining 5k users as well? yeah it's true that we may be overfitting the dataset with certain hyperparameter and algorithm choices, but i think this is swept under the rug as being arguably not influential
Sorry ๐ฆ But yeah, having training/test set is even another step @polar maple, but here I think it's even more unfair for the "Optmized vs Fixed" decay, since here Fixed=Optimized(10k)... So optimizing the Decay on the same set, will just get you of course the same results
But I agree ideally even the optimizer for 1 user should be done on a training set, and then evaluated on a test set
An optimizable decay should still perform better than fixed decay because it can adapt to each user individually rather than only being good on average
But is it how it's evaluated in Jarrett's graph ? I mean, isn't it like a "I optimize it once, and then I evaluate the full set on it ?"
Speaking of which
Model: FSRS-6
Total number of users: 774
Total number of reviews: 26313110
Weighted average by reviews:
FSRS-6 LogLoss (meanยฑstd): 0.3383ยฑ0.1601
FSRS-6 RMSE(bins) (meanยฑstd): 0.0511ยฑ0.0342
Model: FSRS-5
Total number of users: 774
Total number of reviews: 26313110
Weighted average by reviews:
FSRS-5 LogLoss (meanยฑstd): 0.3384ยฑ0.1599
FSRS-5 RMSE(bins) (meanยฑstd): 0.0512ยฑ0.0341
According to my tests opt. decay is pretty much the same as decay=-0.2 ๐ค
So in your test, you optimized different decay for every users ?
Or maybe I just misunderstood what you said ๐
For every user individually, yes
Let's combine the 5 leftmost bins into one, so that it has a larger sample size
@quasi shadow show me your code
Idk how you get better results
Oh wait, maybe my implementation is bugged
How do I do this properly? ๐ญ
class FSRS(nn.Module)
How do I properly do this part?
In other.py
I guess it doesn't matter since you will benchmark opt. decay on your own anyway, but still
-self.model.w[19] doesn't work, 'FSRS5' object has no attribute 'model'
The thing is that optimization goes fine and different users have different decay, but then the results are pretty much identical to decay=-0.2. So I'm guessing that I screwed up outside of the optimization
I cannot review your code if you don't use GitHub and git (
Like this?
Yes
Interesting. So FSRS-6 performs better across all retentions, except for super low ones
The graph is pretty awkward though
There are 3 graphs, but xticks are on only one. And without a horizontal line it's kinda hard to tell where 0 difference is
Having a line like this would be good, but the line is unrelated to the curve, it's related to the differences
Ah man, this is awkward ๐
@bold terrace ok, so this stuff is kinda hard to read, so TLDR: FSRS-6 with it's "A card with S=1 day can never reach R=10%" super-duper-flat curve...somehow performs better than FSRS-5 even for people with retentions like 40% and 50%
It performs better almost universally, except for people with retention around 20%
So...flat curves are just better
Interesting
Probably some external influence on how people use Anki
What if this is FSRS's way of compensating for external reviews?
That's the only explanation that I see
Yes that was also what I was thinking. Probably things never drop under 20% because the people still have a "baseline" knowledge coming from outside anki
Yep
and people with lower R might be people not really doing anything else outside anki
That would be interesting (but difficult I guess ?) to do some clustering on Users, Reviews .... to try to see if we can't also "profile" things
FOr example, by splittng my deck into "Normal D" and "High D", I discovered the default FSRS parameters are actually quite good for my Normal D !
And only my High D collection has a worst logloss/RMSE if not optimized