#FSRS Megathread
1 messages Β· Page 15 of 1
Also, how do you calculate RMSE? It's super high
80% RMSE can't possibly be correct
I would if I could find the code that calculates RMSE in your repo, but I can't π€
df["loss"] = df.apply(lambda df: (df["y"] - df["r"]) ** 2, axis=1)
Ah, I see. You're calculating it in a very different way, the "normal" way
I suggest you to do this instead:
- https://github.com/open-spaced-repetition/srs-benchmark/blob/main/result/FSRS-6-recency.jsonl
- Fetch RMSE(bins) based on user id
- Use that instead
square the root, find the mean. thats all i figured it was π
loss = df_filtered["loss"].mean() ** 0.5
``` i find the mean here
how would i get the fatigue values from that?
Well, that's how normal people do it π€£
https://github.com/open-spaced-repetition/fsrs4anki/wiki/The-Metric
You don't. I'm just telling you where to get RMSE
I might look at it later.
i did it the "simple" way in SSE as well btw so idk where the discrepancy from that might emerge
Well, then you definitely should remove it from the add-on
"bad graph" is good enough for me
no ones ever gonna see that XD
well if they do they've been warned
I'll do this before i do any weird fatigue stuff if i even do any weird fatigue stuff
Btw, please show the code so that I can verify that the integral is calculated as intended
For example, in Python it looks kinda like this
sum_of_avg_r_over_a_year[today] = average_f_power_forgetting_curve(card_table[col["delta_t"]], card_table[col["delta_t"]] + 365, card_table[col["stability"]], DECAY).sum()
https://github.com/open-spaced-repetition/fsrs-rs/blob/092c20bac7d9239a991ae5b561556ad34c706c16/src/optimal_retention.rs#L577
https://github.com/open-spaced-repetition/fsrs-rs/blob/092c20bac7d9239a991ae5b561556ad34c706c16/src/optimal_retention.rs#L26
You can change these formulas where "cards" is an array of cards at the end of the simulation
if you psudocode it or chatgpt it it might save me a job π
as in save me the entire job
Oh lord, Rust... π
for i in 0..delta_t {
memorized_cnt_per_day[last_date_index + i] +=
power_forgetting_curve(w, (pre_sim_days + i) as f32, last_stability);
}
}```
This is the part that needs to be changed...I think.
Instead of using "instant" R from the forgetting curve, we need to use average R over some period of time, aka the integral thingy.
I'm guessing `pre_sim_days` is delta_t?
nope you need to change it here
https://github.com/open-spaced-repetition/fsrs-rs/blob/092c20bac7d9239a991ae5b561556ad34c706c16/src/optimal_retention.rs#L577
def average_f_power_forgetting_curve(t1, t2, s, decay):
def integral_power_forgetting_curve(t, s, decay):
factor = 0.9 ** (1 / decay) - 1
return (s / (factor * (decay + 1))) * np.power((1 + factor * t / s), (decay + 1))
# Calculate F(t2) - F(t1) where F is the antiderivative
integral = integral_power_forgetting_curve(t2, s, decay) - integral_power_forgetting_curve(t1, s, decay)
# Divide it by the difference in time to get the average
return integral / (t2 - t1)```
Port that to Rust (and add an assertion that t2 > t1).
Then do this:
```rust
for i in 0..delta_t {
memorized_cnt_per_day[last_date_index + i] +=
average_f_power_forgetting_curve(w, (pre_sim_days + i) as f32, (pre_sim_days + i + time_offset) as f32, last_stability);
}
}```
`time_offset` is 1/2/3/5/10/50 years, except in days
you can caluclate R using the stabilitys of the cards
But that calls simulate, so I'm changing simulate
you don't need to change simulate it gives you a list of the states of the cards at the end
you're going for retention in the future from the end of the simulation right?
Yep
In cards?
yep
also we already have the forgetting curve
https://github.com/open-spaced-repetition/fsrs-rs/blob/092c20bac7d9239a991ae5b561556ad34c706c16/src/optimal_retention.rs#L234-L238
Yeah, but we need the integral
ah that's true
https://github.com/ankitects/anki/issues/3926#issuecomment-2833639346
Btw, Dae approved "health check", but said that if it has a diagram, it should be embedded in deck options itself, rather than in a pop up
Dae, despite what the screenshot above shows, I think we should disregard that poll and remove "Evaluate" anyway. David agrees, btw. "Evaluate" gives the user a bunch of numbers...
maybe do it with that function as a placeholder just until you get the maths worked out
The math is worked out, just not ported to Rust
well then the placeholder will be easy to implement
any initial ideas for where we get the health check values from?
#1282005522513530952 message any thoughts on "loss vs trend line"
Idk if i like it but it might be ok XD
From the 10k benchmark, of course. Just have to bother Jarrett to implement the train set = test set version
#1282005522513530952 message I'm assuming we would have to solve this problem π€
We'll use RMSE, I'll make a sample size correction
hmm a problem with using average retention is that user A might have a bunch of cards at 25% and another bunch at 75% so it averages to 50%, and user B just has 50%, but user B's log loss will be expected to be higher just because that's the way it is
any ideas?
let me check if RMSE has this problem
i think a health check should be to check the difference between train/test scores based on something like the 5-way split
so if FSRS trained on the training set does not generalize well to the test set then this would be a problem that we can indicate to the user
Log-loss is correlated with retention, RMSE (the FSRS kind) is correlated with n(reviews)
No 5-way split in Anki
we should add it
the metrics are already meaningless without a train/test set
it does
i can write an algorithm that just memorizes the training data to get nearly perfect on the current metrics, yet this algorithm would be useless
Are you sure you are using the right RMSE? From the .jsonl file?
x = [user[1]["true_retention"] for user in users]
y = [user[0]["metrics"]["RMSE"] for user in users]
what should i use?
RMSE(bins)
fsrs_6 = load_jsonl("../srs-benchmark/result/FSRS-6.jsonl")
button_usage = load_jsonl("button_usage.jsonl")
users = list(zip(fsrs_6, button_usage))
still a problem
Well fuck me upside down and sideways then
@quasi shadow why no 5-way split in anki? Evaluate means nothing without a proper train/test split. If we have a train/test split then an idea for a health check would be to compare the metrics between the train set and the test set to directly evaluate for generalization
Probably just for speed
Of Evaluate
we can make this tradeoff to make Evaluate actually mean something
Also, if the health check is not based on the values that Evaluate actually displays, what's the point?
evaluate could include the train/test values, it doesn't have to remain as it is
Oh yeah, let's include four values
Logloss (train)
RMSE bins (train)
Logloss (test)
RMSE bins (test)
Surely that will be less confusing and less information overload
Come one man, we're not trying to make it good for data scientists, we're trying to make it good for the kind of person who thinks that "log-loss" means "lost reviews"
we can show only the test version
this makes the information actually accurate for once
sigh
@quasi shadow do you want to implement the 5-way split in Anki?
Before the release of Anki 25.05
I have a feeling the answer is "no"
I'll make a correction that depends on both retention and n(reviews)
imo it should be either remove Evaluate or add a 5-way split because right now the numbers shown on Evaluate are unreliable
have you looked at the notebook π thats exactly what i'm trying to do
well you can build on it if you want
back to this junk
Ok(SimulationResult { memorized_cnt_per_day, review_cnt_per_day, learn_cnt_per_day, cost_per_day, correct_cnt_per_day, cards, }) }
I have no idea what cards look like, so I can't help here
heres the n(reviews) graph if your intrested
This is extremely awkward
-
I know how to implement the integral in Python
-
I don't know Rust
-
I don't know the simulator code very well
-
You don't know how to implement the integral in Python
-
You know Rust
-
You know the simulator code
Is there some sort of smoothing?
yeah
w/o smoothing
Yep, that's more like what I've seen when plotting it π
from statsmodels.nonparametric.smoothers_lowess import lowess
Take this
` lowess_smooth = lowess(RMSE, sizes, it=3, frac=0.1, return_sorted=False)
lowess_smooth = np.asarray(lowess_smooth)
new_sizes = new_sizes[sorter]
RMSE = RMSE[sorter]
lowess_smooth = lowess_smooth[sorter]
plt.figure(figsize=(16, 8))
plt.scatter(sizes, RMSE, s=30, color="#1f77b4")
plt.plot(new_sizes, lowess_smooth, linewidth=5, label="LOWESS", color="darkorange")`
Something like this
sizes is n(reviews)
there you go
i did this
vals = lowess(x, y)
ax.plot([x[0] for x in vals], [x[1] for x in vals])
Plot both unsmoothed and smoothed data
Also, you got your axes wrong, lol
Axises
Erm, whatever
yeah i did π
i'm just plotting what the function spits out though
and the shape looks right enough?
oh yeah my axes were flipped XD
flipped colours sorry
For unsmoothed data remove lines, leaving only dots
And bring the smoothed version to the front
Yeah, just remove the lines for unsmoothed
Make them dots
im feeling like luc gpt rn XD
Good bot
Make the LOWESS line thick and orange
And add these settings to lowess: it=3, frac=0.1
And just to make it look a little better
plt.ylim([0, max(RMSE) * 1.025]) plt.xlim([0, max(sizes) * 1.025])
i think matplotlib pads it automatically
Try these settings
i wonder if this is what chatgpt feels like when i'm done with it
that is with those settings
Ah, ok
Anyway, here's what Gemini spat out
For the integral
def average_f_power_forgetting_curve(t1, t2, s, decay):
def integral_power_forgetting_curve(t, s, decay):
factor = 0.9 ** (1 / decay) - 1
return (s / (factor * (decay + 1))) * np.power((1 + factor * t / s), (decay + 1))
# Calculate F(t2) - F(t1) where F is the antiderivative
integral = integral_power_forgetting_curve(t2, s, decay) - integral_power_forgetting_curve(t1, s, decay)
# Divide it by the difference in time to get the average
return integral / (t2 - t1)
This is like 6 lines of code
God damn Gemini
- comments this would also be like 6 lines of code XD
use ndarray::Array1;
pub fn average_f_power_forgetting_curve(
t1: &Array1<f64>,
t2: &Array1<f64>,
s: &Array1<f64>,
decay: f64,
) -> Array1<f64> {
let factor = 0.9_f64.powf(1.0 / decay) - 1.0;
let exp = decay + 1.0;
let den_factor = factor * exp;
// Closure equivalent to the inner integral function
let integral_calc = |t: &Array1<f64>| -> Array1<f64> {
// Performs element-wise: (s / den_factor) * (1.0 + factor * t / s).powf(exp)
(&s / den_factor) * (1.0 + factor * t / s).mapv(|base| base.powf(exp))
};
// Calculate integral difference and divide by time difference element-wise
(integral_calc(t2) - integral_calc(t1)) / (t2 - t1)
}```
yeah i can probably use this
not rn but provided it works
You can verify it by trying t2 that is only slightly larger than t1, like 100.0001 and 100.0. It should give you a value close to the original forgetting curve
If you plug t1 into the forgetting curve function
I did this in Python
integral_avg = average_f_power_forgetting_curve(t1, t2, s, decay)
print(f'Average R within the [t1, t2] range: {integral_avg:5f}')
# Brute force check that the integral version is correct
n_values = 500_000 # number of data points between t1 and t2 to be used for averaging
t_range = np.linspace(t1, t2, n_values)
r_values = power_forgetting_curve(t_range, s, decay)
brute_force_avg = np.mean(r_values)
print(f'Brute force calculation of average R within the [t1, t2] range: {brute_force_avg:5f}')
print(f'Brute force calculation agrees with integral calculation: {abs(brute_force_avg - integral_avg) < 1e-7}')```
Just brute-force calculated the average of 500k points between t1 and t2
I'll try just forgetting curve it into the future as well
seems like a proxy for maximising stability though
just the non intergral version
But that just gives you 70%
maximise the cards for memorised as if you stopped reviewing on the last day of the simulator, for memorised in a years time
like that
Ah
Nah, just use the integral
I specifically made it to calculate average R over time without brute-force calculating the average using a loop and a ton of data points
so t1 is an array of the last day that the cards were reviewed and t2 is an array of say [365]?
as in the days in the future to measure?
t1 = how many days have passed since this card's last review by the time the simulation ended
t2 = how many days have passed since this card's last review by the time the simulation ended + 365
Or + whatever number
t2 > t1
Btw, I have no idea what the hell is going on here
let integral_calc = |t: &Array1<f64>| -> Array1<f64> { // Performs element-wise: (s / den_factor) * (1.0 + factor * t / s).powf(exp) (&s / den_factor) * (1.0 + factor * t / s).mapv(|base| base.powf(exp)) };
So just by looking at it, I can't tell if Gemini messed up
so no fixed end date, just + 365 from when the cards were reviewed?
surely then it could be simplified like this
(integral_calc(365)) / 365
```?
Not quite
To calculate R using the forgetting curve, you need t1: time since the last review
To calculate average R using the integral, you need t1: time since the last review, and t2: time since the last review + offset
btw since the reverse power curve is super flat quite quick, the integral look like a linear function
yeah but offsets constant right?
what im trying to say is you could then simplify away the t1?
to just offset?
If t1=0, then it's as if the card has been reviewed just now, but that's not necessarily the case
ahh right yeah
I mean this is the integral function for S=5 from t=1 to 365 with decay -.2
Maybe using f(x)=x as approx of integral is good enough
At this point it's genuinely simpler to implement the integral than to try to approximate it for no reason
It's not even slow or anything
Like, it shouldn't make CMRR much slower
For example, if the card was reviewed 10 days ago, then t1=10 and t2=10+365
Man, leave the integral, honestly
There is no reason to try to approximate it
Even if it makes it 10 milliseconds faster, the simulations themselves take x1,000,000 more time
The bottleneck is the simulation, not the final calculation
Like, it's genuinely one minute vs 10 milliseconds or something
Hold on, let me time the integral
7 microseconds
And this is with shit-ass Python
Well, tbf, this is for one card
But still
It's 6 lines of code, brah
Mine is 1, 10 char
f(stability)=x/2+stability
Close enough π
Sorry
don't want to ruin your fun
but the double standard is excellent
"People don't care about simplicity"
"Let's introduce an average integral ... to approximate f(s)=s"
Well at least UI you can move it, the math you need to maintain it
As long as the user sees simple UI, it doesn't matter what kind of horrors beyond human comprehension are happening in the backend
And we already have the monstrosity that is the simulator
So I really don't see your point
Like, I could see advocating for simplicity before the simulator was implemented in Anki, but it's a bit too late to worry about simplicity now
And since the function is even more gentle than a f(s)=s, I'm really curious how it will help with the CMRR
We saw yesterday sqrt(S) was too gentle
We'll have to wait for Luc
f(S) was only strong enough when decay was high enough
If it doesn't help, then screw CMRR, I guess
With the new decay, I think the weight should take in account the decay in some way
Or not
Btw @cosmic hedge enable loss aversion for CMRR, the time(Again)*2.5 thingy
NOT for the simulator, ONLY for CMRR
But then it will be normal that the returned DR is the lowest bound, since in practice the user seem to never go below a certain R
But we'll see !
Have to sleep, I'll dream about f(x)=x/2
loss aversion is just gone now
Just jam *2.5 somewhere in there π€£
Idk, find time(Again)
i do not need to XD
You're lucky I apparently fail to value my own time...
its 0.94 for every deck I've tried it on btw
on that note i'm done for the night i still have cards to do XD
So what you're saying is that we've got the opposite problem now π€£
Thank you for your work
I highly recommend verifying that the integral works as intended via brute-force averaging, like here #1282005522513530952 message
pub struct Card {
// "id" ignored by "simulate", used purely for hook functions (can be all be 0 with no consequence).
// new cards created by the simulation have negative id's so use positive ones.
pub stability: f32,
pub last_date: f32,
}
pub fn average_f_power_forgetting_curve(
learn_span: usize,
cards: &[Card],
decay: f32,
) -> f32 {
let factor = 0.9_f32.powf(1.0 / decay) - 1.0;
let exp = decay + 1.0;
let den_factor = factor * exp;
// Closure equivalent to the inner integral function
let integral_calc = |card: &Card| -> f32 {
// Performs element-wise: (s / den_factor) * (1.0 + factor * t / s).powf(exp)
let t1 = card.last_date - learn_span as f32;
let t2 = t1 + 365.;
(card.stability / den_factor) * (1.0 + factor * t2 / card.stability).powf(exp) -
(card.stability / den_factor) * (1.0 + factor * t1 / card.stability).powf(exp)
};
// Calculate integral difference and divide by time difference element-wise
cards.iter().map(integral_calc).sum::<f32>()
}
fn main() {
let val = average_f_power_forgetting_curve(10, &vec![Card {stability: 5., last_date: 5.}], -0.2);
assert_eq!(val, 10.);
}
``` This... explains it
if you have time to burn paste that code here https://play.rust-lang.org/?version=stable&mode=debug&edition=2024
edit: I forgot to add the - on the decay XD
edit2: t1 and t2 are the wrong way around
A browser interface to the Rust compiler to experiment with the language
The output of average_f_power_forgetting_curve should be between 0 and 1 btw
I don't see division by (t2-t1)
doesn't affect the minima
t2-t1 will always be 365 or whatever the offset is
in the absence of a functional CMRR are there any other tools we can use to find safe minimum DRs without manually gauging how our daily load/time spent changes/increases as I try incrementally decreasing my DR?
train/test split only make sense when evaluating different models.
The current implementation only evaluates FSRS itself.
The train/test split could tell us the generalization capability among different models or ablation variants.
But when we only evaluate one model, it's not very helpful.
If we implement 5-way split, we will have five sets of parameters optimized on different trainset.
And they are all different from the parameters which the user actually uses.
What can we derive from the evaluation result with 5-way split?
And we have recency weighting. Should the 5-way split consider it?
Wouldn't we need to have a consistent train/test split if we want an untainted Evaluate?
e.g. card.id mod 10 == 0 are never trained on, only used for evaluation.
Then FSRS cannot learn from these cards, and its accuracy would decrease on these cards.
It might end up with worse actual results, but I cannot see how else you would have comparable numbers for a "health check".
not true, train/test shows generalization performance, this is standard in data science
IMO, we can use the current metrics to predict the future metrics.
Of course I know it's standard. But I only care about it when I wrote papers.
hmm not sure what to say to that, if you purposefully want shoddy data science then be my guest
the health check be a FSRS-training in progress check, for example we can train on the first 4/5 of the revlogs and evaluate on the last 1/5 as a health check, and then just before finalizing the parameters we run another epoch where we include the last 1/5 so that we have full coverage
It feels like one of the biggest reoccurring problems with SRS is we are so starved for data π . We really need the mega AI to come along that is trained on such a stupid amount of data that it works well without much per-user data.
But the first set of parameters cannot stand for the final one. The health check only represent the health or the first set of parameters.
If it could stand for the final one, the final one could also stand for the future one.
If the test split is the last 1/5 couldn't that also really harm the training. I thought there was a very noticeable improvement with recency weighting.
If so, why not just evaluate the final one?
this is only a compromise so that all the data is used, the alternatives is to just use the first 4/5 parameters as the final set, or remove evaluate altogether because we cannot actually say anything about the performance on unseen data
e.g. even those big LLMs that you see going around do not train on the final test set before deployment, probably
That's why I was suggesting something like card.id mod 10 == 0
this technically would still leak information but its better than nothing
Now I'm confused. How would that "leak information" if they were not used for training?
N.B. I'm not a big data science person π
(also part of my reasoning for that split is it would stay mostly consistent over time as you add cards etc.)
I suppose you could also just add data marking cards as a "test" card so they stay consistent π€
there is leakage through time, say you're trying to predict the y value on this graph, you can just look at the training set to find nearby x values to predict a reasonable y value
this is s&p 500
so if you have a model that cheats this way, it would be useless for future prediction
but FSRS's goal is to predict the future, not fit the past
so its similar
I might just be dumb but I still cannot see it. If you isolate the revlogs of entire cards I cannot see how the model can cheat other than overall trends (e.g. you performed badly one day because you were tired).
that's why i say that leakage is still possible, that's one of the ways it could be abused
but it might not be significant enough such that the metrics are still reliable enough if we are only limited to parameters that are trained on a subset of the full revlog
parameters trained on card.id mod 10 != 0 would likely perform better than on first 9/10 of the revlog
since it would include the most recent information as well
OK, just implement the same method used by the SRS Benchmark.
Then we can compare the evaluation result with SRS Benchmark's result.
Everything is solved.
another pro: now recency is expected to improve the metrics
would you mind reviewing it?
btw, it doesn't evaluate any parameters.
It only evaluates the "optimization" with given dataset.
And it's slower than training.
It means changing the parameters will not affect the evaluation result.
i'm not comfortable enough with rust to give a good review
i think this is fine, change parameters at your own risk and use tools outside of Anki to really tinker around with your parameters
wait...
I forgot we have use the evaluate here.
Is it worth to keep the previous evaluate?
Is it possible that the test train split evaluate loss goes up when the current evaluates loss goes down? Because if that happens it could appear as if the parameters got worse after optimising right?
Didn't that check not exist at the start anyway so it would be fine anyway but just saying.
π Fine. I don't want to modify too much code, so I keep it.
Something also to keep in mind is compared to case like prediction SP500, here we have way more constraints on how the model can adapt to predict :
- There's no up and down. Question is "by how much do we increase S/D when getting right a review right" or down
- By nature, we could expect memory to not be super volatile like being "5 times as more potent on certain days" and "only half the perf the other day". So splitting train/test while it makes sense, but it might not change much here
related:
Feat/evaluate FSRS via time series splitΒ open-spaced-repetition/fsrs-rs#326
Improve feedback from "Evaluate" and potentially hide/move it (FSRS section)Β #3926 (comment)
OK, the prep work has been done for the health check idea.
π I leave the rest of work to @unique salmon
If Anki gets a short term memory model, I believe I can rest in peace finally...
@polar maple release the thing
Wut
Oh, yeah, I get it
Right now Evaluate evaluates a set of parameters, but that's not the case with the 5-way split
Wait, now people will complain that changing parameters doesn't change the numbers in Evaluate π
God, this is such a pain
It would be so much easier and less confusing to just remove Evaluate
@polar maple may I ask what was the architecture of the neural network you trained that had better loss than fsrs
Go kick Alex's butt so he releases his neural net, which is exactly what you described
thank you
It actually splits the dataset into 6 sets.
What do we do about this?
FSRS's versioning will catch up RWKV'sπ€£
As Alex said, we cannot evaluate the parameters with unseen data.
So we're making Evaluate even more confusing to the average user...
Fragment from "The Sopranos" TV series, season 5, episode 7.
Uncle June cries on wake. Shit!
For the sake of data science!
Fair
So how is it going?
haven't touched it today yet.
if you have any other ideas lmk
Nope, just the integral with various values of offset (or whatever you wanna call it)
Oh, yeah, and make a draft PR for the new diagram for Evaluate
Still trying to figure out how to correct for both retention and n(reviews), though. Can't use LOWESS on 2D data 
I recommend updating the graphs here: https://github.com/Luc-Mcgrady/anki-10k-notebooks
Set frac=0.4 for LOWESS, to make it smoother
Man, I'm just giving you tons of work π
Here's how RMSE (bins) depends on n(reviews)
And here's how it depends on retention
Now the question is - how the hell do I make a correction based on both...
Here's a 3D plot just because why not
Hey, what's the rational behind the second learning step ? I see it seems to match the of "Again then Good", but a second learning steps would be a "Good then Good", right ? For intervals like 82m, it might also means the second learning step will be done the next day, which I thought was not that ideal with FSRS if you want to let FSRS control your interval right ?
@quasi shadow
If you first rating is good, anki will use the 2nd step.
Anki's user manual. Anki is a flashcard program that makes learning easier.
Sure, I know the meaning of having 8h as a 2nd step, I'm not entirely sure why FSRS figured out it should be 8h. It feels like it's the "Again then Good" stability, but I don't really see the logic behind it
I mean, getting to that 2nd learning step might be any of those case. Could be a Again or a Good first right
to be clear there are currently no plans to release this nn in anki
Would make more sense to take : {Again} {Good Then Again} no ? To see if the card is not a difficult one to learn, instead of {Again} {Again then Good}
Yeah, but I just want to see funny benchmark numbers
Because if your first rating is again, anki will use the 1st step. Then you grade good, and anki will use the 2nd step.
Good Then Again?
You will use the first step for this case.
Yeah but if you press "Again", then I agree, the stability to use is the one for "Again", here 119s.
BUt if you press "Good", and want to know what could be a second learning step, you'd like to take the stability of "The cards that were Good, but unfortunately didn't make it at the second steps", right ?
But I guess it's subjective
On your side I think your point is : "If in avg, the guy that does Again->Good has a 8h stab, so the second step should have the 8h stab". When in my case I"m more like "If the guy that press Good has a stab of 2.46days, and when things go wrong (Good -> Again), the have only a stab of 11min, then let's put the second learning step at 11min to see if the card survive that stability"
My point being : If I take your logic, then "Good then Again" would also put him in the first learning step, so let's use that "11.25m" for a first learning step
Ideally those steps should be more like "What kind of succession of good reviews with how much space between those would give the user a DR %-age of chance to have a 1d interval (so he has the chance to succeed it the next day with the desired DR)
That's why the {Good Then Again} makes more sense to me : It's the stability of those cards that might good look on paper (first step was a success), but guess what he got the 2nd step wrong .... Well, next time he got it right in the first step, let's wait for that interval to see if now he remember it
Good (8h) -> Good : Make sense, it's less optimal then using {Good Stability}, but at least we're sure he's not in a "Good then Again" situation.
Again (2m) -> Good (8h) : Make sense, we use {Again Stability} as interval for the first one, and then the {Again Then Good} stability.
Good (8h) -> Again(2m) -> Good (8h) : the** 8h doesn't make much sense here**, the "Good then Again (12m)" should be used, because the guy just did exactly that, finishing by a "Again then Good".
But yeah, the whole learning steps, if hardcoded in the deck options, lack flexibility to really use those values. And leaving it empty means most of the time not even having learning step (If good/hard have ~ >.5 stability)
why is it "Good (2m)"? doesnt it immediately go over to the second step
Oops
Thanks
But then yeah even the Good -> Again doesn't make sense. If we know Good -> Again stability is 12min, having to make the user wait 8h (The Again then Good) feels off
decay = -0.5? i'm guessing thats on purpose
TBH the most logical would maybe even :
First Step : {Again}
Second Step : math.min(AtG, GtA)
Oops
@quasi shadow this line needs to be updated for FSRS-6: https://github.com/open-spaced-repetition/fsrs4anki-helper/blob/fc3f44ba93a7fdf586150e615efcf0a7ece5e625/stats.py#L255
i think there is reason to believe that the current FSRS forgetting curve shape doesn't work well for short stabilities
ratings = {
1: "again",
2: "hard",
3: "good",
4: "again-then-good",
5: "good-then-again",
0: "lapse",
}
Math.min(stability[2] * 2 - stability[1], stability[3], stability[4])
``` It's the min of good and again-then-good
you need some level of near-instant forgetting built into the curve
I experimented a bit with FSRS for short-term reviews and what helped was subtracting an "instability term" from S that decays with time, so it would be like S=S-I, where I=f(time)
The metrics were still shit though
But it helped make them less shit
It would look like this
π
Nice find ! And of hard * 2 - Again, for whatever it means
I agree, most people just don't realise that "Haven read something" doesn't even mean having it memorized in any kind of form, so as long as you treat those spam'ed Again like you would treat any kind of other review, with the current model, you'll get screwed
There's also the coin-flip knowledge : The knowledge that might get a very high Stability as long as you get enough lucky with your wacky knowledge
Example : What year X was born ? Let's say you hesitate between 1990 and 1980
Here are forgetting curves from Alex's 2.7 million parameters neural net
Notice how they fall off very sharply initially
But then again, this isn't useful for FSRS since we're not trying to model what's going on during same-day reviews
It's just experimental evidence that a "proper" forgetting curve looks different from what most (~all) people imagine
Let's say your model is built on 90% healthy cards, with good knowledge of it, not much ambuigity.
So now, to that flip-coin question, you might get it right 4 times in a row (1 chance over 16), get a stability of 4 months because you did "Good-Good-Good-Good"...
But guess what, you never really knew it
So sometimes "very low stability" might mean :
- You never even memorized it a single time
- You never had anything truly substantial to remember, you just remembered it was A or B and most of the time you can get it right
And then you complain about Anki minimum review interval to be 10min
(For case 1)
But for case 2 however (the flip coin question), I have no idea how you could detect that
maybe with 2.7 millions parameters ?
π
I have some ideas though
That ratio of performance drop
For example that word I always hesitate between "shuusei" or "shousei"
How does it translate ? Lapses with no-increasing performance
I think those kind of engineered features, fed to a NN like yours @polar maple , could really detect those
when is it best to do the first few optimizations of a deck with new deck options
after 5 days or after passing the daily reviews size 5x over?
Whenever you want
There's no "too early"
@polar maple : If you want to use those engineered features, most of the logic is here : https://github.com/JSchoreels/anki-addon-leechdetector/blob/main/leechdetector/leech_detector.py#L42-L62
i want to have an in review leech detector
@cosmic hedge @quasi shadow Alright, this was a massive pain, but I made a function that approximates RMSE as f(n_reviews, retention)
def func(a, b, c, d, e, f, g, x, y): return a / (np.power(x, b) + g) + c / (np.power(y, d) + f) + e
x is n_reviews/1000, y is retention
List with parameters (a, b, c, d, e, f, g): [0.16398, 0.73318, 0.018426, 10.0, 0.0, 0.35881, 1.6193]
(e ended up being 0, so we can remove it)
Two notes:
- I divided n_reviews by 1000, just because
- retention is calculated including same-day reviews
So now we can calculate normalized RMSE - take the user's real RMSE and divide it by the output of this function. If the ratio is >1, then the user's RMSE is greater than what would be expected for his retention and n(reviews). If the ratio is <1, then it's lower than what would be expected. Then we can calculate the percentiles of this normalized RMSE, and finally use that as cutoff values for the health check.
1st percentile of norm. RMSE=0.29912 50th percentile of norm. RMSE=0.9972 90th percentile of norm. RMSE=1.8269 99th percentile of norm. RMSE=3.3006
How to use this:
- Take the user's normalized RMSE (which, again, is a ratio) and clamp it so that it's not lower than 0.29912 (1st percentile) and not higher than 3.3006 (99th percentile)
- If it's <=50th percentile, display "Good" (green zone)
- If it's between the 50th and 90th percentile, display "Acceptable" (yellow zone)
- If it's >=90th percentile, display "Poor" (red zone)
One last issue - what do we do if it's in the red? The user will panic and be like "nyooo my fsrs is dying...". But I don't know what to display. If we display a list of possible reasons for poor performance, I doubt people will read it. The kind of person who misuses buttons or uses Anki in some dumb way is exactly the kind of person who won't read a list of possible explanations.
i get that we are correcting for the correlation between RMSE and retention here but if we are doing this correction, why not just use log loss?
It will require a correction too
In this case I don't think either of the two has any advantages, so we might as well flip a coin
ik, i thought originally you wanted to use RMSE because it didn't require a correction but we discovered yesterday that it also needs one
then in this case we might as well use log loss since it is a more accurate metric
what does the distribution look like? (this graph)
Of normalized RMSE?
yeah
well of all the users - the RMSE normalising value(?) which is what i assume were going to be using?
how far is the average user from the "middle value" is what trying to ask
ahh yeah you get the percentiles so it should probably be good anyway?
Black line is the median
Yep, I described the procedure in the second half of this message: #1282005522513530952 message
@cosmic hedge where integral
I need to know if we're axing CMRR or no
And Dae has been asking about it too
how about return 0.9 for CMRR
Also, I have some things to say regarding the wording in Evaluate, but I guess it's a bit too early for that since Luc hasn't made a PR yet
- I think we should rename it to "Evaluate FSRS" to make it clear that we are evaluating the algorithm, not the parameters
- The text should say "Lower values of RMSE and log-loss indicate a better fit to your review history. The results do not depend on your current FSRS parameters". This makes it more clear what is going on
Still unsure what to do with people who will get "Poor"
I mentioned it on Github
"The results do not depend on your current FSRS parameters" ?
I change my param, I press evaluate, the values change
That makes it kinda useless then, doesn't it? Like, it's to check how well your current parameters fit your collection.
And in the road of the datascientist, the Evaluate would not Evaluate the parameters against the Test Sets ?
that's exactly what the new evaluate will do, however, i believe jarrett wants the parameters to still be trained on all data which leaves us with no Test set
so we will basically test FSRS on a train/test split, but the final parameters will still be from training on all of the available data
that's SUPER unintuitive, and better to just remove then
Ok but in any case, you change the param, you potentially change the log loss no ?
If not, then the Evaluate use what params ?
this is what Expertium is getting at, Evaluate will not evaluate on current parameters, rather it will evaluate how well the FSRS algorithm does as a whole
no change
Yeah, the health check will be for FSRS as an algorithm, not for a specific set of params
If Evaluate becomes a sort of self-test and doesn't look at your params I kind of want an "Automatic/Manual" toggle.
- Automatic: you don't show (editable?) params but have an evaluate button.
- Manual: you show params but don't have an evaluate button.
Ok so if I understand correctly :
You Press Evaluate -> It trains -> It gets params -> It evalute on test set ?
this has to do with a common problem in data science, just because a model fits training data well doesn't mean it will generalize well onto unseen data
Why not, if params are present, skip the "train" and evalute on test set ?
I understand that very well, I just don't understand why you can't evalute params, either coming from an optimization or from your own hands
i think that if you want to tinker with params in this way you should do it outside of anki
Me waiting for Alex's course on machine learning
I just don't see what information I would gain from Evaluating FSRS as an algorithm against my deck.
Oh OK, I don't mind that
While with evaluating the params, I at least have SOME kind of idea if one set is better than the other, or one set is a completele miss somehow
The params shouldnt be hidden in some ways ?
Would be extremely strange for a UX point of view that you could see and edit params, that they would have effect on scheduling, but not on the Evaluate function
i think this shouldn't be a problem, when you optimize parameters you only get new parameters if they are an improvement over the previous values in terms of training loss
Or if shown, at least put in read-only
Having something that would alter scheduling but not evaluation feel extremely off
Why? That has zero merit. People who don't know what they do won't touch them anyway
And if you really WANT to modify them, you can anyway. You just made it more annoying.
I'm more thinking about the people who know what they are but were not expecting people to use them in some function (scheduling) but not in other π
Just saying, all of this mess could have been avoided if people voted for removing Evaluate Β―_(γ)_/Β―
But we got 73% first-preference votes for the "health check"
Well, the vote with no word said that it would stop evaluating your parameters...
I asked Jarrett "Do you want to implement train set = test set in the benchmark or the 5-way split in Anki?" and he chose the latter
The choice of not linking Evaluate to the Parameters in the UI is just an abitrary choice, not a real limitation
I just feel the whole topic is some kind of ego war more than anything
"I need to do the split I don't wantto do ? Then I won't let you evaluate it"
Evaluate seems entirely pointless if it does not give you a metric related to the current parameters.
Well train=test in the benchmark is bad science.
"You didn't wanted to hide the Evaluate ? I'll make it useless"
Yeah I'm not arguing that
(I meant Expertium's comment)
i think rossgb's card_id mod 10 == 0 might be a decent compromise
I'm just slow π
Feel free to tell Jarrett to implement train = test in the benchmark as a new command or whatever
I mean, let's just mark the card used for training, and when the guy click on optimize, it runs on the test-set + all the new data
Re-Optimized ? Let's mark the new training set and now Evaluate will work on All\Training Set
We need either train = test in the benchmark or the 5-way split in Anki. Either one will do
Because right now the numbers from Evaluate cannot be compared to the benchmark numbers
So either benchmark has to ankified or anki has to be...benchmarkified
the benchmark itself should never be changed in this way
It could just be a separate command
I can see Jarrett's argument for the "self-test" version. A test split in normal use risks significantly worse performance when we have such a small dataset for an individual user.
Then we can keep the current behavior of Evaluate
IMO this is not necessarly super true, from my own anecdotical experience, when you reach Λ10-20k reviews, you don't get much changing prediction anyway
Most people using anki for a year have way more than 10-20k
And youngsters should use default params for longer
Before there was a threshold to reach I think before being able to optimize
Not THAT long. Not for 10k reviews
There's nothing wrong with optimizing much earlier
Not really. Like 8 reviews for pretrain, 64 for full optimization
IMO this is not great
Except it has some wacky filters and whatnot
when you see @tepid spoke case with 100d stability when he filter out cards with Hard/Easy as first review
So you'll never figure out the exact number of reviews used for training
A long time ago another user helped me and Jarrett with it. We found that 64 reviews for full optimization is alright
Better than the defaults
do we have any performance metrics for low # of review collections?
Ok but maybe some rules like "At least a few Easy, Hard ... ?""
hm? The 100.000 stability value appears when I filter out enough cards via "Ignore cards reviewed before"
Because getting 100d stability because you lacked certain case is a bit meh
Mmmm, delicious two-button users' tears, yummers
Oh yes, I'm 99% using only again/good, but still thinking about the others
But I guess it's not that much a big issue
Except if Anki use FSRS by default and auto-optimize optimize with 30 reviews
But SM2 has the benefit of building a training set for FSRS lol
Anyway, if anyone wants "Evaluate parameters" instead of "Evaluate FSRS", tell Jarrett to implement train = test in the benchmark
Because right now he's doing the exact opposite - implementing the 5-way split in Anki
Again, don't point fingers at me. I asked, he chose
if we implement card_id % 10 == 0 we can keep the current Evaluate and it wouldn't be so incorrect
Train=Test is not related to Evaluate Parameters=Evaluate FSRS, I don't know why this restriction would be there @quasi shadow ?
Why not just mark the cards that were used for Training and keep Evaluate work on Tests set ?
@unique salmon btw to train on the test set its pretty much just 1 line of code in other.py
That evaluates on ~10% of all cards then
this is just a starting suggestion
is 10% too much? too little?
ok
But the point is that we need some method Y that is used both in Anki and in the benchmark so that we gather data for the health check
If Anki uses method Y to calculate metrics and benchmark uses Z to calculate metrics, we can't do shit
This answers your question, I believe
Wouldn't the 5-way split benchmark values still be comparable even if we used the "mod card.id" in Anki? The issue is just that train=test ones are not.
Whatever we do in Anki, we must also do in the benchmark so we can collect data that will be used to decide values for the health check
Not quite
@polar maple I don't think it would?
Something is going very weird with my keyboard. It is hard to type π
Mod would choose cards randomly, whereas in the 5-way split they are not chosen randomly
It might not be perfect. I was assuming that train=test could have very different values, but different methods that do not mix train and test would have similar RMSE/log loss.
My laptop keyboard is very unhappy.
I think the mod 10 version would be expected to get a lower loss than a time split for a similar reason as the s&p500 example
for example if a new user does a bunch of new cards at day 1 and then reviews them at day 100, passing all of them
then a mod 2 split would easily get a zero loss
but splitting by time in half would get a high loss since it would basically be fsrs default params
my view is evaluate parameters is fine if we're not evaluating on the training set
Well, I guess tomorrow we'll see Sound and Oromit debating with Jarett
I do not have any kind of strong attachment to the Evaluate button
It just seems pointless to me if it doesn't evaluate the parameters
Just Sound then π€£
I did vote to keep it, since it's "nice enough to have"
but if it causes trouble like that, meh
I think it's more useful to evaluate the algorithm itself, it's just that it's hard to do this in a way that isn't confusing as hell to the average user
The few times I used it was to check how the values looked for my manually tuned parameters
to make sure I didn't make a horrible mistake
Maybe I'll dig out my Health Check PoC and see if I can create something I don't hate π
But that's such a niche case, it hardly matters
Fair, in your case old Evaluate is more useful
With the Simulator being a thing now, it's a better cross-check anyway
I don't want to debate. I'm convinced by Alex.
You cannot know the performance of any sets of parameters on unseen data (the future reviews) which is actually important in practice.
for a health check could we just compare FSRS-6 with adaptable params vs FSRS-6 with default params using the 5-way split? it seems that the proportion of users is significant (15.7% do better with default params)
https://github.com/open-spaced-repetition/srs-benchmark/blob/main/plots/Superiority-9999.png
@quasi shadow is FSRS-6 def params with the joint optimization params? if not then this value might be even higher than 15.7%
but this is exactly why a train/test split is important, likely 99.9% of users would have a better training loss with adaptable params but the actual benefit is not necessarily that high
The FSRS-6 def params are the median params because we haven't found a method to generate reasonable default parameters.
I think it still doesn't solve the complaint.
Btw, what could we do if the health check's result is bad?
If I understand it correctly, we should use the default params if this kind of health check shows bad result.
Even if the log loss of adaptable params is better than default params.
Hmmm not really. Anki and Benchmark using both the same method (Training/Test) doesn't imply being unable to run Evaluate on Parameters.
It's like saying that : evaluate(optimize(data), test_data) implies that **evaluate(user_defined, test_data) **is not possible
But is it beneficial to evaluate parameters?
If I understand it correctly, you want to know how well the user defined parameters perform on current data.
But the benchmarking method evaluate how well the optimization performs in the future.
Let's reframe this question: how well the user defined parameters perform? -> how well the manual optimization by user perform?
If we want to compare the built-in optimization with user's optimization fairly, the user should also don't optimize the parameters based on the metrics from the test data.
But it's tricky.
It's tempting to optimize the parameters based on test data.
Ah OK I see now a very good point that justify your thoughts, here's what I think about it :
- To be fully consistent, it's true that the user-defined parameters should also be evaluated against the same Test-Set than the optimized one, otherwise the User might have the feeling it gets better result with its parameters when in fact, it's just taht when he defines his params, it runs on the training data and thus he might get better result, which is not great.
But if we mark the data that were used for Training, and only perform Evaluate on the Test Set (excluding the Training Set), then both User defined and Optimized Params will complete on the same "unseen" data.
If doing such exclusion is difficult for now / not feasible within Anki, then it would be probably better to rework slightly the menu to make clear that user defined params are the responsabiltiy of the user... for example :
- Put a warning if the params have been changed by the user
- Instead of having a "Evaluate" button, just having the logloss/RMSE of the optimization written to make clear that it's computed only when optimized and represent only optimized parameters, not user defined one
- If we keep a evaluate for user defined users and do not exclude the training set, some warning to notify that he might get better evaluation but it's "cheating"since it's not splitting Training/Test set
Evaluate button has to stay though
If we keep the current Evaluate behavior, then @quasi shadow needs to implement train = test in the benchmark. Alex said it would be easy
PR is welcome.
Iβm training another model now.
@ashen light challenge accepted π
We'll see if Jarrett merges this PR or finds any issues
extension to the challenge: merge it on gemini's review alone
I mean the result file should be includedβ¦
My device is busy.
Do you mean me running this code and giving you a .jsonl file?
yep, please include it in the PR
Ok
π€£
welp
Hopefully that's easy to fix
Not this error again π
How many times have I gotten it...
If I could see dreams, I would see "not enough values to unpack" in my nightmares
@ashen light if AI wrote 60 lines of code and I wrote 1, does that count? π€£
Alright, so now I have to run FSRS-6 on 10k users
See you guys in 30 70 hours, lol
you dont visually dream?
Nope
No, that's inability to imagine things in your head
Like, imagine a spinning apple or something
I can do that
@wind palm @hasty fractal @cursive badge @cosmic hedge I hate to say this, but it's time to debate Evaluate. Again.
In order to implement the health check, the way the metrics are calculated has to be consistent between the benchmark and Anki, otherwise we can't collect the necessary data. Currently, that's not the case. There are 2 ways to fix this, both are technically doable:
- Implement a training data/testing data split in Anki and instead of evaluating parameters, evaluate FSRS. What this means in practice is that the Evaluate numbers won't depend on your current FSRS parameters. This will also make Evaluate as slow as Optimize, since it has to do an optimization.
- Implement train set = test set in the benchmark. Then I'll run FSRS-6 this way (I'm doing it right now, actually), and then we can keep the current behavior of Evaluate
So either Evaluate evaluates a specific set of parameters, like now, or it evaluates FSRS's ability to perform well on unseen data, like Jarrett and Alex want
Currently in favor of 1: Jarrett, Alex, Luc
Currently in favor of 2: Sound (kind of), Oromit
Currently in favor of "please god let's just get this over with just choose whichever": me
- Implement data/test data split, but find a way to evaluate user-defined parameters in Anki (Evaluate) that would run only on test set (All, excluding the cards marked as "trained_set" during the optimize), just like the result of optimize
I'm not for 1 neither for 2
You can't. The whole point of the test set is that you do NOT optimize parameters on it. The moment you allow people to tweak parameters to see how it affects the metrics on the test set, it ceases to be a test set
I'm not saying optimizing the parameters on it, I say evaluating the cost function on it
Basically :
evaluate(optimize(train_set).parameters, test_set)
evaluate(user_defined.parameters, test_set)
If manual tweaking is allowed, it still defeats the point
Yes, that defeats the point of the test set
Why though
Test set is for evaluating how well the algorithm performs on data that it was not trained on
Yep that's how it's used in
evaluate(optimize(train_set).parameters, test_set)
evaluate(user_defined.parameters, test_set)
If you tweak the parameters to get lower logloss/RMSE, you "train" the algorithm on the test set
Only to evaluate the output of the algorithm
I addressed taht point in the third bullet
Basically if an user goes that far, he would also be able to just train his parameters on the full set π€· , basically he would be hacking his way
At least with
evaluate(optimize(train_set).parameters, test_set)
evaluate(user_defined.parameters, test_set)
You allow it to tweak slightly the optimized version and see if it doesn't get too bad by changing for example his initial stabilities
Of course, if he use that door to train parameters on the whole set, shame on him
But by default it would not be the case
So the complexity comes to : How to make sure Evaluate doesn't cheat, while still being able to be modified for whatever reasons ? The answer is then, hide the training set to it
I just imagine Dae's face with all those flags/parameters in the custom_data of the cards π
I'll """vote""" 1 #1282005522513530952 message because he's already decided.
I guess we could add a stern warning to not tweak parameters manually. But then we'll have to implement a third method, like the mod card ID proposed by Alex, in both Anki AND the benchmark π
The third method of splitting data into train/test, I mean
Shouldn't be that hard right
I mean there is certainly a part in the code where you divide the code into X/1-X
you make it take a custom function "f_partition(card, card_index)", one would be based on "first 80%", and the other "mod N"
Jarrett already made his PR to Anki and FSRS-rs and I already made mine to the benchmark π€£
God this is a mess
You're a bit too intense, chill down and wait. No wonder Dae doesn't involve himself that much in those discussion
FSRS is already good enough, a few more days wont hurt
2 buttons
"evaluate fsrs"
"evaluate current parameters" (burried somewhere)
god save us all
TBH not that bad of an idea
I mean
Maybe some people want to use the full training set
That's confusing as hell
Losing 20% review when you have 100 reviews is a lot
Think of the average user trying to understand it
Serious question: why are we trying to expose more of FSRS rather than hiding it? Ideally, the only setting should be desired retention, that's it
Or a switch "Use Train/Test split or just use Whole set for training ? /!\ This means you're cheating the ability of predicting unseen cards"
I mean, I'm not completely against removing parameters, evaluate from the Deck Options screen, but FSRS should be a bit more trusted for people first
Between the ideal world of FSRS in the benchmark and what people observe, I don't blame people not willing to lose control
But I agree ideally it should be hidden
It's just not mature enough really to be
But anyway
Let the people actually coding decide π
And if I really want my
evaluate(optimize(train_set).parameters, test_set)
evaluate(user_defined.parameters, test_set)
I'll PR it in a few months π
jarrett isn't you though
I guess
Yeah, but he would just approve my PR, without making changes
#1282005522513530952 message
I'm glad I'm no longer on the evaluate mailing list
but literally just stop talking about it till dae says it's reasonable, whatever it is yall are doing
He already greenlit the "health check"
oh cool
But health check requires data, and data requires using the same method both in Anki and in the benchmark for calculating log loss and RMSE, so...here we are
well, have fun
@unique salmon
i hooked up FSRS-6 to optimize on the entire revlog for logloss & rmse (bins) separately but only on the reviews that are evaluated on in srs-benchmark,
logloss: https://pastebin.com/c9c1WniH
rmse (bins): https://pastebin.com/JCkZZZtA
Pastebin.com is the number one paste tool since 2002. Pastebin is a website where you can store text online for a set period of time.
Pastebin.com is the number one paste tool since 2002. Pastebin is a website where you can store text online for a set period of time.
Is this on each individual user or on their combined revlogs?
individual user and with no 5-way splits
it was to simulate how much choosing the best rmse (bins) params within anki might affect the result
but recently jarrett changed it to log loss
I want you to try this on a combined revlog to see if the weirdness with parameters persists
how are the params weird?
I mean w[16] becoming 1.0
I want to see if it persists with both loss functions
On a big combined revlog
we already know the answer for log loss
Time to try RMSE
for rmse (bins) even if its not 1.0 we still wouldn't use it
Why not? If those parameters outperform the current median parameters as default parameters, then why no?
because rmse (bins) is a metric that shouldn't be directly optimized for
better to just fix w[16] to 1.5 and optimize logloss around it or something
why
do i have to repeat my rants against rmse (bins)?
rmse (bins) does not push the model predict as best as it can
i would be fine with this but you might get users who tweak their parameters a bit and it shows a huge drop loss in test loss and then throw complaints about the efficacy of FSRS
We could add a "Do not tweak your FSRS parameters manually to try to bring the values of log loss and RMSE down" warning or something
also new parameters from an optimization are kept only if they improve on the training loss, it could still get worse on the test loss and evaluate would show a worse result
but if we insist that the new parameters should also do better on the test set then this is just training on the test set
I feel like we're screwed either way
- Evaluate evaluates parameters - people start tweaking parameters on the test set and then complaining that the optimizer is garbage because they have gotten "better" parameters manually
- Evaluate evaluates FSRS - people start asking why changing their current parameters doesn't affect Evaluate
But at least they could see that it's their tweaking that had the adverse effect π
Another option : Only allow to change the initial stability parameters
That's weird
Would be interesting to see what people tweak
Personally I never tweaked anything but I would imagine people only tweak initial stabilty ?
Making the params read only would solve most issues I guess
If Evaluate is already a bit difficult to interpret, imagine those parameters !
Could be present in some "Get Troubleshooting parameters / logs" hidden somewhere
My whole point with the evaluate(user_defined.paremeters, test_set) is only there because I was trying to find ways to keep the parameters tweak for all users
Buuuut I don't personally think anyone should tweak them
and if they tweak initial stability, It's not that much of a big deal in terms of logloss/rmse loss
Because you can decrease the RMSE(bins) but increase the log loss in the same time.
compute optimal retention be like: always 70% π₯
Sorry for being annoying, but I really want to see the results of CMRR with the integral, with different values of offset of whatever you want to call it
We need to decide whether we're keeping CMRR or not
#1282005522513530952 message
I don't understand how #1 would have any value for a user. If you're not evaluating their parameters, what's the point?
So I vote for #2.
[See also: I think the health-check seems silly, and I am dreading it being launched with insufficient testing, because it will be a support nightmare. So I won't be in favor of anything that would reduce the utility of Evaluate to make health-check work.]
@polar maple
the exam is in 50 days
i was thinking about it now
if i should maek a priority to have like
75% desired retention and try to crank out
1750 cards in the span of 10 days
because it takes around 10 seconds per card and i see them around 3.4 times to turn into young
almost like 16 hours of new crads
also when exactly does cmrr give me more than 70%
π
btw how is this graph linear?
shouldnt it be exponential
ah nvm
It's not 100% linear
Also, it's different for different users with FSRS-6
can i make my own graph with my own parameters
i know there was a github link somewhere
Nope
Well, I can, but I never shared the code
That's a different one
is there a
average retrievability to desired retenton workload
like
i want to see average retrievability to workload time
https://colab.research.google.com/github/open-spaced-repetition/fsrs4anki/blob/v5.3.3/fsrs4anki_optimizer.ipynb
But it doesn't support FSRS-6 yet
Yes, you can get a graph like that using the Google Colab optimizer, but again, for now it uses FSRS-5
And I don't see why you would want average R instead of DR on the x axis
just need it to explain something to a friend
do i folow the steps from the very beginning of the link
Explain it using this graph, lol
Yeah, it spells everything out, just follow instructions
thank u
i am one of those special people who need step by step instructions for each step π₯
the problem is that the way we currently implement Evaluate is a big no-no in data science. It's currently implemented as where the parameters are evaluated on the same data that it was trained on, but the proper way would be to evaluate the parameters on unseen data. You might be interested in Sound's 3) option
my question is
how do i choose the exact parameter i want it to make it
do i just have to upload the deck
instead of collection
cause i sent in my collection
You can upload a single deck, if that's what you're asking
well cause like my collection has a lot of deck options yk with different parameters
and i want to choose a specific deck option parameter
idk im trying to see
I still don't know what you mean
i have a collection with 5 main deck options:
physiology
biochem
etc etc
i wanna see the graph for the physiology deck options only
not combination of everything
Put all decks that have the "Physiology" preset into one big deck and export that
Hasn't it always been understood that this is bad science to a certain extent? We're asking FSRS to tell us how good of a job it is doing -- like a self-reflection grade. FSRS answers, "when I use this memory model [which I came up with by looking at your review history] on your review history, my predictions are wrong X% of the time." It's not good data science, but it is a good test of whether the model is matching the user (or at least whether FSRS thinks its model is matching the user).
Testing the user's parameters against another user's data seems less helpful. The answer back from FSRS would be, "when I use this memory model on someone's else's review history, my predictions are wrong X% of the time." That seems like a measure of whether the model matches someone else. Why would a user care about that?
You might be interested in Sound's 3) option
Is that this?
- Implement data/test data split, but find a way to evaluate user-defined parameters in Anki (Evaluate) that would run only on test set (All, excluding the cards marked as "trained_set" during the optimize), just like the result of optimize #1282005522513530952 message
Unfortunately, I have no idea what any of that means. π
It means that splitting or not the whole set into a training/test set is not a reason to not evaluate parameters based on user defined parameters.
If the Evaluate button is doing : evaluate=evaluate(optimize(training_set).parameters, test_set), you can also do evaluate=evaluate(user_defined.parameters, test_set).
Of course, it means :
- Having to store what was in the training set, to exclude it from Evaluate when done later.
- Warning the user that he could get better logloss and/or RMSE by tweaking his parameters, but because he would break the whole "You train on train set, you evaluate on test set"
It still make sense to split because without split, the optimization might overfit what it sees but would fail miserably to generalize it to new data. Maybe it's not that much a problem with FSRS in the first place since the forgetting curve has well defined properties, but for any scheduling algorithm using things like Neural Network, if you have enough parameters, you could have very over-specific rule like (If the card has been reviewd 4 times, and the last one was on saturday, the stability will be 6d), just because it saw one or two card in that setup.
But if you split train/set for one, you need to split train/set for all, or you're not comparing models with the same set of rules.
For example, right now I trained and test my parameters on one my deck, I get Evaluate :
Log loss: 0.4024, RMSE(bins): 3.15%. Smaller numbers indicate a better fit to your review history.
Now, I train it on a a subset, my training set
Log loss: 0.3512, RMSE(bins): 2.94%. Smaller numbers indicate a better fit to your review history.
I use it on my testing set :
Log loss: 0.4700, RMSE(bins): 12.68%. Smaller numbers indicate a better fit to your review history.
It performs much worse than than the first result, which is a sign the first evaluate was good only because the model did train on a non representative class of cards
If my testing set had an optimization made on it directly, I would could have gotten :
Log loss: 0.4138, RMSE(bins): 3.80%. Smaller numbers indicate a better fit to your review history.
So the difference between those 2 results, show that optimizing and testing on it, I was able to get way better precision, but by cheating since I now the on what I'll be tested
It's helpful though. I've made a function to predict RMSE given the number of reviews and average retention, and we can compare that approximate value to the real RMSE of the user to find out whether he is doing well for his "weight class"
It's like "For your height and weight, your blood pressure is pretty good", if that analogy helps
The real problem is: what do we tell users with "poor" "health"? If someone's RMSE is way higher than what is expected given their retention and n(reviews), what should Evaluate display?
Just "Poor"? Users will complain
A list of possible explanations and advice? Users won't read it and then will complain anyway
(In this case I did the partitioning based on High/Low D so of course the diff is enormous, but if the partitioning is done smartly, like card.id % 10 or something, it should be hopefully less)
Basically, Sound is saying "Let's use all cards whose ID ends with a zero for testing, the rest for training"
Having such a rule would also make it super easy to know what is part of the Training set and what's not, no need to flag π€
But I have no idea how the card fields are populated and if card.id mod N is that to get well randomized partitions ...
It's Unix timestamps
I think card ID is the epoch ms it was created
Yeah, it's milliseconds elapsed since 01.01.1970 or something
maybe some chaos_function(card.id) mod N would be better
The last digit is as good as random
I guess yes
It doesn't guarantee you have exactly x% sets, I just gave it an example of one way to get a "stable" test set.
I know they say to not create random function based on epoch because if you loop when generating those, you'll get very obvious patterns based on CPU cycles. But here we're talking human creation
Nah, I don't think there are any patterns at the millisecond scale
There may be a pattern because of multiple cards being generated at the same time from a single note.
Or from bulk importing notes/cards
At least to me it looks pretty good
SELECT id % 20 AS mod_result, COUNT(*)
FROM cards GROUP BY mod_result
In SQL querier
hm, I wonder how the card IDs for my deck look like
Cause it initially came to life as import from a CSV file
that import took less than a second
50% of my cards are also one-shot imported though
But it's on millis so even an import would be spreaded normally evenly
I think you would have to be unlucky for the import loop to match up with the n you choose , but it could be possible.
Yeah with the training set 10k we could check if no collection really diverge too much
1675618557059
1675618557215
1675618559833
1675618567127
1675618567137
are some example card IDs
so did it just count up when collisions happened?
The card IDs are in perfectly ascending order with the WaniKani sort ID
sqlite3 ~/Library/Application\ Support/Anki2/User\ 1/collection.anki2 "SELECT id % 20 AS mod_result, COUNT(*) FROM cards GROUP BY mod_result;"
(The 'User\ 1" might need to be adapted obviously, or the path alltogether)
For the health check we have to compare the user's metrics to the values of other users, one way or another, to determine whether this user is doing relatively well or not. There is no absolute standard. Like, you can't say whether 5% RMSE with FSRS-6 is good or not without knowing the values for a ton of users
I think you want an absolute standard, not a relative one
You want a standard that does not depend on data from other users
But we can't do that. Well, we can, but it would be arbitrary
We could just say "RMSE above 10% is bad", without looking at RMSE from lots of users, but that would be kinda dumb
LOL
I asked GPT
For a chaos function
he gave me
SELECT abs((id * 2654435761) % 4294967296) % 20 AS chaos_mod, COUNT(*) FROM cards GROUP BY chaos_mod;
Result ?
WWhat the hell went wrong there π
Are hash stricl speaking chaos function ?
but it's true here I merely want those to be distributed
Not intrinsically. But there must be some good, fast, uniform ones used for hash tables.
Cryptographic hash functions would probably be bad because they are deliberately slow.
Boah
Might not be necessary
I see there's none builtin in sqlite
SELECT id % 20 AS mod_result, COUNT(*) AS count, ROUND(100.0 * COUNT(*) / (SELECT COUNT(*) FROM cards), 2) AS percent, ROUND(100.0 * COUNT(*) / (SELECT COUNT(*) FROM cards) - 5.0, 2) AS percent_deviation FROM cards GROUP BY mod_result;
For SQL, GPT is quite useful π
Well I'm not sure it was worth it to do a fancy 5-percent for percent_deviation LOL
I also see he hardcoded it
super clean code π
But yeah, seems mod Nis more than good enough if don't want to do flagging π€·
Btw, while I would prefer removing Evaluate, I think the health check is a step in the right direction: instead of having to wrestle with completely abstract numbers, users will see a nice colorful scale that tells them in plain English whether their numbers are good or not
The numbers will still be displayed, just for reference
I am repeating myself, but the only real problem is what to do with people who fall in the red zone.
FSRS doesn't have any kind of "emergency mode" or whatever
Like, there is no secret button to fix your shit
Well, I guess "Remedy Hard Misuse" is a bit like that
My point is that it's inevitable that some people will have crappy numbers. What's the course of action then?
Someone nice writes a "Reasons why your FSRS evaluation might be bad and what you can do about it" page to put in the manual
π€·ββοΈ
left is years, all 0.94
what was wrong with jarrets solution for this with the cost from retention btw?
I can tell you did a great job of explaining it, but unfortunately I still don't get it. It's my deficiency, not yours.
I can't even ask clarifying questions, because I just have no idea what you're explaining to me.
Let's see if we can get there without me understanding it. -- You've seen my reasons for wanting #2.
- Is your #3 a better measure than the current method for the user of how well FSRS is working for them, is matching their memory curve, is predicting the appropriate time to study their cards?
- Can we still describe it that same way in general terms -- how well it's working, matching, predicting?
- Will your #3 run nearly as fast as Evaluate does now?
we can compare that approximate value to the real RMSE of the user to find out whether he is doing well for his "weight class"
Is that better than a simple lower is better? It feels like comparing it to a set scale is going to cause more trouble than number-goes-down=good, number-goes-up=bad.
We could just say "RMSE above 10% is bad", without looking at RMSE from lots of users, but that would be kinda dumb
Are we still using the same "working definition" (not entirely mathematically accurate, blah, blah) of RMSE? So isn't "FSRS makes mistakes scheduling 10% of your cards (or 10% of the time)" objectively bad? I don't need to compare to anyone else's results to figure that out.
Part of the problem is RMSE going down doesn't necessarily = good if the optimiser is cheating (overfitting)
π I thought the RMSE-cheating problem got solved ages ago. https://github.com/open-spaced-repetition/fsrs4anki/wiki/The-Metric
I haven't looked at that (before my time active here) but I assume that is cheating in another way because Evaluate still has the problem I'm talking about.
Is there a way to stop the optimizer from cheating-overfitting?
How can a user tell if their optimization has the cheating-overfitting problem? Better question -- Can a user do anything to avoid falling prey to cheating-overfitting parameters?
The user cannot do it easily. You can do something like Sound did where you manually split things for training, but that's not something we should expect a normal user to do.
this is outdated, the new version is still cheatable
The way folks talk about this, it sounds like it's impossible to tell the optimizer not to do this -- don't use this cheat, don't overfit. Is that really not possible?
Maybe a slightly different framing will help:
Imagine I want to teach you how to do addition, but I can only do it by showing you lots of examples e.g. "45 + 22 = 67"
I give you the big book full of examples and let you try to figure out the rules yourself.
Now I want to test how well you learned by asking you questions.
I ask you questions from the book and you do really well so I think my job here is done!
Unfortunately you cheated, you just memorised the examples from the big book, you didn't actually understand addition.
If you later encounter addition problems that were not in the book you do really badly.
This is the overfitting problem. I've taught you to be very good at repeating what you have seen before, but not the general rules that will let you solve novel problems in the future.
Imagine instead I only gave you 4/5 of the book to learn from but kept the last 1/5 of it for myself.
If I later test you using only questions from my part of the book that you have never seen I can get a better idea of if you really understand addition because you cannot have memorised the answers.
The downside to splitting the book is that you will have fewer examples to learn from, so may find it more difficult to learn the rules of addition in the first place. I'll be better at evaluating your performance but your performance might actually be worse (than if you did not cheat with the full book).
This problem of splitting data into train/test possibly reducing performance is why some (Jarrett?) like the idea of the "5 way split" Evaluate as seen in the benchmark:
You keep optimising with all the data and just hope that there is not too much overfitting.
You can get an idea of how well FSRS works in general on your data (but not your specific parameters) by splitting your data into 5 parts then training and testing 5 times choosing a different part as the "test" data each time and average the results.
(N.B. I have not checked if this last part is exactly how the benchmark does it)
rossgb's explanation is great. But just regarding RMSE (bins), it is cheatable in a different way than for what we mean when we are talking about Evaluate so RMSE isn't relevant in this context
The infamous RMSE-BINS-EXPLOIT. Best algorithm of them all π€£
About the train/test split, here is a common practice: https://www.kaggle.com/c/home-data-for-ml-course/overview/frequently-asked-questions
Whatβs the difference between a private and public leaderboard?
The Kaggle leaderboard has a public and private component to prevent participants from βoverfittingβ to the leaderboard. If your model is βoverfitβ to a dataset then it is not generalizable outside of the dataset you trained it on. This means that your model would have low accuracy on another sample of data taken from a similar dataset.Public Leaderboard
For all participants, the same 50% of predictions from the test set are assigned to the public leaderboard. The score you see on the public leaderboard reflects your modelβs accuracy on this portion of the test set.
Private Leaderboard
The other 50% of predictions from the test set are assigned to the private leaderboard. The private leaderboard is not visible to participants until the competition has concluded. At the end of a competition, we will reveal the private leaderboard so you can see your score on the other 50% of the test data. The scores on the private leaderboard are used to determine the competition winners. Getting Started competitions are run on a rolling timeline so the private leaderboard is never revealed.
π You can overfit to the public leaderboard by tuning your model based on the test score and get a bad rank in private leaderboard.
In the case of FSRS and @bold terrace's method #3, you can tune the parameters on test set even if it isn't used for training, and may get worse result in the future.
You can also get into train-test-validation splits because you "taint" any data that you use to twiddle optimisation.
If you are doing good scienceβ’ the final evaluation must be on data that has never been used previously.
At least these are my memories of an undergraduate long ago π
This is why I got into nice deterministic simulations for my research. The evaluation was much simpler! π
In my view, the evaluation only makes sense when we search for a reproducible optimization method. Tuning the parameters by hand is unlikely reproducible.
π Feel unsatisfied about your parameters? Please challenge the SRS Benchmark!
To be frank the moment you manually edit your params you are fully in "here be dragons" territory and should not expect any built-in help.
Yeah agree and also wonder what people actually tweak. I'd make a bet that it's mostly the initial stability, but I have no proof
And I think most of the time, just to reduce the good/easy initial ones
0.1079, 0.8219, 3.3692, 31.2728, 7.2741, 0.4920, 2.0791, 0.0727, 1.3029, 0.2688, 0.8197, 1.8849, 0.0873, 0.3245, 2.3331, 0.0939, 3.2766, 0.7575, 0.3003, 0.0905, 0.1176
Log loss: 0.3512, RMSE(bins): 2.94%. Smaller numbers indicate a better fit to your review history.
If I tweak them because I fear long first intervals :
1.1079, 1.8219, 1.3692, 1.2728, 7.2741, 0.4920, 2.0791, 0.0727, 1.3029, 0.2688, 0.8197, 1.8849, 0.0873, 0.3245, 2.3331, 0.0939, 3.2766, 0.7575, 0.3003, 0.0905, 0.1176
Log loss: 0.3591, RMSE(bins): 4.21%. Smaller numbers indicate a better fit to your review history.
Soooo ... I'm not against putting them in read only and maybe for people actually tweaking them allowing them to still stipulate the 4 initial stab ? I don't know. I think Evaluate is useful to see how well the model is able ot predict your stuff, and how cool it is to copy your parameters in a visualizer and simulate some revlog, but I dont know how useful it is to tweak the parameters
But IMO the mod N way to partition Test/Training set seems so nice it would be cool to be able to test it π
Maybe a better method is to provide different initial parameters. I can calculate the median parameters from collections with high retention and low retention. The latter would have small initial stability.
Do you know if it's because it converges to a different local minimum or is it just a matter of "optimization budget" as it was referenced earlier
But yeah, I'm completely curious to see how some kind of "clustering" can definitely help π
I'm also wondering if the decay wouldn't be different between those 2 groups
Since low decay like .1 translate in "It takes very looong time to get down to 60-70% DR", it might be that the group of user with High DR have a different way of approaching Anki (lots of exposure outside Anki) vs people that use mainly Anki (and not a lot of external exposure)
Finished!
lol ! The origin of being a "comment that rubbed you the wrong way"
It's funny how alienation can have different results on people π . Some will shut themselves, and other like me included, are almost getting motivated by it π
Huh
Have you tried dividing by 1/(t2-t1) just in case? Again, originally the average_forgetting_curve function is supposed to return a number between 0 and 1
This is really strange, I feel like the implementation is wrong somehow
Try dividing and if that still doesn't produce sensible results, show me the Rust code of the integral and I'll try my best to find the problem
I'll re-write some stuff and send you a .docx file later
I think you should write it in very simple layman terms and collapse everything technical, like how you collapsed "Background"
Yep, that was me π€£
Due to the curse of knowledge, I don't know which terms are technicalπ
Me and Gemini will handle that π€£
Just so that if people want to talk about FSRS, their messages won't be scattered across different channels and won't be drowned in a sea of other messages
make sense
there probably should be an fsrs channel, to make searching easier
But it's still very hard to dig messages from discord.π
I proposed that some time ago, but the mods were like "nah"
we cant make threads in a thread and we cant search in a specific thread either
this is the 1 millionth time someone said this, mods seem not to care though.
Is there actually a learning program like Anki that uses Neural Nets (or AI) as its scheduling algorithm
AI-driven spaced repetition and flashcard generation. Create, study, and master material faster than ever.
I have found this but it seems sketchy
A long time ago I contacted the Dekki guy and suggested that he submit his neural net for our benchmark. Well, he never did
has anyone had a good experience with itβ
So it does indeed have a neural netβ
Because the whole idea of let the program do the work for you has sold me
Yes
So what is holding Anki from using a neural-net as well
It seems the Dekki guy sees Anki as a competitor and does not want to reveal the works behind his neural net
the master
what can a neural net even do
how much more is there that you can optimize
It can notice weird patterns in your memory
Which would theoretically make it have a pseudo-short term memory model
But I dont know what I am talking about here
All I know is that it notices patterns which would otherwise not be easy to model by mathematical formulae
So I feel quite tempted by it
And then I asked if there are learning programs like it
with neural nets above all
Let's pretend that Alex released his net
Notice how big of a jump it is compared to everything else in that table
ye 2.7 million
I meant log-loss, RMSE and AUC
Other models cannot get below 0.31 log-loss, this gets 0.27
Other models cannot get below 3.5% RMSE, this gets 1.4%
Other models cannot get above 0.73 AUC, this gets 0.82
So what is the hold upβ The sync problem I get it but why when other programs like Dekki are doing itπ₯²
@polar maple
Well, one of the holdups is that it doesn't have a forgetting curve π
I mean, it does, but not as a nice, simple formula. So you can get all kinds of weirdness, like the probability of recall increasing over time and whatnot
And it would be very difficult to calculate an interval that corresponds to a specific probability of recall, for scheduling purposes
And it would be difficult to ensure things like Again <= Hard <= Good <= Easy
The nice thing about FSRS is that predicting the probability of recall and scheduling the next interval are equally easy, but not with this
Well the thing is, memory is really weird
So weird memory = weird intervals = weird curves
And Dekki seems to be fine
I was just asking for examples of programs with neural nets and it does not seem to be a major con
how are there 2.7 million parameters
π
so thisi means that neural net is going to make me a super genius
I'll message the Dekki guy again, maybe he will participate in the benchmark
I really REALLY hope for Anki to have a Neural Net
The only way to come close to match the weirdness of the human memory
start coding
...or not, Reddit just doesn't load chat
F***** me
is there a way to see average retreivability for a sepcific day
now i have this, only for today, but is there a way i can see what it would be like in 5 days or 6 days
if i didnt review the deck
@cosmic hedge sounds like someone wants a "Memorized Over Time" graph natively in Anki
(and that someone is not just me)
because my plan is to only do filtered decks for cards under 90% average retrievability the day before the exam
i can do that so retrivability/cards right
yes
so like
today its 93% but i want to see
if i dont do reviews, what would it be tomorrow, or in 5 days
Also
try the simulator with a review limit of 0
i'll get to it but last time i tried it with 365 days it didnt do anything
does it use the parameters of the deck im in
or of the cards of whatever deck they are in
the simulator runs on presets
Alright, but there is no way 0.94 for everything is correct. I ran it with the Python simulator (with a simplified config) and it sure as heck wasn't maxing out
i can try plot it at some point if you want?
what settings did you use to simulate it?
Plot what?
could you screenshot them?
the integral/workload
what does the reviews graph look like bc that is weird
what does your card stability graph look like?
I tried the Python simulator with the integral over the next FIVE THOUSAND YEARS and still got 70%
try simulate more than 30 days
pub fn average_f_power_forgetting_curve(
learn_span: usize,
cards: &[Card],
decay: f32,
) -> f32 {
let factor = 0.9_f32.powf(1.0 / decay) - 1.0;
let exp = decay + 1.0;
let den_factor = factor * exp;
// Closure equivalent to the inner integral function
let integral_calc = |card: &Card| -> f32 {
// Performs element-wise: (s / den_factor) * (1.0 + factor * t / s).powf(exp)
let t1 = card.last_date - learn_span as f32;
let t2 = t1 + 365.;
(card.stability / den_factor) * (1.0 + factor * t2 / card.stability).powf(exp) -
(card.stability / den_factor) * (1.0 + factor * t1 / card.stability).powf(exp)
};
// Calculate integral difference and divide by time difference element-wise
cards.iter().map(integral_calc).sum::<f32>()
}
``` if you want to check it
given your stabilities that seems accurate to me
i see
i guess i am doubting myself
can we bring back decimal desired retention π
Give me an example output using some S and some t1 and t2 and decay =-0.2
only slightly jealous
paste the code above this into here #1282005522513530952 message
Already did
wait i forgot to fix it if you copied it quickly copy it again
wait no
hold on
I just want you to give me the output for some input S, t1, t2, decay so that I can verify the math
i changed it again it should work now
wut
?
Is your t1 negative?
Or what is going on there?
I'm trying to figure out what this could mean, and I can't
t1 is just time since the last review of this card
And I can't reproduce your number, btw
Ok, yeah, so your t1 is negative
Though I doubt that's the reason why you're getting 94% every time
I have no idea how you're getting 94%
Let's try to do this as properly as possible:
- No negative t1, it's the number of days since the last review
-
def average_f_power_forgetting_curve(t1, t2, s, decay):
if not t2 > t1:
raise ValueError("t2 must be greater than t1")
# Calculate F(t2) - F(t1) where F is the antiderivative
integral = integral_power_forgetting_curve(t2, s, decay) - integral_power_forgetting_curve(t1, s, decay)
print(f'Raw integral={integral:.5f}')
# Divide it by the difference in time to get the average
return integral / (t2 - t1)```
Divide by t2-t1. If the integral is over the next 365 days, divide by 365. If it's over the next 1825 days, divide by 1825, etc. Aka ensure that the output is between 0 and 1
I just want to confirm that you get 94% even if everything is exactly as intended, no cutting corners
I sent them an email
But I'm like 90% sure they won't participate in Jarrett's benchmark
now its 0.7 again π ```rs
pub fn average_f_power_forgetting_curve(
learn_span: usize,
cards: &[Card],
decay: f32,
) -> f32 {
let factor = 0.9_f32.powf(1.0 / decay) - 1.0;
let exp = decay + 1.0;
let den_factor = factor * exp;
let offset = 365. * 10.;
// Closure equivalent to the inner integral function
let integral_calc = |card: &Card| -> f32 {
// Performs element-wise: (s / den_factor) * (1.0 + factor * t / s).powf(exp)
let t1 = learn_span as f32 - card.last_date;
let t2 = t1 + offset;
(card.stability / den_factor) * (1.0 + factor * t2 / card.stability).powf(exp) -
(card.stability / den_factor) * (1.0 + factor * t1 / card.stability).powf(exp)
};
// Calculate integral difference and divide by time difference element-wise
cards.iter().map(integral_calc).sum::<f32>() / offset
}
so was the problem you had with Jarrett's cost by retention was that the numbers were too arbitrary or something?
Yeah. Time per answer as a function of R would be nice, but his solution wasn't really that
We can do it properly though
But first I'd like you to try 1/5/10/50 years again with this code and report the values
i just tried another deck with the 10 years one π
i think basing anything off of what happens in 10 years time might be slightly insane already though π