#FSRS Megathread

1 messages ยท Page 6 of 1

cosmic hedge
#

It wont be that bad right? ๐Ÿ‘€

quasi shadow
#

It's tricky, I think.

cosmic hedge
#

Sounds fun ๐Ÿ˜‚ ๐Ÿ˜ญ

quasi shadow
#
    def stability_short_term(self, state: Tensor, rating: Tensor) -> Tensor:
        new_s = (
            state[:, 0]
            * torch.exp(self.w[17] * (rating - 3 + self.w[18]))
            * torch.pow(state[:, 0], -self.w[19])
        )
#

It depends on Stability.

#

So, this function needs refactor.

#

But, the previous simplification doesn't work for the new formula.

cosmic hedge
#

Why doesn't it?

quasi shadow
#

For example, with the old formula, the S0(G=1) = w[0], right?

#

Then, you grade good in the same day, S0' = S0 * e^(w[17]*w[18]).

#

You grade good twice, S0'' = S0' * e^(w[17]*w[18]).

#

So, S0'' = w[0] * e^(w[17]*w[18]) * e^(w[17]*w[18]) = w[0] * e^(2*w[17]*w[18]).

#

The two reviews have the same multiple on previous stability.

#

Even if you give other grades, they can be composed easily.

#

So, we can simplify the simulation of short-term reviews.

cosmic hedge
#

Oh I think I see, so we would need a loop in the short term function?

quasi shadow
#

The simplification doesn't need the possibilities of each grade of the short-term reviews.

quasi shadow
#

It's easy to get these stats from the revlog.

cosmic hedge
#

wait no

quasi shadow
#

The Markov chain is used to describe the distribution of the next grade based on the last grade.

cosmic hedge
#

Like in your paper?

quasi shadow
#

Nope. The state in my paper is (S, D).

#

But we don't have a short-term memory model.

#

So... We can use the last grade as the state.

#

OK, it has... 16 parameters!

#

Last Grades: 1, 2, 3, 4
Next Grades: 1, 2, 3, 4

#

4 * 4 = 16 possbilities.

#

๐Ÿ˜‚ And we need to implement (re)learning steps in the simulator.

cosmic hedge
cosmic hedge
cosmic hedge
#

Oh right XD

#

I spent too long trying to see why only the last grade affected it XD

quasi shadow
#

We only need P(G_{i}|G_{iโˆ’1}).

#

It's complex to calculate P(G_{i}|G_{iโˆ’1},G_{iโˆ’2}).

cosmic hedge
#

Right!

lapis hearth
# quasi shadow We only need P(G\_{i}|G\_{iโˆ’1}).

Hey Jarrett. I know this may not be on your radar at this moment, but when would you consider making FSRS scheduling intervals <13m. I believe the reason why you said why it is doing so in the first place was the desired retention.

Ideally it should be getting ever so small the more I press again and again and then I would start to work my way up.

unique salmon
#

Yeah, that's also why I don't want to mess around with it
The current short-term formula has a very convenient property: if you want to know how multiple same-day reviews will affect S, you can just plug the average number of same-day reviews into it an the average grade of those reviews. You don't need to account for every review individually.

#

Making that formula more complex could easily break that property. Then simulations will become a huge pain.

quasi shadow
quasi shadow
quasi shadow
lapis hearth
quasi shadow
#

But I need to benchmark it.

lapis hearth
cursive badge
#

Nothing interesting to report. I have not spent any time on it this week.
I have a quick play with leeches every now and again between other projects. I wasn't planning to work on it continuously for now.
I just shared my work this time to make it easier for others who were interested in playing with the Poisson Binomial idea too.

unique salmon
#

Would be cool if Jarrett implemented it in the Helper add-on, for convenience

cursive badge
#

It's interesting for helping pick out potential leeches but it still doesn't feel like something I would trust to automatically suspend cards without me checking.

unique salmon
#

Btw, a reminder: #1282005522513530952 message

unique salmon
cursive badge
#

Maybe I didn't push it? I'll check.

unique salmon
#

IIRC you only tested it on your own

cursive badge
#

I think the only bit we discussed that I didn't push was the separate "set" vs "unset" thresholds because I was just hard-coding it to experiment instead of implementing it nicely.

unique salmon
#

Ah, ok

#

Hm. I'll see what I can do about it later

#

Maybe I'll use two thresholds, that's a good idea

#

Actually, can you share your code with two thresholds? ๐Ÿ˜…

cursive badge
# unique salmon Actually, can you share your code with two thresholds? ๐Ÿ˜…

This was what I was last playing with (N.B. reset_t):

def _classify_incrementally(
    trials_data: TrialsData,
    initial_threshold: float,
    threshold_fn: ThresholdFn,
) -> (bool, dict):
    leech_data = _calculate_incremental_leech_probabilities(
        trials_data=trials_data,
        initial_threshold=initial_threshold,
        threshold_fn=threshold_fn,
    )

    triggered_at_least_once = False
    last_triggered = False
    curr_triggered = False
    crossover_count = 0

    crossover_idxs = []
    # Mark as leech as soon as we see it drop below the threshold
    for i in range(leech_data.n_trials):
        p = leech_data.probabilities[i]
        t = leech_data.thresholds[i]

        reset_t = 0.2

        if p < t:
            triggered_at_least_once = True
            curr_triggered = True
        elif p > reset_t:
            curr_triggered = False

        if curr_triggered != last_triggered:
            crossover_count += 1
            crossover_idxs.append(i)

        last_triggered = curr_triggered

    metadata = {
        "crossover_count": crossover_count,
        "crossover_idxs": crossover_idxs,
    }

    return triggered_at_least_once, metadata
unique salmon
cursive badge
# unique salmon I want to print the average n crossovers at the end, can you help with that?
diff --git a/src/leechkit/__main__.py b/src/leechkit/__main__.py
index ea0d1bf..b31e047 100644
--- a/src/leechkit/__main__.py
+++ b/src/leechkit/__main__.py
@@ -46,6 +46,7 @@ def main(
 
     selected_card_ids: Sequence[CardId] = col.find_cards(query=query)
     leech_count = 0
+    total_crossover_count = 0
 
     print("")
     print("[bold]Searching for leeches[/bold]")
@@ -86,6 +87,9 @@ def main(
                     highlight=False,
                 )
 
+                if incremental_check:
+                    total_crossover_count += metadata["crossover_count"]
+
                 leech_count += 1
 
             progress.update(task, advance=1)
@@ -94,5 +98,8 @@ def main(
     print(f"Processed {len(selected_card_ids)} cards")
     print(f"Found {leech_count} leeches")
 
+    if incremental_check:
+        print(f"Mean crossover count {total_crossover_count/leech_count:.2f}")
+
 
 typer.run(main)
wind palm
#

The most likely reason is different parameters (have you reoptimized?) since those cards were last scheduled.

How long ago is considered "recently reviewed?"
I expect that is the same as the whatever you have configured at the top level (perhaps just using abbreviated text for that per-deck menu?). The default is 7 days.

rotund summit
unique salmon
hasty fractal
#

@quasi shadow are we able to look at historical R of cards now? I think there was a PR from u.

#

can we do this now maybe?

unique salmon
# cursive badge ```diff diff --git a/src/leechkit/__main__.py b/src/leechkit/__main__.py index e...

@polar maple @quasi shadow if Mean cross-over count is exactly 1.00, that means that once a card is a leech, it never recovers. If our detector was perfect AND difficult cards never became easy, that would be good, but it's not perfect, so we need to allow cards to go back to not being leeches.

I set the threshold to 3.5% and the second threshold to 40%. I think 1.20 is a reasonable amount of cross-overs, maybe we could even allow more.

On the second image the first threshold is also 3.5%, but the second threshold is 25%. This means that a card needs to fall below 3.5% to be considered a leech, and rise above 25% to stop being considered a leech. This results in the average of 1.39, which is good IMO. But again, it depends on how often difficult cards become easy. If it almost never happens, then this amount of cross-overs is too high.

polar maple
#

especially for the calibration part

unique salmon
#

Hopefully Jarrett will do it ๐Ÿ˜…

bold terrace
#

I added a --max-reviews for the leech detector, because it really doesn't make that much sense for me to have past leech detection being reported.

I tried a lot of different setup, and something that I was Ok with was something like

uv run -m leechkit '/Users/jschoreels/Library/Application Support/Anki2/User 1/collection.anki2' --query "Yomitan" --flag --write --leech-threshold 0.01 --max-reviews 5 --incremental-check --tag leech-max-5-threshold-0.01

I got only 2 results over thousands of cards but to me, the goal is not really to detect much, just the one the most problematic right now that I might need to take extra care.

https://github.com/rbrownwsws/leechkit/pull/7

GitHub

Example :
uv run -m leechkit '/Users/[...]/Library/Application Support/Anki2/User 1/collection.anki2' --query "Yomitan" --flag --write --leech-threshold 0.01 --max...

unique salmon
#

https://forums.ankiweb.net/t/desired-retention-ui-overhaul/57678
Meanwhile I wrote this. I really hope everyone in this channel will read it

unique salmon
#

The laziest issue I've ever made ๐Ÿคฃ

cursive badge
# unique salmon Alright, I made one https://github.com/open-spaced-repetition/srs-benchmark/issu...

You might want to precompute a dataset based on Anki 10k where you annotate all reviews with retrievability rather than trying to shoehorn leech detecting into the benchmark code.
Last time I tried playing with the 10k dataset it took my computer a few seconds per preset just to optimise FSRS params.
It would probably be painfully slow to try out lots of different leech detection settings / algorithms.

unique salmon
unique salmon
# quasi shadow 4 * 4 = 16 possbilities.

This is probably the best we can do without a short-term memory model ๐Ÿ‘
This would be a really good change, since then we could try more complex formulas for short-term SInc

#

So to sum it up for other people who might be lost:
Right now we just take the average number of same-day reviews and the average grade of those reviews to do simulations.

Jarrett proposed a more complex approach: while we don't have a short-term memory model for predicting the probability of recall, we can instead use these 16 probabilities, all for same-day reviews:

  1. Probability of Again if the previous grade was also Again
  2. Probability of Again if the previous grade was Hard
  3. Probability of Again if the previous grade was Good
    ...
  4. Probability of Easy if the previous grade was Good
  5. Probability of Easy if the previous grade was also Easy
unique salmon
#

Idk if that would be beneficial though

#

Also, I guess we don't actually need the probability of G given that last G=Easy, because if last G=Easy, the next review cannot be a same-day review in Anki. If you press Easy, the next interval is never <1d
So it's actually 12 probabilities. Well, 24 if you want to do learning and re-learning separately

#

To be clear, this is still an oversimplified model and not a proper way to predict the probability of recall for same-day reviews. It assumes that p(Grade_1 | Grade_2) is a constant that doesn't depend on interval lengths or anything else.

But it's way better than the current approach

quasi shadow
#

Because the sum of possibilities of four grades is always 100%.

quasi shadow
#

I will add two new fields into button_usage.jsonl.

#
{"user": 9, "size": 4729, "first_rating_prob": [0.2364, 0.0993, 0.6349, 0.0294], "review_rating_prob": [0.0234, 0.973, 0.0036], "learn_costs": [52.86, 56.85, 12.46, 8.06], "review_costs": [39.19, 16.6, 6.44, 49.01], "first_rating_offset": [-1.62, -0.3, -0.13, 0.0], "first_session_len": [2.73, 1.64, 0.83, 0.0], "forget_rating_offset": -0.74, "forget_session_len": 1.22, "short_term_recall": [0.7424, 0.9816, 0.953, 0.7244], "learning_step_transition": [[661, 7, 1068, 83], [12, 81, 337, 3], [292, 53, 2677, 34]], "relearning_step_transition": [[64, 0, 148, 0], [0, 0, 0, 0], [0, 0, 0, 0]]}
#

learning_step_transition and relearning_step_transition.

#

[[661, 7, 1068, 83], [12, 81, 337, 3], [292, 53, 2677, 34]] means, there are 661 again and 1068 good in the next same-day review when the last grade is again during learning.

#

There are 2677 good in the next same-day review when the last grade is good.

#

Then we can calculate the Transition probability matrix from it.

#

It's convenient to apply any kind of smoothing method to the raw data.

quasi shadow
quasi shadow
pulsar crow
#

If I set desired retention to for example 88% which stat should I look at to see if im reaching that goal or not? Currently im looking at this graph on mature cards:
Is that correct?

quasi shadow
pulsar crow
#

Hmm mine looks different

#

But okay thank you

quasi shadow
#

OK, I did it.

#

Claude helped me replace all Numpy stuff with standard lib!

pulsar crow
#

In the manual for FSRS they recommend setting the relearning steps blank because seeing the same card several times a day is not that efficient. Which setting is that?

Is it this one?
Because earlier the manual said to put it to 1m 10m

cosmic hedge
cosmic hedge
quasi shadow
#

You can play with it.

#

The bug won't occur if you don't build Anki from the latest source code.

pulsar crow
cosmic hedge
quasi shadow
cosmic hedge
#

well this gives me confidence that my implementation wasn't wrong when I tried it at least.

quasi shadow
#

OK, I submitted the PR to fix the bug.

#

Btw, the detector will cause lag in Anki 25.02.

#

OK, now we have the transition matrix calculated from 10k collections.

#

๐Ÿ˜Ž @cosmic hedge would you like to work on the refactor of the simulator?

cosmic hedge
#

bear in mind you may have to swoop in and save me when I can't manage it ๐Ÿ˜‚

quasi shadow
#

Some traps appear in my mind:

#
  1. ensure the short-term simulation could end in finite epochs.
#
  1. the cost of short-term reviews should be refactored, too.
cosmic hedge
quasi shadow
#

Now we uses the sum of duration of (re)learning reviews as the cost.

#
    cost_dict = (
        df.groupby(by=["first_state", "first_rating"])["sum_duration"]
        .median()
        .to_dict()
    )
#

For example, the forget_cost is the sum of duration of the first review entry with rating again and the remaining relearning entries.

#

In the refactor frame, each relearning entry has its own duration.

#

This duration may depend on the rating.

cosmic hedge
#

does it mean you have to re-run the anki buttons script again though? ๐Ÿ˜‚

quasi shadow
#

Yep. I have to...

#

But you can use mock value for development.

cosmic hedge
#

so just to confirm we will have relearning-costs-median?

quasi shadow
#

๐Ÿ˜… State: Learning, Review, Relearning

#

3 * 4 = 12 costs!

cosmic hedge
# quasi shadow

wow the state affects the duration more than I thought ๐Ÿ˜ณ

quasi shadow
cosmic hedge
unique salmon
cosmic hedge
#
    relearn_costs = np.array([1, 2, 3, 4])
    relearn_chances = np.array(
        [
            [0.3, 0.05, 0.5, 0.15],
            [0.3, 0.05, 0.5, 0.15],
            [0.3, 0.05, 0.5, 0.15],
            [0.3, 0.05, 0.5, 0.15],
        ]
    )
    MAX_RELEARN_STEPS = 5

    # learn_state: 1: Learning, 2: Review, 3: Relearning
    def stability_short_term(s: np.array, init_rating=None):
        def step(s, next_weights):
            rating = np.random.choice(relearn_costs, p=next_weights)
            new_s = s * 
                    (math.e ** (w[17] * (rating - 3 + w[18]))) *
                    (s ** -w[19])

            return (new_s, rating)

        def loop(s, init_rating):
            i = 0
            consecutive = 0
            rating = init_rating or 1
            while i < MAX_RELEARN_STEPS and consecutive < 2 and rating < 4:
                (s, rating) = step(s, relearn_chances[rating - 1])
                i += 1
                if rating > 2:
                    consecutive += 1
                else:
                    consecutive = 0

            return s

        if len(s) != 0:
            new_s = np.vectorize(loop)(s, init_rating)
        else: 
            new_s = np.array([])

        return new_s
``` I may be completely wrong. Is this at least close to what you had in mind?
not sure how we'd get the costs implemented with this though
unique salmon
#

@quasi shadow after beating Claude with a stick multiple times I just barely managed to make it write a fully "torched" version of my Poisson binomial approximation
It still has issues though: unused variables and one last "if" instead of "torch.where". No matter how many times I try, it always either has unused variables or one "if", I just can't get it to finish this properly.

I really hope you can implement this directly in FSRS itself. In other words, if Claude did 80% of the work, I hope that you can do the remaining 20%

quasi shadow
#

I have implemented it in the helper add-on with standard lib.

unique salmon
#

Because I want to use it in FSRS itself, as a new variable

quasi shadow
#

Do you mean add a new state variable?

unique salmon
#

Yes
I'm making a github issue right now

#

Btw, I have not checked whether this code even produces the correct results ๐Ÿคฃ
I mean, wherther the results match the non-torch implementation

#

I guess I should give you the numpy implementation, for comparison

#

I added the numpy code in the github issue

#

They work quite differently (calculating the PMF vs CDF, returning p vs log(p), but hopefully it's still useful

quasi shadow
#

I still don't understand.

#

Is it used to predict stability?

unique salmon
unique salmon
#

So now we just need to implement it inside FSRS ๐Ÿ˜…

cursive badge
# unique salmon Ok, yeah, I checked it - it works fine. The outputs match the numpy version

fyi hypothesis can be nice for checking things like this: https://hypothesis.works/

I used it to double check that the PMF implementation you gave me matched what I had be getting from SciPy:

https://github.com/rbrownwsws/leechkit/blob/master/tests/test_fast_pbd.py

GitHub

Contribute to rbrownwsws/leechkit development by creating an account on GitHub.

unique salmon
#

Damn kids and their fancy frameworks! I just manually type in the inputs and then eyeball the outputs!

unique salmon
#

Actually, that's not correct either, since Easy always graduates the card, so you need to stop counting "consecutive" after Easy

#

Oh man, this will be a pain. Thankfully, not for me ๐Ÿคฃ

#

Btw @quasi shadow will the real number of re-learning steps be used?

quasi shadow
polar maple
# quasi shadow Is it used to predict stability?

the idea is that if this leech detector signal (I'll call it historical likelihood) actually provides more predictive power than what is provided by raw DSR from FSRS then there should be some formulas that incorporate historical retention to improve FSRS predictions, otherwise what is the point of using historical likelihood for leech detection if it has no predictive power? ie why not just use D and S directly instead?

unique salmon
ashen light
#

out of curiosity, what prompted this poll

robust hill
#

death

unique salmon
ashen light
#

oh just another "expertium hates features" poll

robust hill
#

real

#

where do we see how long it takes to mature a card

#

if its possible

ashen light
#

why not just leave the answer graph but remove the %s from the labels

#

people who want to use it as a count of presses still get it and the ability to confuse it for any sort of % metric goes away

#

(no ones gonna be crunching those numbers themselves when theres readily available %s next to it)

#

perhaps I should post that in the thread instead of here no one reads this channel

ashen light
#

wow confidentiality breach

ashen light
#

gg

night saffron
#

that's a gigachad option

pulsar crow
#

How does FSRS handle difficulty in cards, I might have one deck with difficult words and another with easy words, will the algorithm adjust to that?

unique salmon
pulsar crow
#

Does it require that I've had a similarly difficult deck before?

unique salmon
hasty fractal
unique salmon
#

@ashen light remind me, how exactly load balancer deals with siblings?

  1. Do you need to enable "Bury review siblings"?
  2. Do you need to enable "Bury interday learning siblings"?
  3. Does it just always tries to avoid siblings regarding of these settings?
#

It just says "if a note_id is provided, it attempts to avoid placing a card on a day that already has that note_id (aka avoid siblings)", but it's not clear how that is related to these settings

#

Perhaps I should PR just to add a comment about which of these three toggles affects load balancer. Except I need Jake to tell me that ๐Ÿ˜ญ

ashen light
#

bury review siblings turns on the lb's sibling splitter

quasi shadow
#

Does anyone want to develop a standalone add-on for leech detector?

#

The helper add-on is too complex now.

#

๐Ÿ˜‚ And the leech detector code is not written by me.

#

I make this PR just for verify the feasibility.

lapis hearth
# quasi shadow ๐Ÿ˜‚ And the leech detector code is not written by me.

If there is going to be an automatic leex detector, I would really REALLY hope for an In-card-reviewer indicator which shows whether the cards memory state is moving for the worse or the better inside the card as I keep on reviewing - like an up green arrow, a red down arrow or a grey dash. (whether average stability over time is generally decreasing or increasing or smth like that) This would make my decision making much more easier in regarding what I should do with some of these godforsaken cards that eat up 80% of my time.

lapis hearth
#

Just my thought

pulsar crow
#

To get the best fit for the data I guess it would be best to have a separate preset for each deck?

bold terrace
# quasi shadow ๐Ÿ˜‚ And the leech detector code is not written by me.

No worries and sorry for the confusion ๐Ÿ™‚

Through my test with leechkit, I start to see @polar maple point though. It's maybe a nice reflective tool to mark old that had "highly improbable failure rate at some point", but I'm not sure it's really that that useful if it doesn't bring much prediction.

When I started playing with --max-reviews and very very low threshold (.01), I started to get something I was a bit more expecting and that could be translated as : "Cards that just got very improbable failure streak". This might have a bit more prediction value (even than D/S) since it might also be an indicator that "You're now outside predicted model in a negative way", which could also predict unexpected failure afterwards.

unique salmon
bold terrace
#

And to anticipate @polar maple about using only D/S : Very high D, low S, can be perfectly non-leech in a sense that they still fit perfectly FSRS prediction (Let say R=50% every day, you fail them every other day, it's within the realm of FSRS prediction). But if you fail it at 200d stability with 90% R, 30d with 90% R, 10d with 90%, then something is fishy

#

(And counter measure like : Reseting stability to 1d after manual confirmation, could be taken)

unique salmon
#

If we're talking about how to actually implement it in Anki, it should be just one toggle that enables automatic leech detection, that's it

bold terrace
#

Yes I mean "The user could then decide what to do with it, for example reset it if he doesn't want to fail for the next ~30d before the card get a more manageable stability"

#

But nothing to be implemented like this

#

He could also just let it fail and so FSRS would be trained with those inputs

#

I'm just explaining why a leech detection can be useful

#

Because that's also a good point : Why do we even want to flag them ? What would be the possible action on those ? If it's just for the sake of stats, it shouldn't be really in the main public release of Anki

#

But that's a different topic, the one I'm more focused on is : Why the algo in leechkit, might not be really that useful

unique salmon
#

Why can't power users understand that the other 99% of users want a streamlined experience...

#

With as little tweaking as possible

#

Unironically, this is a huge problem with Anki's development

bold terrace
#

I don't see why it won't confuse them even more

unique salmon
#

They don't need to understand the inner workings of the detector. They just click one toggle

bold terrace
#

Yeah but if that detector send them "This card is a leech" but they see the last 10 reviews were answred "Good", they'll be confused

#

If you want to do a poll, maybe one interesting would be
"When you hear leech, what do you think of ?"
"If you could know what are your leech (based on that definition), do you have anything you want to do with them (in anki) ?"
"If you have something you would like to do with them, what would it be ? Reset, Cram, Time outside Anki, ..."

#

But then personally, to those questions, I would answer :

  • Cards I fail too much and can't grow stability just by reviewing them
  • Yes
  • I'd put more time and reviews inside and outside Anki.
    But based on those, then I might indeed seek something more related to D/S/# Repetition then historical statistics on reviews compared to R ๐Ÿค”
lapis hearth
cursive badge
cursive badge
# quasi shadow Does anyone want to develop a standalone add-on for leech detector?

I'm not planning to. I deliberately didn't make an addon so it was easier to use external libs like scipy, scikit-learn, etc. while trying things out.
The Poisson Binomial stuff is interesting, but I still don't think it is a complete solution.
I wouldn't personally want to make an addon until I'm sure it is something more useful for a wider audience.

grizzled cedar
#

I have cards that are very similar, made in the same day (therefore they come up for review at pretty similar dates) but aren't siblings. how can i disperse them? afaik fsrs helper only disperses siblings

unique salmon
#

Anki doesn't allow making arbitrary connections between cards. So...yeah, I mean it

cursive badge
#

You could use a filtered deck to study some of them early and force them to be more offset, but there is nothing that would automatically keep them dispersed in Anki natively.

grizzled cedar
quasi shadow
#

@cosmic hedge I updated the analysis.

wind palm
hasty fractal
#

lmao relatable actually

#

(expertium actually removed embed wtf)

unique salmon
wind palm
hasty fractal
wind palm
hasty fractal
#

so expertium didn't lol

hasty fractal
#

I sometimes think what if I pressed a button wrong and so I have to check card info again and again

hasty fractal
#

u obsess over stuff too! this thread is the proof.

#

and the rants in forums where u ping dae five thousand times ๐Ÿคฃ

unique salmon
#

Trust me man, I have things much more worrying than Anki ๐Ÿคฃ

pulsar crow
wind palm
bold terrace
#

But once you have a good amount of reviews in both ? Could help

#

But again, depends what's the RMSE/logloss and the delta between your DR and what you have

#

If your DR is 80% and your R is around 75-80%, the trouble won't be super worth it

#

If your DR is 80%, one deck is 90% and the other is 70%, yeah

unique salmon
#

@quasi shadow remind me, did someone create a solution to prevent scheduling on different devices from breaking every time there is a new FSRS version with a different number of parameters? Or are we destined to keep telling people "please wait until the newest FSRS version is supported in Anki on all devices" forever?
I mean like how FSRS-5 has 2 more parameters than FSRS-4.5, and if someone has FSRS-4.5 in desktop Anki and FSRS-5 in AnkiDroid, they are not compatible

#

IIRC it's "fixed" in the sense that both devices switch to the older version
Well, first the device that doesn't support the new version switches to the old one, then you sync, then the other device switches to the old one as well

#

Instead of just throwing an error and dying ๐Ÿคฃ

cursive badge
unique salmon
cursive badge
unique salmon
cursive badge
#
  1. You stop using AnkiDroid (never sync with new app that supports FSRS-5)
    Stuck on FSRS-4.5 forever?
unique salmon
#

Idk

#

I guess

#

Hopefully it's more clever than that ๐Ÿ˜…

cursive badge
tepid spoke
#

Yeah, my understanding is that there is nothing akin to an "active FSRS version"

#

an old App would just schedule using old algo and the preserved old parameters, or old default parameters if there are none

quasi shadow
#

Or you can get the cost based on the rating returned by step().

cosmic hedge
cosmic hedge
# quasi shadow Or you can get the cost based on the rating returned by `step()`.
    learn_step_costs = np.array([[1, 2, 3, 4], [5, 6, 7, 8], [9, 10, 11, 12]])
    MAX_RELEARN_STEPS = 5

    # learn_state: 1: Learning, 2: Review, 3: Relearning
    def stability_short_term(s: np.ndarray, init_rating: np.ndarray=None):
        if init_rating is not None:
            costs = learn_step_costs[0]
        else:
            costs = learn_step_costs[1]

        cost = 0

        def step(s, next_weights):

            rating = np.random.choice([1, 2, 3, 4], p=next_weights) # Somethings wrong here
            new_s = s * (math.e ** (w[17] * (rating - 3 + w[18]))) * (s ** -w[19])

            return (new_s, rating)

        def loop(s, init_rating):
            nonlocal cost
            i = 0
            consecutive = 0
            rating = init_rating or 1
            while i < MAX_RELEARN_STEPS and consecutive < 2 and rating < 4:
                (s, rating) = step(s, relearn_chances[rating - 1])
                cost_per_day[today] += costs[rating - 1]
                i += 1
                if rating > 2:
                    consecutive += 1
                else:
                    consecutive = 0    

            return s

I should probably just make a pull request ๐Ÿ˜‚. Idk what to call it though.

bold terrace
#

For Memorized Graph, I tried to plot R*f(S) with S defined as the red function. The idea is to have a Memorized curve with something more intuitive than R*sqrt(S) that have similar shape, but that gives number without much meaning. Here, the meaning would still be "The amount of word you know", with "know" based on a word you know perfectly at R=1 and S>360d.

#

Compared to R*sqrt(S)

#

Any opinion @unique salmon ? Could be injected in CMRR

#

(For comparison, my SUM(R))

unique salmon
#

Speaking of plotting memorized cards...
@cosmic hedge https://forums.ankiweb.net/t/estimated-total-knowledge-graph-over-time/57390/5?u=expertium
Thoughts?

bold terrace
#

Itโ€™s not really a forgettive curve though

#

So basically something with 0 stability shouldnโ€™t even be considered as known (thus 0) and s=365d is considered as known (thus 1)

#

(And never more)

unique salmon
#

The thing is, I can't think of a way to test which one would be better for CMRR: S or sqrt(S) or ln(S) or your formula or something

bold terrace
#

I think we canโ€™t ๐Ÿ˜…

#

Itโ€™s super subjective I think

#

At least R vs R*f(S)

#

But in the end for f I think the result would be similar ๐Ÿค”

#

The curve is really similar for both

#

The benefit of the exp is to have some human friendly number

#

And capping at 1

polar maple
#

we can make something that sort of looks like the red curve but with more meaning. two separate but similar ideas:

  1. for a hyperparameter t, the score is the average retention for the next t days, so (f(0) + f(1) + ... + f(t)) / (t + 1) where f maps time to retention
  2. for a discount factor r < 1, the score is f(0) + rf(1) + r^2f(2) + ....

both of these can be converted into an integral if you want exact area over the forgetting curve

unique salmon
#

I'd rather not make a change to CMRR if I can't test whether it's beneficial or not

pulsar crow
#

Who here follow the computed minimum recommended retention? Mine says 0.72 while my selected is 0.88 and my total true retention is 86.2%

polar maple
#

mine are all exactly the same at 85%

unique salmon
unique salmon
#

When it comes to this stuff, it's hard to say how much deviation is ok and how much is a sign of a problem

#

Since your minimum recommended retention is lower, you can lower your desired retention so that you can have a better amount of stuff memorized/time spent ratio

bold terrace
#

In general it will push you to add more cards

#

10 cards at 50% count as 5

#

6 at 80% with high stability will count as 4.8

#

Even if the 10 have very low stability

pulsar crow
unique salmon
# cosmic hedge sounds good to me

To compensate for the lack of "last year", "last month", etc., would it be possible to make it so that the user can zoom in on a specific part of the graph?

I'm talking about like what matplotlib allows (I just quickly made some random crap just to illustrate zooming):

bold terrace
#

Failing is also super frustrating

#

You have to recheck things etc

unique salmon
bold terrace
#

Yeah and the optimizer doesnโ€™t care about it

#

So donโ€™t use it

#

Itโ€™s bad

#

So donโ€™t drop DR because of it

#

To me low DR is only a good strategy if youโ€™re too late with too much stuff to learn and you want to maximinse your chance of getting the grades

#

So you can increase the number of new items per day

#

But when I did my median stability went from 15d to 4-5

#

I knew a lot of words for a very short amount of time and at a low retrieval

#

Making me stutter at every words I wanted to recall

cosmic hedge
unique salmon
# cosmic hedge yeah I kinda wanted that even for SSE

Well, then both problems are solved. The memorization data is calculated and the graph is rendered only when the user presses a button, that + Rust solves performance issues. Zooming in solves not having "Last year", "Last month" and all that

#

If anything, zooming is even better

cosmic hedge
#

yeah im pretty sure it was just the partial history thing

unique salmon
unique salmon
unique salmon
#

@cosmic hedge regarding this: https://github.com/ankitects/anki/pull/3829
Can you add a hint that says that "Smooth" only affects the visual representation and not scheduling? I'm afraid some users might start asking "What is this Smooth setting that I don't see anywhere else? How do I enable it?"

GitHub

Key Changes

Updated FSRS dependency to use a version that supports post-scheduling hooks
Added post_scheduling_fn to simulator that applies load balance and easy days
Added easy days percentages t...

cosmic hedge
#

it is already called "Smooth graph"

#

i feel like the "graph" would be a hint but you never know ๐Ÿ˜‚

unique salmon
#

Like a little text box under the cursor

unique salmon
#

Actually, I have a better idea - move "Smooth graph" under Reviews - Time - Memorized and make it a check box

sonic forge
#

Advanced settings is perfect place for the Smooth graph toggle and it is already enabled by default. If you afraid that anybody can misunderstand it then just add the hint on hover/click.

unique salmon
#

I guess let's leave it up to Bloke to decide

west whale
#

Heya fav FSRS peeps

#

wish yall a good day โค๏ธ

sonic forge
#

Best solution - don't change anything. Graph in the Smooth graph already implies that this setting affects only the graph below.

cosmic hedge
#

i guess i could reuse the tooltip from the graph though ๐Ÿค”

unique salmon
#

I feel like it's better either at the top or on the right

sonic forge
#

Please, don't move it anywhere. Even just renaming it to Smooth the simulated graph is better.

unique salmon
#

But then the simulation number thingy is on the right as well. So I guess on the right is not good either

unique salmon
#

That seems like the easiest way

sonic forge
#

Again, I don't think it is worth it. Smooth graph is already obvious. Changing it can cause problems on narrow screens (string length)

unique salmon
#

It's just that right now it's placed together with other settings that aren't purely cosmetical, which can make people think that this one is also not purely cosmetical

#

All of these except for Smooth Graph are real Anki settings that affect scheduling, so placing a purely cosmetical setting among them is kinda eeeehhhh

cosmic hedge
#

i mean i could do it and then test it but idk if i wanna get stuck justifying this ๐Ÿ˜‚

cosmic hedge
unique salmon
#

Inbetween Simulate and the graph itself

cosmic hedge
unique salmon
#

Like this

cosmic hedge
unique salmon
cosmic hedge
#

not bad tbf

unique salmon
cosmic hedge
#

id probably move the flicky bit over to the left if i could honestly

unique salmon
#

Oh, yeah, the text and the toggle are on the opposite sides of the screen ๐Ÿคฃ

#

Whatever, it's fine

cosmic hedge
#

im trying ok ๐Ÿ˜ญ

#

"move button left" is always harder than it seems ๐Ÿ˜‚

unique salmon
#

I never touch front end stuff ๐Ÿคฃ

#

Just let me tweak formulas and shit

#

For FSRS

cosmic hedge
#

oh wait i think dae said there were problems with mac not being able to scroll

#

i think this might cut off the bottom axis for mac users

#

idk im just gonna make the pr @ you in it and call it a day ๐Ÿ˜‚

unique salmon
sonic forge
#

Bloke, the Simulate, Clear last simulation, Save to Preset options (and with this PR Smooth graph) area is overloaded with buttons in one place - it is easy to misclick.

cosmic hedge
unique salmon
#

Ah, ok

cosmic hedge
cosmic hedge
# unique salmon Just let me tweak formulas and shit

the algorithm is not my forte
I had an idea though and i was just wondering if it held any water
I figured if the user uses hard as again, hard_penalty (w[15]) would be 1 right
so at risk of wasting your time, what if we let hard_penalty go below 0 and in that circumstance apply it as a bonus to stability_on_fail

python evaluate.py --fast
Model: FSRS-5-dev
Total number of users: 577
Total number of reviews: 17758320
Weighted average by reviews:
FSRS-5-dev LogLoss (meanยฑstd): 0.3394ยฑ0.1659
FSRS-5-dev RMSE(bins) (meanยฑstd): 0.0568ยฑ0.0379
FSRS-5-dev AUC (meanยฑstd): 0.7064ยฑ0.0823

Weighted average by log(reviews):
FSRS-5-dev LogLoss (meanยฑstd): 0.3700ยฑ0.1728
FSRS-5-dev RMSE(bins) (meanยฑstd): 0.0747ยฑ0.0487
FSRS-5-dev AUC (meanยฑstd): 0.7002ยฑ0.0878

Weighted average by users:
FSRS-5-dev LogLoss (meanยฑstd): 0.3722ยฑ0.1743
FSRS-5-dev RMSE(bins) (meanยฑstd): 0.0769ยฑ0.0499
FSRS-5-dev AUC (meanยฑstd): 0.6995ยฑ0.0900

Model: FSRS-5
Total number of users: 577
Total number of reviews: 17758320
Weighted average by reviews:
FSRS-5 LogLoss (meanยฑstd): 0.3396ยฑ0.1660
FSRS-5 RMSE(bins) (meanยฑstd): 0.0570ยฑ0.0380
FSRS-5 AUC (meanยฑstd): 0.7052ยฑ0.0798

Weighted average by log(reviews):
FSRS-5 LogLoss (meanยฑstd): 0.3702ยฑ0.1730
FSRS-5 RMSE(bins) (meanยฑstd): 0.0749ยฑ0.0489
FSRS-5 AUC (meanยฑstd): 0.6997ยฑ0.0873

Weighted average by users:
FSRS-5 LogLoss (meanยฑstd): 0.3724ยฑ0.1745
FSRS-5 RMSE(bins) (meanยฑstd): 0.0771ยฑ0.0500
FSRS-5 AUC (meanยฑstd): 0.6991ยฑ0.0896

I think this means it works? idk ๐Ÿ˜‚

unique salmon
#

So S decreasing after the user pressed Hard? That would mean that intervals can shrink after Hard, idk about that

#

wait no

#

I'm trying to understand what you're doing

#

So success depends not only on the grade, but also on another parameter

unique salmon
#

If we allow it to be negative and just use it in the same way, it would mean that SInc<1 when grade=Hard, so intervals could shrink after Hard

#

But you are doing something more involved

cosmic hedge
#

yeah its aimed to combat hard abuse

unique salmon
#

The thing is, if we allow intervals to shrink after Hard and the user uses Hard a lot, the intervals would just keep shrinking

#

That seems bad

cosmic hedge
#

not if the users using hard as again

cosmic hedge
unique salmon
#

If w[15]=1, then pressing Hard and Good increases S by the same amount

#

if w[15]=0, then pressing Hard doesn't increase S at all

cosmic hedge
#

ahh i meant to say hard as good ๐Ÿ˜…

#

is the FSRS-5.jsonl up to date?

unique salmon
#

And in Anki too

quasi shadow
#

What's wrong?

cosmic hedge
# quasi shadow What's wrong?

-dev:

{"metrics": {"RMSE": 0.400651, "LogLoss": 0.49325, "RMSE(bins)": 0.112599, "ICI": 0.033374, "AUC": 0.784574}, "user": 1, "size": 10620, "parameters": {"0": [0.3387, 0.6238, 0.7773, 3.9799, 7.4677, 0.3263, 1.7317, 0.0256, 1.2388, 0.2723, 0.684, 2.1162, 0.062, 0.5889, 2.4239, 0.5144, 2.3553, 0.3103, 0.7159]}}
{"metrics": {"RMSE": 0.265203, "LogLoss": 0.269664, "RMSE(bins)": 0.039755, "ICI": 0.016234, "AUC": 0.647728}, "user": 3, "size": 4255, "parameters": {"0": [4.0653, 10.1024, 10.2528, 10.0513, 7.2597, 0.692, 1.7276, 0.001, 1.5279, 0.1049, 1.0048, 1.9723, 0.1203, 0.4359, 2.3265, 0.1556, 2.9898, 0.6101, 0.643]}}

FSRS-5:

{"metrics": {"RMSE": 0.401261, "LogLoss": 0.494693, "RMSE(bins)": 0.114298, "ICI": 0.032853, "AUC": 0.782499}, "user": 1, "size": 10620, "parameters": {"0": [0.3527, 0.6238, 0.7818, 3.9533, 7.4838, 0.3735, 1.7316, 0.0229, 1.2548, 0.2462, 0.6704, 2.1209, 0.0661, 0.585, 2.3421, 0.5017, 2.2118, 0.31, 0.7001]}}
{"metrics": {"RMSE": 0.395004, "LogLoss": 0.494957, "RMSE(bins)": 0.13308, "ICI": 0.091856, "AUC": 0.675933}, "user": 2, "size": 35900, "parameters": {"0": [0.2922, 1.8544, 12.6882, 28.0778, 7.212, 0.2171, 1.9473, 0.001, 1.3647, 0.5441, 0.9483, 1.787, 0.1504, 0.3284, 2.5254, 0.2847, 2.8317, 0.4484, 0.9641]}}

if i run script with --dev then the values are different

#

is that because its -recency now?

quasi shadow
#

Weird

#

Did you modify your local fsrs-optimizer?

cosmic hedge
#

hold on

quasi shadow
#

Fine. Forget it.

cosmic hedge
#
"metrics": {"RMSE": 0.400651, "LogLoss": 0.49325, "RMSE(bins)": 0.112599, "ICI": 0.033374, "AUC": 0.784574}, "user": 1, "size": 10620, "parameters": {"0": [0.3387, 0.6238, 0.7773, 3.9799, 7.4677, 0.3263, 1.7317, 0.0256, 1.2388, 0.2723, 0.684, 2.1162, 0.062, 0.5889, 2.4239, 0.5144, 2.3553, 0.3103, 0.7159]}}
{"metrics": {"RMSE": 0.265203, "LogLoss": 0.269664, "RMSE(bins)": 0.039755, "ICI": 0.016234, "AUC": 0.647728}, "user": 3, "size": 4255, "parameters": {"0": [4.0653, 10.1024, 10.2528, 10.0513, 7.2597, 0.692, 1.7276, 0.001, 1.5279, 0.1049, 1.0048, 1.9723, 0.1203, 0.4359, 2.3265, 0.1556, 2.9898, 0.6101, 0.643]}}

yeah nothing changed

quasi shadow
#

Could you try this command?

cosmic hedge
# quasi shadow python other.py --algo FSRS-5
{"metrics": {"RMSE": 0.265203, "LogLoss": 0.269664, "RMSE(bins)": 0.039755, "ICI": 0.016234, "AUC": 0.647729}, "user": 3, "size": 4255, "parameters": {"0": [4.0653, 10.1024, 10.2528, 10.0513, 7.2597, 0.692, 1.7276, 0.001, 1.5279, 0.1049, 1.0048, 1.9723, 0.1203, 0.4359, 2.3265, 0.1556, 2.9898, 0.6101, 0.643]}}```
same result
quasi shadow
#

Do you have CUDA?

cosmic hedge
#

i do

#

im not sure its on though

quasi shadow
#

Could you disable it and optimize the parameters via CPU?

#

Oh, wait

#

Maybe it's the difference between Windows and Mac?

#

I will re-run the benchmark tomorrow.

cosmic hedge
cosmic hedge
quasi shadow
#

--dev

{"metrics": {"RMSE": 0.401261, "LogLoss": 0.494693, "RMSE(bins)": 0.114298, "ICI": 0.032853, "AUC": 0.782499}, "user": 1, "size": 10620, "parameters": {"0": [0.3527, 0.6238, 0.7818, 3.9533, 7.4838, 0.3735, 1.7316, 0.0229, 1.2548, 0.2462, 0.6704, 2.1209, 0.0661, 0.585, 2.3421, 0.5017, 2.2118, 0.31, 0.7001]}}
#

FSRS-5:

{"metrics": {"RMSE": 0.401261, "LogLoss": 0.494693, "RMSE(bins)": 0.114298, "ICI": 0.032853, "AUC": 0.782499}, "user": 1, "size": 10620, "parameters": {"0": [0.3527, 0.6238, 0.7818, 3.9533, 7.4838, 0.3735, 1.7316, 0.0229, 1.2548, 0.2462, 0.6704, 2.1209, 0.0661, 0.585, 2.3421, 0.5017, 2.2118, 0.31, 0.7001]}}
quasi shadow
#

@unique salmon would you mind testing it?

#

Maybe you will get the same result as @cosmic hedge .

lapis hearth
quasi shadow
#

The result becomes worse.

unique salmon
#

That's not surprising considering that we don't have a good model of how S behaves at values <<1

unique salmon
cosmic hedge
#
pip freeze | grep torch
torch==2.6.0
torchcache==0.5.2
[project]
name = "FSRS-Optimizer"
dependencies = [
    "torch>=1.13.1",
]
``` I think it might be that my torch is 1 major version above what it should be

Edit: I can't test it easily.

pip install torch==1.13.1
ERROR: Could not find a version that satisfies the requirement torch==1.13.1 (from versions: 2.5.0, 2.5.1, 2.6.0)

bold terrace
#

Me expecting that plotting Stability Mean/Median over number of repetition would show a correlation like "The more you have reviewed, the more your stability will be bigger"

#

More like "the more you review, the less stability it might translate too"

#

Which is logical

#

but also depressing

unique salmon
#

Is it because of leeches?

bold terrace
#

No I think itโ€™s just that card that you learn quickly youโ€™ll get them high stability quickly and those who you struggle more Anki itself wonโ€™t really compensate enough to make them high stability

#

I have a local build if you want to see with you

#

It's the graph after "Card Stability over Time"

#

it's still a draft so no legend no title etc etc

#

but wanted to see how it would look

#

But still ... It kind mean just reviewing again and again and again a card is not really what will make it stick better

#

It's true that all my High S cards are all with reviews<10

#

While my low Stability, have in general at least 30-40 reps

unique salmon
#

Well, FSRS is also interpretable, just not for most people, it seems

#

The average person seems to want something where it's just "Multiply the last interval length by a fixed modifier like 0.2 or 1.5"

quasi shadow
unique salmon
#

Honestly, custom scheduling in Anki is a solution looking for a problem
Someone who wants an advanced algorithm can just use FSRS without having to reinvent their own algo
Someone who wants a simple algorithm won't touch custom scheduling at all

#

So the only people who use custom scheduling at this point are super advanced users who know how to code yet don't want to just use FSRS for some reason

#

Which is like 0.00000000000000000000000001% of all users

quasi shadow
unique salmon
#

Lol, yeah

quasi shadow
#

Btw, how is your FSRS-H going on?

unique salmon
#

Slow. It makes benchmarking 1.7-2x times slower. And the results don't look all that impressive. I'll report back in that github issue once I have tried all my ideas

#

But even if it's mildly better, it won't justify the increase of optimization time. To justify that, the improvement would have to be MASSIVE

quasi shadow
#

Maybe we would have FSRS-5.5 with an extra parameter in 2025H1.

#

The ~2% improvement is my best finding since last two months.

wet plume
#

quick question

#

current fsrs version how much does it lower review load compared to sm2

#

I know I can run the optimizer in the notebook

#

just asking if somebody already knows

unique salmon
bold terrace
#

The only real way to know is to try, I would first check your SM2 observed retention, switch to FSRS, use that observed retention as the desired one, do a reschedule, and try to do an educate guess based on future review + mentally spreading the potential backlog

#

Also a lot of people try to have with SM2 a very high retention for mature card, which is not really the case with FSRS, in FSRS, young or mature, doesn't change the fact it want to achieve your DR

#

So if right now you have half young with 70% Retention and half mature with 90% Retention, maybe it would be more or less equivalent to a FSRS workload of 80% retention (not an exact computation but to give you a sense of what could influence the workload)

unique salmon
#

We could estimate "worst case" improvement like this:

  1. For each user from the 10k dataset, record their FSRS parameters and the optimal Anki parameters using the modificaiton of SM-2 that predicts probabilities
  2. For each user, run simulations using both FSRS and SM-2 (predictive) at the same DR, with SM-2 probabilities considered "faulty" and FSRS probabilities considered "true", meaning that whether the review ends up being a lapse or not is based on FSRS probabilities
  3. Calculate two values of workload: using SM-2 and using FSRS, compare how much FSRS is better/worse in terms of reviews/day
    @quasi shadow wanna try it?
    Though running 10k simulations will take forever
#

This would tell us how much FSRS outperforms fine-tuned SM-2

#

Or, alternatively, we could do that but with default SM-2 parameters, not fine-tuned for each user

#

To get a "best case" improvement - what if FSRS is competing against unoptimized SM-2?
Or we could do both

tepid spoke
#

There's also always a slight chance FSRS will increase the review load. Though unlikely, if it concludes that your memory sucks, it might show you stuff quite often :D

bold terrace
tepid spoke
#

With SM2 by default if you hit Again, a card would be fully reset as if it just graduated

bold terrace
#

Oh

tepid spoke
#

But it's a multiplier you can configure, it just defaults to 0

bold terrace
#

That's actually quite aggressive lol

#

At the same time .... I think for learning it might make sense

tepid spoke
#

Yeah, I had that multiplier set to 0.5

#

It definitely is the best to ensure retention

bold terrace
#

Yeaaah

tepid spoke
#

but if it's a hard deck, it also creates an intense review load

bold terrace
#

With FSRS it always feel knowledge is kept "at warm temperature"

#

never too mature, never too young

tepid spoke
#

Eh, if I let it do as the optimizer wants, it goes too hard imo

#

Like, 9-12d after graduating, and then multiple months

bold terrace
#

For example this is my graph of "Stability / Repetitions done" ...

#

The more I review them with FSRS, the lower the stability lol

unique salmon
bold terrace
#

In the same amount of time (not in terms of review), more reviews seems to lead to faster increase of S

unique salmon
#

Unrelated, but maybe we should just add a "I use Hard as fail" toggle in Anki and adapt FSRS formulas so that they work well with both

#

Otherwise we will never fully solve the problem

bold terrace
#

Over the past 4 months, I did increase my DR from 80% (but more effectively 75% average R), to a "true" 86, and each time I do a step up, I see my stability plateau going a bit faster up

unique salmon
#

I guess making a 2-button mode the default and adding a warning when enabling the 4-button mode would help, but...

bold terrace
#

Optimize params with Hard=Fail, Hard=Good, and even with ignoring them, and then take the minimum ? ๐Ÿ˜„

#

At this rate a neural network will be easier to implement lol

unique salmon
#

It would be better to just ask the user himself

#

Then again, maybe after 2 buttons become the default + enabling 4 buttons gives a warning about Hard misuse, this won't be a problem, at least for new users

bold terrace
#

If RMSE/logless is more or less the same, is it even a problem ๐Ÿ™‚ ?

#

I think initially I did sometimes "Hard" as a fake "Again", but I never really stress about it too much

#

I was < 500 reviews back then, I'm at 60K right now

unique salmon
#

I just feel like it would create a lot more problems than simply asking the user

bold terrace
#

Gotcha

#

I just think some user will be "That night, at 3AM, I DID USE IT !! Should I stop using FSRS until then ?"

#

If at least the RMSE/logless is similar

#

it's pretty much guaranteed that it doesn't really change anything

cursive badge
bold terrace
#

I think ideally we all would like to see that "with time and reviews, we know better our stuff"

#

and the retention is something in which we search those answers

#

more recently I shifted a bit my mindset to be able to see it also in my stability

#

Even if I still do 80%, I feel confident seeing my stability rising

#

But linking those things like you did is smart because it's a way to say "For those young card, I'm fine having a bit more errors, if it means reducing my workload" (and god knows how young card is often the majority of our workload), but "when you go in mature more, I want to lock you in"

#

Said differently, it's also a way to express "As long as I'm not really able to encode it long enough, I don't want to have too much expectation on that card, but once the signs of having it well encoded start to show up (stability > 21), then I'll have higher expectation of not failing it (prop:r <0.9)

cosmic hedge
unique salmon
unique salmon
cosmic hedge
unique salmon
#

@quasi shadow do you have a "X reviews take Y seconds to optimize" kind of test for fsrs-rs?
I'm asking because David is interested

#

Like, a standardized test for how long the optimization takes

#

Actually, I'm surprised that we don't have a speed benchmark. We could benchmark tweaks to the optimizer code to see how we can make it faster. This could improve user experience and would be one more reason to have automatic optimization when/if the optimizer is fast enough

polar maple
unique salmon
#

Ok, idk, honestly

cursive badge
#

I think the idea of that kind of benchmark is you just do a before and after on the same machine to ensure there is not a regression when you make changes rather than having an "objective" measure.

unique salmon
unique salmon
#

Also, to get the most accurate comparison to Anki's FSRS, what command should I use?B=1 N=100 MEM=1 python performance.py --dev?

#

And one more thing - I want time per X reviews (say, 100 000) rather than total time or time per user

cosmic hedge
# unique salmon Also, to get the most accurate comparison to Anki's FSRS, what command should I ...

B=1 is if you want 2 changes on the same graph which you get by modifying these functions (runs it twice for each user)

def process_wrapper_a(uid: int):
    torch.set_num_threads(2)
    return process_wrapper(uid)


def process_wrapper_b(uid: int):
    torch.set_num_threads(3)  # Num threads example
    return process_wrapper(uid)

MEM=1 shows memory but really throws the speed benchmark outa whack
i think you can add --rust as an argument and that should be pretty close but i never tested it

unique salmon
#

So B=0 N=100 MEM=0 python performance.py --dev --rust?

cosmic hedge
cosmic hedge
unique salmon
unique salmon
#

Maybe sum of row_counts?

#

It's the number of reviews...I think...maybe

cosmic hedge
#
    plt.xlabel(f"Revlogs (total={sum(row_counts)})")

yeah you were bang on

unique salmon
#

Seems like you can't break the "benchmarking code never works on the first try" curse

cosmic hedge
#

I'll try it ๐Ÿ˜ญ

cursive badge
cosmic hedge
cursive badge
#

You either need to use set VAR="value" before calling python in cmd.exe or use $env:VAR="value" in powershell

cosmic hedge
#
> set var="val"
> python ...
unique salmon
cosmic hedge
unique salmon
cosmic hedge
#

oh

#

ehh doesn't seem that fsrs-rs is imported by other.py

#

might be wrong though

unique salmon
cosmic hedge
#

where do you have anki 10k downloaded?

unique salmon
#

Oh, right, it's because right now I'm only using a subset of the dataset

#

Screw it then, it's faster for you or Jarrett to test it

#

Than for me to do the copy-pasting ๐Ÿคฃ

#

I'll do it once I benchmark my ideas to improve FSRS, but that's DEFINITELY slower than having you guys test it

cosmic hedge
#

you can use script-path or whatever that arguments called if you have it downloaded somewhere

unique salmon
#

I do

cosmic hedge
#

i'll try run it anyway

unique salmon
#

But I'm benchmarking stuff on a 1k subset

cosmic hedge
#
# Don't change
USER_COUNT = 10000
``` you can try change this conspicuous variable to 1000 if you want ๐Ÿคฃ
#

should say "dont change unless you only have 1000 users for some reason"

unique salmon
#

Yay, it works!

#

So on my Ryzen 5 3600 (using only one thread) FSRS-rs takes around 57.6 seconds per 100,000 reviews. What's strange is that in Anki itself I have a preset with around 170k reviews, yet the optimization is WAY faster. And based on CPU utilization, it doesn't seem like Anki is using more than one thread ๐Ÿค”

cursive badge
#

There could be a load of small presets biasing the result with that overhead.

polar maple
#

performance.py calls script.processwhich has to read and process the dataset and this step is very slow

unique salmon
#

So the only accurate way to estimate time is...to add the timer inside Anki ๐Ÿ˜…

cosmic hedge
#

idk if it worked very well but it might be better

#

wait i'll run the original one to compare

#

yeah it helps

night saffron
# cursive badge I wanted to reduce my workload recently, but didn't want to risk forgetting too ...

Is it not a million dollar idea? All the brightest minds are here, I wanna know your opinion.

Low DR (80%) for young cards:

  1. intervals grow super fast (at 80% DR it's 2.4x faster than at 90% DR)
  2. low workload (e.g. 2/3 of my everyday cards are young) + consider that @rossgb claims his workload dropped by more than 50% (!)

High DR for mature cards:

  1. You have confidence that you actually remember stuff
  2. Intervals don't grow as fast but it matters less since stability is high (= intervals still grow really fast)

What am I missing? Custom DR for young cards and mature cards seems like best of both worlds (low workload + good long-term memory).

unique salmon
#

There might be a phychological benefit in that it gives you a sense of progress towards mastery, but no real benefit in terms of time spent on reviews, or a minor benefit at best

cursive badge
#

I'm not claiming it's significantly "better". It just makes me feel sad and like I've wasted lots of time if I forget mature cards.

night saffron
#

I see. You guys thought of everything.

polar maple
#

Currently it is 0.35% better than 88% fixed dr

bold terrace
#

Knowing 1000 words for 1 day vs knowing 900 for 100 days.

#

Sure, since lower stability would translate into higher workload, it is "somewhat" taking it in account for the "per minute" part.

#

All those simulations and stats might sound convincing but they are just as biased as the hypothesis on which they are build.

unique salmon
#

It's based on how long people answer cards

#

It's not like we pulled the numbers out of our asses

bold terrace
#

Which is only a sub part of the total time of the time lost

unique salmon
#

Oh, nvm

#

Not that one

bold terrace
#

It only account the front part

unique salmon
#

Ok, idk where those values are from. Maybe from Jarrett's ass ๐Ÿคฃ

bold terrace
#

Data can be wrongly interpreted, it's not a bulletproof argument

#

Caveat 3 : Considering that even if a card was difficult to encode, failing it might means having to repeat a lot of small cycles to rebuild a good encoding, which disrupt short and long term acquisition

#

Very far from the pretty theoretical assumptions of what will happen if you just space you repetition a bit larger each time.

#

Note that one key point between SM2/FSRS might be the fact that SM2 treat the fact that you forget as a "start a new cycle"

#

Where for FSRS, a forgotten card is not the end of the world

polar maple
unique salmon
#

I was like "huh, I could've sworn it was based on real data"

bold terrace
#

Whatever the weight, I guess it was gathered from what Anki recorded : The Front timer

polar maple
#

cost spent finding a better encoding is impossible to model given the current dataset but we need this data

bold terrace
#

How much time the user spent on the back, trying to understand the card, is not accounted

unique salmon
bold terrace
bold terrace
#

Also, it's caped

unique salmon
polar maple
#

Caveat 3 seems more of a criticism of FSRS than it is of SSP-MMC, the claim is that FSRS fails to model those cards correctly

bold terrace
#

I answered this one in 1 sec. Right now, I'm still in the back view.

#

Now, let me go the the card info view

#

Wait 12s is strange I spent more than that

#

Hmmm seems to take the back in account, now the question is for how much time, I'll wait a bit here

#

Seems to be capped at 12s somehow

#

Which is my Addon limit though

#

Default is 60

#

sum(R) vs sum(R*f(S)), for a deck of 102 kana. From the very first days, the SUM(R) was already at 99-101 when it was only a few days I knew them. With S taken in account, you get a better goal function IMO

quasi shadow
quasi shadow
#

Oops, it's harder than I thought.

quasi shadow
#

Anyway, the benchmark of 10k collections costs me ~7 hours. So the average optimization time is 10000/7/60/60 = 0.4s

night saffron
# bold terrace sum(R) vs sum(R*f(S)), for a deck of 102 kana. From the very first days, the SUM...

I feel like S should definitely be taken into account. R&S graph makes much more sense imo. But isn't it weird that only 80% of those cards are considered "memorised" after a year period (assuming you do your reviews daily)? Shouldn't it be closer to 95%+? I just think that if a card is Mature and has R > DR, then it should be considered memorized in 100% of the cases.

PS. And pure R graph (if you do your reviews daily) kinda shows "Introduced cards over time" rather than "Memorised over time".

cosmic hedge
unique salmon
#

It would be more useful

cosmic hedge
quasi shadow
#

And the benchmark is slower than Anki because the time series split strategy.

quasi shadow
#

k is the number of cards

#

n is the number of reviews per card

bold terrace
# night saffron I feel like S should definitely be taken into account. R&S graph makes much more...

Gonna be honest, in my case at least, words I only practice through anki can stay with an extremely low stability even after 100 reviews, I plotted avg and median Stability over Repetitions and the result is a bit sad.

But I think it's a good thing to realize that anki-alone, bruteforcing words just by doing SRS, might not be a viable strategy. Most words that I have with high Stability are the ones with less than ~ 10 reviews.

Right now I'm playing with the idea of increasing DR higher. I think with DR too "warm" (60-80%), you fail too much, and my only guess is that failing can just "destroy" your learning cycle. I think if DR is higher, and workload is higher, there's a change that if I recalled a word a lot for a long period (instead of trying to do the bare minimum) then I might be able to encode them more effectively this time.

It's also why with only sum(R), all the CMRR and smart-computation of some "optimal DR" is pointless. I remember when @unique salmon tried R*sqrt(S), he got higher recommended DR from that optimizer for example.

#

FYI I've been using Anki and FSRS ~15 month ago, I have around ~80-90k review with it

#

Doesn't mean I'm necessarly "right" but at least I start to have a grasp of how it translates to actual learning in my daily usage of japanese. I'd say for now 1h of active immersion (lookup, analyzing sentences, going over and over some sentences, finding nuances) really really really push my further than 1h of Anki for now

#

I think that outside the algo-side of things, the problem with anki is how you can do it a bit in "bruteforce mode", not taking enough time, trying to go through all your reviews as fast as possible, with less chance to build a "network" of knowledge than when you analyze real life sentences

unique salmon
#

@quasi shadow I want to resurrect an old idea: use a neural net to calculate D as a function of (last D, grade, R). Then plot that output and try to come up with a simple(-ish) formula that approximates it. Basically, let a neural net figure out how to update D based on its last value, grade and and R, and then try to achieve the same output using simpler functions.

#

I wanted Alex to do it, but he said "beg Jarrett" ๐Ÿคฃ

#

So you would have a neural net that takes D (a value between 1 and 10), grade and R as input and outputs D (again, between 1 and 10), and then you use that D in FSRS formulas as usual

unique salmon
#

Actually, maybe Claude 3.7 will help me

unique salmon
#

..nevermind

#

I just get errors in other, completely unrelated parts of the benchmark code

#

Actually no, it works, I just have to solve the least intuitive problem ever in the least intuitive way ever

#

Alright. We'll see how well it works

quasi shadow
unique salmon
#

self.difficulty_nn = nn.Sequential( nn.Linear(3, 16), # Input: retrievability, difficulty, rating Mish(), nn.Linear(16, 16), Mish(), nn.Linear(16, 1) # Output: new difficulty )

#

class Mish(nn.Module): def forward(self, x): # Numerically stable version delta = torch.log1p(torch.exp(x)) # For large x values, use an approximation mask = x > 20 # Threshold where exp(x) would cause an overflow delta = torch.where(mask, x, delta) return x * torch.tanh(delta)

polar maple
#

I tried this transformation on the forgetting curve and it looks pretty good so far. The values in the transformation were not fine tuned or anything, I only just eyeballed reasonable values and verified that they help on the first 10 users. I don't have enough cpu resources to test all 10k users

Total number of users: 557
Total number of reviews: 18550740
Weighted average by reviews:
FSRS-5-dev LogLoss (meanยฑstd): 0.3376ยฑ0.1709
FSRS-5-dev RMSE(bins) (meanยฑstd): 0.0516ยฑ0.0336
FSRS-5-dev AUC (meanยฑstd): 0.7061ยฑ0.0840

Model: FSRS-5-recency
Total number of users: 557
Total number of reviews: 18550740
Weighted average by reviews:
FSRS-5-recency LogLoss (meanยฑstd): 0.3406ยฑ0.1735
FSRS-5-recency RMSE(bins) (meanยฑstd): 0.0542ยฑ0.0354
FSRS-5-recency AUC (meanยฑstd): 0.7112ยฑ0.0780```
unique salmon
polar maple
#
    def forgetting_curve(self, t, s):
        base = (1 + FACTOR * t / s) ** DECAY
        l = 0.1
        r = 0.995
        p = 1.07
        return l + (r - l) * base ** p```
unique salmon
#

Ah. That's interesting, but it has issues. Your version cannot predict anything less than 0.1 and more than 0.995

#

I'd like some sort of transformation such that R is still between 0 and 1

polar maple
#

not necessarily a big issue since it helps the average case a ton and we can fix it like how RWKV would fix it

#

add a term that decays immediately, and another term that decays very slowly

#

this would make it so that t(0) = 1 and t(infinity) = 0 still

#

but you get the same effect

unique salmon
#

Also, remember that different values of decay are optimal at different retention rates

#

I think that's what we should be working on

#

Using the same curve for everyone isn't optimal - flatter curves are better at higher retentions and steeper curves are better at lower retentions

polar maple
#

maybe you're right but this is already a huge improvement

#

and the values weren't finetuned yet

unique salmon
#

https://www.desmos.com/calculator/lzzgd5gz0s
I tried this before. My goal was to make a curve that falls off very quickly initially, but after that is more or less the same as the original
It helped a little bit, but not that much. And I don't want to make the curve flatter - it's better for higher retentions, so it will get better metrics on average, but screw up people at lower retentions

#

The next "big thing" for FSRS should be better D or a curve with an adaptive shape. Ideally, the former would lead to the latter

polar maple
unique salmon
polar maple
#

how much worse is lower decay for the lower retention people?

unique salmon
# polar maple how much worse is lower decay for the lower retention people?

I haven't tested that, and neither did Jarrett, I believe. He only provided this graph that shows which decay fits which retention the best: https://github.com/open-spaced-repetition/srs-benchmark/issues/166#issuecomment-2652751232
You can try to determine how much worse yourself, the dataset only has 100 users, so it shouldn't take long

GitHub

Some time ago I plotted this for FSRS-4.5: https://i.imgur.com/DSjaW5e.png https://i.imgur.com/1JsW2Jy.png (can't upload images directly for some reason) While the exact numbers will vary depen...

polar maple
#

in that picture it looks like 0.5 decay doesn't even target the average case

unique salmon
#

Yeah

polar maple
#

well couldn't it be said that it's way worse for the 0.8-1.0 retention judging by that graph

#

so just taking the average would balance it out a bit?

unique salmon
#

"Optimal" decay is around 0.15-0.2 or something like that

#

Not on the uniform dataset

#

I mean, decay that results in the best metrics on the whole dataset

polar maple
#

then we need to investigate how much worse is 0.2 decay actually for the lower retention users

#

sure 0.6 might be optimal for them but maybe the logloss difference is negligible

unique salmon
#

Well, feel free to test it on the uniform dataset

#

I'm running NN D now

unique salmon
polar maple
#

i dont have cpu resources rn

unique salmon
#

The more users I add, the harder it is to keep retentions uniform

#

So I had to only use 100

#

Since retentions aren't uniformly distributed in the big dataset, so I have to "cherry pick" users

polar maple
#

i still have doubts about the validity of the uniform dataset

#

people with retention < 0.3 are probably not using anki like the rest of us

unique salmon
#

It's just to have something where we don't have to worry about the results being screwed by data imbalances

unique salmon
# polar maple i dont have cpu resources rn

I'll test how much worse different values of decay are for users with different retentions once I'm done with NN D. But that will probably take a while, especially since I will have to give you the code so that you can actually do something like:
for R in r_range: for D in d_range: for grade in [1, 2, 3, 4]: new_d = neural_stuff(R, D, G)

#

And analyze it

#

Realistically, you will have to run this on the entire dataset and somehow find the median parameters...or something

#

idk

polar maple
#

u mean the parameters of the nn?

unique salmon
#

Yes

polar maple
#

i don't think you can average the params of an nn like that

#

you should pool the data from 100 users into one set and do a joint optimization over all of that data

unique salmon
#

And you should probably pretrain it, which I don't know how to do. I'm just praying that it will work well enough without pretraining

#

I mean, if you plan to run it on individual users

polar maple
#

i think it'll be bad without pretraining

unique salmon
#

Well, then we're back to me asking you or Jarrett to do it

#

Which historically had a 50/50 chance of being realized ๐Ÿคฃ

polar maple
#

i don't think the nn should be optimized per-user since that would be akin to changing the formulas themselves on the spot

#

instead train it on 100+ users, freeze the params, then analyze and extract formulas from them

unique salmon
#

The floor is yours

#

Look at class FSRS5(FSRS) and this:
self.optimizer = torch.optim.Adam([{'params': self.model.w}, {'params': self.model.difficulty_nn.parameters(), 'weight_decay': 0.0003}], lr=lr)

#

Alternatively, you could tell me how the hell to combine revlogs of multiple users into one

#

Idk if my RAM can handle that though ๐Ÿคฃ

polar maple
#

keep the revlogs separate, then on each iteration randomly sample a user's revlog and do a training step on that revlog

polar maple
unique salmon
polar maple
#

nah

#

just on the same 557 users from earlier

unique salmon
#

Then it has the same problem - most users have high retentions

polar maple
#

why do we not care about the average case?

unique salmon
#

We do, we just want to make sure that the worst case isn't getting even worse

polar maple
#

almost everything you do will result in some user getting worse

#

with this metric we would reject most FSRS changes

unique salmon
#

And based on how often people complain about interval lengths and low true retention, making the curve flatter does not sound good

unique salmon
#

Most changes improve FSRS's flexibility, this one doesn't

#

So this one definitely throws some users under the bus

polar maple
#

i've shown in the stats that you improve the average case way more than the worse case gets worse

polar maple
#

but you would never hear it framed in this way

unique salmon
#

If we flatten the curve, it will increase the number of complaints about long intervals and would likely make things even worse for people who go from, say, 90% DR to 70% DR

#

Let's just work on D for now

#

Except I need you or Jarrett for that

polar maple
#

also i maintain that FSRS memory should try to be as pure of a predictor as possible, but the scheduler should be allowed to fudge values in order to improve user experience

unique salmon
#

Claude isn't going to help here since I would have to feed it the entire repo and not just a code snippet, for it to figure out how to do pretrain

polar maple
#

we just don't have much data on this type of user rn

unique salmon
#

And we won't, since Anki doesn't keep track of DR

#

So we won't get data annotated with DR values

polar maple
#

well, i'm guessing that most data is from SM-2 anyways

unique salmon
# unique salmon There you go

Anyway, I'd really like to focus on NN D, which means running this code on one mega-revlog, which means you and Jarrett are the bottleneck

polar maple
#

i have other things to do

unique salmon
# unique salmon There you go

@quasi shadow would you please run this on a combined dataset of 100 users and save the resulting parameters of the neural net?
(look at the FSRS5 class and at self.optimizer = torch.optim.Adam(...), that's where the changes are)

polar maple
unique salmon
#

Then S will be inaccurate

#

Or, rather, S will be accurate but for the wrong thing

polar maple
#

it is already inaccurate since you are using an inccorect decay

unique salmon
#

Just, no

#

That's a terrible idea

#

You'll get a model optimized for one curve working with a different curve

#

There is no way it will be good

polar maple
#

a decay of -0.2 models memory better

#

but it causes longer intervals

#

so when scheduling, use a decay of -0.5 to get shorter intervals

#

but when you do DS updates, go back to -0.2 decay

#

since that more accurately models memory

unique salmon
#

Please no

#

Just work on NN D man...

#

๐Ÿ˜ญ

polar maple
#

in some sense it is pretty much doing the following: take a user's DR, add some value 0.05 to it or something, and pretend its that instead

unique salmon
#

I first proposed NN D, like, at the end of 2023 or something

polar maple
#

you are lying with a wrong decay just to get shorter intervals

#

and with my 0.2 -> 0.5 convert for scheduling idea, i'm trying to get back this weird anki behaviour

#

since you want it so much

#

but we can have a proper underlying memory model that remains as true as possible

unique salmon
#

@polar maple alright, since I don't know how to combine revlogs, I'll just use the revlogs of The Big Dude - the user with the most reviews out of the entire 10k dataset, which is this one: https://drive.google.com/file/d/12CyIVPjtctpj2p6H6PFruccLY9NBbGyQ/view?usp=sharing. He has 4 million reviews. Sheesh. Bro, I have 600k reviews and I thought I'm a power user. Whoever this guy is, he is the king of power users. Either he has been using Anki since 2006 or he is doing some kind of crazy number of reviews per day.

Anyway, I'll run FSRS-5 on him and then I'll experiment with NN D with regularization and see if I can make it perform better on just this one guy. Then I'll just draw conclusions about D'=f(R, D, grade) from that

#

Though I still have no idea how to extract parameters for the neural net after training

small crow
#

I've changed the way I rate cards (previously 4-button, now 2-button) and created a new preset for the cards I've been learning with the 2-button method and will continue to optimize the FSRS for that preset every few weeks. This preset currently at around 700 reviews and has about 7-10 cards/day added to it. (RMSE is like 10% if that matters, but I think it's higher bc it's a younger deck so I'm too worried)

Should I back-port those values to the older decks/presets? An older deck and preset started back in FSRS4 (and maybe before) has something like 73k reviews and an RSME of like 3.47%, with values separated (and different) from the younger deck, but with no new cards beieng introduced. Or should I just optimize the values within that preset?

I don't want to combine the two decks into the same preset as I wouldn't want the older easy/hard values to get accounted for within the newer preset's algos, but would instead just hand-copy them if this is idea isn't a crazy one.

unique salmon
#

.rar

unique salmon
#

It won't mess up FSRS

small crow
#

Do you think it'd be worth if I switched the order I introduce the cards? Like I always found card1 harder, but card2 would show up more often and remind me of card1 even though card one was the thing i needed to work on irl production vs recognition in a foregin language).

now that i'm typing this out i'm not sure it'd make a difference because fsrs isn't tied to which side of the card you're looking at, but the difficulty at which they're assigned and how you rate them at that assigned difficulty, kinda

thanks!

unique salmon
unique salmon
#

Ok, yeah, I tried saving parameters but I just keep getting errors.
@polar maple so what this means is that I will try different levels of regularization, find something that works, and then have to give the code to you, and then you will have to run it on your own again (on user 6810 from the 10k dataset), and THEN you can do
for R in r_range: for D in d_range: for grade in [1, 2, 3, 4]: new_d = neural_stuff(R, D, G)
And THEN we can finally analyze it

bold terrace
#

I did some play around with the vizualiser again and I came to some observations/interpretations/conclusion :

  • For people "difficulty hell" (w[7] < 0.01), you'll basically get "difficult class" based on how much you failed a card. Cards with each lapse will go to a worst class, but won't realistically come up.
  • To get a better idea of how your interval will grow, it's better in the vizualiser to start with enough fail to represent the number of lapse you can have in your normal card life (~10)

Now it gets interesting :

  • Assuming you do the "perfect sequence" of 3 and 1 based on our DR (For ex, DR=90%, 9 '3' for 1 '1'), based on your DR, you might or might not get an increasing Stability/Interval after each lapse.
  • In my case, with perfect sequences of 80% DR, after those all initial fail, I won't get an increasing Stability. I only get it when I start to be north of ~84% DR.
  • If I do perfect sequence of 90% DR, the Stability at the end of each lapse increase ... until it doesn't. When you look on it, you realize those cycles got a final stability higher when because Difficulty got a bit lower but when Difficulty get lower, a mistake is now more impactful.

Based on that, my current interpretation is :

  • While increasing DR will theoretically increase stability, if you keep repeating those lapses, it will just plateau higher, at a higher workload.
  • While your stability increase through repetitions, if you're not able to overperform FSRS prediction on your way up again, you won't really get better result (stability increasing past the plateau)
  • To get that never-ending increasing stability, you'd have to naturally, compared to FSRS prediction, outperform it a bit more each time. It sound obvious, but in my own experience with Anki, it feels doing Anki alone is not really achieving just by doing reviews. There's a lot of potential discussions/theories to have on this, but it's a bit out of scope for now. But still, it might show why Anki is not a bullet-proof solution for long term learning. It might be a just a "short/mid-term optimizer of memorization", without necessarly being strong on the long run
unique salmon
bold terrace
# unique salmon Yeah, I've said before that currently D is just a fancier lapse counter

You probably have tested already, but what about having D have also exponents more than multipliers ? What I mean by that, is that even if a card is now becoming "one that would never got lapsed again", it won't potentially since the step are very very small, but with an exponent, it could (if necessary) reward long sequences of good answers to make it "move back to lower D"

unique salmon
#

Well, hopefully soon we'll figure out a good way to change D by imitating the outputs of a neural net

#

Assuming "neural D" can outperform normal D. And assuming I can somehow convince Alex/Jarret to help me

cosmic hedge
bold terrace
#

A sad but also possible option is that even with a new formula or a NN, we might not see gain because well, most user in the datatest are people that rarely really transformed a "High-D" into a low one.

bold terrace
#

Thing is, how often does it happen

#

If it doesn't happen that often, it's the kind of thing that RMSE/logloss won't measure that much

cosmic hedge
#

idk ๐Ÿ˜‚

tepid spoke
#

it feels like D is pegged at 100% the moment I fail a card once

#

It's also purely a highly subjective observation, but I feel like cards that never lapsed get too long intervals, while cards that lapsed even once get too short ones.

bold terrace
#

For example :
All my -is:new cards : 3277 cards
-is:new prop:lapses>8 prop:s>21 : 5 cards
-is:new prop:lapses>7 prop:s>21 : 10 cards
-is:new prop:lapses>6 prop:s>21 : 13 cards
-is:new prop:lapses>5 prop:s>21 : 17 cards
-is:new prop:lapses>4 prop:s>21 : 27 cards

If I take those 27 cards, I see they have then probably too large D since the lower D one is 95% and I failed it once for the past year, and succeded it ~20 times.

So that card, could benefit from it.

But even if all the 27 were benefiting, it's <1% than the total amount of card. logloss/RMSE wouldn't flinch on that

bold terrace
#

I only take the mature (in a fsrs way)

#

How much mature cards do I have, that had at least 5 lapses

#

I have 1474 mature cards, 1447 of them never lapsed more than 4 times

#

Only 27, lapsed and were able to become mature

#

1.8%

cosmic hedge
#

but I'd venture to guess that an exponential difficulty would probably benefit the young cards more? because more leeches are probably young cards?

bold terrace
#

996 mature cards never ever lapsed once

cosmic hedge
#

I mean you're right anyway because we know it doesn't work ๐Ÿ˜‚ #1282005522513530952 message

bold terrace
#

It's true that I checked

#

with 99% difficulty, and starting from 0.1 stability, I need 25 good reviews to go to 21 stability

#

but still

#

You'd expect after >100 reviews you'd find a way to be able to recall it ~98% of the time

#

But yeah

#

I'm not sure how large and impactful on a collectiion-level it would be

#

Probably because most people dont' have enough actual case of those

#

We talk about it because we wish we had a lighter backlog because of that

#

Truth is, we just lapse and FSRS just got it right without us thinking we surely have improved when in fact we have not ๐Ÿ˜‚

cosmic hedge
tepid spoke
#

What worries me a bit is how the simulator basically goes into an equilibrium of ~200 reviews every day forever.

cosmic hedge
tepid spoke
#

Just not learning it seems rather counterproductive

bold terrace
#

And the simulator is simulating you sticking to your DR

#

Normally, you should outperform your prediction at some point

#

if you're improvin

cosmic hedge
tepid spoke
#

I haven't hit the set DR in ages

bold terrace
#

Also, LB/Fuzz will lead you to a lower Daily Retention than your actual DR

tepid spoke
#

DR is 90%, on good days I have 88% at best. Realistically more between 80% and 85%

bold terrace
#

Yeah 90 for 80-85% seems normal if you don't hijack the LB in my own experience

tepid spoke
#

I'd expect the actual DR to be higher than 90% though, if all was well

#

given each note spawns two cards with huge cross-bleed between them

bold terrace
#

So normally you'd be sticking to 90% with increasing Stability if you optimize often

#

If you never optimize, then you'd outperform the DR

tepid spoke
#

I think FSRS got poisoned by the early stages of the deck

#

Cause those are a) easy and b) appear every day everywhere in content I consome

#

And the first ~year or so of the deck was mostly that

bold terrace
#

How many reviews total do you have ?

tepid spoke
#

like, just accross all cards you mean?

bold terrace
#

Yep

tepid spoke
#

Where do I even see that?

bold terrace
#

Here

#

FOr example I have 98K reviews

tepid spoke
bold terrace
#

with 217K reviews I think your early phase is not an issue ๐Ÿ˜›

#

also FSRS use some recency weights now

tepid spoke
#

If I'd let FSRS do as it pleases, I think it'd ruin my retention

#

I'm still using the Parameters thar Jarred hand-tuned for me, and specifially disabled short term stuff, since it was highly detrimental

bold terrace
#

With the anki Search Stats Extended plugin, a nice one is this one
You can see the average R of your next day like that

#

For example, even if my DR is 86, my avg(R) is 82.39

#

so with RMSE 3, having a 79% would be perfectly normal for my DR=86%

#

Fuzz/LB is one part culprit, but the low stability cards are something we can't do much about

tepid spoke
#

Two weeks ago, I had a span of 5 days where I averaged at 75% actual retention

#

and now it's back to ~85%. No idea has happened there.

polar maple
#

unless of course that FSRS is wrong about the prediction

#

but if there is a perfect memory model you wouldn't 'outperform' it in the same way you cannot outperform a coin flip

polar maple
quasi shadow
#

According my previous simulation, the optimizer tends to flatten the forgetting curve even when the real decay is -1.

#

If we could find a mechanism to explain it, I would consider use -0.2 as the decay.

#

Otherwise, the flat forgetting curve may be just a cheat.

#

This is the decay used to generate the simulated data.

#

However, the optimized decay from the simulated data deviate the real value.

polar maple
#

if only values near 0.8 was possible then i can see decay not really being represented

quasi shadow
polar maple
#

@quasi shadow I wonder if there is some sort of bug somewhere. I think you can investigate by trying something like: freeze all parameters except for one and optimize, see if the new value is similar to the old value

#

i think the improvement in log loss is unreasonable given the sample size

polar maple
quasi shadow
#

You need to switch to the branch Expt/trainable-forgetting-decay

polar maple
#

also you can try to uniformly sample a DR whenever a card gets scheduled to get full coverage

quasi shadow
#

๐Ÿคฃ Now it converges to 1.0.

polar maple
#

nice..

#

now what's going on

quasi shadow
#

I think the decay parameter tends to compensate other parameters' error.

#

in a cheating way...

polar maple
#

OK but maybe the new set of params will miraculously predict new data better than the old parameters?

#

this would be another bug test i guess

#

so maybe you can try to generate another set of data with the old parameters, and try the new parameters on them

#

so this way we can see if it is just overfitting very well or if it is actually generalizing

#

But if it is generalizing then i expect there to be an underlying bug, it just wouldn't make sense at all

quasi shadow
#

Evaluate the new parameters on the simulated data?

#

Oops, some weird things happen

polar maple
#

old params generates two sets of simulated data, s1, s2
new params are trained on s1
new params are then tested on s2

quasi shadow
#

It's really hard to distinguish them if the retention range is narrow.

polar maple
# quasi shadow

i think this sort of thing wouldn't appear frequently in the 10k set given how much of a logloss improvement 0.2 decay seems to give

#

if it is hard to distinguish like in that plot, it wouldn't give such a log loss improvement

#

so maybe this is just a problem with this specific simulation, idk

quasi shadow
#

The new parameters are slightly worse in testset.

polar maple
#

cool, then there is probably no bug and it really is just the case that decay doesn't seem to matter in this optimizatoin

#

but why do you say that decay is cheating here?

quasi shadow
#

But it's still a problem.

#

The optimized decay should be 1.0 insteal of 0.35.

#

I will combine simulated data with DR = 0.9, 0.8 and 0.7 to optimize the parameters.

polar maple
#

do you have any weight decay or anything that would make the value smaller?

#

L2 reg

polar maple
#

ok, what if you only freeze decay to 1.0? how much worse does it overfit than if you allow it to freely learn?

#

it could just be that in this specific simulation there isn't enough spread for decay to truly matter

quasi shadow
#

Now we have.

#

The decay still decreases...

polar maple
#

what about the values on s1, the data it was trained on?

#

not surprising that it generalized better; in practice for the 10k set we don't need a trainable decay either, we would just lower the decay from the constant 0.5 to 0.2

#

which would be kind of similar to what you are doing here actually, imagine if you instead froze decay to 0.5 in this simulation, you would expect it to generalize worse than if you froze it to 1.0

#

we can still use trainable decay if it does better on the metrics though since we don't cheat with the 5-way split ๐Ÿ™‚

quasi shadow
#

My concern here is, as @bold terrace mentioned before, when the DR is decreased from 90% to 80%, the true retention is decreased from 85%+ to 60%+. (I forget the specific figures, sorry)

#

If we lower the decay, this problem would be worse.

quasi shadow
#

@polar maple another problem: if the model is perfect , the retention range will converge to a point: desired retention.

polar maple
polar maple
polar maple
#

so I think this assumption is just wrong

#

luckily we have tons of data now to prove it

quasi shadow
#

Fine, I will do another experiment later to verify my assumption.

polar maple
#

but either way isn't this off topic? changing decay from 0.5 to 0.2 is still going from a power curve to a power curve

quasi shadow
#

Even when I extend the retention range.

polar maple
#

perhaps in the range [0.1, 1.0] the decay value is easily compensated by the other parameters, so the gradient descent just makes it end up near the middle

#

you could also make it train for more than 5 epochs to see if more interesting stuff happens

quasi shadow
quasi shadow
#

The sample data is generated by exponential forgetting curve.

#

However, the power forgetting curve fits it better.

polar maple