#Learning to backtrack

3 messages · Page 1 of 1 (latest)

elder ridge
#

In RL for reasoning, a recurring pattern is backtracking: trying something, realizing it's not working, and rolling
back. LLMs already learn to do this in an unstructured way during RLVR (e.g., "wait, let me reconsider..."). But what if we could teach models to backtrack in a structured way, as an action within the MDP itself?

The idea: augment the action space with a backtrack action that undoes the last transition and its reward, while encoding where the agent backtracked from in the observation. This lets the agent learn from mistakes without paying their full cost.

Why this could matter for alignment:

  • AI agents increasingly delegate to sub-agents (e.g., Claude Code spawning child agents for subtasks). Standard RLVR doesn't account for these hierarchical actions, which can cause problems, child agents may not pursue the parent's intended goals. Learning to backtrack in a principled way could help solve these sorts of problems.
  • Alignment proposals like Iterated Amplification involve an aligned but less capable agent creating copies of itself to boost capabilities while preserving alignment. But we don't have good methods for an agent to safely copy itself and ensure that its copies stay in-distribution and thus not suffer from goal misgeneneralization.

Why this could matter for capabilities:

  • In standard RL, a single bad action can ruin an entire episode. A humanoid that's about to fall has to crash, reset, and start over. With structured backtracking, it could roll back a few steps and try a different movement, dramatically improving exploration efficiency.
  • More generally, backtracking lets an agent treat dead ends as cheap information rather than expensive failures.

A toy example:

I vibecoded a proof of concept: github.com/emparu/RLBackTrack. It implements Q-learning with a backtrack action in a gridworld with traps.

#

The backtrack agent learns faster during exploration (presumably because it can undo trap penalties immediately) and converges to the same optimal policy as standard Q-learning. The key insight is that backtracking helps during learning, even if the final policy doesn't use it.

It would be interesting to see whether structured backtracking gives real benefits in more challenging domains such as robotics, but I currently not have much time to pursue this idea since RL for robotics is very slow to experiment and test since its hard to parallelize, even more with backtracking.

elder ridge
#