In RL for reasoning, a recurring pattern is backtracking: trying something, realizing it's not working, and rolling
back. LLMs already learn to do this in an unstructured way during RLVR (e.g., "wait, let me reconsider..."). But what if we could teach models to backtrack in a structured way, as an action within the MDP itself?
The idea: augment the action space with a backtrack action that undoes the last transition and its reward, while encoding where the agent backtracked from in the observation. This lets the agent learn from mistakes without paying their full cost.
Why this could matter for alignment:
- AI agents increasingly delegate to sub-agents (e.g., Claude Code spawning child agents for subtasks). Standard RLVR doesn't account for these hierarchical actions, which can cause problems, child agents may not pursue the parent's intended goals. Learning to backtrack in a principled way could help solve these sorts of problems.
- Alignment proposals like Iterated Amplification involve an aligned but less capable agent creating copies of itself to boost capabilities while preserving alignment. But we don't have good methods for an agent to safely copy itself and ensure that its copies stay in-distribution and thus not suffer from goal misgeneneralization.
Why this could matter for capabilities:
- In standard RL, a single bad action can ruin an entire episode. A humanoid that's about to fall has to crash, reset, and start over. With structured backtracking, it could roll back a few steps and try a different movement, dramatically improving exploration efficiency.
- More generally, backtracking lets an agent treat dead ends as cheap information rather than expensive failures.
A toy example:
I vibecoded a proof of concept: github.com/emparu/RLBackTrack. It implements Q-learning with a backtrack action in a gridworld with traps.