Sutton & Barto: Reinforcement Learning Notes
Personal derivations, edge-case deep dives, and mental shortcuts for the classic RL textbook.
The Incremental Update Rule (The "Multiply by 1" Trick)
The goal is to express the new average, \(Q_{n+1}\), entirely in terms of the old average, \(Q_n\), and the newest reward, \(R_n\). The previous average \(Q_n\) is fundamentally defined as:
Step-by-Step Derivation
Starting with the isolated sum of all previous rewards:
The Trick: We multiply that isolated sum by a cleverly disguised "1", written as \((n-1)\frac{1}{n-1}\). This introduces the exact fraction we need to swap the sum out for a \(Q_n\) substitution without breaking the equation's value.
Group the fraction with the summation to substitute \(Q_n\) cleanly into place:
Distribute the outer fraction across the elements inside the brackets:
Rearrange the terms and factor out the common global fraction \(\frac{1}{n}\) to get the final line:
Non-Stationary Tracking (Exponential Recency-Weighted Average)
In environments where target data shifts over time, true historical averages fail. Replacing the dynamic step size \(\frac{1}{n}\) with a constant step size parameter, \(\alpha \in (0, 1]\), forces the agent to prioritize recent information and decay historical data.
Step-by-Step "Unrolling"
1. Rearranging the Base Equation: Group the elements into a fraction of the newest reward and a fraction of the historical average.
2. The Recursion: Substitute \(Q_k\) with its exact mathematical definition from the step prior:
3. Unrolling to the Core Origin: Keep replacing historical terms sequentially all the way back to the initial reward (\(R_1\)) and your starting guess (\(Q_1\)):
4. The Final Closed-Form Representation: Consolidating that massive unrolled string into clean summation notation reveals the underlying mechanism: