Sutton & Barto: Reinforcement Learning Notes

Personal derivations, edge-case deep dives, and mental shortcuts for the classic RL textbook.


Page 31

The Incremental Update Rule (The "Multiply by 1" Trick)

The goal is to express the new average, \(Q_{n+1}\), entirely in terms of the old average, \(Q_n\), and the newest reward, \(R_n\). The previous average \(Q_n\) is fundamentally defined as:

$$Q_n = \frac{1}{n-1}\sum_{i=1}^{n-1}R_i$$

Step-by-Step Derivation

Starting with the isolated sum of all previous rewards:

$$= \frac{1}{n}\left(R_n + \sum_{i=1}^{n-1}R_i\right)$$

The Trick: We multiply that isolated sum by a cleverly disguised "1", written as \((n-1)\frac{1}{n-1}\). This introduces the exact fraction we need to swap the sum out for a \(Q_n\) substitution without breaking the equation's value.

$$= \frac{1}{n}\left(R_n + (n-1)\frac{1}{n-1}\sum_{i=1}^{n-1}R_i\right)$$

Group the fraction with the summation to substitute \(Q_n\) cleanly into place:

$$= \frac{1}{n}\left(R_n + (n-1)Q_n\right)$$

Distribute the outer fraction across the elements inside the brackets:

$$= \frac{1}{n}R_n + \frac{1}{n}(nQ_n) - \frac{1}{n}Q_n$$
$$= \frac{1}{n}R_n + Q_n - \frac{1}{n}Q_n$$

Rearrange the terms and factor out the common global fraction \(\frac{1}{n}\) to get the final line:

$$= Q_n + \frac{1}{n}[R_n - Q_n]$$

Page 32

Non-Stationary Tracking (Exponential Recency-Weighted Average)

In environments where target data shifts over time, true historical averages fail. Replacing the dynamic step size \(\frac{1}{n}\) with a constant step size parameter, \(\alpha \in (0, 1]\), forces the agent to prioritize recent information and decay historical data.

Step-by-Step "Unrolling"

1. Rearranging the Base Equation: Group the elements into a fraction of the newest reward and a fraction of the historical average.

$$Q_{k+1} = Q_k + \alpha(R_k - Q_k)$$
$$= \alpha R_k + (1 - \alpha)Q_k$$

2. The Recursion: Substitute \(Q_k\) with its exact mathematical definition from the step prior:

$$= \alpha R_k + (1 - \alpha)[\alpha R_{k-1} + (1 - \alpha)Q_{k-1}]$$
$$= \alpha R_k + (1 - \alpha)\alpha R_{k-1} + (1 - \alpha)^2 Q_{k-1}$$

3. Unrolling to the Core Origin: Keep replacing historical terms sequentially all the way back to the initial reward (\(R_1\)) and your starting guess (\(Q_1\)):

$$= \alpha R_k + (1 - \alpha)\alpha R_{k-1} + (1 - \alpha)^2\alpha R_{k-2} + \dots + (1 - \alpha)^{k-1}\alpha R_1 + (1 - \alpha)^k Q_1$$

4. The Final Closed-Form Representation: Consolidating that massive unrolled string into clean summation notation reveals the underlying mechanism:

$$= (1 - \alpha)^k Q_1 + \sum_{i=1}^k \alpha(1 - \alpha)^{k-i}R_i$$
The Takeaway: Because \(\alpha\) is a fraction less than 1, the multiplier \((1-\alpha)\) shrinks exponentially the older a reward is (\(R_{k-2}, R_{k-3}\), etc.), proving mathematically that older data is systematically forgotten.