TD( λ ): Unifying TD(0) and MC
- 1-step bootstrap (TD(0)): r1+γV(si+1)
- 2-step bootstrap: r1+γri+1+γ2V(si+2)
- 3-step bootstrap: r1+γri+1+γ2ri+2+γ3V(si+3)
- …
- ∞ -step bootstrap: r1+γri+1+γ2ri+2+γ3ri+3+⋯ is Monte-Carlo.
Proof of TD( λ )'s correctness
E.g. in 2-step bootstrap,
With Law of total expectation,
====E[r1+γrt+1+γ2V(st+2)∣st]E[rt+γ(rt+1+γV(str))∣st]E[rt]+γEst+1∣st[E[(rt+1+γV(str))∣st,st+1]]E[rt+γ(Tπ)(st+1)∣st]((Tπ)2V)(s)
TD( λ )
For n-step bootstrap, give a (1−λ)λn weight.
- λ=0 : Only n=1 gives the full weight. TD(0).
- λ→1 : (almost) Monte-Carlo.
forward view and backward view
Forward view
(1−λ)⋅(r1+γV(s2)−V(s1))(1−λ)λ⋅(r1+γr2+γ2V(s3)−V(s1))(1−λ)λ2⋅(r1+γr2+γ2r3+γ3V(s4)−V(s1))⋯
, and so on.
Backward view
1⋅(r1+γV(s2)−V(s1))λγ⋅(r2+γV(s3)−V(s2))λ2γ2⋅(r3+γV(s4)−V(s3))⋯