Reinforcemant Learning (13)

TD( λ\lambda ): Unifying TD(0) and MC

  • 1-step bootstrap (TD(0)): r1+γV(si+1)r_1 + \gamma V(s_{i+1})
  • 2-step bootstrap: r1+γri+1+γ2V(si+2)r_1 + \gamma r_{i+1} + \gamma^2 V(s_{i+2})
  • 3-step bootstrap: r1+γri+1+γ2ri+2+γ3V(si+3)r_1 + \gamma r_{i+1} + \gamma^2 r_{i+2} + \gamma^3 V(s_{i+3})
  • \infty -step bootstrap: r1+γri+1+γ2ri+2+γ3ri+3+r_1 + \gamma r_{i+1} + \gamma^2 r_{i+2} + \gamma^3 r_{i+3} + \cdots is Monte-Carlo.

Proof of TD( λ\lambda )'s correctness

E.g. in 2-step bootstrap,

With Law of total expectation,

E[r1+γrt+1+γ2V(st+2)st]=E[rt+γ(rt+1+γV(str))st]=E[rt]+γEst+1st[E[(rt+1+γV(str))st,st+1]]=E[rt+γ(Tπ)(st+1)st]=((Tπ)2V)(s)\begin{aligned} & \mathbb{E}[r_1 + \gamma r_{t+1} + \gamma^2 V(s_{t+2})|s_t] \\ =& \mathbb{E}[r_t + \gamma(r_{t+1}+\gamma V(s_{t r})) | s_t] \\ =& \mathbb{E}[r_t] + \gamma \mathbb{E}_{s_{t+1}|s_t}\big[\mathbb{E}[(r_{t+1}+\gamma V(s_{t r})) | s_t, s_{t+1}]\big] \\ =& \mathbb{E}[r_t + \gamma (\mathcal{T}^\pi)(s_{t+1}) | s_t ] \\ =& ((\mathcal{T}^\pi)^2 V)(s) \end{aligned}

TD( λ\lambda )

For n-step bootstrap, give a (1λ)λn(1-\lambda)\lambda^n weight.

  • λ=0\lambda = 0 : Only n=1 gives the full weight. TD(0).
  • λ1\lambda \to 1 : (almost) Monte-Carlo.

forward view and backward view

Forward view

(1λ)(r1+γV(s2)V(s1))(1λ)λ(r1+γr2+γ2V(s3)V(s1))(1λ)λ2(r1+γr2+γ2r3+γ3V(s4)V(s1))\begin{gathered} (1-\lambda)\cdot (r_1+\gamma V(s_2)-V(s_1)) \\ (1-\lambda) \lambda \cdot(r_1+\gamma r_2+\gamma^2 V(s_3)-V(s_1)) \\ (1-\lambda) \lambda^2 \cdot(r_1+\gamma r_2+\gamma^2 r_3+\gamma^3 V(s_4)-V(s_1)) \\ \cdots \end{gathered}

, and so on.

Backward view

1(r1+γV(s2)V(s1))λγ(r2+γV(s3)V(s2))λ2γ2(r3+γV(s4)V(s3))\begin{gathered} 1 \cdot (r_1 + \gamma V(s_2) - V(s_1)) \\ \lambda \gamma \cdot (r_2 + \gamma V(s_3) - V(s_2)) \\ \lambda^2 \gamma^2 \cdot (r_3 + \gamma V(s_4) - V(s_3)) \\ \cdots \end{gathered}


Reinforcemant Learning (13)
https://yzzzf.xyz/2024/03/20/reinforcement-learning-lecture-13/
Author
Zifan Ying
Posted on
March 20, 2024
Licensed under