Reinforcemant Learning (9)

recap

in policy iteration, appply greedy algo very time.

#steps are finite.

another proof

performance-difference lemma (P-D lemma)

this is a fundamental tool in RL.
many deep RL models relies on this lemma
{: .prompt-info }

π,π,s\forall \pi,\pi', s,

Vπ(s)Vπ(s)=11γEsdsπ[Qπ(s,π)Vπ(s)]V^{\color{red}{\pi'}}(s) - V^{\pi}(s) = \frac{1}{1-\gamma} E_{s'\sim d_s^{\color{red}{\pi'}}} \left[Q^\pi(s', {\color{red}{\pi'}})-V^\pi(s')\right]

apply the lemma in the policy iteration steps:

Vπk+1(s)Vπk(s)=11γEs?[Qπk(s,πk+1)Vπk(s)]V^{\pi_{k+1}}(s) - V^{\pi_k}(s) = \frac{1}{1-\gamma} E_{s'\sim ?} \left[Q^{\pi_k}(s', \pi_{k+1})-\boxed{V^{\pi_k}(s')}\right]

and

Vπk(s)=Qπk(s,πk)\boxed{V^{\pi_k}(s')} = Q^{\pi_k}(s', \pi_k)

and RHS 0\ge 0 is trivial. QED

Proof of lemma


Reinforcemant Learning (9)
https://yzzzf.xyz/2024/02/13/reinforcement-learning-lecture-9/
Author
Zifan Ying
Posted on
February 13, 2024
Licensed under