Reinforcemant Learning (9)
recap
in policy iteration, appply greedy algo very time.
#steps are finite.
another proof
performance-difference lemma (P-D lemma)
this is a fundamental tool in RL.
many deep RL models relies on this lemma
{: .prompt-info }
,
apply the lemma in the policy iteration steps:
and
and RHS is trivial. QED
Proof of lemma
Reinforcemant Learning (9)
https://yzzzf.xyz/2024/02/13/reinforcement-learning-lecture-9/