greedy policy:
π⋆(s)=a∈AargmaxQ⋆(s,a)
sequence of function:
f0,f1,f2,⋯→Q⋆
define
πfk⋆(s)=a∈Aargmaxfk(s,a)
Claim:
∥V⋆−Vπf∥≤1−γ2∥f−Q⋆∣∣∞
define operator T :
(Tf)(s)=a∈Amax(R(s,a)+γEs′∼P(⋅∣s,A)[f(s′)])
Note:
the T in TQ⋆ and TV⋆ are not the same.
{: .prompt-tip }
V⋆ Iteration
f0=0fk←Tfk−1
then
fk(s)=all possible πmaxE[t=1∑kγt−1rt∣s1=s,π]
This is derived my the definaion of operator T .
{: .prompt-tip }
Claim:
∥fk−V⋆∥≲γk
step 1: fk≤V⋆
step 2:
fk≥≥E[t=1∑∞γt−1rt∣s1=s,π⋆]−E[t=k+1∑∞γt−1rt∣s1=s,π⋆]V⋆−rkVmax■