Recall V⋆,Q⋆,Vπ,Qπ .
V⋆(s)=a∈Amax(Q⋆(s,a)R(s,a)+γES′∼P(⋅∣s,a)[V⋆(s′)])
Bellman Operator
∀f:S×A→R,(Tf)(s,a)=(R(s,a)+γEs′∼P(⋅∣s,a)[a′maxf(s′,a′)])whereT:RSA→RSA.
then
Q⋆=TQ⋆V⋆=TV⋆
i.e. Q⋆ and V⋆ are fixpoints of T .
Value Interation Algorithm (VI)
funtion f0=0∈RSA
Interation to calculate fix points:
for k=1,2,3,⋯ ,
fk=Tfk−1
How to do interation
∀s,a:f1(s,a)←Tf0(s,a)
- synchronized iteration: use all functions values from old version during the update.
- asynchronized iteration
Convergence of VI
lemma: T is a γ -contraction under ∥⋅∥∞ where ∥⋅∥∞:=maxx∈(⋅)x .
which means
∥Tf−Tf′∥∞≤γ∥f−f′∥∞
Proof.
==≤lemma∥fk−Q⋆∥∞∥Tfk−1−Q⋆∥∞∥Tfk−1−TQ⋆∥∞γ∥fk−1−Q⋆∥∞
⇒∥fk−Q⋆∥∞≤γk∥fk−1−Q⋆∥∞whereγ∈(0,1)■
Proof of lemma:
It suffies to prove
==∣(Tf−Tf′)(s,a)∣(R(s,a)+γEs′∼P(⋅∣s,a)[a′maxf(s′,a′)])−(R(s,a)+γEs′∼P(⋅∣s,a)[a′maxf(s′,a′)])γEs′∼P(⋅∣s,a)[a′maxf(s′,a′)−a′maxf′(s′,a′)]
$\text{WLOG} assume, \max_{a’} f(s’, a’) > \max_{a’} f’(s’, a’) and \exist a^\star : \max_af(s’,a)=f(s’,a^\star )$, then
=≤≤≤γEs′∼P(⋅∣s,a)[a′maxf(s′,a′)−a′maxf′(s′,a′)]γEs′∼P(⋅∣s,a)[f(s′,a⋆)−a′maxf′(s′,a′)]γEs′∼P(⋅∣s,a)[f(s′,a⋆)−f′(s′,a⋆)]γ∣f(s′,a⋆)−f′(s′,a⋆)∣γ∥f−f′∥∞