Reinforcement Learning (6)

Recall V,Q,Vπ,QπV^\star , Q^\star , V^\pi, Q^\pi .

V(s)=maxaA(R(s,a)+γESP(s,a)[V(s)]Q(s,a))V^\star (s)=\max _{a \in A}(\underbrace{R(s, a)+\gamma \mathbb{E}_{S^{\prime} \sim P(\cdot \mid s, a)}\left[V^\star \left(s^{\prime}\right)\right]}_{Q^\star (s, a)})

Bellman Operator

f:S×AR,(Tf)(s,a)=(R(s,a)+γEsP(s,a)[maxaf(s,a)])where  T:RSARSA.\begin{gathered} \forall f: S\times A \to \mathbb{R}, \\ (\mathcal{T}f) (s,a) = (R(s,a) + \gamma \mathbb{E}_{s'\sim P(\cdot|s,a)}[\max_{a'} f(s', a')]) \\ \text{where} \; \mathcal{T}: \mathbb{R}^{SA} \to \mathbb{R}^{SA}. \end{gathered}

then

Q=TQV=TV\begin{gathered} Q^\star = \mathcal{T} Q^\star \\ V^\star = \mathcal{T} V^\star \\ \end{gathered}

i.e. QQ^\star and VV^\star are fixpoints of T\mathcal{T} .

Value Interation Algorithm (VI)

funtion f0=0RSA\text{funtion } f_0 = \vec{0} \in \mathbb{R}^{SA}

Interation to calculate fix points:

for k=1,2,3,k = 1,2,3, \cdots ,

fk=Tfk1f_k = \mathcal{T} f_{k-1}

How to do interation

s,a:  f1(s,a)Tf0(s,a)\forall s,a: \; f_1(s,a) \leftarrow \mathcal{T} f_0(s,a)

  • synchronized iteration: use all functions values from old version during the update.
  • asynchronized iteration

Convergence of VI

lemma: T\mathcal{T} is a γ\gamma -contraction under \| \cdot\| _{\infty} where :=maxx()x\| \cdot\| _{\infty} := \max_{x\in (\cdot)}x .

which means

TfTfγff\| \mathcal{T}f-\mathcal{T}f'\| _{\infty}\leq\gamma\| f-f'\| _{\infty}

Proof.

fkQ=Tfk1Q=Tfk1TQlemmaγfk1Q\begin{aligned} &\| f_k - Q^\star \|_\infty \\ =& \| \mathcal{T}f_{k-1} - Q^\star \|_\infty \\ =& \| \mathcal{T}f_{k-1} - \mathcal{T}Q^\star \|_\infty \\ \overset{\text{lemma}}{\le}& \gamma \| f_{k-1} - Q^\star \|_\infty \\ \end{aligned}

fkQγkfk1Q  where  γ(0,1)\Rightarrow \| f_k-Q^\star \|_\infty \le \gamma^k \| f_{k-1} -Q^\star \|_\infty \; \text{where}\; \gamma \in (0,1) \blacksquare

Proof of lemma:

It suffies to prove

(TfTf)(s,a)=(R(s,a)+γEsP(s,a)[maxaf(s,a)])(R(s,a)+γEsP(s,a)[maxaf(s,a)])=γEsP(s,a)[maxaf(s,a)maxaf(s,a)]\begin{aligned} &\left|(\mathcal{T}f - \mathcal{T}f') (s,a)\right| \\ = &\left|\left(R(s,a) + \gamma \mathbb{E}_{s'\sim P(\cdot|s,a)}[\max_{a'} f(s', a')]\right) - \left(R(s,a) + \gamma \mathbb{E}_{s'\sim P(\cdot|s,a)}[\max_{a'} f(s', a')]\right)\right| \\ =& \gamma\left|\mathbb{E}_{s'\sim P(\cdot|s,a)}[\max_{a'} f(s', a') - \max_{a'} f'(s', a')]\right| \end{aligned}

$\text{WLOG} assume,assume, \max_{a’} f(s’, a’) > \max_{a’} f’(s’, a’) andand \exist a^\star : \max_af(s’,a)=f(s’,a^\star )$, then

γEsP(s,a)[maxaf(s,a)maxaf(s,a)]=γEsP(s,a)[f(s,a)maxaf(s,a)]γEsP(s,a)[f(s,a)f(s,a)]γf(s,a)f(s,a)γff\begin{aligned} & \gamma\left|\mathbb{E}_{s'\sim P(\cdot|s,a)}[\max_{a'} f(s', a') - \max_{a'} f'(s', a')]\right| \\ =& \gamma\left|\mathbb{E}_{s'\sim P(\cdot|s,a)}[f(s', a^\star ) - \max_{a'} f'(s', a')]\right| \\ \le& \gamma\left|\mathbb{E}_{s'\sim P(\cdot|s,a)}[f(s', a^\star ) - f'(s', a^\star )]\right| \\ \le& \gamma\left|f(s', a^\star ) - f'(s', a^\star )\right| \\ \le& \gamma \| f-f'\|_\infty \end{aligned}


Reinforcement Learning (6)
https://yzzzf.xyz/2024/02/11/reinforcement-learning-lecture-6/
Author
Zifan Ying
Posted on
February 11, 2024
Licensed under