Reinforcement Learning (7)

greedy policy:

π(s)=arg maxaAQ(s,a)\pi^\star (s) = \argmax_{a\in A} Q^\star (s, a)

sequence of function:

f0,f1,f2,Qf_0, f_1, f_2, \cdots \to Q^\star

define

πfk(s)=arg maxaAfk(s,a)\pi_{f_k}^\star (s) = \argmax_{a\in A} f_k(s, a)

Claim:

VVπf2fQ1γ\| V^\star -V^{\pi_f} \| \le \frac{2 \| f - Q^\star ||_\infty}{1-\gamma}

define operator T\mathcal{T} :

(Tf)(s)=maxaA(R(s,a)+γEsP(s,A)[f(s)])(\mathcal{T}f)(s)=\max_{a \in A}\left(R(s, a)+\gamma E_{s^{\prime} \sim P(\cdot \mid s, A)}\left[f\left(s^{\prime}\right)\right]\right)

Note:
the T\mathcal{T} in TQ\mathcal{T}Q^\star and TV\mathcal{T}V^\star are not the same.
{: .prompt-tip }

VV^\star Iteration

f0=0fkTfk1\begin{gathered} f_0 = \vec{0} \\ f_k \leftarrow \mathcal{T}f_{k-1} \end{gathered}

then

fk(s)=maxall possible πE[t=1kγt1rts1=s,π]f_k(s)=\max _{\text {all possible } \pi} \mathbb{E}\left[\sum_{t=1}^k \gamma^{t-1} r_t \mid s_1=s, \pi\right]

This is derived my the definaion of operator T\mathcal{T} .
{: .prompt-tip }

Claim:

fkVγk\| f_k -V^\star \| \lesssim \gamma^k

step 1: fkVf_k \le V^\star

step 2:

fkE[t=1γt1rts1=s,π]E[t=k+1γt1rts1=s,π]VrkVmax\begin{aligned} f_k \ge &\boxed{\mathbb{E}\left[\sum_{t=1}^\infty \gamma^{t-1} r_t \mid s_1=s, \pi^\star \right]} - \mathbb{E}\left[\sum_{t=k+1}^\infty \gamma^{t-1} r_t \mid s_1=s, \pi^\star \right] \\ \ge&\boxed{V^\star } - r^k V_{\max} \blacksquare \end{aligned}


Reinforcement Learning (7)
https://yzzzf.xyz/2024/02/11/reinforcement-learning-lecture-7/
Author
Zifan Ying
Posted on
February 11, 2024
Licensed under