Reinforcement Learning (8)

Policy Iteration

π0:SA\pi_0: S \to A

steps:

  • policy evaluation: Compute Qπk1RSAQ^{\pi_{k-1}}\in \mathbb{R}^{SA}
  • policy improvement: πkπQπk1\pi_k \leftarrow \pi_{Q^{\pi_{k-1}}} where

πfs=arg maxaAf(s,a)\pi_{f{s}} = \argmax_{a \in A} f(s,a)

property:

πQπ=QπQ=π\pi^\star \to Q^{\pi^\star }=Q^\star \to \pi_{Q^\star } =\pi^\star

this means, once reached goal π\pi^\star , never leave.
{: .prompt-tip }

example

graph TD;
    A([start])
    A -->|+0| B([Japanese])
    A -->|+0| C([Italian])
    B -->|+2| D([Ramen])
    B -->|+2| E([Sushi])
    C -->|+1| F([Steak])
    C -->|+3| G([Pasta])

Optimal policy is heading Pasta.

This example is a finite horizon case.
To make it infinite horizon discount, add a state TT :

graph TD;
    A([start])
    A -->|+0| B([Japanese])
    A -->|+0| C([Italian])
    B -->|+2| D([Ramen])
    B -->|+2| E([Sushi])
    C -->|+1| F([Steak])
    C -->|+3| G([Pasta])

    D -.->|+0| T((T))
    E -.->|+0| T((T))
    F -.->|+0| T((T))
    G -.->|+0| T((T))
    T --->|+0| T((T))

{: .prompt-tip }

To find V(s)V^\star (s) , update VV value from leaf upwards to root state.

policy iteration (example)

interation #0

define initial π0\pi_0 :

graph TD;
    A([start])
    A -.-|+0| B([Japanese])
    A ==>|+0| C([Italian])
    B ==>|+2| D([Ramen])
    B -.-|+2| E([Sushi])
    C ==>|+1| F([Steak])
    C -.-|+3| G([Pasta])

then the corresponding Qπ0(s,a)Q^{\pi_0} (s, a) :

graph TD;
    A([start])
    A -->|+2| B([Japanese])
    A -->|+1| C([Italian])
    B -->|+2| D([Ramen])
    B -->|+2| E([Sushi])
    C -->|+1| F([Steak])
    C -->|+3| G([Pasta])

interation #1

π1(s):=arg maxaAQπ0(s,a)\pi_1(s) := \argmax_{a\in A} Q^{\pi_0} (s, a)

graph TD;
    A([start])
    A ==>|+2| B([Japanese])
    A -.-|+1| C([Italian])
    B ==>|+2| D([Ramen])
    B -.-|+2| E([Sushi])
    C -.-|+1| F([Steak])
    C ==>|+3| G([Pasta])

QπiQ^{\pi_i}:

graph TD;
    A([start])
    A -->|+2| B([Japanese])
    A -->|+3| C([Italian])
    B -->|+2| D([Ramen])
    B -->|+2| E([Sushi])
    C -->|+1| F([Steak])
    C -->|+3| G([Pasta])

interation #2

π2(s):=arg maxaAQπ1(s,a)\pi_2(s) := \argmax_{a\in A} Q^{\pi_1} (s, a)

graph TD;
    A([start])
    A -.-|+2| B([Japanese])
    A ==>|+3| C([Italian])
    B ==>|+2| D([Ramen])
    B -.-|+2| E([Sushi])
    C -.-|+1| F([Steak])
    C ==>|+3| G([Pasta])

Comment

Policy π\pi was switched to Japanese for once, and switched back to Italian at the end.

Also, the policy updates upwards.

Monotone Policy improvement

k,s:VπkVπk1\forall k,\forall s: V^{\pi_k}\geq V^{\pi_{k-1}}

 if πk1π,s:vπk(s)>Vπk1(s)\text { if } \pi_{k-1} \neq \pi^\star , \exists s: v^{\pi_k}(s)>V^{\pi_{k-1}}(s)

#iterationAS\Rightarrow \text{\#iteration} \le |A|^{|S|}

Monotone Policy improvement produces exact solutions, while value iteration produces approxmitate solutions,
{: .prompt-tip }

Proof of: Qπk+1QπkQ^{\pi_{k+1}} \ge Q^{\pi_k}

lemma 1:

Qπk=TπkQπkTQπkQ^{\pi_k}=\mathcal{T}^{\pi_k} Q^{\pi_k} \leq \mathcal{T} Q^{\pi_k}

beacuse

(Tπf)(s,a)=R(s,a)+γEsp(s,a)[f(s,π)](Tf)(s,a)=R(s,a)+γEsp(s,a)[maxaf(s,a)]\begin{aligned} & (\mathcal{T}^\pi f)(s, a)=R(s, a)+\gamma \mathbb{E}_{s^{\prime} \sim p(\cdot \mid s, a)}\left[f\left(s^{\prime}, \pi\right)\right] \\ \le & (\mathcal{T} f)(s, a)=R(s, a)+\gamma \mathbb{E}_{s^{\prime} \sim p(\cdot \mid s, a)}\left[\max _{a^{\prime}} f\left(s^{\prime}, a^{\prime}\right)\right] \end{aligned}

lemma 2:

TQπk=Tπk+1Qπk\mathcal{T} Q^{\pi_k}=\mathcal{T}^{\pi_{k+1}} Q^{\pi_k}

lemma 3:

ff,  TπfTπf\forall f \ge f', \; \mathcal{T}^\pi f \ge \mathcal{T}^\pi f'

with lemma 1,2,3,

QπkTπk+1QπkTπk+1QπkTπk+1Tπk+1QπkQπk(Tπk+1)Qπk=Qπk+1\begin{aligned} Q^{\pi_k} & \leq \boxed{\mathcal{T}^{\pi_{k+1}} Q^{\pi_k}} \\ \Rightarrow \boxed{\mathcal{T}^{\pi_{k+1}} Q^{\pi_k}} & \leq \mathcal{T}^{\pi_{k+1}} \mathcal{T}^{\pi_{k+1}} Q^{\pi_k} \\ & \vdots \\ \Rightarrow Q^{\pi_k} & \leq\left(\mathcal{T}^{\pi_{k+1}}\right)^{\infty} Q^{\pi_k}=Q^{\pi_{k+1}} \end{aligned}

because Qπk+1Q^{\pi_{k+1}} is the fixed point of Tπk+1\mathcal{T}^{\pi_{k+1}} .


Reinforcement Learning (8)
https://yzzzf.xyz/2024/02/12/reinforcement-learning-lecture-8/
Author
Zifan Ying
Posted on
February 11, 2024
Licensed under