Reinforcement Learning (8)
Policy Iteration
steps:
- policy evaluation: Compute
- policy improvement: where
property:
this means, once reached goal , never leave.
{: .prompt-tip }
example
graph TD;
A([start])
A -->|+0| B([Japanese])
A -->|+0| C([Italian])
B -->|+2| D([Ramen])
B -->|+2| E([Sushi])
C -->|+1| F([Steak])
C -->|+3| G([Pasta])
Optimal policy is heading Pasta
.
This example is a finite horizon case.
To make it infinite horizon discount, add a state :graph TD; A([start]) A -->|+0| B([Japanese]) A -->|+0| C([Italian]) B -->|+2| D([Ramen]) B -->|+2| E([Sushi]) C -->|+1| F([Steak]) C -->|+3| G([Pasta]) D -.->|+0| T((T)) E -.->|+0| T((T)) F -.->|+0| T((T)) G -.->|+0| T((T)) T --->|+0| T((T))
{: .prompt-tip }
To find , update value from leaf upwards to root state.
policy iteration (example)
interation #0
define initial :
graph TD;
A([start])
A -.-|+0| B([Japanese])
A ==>|+0| C([Italian])
B ==>|+2| D([Ramen])
B -.-|+2| E([Sushi])
C ==>|+1| F([Steak])
C -.-|+3| G([Pasta])
then the corresponding :
graph TD;
A([start])
A -->|+2| B([Japanese])
A -->|+1| C([Italian])
B -->|+2| D([Ramen])
B -->|+2| E([Sushi])
C -->|+1| F([Steak])
C -->|+3| G([Pasta])
interation #1
graph TD;
A([start])
A ==>|+2| B([Japanese])
A -.-|+1| C([Italian])
B ==>|+2| D([Ramen])
B -.-|+2| E([Sushi])
C -.-|+1| F([Steak])
C ==>|+3| G([Pasta])
:
graph TD;
A([start])
A -->|+2| B([Japanese])
A -->|+3| C([Italian])
B -->|+2| D([Ramen])
B -->|+2| E([Sushi])
C -->|+1| F([Steak])
C -->|+3| G([Pasta])
interation #2
graph TD;
A([start])
A -.-|+2| B([Japanese])
A ==>|+3| C([Italian])
B ==>|+2| D([Ramen])
B -.-|+2| E([Sushi])
C -.-|+1| F([Steak])
C ==>|+3| G([Pasta])
Comment
Policy was switched to Japanese for once, and switched back to Italian at the end.
Also, the policy updates upwards.
Monotone Policy improvement
Monotone Policy improvement produces exact solutions, while value iteration produces approxmitate solutions,
{: .prompt-tip }
Proof of:
lemma 1:
beacuse
lemma 2:
lemma 3:
with lemma 1,2,3,
because is the fixed point of .
Reinforcement Learning (8)
https://yzzzf.xyz/2024/02/12/reinforcement-learning-lecture-8/