Reinforcement Learning (2-4)

Markov Decision Processes

Infinite-horizon discounted MDPs

An MDP M=(S,A,P,R,γ)M = (S, A, P, R, \gamma) consists of:

  • State space SS .
  • Action space AA .
  • Transition function PP : S×AΔ(S)S \times A \rightarrow \Delta(S) . Δ(S)\Delta(S) is the probability simplex over SS , i.e., all non-negative vectors of length S| S| that sums up to 11 .
  • Reward function RR : S×ARS \times A \rightarrow \mathbb{R} . (deterministic reward function)
  • Discount factor γ[0,1]\gamma \in [0,1]

The agent

  1. starts in some state s1s_1
  2. takes action a1a_1
  3. receives reward r1=R(s1,a1)r_1 = R(s_1, a_1)
  4. transitions to s2P(s1,a1)s_2 \sim P(s_1, a_1)
  5. takes action a2a_2
  6. … and so on so forth — the process continues forever.

Objective: (discounted) expected total reward

  • Other terms used: return, value, utility, long-term reward, etc

Example: Gridworld

  • State: grid x, y
  • Action: N, S, E, W
  • Dynamics:
    • most cases: move to adjacent grid
    • meet wall or reached goal: keep in the current state
  • Reward:
    • 00 in the goal state
    • 1-1 everywhere else
  • Discount factor γ\gamma : 0.99

discounting

$\gamma = 1 allowssomestrategiestoobtainallows some strategies to obtain -\infty$ expected return.

For γ<1\gamma < 1 , the total reward is finite.

finite horizon vs. infinite-horizon discounted MDP

  • For finite-horizon (finite acitons), γ\gamma can be 11 .
  • For infinite-horizon (infinite acitons), γ<1\gamma < 1 .

Value and policy

Take action that maximize

E[t=1γt1rt]\mathbb{E}\left[\sum_{t=1}^{\infty} \gamma^{t-1} r_t\right]

assume rt[0,Rmax]r_t \in [0, R_{\max}] ,

E[t=1γt1rt][0,Rmax1γ].\mathbb{E}\left[\sum_{t=1}^{\infty} \gamma^{t-1} r_t\right] \in \left[0,\frac{R_{\max}}{1-\gamma} \right].

A policy describes how the agent acts at a state:

at=π(st)a_t = \pi(s_t)

define value funtion

Vπ(s)=E[t=1γt1rt|s1=s,π]V^\pi (s) = \mathbb{E}\left[\sum_{t=1}^{\infty} \gamma^{t-1} r_t \middle| s_1 = s, \pi \right]

Bellman Equation

Vπ(s)=E[t=1γt1rt|s1=s,π]=R(s,π(s))+γP(|s,π(s)),Vπ()\begin{aligned} V^\pi(s) & =\mathbb{E}\left[\sum_{t=1}^{\infty} \gamma^{t-1} r_t \middle| s_1=s, \pi\right] \\ & =R(s, \pi(s))+\gamma\left\langle P(\cdot \middle| s, \pi(s)), V^\pi(\cdot)\right\rangle \end{aligned}

Detailed steps

Vπ(s)=E[t=1γt1rt|s1=s,π]=E[r1+t=2γt1rt|s1=s,π]=R(s,π(s))+sSP(s|s,π(s))E[γt=2γt2rt|s1=s,s2=s,π]=R(s,π(s))+sSP(s|s,π(s))E[γt=2γt2rt|s2=s,π]=R(s,π(s))+γsSP(s|s,π(s))E[t=1γt1rt|s1=s,π]=R(s,π(s))+γsSP(s|s,π(s))Vπ(s)=R(s,π(s))+γP(|s,π(s)),Vπ()\begin{aligned} V^\pi(s) & =\mathbb{E}\left[\sum_{t=1}^{\infty} \gamma^{t-1} r_t \middle| s_1=s, \pi\right] \\ & =\mathbb{E}\left[r_1+\sum_{t=2}^{\infty} \gamma^{t-1} r_t \middle| s_1=s, \pi\right] \\ & =R(s, \pi(s))+\sum_{s^{\prime} \in \mathcal{S}} P\left(s^{\prime} \middle| s, \pi(s)\right) \mathbb{E}\left[\gamma \sum_{t=2}^{\infty} \gamma^{t-2} r_t \middle| s_1=s, s_2=s^{\prime}, \pi\right] \\ & =R(s, \pi(s))+\sum_{s^{\prime} \in \mathcal{S}} P\left(s^{\prime} \middle| s, \pi(s)\right) \mathbb{E}\left[\gamma \sum_{t=2}^{\infty} \gamma^{t-2} r_t \middle| s_2=s^{\prime}, \pi\right] \\ & =R(s, \pi(s))+\gamma \sum_{s^{\prime} \in \mathcal{S}} P\left(s^{\prime} \middle| s, \pi(s)\right) \mathbb{E}\left[\sum_{t=1}^{\infty} \gamma^{t-1} r_t \middle| s_1=s^{\prime}, \pi\right] \\ & =R(s, \pi(s))+\gamma \sum_{s^{\prime} \in \mathcal{S}} P\left(s^{\prime} \middle| s, \pi(s)\right) V^\pi\left(s^{\prime}\right) \\ & =R(s, \pi(s))+\gamma\left\langle P(\cdot \middle| s, \pi(s)), V^\pi(\cdot)\right\rangle \end{aligned}

where ,\langle\cdot, \cdot\rangle is Dot Product.

Matrix form

  • VπV^\pi as the S×1| S| \times 1 vector [Vπ(s)]sS[V^\pi(s)]_{s \in S}
  • RπR^\pi as the vector [R(s,π(s))]sS[R(s, \pi(s))]_{s \in S}
  • PπP^\pi as the matrix [P(ss,π(s))]sS,sS[P(s' | s, \pi(s))]_{s \in S, s' \in S}

Vπ=Rπ+γPπVπ(IγPπ)Vπ=RπVπ=(IγPπ)1Rπ\begin{gathered} V^\pi = R^\pi + \gamma P^\pi V^\pi \\ (I - \gamma P^\pi)V^\pi = R^\pi \\ V^\pi = (I - \gamma P^\pi)^{-1}R^\pi \\ \end{gathered}

Claim: (IγP)(I - \gamma P) is invertible.

Proof. It suffies to prove

x0RS,  (IγPπ)x0\forall x \ne \vec{0} \in \mathbb{R}^S, \; (I - \gamma P^\pi)x \ne \vec{0}

then

(IγPπ)x=xγPπxxγPπxxγx=(1γ)xx>0\begin{aligned} &\| (I - \gamma P^\pi) x \| _{\infty} \\ =&\| x - \gamma P^\pi x \| _{\infty} \\ \ge&\| x\| _{\infty} - \gamma\| P^\pi x \| _{\infty} \\ \ge&\| x\| _{\infty} - \gamma\| x\| _{\infty} \\ =&(1 - \gamma)\| x\| _{\infty} \\ \ge&\| x\| _{\infty} \\ >& 0 \blacksquare \end{aligned}

Generalize to stochastic policies

Vπ(s)=Eaπ(s),sP(s,a)[R(s,a)+γVπ(s)]V^\pi(s) = \mathbb{E}_{a \sim \pi(\cdot \mid s), s^{\prime} \sim P(\cdot \mid s, a)}\left[R(s, a)+\gamma V^\pi\left(s^{\prime}\right)\right]

Matrix form

Vπ=Rπ+γPπVπV^\pi = R^\pi + \gamma P^\pi V^\pi

still holds for

Rπ(s)=Eaπ(s)[R(s,a)]Pπ(ss)=aAπ(as)P(ss,a)\begin{aligned} & R^\pi(s)=\mathbb{E}_{a \sim \pi(\cdot \mid s)}[R(s, a)] \\ & P^\pi\left(s^{\prime} \mid s\right)=\sum_{a \in \mathcal{A}} \pi(a \mid s) P\left(s^{\prime} \mid s, a\right) \end{aligned}

Optimality

For infinite-horizon discounted MDPs, there always exists a stationary and deterministic policy that is optimal for all starting states simultaneously.

Optimal policy π\pi^\star and

V:=VπV^\star := V^{\pi^\star }

Bellman Optimality Equation

V(s)=maxaA(R(s,a)+γEsP(s,a)[V(s)])V^{*}(s)=\max_{a\in A}\left(R(s,a)+\gamma\mathbb{E}_{s^{\prime}\thicksim P(s,a)}\left[V^{*}(s^{\prime})\right]\right)

Q-functions

Qπ(s,a):=E[t=1γt1rt|s1=s,a1=a;π]Q:=Qπ  orVπ(s)=Qπ(s,π(s))    or    Qπ(s,π)\begin{gathered} Q^{\pi}(s,a):=\mathbb{E}\left[\sum_{t=1}^{\infty}\gamma^{t-1}r_{t} \middle| s_1=s, a_1=a; \pi \right] \\ Q^\star := Q^{\pi^\star } \; \text{or} \\ V^{\pi}(s)=Q^{\pi}(s,\pi(s)) \;\; \text{or} \;\; Q^{\pi}(s,\pi)\\ \end{gathered}

Bellman equation for QQ

Qπ(s,a)=R(s,a)+γEsP(s,a)[Qπ(s,π)]Q(s,a)=R(s,a)+γEsP(s,a)[maxaAQ(s,a)]\begin{gathered} Q^{\pi}(s,a)=R(s,a)+\gamma\mathbb{E}_{s^{\prime}\sim P(\cdot|s,a)}\left[Q^{\pi}(s^{\prime},\pi)\right] \\ Q^\star (s,a)=R(s,a)+\gamma\mathbb{E}_{s^{\prime}\sim P(\cdot|s,a)}\left[\max_{a^{\prime}\in A}Q^\star (s^{\prime},a^{\prime})\right] \end{gathered}

Define optimal VV and π\pi by QQ

V(s)=maxaAQ(s,a)=Q(s,π(s))V^{*}(s)=\max_{a\in A}Q^{*}(s,a)=Q^{*}(s,\pi^{*}(s))

π(s)=arg maxaAQ(s,a)\pi^\star (s) = \argmax_{a \in A} Q^\star (s, a)

Fixed-horizon MDPs

Specified by (S,A,R,P,H)(S, A, R, P, H) , All trajectories end in precisely HH steps

VH+1π0Vhπ(s)=R(s,π(s))+EsP(s,a)[Vh+1π(s)]\begin{gathered} V_{H+1}^\pi \equiv 0 \\ V_{h}^{\pi}(s)=R(s,\pi(s))+\mathbb{E}_{s'\sim P(s,a)}[V_{h+1}^{\pi}(s')] \end{gathered}


Reinforcement Learning (2-4)
https://yzzzf.xyz/2024/02/08/reinforcement-learning-lecture-2-3/
Author
Zifan Ying
Posted on
February 7, 2024
Licensed under