Reinforcemant Learning (19)

Policy Gradient

Given policy πθ\pi_\theta, optimize J(πθ):=Esd0[ Vπθ(s)]{J}\left(\pi_\theta\right):=\mathbb{E}_{s\sim d_0}\left[{~V}^{\pi_\theta}(s)\right]
where d0d_0 is the initial state distribution.

  • Use Gradient Ascent (θJ(πθ)\nabla_\theta {J}\left(\pi_\theta\right))
  • an unbiased estimate can be obtained from a
    single on-policy trajectory
  • no need of the knowledge of PP and RR of the MDP
  • Similar to IS

Note that when we use π\pi, we mean πθ\pi_{\theta} here, and \nabla means θ\nabla_\theta.

About PG:

  • Goal: we want to find good policy.
  • Value-based RL is indirect
  • PG isn’t based on value function
    • It’s possible a good policy don’t match Bellman Equation.

Example of policy parametrization

Linear + softmax:

  • Featurize state-action: ϕ:S×ARd\phi: S\times A \rightarrow \mathbb{R}^d
  • Policy (softmax): π(as)eθϕ(s,a)\pi(a|s) \propto e^{\theta^{\top} \phi(s, a)}

Recall in SARSA, we also used softmax with temperature TT. But in PG, we don’t need it. Why?

  • In SARSA, softmax policy based on QQ function – QQ function cannot be arbitrary.
  • In PG, ϕ(s,a)\phi(s,a) is arbitrary function – TT is included.

PG Derivation

  • The trajectory inducded by π\pi: τ:=(s1,a1,r1,,sH,aH,rH)\tau:=\left(s_1, a_1, r_1, \ldots, s_{H}, a_{H}, r_{H}\right) and τπ\tau \sim \pi.
  • Let R(τ):=t=1Hγt1rtR(\tau):=\sum_{t=1}^H \gamma^{t-1} r_t

J(π):=Eπ[t=1Hγt1rt]=Eτπ[R(τ)]J(\pi) := \mathbb{E}_\pi \left[\sum_{t=1}^H \gamma^{t-1}r_t\right] = \mathbb{E}_{\tau\sim\pi} [R(\tau)]

J(π)=τ(S×A)HPπ(τ)R(τ)=τ(Pπ(τ))R(τ)=τPπ(τ)Pπ(τ)Pπ(τ)R(τ)=τPπ(τ)logPπ(τ)R(τ)=Eτπ[logPπ(τ)R(τ)]\begin{aligned} & \nabla J\left(\pi\right) \\ = & \nabla \sum_{\tau \in(S \times A)^H} P^{\pi}(\tau) R(\tau) \\ = & \sum_\tau\left(\nabla P^{\pi}(\tau)\right) R(\tau) \\ = & \sum_\tau \frac{P^\pi(\tau)}{P^{\pi}(\tau)} \nabla P^{\pi}(\tau) R(\tau) \\ = & \sum_\tau P^{\pi}(\tau) \nabla \log P^{\pi}(\tau) R(\tau) \\ = & \mathbb{E}_{\tau \sim \pi}\left[\nabla \log P^{\pi}(\tau) R(\tau)\right] \end{aligned}

and

θlogPπθ(τ)=θlog(d0(s1)π(a1s1)P(s2s1,a1)π(a2s2))=logd0(s1)+logπ(a1s1)+logP(s2s1,a1)+logπ(a2s2)+=logπ(a1s1)+logπ(a2s2)+\begin{aligned} & \nabla_\theta \log P^{\pi_\theta}(\tau) =\nabla_\theta \log \left(d_0\left(s_1\right) \pi\left(a_1 | s_1\right) P\left(s_2 | s_1, a_1\right) \pi\left(a_2 | s_2\right) \cdots \right) \\ &= \cancel{\nabla \log d_0(s_1)} + \nabla \log \pi\left(a_1 | s_1\right) + \cancel{\nabla \log P\left(s_2 | s_1, a_1\right)} + \nabla \log \pi\left(a_2 | s_2\right) + \cdots \\ &= \nabla \log \pi\left(a_1 | s_1\right) + \nabla \log \pi\left(a_2 | s_2\right) + \cdots \end{aligned}

Note that this form is similar to that discussed in Importance Sampling.

Given that π(as)=eθϕ(s,a)aeθϕ(s,a)\pi(a|s) = \frac{e^{\theta^\top \phi(s,a)}}{\sum_{a'} e^{\theta^\top \phi(s,a')}} (denoted by π(as)\pi(a|s)).

logπ(as)=(log(eθϕ(s,a))log(aeθϕ(s,a)))=ϕ(s,a)aeθϕ(s,a)ϕ(s,a)aeθϕ(s,a)=ϕ(s,a)Eaπ[ϕ(s,a)]\begin{aligned} & \nabla \log \pi(a|s) \\ = & \nabla\left(\log \left(e^{\theta^\top \phi\left(s, a^{\prime}\right)}\right)-\log \left(\sum_{a^{\prime}} e^{\theta^{\top} \phi\left(s, a^{\prime}\right)}\right)\right) \\ = & \phi(s, a)-\frac{\sum_{a'} e^{\theta^\top \phi(s,a')}\phi(s,a')}{\sum_{a^{\prime}} e^{\theta^{\top} \phi\left(s, a^{\prime}\right)}} \\ = & \phi(s,a) - \mathbb{E}_{a'\sim \pi} [\phi(s,a')] \end{aligned}

Note that the expectation of the quantity above is 00. i.e.

Eaπ[ϕ(s,a)Eaπ[ϕ(s,a)]]=0\mathbb{E}_{a\sim\pi}\big[ \phi(s,a) - \mathbb{E}_{a'\sim \pi} [\phi(s,a')] \big] = 0

Couclusion:

So far we have

J(π)=Eπ[(t=1Hγt1rt)(t=1Hlogπ(atst))]\nabla J(\pi) = \mathbb{E}_{\pi} \left[ \left(\sum_{t=1}^H \gamma^{t-1} r_t \right) \left(\sum_{t=1}^H \nabla\log\pi (a_t|s_t) \right) \right]

With the relation discussed above, we say Eπ[logπ(atst)]=atπ(atst)=1=0\mathbb{E}_{\pi}[\nabla\log\pi (a_t|s_t)] = \sum_{a_t}\nabla \pi (a_t|s_t) = \nabla 1 = 0

So, for t<tt' < t, rtr_{t'} is independent to logπ(atst)\nabla\log\pi (a_t|s_t), we have

Eπ[logπ(atst)rt]=Eπ[logπ(atst)]Eπ[rt]=0\mathbb{E}_{\pi}[\nabla\log\pi (a_t|s_t)r_{t'}] = \mathbb{E}_{\pi}[\nabla\log\pi(a_t|s_t)] \mathbb{E}_{\pi}[r_{t'}] = 0

We can therefore rewrite the J(π)\nabla J(\pi) as

J(π)=Eπ[t=1H(logπ(atst)t=tHγt1rt)]\nabla J(\pi) = \mathbb{E}_{\pi} \left[ \sum_{t=1}^H \left( \nabla\log\pi (a_t|s_t) \sum_{t'=t}^H \gamma^{t'-1} r_{t'} \right) \right]

PG and Value-Based Method

So far we have

J(π)=Eπ[t=1H(logπ(atst)t=tHγt1rt)].\nabla J(\pi) = \mathbb{E}_{\pi} \left[ \sum_{t=1}^H \left( \nabla\log\pi (a_t|s_t) \sum_{t'=t}^H \gamma^{t'-1} r_{t'} \right) \right].

add a condition on expectation:

J(π)=Est,atπ[Eπ[t=1H(logπ(atst)t=tHγt1rt)st,at]]=t=1HEst,atdtπ[logπ(atst)Eπ[t=tHγt1rtst,at]γt1Qπ(st,at)]=t=1Hγt1Es,adtπdt[logπ(atst)Qπ(s,a)]=11γEsdπ,aπ(s)[Qπ(s,a)logπ(as)]\begin{aligned} \nabla J(\pi) &= \mathbb{E}_{s_t,a_t\sim \pi} \left[ \mathbb{E}_{\pi} \left[ \sum_{t=1}^H \left( \nabla\log\pi (a_t|s_t) \sum_{t'=t}^H \gamma^{t'-1} r_{t'} \right) \Bigg | s_t, a_t \right] \right] \\ &= \sum_{t=1}^H \mathbb{E}_{s_t,a_t\sim d_t^\pi} \left[ \nabla\log\pi (a_t|s_t) \underbrace{\mathbb{E}_{\pi} \left[ \sum_{t'=t}^H \gamma^{t'-1} r_{t'} \bigg | s_t, a_t \right]}_{\gamma^{t-1} Q_\pi(s_t, a_t)} \right] \\ &= \sum_{t=1}^H \gamma^{t-1} \mathbb{E}_{\underbrace{s,a\sim d_t^\pi}_{d^t}} \left[ \nabla\log\pi (a_t|s_t) Q^\pi (s,a) \right] \\ &= \frac{1}{1-\gamma} \mathbb{E}_{s\sim d^\pi, a\sim \pi(s)} \left[ Q^\pi (s,a) \nabla\log\pi(a|s) \right] \end{aligned}

Blend PG and Value-Based Methods

Instead of using MC estimate t=tHγt1rt\sum_{t^{\prime}=t}^H \gamma^{t^{\prime}-1} r_t for Qπ(st,at)Q^\pi\left(s_t, a_t\right), use an
approximate value-function Q^st,at\hat{Q}_{s_t, a_t}, often trained by TD, e.g. expected SARSA:
Q(St,At)Q(St,At)+α[Rt+1+γEπ[Q(St+1,At+1)St+1]Q(St,At)] Q\left(S_t, A_t\right) \leftarrow Q\left(S_t, A_t\right)+\alpha\left[R_{t+1}+\gamma \mathbb{E}_\pi\left[Q\left(S_{t+1}, A_{t+1}\right) \mid S_{t+1}\right]-Q\left(S_t, A_t\right)\right] .

Actor-critic

The parametrized policy is called the actor, and
the value-function estimate is called the critic.

Actor-Critic

Baseline in PG

for any f:SRf: S\to \mathbb{R},

J(π)=11γEsdπ,aπ(s)[(Qπ(s,a)f(s))logπ(as)]\nabla J(\pi)=\frac{1}{1-\gamma} \mathbb{E}_{s \sim d^\pi, a \sim \pi(s)}\left[\left(Q^\pi(s, a)-f(s)\right) \nabla \log \pi(a | s)\right]

because f(s)f(s) and logπ(sa)\nabla\log\pi(s|a) are independent.

Choose f=Vπ(s)f = V^\pi(s) and

J(π)=11γEsdπ,aπ(s)[Aπ(s,a)logπ(as)]\nabla J(\pi)=\frac{1}{1-\gamma} \mathbb{E}_{s \sim d^\pi, a \sim \pi(s)}\left[A^\pi(s, a) \nabla \log \pi(a \mid s)\right]

where AA is the advantage function.
Bseline don’t change the expectation of Gradient but lower the variance.


Reinforcemant Learning (19)
https://yzzzf.xyz/2024/04/12/reinforcement-learning-lecture-19/
Author
Zifan Ying
Posted on
April 12, 2024
Licensed under