Policy Gradient
Given policy π θ \pi_\theta π θ , optimize J ( π θ ) : = E s ∼ d 0 [ V π θ ( s ) ] {J}\left(\pi_\theta\right):=\mathbb{E}_{s\sim d_0}\left[{~V}^{\pi_\theta}(s)\right] J ( π θ ) := E s ∼ d 0 [ V π θ ( s ) ]
where d 0 d_0 d 0 is the initial state distribution.
Use Gradient Ascent (∇ θ J ( π θ ) \nabla_\theta {J}\left(\pi_\theta\right) ∇ θ J ( π θ ) )
an unbiased estimate can be obtained from a
single on-policy trajectory
no need of the knowledge of P P P and R R R of the MDP
Similar to IS
Note that when we use π \pi π , we mean π θ \pi_{\theta} π θ here, and ∇ \nabla ∇ means ∇ θ \nabla_\theta ∇ θ .
About PG:
Goal: we want to find good policy.
Value-based RL is indirect
PG isn’t based on value function
It’s possible a good policy don’t match Bellman Equation.
Example of policy parametrization
Linear + softmax:
Featurize state-action: ϕ : S × A → R d \phi: S\times A \rightarrow \mathbb{R}^d ϕ : S × A → R d
Policy (softmax): π ( a ∣ s ) ∝ e θ ⊤ ϕ ( s , a ) \pi(a|s) \propto e^{\theta^{\top} \phi(s, a)} π ( a ∣ s ) ∝ e θ ⊤ ϕ ( s , a )
Recall in SARSA, we also used softmax with temperature T T T . But in PG, we don’t need it. Why?
In SARSA, softmax policy based on Q Q Q function – Q Q Q function cannot be arbitrary.
In PG, ϕ ( s , a ) \phi(s,a) ϕ ( s , a ) is arbitrary function – T T T is included.
PG Derivation
The trajectory inducded by π \pi π : τ : = ( s 1 , a 1 , r 1 , … , s H , a H , r H ) \tau:=\left(s_1, a_1, r_1, \ldots, s_{H}, a_{H}, r_{H}\right) τ := ( s 1 , a 1 , r 1 , … , s H , a H , r H ) and τ ∼ π \tau \sim \pi τ ∼ π .
Let R ( τ ) : = ∑ t = 1 H γ t − 1 r t R(\tau):=\sum_{t=1}^H \gamma^{t-1} r_t R ( τ ) := ∑ t = 1 H γ t − 1 r t
J ( π ) : = E π [ ∑ t = 1 H γ t − 1 r t ] = E τ ∼ π [ R ( τ ) ] J(\pi) := \mathbb{E}_\pi \left[\sum_{t=1}^H \gamma^{t-1}r_t\right] = \mathbb{E}_{\tau\sim\pi} [R(\tau)]
J ( π ) := E π [ t = 1 ∑ H γ t − 1 r t ] = E τ ∼ π [ R ( τ )]
∇ J ( π ) = ∇ ∑ τ ∈ ( S × A ) H P π ( τ ) R ( τ ) = ∑ τ ( ∇ P π ( τ ) ) R ( τ ) = ∑ τ P π ( τ ) P π ( τ ) ∇ P π ( τ ) R ( τ ) = ∑ τ P π ( τ ) ∇ log P π ( τ ) R ( τ ) = E τ ∼ π [ ∇ log P π ( τ ) R ( τ ) ] \begin{aligned}
& \nabla J\left(\pi\right) \\
= & \nabla \sum_{\tau \in(S \times A)^H} P^{\pi}(\tau) R(\tau) \\
= & \sum_\tau\left(\nabla P^{\pi}(\tau)\right) R(\tau) \\
= & \sum_\tau \frac{P^\pi(\tau)}{P^{\pi}(\tau)} \nabla P^{\pi}(\tau) R(\tau) \\
= & \sum_\tau P^{\pi}(\tau) \nabla \log P^{\pi}(\tau) R(\tau) \\
= & \mathbb{E}_{\tau \sim \pi}\left[\nabla \log P^{\pi}(\tau) R(\tau)\right]
\end{aligned}
= = = = = ∇ J ( π ) ∇ τ ∈ ( S × A ) H ∑ P π ( τ ) R ( τ ) τ ∑ ( ∇ P π ( τ ) ) R ( τ ) τ ∑ P π ( τ ) P π ( τ ) ∇ P π ( τ ) R ( τ ) τ ∑ P π ( τ ) ∇ log P π ( τ ) R ( τ ) E τ ∼ π [ ∇ log P π ( τ ) R ( τ ) ]
and
∇ θ log P π θ ( τ ) = ∇ θ log ( d 0 ( s 1 ) π ( a 1 ∣ s 1 ) P ( s 2 ∣ s 1 , a 1 ) π ( a 2 ∣ s 2 ) ⋯ ) = ∇ log d 0 ( s 1 ) + ∇ log π ( a 1 ∣ s 1 ) + ∇ log P ( s 2 ∣ s 1 , a 1 ) + ∇ log π ( a 2 ∣ s 2 ) + ⋯ = ∇ log π ( a 1 ∣ s 1 ) + ∇ log π ( a 2 ∣ s 2 ) + ⋯ \begin{aligned}
& \nabla_\theta \log P^{\pi_\theta}(\tau)
=\nabla_\theta \log \left(d_0\left(s_1\right) \pi\left(a_1 | s_1\right) P\left(s_2 | s_1, a_1\right) \pi\left(a_2 | s_2\right) \cdots \right) \\
&=
\cancel{\nabla \log d_0(s_1)} +
\nabla \log \pi\left(a_1 | s_1\right) +
\cancel{\nabla \log P\left(s_2 | s_1, a_1\right)} +
\nabla \log \pi\left(a_2 | s_2\right) +
\cdots \\
&=
\nabla \log \pi\left(a_1 | s_1\right) +
\nabla \log \pi\left(a_2 | s_2\right) +
\cdots
\end{aligned}
∇ θ log P π θ ( τ ) = ∇ θ log ( d 0 ( s 1 ) π ( a 1 ∣ s 1 ) P ( s 2 ∣ s 1 , a 1 ) π ( a 2 ∣ s 2 ) ⋯ ) = ∇ log d 0 ( s 1 ) + ∇ log π ( a 1 ∣ s 1 ) + ∇ log P ( s 2 ∣ s 1 , a 1 ) + ∇ log π ( a 2 ∣ s 2 ) + ⋯ = ∇ log π ( a 1 ∣ s 1 ) + ∇ log π ( a 2 ∣ s 2 ) + ⋯
Note that this form is similar to that discussed in Importance Sampling .
Given that π ( a ∣ s ) = e θ ⊤ ϕ ( s , a ) ∑ a ′ e θ ⊤ ϕ ( s , a ′ ) \pi(a|s) = \frac{e^{\theta^\top \phi(s,a)}}{\sum_{a'} e^{\theta^\top \phi(s,a')}} π ( a ∣ s ) = ∑ a ′ e θ ⊤ ϕ ( s , a ′ ) e θ ⊤ ϕ ( s , a ) (denoted by π ( a ∣ s ) \pi(a|s) π ( a ∣ s ) ).
∇ log π ( a ∣ s ) = ∇ ( log ( e θ ⊤ ϕ ( s , a ′ ) ) − log ( ∑ a ′ e θ ⊤ ϕ ( s , a ′ ) ) ) = ϕ ( s , a ) − ∑ a ′ e θ ⊤ ϕ ( s , a ′ ) ϕ ( s , a ′ ) ∑ a ′ e θ ⊤ ϕ ( s , a ′ ) = ϕ ( s , a ) − E a ′ ∼ π [ ϕ ( s , a ′ ) ] \begin{aligned}
& \nabla \log \pi(a|s) \\
= & \nabla\left(\log \left(e^{\theta^\top \phi\left(s, a^{\prime}\right)}\right)-\log \left(\sum_{a^{\prime}} e^{\theta^{\top} \phi\left(s, a^{\prime}\right)}\right)\right) \\
= & \phi(s, a)-\frac{\sum_{a'} e^{\theta^\top \phi(s,a')}\phi(s,a')}{\sum_{a^{\prime}} e^{\theta^{\top} \phi\left(s, a^{\prime}\right)}} \\
= & \phi(s,a) - \mathbb{E}_{a'\sim \pi} [\phi(s,a')]
\end{aligned}
= = = ∇ log π ( a ∣ s ) ∇ ( log ( e θ ⊤ ϕ ( s , a ′ ) ) − log ( a ′ ∑ e θ ⊤ ϕ ( s , a ′ ) ) ) ϕ ( s , a ) − ∑ a ′ e θ ⊤ ϕ ( s , a ′ ) ∑ a ′ e θ ⊤ ϕ ( s , a ′ ) ϕ ( s , a ′ ) ϕ ( s , a ) − E a ′ ∼ π [ ϕ ( s , a ′ )]
Note that the expectation of the quantity above is 0 0 0 . i.e.
E a ∼ π [ ϕ ( s , a ) − E a ′ ∼ π [ ϕ ( s , a ′ ) ] ] = 0 \mathbb{E}_{a\sim\pi}\big[ \phi(s,a) - \mathbb{E}_{a'\sim \pi} [\phi(s,a')] \big] = 0
E a ∼ π [ ϕ ( s , a ) − E a ′ ∼ π [ ϕ ( s , a ′ )] ] = 0
Couclusion :
So far we have
∇ J ( π ) = E π [ ( ∑ t = 1 H γ t − 1 r t ) ( ∑ t = 1 H ∇ log π ( a t ∣ s t ) ) ] \nabla J(\pi) =
\mathbb{E}_{\pi} \left[
\left(\sum_{t=1}^H \gamma^{t-1} r_t \right)
\left(\sum_{t=1}^H \nabla\log\pi (a_t|s_t) \right)
\right]
∇ J ( π ) = E π [ ( t = 1 ∑ H γ t − 1 r t ) ( t = 1 ∑ H ∇ log π ( a t ∣ s t ) ) ]
With the relation discussed above, we say E π [ ∇ log π ( a t ∣ s t ) ] = ∑ a t ∇ π ( a t ∣ s t ) = ∇ 1 = 0 \mathbb{E}_{\pi}[\nabla\log\pi (a_t|s_t)] = \sum_{a_t}\nabla \pi (a_t|s_t) = \nabla 1 = 0 E π [ ∇ log π ( a t ∣ s t )] = ∑ a t ∇ π ( a t ∣ s t ) = ∇1 = 0
So, for t ′ < t t' < t t ′ < t , r t ′ r_{t'} r t ′ is independent to ∇ log π ( a t ∣ s t ) \nabla\log\pi (a_t|s_t) ∇ log π ( a t ∣ s t ) , we have
E π [ ∇ log π ( a t ∣ s t ) r t ′ ] = E π [ ∇ log π ( a t ∣ s t ) ] E π [ r t ′ ] = 0 \mathbb{E}_{\pi}[\nabla\log\pi (a_t|s_t)r_{t'}]
= \mathbb{E}_{\pi}[\nabla\log\pi(a_t|s_t)] \mathbb{E}_{\pi}[r_{t'}]
= 0
E π [ ∇ log π ( a t ∣ s t ) r t ′ ] = E π [ ∇ log π ( a t ∣ s t )] E π [ r t ′ ] = 0
We can therefore rewrite the ∇ J ( π ) \nabla J(\pi) ∇ J ( π ) as
∇ J ( π ) = E π [ ∑ t = 1 H ( ∇ log π ( a t ∣ s t ) ∑ t ′ = t H γ t ′ − 1 r t ′ ) ] \nabla J(\pi) =
\mathbb{E}_{\pi} \left[
\sum_{t=1}^H \left(
\nabla\log\pi (a_t|s_t)
\sum_{t'=t}^H \gamma^{t'-1} r_{t'}
\right)
\right]
∇ J ( π ) = E π [ t = 1 ∑ H ( ∇ log π ( a t ∣ s t ) t ′ = t ∑ H γ t ′ − 1 r t ′ ) ]
PG and Value-Based Method
So far we have
∇ J ( π ) = E π [ ∑ t = 1 H ( ∇ log π ( a t ∣ s t ) ∑ t ′ = t H γ t ′ − 1 r t ′ ) ] . \nabla J(\pi) =
\mathbb{E}_{\pi} \left[
\sum_{t=1}^H \left(
\nabla\log\pi (a_t|s_t)
\sum_{t'=t}^H \gamma^{t'-1} r_{t'}
\right)
\right].
∇ J ( π ) = E π [ t = 1 ∑ H ( ∇ log π ( a t ∣ s t ) t ′ = t ∑ H γ t ′ − 1 r t ′ ) ] .
add a condition on expectation:
∇ J ( π ) = E s t , a t ∼ π [ E π [ ∑ t = 1 H ( ∇ log π ( a t ∣ s t ) ∑ t ′ = t H γ t ′ − 1 r t ′ ) ∣ s t , a t ] ] = ∑ t = 1 H E s t , a t ∼ d t π [ ∇ log π ( a t ∣ s t ) E π [ ∑ t ′ = t H γ t ′ − 1 r t ′ ∣ s t , a t ] ⏟ γ t − 1 Q π ( s t , a t ) ] = ∑ t = 1 H γ t − 1 E s , a ∼ d t π ⏟ d t [ ∇ log π ( a t ∣ s t ) Q π ( s , a ) ] = 1 1 − γ E s ∼ d π , a ∼ π ( s ) [ Q π ( s , a ) ∇ log π ( a ∣ s ) ] \begin{aligned}
\nabla J(\pi) &=
\mathbb{E}_{s_t,a_t\sim \pi} \left[
\mathbb{E}_{\pi} \left[
\sum_{t=1}^H \left(
\nabla\log\pi (a_t|s_t)
\sum_{t'=t}^H \gamma^{t'-1} r_{t'}
\right)
\Bigg | s_t, a_t
\right]
\right] \\
&=
\sum_{t=1}^H
\mathbb{E}_{s_t,a_t\sim d_t^\pi} \left[
\nabla\log\pi (a_t|s_t)
\underbrace{\mathbb{E}_{\pi} \left[
\sum_{t'=t}^H \gamma^{t'-1} r_{t'}
\bigg | s_t, a_t
\right]}_{\gamma^{t-1} Q_\pi(s_t, a_t)}
\right] \\
&=
\sum_{t=1}^H
\gamma^{t-1} \mathbb{E}_{\underbrace{s,a\sim d_t^\pi}_{d^t}} \left[
\nabla\log\pi (a_t|s_t)
Q^\pi (s,a)
\right] \\
&=
\frac{1}{1-\gamma} \mathbb{E}_{s\sim d^\pi, a\sim \pi(s)}
\left[
Q^\pi (s,a) \nabla\log\pi(a|s)
\right]
\end{aligned}
∇ J ( π ) = E s t , a t ∼ π [ E π [ t = 1 ∑ H ( ∇ log π ( a t ∣ s t ) t ′ = t ∑ H γ t ′ − 1 r t ′ ) s t , a t ] ] = t = 1 ∑ H E s t , a t ∼ d t π ∇ log π ( a t ∣ s t ) γ t − 1 Q π ( s t , a t ) E π [ t ′ = t ∑ H γ t ′ − 1 r t ′ s t , a t ] = t = 1 ∑ H γ t − 1 E d t s , a ∼ d t π [ ∇ log π ( a t ∣ s t ) Q π ( s , a ) ] = 1 − γ 1 E s ∼ d π , a ∼ π ( s ) [ Q π ( s , a ) ∇ log π ( a ∣ s ) ]
Blend PG and Value-Based Methods
Instead of using MC estimate ∑ t ′ = t H γ t ′ − 1 r t \sum_{t^{\prime}=t}^H \gamma^{t^{\prime}-1} r_t ∑ t ′ = t H γ t ′ − 1 r t for Q π ( s t , a t ) Q^\pi\left(s_t, a_t\right) Q π ( s t , a t ) , use an
approximate value-function Q ^ s t , a t \hat{Q}_{s_t, a_t} Q ^ s t , a t , often trained by TD, e.g. expected SARSA:
Q ( S t , A t ) ← Q ( S t , A t ) + α [ R t + 1 + γ E π [ Q ( S t + 1 , A t + 1 ) ∣ S t + 1 ] − Q ( S t , A t ) ]
Q\left(S_t, A_t\right) \leftarrow Q\left(S_t, A_t\right)+\alpha\left[R_{t+1}+\gamma \mathbb{E}_\pi\left[Q\left(S_{t+1}, A_{t+1}\right) \mid S_{t+1}\right]-Q\left(S_t, A_t\right)\right]
Q ( S t , A t ) ← Q ( S t , A t ) + α [ R t + 1 + γ E π [ Q ( S t + 1 , A t + 1 ) ∣ S t + 1 ] − Q ( S t , A t ) ] .
Actor-critic
The parametrized policy is called the actor , and
the value-function estimate is called the critic .
Baseline in PG
for any f : S → R f: S\to \mathbb{R} f : S → R ,
∇ J ( π ) = 1 1 − γ E s ∼ d π , a ∼ π ( s ) [ ( Q π ( s , a ) − f ( s ) ) ∇ log π ( a ∣ s ) ] \nabla J(\pi)=\frac{1}{1-\gamma} \mathbb{E}_{s \sim d^\pi, a \sim \pi(s)}\left[\left(Q^\pi(s, a)-f(s)\right) \nabla \log \pi(a | s)\right]
∇ J ( π ) = 1 − γ 1 E s ∼ d π , a ∼ π ( s ) [ ( Q π ( s , a ) − f ( s ) ) ∇ log π ( a ∣ s ) ]
because f ( s ) f(s) f ( s ) and ∇ log π ( s ∣ a ) \nabla\log\pi(s|a) ∇ log π ( s ∣ a ) are independent .
Choose f = V π ( s ) f = V^\pi(s) f = V π ( s ) and
∇ J ( π ) = 1 1 − γ E s ∼ d π , a ∼ π ( s ) [ A π ( s , a ) ∇ log π ( a ∣ s ) ] \nabla J(\pi)=\frac{1}{1-\gamma} \mathbb{E}_{s \sim d^\pi, a \sim \pi(s)}\left[A^\pi(s, a) \nabla \log \pi(a \mid s)\right]
∇ J ( π ) = 1 − γ 1 E s ∼ d π , a ∼ π ( s ) [ A π ( s , a ) ∇ log π ( a ∣ s ) ]
where A A A is the advantage function.
Bseline don’t change the expectation of Gradient but lower the variance .