Reinforcemant Learning (15)

Recall the Bellman Equation:

(Tπf)(s,a)=R(s,a)+γEsP(k,a)[f(s,π)]=E[r+γf(s,π)s,a].\begin{aligned} \left(T^\pi f\right)(s, a) & =R(s, a)+\gamma \mathbb{E}_{s^{\prime}\sim P(k, a)}\left[f\left(s^{\prime}, \pi\right)\right] \\ & =\mathbb{E}\left[r+\gamma \cdot f\left(s^{\prime}, \pi\right) \mid s, a\right] . \end{aligned}

with empirically equals to:

1ni=1n(ri+γθk1(si,π)).\frac{1}{n} \sum_{i=1}^n\left(r_i+\gamma \theta_{k_1}\left(s_i^{\prime}, \pi\right)\right) .

with tuples (st,at,rt,st+1)(s_t, a_t, r_t, s_{t+1}) in the long trajectory, applying the running average:

Qk(st,at)Qk(st,at)+α(rt+γQk1(st+1,π)Qk(st,at))Q_k(s_t, a_t) \leftarrow Q_k(s_t, a_t)+\alpha(r_t+\gamma Q_{k-1}(s_{t+1}, \pi)-Q_k(s_t, a_t))

SARSA

Q(st,at)Q(st,at)+α(rt+γQ(st+1,at+1)Q(st,at))Q\left(s_{t}, a_{t}\right) \leftarrow Q\left(s_{t}, a_{t}\right)+\alpha\left(r_{t}+\gamma Q\left(s_{t}+1, a_{t}+1\right)-Q\left(s_{t}, a_{t}\right)\right)

Notice that SARSA is not applicable for deterministic policy, because it requires a non-zero probability distribution over all st0ate-action pairs ( (s,a)S×A\forall (s,a) \in S\times A ), but the only possible action for a certain state is determined by the policy.

SARSA with ϵ\epsilon -greedy policy

How are the s,as, a data pairs picked in SARSA?

At each time step t, with probability ϵ\epsilon , choose a from the action space uniformly at random. otherwise, at=arg maxaQ(st,a)a_t = \argmax_a Q(s_t, a)

When sampling s-a-r-s-a tuple along the trajectory, the first action in the tuple is actually generated with last version of QQ , so we can say SARSA is not 100% “on policy”.
{: .prompt-info }

Does SARSA converge to optimal policy?

The cliff example (pg 132 of Sutton & Barto)

  • Deterministic navigation, high penalty when falling off the clif
  • Optimal policy: walk near the cliff
  • Unless epsilon is super small, SARSA will avoid the cliff

cliff problem
cliff example

The optimal path is along the side of the cliff, but on this path, the ϵ\epsilon -greedy SARSA will often see large penalty (falling off the cliff) and therefore, choose the safe path instead.

softmax

ϵ\epsilon-greedy can be replaged by softmax: chooses action a with probability

exp(Q(s,a)/T)aexp(Q(s,a)/T)\frac{\exp(Q(s,a) / T)}{\sum_{a'} \exp(Q(s,a') / T)}

where TT is temperature.


Reinforcemant Learning (15)
https://yzzzf.xyz/2024/03/22/reinforcement-learning-lecture-15/
Author
Zifan Ying
Posted on
March 22, 2024
Licensed under