Reinforcemant Learning (15)
Recall the Bellman Equation:
with empirically equals to:
with tuples in the long trajectory, applying the running average:
SARSA
Notice that SARSA is not applicable for deterministic policy, because it requires a non-zero probability distribution over all st0ate-action pairs ( ), but the only possible action for a certain state is determined by the policy.
SARSA with -greedy policy
How are the data pairs picked in SARSA?
At each time step t, with probability , choose a from the action space uniformly at random. otherwise,
When sampling s-a-r-s-a tuple along the trajectory, the first action in the tuple is actually generated with last version of , so we can say SARSA is not 100% “on policy”.
{: .prompt-info }
Does SARSA converge to optimal policy?
The cliff example (pg 132 of Sutton & Barto)
- Deterministic navigation, high penalty when falling off the clif
- Optimal policy: walk near the cliff
- Unless epsilon is super small, SARSA will avoid the cliff
cliff example
The optimal path is along the side of the cliff, but on this path, the -greedy SARSA will often see large penalty (falling off the cliff) and therefore, choose the safe path instead.
softmax
-greedy can be replaged by softmax: chooses action a with probability
where is temperature.