Reinforcemant Learning (16)
Q-learning
Update rule:
Q-learning is off-policy: how we take actions have nothing to do with our current Q-estimate (or its greedy policy). i.e. Q-learning always taks no matter what the real policy is.
e.g. in the cliff setting, the optimal can always be found, no matter the choice of .
Exercise: Multi-step Q-learning?
Does the target work? If not, why?
No. Because it leads to
This resulting is also a optimal policy, but for another MDP, i.e. on odd steps, follow , on even steps, free to decide.
{: .prompt-info }
Q-learning with experience replay
So far most algorithms we see are "one-pass
-
i.e., use each data point once and discard them
-
# updates = # data points
-
Concern 1: We need many updates for optimization to converge
Can we separate optimization from data collection? -
Concern 2: Need to reuse data if sample size is limited
Sample (with replacement) a tuple randomly from the bag, and apply the Q-learning update rule.
- # updates >> # data points
Each time get a new tuple, put in bag, and do updates for several times.
Not applicable for on-policy controls (e.g. SARSA).