Reinforcemant Learning (16)

Q-learning

Update rule:

Q(st,at)Q(st,at)+α(rt+γmaxaQ(st+1,a)Q(st,at))Q\left(s_t, a_{t}\right) \leftarrow Q\left(s_{t}, a_{t}\right)+\alpha\left(r_{t}+\gamma \max _{a^{\prime}} Q\left(s_{t+1}, a^{\prime}\right)-Q\left(s_{t}, a_{t}\right)\right)

Q-learning is off-policy: how we take actions have nothing to do with our current Q-estimate (or its greedy policy). i.e. Q-learning always taks maxaQ(st+1,a)\max _{a^{\prime}} Q\left(s_{t+1}, a^{\prime}\right) no matter what the real policy is.

e.g. in the cliff setting, the optimal can always be found, no matter the choice of ϵ\epsilon .

Exercise: Multi-step Q-learning?

Does the target rt+γrt+1+γ2maxaQ(st+2,a)r_{t}+\gamma r_{t+1}+\gamma^2 \max_{a^{\prime}} Q\left(s_{t+2}, a^{\prime}\right) work? If not, why?

No. Because it leads to

QTπTQQ \leftarrow \mathcal{T}^\pi \mathcal{T} Q

This resulting TπTTπTQ\mathcal{T}^\pi \mathcal{T}\cdots \mathcal{T}^\pi \mathcal{T}Q is also a optimal policy, but for another MDP, i.e. on odd steps, follow π\pi , on even steps, free to decide.
{: .prompt-info }

Q-learning with experience replay

So far most algorithms we see are "one-pass

  • i.e., use each data point once and discard them

  • # updates = # data points

  • Concern 1: We need many updates for optimization to converge
    Can we separate optimization from data collection?

  • Concern 2: Need to reuse data if sample size is limited

Sample (with replacement) a tuple randomly from the bag, and apply the Q-learning update rule.

  • # updates >> # data points

Each time get a new tuple, put in bag, and do updates for several times.

Not applicable for on-policy controls (e.g. SARSA).


Reinforcemant Learning (16)
https://yzzzf.xyz/2024/03/22/reinforcement-learning-lecture-16/
Author
Zifan Ying
Posted on
March 22, 2024
Licensed under