Reinforcemant Learning (11)

Model-based RL with a sampling oracle (Certainty Equivalence) Cont’d

To find QM^Q^\star_{\hat{M}} with empirical R^\hat{R} and P^\hat{P} :

f0RSA,fkT^fk1.f_0 \in \mathbb{R}^{SA}, \quad f_k \in \hat{\mathcal{T}} f_{k-1} .

where

(T^f)(s,a)=R^(s,a)+γEsP^(s,a)[maxaf(s,a)]Vf(s)=1ni=1nri+γP^(s,a),Vf=1ni=1nγi+γs(1ni=1nI[si=s])Vf(s)=1ni=1nri+γni=1n(sI[si=s]Vf(s))=1ni=1nri+γ1ni=1nVf(si)=1ni=1n(ri+γmaxaf(si,a))\begin{aligned} (\hat{\mathcal{T}} f)(s, a) & =\hat{R}(s, a)+\gamma \mathbb{E}_{s^{\prime} \sim \hat{P}(\cdot \mid s, a)} \underbrace{\left[\max _{a^{\prime}} f\left(s^{\prime}, a^{\prime}\right)\right]}_{V_f\left(s^{\prime}\right)} \\ & =\frac{1}{n} \sum_{i=1}^n r_i+\gamma\left\langle\hat{P}(\cdot \mid s, a), V_f\right\rangle \\ & =\frac{1}{n} \sum_{i=1}^n \gamma_i+\gamma \sum_{s^{\prime}}\left(\frac{1}{n} \sum_{i=1}^n \mathbb{I}\left[s_i^{\prime}=s^{\prime}\right]\right) \cdot V_f\left(s^{\prime}\right) \\ & =\frac{1}{n} \sum_{i=1}^n r_i+\frac{\gamma}{n} \sum_{i=1}^n\left(\sum_{s^{\prime}} \mathbb{I}\left[s_i^{\prime}=s^{\prime}\right] V_f\left(s^{\prime}\right)\right) \\ & =\frac{1}{n} \sum_{i=1}^n r_i+\gamma \cdot \frac{1}{n} \sum_{i=1}^n V_f\left(s_i^{\prime}\right) \\ & =\frac{1}{n} \sum_{i=1}^n\left(r_i+\gamma \max _{a^{\prime}} f\left(s_i^{\prime}, a^{\prime}\right)\right) \\ \end{aligned}

is call the Empirical Bellman Update.

Computational Complexity

Value Interation

For original value iteration, the Computational Complexity is

S×A×S|S|\times |A| \times |S|

$| S| \times | A| foreachfor each f(s,a) andand | S |$ for expectation.

Empirical Bellman Update

For Empirical Bellman Update, the Computational Complexity is

S×A×n|S|\times |A| \times n

Empirical sampling for nn times.

the Value Prediction Problem

Given π\pi , wnat to know VπV^\pi and QπQ^\pi .

On-policy Learning
data used to improve policy π\pi is generated by π\pi .
Off-policy Learning
data used to improve policy π\pi is generated by some other policies.

When action is always chosen by a fixed policy, the MDP reduces
to a Markov chain plus a reward function over states, also known
as Markov Reward Processes (MRP)

Monte-Carlo Value Prediction

For each ss , roll out nn trajectories using policy π\pi

  • For episodic tasks, roll out until termination
  • For continuing tasks, roll out to a length (typically H=O(1/(1γ))H=\mathrm{O}(1 /(1-\gamma)) ) such that omitting the future rewards has minimal impact (“small truncation error”)
  • Let V^π(s)\hat{V}^\pi(s) (will just write V(s)V(s) ) be the average discounted return

Online Monte-Carlo

  • For i=1,2,i=1,2, \ldots as the index of trajectories
    • Draw a starting state sis_i from the exploratory initial distribution, roll out a trajectory using π\pi from sis_i , and let GiG_i be the (random) discounted return
    • Let n(si)n\left(s_i\right) be the number of times sis_i has appeared as an initial state. If n(si)=1n\left(s_i\right)=1 (first time seeing this state), let V(si)GiV\left(s_{i}\right) \leftarrow G_{i} (where Gt=t=tt+HγttrtG_t=\sum_{t^{\prime}=t}^{t+H} \gamma^{t^{\prime}-t} r_{t^{\prime}} )
    • Otherwise, V(si)n(si)1n(si)V(si)+1n(si)GiV\left(s_{i}\right) \leftarrow \frac{n\left(s_{i}\right)-1}{n\left(s_{i}\right)} V\left(s_{i}\right)+\frac{1}{n\left(s_{i}\right)} G_{i}

No need to store the trajectory.

More generally,

V(si)(1α)V(si)+αGiV\left(s_{i}\right) \leftarrow(1-\alpha) V\left(s_{i}\right)+\alpha G_{i}

or

V(si)V(si)+α(GiV(si))V\left(s_{i}\right) \leftarrow V\left(s_{i}\right)+\alpha\left(G_{i}-V\left(s_{i}\right)\right)

where α\alpha is known as learning rate, and GiG_i as the target.

It can be interpreted as stochastic gradient descent. If we have i.i.d. real random variables v1,v2,,vnv_1, v_2, \ldots, v_n , the average is the solution of the least-square optimization problem:

minv12ni=1n(vvi)2\min _v \frac{1}{2 n} \sum_{i=1}^n\left(v-v_i\right)^2

{: .prompt-tip }


Reinforcemant Learning (11)
https://yzzzf.xyz/2024/02/25/reinforcement-learning-lecture-11/
Author
Zifan Ying
Posted on
February 25, 2024
Licensed under