Reinforcemant Learning (11)

Model-based RL with a sampling oracle (Certainty Equivalence) Cont’d

To find $Q^\star_{\hat{M}}$ with empirical $\hat{R}$ and $\hat{P}$ :

f_0 \in \mathbb{R}^{SA}, \quad f_k \in \hat{\mathcal{T}} f_{k-1} .

where

\begin{aligned} (\hat{\mathcal{T}} f)(s, a) & =\hat{R}(s, a)+\gamma \mathbb{E}_{s^{\prime} \sim \hat{P}(\cdot \mid s, a)} \underbrace{\left[\max _{a^{\prime}} f\left(s^{\prime}, a^{\prime}\right)\right]}_{V_f\left(s^{\prime}\right)} \\ & =\frac{1}{n} \sum_{i=1}^n r_i+\gamma\left\langle\hat{P}(\cdot \mid s, a), V_f\right\rangle \\ & =\frac{1}{n} \sum_{i=1}^n \gamma_i+\gamma \sum_{s^{\prime}}\left(\frac{1}{n} \sum_{i=1}^n \mathbb{I}\left[s_i^{\prime}=s^{\prime}\right]\right) \cdot V_f\left(s^{\prime}\right) \\ & =\frac{1}{n} \sum_{i=1}^n r_i+\frac{\gamma}{n} \sum_{i=1}^n\left(\sum_{s^{\prime}} \mathbb{I}\left[s_i^{\prime}=s^{\prime}\right] V_f\left(s^{\prime}\right)\right) \\ & =\frac{1}{n} \sum_{i=1}^n r_i+\gamma \cdot \frac{1}{n} \sum_{i=1}^n V_f\left(s_i^{\prime}\right) \\ & =\frac{1}{n} \sum_{i=1}^n\left(r_i+\gamma \max _{a^{\prime}} f\left(s_i^{\prime}, a^{\prime}\right)\right) \\ \end{aligned}

is call the Empirical Bellman Update.

Computational Complexity

Value Interation

For original value iteration, the Computational Complexity is

|S|\times |A| \times |S|

$| S| \times | A| $for each$ f(s,a) $and$ | S |$ for expectation.

Empirical Bellman Update

For Empirical Bellman Update, the Computational Complexity is

|S|\times |A| \times n

Empirical sampling for $n$ times.

the Value Prediction Problem

Given $\pi$ , wnat to know $V^\pi$ and $Q^\pi$ .

On-policy Learning: data used to improve policy $\pi$ is generated by $\pi$ .
Off-policy Learning: data used to improve policy $\pi$ is generated by some other policies.

When action is always chosen by a fixed policy, the MDP reduces
to a Markov chain plus a reward function over states, also known
as Markov Reward Processes (MRP)

Monte-Carlo Value Prediction

For each $s$ , roll out $n$ trajectories using policy $\pi$

For episodic tasks, roll out until termination
For continuing tasks, roll out to a length (typically $H=\mathrm{O}(1 /(1-\gamma))$ ) such that omitting the future rewards has minimal impact (“small truncation error”)
Let $\hat{V}^\pi(s)$ (will just write $V(s)$ ) be the average discounted return

Online Monte-Carlo

For $i=1,2, \ldots$ $i = 1, 2, \dots$ as the index of trajectories
- Draw a starting state $s_i$ from the exploratory initial distribution, roll out a trajectory using $\pi$ from $s_i$ , and let $G_i$ be the (random) discounted return
- Let $n\left(s_i\right)$ be the number of times $s_i$ has appeared as an initial state. If $n\left(s_i\right)=1$ (first time seeing this state), let $V\left(s_{i}\right) \leftarrow G_{i}$ (where $G_t=\sum_{t^{\prime}=t}^{t+H} \gamma^{t^{\prime}-t} r_{t^{\prime}}$ )
- Otherwise, $V\left(s_{i}\right) \leftarrow \frac{n\left(s_{i}\right)-1}{n\left(s_{i}\right)} V\left(s_{i}\right)+\frac{1}{n\left(s_{i}\right)} G_{i}$

No need to store the trajectory.

More generally,

V\left(s_{i}\right) \leftarrow(1-\alpha) V\left(s_{i}\right)+\alpha G_{i}

V\left(s_{i}\right) \leftarrow V\left(s_{i}\right)+\alpha\left(G_{i}-V\left(s_{i}\right)\right)

where $\alpha$ is known as learning rate, and $G_i$ as the target.

It can be interpreted as stochastic gradient descent. If we have i.i.d. real random variables $v_1, v_2, \ldots, v_n$ , the average is the solution of the least-square optimization problem:
$\min _v \frac{1}{2 n} \sum_{i=1}^n\left(v-v_i\right)^2$

{: .prompt-tip }

Course Notes > Reinforcement Learning

Reinforcemant Learning (11)

https://yzzzf.xyz/2024/02/25/reinforcement-learning-lecture-11/

Author

Zifan Ying

Posted on

February 25, 2024

Licensed under

Communication Networks (9) Previous

Reinforcemant Learning (10) Next