Reinforcemant Learning (17)

A Question

Es,r,s[(Vθ(s)rγVθ(s))2]\mathbb{E}_{s,r,s^{\prime}}\left[\left(V_\theta(s)-r-\gamma V_\theta\left(s^{\prime}\right)\right)^2\right]

We do Vθ(s)Vθ(s)+α(rγVθ(s)Vθ(s))V_\theta(s)\leftarrow V_\theta(s) + \alpha(r-\gamma V_\theta\left(s^{\prime}\right) - V_\theta(s)) in TD(0).

What if we minimize the square error between Vθ(s)V_\theta(s) and its target, i.e. Es,r,s[(Vθ(s)rγVθ(s))2]\mathbb{E}_{s,r,s^{\prime}}\left[\left(V_\theta(s)-r-\gamma V_\theta\left(s^{\prime}\right)\right)^2\right] ?

No correct. It can be decomposed as the sum of 2 parts:

  • Es[(Vθ(s)(TπVθ)(s))2]\mathbb{E}_s\left[\left(V_\theta(s)-\left(\mathcal{T}^\pi V_\theta\right)(s)\right)^2\right]
    • good. It’s L-2 norm Bellman Error.
  • γ2Es[Varss,π(s)[Vθ(s)]]\gamma^2 \mathbb{E}_s\left[\operatorname{Var}_{s^{\prime} \mid s, \pi(s)} \left[ V_\theta\left(s^{\prime}\right)\right]\right]
    • Not good. It penalize policy with large variance.
    • OK for deterministic environment because the variance is always 00 in this case.

Solution

If we have a simulator, for each ss in data, draw another independent state transition.

Minimize objective

E[(Vθ(s)rγVθ(sA))(Vθ(s)rγVθ(sB)]\mathbb{E}\left[\left(V_\theta(s)-r-\gamma V_\theta\left(s_A^{\prime}\right)\right)\left(V_\theta(s)-r-\gamma V_\theta\left(s_B^{\prime}\right)\right]\right.

“Double sampling” and Baird’s residual algorithm (Bellman residual minimization).

Convergence

  • TD with function approximation can diverge in general
  • Is it because of…
    • Randomness in SGD?
      • Nope. Even the batch version doesn’t converge
    • Sophisticated, non-linear func approx?
      • Nope. Even linear doesn’t converge.
    • That our function class does not capture V"?
      • Nope. Even if V" can be exactly represented in the function class (“realizable”), it still does not converge.

example

graph LR
    1((1)) 
    --> 2((2))
    --> 3((3))
    --> 4((4)) 
    --> 5((5))
    --> 6((6))
    --> 7((7))
    --> 8((8))
    --> 9((9))
    --> 10((10))
    10 ~~~|"reward=Ber(0.5)"| 10

iterations

#Iter 1 2 9 10
1 0.501
2 0.501 0.501
10 0.501 0.501 0.501 0.501 0.501

Assume the function space has to possible values at each state:

1 2 3 8 9 10
0.5 0.5 0.5 0.5 0.5 0.5
1.012 0.756 0.628 0.504 0.502 0.501

(0.5 and 0.502 have the same distance to 0.501;
0.5 and 0.504 have the same distance to 0.502;…)

then

#Ite 1 2 9 10
1 0.501
2 0.502 0.501
10 1.012 0.756 0.502 0.501

Value deviates from 0.501 as iteration goes.

Say the function space is a plane, than the results of each iteration (bellman operator) is not on the plane, instead, their projections are picked.

Importance Sampling

We can only sample xqx \sim q but want to estimate Expf(x)\mathbb{E}_{x\sim p} f(x)

Importance Sampling (or importance weighted, or inverse propensity yscore Ps
estimator):

p(x)q(x)f(x)\frac{p(x)}{q(x)}f(x)

Unbiasedness:

Exq[p(x)q(x)f(x)]=xq(x)(p(x)q(x)f(x))=xp(x)f(x)=Exp[f(x)]\mathbb{E}_{x \sim q}\left[\frac{p(x)}{q(x)} f(x)\right]=\sum_x q(x)\left(\frac{p(x)}{q(x)} f(x)\right)=\sum_x p(x) f(x)=\mathbb{E}_{x \sim p}[f(x)]


Reinforcemant Learning (17)
https://yzzzf.xyz/2024/03/23/reinforcement-learning-lecture-17/
Author
Zifan Ying
Posted on
March 23, 2024
Licensed under