Reinforcemant Learning (17)
A Question
We do in TD(0).
What if we minimize the square error between and its target, i.e. ?
No correct. It can be decomposed as the sum of 2 parts:
-
- good. It’s L-2 norm Bellman Error.
-
- Not good. It penalize policy with large variance.
- OK for deterministic environment because the variance is always in this case.
Solution
If we have a simulator, for each in data, draw another independent state transition.
Minimize objective
“Double sampling” and Baird’s residual algorithm (Bellman residual minimization).
Convergence
- TD with function approximation can diverge in general
- Is it because of…
- Randomness in SGD?
- Nope. Even the batch version doesn’t converge
- Sophisticated, non-linear func approx?
- Nope. Even linear doesn’t converge.
- That our function class does not capture V"?
- Nope. Even if V" can be exactly represented in the function class (“realizable”), it still does not converge.
- Randomness in SGD?
example
graph LR
1((1))
--> 2((2))
--> 3((3))
--> 4((4))
--> 5((5))
--> 6((6))
--> 7((7))
--> 8((8))
--> 9((9))
--> 10((10))
10 ~~~|"reward=Ber(0.5)"| 10
iterations
#Iter | 1 | 2 | … | 9 | 10 |
---|---|---|---|---|---|
1 | 0.501 | ||||
2 | 0.501 | 0.501 | |||
… | |||||
10 | 0.501 | 0.501 | 0.501 | 0.501 | 0.501 |
Assume the function space has to possible values at each state:
1 | 2 | 3 | … | 8 | 9 | 10 |
---|---|---|---|---|---|---|
0.5 | 0.5 | 0.5 | … | 0.5 | 0.5 | 0.5 |
1.012 | 0.756 | 0.628 | … | 0.504 | 0.502 | 0.501 |
(0.5 and 0.502 have the same distance to 0.501;
0.5 and 0.504 have the same distance to 0.502;…)
then
#Ite | 1 | 2 | … | 9 | 10 |
---|---|---|---|---|---|
1 | 0.501 | ||||
2 | 0.502 | 0.501 | |||
… | |||||
10 | 1.012 | 0.756 | … | 0.502 | 0.501 |
Value deviates from 0.501 as iteration goes.
Say the function space is a plane, than the results of each iteration (bellman operator) is not on the plane, instead, their projections are picked.
Importance Sampling
We can only sample but want to estimate
Importance Sampling (or importance weighted, or inverse propensity yscore Ps
estimator):
Unbiasedness: