Reinforcemant Learning (12)
Every-visit Monte-Carlo
Suppose we Have a continuing task. What/if we cannot set the
starting state arbitrarily?
i.e. we have a single long trajectory with length
- we can truncate truncations with length from the long trajectory.
- we can shift the -length window by 1 each time and get truncations.
This “walk” through the state space should have non-zero probability on each state, i.e. do not starve every states.
What if a state occures multiple times on a trajectory?
- approach 1: only the first occurance is used
- approach 2: all the occurances are used
Alternative Approach: TD(0)
Again, suppose we have a single long trajectory , in a continuing task
TD(0): for
- TD
- temporal difference
- TD_error
Same as Monte-Carlo update rule, excepts that the “target” is , which is similar to the empirical Bellman update.
Recall that in Monte-Carlo, the “target” is and is independent to the current value function.
While in TD(0), the target is dependent to the current value function . i.e.
Compared to value iteration:
and the equation above is
which is an approximate Value Iteration process, and notice that the whole iteraton through is only 1 iteration (a ), so an outside loop is needed if we want to approximates real .
Understanding TD(0)
The “approximate” Value Iteration process above is similar to TD(0) but slightly different:
it uses a value function (which stays constant during updates) to update which is another function. After long enough, we have and do , then repeat the process. Finally converges to .
But in TD(0), we uses to update itself. The difference is “synchronous” vs “asynchronous”.
TD(0) is less stable
{: .prompt-info }