Reinforcemant Learning (12)

Every-visit Monte-Carlo

Suppose we Have a continuing task. What/if we cannot set the
starting state arbitrarily?

i.e. we have a single long trajectory with length NN

s1,a1,r1,s2,a2,r2,s3,a3,r3,s_1, a_1, r_1, s_2, a_2, r_2, s_3, a_3, r_3, \ldots

  • we can truncate N/HN/H truncations with length H=O(1/(1γ))H = O(1/(1-\gamma)) from the long trajectory.
  • we can shift the HH -length window by 1 each time and get NH+1NN-H+1\approx N truncations.

This “walk” through the state space should have non-zero probability on each state, i.e. do not starve every states.

What if a state occures multiple times on a trajectory?

  • approach 1: only the first occurance is used
  • approach 2: all the occurances are used

Alternative Approach: TD(0)

Again, suppose we have a single long trajectory s1,a1,r1,s2,a2,r2s_1, a_1, r_1, s_2, a_2, r_2 , s3,a3,r3,s4,s_3, a_3, r_3, s_4, \ldots in a continuing task

TD(0): for t=1,2,,V(st)V(st)+α(rt+γV(st+1)V(st))t=1,2, \ldots, V\left(s_t\right) \leftarrow V\left(s_t\right)+\alpha\left(r_t+\gamma V\left(s_{t+1}\right)-V\left(s_t\right)\right)

TD
temporal difference
TD_error
rt+γV(st+1)V(st)r_t+\gamma V\left(s_{t+1}\right)-V\left(s_t\right)

Same as Monte-Carlo update rule, excepts that the “target” is rt+γV(st+1)r_t+\gamma V\left(s_{t+1}\right) , which is similar to the empirical Bellman update.

Recall that in Monte-Carlo, the “target” is Gt=t=tt+HγttrtG_t=\sum_{t^{\prime}=t}^{t+H} \gamma^{t^{\prime}-t} r_{t^{\prime}} and is independent to the current value function.
While in TD(0), the target rt+γV(st+1)r_t+\gamma V\left(s_{t+1}\right) is dependent to the current value function VV . i.e.

Compared to value iteration:

Vk+1(s):=Er,ss,π[r+γVk(s)] V_{k+1}(s) := \mathbb{E}_{r,s'|s,\pi} \left[r+\gamma V_k\left(s^{\prime}\right)\right]

and the equation above is

1ni=1n(ri+rVk(si)) \approx \frac{1}{n} \sum_{i=1}^n\left(r_i+r V_k\left(s_i^{\prime}\right)\right)

which is an approximate Value Iteration process, and notice that the whole iteraton through i=1,,ni=1,\cdots, n is only 1 iteration (a VkV_k ), so an outside loop is needed if we want to VV approximates real VπV^\pi .

Understanding TD(0)

The “approximate” Value Iteration process above is similar to TD(0) but slightly different:
it uses a value function VV (which stays constant during updates) to update VV' which is another function. After long enough, we have V=TπVV'=\mathcal{T}^\pi V and do VVV \leftarrow V' , then repeat the process. Finally converges to VπV^\pi .

But in TD(0), we uses VV to update itself. The difference is “synchronous” vs “asynchronous”.

TD(0) is less stable
{: .prompt-info }


Reinforcemant Learning (12)
https://yzzzf.xyz/2024/03/18/reinforcement-learning-lecture-12/
Author
Zifan Ying
Posted on
March 18, 2024
Licensed under