For original value iteration, the Computational Complexity is
∣S∣×∣A∣×∣S∣
$| S| \times | A| foreach f(s,a) and | S |$ for expectation.
Empirical Bellman Update
For Empirical Bellman Update, the Computational Complexity is
∣S∣×∣A∣×n
Empirical sampling for n times.
the Value Prediction Problem
Given π , wnat to know Vπ and Qπ .
On-policy Learning
data used to improve policy π is generated by π .
Off-policy Learning
data used to improve policy π is generated by some other policies.
When action is always chosen by a fixed policy, the MDP reduces
to a Markov chain plus a reward function over states, also known
as Markov Reward Processes (MRP)
Monte-Carlo Value Prediction
For each s , roll out n trajectories using policy π
For episodic tasks, roll out until termination
For continuing tasks, roll out to a length (typically H=O(1/(1−γ)) ) such that omitting the future rewards has minimal impact (“small truncation error”)
Let V^π(s) (will just write V(s) ) be the average discounted return
Online Monte-Carlo
For i=1,2,… as the index of trajectories
Draw a starting state si from the exploratory initial distribution, roll out a trajectory using π from si , and let Gi be the (random) discounted return
Let n(si) be the number of times si has appeared as an initial state. If n(si)=1 (first time seeing this state), let V(si)←Gi (where Gt=∑t′=tt+Hγt′−trt′ )
where α is known as learning rate, and Gi as the target.
It can be interpreted as stochastic gradient descent. If we have i.i.d. real random variables v1,v2,…,vn , the average is the solution of the least-square optimization problem: