Reinforcemant Learning (10)
The Learning Setting
planning and learning
Planning:
- given MDP model, how to compute optimal policy
- The MDP model is known
Learning:
- MDP model is unknown
- collect data from the MDP: .
- Data is limited. e.g., adaptive medical treatment, dialog systems
- Go, chess, …
- Learning can be useful even if the final goal is planning
- especially when is large and/or only blackbox simulator
- e.g., AlphaGo, video game playing, simulated robotics
Monte-Carlo policy evaluation
Given , estimate ( is initial state distribution)
is the actual expectation of reward.
Monte-Carlo outputs some scalar ; accuracy measured by .
(by sampling different trajectories):
Data: trajectories starting from using (i.e., ).
this is called on-policy: evaluating a policy with data collected from the exactly same policy.
Othwise, it is off-policy.
{: .prompt-info }
Estimator:
Guarantee: w.p. at least (larger n, higher accuracy)
It is independent to the size of state space
{: .prompt-tip }
Comment on Monte-Carlo
Monte-Carlo is a Zeroth-order (ZO) optimization method, which is not efficient.
- first order: gradient / first derivative (in DL/ML, SDG)
- second order: Hessian matrix / second derivative
Model-based RL with a sampling oracle (Certainty Equivalence)
Assuming the reward / probability is determined (constant) via sampling.
{: .prompt-info }
Assume we can sample and for any
Collect samples per . Total sample size
Estimate an empirical MDP from data
- i.e., treat the empirical frequencies of states appearing in as the true distribution.
Plan in the estimated model and return the optimal policy
transition tuples: . Use to identify current state and action, use for reward and for transition.
extract transition tuples from trajectories.
finding policy on estimated environment
true environment:
estimated environment:
- notation:
performance measurement:
- in the true environment, use where
- in estimated environment, use , i.e. measure the optimal policy of estimated environment in the real environment.