Reinforcemant Learning (10)

The Learning Setting

planning and learning

Planning:

  • given MDP model, how to compute optimal policy
  • The MDP model is known

Learning:

  • MDP model is unknown
  • collect data from the MDP: (s,a,r,s)(s,a,r,s') .
  • Data is limited. e.g., adaptive medical treatment, dialog systems
  • Go, chess, …
  • Learning can be useful even if the final goal is planning
    • especially when S| S | is large and/or only blackbox simulator
    • e.g., AlphaGo, video game playing, simulated robotics

Monte-Carlo policy evaluation

Given π\pi , estimate J(π):=Esd0[Vπ(s)]J(\pi):=\mathbb{E}_{s \sim d_0}\left[V^\pi(s)\right] ( d0d_0 is initial state distribution)
is the actual expectation of reward.

Monte-Carlo outputs some scalar vv ; accuracy measured by vJ(π)| v-J(\pi)| .
(by sampling different trajectories):

Data: trajectories starting from s1d0s_1 \sim d_0 using π\pi (i.e., at=π(st)a_t=\pi\left(s_t\right) ).

{(s1(i),a1(i),r1(i),s2(i),,sH(i),aH(i),rH(i))}i=1n\left\{\left(s_1^{(i)}, a_1^{(i)}, r_1^{(i)}, s_2^{(i)}, \ldots, s_H^{(i)}, a_H^{(i)}, r_H^{(i)}\right)\right\}_{i=1}^n

this is called on-policy: evaluating a policy with data collected from the exactly same policy.

Othwise, it is off-policy.
{: .prompt-info }

Estimator:

1ni=1nt=1Hγt1rt(i)\frac{1}{n} \sum_{i=1}^n \sum_{t=1}^H \gamma^{t-1} r_t^{(i)}

Guarantee: w.p. at least 1δ,vJ(π)Rmax1γ12nln2δ1-\delta,| v-J(\pi)| \leq \frac{R_{\max }}{1-\gamma} \sqrt{\frac{1}{2 n} \ln \frac{2}{\delta}} (larger n, higher accuracy)

It is independent to the size of state space
{: .prompt-tip }

Comment on Monte-Carlo

Monte-Carlo is a Zeroth-order (ZO) optimization method, which is not efficient.

  • first order: gradient / first derivative (in DL/ML, SDG)
  • second order: Hessian matrix / second derivative

Model-based RL with a sampling oracle (Certainty Equivalence)

Assuming the reward / probability is determined (constant) via sampling.
{: .prompt-info }

Assume we can sample rR(s,a)r \sim R(s, a) and sP(s,a)s^{\prime} \sim P(s, a) for any (s,a)(s, a)

Collect nn samples per (s,a):(ri,si)i=1n(s, a):\\{\left(r_i, s_i^{\prime}\right)\\}_{i=1}^n . Total sample size nS×An| S \times A|

Estimate an empirical MDP M^\hat{M} from data

  • R^(s,a):=1ni=1nri,P^(ss,a):=1ni=1nI[si=s]\hat{R}(s, a):=\frac{1}{n} \sum_{i=1}^n r_i, \quad \hat{P}\left(s^{\prime} \mid s, a\right):=\frac{1}{n} \sum_{i=1}^n \mathbb{I}\left[s_i^{\prime}=s^{\prime}\right]
  • i.e., treat the empirical frequencies of states appearing in {si}i=1n\{s_i^{\prime}\}_{i=1}^n as the true distribution.

Plan in the estimated model and return the optimal policy

transition tuples: (si,ai,ri,si+1)(s_i, a_i, r_i, s_{i+1}) . Use si,ais_i, a_i to identify current state and action, use rir_i for reward and si+1s_{i+1} for transition.

extract transition tuples from trajectories.

finding policy on estimated environment

true environment: M=(S,A,P,R,γ)M = (S, A, P, R, \gamma)

estimated environment: M^=(S,A,P^,R^,γ)\hat{M} = (S, A, \hat{P}, \hat{R}, \gamma)

  • notation: πM^,VM^,\pi_{\hat{M}}, \, V_{\hat{M}}, \, \ldots

performance measurement:

  • in the true environment, use VVπf\| V^\star - V^{\pi_f} \| where fQf \approx Q^\star
  • in estimated environment, use VMVMπM^\| V_M^\star - V_M^{\pi_{\hat{M}}^\star} \| , i.e. measure the optimal policy of estimated environment in the real environment.

Reinforcemant Learning (10)
https://yzzzf.xyz/2024/02/24/reinforcement-learning-lecture-10/
Author
Zifan Ying
Posted on
February 24, 2024
Licensed under