Reinforcemant Learning (18)

Application in contextual bandit (CB)

  • The data point is a tuple (x,a,r)(x, a, r)

  • The function of interest is (x,a,r)r(x, a, r) \mapsto r

  • The distribution of interest is xd0,aπ,rR(x,a)x \sim d_0, a \sim \pi, r \sim R(x, a)

    • Let the joint density be p(x,a,r)p(x, a, r)
  • The data distribution is xd0,aπb,rR(x,a)x \sim d_0, a \sim \pi_b, r \sim R(x, a)

    • Let the joint density be q(x,a,r)q(x, a, r)
  • IS estimator: p(x,a,r)q(x,a,r)r\frac{p(x, a, r)}{q(x, a, r)} \cdot r

  • Write down the densities

    • p(x,a,r)=d0(x)π(ax)R(rx,a)p(x, a, r)=d_0(x) \cdot \pi(a \mid x) \cdot R(r \mid x, a)
    • q(x,a,r)=d0(x)πb(ax)R(rx,a)q(x, a, r)=d_0(x) \cdot \pi_b(a \mid x) \cdot R(r \mid x, a)
    • To compute importance weight, you don’t need knowledge of μ\mu or RR ! You just need πb\pi_b (or even just πb(ax)\pi_b(a \mid x) , “proposal prob.”)
  • Let ρ\rho be a shorthand for π(ax)\pi(a \mid x) , so estimator is ρr\rho \cdot r

  • πb\pi_b need to “cover” π\pi

    • i.e., whenever π(ax)>0\pi(a \mid x)>0 , we need$ \pi_b(a \mid x)>0$
  • A special case:

    • π\pi is deterministic, and πb\pi_b is uniformly random (πb(ax)1/A)\left(\pi_b(a \mid x) \equiv 1 /|A|\right)
    • I[a=π(x)]1/Ar\frac{\mathbb{I}[a = \pi(x)]}{1/|A|} r
      • only look at actions that match what π\pi wants to take and discard other data points
      • If match, ρ=A\rho=|A| ; mismatch: ρ=0\rho=0
    • On average: only 1/A1 /|A| portion of the data is useful

A note about using IS

  • We know that shifting rewards do not matter (for planning purposes) for fixed-horizon problems
  • However, when you apply IS, shifting rewards do impact the variance of the estimator
  • Special case:
    • deterministic π\pi , uniformly random πb\pi_b ,
    • reward is deterministic and constant: regardless of (x,a)(x, a) , reward is always 1 (without any randomness)
    • We know the value of any policy is 1
    • On-policy MC has 0 variance
    • IS still has high variance!
  • Where does variance come from?

1ni=1nI[a(i)=π(x(i))]1/Ar(i)=i=1nI[a(i)=π(x(i))]r(i)n/A=1n/Ai:a(i)=π(x(i))r(i)\begin{aligned} & \frac{1}{n} \sum_{i=1}^n \frac{\mathbb{I}\left[a^{(i)}=\pi\left(x^{(i)}\right)\right]}{1 /|A|} \cdot r^{(i)}=\sum_{i=1}^n \frac{\mathbb{I}\left[a^{(i)}=\pi\left(x^{(i)}\right)\right] \cdot r^{(i)}}{n /|A|} \\ & =\frac{1}{n /|A|} \sum_{i: a^{(i)}=\pi\left(x^{(i)}\right)} r^{(i)} \end{aligned}

Because n/An / | A| is the expectaton of # of sampling a(i)a^{(i)} matches π\pi but not the true # of matched samples, which causes variance.

Solution: use true # of matched samples as denometer,

1{i:a(i)=π(i)}a(i)=π(i)ri\frac{1}{| \{ i:a^{(i)} = \pi^{(i)} \} |} \sum_{a^{(i)} = \pi^{(i)}}r_i

Multi-step IS in MDPs

  • Data: trajectories starting from s1d0s_1 \sim d_0 using πb\pi_b (i.e., atπb(st)a_t \sim \pi_b\left(s_t\right) )
    (for simplicity, assume process terminates in HH time steps)

{(s1(i),a1(i),r1(i),s2(i),,sH(i),aH(i),rH(i))}i=1n\left\{\left(s_1^{(i)}, a_1^{(i)}, r_1^{(i)}, s_2^{(i)}, \ldots, s_H^{(i)}, a_H^{(i)}, r_H^{(i)}\right)\right\}_{i=1}^n

  • Want to estimate J(π):=Esd0[Vπ(s)]J(\pi):=\mathbb{E}_{s \sim d_0}\left[V^\pi(s)\right]
  • Same idea as in bandit: apply IS to the entire trajectory

===

  • define τ\tau as the whole trajectory. The function of interest is τt=1Hγt1rt\tau \mapsto \sum_{t=1}^H \gamma_{\partial}^{t-1} r_t .
  • Let the distribution of trajectory induced by π\pi be p(τ)p(\tau)
  • Let the distribution of trajectory induced by πb\pi_b be q(τ)q(\tau)
  • IS estimator: p(τ)q(τ)t=1Hγt1rt\frac{p(\tau)}{q(\tau)} \cdot \sum_{t=1}^H \gamma^{t-1} r_t

How to compute p(τ)/q(τ)p(\tau)/q(\tau) ?

  • p(τ)=d0(s1)π(a1s1)P(s2s1,a1)π(a2s2)P(sHsH1,aH1)π(aHsH)p(\tau)=d_0\left(s_1\right) \cdot \pi\left(a_1 \mid s_1\right) \cdot P\left(s_2 \mid s_1, a_1\right) \cdot \pi\left(a_2 \mid s_2\right) \cdots P\left(s_H \mid s_{H-1}, a_{H-1}\right) \cdot \pi\left(a_H \mid s_H\right)
  • q(τ)=d0(s1)πb(a1s1)P(s2s1,a1)πb(a2s2)P(sHsH1,aH1)πb(aHsH)q(\tau)=d_0\left(s_1\right) \cdot \pi_b\left(a_1 \mid s_1\right) \cdot P\left(s_2 \mid s_1, a_1\right) \cdot \pi_b\left(a_2 \mid s_2\right) \cdots P\left(s_H \mid s_{H-1}, a_{H-1}\right) \cdot \pi_b\left(a_H \mid s_H\right)

Here all P()P(\cdot|\cdot) terms are cancelled out.

Let ρt=π(dtst)πb(atst)\rho_t=\frac{\pi\left(d_t \mid s_t\right)}{\pi_b\left(a_t \mid s_t\right)} , then p(τ)q(τ)=t=1Hρt=:ρ1:H\frac{p(\tau)}{q(\tau)}=\prod_{t=1}^H \rho_t=: \rho_{1: H}

Examine the special case again

  • π\pi is deterministic, and πb\pi_b is uniformly random (πb(ax)1/A)\left(\pi_b(a \mid x) \equiv 1 /|A|\right)
  • ρt=I[at=π(st)]1/A\rho_t=\frac{\mathbb{I}\left[a_t=\pi\left(s_t\right)\right]}{1 /|A|}
  • only look at trajectories where all actions happen to match what π\pi wants to take
    • Only if match, ρ=AH\rho=|A|^H ; mismatch: ρ=0\rho=0
  • On average: only 1/AH1 /| A|^H portion of the data is useful

Reinforcemant Learning (18)
https://yzzzf.xyz/2024/03/24/reinforcement-learning-lecture-18/
Author
Zifan Ying
Posted on
March 24, 2024
Licensed under