Application in contextual bandit (CB)
-
The data point is a tuple (x,a,r)
-
The function of interest is (x,a,r)↦r
-
The distribution of interest is x∼d0,a∼π,r∼R(x,a)
- Let the joint density be p(x,a,r)
-
The data distribution is x∼d0,a∼πb,r∼R(x,a)
- Let the joint density be q(x,a,r)
-
IS estimator: q(x,a,r)p(x,a,r)⋅r
-
Write down the densities
- p(x,a,r)=d0(x)⋅π(a∣x)⋅R(r∣x,a)
- q(x,a,r)=d0(x)⋅πb(a∣x)⋅R(r∣x,a)
- To compute importance weight, you don’t need knowledge of μ or R ! You just need πb (or even just πb(a∣x) , “proposal prob.”)
-
Let ρ be a shorthand for π(a∣x) , so estimator is ρ⋅r
-
πb need to “cover” π
- i.e., whenever π(a∣x)>0 , we need$ \pi_b(a \mid x)>0$
-
A special case:
- π is deterministic, and πb is uniformly random (πb(a∣x)≡1/∣A∣)
- 1/∣A∣I[a=π(x)]r
- only look at actions that match what π wants to take and discard other data points
- If match, ρ=∣A∣ ; mismatch: ρ=0
- On average: only 1/∣A∣ portion of the data is useful
A note about using IS
- We know that shifting rewards do not matter (for planning purposes) for fixed-horizon problems
- However, when you apply IS, shifting rewards do impact the variance of the estimator
- Special case:
- deterministic π , uniformly random πb ,
- reward is deterministic and constant: regardless of (x,a) , reward is always 1 (without any randomness)
- We know the value of any policy is 1
- On-policy MC has 0 variance
- IS still has high variance!
- Where does variance come from?
n1i=1∑n1/∣A∣I[a(i)=π(x(i))]⋅r(i)=i=1∑nn/∣A∣I[a(i)=π(x(i))]⋅r(i)=n/∣A∣1i:a(i)=π(x(i))∑r(i)
Because n/∣A∣ is the expectaton of # of sampling a(i) matches π but not the true # of matched samples, which causes variance.
Solution: use true # of matched samples as denometer,
∣{i:a(i)=π(i)}∣1a(i)=π(i)∑ri
Multi-step IS in MDPs
- Data: trajectories starting from s1∼d0 using πb (i.e., at∼πb(st) )
(for simplicity, assume process terminates in H time steps)
{(s1(i),a1(i),r1(i),s2(i),…,sH(i),aH(i),rH(i))}i=1n
- Want to estimate J(π):=Es∼d0[Vπ(s)]
- Same idea as in bandit: apply IS to the entire trajectory
===
- define τ as the whole trajectory. The function of interest is τ↦∑t=1Hγ∂t−1rt .
- Let the distribution of trajectory induced by π be p(τ)
- Let the distribution of trajectory induced by πb be q(τ)
- IS estimator: q(τ)p(τ)⋅∑t=1Hγt−1rt
How to compute p(τ)/q(τ) ?
- p(τ)=d0(s1)⋅π(a1∣s1)⋅P(s2∣s1,a1)⋅π(a2∣s2)⋯P(sH∣sH−1,aH−1)⋅π(aH∣sH)
- q(τ)=d0(s1)⋅πb(a1∣s1)⋅P(s2∣s1,a1)⋅πb(a2∣s2)⋯P(sH∣sH−1,aH−1)⋅πb(aH∣sH)
Here all P(⋅∣⋅) terms are cancelled out.
Let ρt=πb(at∣st)π(dt∣st) , then q(τ)p(τ)=∏t=1Hρt=:ρ1:H
Examine the special case again
- π is deterministic, and πb is uniformly random (πb(a∣x)≡1/∣A∣)
- ρt=1/∣A∣I[at=π(st)]
- only look at trajectories where all actions happen to match what π wants to take
- Only if match, ρ=∣A∣H ; mismatch: ρ=0
- On average: only 1/∣A∣H portion of the data is useful