Reinforcemant Learning (14)

Value Prediction with Function Approximation

tabular representation vs. function approximation
function approximation can handle infinite state space (can’t enumerate through all states).
linear function approximation
design features ϕ(s)Rd\phi(s) \in \mathbb{R}^d (“featurizing states”), and approximate Vπ(s)θϕ(s)+b\mathrm{V}^\pi(\mathrm{s}) \approx \theta^{\top} \phi(\mathrm{s}) + b , where θ\theta should be fixed among features (in the following parts, bb is ignored because it can be reached by appending a 11 to the feature vector).

tabular value function can be interpreted as feature vector RS\in \mathbb{R}^S :
[0,,0,1,0,,0][0,\cdots, 0, 1, 0, \cdots, 0] where the position of the 11 indicates the state.
{: .prompt-info }

Example: Tetris Game

Tetris Game

The state space is exponentially large: each block be occupied / not occupied.

Featurize: # of blocks on each column. In the example, the feature is (4  4  5  4  3  3  3  ) (4\;4\;5\;4\;3\;3\;3\;\cdots)

Monte-Carlo Vaule Prediction

Vπ(s)=E[Gs]=arg minf:SRE[(f(s)G)2]V^\pi(s)=\mathbb{E}[G \mid s]=\argmin_{f:S\to \mathbb R} \mathbb{E}\left[(f(s)-G)^2\right]

Is a regression problem.

Why the expectation is the argmin? See here
{: .prompt-tip }

The same idea applies to non-linear value function approximation
More generally & abstractly, think of function approximation as
searching over a restricted function space, which is a set whose
members are functions that map states to real values.

E.g. a function space of linear value function approximation:

F={Vθ:θRd}, where Vθ(s)=θϕ(s)\mathscr{F}=\left\{\mathrm{V}_\theta: \theta \in \mathbb{R}^{\mathrm{d}}\right\} \text {, where } \mathrm{V}_\theta(\mathrm{s})=\theta^{\top} \phi(\mathrm{s})

  • typically only a small subset of all possible functions
  • Using “all possible functions” = tabular!
  • Equivalently, tabular MC value prediction can be recovered by choosing ϕ\phi as the identity features \phi(\mathrm{s})=\left\\{\mathbb{I} \left[\mathrm{~s}=\mathrm{s}^{\prime}\right]\right\\}_{\mathrm{s}^{\prime} \in \mathrm{S}}

Find the function:

minVθF1ni=1n(Vθ(si)Gi)2\min _{V_\theta \in \mathscr{F}} \frac{1}{n} \sum_{i=1}^n\left(V_\theta\left(s_i\right)-G_i\right)^2

SGD: uniformly sample ii and

θθα(Vθ(si)Gi)Vθ(si)\theta \leftarrow \theta-\alpha \cdot\left(\mathrm{V}_\theta\left(\mathrm{s}_{\mathrm{i}}\right)-\mathrm{G}_{\mathrm{i}}\right) \cdot \nabla \mathrm{V}_\theta\left(\mathrm{s}_{\mathrm{i}}\right)

Interprete Td(0) with Linear Approximation

TD(0) iteration is equivalent to

θθ+α(Gtϕ(st)θ)ϕ(st).\theta \leftarrow \theta+\alpha\left(G_t-\phi\left(s_t\right)^{\top} \theta\right) \phi\left(s_t\right) .

Here θ\theta is the tabular value function and ϕ\phi is [0,,0,1,0,,0][0,\cdots, 0, 1, 0, \cdots, 0] , as mentioned here.

TD(0) with Linear Approximation

In TD(0), we do

V(st)V(st)+α(rt+γV(st+1)V(st)),V\left(s_t\right) \leftarrow V\left(s_t\right)+\alpha\left(r_t+\gamma V\left(s_{t+1}\right)-V\left(s_t\right)\right) ,

which, with all steps on tt , gets

Vk+1TπVk.V_{k+1} \leftarrow \mathcal{T}^\pi V_k .

i.e.

Vk+1(s)=Eπ[r+γVk(s)s]V_{k+1}(s)=\mathbb{E}_\pi\left[r+\gamma V_k\left(s^{\prime}\right) \mid s\right]

Similar to Linear Approximation, rewriting expectation with a regression problem,

Vk+1(s)=arg minf:sREπ[(f(s)(r+γVs))2]arg minVθF1ni=1n(Vθ(si)riγVk(s))2.\begin{gathered} V_{k+1}(s)= \argmin_{f:s\to \mathbb{R}} \mathbb{E}_\pi\left[ \big(f(s)-(r+\gamma V_{s})\big)^2\right] \\ \approx \argmin_{V_\theta\in \mathscr{F}} \frac{1}{n} \sum_{i=1}^n\big(V_\theta(s_i)-r_i-\gamma V_k(s^{\prime})\big)^2 . \\ \end{gathered}

And the SGD steps should be

θθ+α(Vθ(st)rtγVk(st+1))Vθ(st)\theta \leftarrow \theta+\alpha\left(V_\theta\left(s_t\right)-r_t-\gamma V_k\left(s_{t+1}\right)\right) \nabla V_\theta\left(s_t\right)


Reinforcemant Learning (14)
https://yzzzf.xyz/2024/03/20/reinforcement-learning-lecture-14/
Author
Zifan Ying
Posted on
March 20, 2024
Licensed under