Reinforcement Learning (1)

example: Shortest Path


Shortest Path

  • nodes: stats
  • edges: actions

Greedy is not optimal.

Bellman Equation (Dynamic Programing):

V(d)=min{3+V(g),2+V(f) V^\star (d) = \min\{3 + V^\star (g) ,\, 2 + V^\star (f)\

Stochastic Shortest Path

Markov Decision Process (MDP)


Stochastic Shortest Path

Bellman Equation

V(c)=min{4+0.7×V(d)+0.3×V(e),2+V(e)}V^\star (c) = \min\{4 + 0.7 × V^\star (d) + 0.3 × V^\star (e) ,\, 2 + V^\star (e)\}

optimal policy : π\pi^\star

Model-based RL

The states are unknown.
Learn by trial-and-error


a trajectory: s0>c>e>F>G

Need to recover the graph by collecting multiple trajectories.

Use imperial frequency to find probabilities.

Assume states & actions are visited uniformly.

exploration problem

Random exploration can be inefficient:


example: video game

example: video game

Objective: maximize the reward

E[t=1rtπ]  or  E[t=1γt1rtπ]\mathbb{E}\left[\sum_{t=1}^{\infty} r_t \mid \pi\right] \; \text{or} \; \mathbb{E}\left[\sum_{t=1}^{\infty} \gamma^{t-1} r_t \mid \pi\right]

Problem: the graph is too large

There are states that the RL model have never seen, therefore need generalization

Contextual bandits

  • Even if the algorithm is good, if mamke bad actions at beginning, will not get good data.
  • Keep taking bad actions (e.g. guessing wrong label on image classification), don’t know right action.
    • Compared with superivsed learning
  • Multi-armed bandit

RL steps

For round t = 1, 2, …,

  • For time step h=1, 2, …, H, the learner
    • Observes xh(t)x_h^{(t)}
    • Chooses ah(t)a_h^{(t)}
    • Receives rh(t)R(xh(t),ah(t))r_h^{(t)} \sim R(x_h^{(t)}, a_h^{(t)})
    • Next xh+1(t)x_{h+1}^{(t)} is generated as a function of xh(t)x_h^{(t)} and ah(t)a_h^{(t)}
      (or sometimes, all previous x’s and a’s within round t)

Reinforcement Learning (1)
https://yzzzf.xyz/2024/02/04/reinforcement-learning-lecture-1/
Author
Zifan Ying
Posted on
February 4, 2024
Licensed under