Reinforcement Learning (1)

example: Shortest Path

Shortest Path

Greedy is not optimal.

Bellman Equation (Dynamic Programing):

V^\star (d) = \min\{3 + V^\star (g) ,\, 2 + V^\star (f)\

Markov Decision Process (MDP)

Stochastic Shortest Path

Bellman Equation

V^\star (c) = \min\{4 + 0.7 × V^\star (d) + 0.3 × V^\star (e) ,\, 2 + V^\star (e)\}

optimal policy : $\pi^\star$

The states are unknown.
Learn by trial-and-error

a trajectory: s0>c>e>F>G

Need to recover the graph by collecting multiple trajectories.

Use imperial frequency to find probabilities.

Assume states & actions are visited uniformly.

Random exploration can be inefficient:

example: video game

Objective: maximize the reward

\mathbb{E}\left[\sum_{t=1}^{\infty} r_t \mid \pi\right] \; \text{or} \; \mathbb{E}\left[\sum_{t=1}^{\infty} \gamma^{t-1} r_t \mid \pi\right]

Problem: the graph is too large

There are states that the RL model have never seen, therefore need generalization

Even if the algorithm is good, if mamke bad actions at beginning, will not get good data.
Keep taking bad actions (e.g. guessing wrong label on image classification), don’t know right action.
- Compared with superivsed learning
Multi-armed bandit

For round t = 1, 2, …,

Course Notes > Reinforcement Learning

#CS443

Reinforcement Learning (1)

https://yzzzf.xyz/2024/02/04/reinforcement-learning-lecture-1/

Author

Zifan Ying

Posted on

February 4, 2024

Licensed under