<aside> 💡 개요 (lecture.pdf) → 강화학습의 가장 기본이라 할수 있는 MDP에 대해서 설명함 * Markov Processes, MRPs, MDPs … → Value function * State-value function, Action-value function * Bellman Expectation Equation * Bellman Optimality Equation

</aside>

Introduction to MDPs

Markov decision processes formally describe an environment for reinforcement learning
Where the environment is fully observable
거의 모든 RL 문제들은 MDP로 표현될 수 있음

Markov Property

<aside> 💡 Definition

→ The future is independent of the past given the present

$$ \mathbb{P}[S_{t+1}|S_t]=\mathbb[S_{t+1}|S_1,...,S_t] $$

</aside>

The state $S_t$ captures all relevant information from history $H_t$
Once the state is known, the history may be thrown away

State Transition Matrix

State transition probability

$$ \mathcal{P}{ss^\prime}=\mathbb{P}[S{t+1}=s^\prime|S_t=s] $$

State transition matrix

where each row of the matrix sums to 1

$$ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ to\\ \mathcal{P}= from\ \ \begin{bmatrix} \mathcal{P}{11} & ... & \mathcal{P}{1n}\\ & ... &\\ \mathcal{P}{n1} & ... & \mathcal{P}{nn} \end{bmatrix} $$

Markov Process

<aside> 💡 Definition → A Markov Process (or Markov Chain) is a tuple $\langle\mathcal{S},\mathcal{P}\rangle$ * $\mathcal{S}$ is a (finite) set of states * $\mathcal{P}$ is a state transition probability matrix

$$ \mathcal{P}{ss^\prime}=\mathbb{P}[S{t+1}=s^\prime|S_t=s] $$

</aside>

memoryless random process
- i.e. a sequence of random states $S_1,S_2, ...$ with the Markov property

Introduction to MDPs

Markov Property

State Transition Matrix

State transition probability

State transition matrix

Markov Process

Markov Reward Process (MRP)