In this note we formalize the problem in reinforcement learning into a Markov Decision Process. The lecture video is here: Lecture 2: Markov Decision Process.

In fact, I was shocked by the powerfulness of formalization invented by the western. Maybe it is the reason that ancient China did not develop mathematics and science.

# Markov Decision Process

We build the final Markov Decision Process cumulatively. The very basic building block is the well known Markov Process:

A

**Markov Process**is a tuple $<S,P>$, where

- $S$ is a finite set of states.

- $P$ is a state transition matrix, where $P_{ss'}=\mathbb{P}(S_{t+1}=s'\mid S_t=s)$

to which we then add *rewards*, to form a Markov Reward Process:

A

**Markov Reward Process**is a tuple $<S,P,R,\gamma>$, where

- $S$ is a finite set of states.

- $P$ is a state transition matrix, where $P_{ss'}=\mathbb{P}(S_{t+1}=s'\mid S_t=s)$

- $R$ is a reward function, where $R_s=\mathbb{E}(R_{S_{t+1}}\mid S_t=s)$

- $\gamma\in[0,1]$ is a discount factor.

to which we further add *actions*, to form a Markov Decision Process:

A

**Markov Decision Process**is a tuple $<S,A,P,R,\gamma>$, where

- $S$ is a finite set of states.

- $A$ is a finite set of actions.

- $P$ is a state transition matrix, where $P_{ss'}^a=\mathbb{P}(S_{t+1}=s'\mid S_t=s, A_t=a)$

- $R$ is a reward functio, where $R_s^a$ is the immediate rewrad after performing action $a$ at state $s$.

- $\gamma\in[0,1]$ is a discount factor.

The Markov Decision Process is abbreviated as the **MDP**.

# Notations and Randomness

Before move on, I think we should clear some fact about the notations and randomness here. The whole process is essentially stochastic, which is described by the following variables and random variables:

- $S_t$, the random variable of agent state at timestamp $t$.
- $A_t$, the random variable of the action taken by the agent at timestamp $t$. We have

$$\mathbb{P}(A_t=a\mid S_t=s)=\pi(a\mid s)$$ - $P_{ss'}^a$, the variable of transition probability from state $s$ to state $s'$ after performing action $a$. It is not a random variable.
- $R_s^a$, the variable of immediate reward after performing action $a$ from state $s$. Although it is not a random variable, but we usually come into $R_{S_t}^{A_t}, R_{S_t}^{a}, R_{s}^{A_t}$, which are random variables, and we usually abbreviate them to $R_{t+1}$ when the randomness is clear from the context. Note that we use $R_{t+1}$ instead of $R_t$, which is just a notation convention, indicating that the feedback of the environment is received at the next timestamp.

Note that $S_t$ and $A_t$ is determined by the agent, while $P_{ss'}^a$ and $R_s^a$ is indeed a **model** of the environment in the agent, which encapsulates the ignorance of the environment.

Besides, both agent state and environment state by definition have **Markov property**, i.e.

$$\mathbb{P}[S_{t+1}\mid S_t] = \mathbb{P}[S_{t+1}\mid S_t, S_{t-1}, \cdots, S_1]$$

# Value Functions

In the previous post RL Note (1) - Introduction to Reinforcement, we have stated the *Reward Hypothesis*. Here we can express it formally.

The

**return**for

*a single sample*from timestamp $t$ afterwards is

$$G_t = R_{t+1}+\gamma R_{t+2}+\gamma^2 R_{t+3}+\cdots = \sum_{k\geqslant 1} \gamma R_{t+k}$$

Note that it is a random variable. So the reward hypothesis is saying that our goal is to maximize the following value:

$$v_{\pi}(s)\triangleq \mathbb{E}[G_t \mid S_t = s]$$The $v(s)$ is called the **state value function**. Note that implicitly the sequence is generated according to policy $\pi$, so we index the value function by it.

Donâ€™t forget the reward is gained through performing actions, so besides state value function, we also have an **action value function**, which is defined as:

$$q_{\pi}(s,a)\triangleq\mathbb{E}[G_t \mid S_t = s, A_t=a]$$

Instead of caring about the value function under a certain policy, we in fact want to find a optimal policy generating an optimal value function in some sense. So we define the **optimal state value function** as

$$v_*(s)=\max_{\pi}{v_{\pi}(s)}, \forall s$$

and the **optimal action value function** as

$$q_*(s, a)=\max_{\pi}{q_{\pi}(s, a)}, \forall s$$

Besides, in order to compare policy, we introduce a partial order in the policy space, say

$$\pi\geqslant\pi' \text{ if } v_{\pi}(s)\geqslant v_{\pi'}(s), \forall s$$

Note that $v_*(s)$ and $q_*(s, a)$ are defined by taking maximum for each $s$, so the optimal value function itself may not necessarily be a value function for some policy. Thanks to the following theorem, it will not gonna happen.

For any Markov Decision Process,

- There exists an optimal policy $\pi_*$, i.e. $\pi_*\geqslant\pi, \forall\pi$

- All optimal policies achieves optimal value functions, i.e. $v_*=v_{\pi_*}, q_*=q_{\pi_*}$

OK, but here is another question: how to find the optimal policy? It turns out that as long as we find the optimal action value function $q_*(s, a)$, we can construct the optimal policy by

$$\pi_*(a\mid s) = \begin{cases}
1, \quad a = \operatorname{argmax} q_*(s, a)\\
0, \quad\text{otherwise}
\end{cases}$$

In another word, $\pi_*$ is constructed by deterministically choosing the action with maximum value.

Note that it is *an* optimal policy, which is deterministic. There may be other optimal policies which is stochastic. Nevertheless, all optimal policies achieve the optimal value function, which is obviously unique.

Given a MDP, i.e. given $S, A, P, R, \gamma$, our goal is find the optimal policy, or sufficiently to find the optimal action function. It is what we mean by *solving* a MDP.