RL Note (2.1) - Markov Decision Process

In this note we formalize the problem in reinforcement learning into a Markov Decision Process. The lecture video is here: Lecture 2: Markov Decision Process.

In fact, I was shocked by the powerfulness of formalization invented by the western. Maybe it is the reason that ancient China did not develop mathematics and science.

Markov Decision Process

We build the final Markov Decision Process cumulatively. The very basic building block is the well known Markov Process:


A Markov Process is a tuple $<S,P>$, where
- $S$ is a finite set of states.
- $P$ is a state transition matrix, where $P_{ss'}=\mathbb{P}(S_{t+1}=s'\mid S_t=s)$

to which we then add rewards, to form a Markov Reward Process:


A Markov Reward Process is a tuple $<S,P,R,\gamma>$, where
- $S$ is a finite set of states.
- $P$ is a state transition matrix, where $P_{ss'}=\mathbb{P}(S_{t+1}=s'\mid S_t=s)$
- $R$ is a reward function, where $R_s=\mathbb{E}(R_{S_{t+1}}\mid S_t=s)$
- $\gamma\in[0,1]$ is a discount factor.

to which we further add actions, to form a Markov Decision Process:


A Markov Decision Process is a tuple $<S,A,P,R,\gamma>$, where
- $S$ is a finite set of states.
- $A$ is a finite set of actions.
- $P$ is a state transition matrix, where $P_{ss'}^a=\mathbb{P}(S_{t+1}=s'\mid S_t=s, A_t=a)$
- $R$ is a reward functio, where $R_s^a$ is the immediate rewrad after performing action $a$ at state $s$.
- $\gamma\in[0,1]$ is a discount factor.

The Markov Decision Process is abbreviated as the MDP.

Notations and Randomness

Before move on, I think we should clear some fact about the notations and randomness here. The whole process is essentially stochastic, which is described by the following variables and random variables:

  • $S_t$, the random variable of agent state at timestamp $t$.
  • $A_t$, the random variable of the action taken by the agent at timestamp $t$. We have
    $$\mathbb{P}(A_t=a\mid S_t=s)=\pi(a\mid s)$$
  • $P_{ss'}^a$, the variable of transition probability from state $s$ to state $s'$ after performing action $a$. It is not a random variable.
  • $R_s^a$, the variable of immediate reward after performing action $a$ from state $s$. Although it is not a random variable, but we usually come into $R_{S_t}^{A_t}, R_{S_t}^{a}, R_{s}^{A_t}$, which are random variables, and we usually abbreviate them to $R_{t+1}$ when the randomness is clear from the context. Note that we use $R_{t+1}$ instead of $R_t$, which is just a notation convention, indicating that the feedback of the environment is received at the next timestamp.

Note that $S_t$ and $A_t$ is determined by the agent, while $P_{ss'}^a$ and $R_s^a$ is indeed a model of the environment in the agent, which encapsulates the ignorance of the environment.

Besides, both agent state and environment state by definition have Markov property, i.e.
$$\mathbb{P}[S_{t+1}\mid S_t] = \mathbb{P}[S_{t+1}\mid S_t, S_{t-1}, \cdots, S_1]$$

Value Functions

In the previous post RL Note (1) - Introduction to Reinforcement, we have stated the Reward Hypothesis. Here we can express it formally.


The return for a single sample from timestamp $t$ afterwards is
$$G_t = R_{t+1}+\gamma R_{t+2}+\gamma^2 R_{t+3}+\cdots = \sum_{k\geqslant 1} \gamma R_{t+k}$$

Note that it is a random variable. So the reward hypothesis is saying that our goal is to maximize the following value:

$$v_{\pi}(s)\triangleq \mathbb{E}[G_t \mid S_t = s]$$

The $v(s)$ is called the state value function. Note that implicitly the sequence is generated according to policy $\pi$, so we index the value function by it.

Don’t forget the reward is gained through performing actions, so besides state value function, we also have an action value function, which is defined as:
$$q_{\pi}(s,a)\triangleq\mathbb{E}[G_t \mid S_t = s, A_t=a]$$

Instead of caring about the value function under a certain policy, we in fact want to find a optimal policy generating an optimal value function in some sense. So we define the optimal state value function as
$$v_*(s)=\max_{\pi}{v_{\pi}(s)}, \forall s$$

and the optimal action value function as
$$q_*(s, a)=\max_{\pi}{q_{\pi}(s, a)}, \forall s$$

Besides, in order to compare policy, we introduce a partial order in the policy space, say
$$\pi\geqslant\pi' \text{ if } v_{\pi}(s)\geqslant v_{\pi'}(s), \forall s$$

Note that $v_*(s)$ and $q_*(s, a)$ are defined by taking maximum for each $s$, so the optimal value function itself may not necessarily be a value function for some policy. Thanks to the following theorem, it will not gonna happen.


For any Markov Decision Process,
- There exists an optimal policy $\pi_*$, i.e. $\pi_*\geqslant\pi, \forall\pi$
- All optimal policies achieves optimal value functions, i.e. $v_*=v_{\pi_*}, q_*=q_{\pi_*}$

OK, but here is another question: how to find the optimal policy? It turns out that as long as we find the optimal action value function $q_*(s, a)$, we can construct the optimal policy by
$$\pi_*(a\mid s) = \begin{cases} 1, \quad a = \operatorname{argmax} q_*(s, a)\\ 0, \quad\text{otherwise} \end{cases}$$

In another word, $\pi_*$ is constructed by deterministically choosing the action with maximum value.

Note that it is an optimal policy, which is deterministic. There may be other optimal policies which is stochastic. Nevertheless, all optimal policies achieve the optimal value function, which is obviously unique.

Given a MDP, i.e. given $S, A, P, R, \gamma$, our goal is find the optimal policy, or sufficiently to find the optimal action function. It is what we mean by solving a MDP.