In the previous note RL Note (2.1) - Markov Decision Process, we formalized the agent-environment interaction problem into a Markov Decision Process, and defined the value functions of MDP. In this post we deduce a very important property of those value functions, namely the Bellman Equation, which is of fundamental importance to solve the MDP.

# Bellman Expectation Equation

We can sense the recursiveness in the value function, which leads us to a relationship between the value function and itself. We have

\begin{aligned} v_{\pi}(s)&=\mathbb{E}[G_t \mid S_t = s]\\ &=\mathbb{E}[R_{t+1}+\gamma R_{t+2}+\gamma^2 R_{t+3}+\cdots \mid S_t = s]\\ &=\mathbb{E}[R_{t+1}\mid S_t=s+\gamma R_{t+2}+\gamma^2 R_{t+3}+\cdots \mid S_t = s]\\ &=\mathbb{E}[R_{t+1}\mid S_t=s] + \gamma\mathbb{E}[R_{t+2}+\gamma R_{t+3}+\cdots \mid S_t = s]\\ &=\mathbb{E}[R_{t+1}\mid S_t=s] + \gamma\mathbb{E}[G_{t+1} \mid S_t = s]\\ &=\mathbb{E}[R_{t+1}\mid S_t=s] + \gamma\mathbb{E}[\mathbb{E}[G_{t+1}\mid S_{t+1}] \mid S_t = s]\\ &=\mathbb{E}[R_{t+1}\mid S_t=s] + \gamma\mathbb{E}[v_{\pi}(S_{t+1}) \mid S_t = s]\\ \end{aligned}

which can be interpreted intuitively as

The state value of state $s$ is the immediate (expected) reward plus the discounted (expected) values of future states.

However, the above deduction is too simplified on notation, which may lead to confusion as I initially encountered. So I think it worth some explanation. Up to the penultimate line, everything is fine. The penultimate line uses a general fact about the expectation:
$$\mathbb{E}_{Z\sim\mathbb{P}[Z]}[Z] = \mathbb{E}_{Y\sim\mathbb{P}[Y]}[\mathbb{E}[Z\mid Y]]$$

Note that in the right hand side, the random variable has changed from $Z$ to $\mathbb{E}[Z\mid Y]$, and the outer expectation is calculated according to the distribution of $Y$ instead of $Z$. It is a common trick to transform expectation.

Also, the $R_t$ in all formulas is fact a abbreviation of $R_{s}^{A_t}$, which is a random variable distributed based on the distribution of $A_t$ given state $s$, i.e. the policy $\pi(a\mid s)$.

Similarly, we have
\begin{aligned} q_{\pi}(s,a)&=\mathbb{E}[G_t \mid S_t = s, A_t=a]\\ &=\cdots\\ &=\mathbb{E}[R_{t+1}\mid S_t=s,A_t=a] + \gamma\mathbb{E}[q(S_{t+1}, A_{t+1}) \mid S_t = s, A_t=a]\\ &=R_s^a + \gamma\mathbb{E}[q_{\pi}(S_{t+1}, A_{t+1}) \mid S_t = s, A_t=a] \end{aligned}

which can also be interpreted intuitively as

The action value of state $s$ and action $a$ is the immediate reward plus the discounted (expected) values from future states and actions.

Note that here the first term, i.e. $R_s^a$, is a known value of the MDP instead of a expectation.

If we expand the expectation, we get
\begin{aligned} v_{\pi}(s)&=\mathbb{E}[R_{t+1}\mid S_t=s] + \gamma\mathbb{E}[v(S_{t+1}) \mid S_t = s]\\ &=\sum_{a}\pi(a\mid s)\left(R_s^a + \gamma \sum_{s'}\mathbb{P}_{ss'}^av_{\pi}(s')\right) \end{aligned}

and
\begin{aligned} q_{\pi}(s,a)&=R_s^a + \gamma\mathbb{E}[q(S_{t+1}, A_{t+1}) \mid S_t = s, A_t=a]\\ &= R_s^a + \gamma\sum_{s'}\mathbb{P}_{ss'}^a \left(\sum_{a'}\pi(a'\mid s') q_{\pi}(s',a')\right) \end{aligned}

Also, we can deduce the relationship between $v_{\pi}$ and $q_{\pi}$:
\begin{aligned} v_{\pi}(s)&=\mathbb{E}[G_t \mid S_t = s]\\ &=\mathbb{E}[\mathbb{E}[G_t \mid S_t = s, A_t]]\\ &=\mathbb{E}[q_{\pi}(s, A_t)]\\ &=\sum_{a}\pi(a\mid s)q_{\pi}(s,a) \end{aligned}

put it back into the above expression of $q_{\pi}(s,a)$, we get
\begin{aligned} q_{\pi}(s,a) = R_s^a + \gamma\sum_{s'}\mathbb{P}_{ss'}^a v_{\pi}(s') \end{aligned}

Those are called the Bellman Expectation Equation, which are self-entangled recursive relationships between two value functions. Looks frightening, but in fact they can be intuitively deduced by randomly sample 1 and 2 steps from the state $S_t=s$.

Note that the Bellman Expectation Equation is deduced purely from the very definition of the value function, which means that it is a necessory condition for a function to be a valid value function.

# Bellman Optimality Equation

The Bellman Expectation Equation is true for any value function, in particular for an optimal value function. Select an optimal policy (which we have known exists) $\pi_*$, since it deterministically pick the action with maximum action value, we can replace the expection $\mathbb{E}$ in the Bellman Expcetion Equation to some deterministic function:
\begin{aligned} v_{\pi_*}(s)&=\sum_{a}\pi_*(a\mid s)q_{\pi_*}(s,a)\\ &=\max_a{q_{\pi_*}(s,a)}\\ &=\max_a{\left(R_s^a + \gamma \sum_{s'}\mathbb{P}_{ss'}^av_{\pi_*}(s')\right)}\\ &=v_*(s) \end{aligned}

and

\begin{aligned} q_{\pi_*}(s,a)&=R_s^a + \gamma\sum_{s'}\mathbb{P}_{ss'}^a v_{\pi_*}(s')\\ &=R_s^a + \gamma\sum_{s'}\mathbb{P}_{ss'}^a \max_{a'}{q_{\pi_*}(s',a')}\\ &=q_*(s,a) \end{aligned}

Again, it is a necessory condition for a function to be a optimal value function.

Now, to solve the MDP, we need to find a $q_*$ (or $v_*$ equivalently, since they are mutually convertible when the model is known). However, the relationship specified by the Bellman Optimality Equation is not linear (unlike solving a Markov Chain). Luckily we can use some iterative algorithm to do the trick, which I will write in the next post.