RL Note (2.2) - The Bellman Equation

In the previous note RL Note (2.1) - Markov Decision Process, we formalized the agent-environment interaction problem into a Markov Decision Process, and defined the value functions of MDP. In this post we deduce a very important property of those value functions, namely the Bellman Equation, which is of fundamental importance to solve the MDP.

Bellman Expectation Equation

We can sense the recursiveness in the value function, which leads us to a relationship between the value function and itself. We have

$$\begin{aligned} v_{\pi}(s)&=\mathbb{E}[G_t \mid S_t = s]\\ &=\mathbb{E}[R_{t+1}+\gamma R_{t+2}+\gamma^2 R_{t+3}+\cdots \mid S_t = s]\\ &=\mathbb{E}[R_{t+1}\mid S_t=s+\gamma R_{t+2}+\gamma^2 R_{t+3}+\cdots \mid S_t = s]\\ &=\mathbb{E}[R_{t+1}\mid S_t=s] + \gamma\mathbb{E}[R_{t+2}+\gamma R_{t+3}+\cdots \mid S_t = s]\\ &=\mathbb{E}[R_{t+1}\mid S_t=s] + \gamma\mathbb{E}[G_{t+1} \mid S_t = s]\\ &=\mathbb{E}[R_{t+1}\mid S_t=s] + \gamma\mathbb{E}[\mathbb{E}[G_{t+1}\mid S_{t+1}] \mid S_t = s]\\ &=\mathbb{E}[R_{t+1}\mid S_t=s] + \gamma\mathbb{E}[v_{\pi}(S_{t+1}) \mid S_t = s]\\ \end{aligned}$$

which can be interpreted intuitively as

The state value of state $s$ is the immediate (expected) reward plus the discounted (expected) values of future states.

However, the above deduction is too simplified on notation, which may lead to confusion as I initially encountered. So I think it worth some explanation. Up to the penultimate line, everything is fine. The penultimate line uses a general fact about the expectation:
$$\mathbb{E}_{Z\sim\mathbb{P}[Z]}[Z] = \mathbb{E}_{Y\sim\mathbb{P}[Y]}[\mathbb{E}[Z\mid Y]]$$

Note that in the right hand side, the random variable has changed from $Z$ to $\mathbb{E}[Z\mid Y]$, and the outer expectation is calculated according to the distribution of $Y$ instead of $Z$. It is a common trick to transform expectation.

Also, the $R_t$ in all formulas is fact a abbreviation of $R_{s}^{A_t}$, which is a random variable distributed based on the distribution of $A_t$ given state $s$, i.e. the policy $\pi(a\mid s)$.

Similarly, we have
$$\begin{aligned} q_{\pi}(s,a)&=\mathbb{E}[G_t \mid S_t = s, A_t=a]\\ &=\cdots\\ &=\mathbb{E}[R_{t+1}\mid S_t=s,A_t=a] + \gamma\mathbb{E}[q(S_{t+1}, A_{t+1}) \mid S_t = s, A_t=a]\\ &=R_s^a + \gamma\mathbb{E}[q_{\pi}(S_{t+1}, A_{t+1}) \mid S_t = s, A_t=a] \end{aligned}$$

which can also be interpreted intuitively as

The action value of state $s$ and action $a$ is the immediate reward plus the discounted (expected) values from future states and actions.

Note that here the first term, i.e. $R_s^a$, is a known value of the MDP instead of a expectation.

If we expand the expectation, we get
$$\begin{aligned} v_{\pi}(s)&=\mathbb{E}[R_{t+1}\mid S_t=s] + \gamma\mathbb{E}[v(S_{t+1}) \mid S_t = s]\\ &=\sum_{a}\pi(a\mid s)\left(R_s^a + \gamma \sum_{s'}\mathbb{P}_{ss'}^av_{\pi}(s')\right) \end{aligned}$$

and
$$\begin{aligned} q_{\pi}(s,a)&=R_s^a + \gamma\mathbb{E}[q(S_{t+1}, A_{t+1}) \mid S_t = s, A_t=a]\\ &= R_s^a + \gamma\sum_{s'}\mathbb{P}_{ss'}^a \left(\sum_{a'}\pi(a'\mid s') q_{\pi}(s',a')\right) \end{aligned}$$

Also, we can deduce the relationship between $v_{\pi}$ and $q_{\pi}$:
$$\begin{aligned} v_{\pi}(s)&=\mathbb{E}[G_t \mid S_t = s]\\ &=\mathbb{E}[\mathbb{E}[G_t \mid S_t = s, A_t]]\\ &=\mathbb{E}[q_{\pi}(s, A_t)]\\ &=\sum_{a}\pi(a\mid s)q_{\pi}(s,a) \end{aligned}$$

put it back into the above expression of $q_{\pi}(s,a)$, we get
$$\begin{aligned} q_{\pi}(s,a) = R_s^a + \gamma\sum_{s'}\mathbb{P}_{ss'}^a v_{\pi}(s') \end{aligned}$$

Those are called the Bellman Expectation Equation, which are self-entangled recursive relationships between two value functions. Looks frightening, but in fact they can be intuitively deduced by randomly sample 1 and 2 steps from the state $S_t=s$.

Note that the Bellman Expectation Equation is deduced purely from the very definition of the value function, which means that it is a necessory condition for a function to be a valid value function.

Bellman Optimality Equation

The Bellman Expectation Equation is true for any value function, in particular for an optimal value function. Select an optimal policy (which we have known exists) $\pi_*$, since it deterministically pick the action with maximum action value, we can replace the expection $\mathbb{E}$ in the Bellman Expcetion Equation to some deterministic function:
$$\begin{aligned} v_{\pi_*}(s)&=\sum_{a}\pi_*(a\mid s)q_{\pi_*}(s,a)\\ &=\max_a{q_{\pi_*}(s,a)}\\ &=\max_a{\left(R_s^a + \gamma \sum_{s'}\mathbb{P}_{ss'}^av_{\pi_*}(s')\right)}\\ &=v_*(s) \end{aligned}$$

and

$$\begin{aligned} q_{\pi_*}(s,a)&=R_s^a + \gamma\sum_{s'}\mathbb{P}_{ss'}^a v_{\pi_*}(s')\\ &=R_s^a + \gamma\sum_{s'}\mathbb{P}_{ss'}^a \max_{a'}{q_{\pi_*}(s',a')}\\ &=q_*(s,a) \end{aligned}$$

Again, it is a necessory condition for a function to be a optimal value function.

Now, to solve the MDP, we need to find a $q_*$ (or $v_*$ equivalently, since they are mutually convertible when the model is known). However, the relationship specified by the Bellman Optimality Equation is not linear (unlike solving a Markov Chain). Luckily we can use some iterative algorithm to do the trick, which I will write in the next post.