In this post I will clarify a method to solve the MDP using Bellman Expectation and/or Optimality Equation, as stated in the previous post RL Note (3) - Planning by Dynamic Programming. The lecture video is here: Lecture 3_ Planning by Dynamic Programming.

To be honest, initially I was not clear whatâ€™s going on after watching the video, so used my own way to re-formulate the algorithm introduced in the video, which hopefully is clearer and easier to understand.

In this post we are considering how to solve for optimal state value function $v_*$ and optimal policy $\pi_*$.

# Notations

The Bellman Expectation Equation is
$$v_{\pi}(s)=\mathbb{E}[R_{t+1}\mid S_t=s] + \gamma\mathbb{E}[v_{\pi}(S_{t+1}) \mid S_t = s]$$

For simplicity, we define a bellman expectation operator $T_E^{\pi}$ on the functions over state space to encapsulate all right hand side, i.e. define

\begin{aligned} T_E^{\pi}: (S\rightarrow\mathbb{R})&\rightarrow(S\rightarrow\mathbb{R})\\ v&\mapsto \left(s\mapsto\mathbb{E}[R_{t+1}\mid S_t=s] + \gamma\mathbb{E}[v(S_{t+1}) \mid S_t = s]\right) \end{aligned}

So the Bellman Expectation Equation can be expressed as simply:
$$v_{\pi}=T_E^{\pi}(v_{\pi})$$

Also, for Bellman Optimality Equation, we can similarly define a bellman optimality operator
\begin{aligned} T_O: (S\rightarrow\mathbb{R})&\rightarrow(S\rightarrow\mathbb{R})\\ v&\mapsto \left(s\mapsto\max_a{\left(R_s^a + \gamma \sum_{s'}\mathbb{P}_{ss'}^av(s')\right)}\right) \end{aligned}

As a result, the Bellman Optimality Equation for optimal value functions can be written as
$$v_*=T_O(v_*)$$

Furthermore, recall that we can construct an optimal policy $\pi_*$ from an optimal action value function $q_*$ greedily by picking the action with the maximum value. Since $q_*$ and $v_*$ are inter-convertible, we can also construct the $\pi_*$ from an optimal state value function $q_*$. For simplicity, we define the greedy operator $G$ by
\begin{aligned} G: (S\rightarrow\mathbb{R})&\rightarrow(A\rightarrow\mathbb{R})\\ v&\mapsto \left(a\mapsto \begin{cases} 1, \quad a = \operatorname{argmax} \left(R_s^a + \gamma\sum_{s'}\mathbb{P}_{ss'}^a v(s')\right)\\ 0, \quad\text{otherwise} \end{cases}\right) \end{aligned}

Use those operators, we can find a simple relation between those frightening formulas:
$$T_O(\cdot) = T_E^{G(\cdot)}(\cdot)$$

# Three Iterative Methods

In the lecture, the teacher mentioned three iterative algorithms, which are

• Policy Evaluation
• Input: MDP, A policy $\pi$
• Output: The value function $v_{\pi}$
• Policy Iteration
• Input: MDP
• Output: An optimal value function $v_*$ and optimal policy $\pi_*$
• Value Iteration
• Input: MDP
• Output: An optimal value function $v_*$ and optimal policy $\pi_*$

At the beginning I was really confused, especially by the latter two. Luckily, by the operator I defined above, I can clearly express those three algorithms now.

## Policy Evaluation

• Input:
• MDP, A policy $\pi$
• Steps:
• initialize $v^{(0)}$
• loop until should terminate
• $v^{(k+1)}\leftarrow T^{\pi}_E\left(v^{(k)}\right)$
• Output:
• $v_*\leftarrow v^{(\text{last})}$

## Policy Iteration

• Input:
• MDP
• Steps:
• initialize $v^{(0)}, \pi^{(0)}$, and an integer $K\geqslant 1$
• loop until should terminate
• $v^{(k+1)}\leftarrow \left(T^{\pi^{(k)}}_E\right)^K\left(v^{(k)}\right)$
• $\pi^{(k+1)}\leftarrow G\left(v^{(k+1)}\right)$
• Output:
• $v_*\leftarrow v^{(\text{last})}$
• $\pi_*\leftarrow \pi^{(\text{last})}$

## Value Iteration

• Input:
• MDP
• Steps:
• initialize $v^{(0)}$
• loop until should terminate
• $v^{(k+1)}\leftarrow T_O\left(v^{(k)}\right)$
• Output:
• $v_*\leftarrow v^{(\text{last})}$
• $\pi_*\leftarrow G(v_*)$

Note that when $K=1$, the Policy Iteration is exactly Value Iteration.

It can be shown by the Contraction Mapping Theorem that when $\gamma<1$

• $v_{\pi}$ is unique for a given $\pi$
• Policy Evaluation converges to $v_{\pi}$
• Policy Iteration converges to $v_*$
• Value Iteration converges to $v_*$