RL Note (3) - Planning by Dynamic Programming

In this post I will clarify a method to solve the MDP using Bellman Expectation and/or Optimality Equation, as stated in the previous post RL Note (3) - Planning by Dynamic Programming. The lecture video is here: Lecture 3_ Planning by Dynamic Programming.

To be honest, initially I was not clear what’s going on after watching the video, so used my own way to re-formulate the algorithm introduced in the video, which hopefully is clearer and easier to understand.

In this post we are considering how to solve for optimal state value function $v_*$ and optimal policy $\pi_*$.

Notations

The Bellman Expectation Equation is
$$v_{\pi}(s)=\mathbb{E}[R_{t+1}\mid S_t=s] + \gamma\mathbb{E}[v_{\pi}(S_{t+1}) \mid S_t = s]$$

For simplicity, we define a bellman expectation operator $T_E^{\pi}$ on the functions over state space to encapsulate all right hand side, i.e. define

$$\begin{aligned} T_E^{\pi}: (S\rightarrow\mathbb{R})&\rightarrow(S\rightarrow\mathbb{R})\\ v&\mapsto \left(s\mapsto\mathbb{E}[R_{t+1}\mid S_t=s] + \gamma\mathbb{E}[v(S_{t+1}) \mid S_t = s]\right) \end{aligned}$$

So the Bellman Expectation Equation can be expressed as simply:
$$v_{\pi}=T_E^{\pi}(v_{\pi})$$

Also, for Bellman Optimality Equation, we can similarly define a bellman optimality operator
$$\begin{aligned} T_O: (S\rightarrow\mathbb{R})&\rightarrow(S\rightarrow\mathbb{R})\\ v&\mapsto \left(s\mapsto\max_a{\left(R_s^a + \gamma \sum_{s'}\mathbb{P}_{ss'}^av(s')\right)}\right) \end{aligned}$$

As a result, the Bellman Optimality Equation for optimal value functions can be written as
$$v_*=T_O(v_*)$$

Furthermore, recall that we can construct an optimal policy $\pi_*$ from an optimal action value function $q_*$ greedily by picking the action with the maximum value. Since $q_*$ and $v_*$ are inter-convertible, we can also construct the $\pi_*$ from an optimal state value function $q_*$. For simplicity, we define the greedy operator $G$ by
$$\begin{aligned} G: (S\rightarrow\mathbb{R})&\rightarrow(A\rightarrow\mathbb{R})\\ v&\mapsto \left(a\mapsto \begin{cases} 1, \quad a = \operatorname{argmax} \left(R_s^a + \gamma\sum_{s'}\mathbb{P}_{ss'}^a v(s')\right)\\ 0, \quad\text{otherwise} \end{cases}\right) \end{aligned}$$

Use those operators, we can find a simple relation between those frightening formulas:
$$T_O(\cdot) = T_E^{G(\cdot)}(\cdot)$$

Three Iterative Methods

In the lecture, the teacher mentioned three iterative algorithms, which are

  • Policy Evaluation
    • Input: MDP, A policy $\pi$
    • Output: The value function $v_{\pi}$
  • Policy Iteration
    • Input: MDP
    • Output: An optimal value function $v_*$ and optimal policy $\pi_*$
  • Value Iteration
    • Input: MDP
    • Output: An optimal value function $v_*$ and optimal policy $\pi_*$

At the beginning I was really confused, especially by the latter two. Luckily, by the operator I defined above, I can clearly express those three algorithms now.

Policy Evaluation

  • Input:
    • MDP, A policy $\pi$
  • Steps:
    • initialize $v^{(0)}$
    • loop until should terminate
      • $v^{(k+1)}\leftarrow T^{\pi}_E\left(v^{(k)}\right)$
  • Output:
    • $v_*\leftarrow v^{(\text{last})}$

Policy Iteration

  • Input:
    • MDP
  • Steps:
    • initialize $v^{(0)}, \pi^{(0)}$, and an integer $K\geqslant 1$
    • loop until should terminate
      • $v^{(k+1)}\leftarrow \left(T^{\pi^{(k)}}_E\right)^K\left(v^{(k)}\right)$
      • $\pi^{(k+1)}\leftarrow G\left(v^{(k+1)}\right)$
  • Output:
    • $v_*\leftarrow v^{(\text{last})}$
    • $\pi_*\leftarrow \pi^{(\text{last})}$

Value Iteration

  • Input:
    • MDP
  • Steps:
    • initialize $v^{(0)}$
    • loop until should terminate
      • $v^{(k+1)}\leftarrow T_O\left(v^{(k)}\right)$
  • Output:
    • $v_*\leftarrow v^{(\text{last})}$
    • $\pi_*\leftarrow G(v_*)$

Note that when $K=1$, the Policy Iteration is exactly Value Iteration.

It can be shown by the Contraction Mapping Theorem that when $\gamma<1$

  • $v_{\pi}$ is unique for a given $\pi$
  • Policy Evaluation converges to $v_{\pi}$
  • Policy Iteration converges to $v_*$
  • Value Iteration converges to $v_*$