As always, we first tackle evaluation problem, then control problem, where the goal is to find the optimal policy, instead of the value function given a policy. This post is the note for model-free control, and the youtube video is Lecture 5: Model Free Control

If the model is known, as the case in RL Note (3) - Planning by Dynamic Programming, we use an iterative process which asymptotically converge to an optimal policy. However, such approach is not applicable in the model-free case, since the greedy operator $G$, as defined below, is unknown.

\begin{aligned} G: (S\rightarrow\mathbb{R})&\rightarrow(A\rightarrow\mathbb{R})\\ v&\mapsto \left(a\mapsto \begin{cases} 1, \quad a = \operatorname{argmax} \left(R_s^a + \gamma\sum_{s'}\mathbb{P}_{ss'}^a v(s')\right)\\ 0, \quad\text{otherwise} \end{cases}\right) \end{aligned}

To be more specific, recall that in both Policy Iteration and Value Iteration, we use $G$ to generate policy from current state value function $v$ to pick the next action. However, when the model is unknown, we do not know $R_s^a$ and $\mathbb{P}_{ss'}^a$, therefore cannot use $G$ anymore.

The workaround is to use action value function $q$ instead of $v$, and use the following new greedy operator:

\begin{aligned} G: (S\times A\rightarrow\mathbb{R})&\rightarrow(A\rightarrow\mathbb{R})\\ q&\mapsto \left(a\mapsto \begin{cases} 1, \quad a = \operatorname{argmax} q(s,a)\\ 0, \quad\text{otherwise} \end{cases}\right) \end{aligned}

Note that we use the same symbol for greedy operator on $q$ and that on $v$. One can infer correctly from the context.

However, we still have one more problem. To see this, note that when the model is known, we use the Bellman Expectation Equation to iteratively update the state value function $v$, i.e.

$$v^{(k+1)}_{\pi}(s)\leftarrow\sum_{a}\pi(a\mid s)\left(R_s^a + \gamma \sum_{s'}\mathbb{P}_{ss'}^av^{(k)}_{\pi}(s')\right)$$

Now since we use the action value function $q$ instead of $v$, naturally we hope to use the corresponding Bellman Expectation Equation,. i.e.

$$q^{(k+1)}_{\pi}(s,a)\leftarrow R_s^a + \gamma\sum_{s'}\mathbb{P}_{ss'}^a \left(\sum_{a'}\pi(a'\mid s') q^{(k)}_{\pi}(s',a')\right)$$

Sadly, it cannot be done, since the model is unknown, we do not have access to $R_s^a$ and $\mathbb{P}_{ss'}^a$. As a result, we have to seek other approaches to update $q$.

A naive way is to use empirical average by sampling, i.e. the Monte-Carlo way. But it requires us to explore all possible states, which naturally holds when the model is known, but might not be true otherwise. As a result, we must adjust our policy to ensure basic exploration, which leads to the so called $\epsilon$-greedy operator:

\begin{aligned} G_{\epsilon}: (S\times A\rightarrow\mathbb{R})&\rightarrow(A\rightarrow\mathbb{R})\\ q&\mapsto \left(a\mapsto \begin{cases} \frac{\epsilon}{|A|} + 1-\epsilon, \quad a = \operatorname{argmax} q(s,a)\\ \frac{\epsilon}{|A|}, \quad\text{otherwise} \end{cases}\right) \end{aligned}

Such operator ensure a base uniform probability $\frac{\epsilon}{|A|}$ for all possible actions, and assign rest probability to the best one.

# Monte-Carlo Control

Now we can state the most naive attempt to find the optimal policy in a model-free setting:

• Initialize the action value function $q(s,a)$ for all $s, a$, and a policy $\pi$
• For $n$th episode
• Sample $(s_1, a_1, r_2, s_2, \cdots, s_{T}, a_{T}, r_{T+1}, s_{T+1})$ according to $\pi$:
• for each (or the fist) $(s_t, a_t)$ in the episode:
• calculate $q\leftarrow r_{t+1}+\gamma r_{t+2}+\cdots +\gamma^{T-t} r_{T+1}$
• accumulate $N(s_t, a_t)\leftarrow N(s_t, a_t) + 1$
• update $q(s_t, a_t)\leftarrow q(s_t, a_t) + \frac{1}{N(s_t, a_t)}(q-q(s_t, a_t))$
• $\pi\leftarrow G_{\epsilon}(q)$

Looks great, but does it converge to the optimal policy? No, or not necessarily! A sufficient condition for such process to converge is the so-called GLIE condition:

1. All state-action pairs are explored infinitely times, i.e. $N(s_t, a_t)\rightarrow\infty$
2. The policy converges to a greedy policy, i.e. $\pi(a|s)\rightarrow G(q)$

The above process satisfies the first condition as long as $\epsilon>0$, but fails to satisfy the second one. One way to rescue is to make $\epsilon$ decreasing to zero, e.g. $\epsilon=\frac{1}{n}$ for $n$th episode. We call such process the GLIE Monte-Carlo control, which is guaranteed to converge to the optimal action value function, i.e. $q\rightarrow q_*$ as $n\rightarrow\infty$.

We use $G_{\text{GLIE}}$ to represent generating a policy satisfying GLIE condition.

# SARSA

The control process is nothing but an evaluation process alternated with a policy updating process, in which the evaluation process is exactly what is described in RL Note (4.1) - Model-Free Prediction. As a result, a natural improvement to Monte-Carlo Control is to use Temporal-difference(TD) upon Monte-Carlo, and the resulting control algorithm is named SARSA.

• Initialize the action value function $q(s,a)$ for all $s, a$, and a policy $\pi$
• For $n$th episode
• $t\leftarrow 1$
• Initialize the starting state $s_t$
• Sample $a_t$ from $s_t$ according to $\pi$
• Loop
• Taken action $a_t$ to observe $r_{t+1}$ and $s_{t+1}$
• Sample $a_{t+1}$ from $s_{t+1}$ according to $\pi$
• update $q(s_t, a_t)\leftarrow q(s_t, a_t) + \alpha(r_{t+1}+\gamma q(s_{t+1}, a_{t+1})-q(s_t, a_t))$
• $t\leftarrow t + 1$ and repeat unless terminated
• $\pi\leftarrow G_{\text{GLIE}}(q)$

However, in order to let the above process converge to the optimal $q_*$, except for the GLIE condition on $\pi$, we must have additional constraint on $\alpha$, which must in fact be a so-called Robbins-Monro sequence $\alpha_t$:

\begin{aligned} \sum_{t=1}^{\infty}\alpha_t&=\infty\\ \sum_{t=1}^{\infty}\alpha_t^2&<\infty\\ \end{aligned}

# SARSA($\lambda$)

Like in model-free prediction situation we have TD and TD($\lambda$), here we have SARSA($\lambda$) as well, with exactly the same underneath motivation. Furthermore, we improve the calculation using eligibility trace for state-action pairs:

\begin{aligned} E_0(s,a) &= 0\\ E_t(s,a) &= \gamma\lambda E_{t-1}(s,a) + \mathbb{1}(S_t=s, A_t=a) \end{aligned}

The algorithm thus is as following:

• Initialize the action value function $q(s,a)$ for all $s, a$, and a policy $\pi$
• For $n$th episode
• Initialzie $E_0(s,a)\leftarrow 0$ for all $s, a$
• $t\leftarrow 1$
• Initialize the starting state $s_t$
• Sample $a_t$ from $s_t$ according to $\pi$
• Loop
• For each $(s,a)$
• update $E_t(s,a)\leftarrow \gamma\lambda E_{t-1}(s,a) + \mathbb{1}(s_t=s, a_t=a)$
• Taken action $a_t$ to observe $r_{t+1}$ and $s_{t+1}$
• Sample $a_{t+1}$ from $s_{t+1}$ according to $\pi$
• Calcualte TD error $\delta_t \leftarrow r_{t+1}+\gamma q(s_{t+1}, a_{t+1})-q(s_t, a_t)$
• For each $(s,a)$
• update $q(s,a)\leftarrow q(s,a) + \alpha_t\delta_tE_t(s,a)$
• $t\leftarrow t + 1$ and repeat unless terminated
• $\pi\leftarrow G_{\text{GLIE}}(q)$