RL Note (5.2) - Q-Learning

In RL Note (5.1) - Model-Free Control, we explored three ways to perform model-free control to find optimal policy through calculating the optimal action value function. All those three approaches share a common property: they all start from an initial policy and iteratively update the policy itself to approach the optimal one. Such property is called on-policy.

However, in reality, there are often the cases when we have to separate the single policy into two policies: a target policy $\pi$ to compute value function $v$ and $q$, and a behavior policy $\mu$ to pick actions. Such property is called off-policy.

Following are several reasons for off-policy learning.

  • Sometimes we want to learn from some external oracle, such as teaching a robot to move by observing a person, where $\mu$ is the policy the person takes, and $\pi$ is the one the robot need to learn.
  • Sometimes we want to reuse samples generated by previous policy $\pi_t$ during an iterative process to best explore the data. In such case we can treat those old policies as several external behavior policies.
  • Note that the exploration and exploitation is always a contradiction in RL, since the optimal target policy $\pi$ should always be deterministic, and avoid exploitation at all. Thus, we can use a stochastic behavior policy $\mu$ to act as an exploratory policy to guide the exploration during learning.
  • Finally, we may want to learn multiple policies by following one policy. One rough example is that we may want to learn how to study, how to work, and how to exercise based on a single policy, i.e. each one’s life trajectory.

Based on all those reasons, we have to develop a way to learn a policy $\pi$ following another policy $\mu$.


Now let’s consider off-policy learning of action value function $q(s,a)$ in a model-free setting. The process is like SARSA, but with slight difference when sampling actions:

  • Given the behavior policy $\mu$
  • Initialize the action value function $q(s,a)$ for all $s, a$, and a policy $\pi$ as the target policy
  • For $n$th episode
    • $t\leftarrow 1$
    • Initialize the starting state $s_t$
    • Loop
      • Sample $a_t$ from $s_t$ according to behavior policy $\mu$
      • Taken action $a_t$ to observe $r_{t+1}$ and $s_{t+1}$
      • Sample $a'$ from $s_{t+1}$ according to target policy $\pi$
      • update $q(s_t, a_t)\leftarrow q(s_t, a_t) + \alpha_t(r_{t+1}+\gamma q(s_{t+1}, a')-q(s_t, a_t))$
      • $t\leftarrow t + 1$ and repeat unless terminated
    • $\pi\leftarrow G_{\text{GLIE}}(q)$

Note that comparing to SARSA, sampling $a_t$ is moved into the loop.

Furthermore, what if the behavior policy is not external and given, but also generated during the process like $\pi$? In fact, it is a good way to handle the exploration-exploitation issue. Note that we know $\pi$ should be deterministic, and $\mu$ stochastic, as at each step we update these two as

$$\begin{aligned} &\pi\leftarrow G(q)\\ &\mu\leftarrow G_{\epsilon}(q) \end{aligned}$$

Since at timestamp $t$, $a'$ is picked throuth $\pi$, thus we get

$$r_{t+1}+\gamma q(s_{t+1}, a') = r_{t+1} + \gamma\max\limits_aq(s_{t+1}, a)$$

As a result, after implicitly absorbing $\mu$ and $\pi$, the algorithm simplifies to

  • Initialize the action value function $q(s,a)$ for all $s, a$
  • For $n$th episode
    • $t\leftarrow 1$
    • Initialize the starting state $s_t$
    • Loop
      • Sample $a_t$ from $s_t$ based on $q$ $\epsilon$-greedily
      • Taken action $a_t$ to observe $r_{t+1}$ and $s_{t+1}$
      • update $q(s_t, a_t)\leftarrow q(s_t, a_t) + \alpha_t(r_{t+1}+\gamma \max\limits_aq(s_{t+1}, a)-q(s_t, a_t))$
      • $t\leftarrow t + 1$ and repeat unless terminated

It is named Q-Learning.

Correspondence between DP and TD

The following table demonstrates the relationship between two approaches: dynamic programming and temporal difference. Note that $x\Leftarrow y$ means $x\leftarrow x + \alpha(y-x)$ for some step size $\alpha$.

Dynamic Programming Temporal Difference
use case when model is known when model is unknown
strategy directly computing the expectation after one step sampling one step
updating for $v(s_t)$ $\leftarrow\mathbb{E}(R_{t+1}+\gamma v(S_{t+1}) | s)$ $\Leftarrow r_{t+1} + \gamma v(s_{t+1})$
updating for $q(s_t, a_t)$ $\leftarrow\mathbb{E}(R_{t+1}+\gamma q(S_{t+1}, A_{t+1}) | s,a)$ $\Leftarrow r_{t+1} + \gamma q(s_{t+1}, a_{t+1})$
updating for $q(s_t,a_t)$ using deterministicity of the optimal policy $\leftarrow\mathbb{E}(R_{t+1}+\gamma \max\limits_{a'} q(S_{t+1}, a') | s,a)$ $\Leftarrow r_{t+1} + \gamma \max\limits_{a'}q(s_{t+1}, a')$