This note is for lecture Lecture 6: Value Function Approximation.

Up until now, we have deduced several algorithms for both prediction and control, in theory. However, when applying them to practice, we have to face one challenge: the number of states or state-action pairs may be tremendously huge.

The most naive approach to implement those algorithms, in particular to represent value functions $v(s)$ and $q(s,a)$, is to build a lookup table. In reality, however, it is not applicable. For example, for game Backgammon we have roughly $10^{20}$ states, and $10^{170}$ states for the game Go. To make things worse, we have a continuous state space when training a helicopter.

A natural approach to resolve the issue is to build a model $\hat{v}(s;w)$ or $\hat{q}(s,a;w)$ parametrized by some parameters $w$ to approximate the true mapping $v(s)$ or $q(s,a)$.

One way to learn weights $w$ is to act as if it is a supervised learning problem. Now suppose some oracle has told us the true value function $v^{\pi}(s)$ for some policy $\pi$. Our target is to minimize the following loss:

$$L(w) = \mathbb{E}_{\pi}\left[(v^{\pi}(S)-\hat{v}(S;w))^2\right]$$

How to minimize such loss with respect to $w$? Gradient descent! To be more specific, we calculate for some step size $\alpha$ $$\begin{aligned} \Delta w&=-\frac{1}{2}\alpha\nabla_w L(w)\\ &=\alpha\mathbb{E}_{\pi}\left[(v^{\pi}(S)-\hat{v}(S;w))\nabla_w \hat{v}(S;w)\right] \end{aligned}$$

However, it is impossible to calculate the actual expectation, so we have to compromise. The typically way to compromise is sampling.

Suppose we have sampled several steps according to policy $\pi$, and stored them in a set named *experience*, denoted by $\mathcal{D}$:

$$\mathcal{D}=\left\{(s_1, v^{\pi}_1),\dots,(s_n, v^{\pi}_n)\right\}$$

Now we have two ways to learn our parameters $w$ using the experience.

# Incremental Method

One way is to use the sample one by one sequentially, or say incrementally.

- for $t$ in $1\dots n$
- calculate $\Delta w=\alpha(v^{\pi}(s_t)-\hat{v}(s_t;w))\nabla_w \hat{v}(s_t;w)$
- $w\leftarrow w+\Delta w$

In practice, we may not need to sample the whole $\mathcal{D}$ in advance, but sample at the beginning of each step.

Obviously, there is a drawback of the incremental method: the samples are not utilized to their best. We sample one by one, and discard the previous sample immediately after updating the weights using it. This drawback encourages the so-called batch method.

# Batch Method

Instead of sampling and discarding one by one, we set our goal to find an approximator $\hat{v}(s;w)$ best fitting the experience $\mathcal{D}$, in the sense to minimize the *least squares* of the approximation:

$$L(w)=\sum_{i=1}^{n}(v^{\pi}(s_i)-\hat{v}(s_i;w))^2$$

One way to look at it is to rewrite the original loss as

$$\begin{aligned}
L(w) &= \mathbb{E}_{S\sim\pi}\left[(v^{\pi}(S)-\hat{v}(S;w))^2\right]\\
&=\mathbb{E}_{D\sim\pi}\left[\mathbb{E}_{S\sim D}\left[(v^{\pi}(S)-\hat{v}(S;w))^2\right]\right]
\end{aligned}$$

Instead of sampling $S$ one by one, we sample $\mathcal{D}$, i.e. a batch of $S$, to approximate the desired expectation.

How to minimize the least squares? One way is to calculate the full gradient contributed by $n$ samples, i.e gradient descent. However, in reality $n$ may be very large. As a result, we use *stochastic gradient descent*:

- repeat sufficient times
- randomly pick a sample $(s_i, v^{\pi}_i)$
- calculate $\Delta w=\alpha(v^{\pi}_i-\hat{v}(s_i;w))\nabla_w \hat{v}(s_i;w)$
- $w\leftarrow w+\Delta w$

# Missing Oracle

Up until now, we are assuming we know the true value function $v^{\pi}$. However, in reality we never know it. As a compromise, we replace it by the target used in Monte-Carlo, Temporal-Difference and TD($\lambda$) prediction.

Prediction Method | Replacement for $v^{\pi}$ | Require full epoch |
---|---|---|

Monte-Carlo | $G^t$ | yes |

Temporal-Difference | $r_{t+1}+\gamma \hat{v}(s_{t+1};w)$ | no |

TD($\lambda$) in forward-view | $G^{\lambda}$ | yes |

Note that the vanilla TD($\lambda$) requires seeing the full epoch. Thankfully, similar to previous approach, we can use eligibility trace to rewrite the TD($\lambda$) in backward view:

$$\begin{aligned}
\delta_t&=r_{t+1}+\gamma \hat{v}(s_{t+1};w)-\hat{v}(s_{t+1};w)\\
E_t&=\gamma\lambda E_{t-1} + \nabla_w \hat{v}(s_t;w)\\
\Delta w &= \alpha\delta_t E_t
\end{aligned}$$

Note that in the TD and TD($\lambda$) cases, the new term is dependent on $\hat{v}$. However, we **must not** differentiate it when calculating the gradient. An explanation is that we always treat this new term as the replacement, or approximation, of the true $v^{\pi}$, and update our target value function toward it, and not vise versa, i.e. we should not update the assumed true value towards our approximator. Anyway, in reality, if we take this new term into account when doing gradient descent, the performance dramatically drops.

# Choice for function approximators

We have settled down the overall structure of the prediction algorithm using a value function approximator, but we haven’t answer the very first question: what approximator should we use?

Generally speaking, there are two categories: linear and non-linear approximator. The former is simple and straight forward, while the latter is typically a neural network.

## Linear function approximator

In the linear case, we assume

$$\hat{v}(s;w)=\langle x(s), w\rangle$$

where $x(s)$ is the *feature vector* of the state not containing any weights, e.g. price history of a stock, or reading of sensors of a robot. In particular, if $x(s)$ is a one-hot vector for each $s$, the linear approximation degenerates to table lookup.

One great advantage of using linear approximation in an optimization problem is that it always has simple gradient and direct, closed form solution. In fact, we have

$$\nabla_w \hat{v}(s;w) = x(s)$$

Such gradient can be plugged into both Incremental Method and Batch Method directly.

Furthermore, in Batch Method case, the problem turns out to be the classic Linear Least Squares problem, in which the optimal weight $w$ can be directly solved by

$$w = (Y^{\intercal}Z)^{-1}Y^{\intercal}V$$

where

$$Y=\begin{pmatrix}
y^{\intercal}_1\\
\vdots\\
y^{\intercal}_n\\
\end{pmatrix},
Z=\begin{pmatrix}
z^{\intercal}_1\\
\vdots\\
z^{\intercal}_n\\
\end{pmatrix},
V=\begin{pmatrix}
v^{\intercal}_1\\
\vdots\\
v^{\intercal}_n\\
\end{pmatrix}$$

are two $n\times m$ matrix, and a $n\times 1$ vector, respectively, with entries described in the following table.

Prediction Method | $y_t$ | $z_t$ | $v_t$ |
---|---|---|---|

Monte-Carlo | $x(s_t)$ | $x(s_t)$ | $G_t$ |

Temporal-Difference | $x(s_t)$ | $x_t-\gamma v(s_{t+1})$ | $r_{t+1}$ |

TD($\lambda$) in backward-view | $E_t$ | $x_t-\gamma v(s_{t+1})$ | $r_{t+1}$ |

## Non-linear function approximator

In more complicated case, we approximate the value function $\hat{v}(s;w)$ using a neural network. Besides, when doing control instead of prediction, we model $\hat{q}(s,a;w)$. There are two ways to model it:

- $\hat{q}:S\times A\rightarrow \mathbb{R}$
- $\hat{q}:S\rightarrow\mathbb{R}^{|A|}$

The second one is usually preferred.

A well-known success of using a non-linear approximator is the **Deep Q-Network**.

- Initialze an empty experience set $\mathcal{D}$.
- Initialize $w$, and set $w^{-}=w$
- Repeat for several epochs
- Initialize timestmap $t\leftarrow 0$
- Initialize $s_t$
- Repeat
- Break if $s_t$ is termination.
- $w^{-}\leftarrow w$ after $C$ steps.
- Take an action $a_t$ from current state $s_t$ according to $\hat{q}(s_t,a;w)$ $\epsilon$-greedily.
- Apply $a_t$ to observe $r_{t+1}$ and $s_{t+1}$.
- Store $(s_t, a_t, r_{t+1}, s_{t+1})$ in $\mathcal{D}$
- $t\leftarrow t+1$ and jump to next loop if $\mathcal{D}$ is not big enough.
- Randomly sample a mini-batch $\mathcal{D}_s$ from $\mathcal{D}$.
- Use variant of SGD to minizie$$L(w)=\mathbb{E}_{(s,a,r',s')\sim \mathcal{D}_s}\left[\left(r+\gamma\max_{a'}\hat{q}(s',a';w^{-})-\hat{q}(s,a;w)\right)^2\right]$$
- $t\leftarrow t+1$

The original paper can be found here, and a TensorFlow implementation can be found here.

Note there are two key innovation in DQN, the first one is **experience replay**, as has mentioned, which greatly reduces the correlation between training samples. The second one is **fixed $Q$-target**, which fixes the parameters of target policy for several steps, and results in greatly stabilizing the training process.