RL Note (4.2) - Backward View TD

TD($\lambda$) is good, but has the same problem as Monte-Carlo:

  • The episode must terminate.
  • We must wait until the episode terminates.

Well, by the very definition of $G_t^{(\lambda)}$, the episode need to have a termination, which in fact is tolerable, since in many tasks the episodes do terminate, and we can manually require so. However, the second one is absolutely intolerable, which for example may result in inefficient implementation.

Luckily, there is a very clever way to resolve this. I will directly put the conclusion, and provide the prove, since I have no idea how to deduce it from the first place:(

Eligibility Trace

We maintain an eligibility trace for each state $s$ at each timestmap $t$:

$$\begin{aligned} E_0(s) &= 0\\ E_t(s) &= \gamma\lambda E_{t-1}(s) + \mathbb{1}(S_t=s) \end{aligned}$$

Intuitively, the eligibility of a state bump up by 1 each time we enter that state, and always decay exponentially.

We perform the update by

$$\begin{aligned} \delta_t &= R_{t+1}+\gamma v(S_{t+1}) - v(S_t)\\ v(s)&\leftarrow v(s) + \alpha\delta_tE_t(s), \forall s \end{aligned}$$

Note that the update is for every state $s$. The algorithm can be described as:

  • Start an episode, given initial state $s_1$
    • for each state $s$
      • update $E_t(s)$
    • Sample forward one step: $(s_t, a_t, r_{t+1}, s_{t+1})$:
    • Calculate TD error $\delta_t \leftarrow r_{t+1}+\gamma v(s_{t+1}) - v(s_t)$
    • for each state $s$
      • update $v(s)\leftarrow v(s) + \alpha\delta_tE_t(s)$
    • repeat for the timestamp $t+1$ if not terminate

Use such approach, we no longer need to wait until a complete episode terminates before performing the update. However, does it give the same result as the original TD$(\lambda)$? Surprisingly, yes!


The sum of offline updates is identical for forward-view and backward-view TD$(\lambda)$
$$\sum_{t=1}^T\alpha\delta_tE_t(s) = \sum_{t=1}^T\alpha\left(G_t^{\lambda}-v(S_t)\right)\mathbb{1}(S_t=s), \forall s$$

offline means we keep a copy of the previous $v(s)$ for updating within each episode, instead of dynamically update $v(s)$ on the fly.

This theorem says that the total change for each state $s$ is the same under two approaches. Note that it does not say any partial changes before termination are the same, and they are not.

Proof

Here we prove the above theorem. Initially I tried brute force, by expending two sides, which turns out to be too complicated. Then I turn to mathematical induction, which works happily.

After cancelling $\alpha$, we need to prove

$$\sum_{t=1}^T\delta_tE_t(s) = \sum_{t=1}^T\left(G_t^{\lambda}-v(S_t)\right)\mathbb{1}(S_t=s), \forall s$$

We perform induction on the terminating time $T$. When $T=1$, the left hand side is

$$(R_2+\gamma v(S_2)-v(S_1))\mathbb{1}(S_1=s)$$

while the right hand side is

$$(G_1^{\lambda}-v(S_1))\mathbb{1}(S_1=s)$$

Also, when terminating after only $1$ step, we have

$$G_1^{\lambda} = G_1^{(1)} = R_2+\gamma v(S_2)$$

So the equation holds.

Now we assume the theorem holds for a $T\geqslant 1$, we need to show it holds for $T+1$. What is the difference for each side when $T$ becomes $T+1$? On the left hand side, we get an extra term

$$\delta_{T+1}E_{T+1}(s)$$

The right hand side is not that obvious, since the dependency of $T$ is implicit inside $G_t^{\lambda}$, not only in the subscript. To emphasize this, we temporarily write $G_t^{\lambda,T}$ to mean it is the $G_t^{\lambda}$ when terminating at timestamp $T$.

We have

$$\begin{aligned} G_t^{\lambda,T} &= (1-\lambda)\sum_{n=1}^{T-t}\lambda^{n-1}G_t^{(n,T)} + \lambda^{T-t}G_t^{(T+1-t, T)}\\ G_t^{\lambda,T+1} &= (1-\lambda)\sum_{n=1}^{T-t+1}\lambda^{n-1}G_t^{(n,T+1)} + \lambda^{T+1-t}G_t^{(T+2-t, T+1)} \end{aligned}$$

Note that when $n\leqslant T+1-t$, $G_t^{(n,T)}$ and $G_t^{(n,T+1)}$ are the same, due to bootstrapping eliminates the tails. So their difference is

$$\begin{aligned} G_t^{\lambda,T+1}-G_t^{\lambda,T} &= (1-\lambda)\lambda^{T-t}G_t^{(T+1-t,T+1)}+ \lambda^{T+1-t}G_t^{(T+2-t,T+1)}-\lambda^{T-t}G_t^{(T+1-t,T)}\\ &=\lambda^{T+1-t}\left(G_t^{(T+2-t,T+1)}-G_t^{(T+1-t,T+1)}\right)\\ &=\lambda^{T+1-t}\left(\left(\sum_{i=1}^{T+2-t}\gamma^{i-1} R_{t+i}+\gamma^{T+2-t}v(S_{T+2})\right)-\left(\sum_{i=1}^{T+1-t}\gamma^{i-1} R_{t+i}+\gamma^{T+1-t}v(S_{T+1})\right)\right)\\ &=\lambda^{T+1-t}\left(\gamma^{T+1-t} R_{T+2}+\gamma^{T+2-t}v(S_{T+2})-\gamma^{T+1-t}v(S_{T+1})\right)\\ &=(\lambda\gamma)^{T+1-t}\left(R_{T+2}+\gamma v(S_{T+2})-v(S_{T+1})\right) \end{aligned}$$

Looks simple, so must be right! Now the whole difference of the right hand side between $T+1$ and $T$ is

$$\sum_{t=1}^T(\lambda\gamma)^{T+1-t}\left(R_{T+2}+\gamma v(S_{T+2})-v(S_{T+1})\right)\mathbb{1}(S_t=s) + (G_{T+1}^{\lambda,T+1}-v(S_{T+1}))\mathbb{1}(S_{T+1}=s)$$

Note that

$$G_{T+1}^{\lambda,T+1} = R_{T+2}+\gamma v(S_{T+2})$$

We can rewrite the difference as

$$\left(R_{T+2}+\gamma v(S_{T+2})-v(S_{T+1})\right)\sum_{t=1}^{T+1}(\lambda\gamma)^{T+1-t}\mathbb{1}(S_{t}=s)$$

Now we must show the differences of two sides are equal, which is

$$\delta_{T+1}E_{T+1}(s)=\left(R_{T+2}+\gamma v(S_{T+2})-v(S_{T+1})\right)\sum_{t=1}^{T+1}(\lambda\gamma)^{T+1-t}\mathbb{1}(S_{t}=s)$$

It holds! Since it is not hard to show that

$$\begin{aligned} \delta_{T+1}&=R_{T+2}+\gamma v(S_{T+2})-v(S_{T+1})\\ E_{T+1}(s)&=\sum_{t=1}^{T+1}(\lambda\gamma)^{T+1-t}\mathbb{1}(S_{t}=s) \end{aligned}$$

Wonderful!