Up until now, all our works are done under the assumption that a MDP is known, in particular the transition probability $P$ and rewords $R$ are known. However, those two things are determined by the environment, and in real life typically not known by the agent. For example, when playing a game, unless you have read the manual, you have no idea what will happen if you press the up key, and what reward you will get. You must play and observe for a while to generate a good guess.

To handle this, there are two main approaches, differed by creating a model of the environment or not. For example, physics is a model for the universe. This note is for the model free approach to handle prediction, corresponding to Lecture 4: Model-Free Prediction

Recall that a prediction problem is to figure out the value function $v(s)$ given the policy $\pi$.

So, what can we do as the agent, knowing literally nothing at all about the outside world? Sampling!

# Monte-Carlo

Perhaps the most naive and ancient approach to tackle randomness is to use the so called Monte-Carlo method, by estimating the true expectation by empirical average.

In the RL case, to find $v(s)$ for each $s$, we just sample several episode $(s_1, a_1, r_2, s_2, \cdots, s_{T}, a_{T}, r_{T+1}, s_{T+1})$, and calculate the value function $v(s)$ for each state $s$ as the average of all the accumulated discounted rewards starting from state $s$ in all episodes. Note that we could only care $s_1=s$ as well.

The algorithm can be described as:

• For nth episode
• Sample $(s_1, a_1, r_2, s_2, \cdots, s_{T}, a_{T}, r_{T+1}, s_{T+1})$:
• for each (or the fist) $s_t$ in the episode:
• calculate $v\leftarrow r_{t+1}+\gamma r_{t+2}+\cdots +\gamma^{T-t} r_{T+1}$
• accumulate $N(s_t)\leftarrow N(s_t) + 1$
• update $v(s_t)\leftarrow v(s_t) + \frac{1}{N(s_t)}(v-v(s_t))$

Note that here we used a common trick to calculate moving mean:

$$\mu_n=\mu_{n-1}+\frac{1}{n}(a_n-\mu_{n-1})$$

Obviously, there are at least two drawbacks for this approach:

• The episode must terminate.
• We must wait until the episode terminates.

To resolve those two issues, we need to play smarter.

# Temporal-Difference

The idea is, instead of sampling a whole episode, we sample only 1 step forward, and then use the current guess of the new state to update the previous state. The algorithm is:

• For nth episode
• Sample forward one step: $(s_t, a_t, r_{t+1}, s_{t+1})$:
• update $v(s_t)\leftarrow v(s_t) + \alpha(r_{t+1}+\gamma v(s_{t+1})-v(s_t))$

The $r_{t+1}+\gamma v(s_{t+1})$ term is called the TD target, which reflects the new knowledge of $v(s_t)$ after performing an action.

The whole $r_{t+1}+\gamma v(s_{t+1})-v(s_t)$ term is called the TD error, which is the difference between the TD target and the previous guess of $v(s_t)$.

So in one word, every time we move forward one step to get a new state, and update the value function of the previous state by a tiny fraction $\alpha$ of the TD error.

The method that using current guess to update a new guess is called bootstrapping.

Note that the only difference between Monte-Carlo and TD is that, TD bootstraps after 1 step, while Monte-Carlo does not bootstrp at all. So, why 1 step? Why not 2, or 100? Of course we can.

Define n-step return by

\begin{aligned} G_t^{(n)} &= R_{t+1}+\gamma R_{t+2}+\cdots + \gamma^{n-1} R_{t+n} + \gamma^{n}v(S_{t+n})\\ &=\sum_{i=1}^n\gamma^{i-1} R_{t+i}+\gamma^{n}v(S_{t+n}) \end{aligned}

A little word about the notation. Since the $n$ in $G_t^{(n)}$ has to start from $1$, when we say the episode terminates at timestamp $T$, it means the last piece of the sequence is $(s_T, a_T, r_{T+1}, s_{T+1})$, where $s_{T+1}$ is the terminate state.

Now, the Monte-Carlo is simply to update by

$$v(S_{t+1})\leftarrow v(S_{t+1}) + \frac{1}{N} \left(G_t^{(T-t+1)} - v(S_{t+1})\right)$$

and the TD is simply to update by

$$v(S_{t+1})\leftarrow v(S_{t+1}) + \alpha\left(G_t^{(1)} - v(S_{t+1})\right)$$

So why can’t we use $G_t^{(n)}$ to update, like

$$v(S_{t+1})\leftarrow v(S_{t+1}) + \alpha\left(G_t^{(n)} - v(S_{t+1})\right)$$

which apparently will perform better? Yes it is, but depends, on $\alpha$. Note that for different step size $\alpha$, we need different $n$ to get the best result. This is bad, and we want a method where the performance depends (almost) only on a single variable, like $\alpha$.

# TD($\lambda$)

To achieve the above goal, we use a simple solution: just like accumulating discounted rewards as the final reward, we accumulate the discounted n-step return to make the final return, i.e. TD target. We define

$$G_t^{\lambda} \triangleq (1-\lambda)\left(G_t^{(1)}+\lambda G_t^{(2)}+\lambda^2G_t^{(3)}+\cdots\right)$$

and after the termination timestamp $T$, all terms are $G_t^{(T-t+1)}$, i.e. it absorbs all tails. However, to handle $\lambda=1$ correctly, we rewrite the definition as

$$G_t^{\lambda} \triangleq (1-\lambda)\sum_{n=1}^{T-t}\lambda^{n-1}G_t^{(n)} + \lambda^{T-t}G_t^{(T-t+1)}$$

Now we can update the value function towards this integrated return, i.e.

$$v(S_{t+1})\leftarrow v(S_{t+1}) + \alpha\left(G_t^{\lambda} - v(S_{t+1})\right)$$

Does it solve the previous problem? It does! Experiments show that best results are always achieved when $\lambda$ is around $0.9$, so it almost only depends on $\alpha$!