RL Note (1) - Introduction to Reinforcement

Recently I began to watch RL Course by David Silver, and will write a series of notes on this topic. This first note will record some symbols and definitions, corresponding to Lecture 1: Introduction to Reinforcement.

Basic Concepts

In reinforcement learning, we are studying an agent interacting with an environment in a sequential manner. At each timestamp, the agent performs some action based on the history in its memory. At the next timestamp, the environment will respond by providing some scalar reward.

The history consists of a sequence of observations, actions and rewords, say $h_T = (\cdots,o_t, a_t, r_{t+1}, o_{t+1},\cdots, o_T)$. However, it can be too large to handle. As a result, we define the concept of state, which is nothing but a function of history, i.e.

$$s_t = f(h_t)$$

Now we can say the agent acts based on the current state.

There are two kinds of states, the agent state and the environment state. The former is by definition known by the agent, while the latter may or may not, which defines different problem types.

At every timestamp, the agent is at some state, and should perform some action. The decision process according to which the agent choose an action is called a policy, usually denoted by $\pi(a\mid s)$. It could be stochastic or deterministic.

Above are all basic concepts of the reinforcement learning.


Our agent does not interact with the environment for nothing, it has a goal, which by definition consists of the immediate reward together with all future rewards. There is a fundamental axiom of the reinforcement learning, which says

(Reward Hypothesis) All goals can be described by the maximization of expected cumulative rewards.

All study and works in reinforcement learning is based on this assumption.

Characteristic of Reinforcement Learning

The main difference between the reinforcement learning and other machine learning methods are:

  • No supervisor, only reward signals from the environment.
  • Feedback from environment may be delayed.
  • Sequential random process, not i.i.d data.

We should also distinguish between Reinforcement Learning and Planning. In the planning case, the model of the environment, i.e. $P_{ss'}^a$ and $R_s^a$, are known by the agent which can perform calculation according to the model. In the reinforcement learning case, the model is unknown, and the agent can only updating its policy according to the feedback of the environment.

I think a typical example is our interaction with the earth. The environment is the earth, which has an internal law determining the evolution of the state. We, as the agent, although know nothing about that law, but come up with a pretty good approximation, i.e. the law of physics. Use physics, we can in some simple case accurately determine $P_{ss'}^a$ and $R_s^a$, which turns the problem into a Planning one. However, in some other situation, it is impossible or too complex to apply the law of physics, which leads us to the Reinforcement Learning case.