point

Remember me

## Reinforcement Learning in R: Markov Decision Process (MDP) and Value Iteration

Mon, 06 Jan 2014 20:07:37 GMT

How can we find the best long-term plan?

In the last post, we looked at the idea of dynamic programming, where a problem is divided into overlapping parts, each part is solved independently, and then aggregated back up into a global solution. This is one of the most basic and widely used methods for solving problems that involve optimal long-term planning.  I am emphasizing the long-term part because typical machine learning or statistical models consider the optimal solution to the current state.  This short-term solution can lead to sub-optimal behaviors over many steps when there is a dependency between each observation.  Many statistical models assume independent observations, but this is not always a valid assumption.

Reinforcement Learning (RL) is an important but sometimes overlooked method for solving complex planning problems.  Many machine learning textbooks ignore this subject entirely, instead focusing strictly on supervised and unsupervised learning.  RL grew independently within the Artificial Intelligence (AI) and Operations Research (OR) communities (and is increasingly applied in other fields such as finance and economics).

• AI often deals with problems to do with playing games (e.g. checkers, chess) and robots.  These are examples of problems that require taking actions over time to find a long-term optimal solution to a dynamic problem.  They are dynamic in the sense that the conditions are constantly changing, sometimes in response to another agent who can be adversarial.
• OR has a subfield known as optimal control theory, which covers dynamic programming (DP) and approximate dynamic programming (ADP).  ADP is a very specialized topic which aims to solve problems that cannot be solved directly by DP because of the curse of dimensionality.

As with almost all academic disciplines, the lines between these fields has become increasingly fuzzy over time and there is much more interdisciplinary work being done.

### Reinforcement Learning, and Markov Decision Processes

Reinforcement learning models all tend to have the following attributes:

• Environment (E): There is an environment which the agent inhabits.  In mathematics, we think about problems existing within a domain.  The environment provides the same kind of context to an RL problem, by limiting the scope in some clear way, by defining what states are available, what actions can be taken, and other attributes of the problem.
• State (S): The agent have a state value which changes over time.  This is often a finite discrete variable, meaning that it can only vary over a few different values, although it is also important that it will sometimes be less constrained.
• Action (A): Given a current state, the agent is allowed to choose an action.  Actions are also often defined as finite discrete variable as well (a limited range of actions).  This limitation is often realistic.  For instance, a robot may be limited to only moving north, south, east, or west.
• Reward (R): After taking an action, the agent receives a reward and moves into a new state.  The reward value can be anything, including positive or negative values, so it can be used to encourage or discourage behaviors.  The reward is thus a function of the state and action chosen.

We will start by limiting both the actions and states to finite discrete values to avoid issues to do with dimensionality.

We add a few more terms to provide more structure to the problem.  We will start by considering the most common simplifying case: the markov decision process (MDP).  MDP's follow the markov property that the next state only depends on the current state and action.  MDP are an extension of markov chains, with the addition of actions and rewards.

• Policy ($\pi$): A policy consists of a set of state/action pairs.  We may start off in state 1, take action 1, and move to state 2, etc.
• Value (V): the value function, depends on the initial state and policy that is followed.  The value of starting in some state $s_0$ and following a specific policy can thus be represented as $V^{\pi}(s_0)$.  Every state will have an associated value.
• Transition probability matrix (P): This gives the probability of transitioning from one state to another given a certain action.  In some cases, this is provided along with the problem definition, while in other cases this is learned.
• Discount ($\gamma$): The discount factor can be set to one, but it is generally included in order to limit the number of steps in the policy.  This is of great practical importance.  Imagine a problem where a robot can take a short path to its target by walking on one path along the edge of a cliff, or take an extremely long path on very safe ground.  There is a trade-off in this scenario, and the discount factor allows you to add a time value to the ultimate objective.

The basic MDP is a 5-tuple ${S, A, P, R, \gamma}$, meaning that we need to keep track of these five terms at all times.  In the basic setup, $P$ is provided, although alternative versions also allows for this to be a learned value as well.

The fundamental objective of an RL model is to maximize the reward (or value, or utility).  This is usually done using the Bellman equation (here shown as a recursion relation):

$V^*(s)= R(s) + \max_a \gamma \sum_{s'} P(s'|s,a) V^*(s').\$

Where the * represents the optimal policy $\pi$.

Lastly, we should touch on one crucial feature of RL problems: exploration vs. exploitation.  This is sometimes discussed in the context of a problem known as the multi-armed bandit.  A similar statement of this problem is the Feynman Restaurant Problem:

"Richard Feynman and Ralph Leighton had a hard time at restaurants deciding whether to order the best dish they had tried so far or something new, which - who knows - might be even better."

Assume that we start off with no knowledge of the environment (in terms of what rewards are available, and what policies to follow).  We start to explore the environment by moving from one state to another, and in the process, we start to learn about what we think are more optimal policies.  The question is: at what point should we stop exploring and start to exploit our knowledge.  Keep in mind that exploring is costly: we may know a good policy, but choose to ignore it in case we might be missing something better.

### Example: Grid World

Let us start by considering one of the most canonical examples: grid world.  This is a very simple example, but it can easily be extended and it serves to explore all the building blocks.

Consider a 3x4 grid.  An agent (e.g. robot, rat) starts in some state and moves through the grid until it reaches a terminal state, provided by a positive reward (+1) or a negative reward (-1).  We can add a discount factor to encourage the robot to reach the objective more quickly.

The robot can move 1 step at a time, and is able to move north, south, east, or west.  Unfortunately, the robot isn't perfect, so there is a probability transition function such that any time it wants to move in a particular direction (e.g. north), then there is some small chance that it will instead move to the left or right instead (probability = 0.1 respectively).

### Solving an MDP with Value Iteration

Now that we have defined the building blocks of a MDP, let us consider how we might go about explicitly solving the example problem.  Following Sutton and Barto (1998), there are three different basic approaches for solving RL problems: dynamic programming, monte carlo, and temporal-difference learning.  One of the simplest algorithms for dynamic programming is known as value iteration.

Value iteration is very straight forward: it simply repeatedly tries to update the value based on the Bellman equation until it converges.  Here is the pseudo-code from Sutton and Barto (1998):

Value iteration algorithm from Sutton and Barto (1998): http://webdocs.cs.ualberta.ca/~sutton/book/ebook/node44.html.

First, we set up the grid, the possible actions, rewards, and states. (NOTE: This is included as a github gist, which may be viewed here if it doesn't show up embedded in the post.)

Next, we run through the value iteration algorithm (see gist here if not shown below).  Here I simply iterate for a set number of iterations.  In future posts, we will consider convergence of the algorithm and its complexity.

We are now able to get the optimal policy based on these values for a given discount rate.

In future posts, we will study the convergence properties of different RL algorithms, considering alternative methods for solving reinforcement learning problems, including the temporal difference method.  we will also walk through approximate methods and solve problems that are less constrained (continuous, infinite). Lastly, we should spend at least one post considering RL models within the PAC framework (Probably Approximately Correct).

### Resources

There are a number of excellent books that provide good introductions to this subject:

There are also a number of good online courses on this subject. The best that I have seen is Berkeley's CS188 "Artificial Intelligence", which is available as a MOOC from edX.

Also, Andrew Ng's Stanford CS229 "Machine Learning" has a lecture on the subject, including some very good lecture notes.