My RL book

Algorithms for Reinforcement Learning, my sleek book was published by Morgan & Claypool in July 2010.

Download the most recent version in pdf (last update: June 25, 2018), or download the original from the publisher's webpage (if you have access).
Or, buy a printed copy from Amazon.com for ca. USD 35.00, Amazon.ca for ca. CDN$ 42.02, or from Amazon.co.uk for GBP18.99.
Faculty: write to info@morganclaypool.com to request your desk copy today! New! A Japanese translation by Sotetsu Koyamada is ready. The translation has a short supplementary material about the equivalence of the forward and backward views of TD lambda (by Dr. Koyama) and also on deep RL (by Sotetsu Koyamada). Amazon Asia link, Kyoritsu pub, errata.

Book cover for the book 'Algorithms for Reinforcement Learning'

Why this book?

There exist a good number of really great books on Reinforcement Learning. So why a new book? I have to confess: The book arose from selfish reasons: I wanted a short book, which nevertheless contained the major ideas underlying state-of-the-art RL algorithms, a discussion of their relative strengths and weaknesses, with hints on what is known (and not known, but would be good to know) about these algorithms. If I succeeded, time will tell. Or, you can, by sending me an e-mail!


Reinforcement learning is a learning paradigm concerned with learning to control a system so as to maximize a numerical performance measure that expresses a long-term objective. What distinguishes reinforcement learning from supervised learning is that only partial feedback is given to the learner about the learner's predictions. Further, the predictions may have long term effects through influencing the future state of the controlled system. Thus, time plays a special role. The goal in reinforcement learning is to develop efficient learning algorithms, as well as to understand the algorithms' merits and limitations. Reinforcement learning is of great interest because of the large number of practical applications that it can be used to address, ranging from problems in artificial intelligence to operations research or control engineering. In this book, we focus on those algorithms of reinforcement learning that build on the powerful theory of dynamic programming. We give a fairly comprehensive catalog of learning problems, describe the core ideas, note a large number of state of the art algorithms, followed by the discussion of their theoretical properties and limitations.

Table of contents

1 Markov Decision Processes 1
1.1 Preliminaries1
1.2 Markov Decision Processes 1
1.3 Value functions6
1.4 Dynamic programming algorithms for solving MDPs 10
2 Value Prediction Problems 11
2.1 Temporal difference learning in finite state spaces11
2.1.1 Tabular TD(0) 11
2.1.2 Every-visit Monte-Carlo14
2.1.3 TD(lambda): Unifying Monte-Carlo and TD(0)16
2.2 Algorithms for large state spaces 18
2.2.1 TD(lambda) with function approximation22
2.2.2 Gradient temporal difference learning 25
2.2.3 Least-squares methods27
2.2.4 The choice of the function space 33
3 Control37
3.1 A catalog of learning problems37
3.2 Closed-loop interactive learning 38
3.2.1 Online learning in bandits38
3.2.2 Active learning in bandits 40
3.2.3 Active learning in Markov Decision Processes41
3.2.4 Online learning in Markov Decision Processes 42
3.3 Direct methods47
3.3.1 Q-learning in finite MDPs47
3.3.2 Q-learning with function approximation49
3.4 Actor-critic methods52
3.4.1 Implementing a critic54
3.4.2 Implementing an actor56
4 For Further Exploration 63
4.1 Further reading63
4.2 Applications 63
4.3 Software64
A The Theory of Discounted Markovian Decision Processes 65
A.1 Contractions and Banach’s fixed-point theorem65
A.2 Application to MDPs69
Author's Biography89

List of algorithms

The following algorithms are explained in the book. For algorithms whose names are boldfaced a pseudocode is also given.

Value iterationp. 10 Policy iterationp. 10 Tabular TD(0)p. 12
Every-visit Monte-Carlop. 15 Tabular TD(lambda), accumulating tracesp. 17 First-visit Monte-Carlop. 17
Tabular TD(lambda), replacing tracespp. 17-18 TD(lambda) with function approximationp. 22 GTD2p. 26
TDCp. 27 LSTDp. 28 RLSTDpp. 28-29
(R)LSTD(lambda)pp. 29-30 lambda-LSPEpp. 30-31 Boltzmann explorationp. 38
epsilon-greedyp. 39 UCB1pp. 39-40 Action eliminationp. 41
E^3p. 42 UCRL2pp. 42-44 R-max, MBIE, OI, delayed-Q, MorMaxp. 44
Tabular Q-learningp. 47 Q-learning with function approximationpp. 49-50 State-aggregation based Q-learningp. 50
Soft-state aggregation based Q-learningp. 50 Fitted Q-iterationpp. 51-52 Generalized policy iterationpp. 52-53
SARSAp. 54 SARSA(lambda)pp. 55-55 LSTDQ(lambda)pp. 55-56
LSPI(lambda)p. 57 REINFORCEp. 59 Actor-critic with SARSA(1)pp. 60-61
Natural actor-criticp. 57

Other unique features of the book

The book discusses the following function approximation methods: Linear function approximation, tensor product construction, kernel smoothing, averagers, state aggregation, tile coding, nearest-neighbor methods, nonparametric kernel smoothing, gaussian processes, and RKHS regression.

In addition, it discusses the relative merits of "batch" (LS-type) and incremental (TD-type) algorithms, the influence of the choice of the function approximation method (can we overfit in reinforcement learning?), various theoretically well-founded online learning algorithms (ever wondered about what an efficient exploration method should do?), actor-critic algorithms, and more.

Some connections to other parts of the literature (outside of machine learning) are mentioned. This includes connection of LSTD (and related methods) to Z-estimation (from statistics), sample-average approximation methods (from operations research), or the connection of policy gradient algorithms to likelihood ratio methods (from simulation optimization). In general, the book has many pointers to the literature. I think that the books provides a very good starting point if someone wants to know the status of the theory related to some algorithm or idea. This being said, the book cites 207 works, many of which are quite recent.

The book has even some new results! These include:

  • TD(lambda) with linear function approximation solves a model (previously, this was known for lambda=0 only)
  • A new bound on the complexity of active learning in finite deterministic MDPs, which significantly improves a previous bound by Sebastian Thrun.


Rich and I gave a tutorial at AAAI-2010 in July that was based on the book. The tutorial webpage is here. The slides are here:

Errata (last update: June 25, 2018)

The above errate is based mostly on a list provided by Gabor, who obviously deserves a big special thanks for reading through the text so carefully. I am also indebted to Sotetsu Koyamada who more recently have given me another lengthy list of typos. Earlier (and more recently), several individual read various parts of the draft and have submitted useful suggestions, which I tried to incorporate. They include Dimitri Bertsekas, Gabor Balazs, Bernardo Avila Pires, Warren Powell, Tom Schaul, Rich Sutton, Nikos Vlassis, Hengshuai Yao and Shimon Whiteson, Massimiliano Tomassoli and Hill Ma. Thank You! All the remaining errors are mine.