Algorithms for Reinforcement Learning, my sleek book was published by Morgan & Claypool in July 2010.
Download the most recent version in pdf
(last update: June 25, 2018),
or download the original
from the publisher's webpage (if you have access).

Why this book?
There exist a good number of really great books on Reinforcement Learning. So why a new book? I have to confess: The book arose from selfish reasons: I wanted a short book, which nevertheless contained the major ideas underlying stateoftheart RL algorithms, a discussion of their relative strengths and weaknesses, with hints on what is known (and not known, but would be good to know) about these algorithms. If I succeeded, time will tell. Or, you can, by sending me an email!Abstract
Reinforcement learning is a learning paradigm concerned with learning to control a system so as to maximize a numerical performance measure that expresses a longterm objective. What distinguishes reinforcement learning from supervised learning is that only partial feedback is given to the learner about the learner's predictions. Further, the predictions may have long term effects through influencing the future state of the controlled system. Thus, time plays a special role. The goal in reinforcement learning is to develop efficient learning algorithms, as well as to understand the algorithms' merits and limitations. Reinforcement learning is of great interest because of the large number of practical applications that it can be used to address, ranging from problems in artificial intelligence to operations research or control engineering. In this book, we focus on those algorithms of reinforcement learning that build on the powerful theory of dynamic programming. We give a fairly comprehensive catalog of learning problems, describe the core ideas, note a large number of state of the art algorithms, followed by the discussion of their theoretical properties and limitations.Table of contents
Preface  ix 
Acknowledgments  xiii 
1 Markov Decision Processes  1 
1.1 Preliminaries  1 
1.2 Markov Decision Processes  1 
1.3 Value functions  6 
1.4 Dynamic programming algorithms for solving MDPs  10 
2 Value Prediction Problems  11 
2.1 Temporal difference learning in finite state spaces  11 
2.1.1 Tabular TD(0)  11 
2.1.2 Everyvisit MonteCarlo  14 
2.1.3 TD(lambda): Unifying MonteCarlo and TD(0)  16 
2.2 Algorithms for large state spaces  18 
2.2.1 TD(lambda) with function approximation  22 
2.2.2 Gradient temporal difference learning  25 
2.2.3 Leastsquares methods  27 
2.2.4 The choice of the function space  33 
3 Control  37 
3.1 A catalog of learning problems  37 
3.2 Closedloop interactive learning  38 
3.2.1 Online learning in bandits  38 
3.2.2 Active learning in bandits  40 
3.2.3 Active learning in Markov Decision Processes  41 
3.2.4 Online learning in Markov Decision Processes  42 
3.3 Direct methods  47 
3.3.1 Qlearning in finite MDPs  47 
3.3.2 Qlearning with function approximation  49 
3.4 Actorcritic methods  52 
3.4.1 Implementing a critic  54 
3.4.2 Implementing an actor  56 
4 For Further Exploration  63 
4.1 Further reading  63 
4.2 Applications  63 
4.3 Software  64 
A The Theory of Discounted Markovian Decision Processes  65 
A.1 Contractions and Banachâ€™s fixedpoint theorem  65 
A.2 Application to MDPs  69 
Bibliography  73 
Author's Biography  89 
List of algorithms
The following algorithms are explained in the book. For algorithms whose names are boldfaced a pseudocode is also given.
Value iteration  p. 10  Policy iteration  p. 10  Tabular TD(0)  p. 12 
Everyvisit MonteCarlo  p. 15  Tabular TD(lambda), accumulating traces  p. 17  Firstvisit MonteCarlo  p. 17 
Tabular TD(lambda), replacing traces  pp. 1718  TD(lambda) with function approximation  p. 22  GTD2  p. 26 
TDC  p. 27  LSTD  p. 28  RLSTD  pp. 2829 
(R)LSTD(lambda)  pp. 2930  lambdaLSPE  pp. 3031  Boltzmann exploration  p. 38 
epsilongreedy  p. 39  UCB1  pp. 3940  Action elimination  p. 41 
E^3  p. 42  UCRL2  pp. 4244  Rmax, MBIE, OI, delayedQ, MorMax  p. 44 
Tabular Qlearning  p. 47  Qlearning with function approximation  pp. 4950  Stateaggregation based Qlearning  p. 50 
Softstate aggregation based Qlearning  p. 50  Fitted Qiteration  pp. 5152  Generalized policy iteration  pp. 5253 
SARSA  p. 54  SARSA(lambda)  pp. 5555  LSTDQ(lambda)  pp. 5556 
LSPI(lambda)  p. 57  REINFORCE  p. 59  Actorcritic with SARSA(1)  pp. 6061 
Natural actorcritic  p. 57 
Other unique features of the book
The book discusses the following function approximation methods: Linear function approximation, tensor product construction, kernel smoothing, averagers, state aggregation, tile coding, nearestneighbor methods, nonparametric kernel smoothing, gaussian processes, and RKHS regression.In addition, it discusses the relative merits of "batch" (LStype) and incremental (TDtype) algorithms, the influence of the choice of the function approximation method (can we overfit in reinforcement learning?), various theoretically wellfounded online learning algorithms (ever wondered about what an efficient exploration method should do?), actorcritic algorithms, and more.
Some connections to other parts of the literature (outside of machine learning) are mentioned. This includes connection of LSTD (and related methods) to Zestimation (from statistics), sampleaverage approximation methods (from operations research), or the connection of policy gradient algorithms to likelihood ratio methods (from simulation optimization). In general, the book has many pointers to the literature. I think that the books provides a very good starting point if someone wants to know the status of the theory related to some algorithm or idea. This being said, the book cites 207 works, many of which are quite recent.
The book has even some new results! These include:
 TD(lambda) with linear function approximation solves a model (previously, this was known for lambda=0 only)
 A new bound on the complexity of active learning in finite deterministic MDPs, which significantly improves a previous bound by Sebastian Thrun.