Algorithms of Reinforcement Learning


  • Download the pdf, free of charge, courtesy of our wonderful publisher. Last update: March 12, 2019

  • Access the original on the Morgan and Claypool webpage

  • Buy a printed copy from

  • Japanese translation by Sotetsu Koyamada! The translation has a short supplementary material about the equivalence of the forward and backward views of TD lambda (by Dr. Koyama) and also on deep RL (by Sotetsu Koyamada). Amazon Asia link, Kyoritsu pub, errata.

Why did I write this book?

Good question! There exist a good number of really great books on Reinforcement Learning. So why a new book?

I had selfish reasons: I wanted a short book, which nevertheless contained the major ideas underlying state-of-the-art RL algorithms (back in 2010), a discussion of their relative strengths and weaknesses, with hints on what is known (and not known, but would be good to know) about these algorithms. If I succeeded, time will tell. Or, you can, by sending me an e-mail at


Reinforcement learning is a learning paradigm concerned with learning to control a system so as to maximize a numerical performance measure that expresses a long-term objective. What distinguishes reinforcement learning from supervised learning is that only partial feedback is given to the learner about the learner's predictions. Further, the predictions may have long term effects through influencing the future state of the controlled system. Thus, time plays a special role. The goal in reinforcement learning is to develop efficient learning algorithms, as well as to understand the algorithms’ merits and limitations. Reinforcement learning is of great interest because of the large number of practical applications that it can be used to address, ranging from problems in artificial intelligence to operations research or control engineering. In this book, we focus on those algorithms of reinforcement learning that build on the powerful theory of dynamic programming. We give a fairly comprehensive catalog of learning problems, describe the core ideas, note a large number of state of the art algorithms, followed by the discussion of their theoretical properties and limitations.

Table of contents

  1. Preface ix

  2. Acknowledgments xiii

  3. Markov Decision Processes 1

    1. Preliminaries 1

    2. Markov Decision Processes 1

    3. Value functions 6

    4. Dynamic programming algorithms for solving MDPs 10

  4. Value Prediction Problems 11

    1. Temporal difference learning in finite state spaces 11

      1. Tabular TD(0) 11

      2. Every-visit Monte-Carlo 14

      3. TD(lambda): Unifying Monte-Carlo and TD(0) 16

    2. Algorithms for large state spaces 18

      1. TD(lambda) with function approximation 22

      2. Gradient temporal difference learning 25

      3. Least-squares methods 27

      4. The choice of the function space 33

  5. Control 37

    1. A catalog of learning problems 37

    2. Closed-loop interactive learning 38

      1. Online learning in bandits 38

      2. Active learning in bandits 40

      3. Active learning in Markov Decision Processes 41

      4. Online learning in Markov Decision Processes 42

    3. Direct methods 47

      1. Q-learning in finite MDPs 47

      2. Q-learning with function approximation 49

    4. Actor-critic methods 52

      1. Implementing a critic 54

      2. Implementing an actor 56

  6. For Further Exploration 63

    1. Further reading 63

    2. Applications 63

    3. Software 64

  7. Appendix: The Theory of Discounted Markovian Decision Processes 65

    1. A.1 Contractions and Banach’s fixed-point theorem 65

    2. A.2 Application to MDPs 69

  8. Bibliography 73

  9. Author's Biography 89


The book, as the title suggests, describes a number of algorithms. These are the following. For algorithms whose names are boldfaced a pseudocode is also given.

  1. Value iteration p. 10

  2. Policy iteration p. 10

  3. Tabular TD(0) p. 12

  4. Every-visit Monte-Carlo p. 15

  5. Tabular TD(lambda), accumulating traces p. 17

  6. First-visit Monte-Carlo p. 17

  7. Tabular TD(lambda), replacing traces pp. 17-18

  8. TD(lambda) with function approximation p. 22

  9. GTD2 p. 26

  10. TDC p. 27

  11. LSTD p. 28

  12. RLSTD pp. 28-29

  13. (R)LSTD(lambda) pp. 29-30

  14. lambda-LSPE pp. 30-31

  15. Boltzmann exploration p. 38

  16. epsilon-greedy p. 39

  17. UCB1 pp. 39-40

  18. Action elimination p. 41

  19. E^3 p. 42

  20. UCRL2 pp. 42-44

  21. R-max, MBIE, OI, delayed-Q, MorMax p. 44

  22. Tabular Q-learning p. 47

  23. Q-learning with function approximation pp. 49-50

  24. State-aggregation based Q-learning p. 50

  25. Soft-state aggregation based Q-learning p. 50

  26. Fitted Q-iteration pp. 51-52

  27. Generalized policy iteration pp. 52-53

  28. SARSA p. 54

  29. SARSA(lambda) pp. 55-55

  30. LSTDQ(lambda) pp. 55-56

  31. LSPI(lambda) p. 57

  32. REINFORCE p. 59

  33. Actor-critic with SARSA(1) pp. 60-61

  34. Natural actor-critic p. 57

Other unique features of the book

The book discusses the following function approximation methods:

  1. Linear function approximation,

  2. tensor product construction,

  3. kernel smoothing,

  4. averagers,

  5. state aggregation,

  6. tile coding,

  7. nearest-neighbor methods,

  8. nonparametric kernel smoothing,

  9. gaussian processes, and

  10. RKHS regression.

In addition, it discusses the relative merits of “batch” (LS-type) and incremental (TD-type) algorithms, the influence of the choice of the function approximation method (can we overfit in reinforcement learning?), various theoretically well-founded online learning algorithms (ever wondered about what an efficient exploration method should do?), actor-critic algorithms, and more.

Some connections to other parts of the literature (outside of machine learning) are mentioned. This includes connection of LSTD (and related methods) to Z-estimation (from statistics), sample-average approximation methods (from operations research), or the connection of policy gradient algorithms to likelihood ratio methods (from simulation optimization). In general, the book has many pointers to the literature. I think that the books provides a very good reasonable starting point if someone wants to know the status of the theory related to some algorithm or idea.The book cites 207 works, many of which were quite recent in 2010.

New results in the book

  1. TD(lambda) with linear function approximation solves a model (previously, this was known for lambda=0 only)

  2. A new bound on the complexity of active learning in finite deterministic MDPs, which significantly improves a previous bound by Sebastian Thrun.

Tutorial, slides

Some people find it much easier to learn from slides. Rich and I gave a tutorial at AAAI-2010 in July that was based on the book. The tutorial webpage is here. We used the following slides:

  1. Part 1 – Foundations of computing optimal decisions

  2. Part 2 – Learning to predict values

  3. Part 3 – Learning to control

  4. Part 4 – Summary

Errata (last update: June 25, 2018)

In an ideal world, we would publish with no mistakes. The world is not ideal. The second best thing then is to keep a list of mistakes (and update the pdf!). For your convenience, here I give you an errata both as a pdf file and also in html.

The above errata is based mostly on a list provided by Gabor, who deserves a big thanks for reading through the text so carefully. I am also indebted to Sotetsu Koyamada who more recently have given me another lengthy list of typos. Earlier (and more recently), several individual read various parts of the draft and have submitted useful suggestions, which I tried to incorporate. They include:

  1. Dimitri Bertsekas

  2. Gabor Balazs

  3. Bernardo Avila Pires

  4. Warren Powell

  5. Tom Schaul

  6. Rich Sutton

  7. Nikos Vlassis,

  8. Hengshuai Yao

  9. Shimon Whiteson,

  10. Massimiliano Tomassoli and

  11. Hill Ma.

Thank You! All the remaining errors are mine.