## Algorithms of Reinforcement Learning## AccessDownload the pdf, free of charge, courtesy of our wonderful publisher. Last update: March 12, 2019 Access the original on the Morgan and Claypool webpage Buy a printed copy from Amazon.com ca. USD 35.00 Amazon.ca ca. CDN 42.02 Amazon.co.uk, GBP18.99.
Faculty can write to `info@morganclaypool.com`to request a desk copyJapanese translation by Sotetsu Koyamada! The translation has a short supplementary material about the equivalence of the forward and backward views of TD lambda (by Dr. Koyama) and also on deep RL (by Sotetsu Koyamada). Amazon Asia link, Kyoritsu pub, errata.
## Why did I write this book?
Good question! There exist a good number of really great books on Reinforcement Learning. So why a new book?
I had selfish reasons: I wanted a short book, which nevertheless contained the major ideas underlying state-of-the-art RL algorithms (back in 2010), a discussion of their relative strengths and weaknesses, with hints on what is known (and not known, but would be good to know) about these algorithms. If I succeeded, time will tell. Or, you can, by sending me an e-mail at ## AbstractReinforcement learning is a learning paradigm concerned with learning to control a system so as to maximize a numerical performance measure that expresses a long-term objective. What distinguishes reinforcement learning from supervised learning is that only partial feedback is given to the learner about the learner's predictions. Further, the predictions may have long term effects through influencing the future state of the controlled system. Thus, time plays a special role. The goal in reinforcement learning is to develop efficient learning algorithms, as well as to understand the algorithms’ merits and limitations. Reinforcement learning is of great interest because of the large number of practical applications that it can be used to address, ranging from problems in artificial intelligence to operations research or control engineering. In this book, we focus on those algorithms of reinforcement learning that build on the powerful theory of dynamic programming. We give a fairly comprehensive catalog of learning problems, describe the core ideas, note a large number of state of the art algorithms, followed by the discussion of their theoretical properties and limitations. ## Table of contentsPreface ix Acknowledgments xiii Markov Decision Processes 1 Preliminaries 1 Markov Decision Processes 1 Value functions 6 Dynamic programming algorithms for solving MDPs 10
Value Prediction Problems 11 Temporal difference learning in finite state spaces 11 Tabular TD(0) 11 Every-visit Monte-Carlo 14 TD(lambda): Unifying Monte-Carlo and TD(0) 16
Algorithms for large state spaces 18 TD(lambda) with function approximation 22 Gradient temporal difference learning 25 Least-squares methods 27 The choice of the function space 33
Control 37 A catalog of learning problems 37 Closed-loop interactive learning 38 Online learning in bandits 38 Active learning in bandits 40 Active learning in Markov Decision Processes 41 Online learning in Markov Decision Processes 42
Direct methods 47 Q-learning in finite MDPs 47 Q-learning with function approximation 49
Actor-critic methods 52 Implementing a critic 54 Implementing an actor 56
For Further Exploration 63 Further reading 63 Applications 63 Software 64
Appendix: The Theory of Discounted Markovian Decision Processes 65 A.1 Contractions and Banachâ€™s fixed-point theorem 65 A.2 Application to MDPs 69
Bibliography 73 Author's Biography 89
## AlgorithmsThe book, as the title suggests, describes a number of algorithms. These are the following. For algorithms whose names are boldfaced a pseudocode is also given. Value iteration p. 10 Policy iteration p. 10 **Tabular TD(0)**p. 12**Every-visit Monte-Carlo**p. 15Tabular TD(lambda), accumulating traces p. 17 First-visit Monte-Carlo p. 17 **Tabular TD(lambda), replacing traces**pp. 17-18**TD(lambda) with function approximation**p. 22**GTD2**p. 26TDC p. 27 LSTD p. 28 **RLSTD**pp. 28-29(R)LSTD(lambda) pp. 29-30 **lambda-LSPE**pp. 30-31Boltzmann exploration p. 38 epsilon-greedy p. 39 **UCB1**pp. 39-40Action elimination p. 41 E^3 p. 42 **UCRL2**pp. 42-44R-max, MBIE, OI, delayed-Q, MorMax p. 44 **Tabular Q-learning**p. 47**Q-learning with function approximation**pp. 49-50State-aggregation based Q-learning p. 50 Soft-state aggregation based Q-learning p. 50 **Fitted Q-iteration**pp. 51-52Generalized policy iteration pp. 52-53 SARSA p. 54 **SARSA(lambda)**pp. 55-55**LSTDQ(lambda)**pp. 55-56**LSPI(lambda)**p. 57REINFORCE p. 59 **Actor-critic with SARSA(1)**pp. 60-61Natural actor-critic p. 57
## Other unique features of the bookThe book discusses the following function approximation methods: Linear function approximation, tensor product construction, kernel smoothing, averagers, state aggregation, tile coding, nearest-neighbor methods, nonparametric kernel smoothing, gaussian processes, and RKHS regression.
In addition, it discusses the relative merits of “batch” (LS-type) and incremental (TD-type) algorithms,
the influence of the choice of the function approximation method (can we overfit in reinforcement learning?),
various theoretically well-founded online learning algorithms (ever wondered about what an efficient exploration method should do?),
actor-critic algorithms,
and more.
Some connections to other parts of the literature (outside of machine learning) are mentioned. This includes connection of LSTD (and related methods) to Z-estimation (from statistics), sample-average approximation methods (from operations research), or the connection of policy gradient algorithms to likelihood ratio methods (from simulation optimization). In general, the book has many pointers to the literature. I think that the books provides a ## New results in the book
TD(lambda) with linear function approximation solves a model (previously, this was known for lambda=0 only) A new bound on the complexity of active learning in finite deterministic MDPs, which significantly improves a previous bound by Sebastian Thrun.
## Tutorial, slidesSome people find it much easier to learn from slides. Rich and I gave a tutorial at AAAI-2010 in July that was based on the book. The tutorial webpage is here. We used the following slides: Part 1 – Foundations of computing optimal decisions Part 2 – Learning to predict values Part 3 – Learning to control Part 4 – Summary
## Errata (last update: June 25, 2018)In an ideal world, we would publish with no mistakes. The world is not ideal. The second best thing then is to keep a list of mistakes (and update the pdf!). For your convenience, here I give you an errata both as a pdf file and also in html. The above errata is based mostly on a list provided by Gabor, who deserves a big thanks for reading through the text so carefully. I am also indebted to Sotetsu Koyamada who more recently have given me another lengthy list of typos. Earlier (and more recently), several individual read various parts of the draft and have submitted useful suggestions, which I tried to incorporate. They include: Hill Ma.
Thank You! All the remaining errors are mine. |