Algorithms for Reinforcement Learning
Errata for the printed book
Last update: July 8, 2017
Page numbers refer to the printed copy. The online version (the “draft”) is up-to-date. Thanks to
my PhD student, Gabor Bartok and Sotetsu Koyamada who have found many of these
- p. xi. Section 2) should be Section 2 (no closing parenthesis)
- p. 1. The dot should be in between the bars in the definition of the infinity norm, not
on the top. That is, ∥⋅∥∞ is the intended form and not ||∞. Also, in “which, if θ,
which” the part “which, if θ” should be deleted.
- p.2. The footnote from p.5 explaining the meaning of “almost surely” should be moved
- p.5. In the example on gambling the personal pronoun “his” should be replaced by
- p.9. In Eq. (1.14) on the right-hand side of the equation Q(y,π(x)) should be Q(y,π(y)).
- p.12. In footnote 1, add “if” before “it”.
- p. 19. line 25: “gives the the sum of rewards along the trajectory” should be “gives the
difference between the return along the trajectory and the value estimate at the start
- p. 21. line 1: “then” should be “than”.
- p. 22. The text “goal is to approximate the value function V underlying ” should
- p. 23. Delete “be” from “is no longer be guaranteed”. After θ(λ) in the middle of page
delete “.”. The phrase “using V θ” should be “ using the chosen features φ”.
- p. 24. line 19: “full control learning task” should be “full, control-learning task”
- p. 25. “some methods using which” should be “some methods that avoid” On the same
page, in the figure caption, “true” should be “two”. Also, on the figure the labels Ln(θ)
and L(θ) are swapped.
- p. 29. “this corresponds to starting with a diagonal matrix in RLSTD” should be “this
corresponds to starting with a positive diagonal matrix in RLSTD”.
- p. 32. The word “complicate” (in the middle of page) should be “complicated”.
- p. 34. “has full column rank” should be “has full row rank”. On the same page, in Eq.
(2.17): Replace ∥θ*⊤φ-r∥2 with ∥θ
μ2 on the right-hand side (although note
that V = r).
- p. 40. The text “Gittins (1989) has shown” should be “Gittins (1989) showed”.
- p. 43. The 4th displayed equation and the text surrounding it should be deleted. This is
the equation that says that RT UCRL2(δ) = O(D2||2|| log(T∕δ)∕ε+εT). This equation
holds (under the cited conditions), but it does not lead itself to a logarithmic regret
- p. 43, line -5: Replace “. This happens” with “as happens to be the case when”.
- p. 47. On line 4 of the 1st paragraph of Section 3.3., “optimalas” should be “optimal
as”. On the same page, on line 3, after Eq. (3.1): “, Algorithm 12 the pseudocode of
Q-learning.” should start with a full stop and is missing the word “shows”. So, the
text should be “. Algorithm 12 shows the pseudocode of Q-learning.”.
- p. 48. Section 3.2 is mentioned twice in the same sentence (around the middle of the
page). The second occurence should be deleted.
- p. 56, Algorithm 16, line 7. The correct update equation is b ← b + Rt+1 ⋅ z [Tom
- p. 58. The definition of regret should be RT = Tρ* -
T [Hamid Reza Maei,
- p. 59. “likelihood ratio methods Glynn, 1990” should be “likelihood ratio methods
- p. 65. The definition of norm is missing the so-called homogeneity condition: For any
λ ∈ ℝ, v ∈ V , f(λv) = |λ|f(v). One the same page, in the bottom, “ℓ∞ norms” should
be “ℓ∞ norm”.
- p. 66. “uniformly bounded” should be “bounded” (when mentioning a single function).
- p. 66. line 8 from the bottom, “fn(x) → 0 for each x” should be “Define f so that
f(x) = 0 if x≠0 and f(0) = 1. Then fn(x) → f(x) for each x.” Replace “However,
∥fn - f∥∞ = ∞ = 1 ⁄→ 0.” with “However, ∥fn - f∥∞ = 1 ⁄→ 0.”
- p. 67. “Polish mathematicians” should be singular: “Polish mathematician”.
- p. 68. On the top of page, “Assume that T is a γ-contraction.” should go into the next
- p. 68. In the first displayed equation the last inequality should be removed. In the next
two displayed equations, on the right-hand side, replace with . Finally,
in the 4th displayed equation on the page, remove from the right-hand side.
- p. 69. In the line preceding the definition of B(), “uniformly bounded” should be
- p. 69. “It is easy to see that V π”: At the end of the display following (in the bottom
of the page) replace V with V π.
- p. 72. “for the final policy π, we have TV π = TπV π = V π.” should be “for the final
policy π, we have T*V π = TπV π = V π.”.
For further information, visit http://www.ualberta.ca/~szepesva/RLBook.html.