## Algorithms for Reinforcement Learning Errata for the printed book

August 7, 2010*

*Last update: June 25, 2018

### Contents

Page numbers refer to the printed copy. The online version (the “draft”) is up-to-date. Thanks to my PhD student, Gabor Bartok and Sotetsu Koyamada who have found many of these errors.

• p. xi. Section 2) should be Section 2 (no closing parenthesis)
• p. 1. The dot should be in between the bars in the definition of the infinity norm, not on the top. That is, ∥⋅∥ is the intended form and not || . Also, in “which, if θ, which” the part “which, if θ” should be deleted.
• p.2. The footnote from p.5 explaining the meaning of “almost surely” should be moved here.
• p.5. In the example on gambling the personal pronoun “his” should be replaced by “her”.
• p.9. In Eq. (1.14) on the right-hand side of the equation Q(y,π(x)) should be Q(y,π(y)).
• p.12. In footnote 1, add “if” before “it”.
• p. 19. line 25: “gives the the sum of rewards along the trajectory” should be “gives the difference between the return along the trajectory and the value estimate at the start state”.
• p. 21. line 1: “then” should be “than”.
• p. 22. The text “goal is to approximate the value function V underlying ” should be deleted.
• p. 23. Delete “be” from “is no longer be guaranteed”. After θ(λ) in the middle of page delete “.”. The phrase “using V θ” should be “ using the chosen features φ”.
• p. 24. line 19: “full control learning task” should be “full, control-learning task”
• p. 25. “some methods using which” should be “some methods that avoid” On the same page, in the figure caption, “true” should be “two”. Also, on the figure the labels Ln(θ) and L(θ) are swapped.
• p. 29. “this corresponds to starting with a diagonal matrix in RLSTD” should be “this corresponds to starting with a positive diagonal matrix in RLSTD”.
• p. 32. The word “complicate” (in the middle of page) should be “complicated”.
• p. 34. “has full column rank” should be “has full row rank”. On the same page, in Eq. (2.17): Replace θ*φ-r2 with θ *φ-V μ2 on the right-hand side (although note that V = r).
• p. 40. The text “Gittins (1989) has shown” should be “Gittins (1989) showed”.
• p. 43. The 4th displayed equation and the text surrounding it should be deleted. This is the equation that says that RT UCRL2(δ) = O(D2| |2| | log(T∕δ)∕ε+εT). This equation holds (under the cited conditions), but it does not lead itself to a logarithmic regret bound.
• p. 43, line -5: Replace “. This happens” with “as happens to be the case when”.
• p. 45, Algorithm 10 (UCRL2): Instead of “repeat-until” the appropriate programming construct should have been “while true-end while” (the body of the loop is repeated indefinitely) [Hill Ma].
• p. 46, Algorithm 11 (OptSolve): In the repeat-until construction the conditions must be flipped. Two places [Hill Ma].
• p. 47. On line 4 of the 1st paragraph of Section 3.3., “optimalas” should be “optimal as”. On the same page, on line 3, after Eq. (3.1): “, Algorithm 12 the pseudocode of Q-learning.” should start with a full stop and is missing the word “shows”. So, the text should be “. Algorithm 12 shows the pseudocode of Q-learning.”.
• p. 48. Section 3.2 is mentioned twice in the same sentence (around the middle of the page). The second occurrence should be deleted.
• p. 56, Algorithm 16, line 7. The correct update equation is b b + Rt+1 z [Tom Schaul, Idsia].
• p. 57, Algorithm 16 (LSPI): Flip the condition in the until construct [Hill Ma].
• p. 58. The definition of regret should be RT = * - T [Hamid Reza Maei, Stanford].
• p. 59. “likelihood ratio methods Glynn, 1990” should be “likelihood ratio methods (Glynn, 1990).”
• p. 61, Algorithm 18: Again, instead of repeat-until, one needs while(true)-endwhile [Hill Ma].
• p. 65. The definition of norm is missing the so-called homogeneity condition: For any λ , v V , f(λv) = |λ|f(v). One the same page, in the bottom, “ norms” should be “ norm”.
• p. 66. “uniformly bounded” should be “bounded” (when mentioning a single function).
• p. 66. line 8 from the bottom, “fn(x) 0 for each x” should be “Define f so that f(x) = 0 if x0 and f(0) = 1. Then fn(x) f(x) for each x.” Replace “However, fn - f = = 1 ⁄→ 0.” with “However, fn - f = 1 ⁄→ 0.”
• p. 67. “Polish mathematicians” should be singular: “Polish mathematician”.
• p. 68. On the top of page, “Assume that T is a γ-contraction.” should go into the next line.
• p. 68. In the first displayed equation the last inequality should be removed. In the next two displayed equations, on the right-hand side, replace with . Finally, in the 4th displayed equation on the page, remove from the right-hand side.
• p. 69. In the line preceding the definition of B( ), “uniformly bounded” should be “bounded”.
• p. 69. “It is easy to see that V π”: At the end of the display following (in the bottom of the page) replace V with V π.
• p. 72. “for the final policy π, we have TV π = TπV π = V π.” should be “for the final policy π, we have T*V π = TπV π = V π.”.

For further information, visit http://www.ualberta.ca/~szepesva/RLBook.html.