CMPUT 654 - Online Learning Algorithms

Welcome to CMPUT 654!

NEWS! The course has an official webpage here which will be used in the rest of the semester. That is, this page will not be updated. Please visit the page above for the latest news.

Fall 2009
Department of Computing Science
University of Alberta

Instructor: Csaba Szepesvári, Ath311, x2-8581, Csaba.Szepesvari at cs dot ualberta dot ca
Room: NRE 2 016
Time: TR 11:00 - 12:20
Office hours: TR 15:00-16:00

Course outline (tentative, will/may change and topics marked by * are for further exploration):

Unit 1 -- slides: Introduction or how to measure performance? dowry-game by David (Windows only), mind-reader applet by Anup Doshi

Unit 2: The classical approach: Stopping problems, secretary problem, bandit problems, the concept of regret, Bayesian bandits: Gittins indeces.

Unit 3 -- slides: Nonparametric stochastic bandits The upper confidence bound (UCB) based algorithm. *Exploration/exploitation in MDPs, *associative delayed bandits, *PAC learning vs. regret bounds

Unit 4 -- slides: Prediction of individual sequences using expert advice. Finding a perfect expert: Mistake bounds for the halving algorithm. Imperfect experts: weighted majority algorithm (Hedge). Bounding the regret for convex losses: The exponentially weighted average forecaster (EWA). *growing horizon, *improved bounds for small losses, *large number of good experts, *scaling the loss function, *discounting the past

Unit 5 -- slides: Small regret for specific problems. Efficient gambling: Finite regret for the exponentially weighted average forecaster via the potential function technique for exp-concave losses. Online parameter optimization over convex domains: Logarithmic regret bounds for the follow the leader and for for strictly convex, globally Lipschitz loss functions. *the greedy forecaster and the absolute loss, *minimax lower bounds (EWA is 'optimal')

Unit 6 -- slides: Binary prediction problems. Randomized exponentially weighted forecaster, follow the perturbed leader. *internal regret, *calibration

Unit 6 -- slides: Efficient algorithms for large classes of experts. Tracking the best expert: The fixed- and variable-share algorithms; prediction using side information: the context tree algorithm (*KT estimates, log-loss, Markov forecasters). Shortest path problems.

Unit 7 -- slides
Multi-armed bandits in adversarial environments. A high-probability bound on the regret of EXP3.P, *tracking, *efficient estimation, *label efficient prediction, *lazy label efficient prediction, *partial monitoring

Unit 8 -- slides: Playing against reactive opponents. Hannan consistency vs. finding the best counter-strategy

Unit 9: *Sequential investments EG, second-order Newton method

Unit 10 -- slides: Linear pattern recognition (prediction with side information). Gradient-based forecasters: Widrow-Hoff, EG. Ridge regression, Vock-Azoury-Warmuth forecaster, matching lower bound, tracking bounds, *self-confident linear forecasters

Unit 11 --slides: Linear classification (binary prediction with side information). Upper-bounding the zero-one loss. Hinge-loss and the margin. The p-norm Perceptron and Winnow. *2nd order perceptron, *ALMA: the approximate large margin algorithm, *kernel-based classification, *ANOVA kernel

Course blurb:

Online learning is a hot topic in machine learning: When processing huge amounts of data algorithms must operate continuously in their environments, making predictions, receiving rewards, suffering losses. The distinguishing feature of online problems is that the algorithm's performance is measured by comparing it with the best performing algorithm from a larger class of algorithms, allowing for a wide class of environment models, ranging from stationary stochastic models to ones where no statistical constraints are adopted.

Examples of online learning problems abound: Finding a large reward, finding the best candidate at a job fair when decisions must be made on the spot, optimal allocation of resources, optimal sequential investments, etc.

The focus of the course will be on the ideas that make effective online learning possible, as well as the inherent limitations of the approach and that of the algorithms. Students attending the course will understand the most important techniques underlying these algorithms, as well as to apply these techniques to new problems and appreciate the differences of the designs.

The course has no formal prerequisites. However, it is expected that students have a rudimentary knowledge of real analysis (derivatives, Taylor's expansion) and probability theory. Students who wish to accomplish a project should have basic programming capabilities.

Required Textbook: Prediction, Learning and Games by Nicolò Cesa-Bianchi and Gábor Lugosi

Most of the material of the course can be found in the book, though some additional material will also be included.
The slides above are for the course of 2006 Fall. The new slides might be marginally different. If they will be, they will be posted to this website in due time.
For additional reading material (that may be handy when it comes to designing your projects), see the annotated references at the end of the page.


Please refer to the moodle page of the course for details of grading.

Plagiarism and cheating

All assignments should be done individually. The goal of these assignments and the project work is that you gain knowledge and experience. By copying someone else's solution will certainly be of no help in this respect. However, sometimes it may be valuable to discuss the course material with your fellows (or other instructors). If you collaborate or discuss your assignment with someone else (student or instructor), this must be clearly stated in your report.

"The University of Alberta is committed to the highest standards of academic integrity and honesty. Students are expected to be familiar with these standards regarding academic honesty and to uphold the policies of the University in this respect. Students are particularly urged to familiarize themselves with the provisions of the Code of Student Behaviour (online at and avoid any behaviour which could potentially result in suspicions of cheating, plagiarism, misrepresentation of facts and/or participation in an offence. Academic dishonesty is a serious offence and can result in suspension or expulsion from the University."

"Policy about course outlines can be found in §23.4(2) of the University Calendar."

Additional Reading Material

Papers that make statistical assumptions about the environment

Bandits, limited feedback

Thomas S. Ferguson: Optimal Stopping and Applications (on-line book)

Robert Kleinberg: Nearly Tight Bounds for the Continuum-Armed Bandit Problem (bandits identified by elements of a compact set, Lipschitz losses either adversarially or randomly changing, n^(2/3) regret, almost matching lower bound)

Lai and Yakowitz: Machine Learning and Nonparametric Bandit Theory, IEEE TAC:40, pp. 1199-1209, 1995 (extending bandits to MDPs, approaching logarithmic regret, but algorithm is computationally infeasible)

Todd L. Graves and Tze Leung Lai: Asymptotically efficient adaptive choice of control laws in controlled Markov chains, SIAM J. Control and Opt., 35:715-743, 1997 (goal is to select the best from a finite set of stationary policies in an MDP, MDP can be continuous, explicit testing phases with increasing rarity, blocking technique, log-regret)

Omid Madani, Daniel J. Lizotte and Russell Greiner: Active Model Selection (searching for a large reward?)

Klaus Adam: Learning While Searching for the Best Alternative Journal of Economic Theory 101:252-280, 2001 (searching for a large reward with costs, reservation prices similar to Gittins indeces, introduces a special conditions ruling out `strong positive learning effects', but this allows a closed form optimal strategy)

Haya Kaspi and Avishai Mandelbaum: Multi-armed bandits in discrete and continuous time, The Annals of Applied Probability, 8:1270-1290, 1998 (continuous time generalization of bandit problems)

Nicole el Karoui, Ioannis Karatzas: Synchronization and optimality for multi-armed bandit problems in continuous time (generalization to the case when arms are not independent)

Chih-Chun Wang, Sanjeev R. Kulkarni and H. Vincent Poor: Bandit Problems with Side Observations, IEEE TAC, Vol. 50, 2005 (Arms share parameters to be estimated. How to sample? Do we still need to explore? Sometimes not! Sometimes yes. Does there exist a universally optimal strategy?)

Dirk Bergemann and Juuso Valimaki: Bandit Problems (discusses bandit problems and gives some examples from economics)

Peter Auer, Nicolo-Cesa Bianchi and Paul Fischer: Finite-time Analysis of the Multiarmed Bandit Problem, Machine Learning, 47:235–256, 2002 (regret analysis of upper confidence bound based algorithms, epsilon-greedy)

Yuhong Yang and Dan Zhu: Randomized allocation with nonparametric estimation for a multi-armed bandit problem with covariates, The Annals of Statistics, 30:100–121, 2002 (plug-in estimation, epsilon-greedy sampling with epsilon going to zero, consistency, but no regret analysis)

Rangarajan K. Sundaram: Generalized Bandit Problems (Bayesian setting: infinite number of bandits, parallel search, cost of switching -- can the Gittins-Jones Thm be generalized to these cases? yes, no, no)

Luc Devroye, Adam Krzyzak: Random search under additive noise (systematic review of results on random search (literature review and some consistency results)

Hiraoka, Amari: Optimal Strategy under Unknown Stochastic Environment -- Nonparametric Lob-Pass Problem (O(t^eps) regret for the lob-pass problem, where the opponent reacts based on a simple statistics of our play and in a static, but unknown probabilistic way)

PAC Learning/Exploration (why is this here?:))

Claude-Nicolas Fiechter: Efficient Reinforcement Learning (discounted expected total reward, reset, learns a model and then argues that by continuity the computed policy is near-optimal)

Claude-Nicolas Fiechter: PAC Associative Reinforcement Learning (an early paper on associative bandits, here called associative reinforcement learning, specialized to decision lists, but actually the underlying reasoning holds for more general regression estimation techniques)

Claude-Nicolas Fiechter: Expected mistake bound model for on-line reinforcement learning (here is the answer: "equivalence" of the regret model to the PAC model: If you can learn good policies fast, you can bound the regret)

Michael Kearns and Satinder Singh: Near-Optimal Reinforcement Learning in Polynomial Time (described E3 and gives a bound on the exploration sample complexity of the algorithm. This algorithm tries to balance exploration and exploitation, unlike Fiechter's `efficient RL' algorithm; no reset, undiscounted, discounted..)

Alexander L. Strehl, Michael L. Littman: A Theoretical Analysis of Model-Based Interval Estimation: Proofs ("sample complexity of exploration", bounds, compare with UCB!)

Alexander L. Strehl, Chris Mesterharm, Michael L. Littman, Haym Hirsh: Experience-Efficient Learning in Associative Bandit Problems (PAC bounds, reduction to classification instead of regression as one would have expected, are the rates optimal?)

Shie Mannor and John N. Tsitsiklis: The Sample Complexity of Exploration in the Multi-Armed Bandit Problem, Journal of Machine Learning Research 5:623-648, 2004 (Even-Dar et. al's median elimination has sample complexity O(n/eps^2 log(1/delta)); here this is shown to be optimal by a matching lower bound)

Papers with no statistical assumptions about the environment

Bandits, limited feedback

P. Auer, N. Cesa-Bianchi, Y. Freund, and Robert Schapire. The Nonstochastic Multiarmed Bandit Problem, SIAM J. Computing, Volume 32, Number 1, pp. 48-77, 2002.

Levente Kocsis and Csaba Szepesvari: Reduced-Variance Payoff Estimation in Adversarial Bandit Problems (Points out that you can use any payoff estimate you want as long as you control the variance of the estimate. What is more, the variance of your estimator will be reflected by the regret -- the variance appears as the constant multiplier of the regret. The findings are empirically demonstrated in a poker program that plays Omaha HiLo)

Nicolo Cesa-Bianchi, Gabor Lugosi and Gilles Stoltz: Regret Minimization Under Partial Monitoring (partial monitoring generalizes bandit problems; you get some feedback from which on the average -- in expectation -- you can decode the losses of the individual experts; examples include dynamic pricing, apple tasting -- regret bound is just n^(2/3), and this is the optimal rate)

Large classes of experts

Adam Kalai Santosh Vempala: Efficient Algorithms for Online Decision Problems (follow the perturbed leader for shortest path, tree update, ..)

M. Zinkevich. Online convex programming and generalized infinitesimal gradient ascent. In Twentieth International Conference on Machine Learning, 2003 (Convex optimization setting, as in the course. Gradient algorithm is proposed an analyzed. The regret is O(n^{1/2}).)

Abraham D. Flaxman, Adam Tauman Kalai, H. Brendan McMahan:  Online convex optimization in the bandit setting: gradient descent without a gradient (Setting: as in the previous paper, but unlike there here the full function is not told, just the value of the function at a point. Can you still compete with the overall best solution? When you cannot compute the gradient you can use a randomization idea /SPSA/ to estimate it. Even with a single(!) sample. Cool! The total regret is O(n^{3/4}). Is it possible to do any better than this?)

E. Hazan, A. Kalai, S. Kale, and A. Agarwal. Logarithmic Regret Algorithms for Online Convex Optimization, COLT'2006 (Another recent paper, showing logarithmic regret. Note that the follow-the-leader algorithm has logarithmic regret, too!)

Levente Kocsis and Csaba Szepesvari: Discounted UCB -- slides of the presentation at the Workshop on the Exploration-Exploitation Challenge of PASCAL (the environment is changing in an unknown way; typically slowly drifting -- can you design an efficient algorithm?)

András György and György Ottucsák: Adaptive Routing Using Expert Advice (applies Exp3.P to routing problems)

Reactive opponents

Daniela Pucci de Farias, Nimrod Megiddo: Exploration-Exploitation Tradeoffs for Experts Algorithms in Reactive Environments (finding best counter-actions, as above; criticizes the previous non-oblivious opponent results -- I agree with that:))

Jan Poland and Marcus Hutter: Defensive Universal Learning with Experts (generalizes the previous paper to loss functions that can grow in time without limits, countably infinite actions, uses a follow-the-perturbed leader algorithm)

Other papers

Applications to Online search

Laurent Peret and Frederick Garcia: On-line search for solving Markov Decision Processes via heuristic sampling (tries to improve the trajectory tree method by guiding the tree building process by estimating the error of the value function estimates, empirical evaluation on the sailing domain)

Hyeong Soo Chang, Michael C. Fu and Steven I. Marcus: An Adaptive Sampling Algorithm for Solving Markov Decision Processes (same problem as previous paper, depth first search, does not actually build the tree -- never caches results; best for very large domains where it is unlikely that the same state will ever be encountered twice; evaluation on queues)


C. Hayes, P. Massa, P. Avesani, and P. Cunningham: An on-line evaluation framework for recommender systems (not a theoretical work, argues for the advantages of measuring performance online)