Contextual bandits---an extension of the classical multi-armed bandit problem---is a basic decision-making problem where the reward distribution depends on the side-information (covariates) presented to the statistician before each decision. A fundamental challenge in contextual bandits is to develop flexible, general-purpose algorithms with computational requirements no worse than classical supervised learning tasks. In this talk, we will describe a universal and minimax optimal reduction from contextual bandits to online and offline regression. We characterize the minimax rates for contextual bandits with general, potentially nonparametric function classes, and show that our algorithm is minimax optimal whenever the online regression method is optimal. We then turn to the case of iid data and present an adaptive method that attains fast instance-dependent rates, whenever certain disagreement-based notions of problem complexity are bounded. We discuss extensions to combinatorial settings with submodular rewards, as well as reinforcement learning.
Bio: Alexander (Sasha) Rakhlin is a Professor at MIT, with appointments in the Statistics & Data Science Center and the Department of Brain and Cognitive Sciences. His interests are in statistical learning, mathematical statistics, and online methods.