- Yuval Peres

- Microsoft Research-Redmond

Consider a player selecting, repeatedly, one of two possible actions. In each round the player learns the loss arising from his action, and his regret after T rounds is his cumulative loss, minus the loss incurred by consistently selecting the better of the two actions. It is well known that in this setting (known as the adversarial two-armed bandit problem) several algorithms (including a variant of multiplicative weights) can ensure the player a regret of at most O(T^{1/2}) and this is sharp. However, if the player incurs a unit cost each time he switches actions then the best known upper bound on regret was $O(T^{2/3})$ while the best lower bound known was still of order $T^{1/2}$. We resolve this gap, by proving the upper bound is sharp (up to a log term). In the corresponding full-information problem, the minimax regret is known to grow at a slower rate of T^{1/2} . The difference between these two rates indicates that learning with bandit feedback (i.e. just knowing the loss from the player's action, not the alternative) can be significantly harder than learning with full-information feedback. It also shows that without switching costs, any regret-minimizing algorithm for the bandit problem must sometimes switch actions very frequently. The proof is based on an information-theoretic analysis of a loss process arising from a multi-scale random walk. (Joint work with Ofer Dekel, Jian Ding and Tomer Koren, to appear in STOC 2014 available at http://ar! xiv.org/abs/1310.2997 )