Adaptivity, Privacy, and Efficiency
Explicit regularization (via adding a convex regularizer) is the
mainstay of (distribution-free) statistically efficient learning algorithms. [Kalai,
Vempala 2002], also [Hannan 1957], present an alternative framework
(FTPL) for sequential prediction, based on randomly perturbing the history of
losses on previous predictions. While these perturbation-based algorithms admit
computationally efficient implementations, discounting some simple settings,
their statistical efficiency is not well understood (vis-a-vis regularization).
Building on some recent progress towards this understanding, we present three
compelling cases for the use of perturbation methods for online learning that
improve upon the state-of-the-art. The theme unifying the three applications is
the utilization of structured noise for perturbations, an element hence so far
not employed by the previous analyses.
Adapting to the Inherent Structure
For the problem of expert advice, when the loss vectors are restricted to some unknown d-dimensional subspace (fondly referred to as low-rank experts), we demonstrate that a natural variant of FTPL achieves a regret of d^0.75√T, crucially independent of the number of experts. This improves on the previous best of d√T achieved by regularization-based schemes due to [Hazan, Koren, Livni, Mansour 2016].
Privacy for Free
We design differentially private algorithms for the problem of online
linear optimization in the full information and bandit settings with optimal √T regret bounds. In the
full-information setting, our results demonstrate that differential
privacy may be ensured for free -- in particular, the regret bounds scale as √T+1/ε. Notably, this phenomenon
was unacknowledged even in the (weaker) setting of statistical learning.
This separation of the extra (additive) regret suffered to ensure privacy from
the non-private regret bound is permitted through a reinterpretation of
privacy-aimed noise addition schemes as random perturbations of the objective.
For bandit linear optimization, and as a special case, for non-stochastic
multi-armed bandits, the proposed algorithm achieves a regret of √T/ε, while
the previously best-known regret bound was T^0.75/ε.
Efficient Algorithms for Contextual Learning
We devise the first algorithm for transductive contextual learning that enjoys optimal √T regret guarantees while admitting computationally efficient implementations at run-time. In contrast, previous works either obtain sub-optimal regret bounds or require explicit enumeration of the policy class on every iteration. Contextual learning (with partial feedback) is a strict, far-reaching generalization of the Markov Decision Process model for policy-based reinforcement learning. Contextual learning offers the same sample complexity bounds without placing any assumption on the underlying state dynamics (for example, conditional Markovian evolution of states).
The first and third results are due to
collaborations with my advisor Elad
Hazan (Princeton). The second result is joint work with Naman Agarwal (Princeton).
Reading List
Texts
Bubeck,
Cesa-Bianchi:
Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems.
Hazan: Introduction to Online Convex Optimization.
Shalev-Shwartz: Online Learning and Online Convex Optimization.
Szepesvári: Algorithms for Reinforcement Learning.
Papers
Contextual Learning
Dudik, Hsu, Kale, Karampatziakis, Langford, Reyzin, Zhang: Efficient optimal learning for contextual bandits. UAI 2011
Kakade, Kalai: From batch to transductive online learning. NIPS 2005
Langford, Zhang: The epoch-greedy algorithm for multi-armed bandits with
side information. NIPS 2008
Rakhlin, Sridharan: BISTRO: An efficient relaxation-based method for
contextual bandits. ICML 2016
Syrgkanis, Krishnamurthy, Schapire: Efficient Algorithms for Adversarial
Contextual Learning. ICML 2016
Imitation Learning
Chang, Krishnamurthy, Agarwal, Daume, Langford: Learning to search better than your teacher. ICML 2015
Ross, Bagnell: Efficient Reductions for Imitation Learning. AISTATS 2010
Ross, Gordon, Bagnell: A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning. AISTATS 2011
Learning with Partial Information
Bubeck, Cesa-Bianchi, Kakade,
Mannor, Srebro, Williamson: Towards Minimax Policies for
Online Linear Optimization with Bandit Feedback. COLT 2012
Dani, Hayes, Kakade: The Price of Bandit Information for Online Optimization.
NIPS 2007
Neu, Bartók: Importance weighting without importance weights: An efficient
algorithm for combinatorial semi-bandits. JMLR
2016
Differentially Private Online Learning
Dwork, Naor, Pitassi, Rothblum: Differential privacy under continual observation. STOC 2010
Jain, Kothari, Thakurta: Differentially Private Online Learning. COLT 2012
Smith, Thakurta: (Nearly) optimal algorithms for private online learning in full-information and bandit settings. NIPS 2013
Perturbation Schemes in Statistics
Abernethy, Lee, Sinha, Tewari: Online Linear Optimization via Smoothing. COLT 2014
Hazan, Jaakkola: On the Partition Function and Random Maximum A-Posteriori Perturbations. ICML 2012
No-Regret Algorithms in Structured Settings
Garber, Hazan, Ma: Online Learning of Eigenvectors. ICML 2015
Hazan, Koren, Livni, Mansour: Online Learning with Low-Rank Experts. COLT 2016