Deep Learning from Logged Interventions
by Thorsten Joachims
Every time a system places an ad, presents a search ranking, or makes a recommendation, we can think about this as an intervention for which we can observe the user’s response (e.g. click, dwell time, purchase). Such logged intervention data is one of the most plentiful types of data available, as it can be recorded from a variety of systems (e.g., search engines, recommender systems, ad placement) at little cost. However, this data provides only partial-information feedback – aka “bandit feedback” – limited to the particular intervention chosen by the system. We don’t get to see how the user would have responded, if we had chosen a different intervention. This makes learning from logged bandit feedback substantially different from conventional supervised learning, where “correct” predictions together with a loss function provide full-information feedback. It is also different from online learning in the bandit setting, since the algorithm does not assume interactive control of the interventions. In this talk, I will explore learning methods for batch learning from logged bandit feedback (BLBF). Following the inductive principle of Counterfactual Risk Minimization for BLBF, this talk presents an approach to training linear models and deep networks from propensity-scored bandit feedback.
About the speaker: Thorsten Joachims is a Professor in the Department of Computer Science and in the Department of Information Science at Cornell University. His research interests center on a synthesis of theory and system building in machine learning, with applications in information access, language technology, and recommendation. His past research focused on counterfactual and causal inference, support vector machines, text classification, structured output prediction, convex optimization, learning to rank, learning with preferences, and learning from implicit feedback. He is an ACM Fellow, AAAI Fellow, and Humboldt Fellow.
In this talk, I’ll cover three areas our team at DeepMind have been working on in recommender systems. First, in recommender systems often we observed delayed signals such as longer term user engagement, user conversions, and delays may simply result from logging/data pipeline issues in online learning models. How do we learn from these delayed signals efficiently? Finally, due to positional, contextual biases, top-k ranking may not give us the optimal slate for a user. How do we learn the joint distribution of slates and user responses so that we can directly generate the slate?
About the speaker: Ray Jiang is a Research Scientist in Machine Learning at DeepMind. During her PhD, she worked on deep learning (what-where auto-encoders, speech adaptation, concept drift, …) and protein folding. After graduating, she joined Facebook to work on online recommender systems. Her current interests range from generative modeling for slate optimization and hierarchical reinforcement learning to machine learning fairness.