Monday, June 25, 2012

Coactive Learning

Shivaswamy and Joachims have a paper called Online Structured Prediction via Coactive Learning at ICML 2012 this year. Joachims, of course, is associated with a classic line of research which I'll summarize thusly: attempting to impute absolute relevance scores from behavioral data exhaust is not effective, and that imputing relative preferences leveraging an attentional model (e.g., serial scan) is more effective. This is one of those deep tricks'' that you can carry with you into many different situations.

So the classic example is when you have a search engine result, and you get just one click at a particular position $p$, and your attentional model assumes that the user considered every result up to that position plus one more. Therefore the partial preferences $\forall x \in [1, p + 1], x \neq p: r_p > r_x$ are revealed and added to the (ranking) training set.

Later in my career I began to appreciate stochastic contextual bandits, specifically the importance of debiasing the historical state-action density in order to get consistent estimates. That left me with an inconsistency: on the one hand, optimizing a search engine with click feedback is definitely Learning Through Exploration, since you only get information about the relative preferences of (a subset of) the items presented. On the other hand I'm not attempting to debias the historical state-action density when I'm doing straight Joachims.

I was hoping this paper would resolve this difficulty for me. It did, but not in the way I expected; the contextual bandit literature is only referred to in the introduction for comparative purposes. Instead the authors make the following assumptions:
1. User loss is convex in (linear) utility differences.
2. Users only suggest improvements (i.e., user feedback always points downhill'').
3. Users only suggest significant improvements (i.e., feedback states have a utility increment at least proportional to the increment to the optimum).
Under these circumstances it is sensible that a Perceptron-style algorithm achieves a good regret bound. The authors also explore relaxations of these assumptions (e.g., improvements are only significant in expectation, or feedback occasionally points downhill) and the resulting degradation of the regret guarantee.

I suspect the analysis does not look how I anticipated because, subject to the conditions of the previous paragraph, the user feedback can be chosen adversarially. Nonetheless it could be interesting to consider a contextual bandit style'' formulation, e.g., instead of learning the reward associated with the chosen arm, one learns the difference between the reward of the chosen arm and another arm. A good place to start would be the literature on contextual bandits with controlled side information, but a key difference here is that the user feedback is not under control of the algorithm.