Monday, May 21, 2012

Instruction Theory?

In learning theory we often talk about an environment which is oblivious to our algorithm, or an environment which is fully aware of our algorithm and attempting to cause it to do badly. What about the case where the environment is fully aware of our algorithm and attempting to help it succeed?

Here's a concrete example. Suppose you are trying to communicate a message to a receiver, and this message is one of a finite set of hypotheses. You are forced to communicate to the receiver by sending a sequence of feature-label pairs $(X \times Y)^*$; the receiver will decode the message via ERM on the hypothesis set using the supplied data. How many examples does it take, and how should you chose them? If this sounds corny, consider that evolution works by reuse, so if the capability to learn from experience developed due to other selection pressure, it might be co-opted to service communication a la Cognitive Linguistics.

Intuitively, even if the hypothesis being communicated was learned from experience, it's not good strategy to retransmit the exact data used to learn the hypothesis. In fact, it seems like the best strategy would be not using real data at all; by constructing an artificial set of training examples favorable structure can be induced, e.g., the problem can be realizable. (Funny aside: I TA-d for a professor once who confided that he sometimes lies to undergraduate in introductory courses in order to get them ``closer to the truth''; the idea was, if they took an upper division class he would have the ability to refine their understanding, and if not they were actually better off learning a simplified distortion).

Some concepts from learning theory are backwards in this setup. For instance, Littlestone's dimension indicates the maximum number of errors made in a realizable sequential prediction scenario when the examples are chosen adversarially (and generalizes to the agnostic case). We can chose the examples helpfully here (what's the antonym of adversarial?), but actually we want errors so that many of the hypothesis are incorrect and can be quickly eliminated. Unfortunately we might encounter a condition where the desired-to-be-communicated hypothesis disagrees with at most one other hypothesis on any point. Littlestone finds this condition favorable since a mistake would eliminate all but one hypothesis, and otherwise no harm no foul; but in our situation this is worst-case behaviour, because it makes it difficult to isolate the target hypothesis with examples. In other words, we can chose the data helpfully, but if the set of hypotheses is chosen adversarially this could be still very difficult.

Inventing an optimal fictitious sequence of data might be computationally too difficult for the sender. In this case active learning algorithms might provide good heuristic solutions. Here label complexity corresponds to data compression between the original sequence of data used to learn the hypothesis and the reduced sequence of data used to transmit the hypothesis.

There is fertile ground for variations, e.g., the communication channel is noisy, the receiver does approximate ERM, or the communication is scored on the difference in loss between communicated and received hypothesis rather than 0-1 loss on hypotheses.

Wednesday, May 16, 2012

Active Minimax Forecasting

This is a continuation of my previous post where I noted that the Minimax Forecaster is tantalizing from an active learning perspective. Fortunately I noticed a paper by Abernethy et. al. called A Stochastic View of Optimal Regret through Minimax Duality. In this paper the authors unwrap online convex optimization in a general fashion by leveraging von Neumann's minimax theorem. By doing this they derive a general formula for the value of an online game to the adversary. The intuition of the previous post is that differences in game value between observing and not observing a particular outcome will be key to making active learning decisions in an adversarial setting, so this formula is very interesting.

Abernethy et. al. start out with a more general setup than the previous post, but I'll adapt their conventions to the previous post where there are differences. There is a game with $T$ rounds in it. There is a set $\mathcal{F}$ of experts. On each round, each expert $f$ produces a prediction $f_t$, player produces a prediction $p_t$, adversary simultaneously produces an outcome $y_t$, and player suffers an instantaneous loss $l (p_t, y_t)$. The experts are static (their predictions to do not depend upon previously observed outcomes), so essentially each expert is an sequence $f_{1:T}$. The player wants to generate a sequence of predictions $p_{1:T}$ which minimizes worst-case regret \[
\sup_{y_{1:T} \in \mathcal{Y}^T} L (p_{1:T}, y_{1:T}) - \inf_{f \in \mathcal{F}} L (f_{1:T}, y_{1:T}),
\] where $L (p_{1:T}, y_{1:T}) = \sum_s l (p_s, y_s)$ is the total loss. (To ease notation supremums and infimums will capture the entire rest of the expression unless explicitly limited by parenthesis.) The Minimax Forecaster of the previous post is an explicit solution to a specific case of this problem, whereas Abernethy et. al. are concerned with characterizing such games in general. They consider the value of the game to the adversary under optimal play, \[
R^* (\mathcal{F}) &= \inf_{p_1 \in \mathcal{P}} \sup_{y_1 \in \mathcal{Y}} \ldots \inf_{p_T \in \mathcal{P}} \sup_{y_T \in \mathcal{Y}} \sum_{t=1}^T l (p_t, y_t) - \inf_{f \in \mathcal{F}} \sum_{t=1}^T l (f_t, y_t).
\] The amazing result is that this is the same as \[
R^* (\mathcal{F}) &= \sup_{Y \in \mathbb{P} (\mathcal{Y}^T)} \mathbb{E}_{y_{1:T} \sim Y} \left[ \sum_{t=1}^T \inf_{p_t \in \mathcal{P}} \biggl( \mathbb{E}_{\tilde y_t} \left[ l (p_t, \tilde y_t) | y_{1:t-1} \right] \biggr) - \inf_{f \in \mathcal{F}} \sum_{t=1}^T l (f_t, y_t) \right],
\] where the supremum is over distributions of outcome sequences $\mathbb{P} (\mathcal{Y}^T)$. In other words, this looks like a game played with an oblivious (non-stationary) environment, but played in the worst possible environment. This is a nifty result, and leads in subsequent work to interpolating between the IID and adversarial settings by constraining the supremum over sequence distributions.

Now with active learning, we make the adversary more powerful by choosing not to observe some of the outcomes (represented by the variable $z_s \in \{ 0, 1 \}$). I can upper bound the game value to the adversary as \[
R^* (\mathcal{F} | z_{1:T}) &\leq \sup_{Y \in \mathbb{P} (\mathcal{Y}^T)} \mathbb{E}_{y_{1:T} \sim Y} \left[ \sum_{t=1}^T \inf_{p_t \in \mathcal{P}} \biggl( \mathbb{E}_{\tilde y_t} \left[ l (p_t, \tilde y_t) | \Omega_{t-1} \right] \biggr) - \inf_{f \in \mathcal{F}} \sum_{t=1}^T l (f_t, y_t) \right],
\] where $\Omega_t = \{ y_s | s \leq t, z_s = 1 \}$ denotes observed outcomes. This is intuitively pleasing because the inner conditional expectation represents the player's knowledge. To derive the upper bound follow the procedure in Appendix A of the paper, after transforming the game value expression whenever $z_s = 0$ from \[
\inf_{p_1 \in \mathcal{P}} \sup_{y_1 \in \mathcal{Y}} \cdots \inf_{p_s \in \mathcal{P}} \sup_{y_s \in \mathcal{Y}} \cdots \inf_{p_T \in \mathcal{P}} \sup_{y_T \in \mathcal{Y}} \sum_{t=1}^T l (p_t, y_t) - \inf_{f \in \mathcal{F}} \sum_{t=1}^T l (f_t, y_t),
\] to \[
\inf_{p_1 \in \mathcal{P}} \sup_{y_1 \in \mathcal{Y}} \cdots \inf_{p_s \in \mathcal{P}} \cdots \inf_{p_T \in \mathcal{P}} \sup_{y_T \in \mathcal{Y}} \sup_{y_s \in \mathcal{Y}} \sum_{t=1}^T l (p_t, y_t) - \inf_{f \in \mathcal{F}} \sum_{t=1}^T l (f_t, y_t),
\] i.e., by letting the adversary defer selection of the unobserved values until the end of the game. This is just an upper bound because in reality the adversary has to chose the outcome for round $s$ on round $s$ so possibly I'm being too generous to the adversary.

Minimax Forecasting

If we design a Minimax Forecaster on the extensive form game associated with the bound, we get an algorithm which optimizes the bound. Consider the case where all outcomes are observed except for round $s$. Repeating the backward induction steps from the original minimax forecaster paper yields the expressions, \[
p_t^* &= \frac{1}{2} \biggl( 1 - R^* (\mathcal{F}, y_{1:t-1}0 | z_s = 0) + R^* (\mathcal{F}, y_{1:t-1}1 | z_s = 0) \biggr). & (t > s)
\] and \[
&R^* (\mathcal{F}, y_{1:t} | z_s = 0) \\
&= \frac{T - t}{2} + \mathbb{E}_{\sigma_{t+1:T}}\left[ \sup_{y_s \in \mathcal{Y}} |p_s - y_s| - \inf_{f \in \mathcal{F}} L (f_{1:T}, y_{1:t} \sigma_{t+1:T}) \right] & (t > s) \\
&= R^* (\mathcal{F}, y_{1:t}) \\
&\quad + \mathbb{E}_{\sigma_{t+1:T}}\left[ \left| p_s - \frac{1}{2} \left(1 + \inf_{f \in \mathcal{F}} \left( L (f_{1:T}, y_{1:s-1} 0 \sigma_{s+1:T}) \right) - \inf_{f \in \mathcal{F}} \left( L (f_{1:T}, y_{1:s-1} 1 \sigma_{s+1:T}) \right) \right) \right| \right].
\] Thus the residual game value having played $p_s$ and not observed outcome $s$ is equal to the fully observed residual game value plus a penalty related to expected absolute loss averaged over all Rademacher distributed playouts. This implies the optimal choice of $p_s$ is a median \[
p_s^* &= \mathop{\operatorname{median}} \left( \frac{1}{2} \left( 1 + \inf_{f \in \mathcal{F}} \left( L (f_{1:T}, y_{1:s-1} 0 \sigma_{s+1:T}) \right) - \inf_{f \in \mathcal{F}} \left( L (f_{1:T}, y_{1:s-1} 1 \sigma_{s+1:T}) \right) \right) \right).
\] In the case where there are no additional rounds of play between $s$ and $T$, there is only one point in the distribution so the median is that point, so the optimal play $p_s^*$ is the same as in the fully observed case. In the case where there is one additional round of play between $s$ and $T$, there are two points in the distribution so the mean is a median, and again the optimal play $p_s^*$ is the same as in the fully observed game (consistent with the previous blog post). In the case of more than one additional round of play between $s$ and $T$, the mean is not necessarily a median (i.e., does not necessary minimize expected absolute loss) so optimal play $p_s^*$ is different. Maybe this is why Mathematica ``pooped out'' when I tried to solve a longer version of the unobserved game.

Once we've ``hopped over'' the unobserved point and determined $p_s^*$, we can continue the backwards induction; but I think the interesting bit is the penalty term, which says what the value of observing the outcome on round $s$ given all previous and subsequent rounds will be observed, \[
\mathbb{E}_{\sigma_{t+1:T}}\left[ \left| p_s - \frac{1}{2} \left(1 + \inf_{f \in \mathcal{F}} \left( L (f_{1:T}, y_{1:s-1} 0 \sigma_{s+1:T}) \right) - \inf_{f \in \mathcal{F}} \left( L (f_{1:T}, y_{1:s-1} 1 \sigma_{s+1:T}) \right) \right) \right| \right].
\] This penalty term would be important for deciding whether or not to query the outcome on round $s$. It'd be nice to generalize it to the case where some of the previous outcomes are not observed and also with respect to potentially unobserved future outcomes. Of course, planning the latter might be exponentially difficult.

For this to be practical, it would be nice if medians and expected absolute losses could be cheaply and accurately estimated, e.g., using random playouts. Also perhaps deciding whether to query the outcome needs to be done on a game value bound, e.g., assuming all subsequent observations will be observed rather than taking into account future decisions to observe the outcome. Furthermore, the technique is still fundamentally transductive since the expert predictions for the planning horizon $f_{1:T}$ need to be known. Even with all those caveats, however, there might be an interesting application for this, e.g., in recommendation systems.

Sunday, May 6, 2012

The Minimax Forecaster and Transductive Active Learning

I've been making my way down the NIPS 2011 paper list, and found this nice paper Efficient Online Learning via Randomized Rounding by Cesa-Bianchi and Shamir. This paper is about improving and extending the Minimax Forecaster which is described in Prediction, Learning, and Games. (I own a copy but I confess to not having made it very far into that book.) The Minimax Forecaster uses a different strategy for online learning than mirror descent which is essentially what I (and everybody else?) use everyday. This different setting provides an opportunity to think about adversarial active learning.

Here's the basic setup. There is a game with $T$ rounds in it. There is a set $\mathcal{F}$ of experts. On each round, each expert $f$ produces a prediction $f_t$, player produces a prediction $p_t$, adversary simultaneously produces an outcome $y_t$, and player suffers an instantaneous loss $l (p_t, y_t)$. The experts are static (their predictions to do not depend upon previously observed outcomes), so essentially each expert is an sequence $f_{1:T}$. The player wants to generate a sequence of predictions $p_{1:T}$ which minimizes worst-case regret \[
\sup_{y_{1:T} \in \mathcal{Y}^T} \biggl( L (p_{1:T}, y_{1:T}) - \inf_{f \in \mathcal{F}} L (f_{1:T}, y_{1:T}) \biggr),
\] where $L (p_{1:T}, y_{1:T}) = \sum_s l (p_s, y_s)$ is the total loss. When the observations are binary $\mathcal{Y} = \{ 0, 1 \}$, then an instantaneous loss of $|p_t - y_t|$ corresponds to expected 0-1 loss when the player is randomizing decisions. Amazingly this case yields a closed-form expression for the optimal prediction \[
p^*_t &= \frac{1}{2} \biggl( 1 + R^* (\mathcal{F}, y_{1:t-1}1) - R^* (\mathcal{F}, y_{1:t-1}0) \biggr) \\
&= \frac{1}{2} \biggl( 1 + \mathbb{E}_{\sigma_{t+1:T}} \left[ \inf_{f \in \mathcal{F}} L (f_{1:T}, y_{1:t-1} 0 \sigma_{t+1:T}) - \inf_{f \in \mathcal{F}} L (f_{1:T}, y_{1:t-1} 1 \sigma_{t+1:T}) \right] \biggr),
\] where temporal sequence concatenation is denoted lexically, $\sigma_t$ is $\mathrm{Bournelli}(1/2)$ distributed a la Rademacher averages, and $R^* (\mathcal{F}, y_{1:t})$ is the residual game value after some rounds of play, \[
R^* (\mathcal{F}, y_{1:t}) &= \frac{1}{2} \biggl( 1 + R^* (\mathcal{F}, y_{1:t-1}0) + R^* (\mathcal{F}, y_{1:t-1}1) \biggr) \\
&= \frac{T - t}{2} - \mathbb{E}_{\sigma_{t+1:T}}\left[ \inf_{f \in \mathcal{F}} L (f_{1:T}, y_{1:t} \sigma_{t+1:T}) \right].
\] Essentially what's happening here is that player is able to make adversary indifferent to playing either option on each round by playing a constant plus the difference between the residual game values associated with playing each option; this causes the residual game value to be a constant plus the average value of continuing after playing each option. Unwrapping the game value recursively leads to the Rademacher style averages. One observation of the paper is that such expectations can be approximated by sampling to achieve a high probability regret bound, aka random playout.

In practice even to do random playout you need to know $f_{1:T}$ for the various experts. When mapping this to a contextual prediction setting, this corresponds to knowing the sequence of features in advance (but not the labels). Thus this is essentially a transductive technique. Some recommendation problems are naturally transductive, and the paper discusses an application to collaborative filtering.

Active Learning?

In principle the setup can be modified to consider active learning. Each round, in addition to generating a prediction, player must make a decision $z_t \in \{ 0, 1 \}$ whether or not to observe $y_t$. If $z_t = 0$, player cannot use the value of $y_t$ in subsequent predictions. Since it is always better for the player to observe $y_t$, there has to be some penalty for doing so, thus consider a constant penalty $\alpha$ per observation. The player wants to generate a sequence of predictions $p_{1:T}$ and queries $z_{1:T}$ which minimizes worst-case regret \[\sup_{y_{1:T} \in \mathcal{Y}^T} \biggl( \sum_s \alpha z_s + L (p_{1:T}, y_{1:T}) - \inf_{f \in \mathcal{F}} L (f_{1:T}, y_{1:T}) \biggr).
\] Concise general closed-form expressions have eluded me thus far, but there is a non-trivial case which yields nice answers: the two-round game.

It never makes sense to observe the final outcome $y_T$, so $z_T = 0$. In the two-round game, then, the question is whether to observe $y_1$. If $y_1$ is not observed (i.e., $z_1 = 0$), player must ballistically plan both predictions without intermediate feedback, \[
(p_1^*, p_2^*) &= \mathop{\operatorname{arg\,inf}}\limits_{p_{1:2} \in \mathcal{P}^2} \sup_{y_{1:2} \in \mathcal{Y}^2} \left( |p_1 - y_1| + |p_2 - y_2| - \inf_{f \in \mathcal{F}} L (f_{1:2}, y_{1:2}) \right).
\] This can be solved with Mathematica: here's the incantation.
Minimize[{ z, 
           p1 + p2 - inf00 <= z, 
           p1 + (1 - p2) - inf01 <= z, 
           (1 - p1) + p2 - inf10 <= z, 
           (1 - p1) + (1 - p2) - inf11 <= z }, 
           { p1, p2, z }] // Simplify
This has solution \[
p_1^* &= \frac{1}{2} \left( 1 + \frac{1}{2} \sum_{y_2=0}^1 \left( \inf_{f \in \mathcal{F}} L (f_{1:2}, 0y_2) - \inf_{f \in \mathcal{F}} L (f_{1:2}, 1y_2) \right) \right) & (z_1 = 0), \\
p_2^* &= \frac{1}{2} \left(1 + \frac{1}{2} \sum_{y_1=0}^1 \left( \inf_{f \in \mathcal{F}} L (f_{1:2}, y_10) - \inf_{f \in \mathcal{F}} L (f_{1:2}, y_11) \right) \right) & (z_1 = 0),
\] with game value \[
&R (\mathcal{F}, \emptyset | z_1 = 0) \\
&= 1 - \frac{1}{2} \min\left\{ \inf_{f \in \mathcal{F}} L (f_{1:2}, 00) + \inf_{f \in \mathcal{F}} L (f_{1:2}, 11), \inf_{f \in \mathcal{F}} L (f_{1:2}, 01) + \inf_{f \in \mathcal{F}} L (f_{1:2}, 10) \right\}. \\
\] Now compare this to the case of $z_1 = 1$, which is the same as the fully observed Minimax Forecaster. \[
p_1^* &= \frac{1}{2} \left( 1 + \frac{1}{2} \sum_{y_2=0}^1 \left( \inf_{f \in \mathcal{F}} L (f_{1:2}, 0y_2) - \inf_{f \in \mathcal{F}} L (f_{1:2}, 1y_2) \right) \right) & (z_1 = 1), \\
p_2^* &= \frac{1}{2} \left(1 + \inf_{f \in \mathcal{F}} L (f_{1:2}, y_10) - \inf_{f \in \mathcal{F}} L (f_{1:2}, y_11) \right) & (z_1 = 1).
\] The first round prediction $p_1^*$ is the same whether or not $z_1 = 0$ or $z_1 = 1$, but the second round prediction $p_2^*$ is different. If $z_1 = 0$, then $p_2^*$ is computed by averaging over possible histories; whereas if $z_1 = 1$, then $p_2^*$ is computing using the actual observed history. (Aside: perhaps constant-time Radacher averages will be quantum computing's killer app.)

To decide whether to observe $y_1$ or not, we need to know how much better it is to do so, i.e., the difference in game values. When $z_1=1$ this is the same as the fully observed Minimax Forecaster, \[
&R (\mathcal{F}, \emptyset | z_1 = 1) \\
&= 1 - \frac{1}{4} \left( \inf_{f \in \mathcal{F}} L (f_{1:T}, 00) + \inf_{f \in \mathcal{F}} L (f_{1:T}, 01) + \inf_{f \in \mathcal{F}} L (f_{1:T}, 10) + \inf_{f \in \mathcal{F}} L (f_{1:T}, 11) \right),
\] therefore the difference in game value is \[
&R^* (\mathcal{F}, \emptyset | z_t = 0) - R^* (\mathcal{F}, \emptyset | z_t = 1) \\
&= \frac{1}{4} \left| \inf_{f \in \mathcal{F}} L (f_{1:2}, 00) - \inf_{f \in \mathcal{F}} L (f_{1:2}, 01) - \left( \inf_{f \in \mathcal{F}} L (f_{1:2}, 10) - \inf_{f \in \mathcal{F}} L (f_{1:2}, 11) \right) \right|. \\
\] This looks like a difference of future differences. If the game value difference exceeds $\alpha$, then we should decide $z_1 = 1$, otherwise not. So, for instance, if every expert predicts the same value on the first round, then the difference of future differences will be zero and we should not observe $y_1$. That certainly sounds like active learning.

So what should a general case $T$ round solution look like? Intuitively, one would hope that if all the experts that have done well in the past predict the same thing on the current instance, that the value of observing $y_t$ for that instance would go down. That is roughly what agnostic active learning does in the IID setting. Here the future is also important, but analogously if all the experts that are in the running for the infimum at the end of the horizon agree on a value, it should be that observing $y_t$ has less value. As we near to the end of the planning horizon, that will be driven mostly by having done well in the past.