Monday, May 21, 2012

Instruction Theory?

In learning theory we often talk about an environment which is oblivious to our algorithm, or an environment which is fully aware of our algorithm and attempting to cause it to do badly. What about the case where the environment is fully aware of our algorithm and attempting to help it succeed?

Here's a concrete example. Suppose you are trying to communicate a message to a receiver, and this message is one of a finite set of hypotheses. You are forced to communicate to the receiver by sending a sequence of feature-label pairs $(X \times Y)^*$; the receiver will decode the message via ERM on the hypothesis set using the supplied data. How many examples does it take, and how should you chose them? If this sounds corny, consider that evolution works by reuse, so if the capability to learn from experience developed due to other selection pressure, it might be co-opted to service communication a la Cognitive Linguistics.

Intuitively, even if the hypothesis being communicated was learned from experience, it's not good strategy to retransmit the exact data used to learn the hypothesis. In fact, it seems like the best strategy would be not using real data at all; by constructing an artificial set of training examples favorable structure can be induced, e.g., the problem can be realizable. (Funny aside: I TA-d for a professor once who confided that he sometimes lies to undergraduate in introductory courses in order to get them ``closer to the truth''; the idea was, if they took an upper division class he would have the ability to refine their understanding, and if not they were actually better off learning a simplified distortion).

Some concepts from learning theory are backwards in this setup. For instance, Littlestone's dimension indicates the maximum number of errors made in a realizable sequential prediction scenario when the examples are chosen adversarially (and generalizes to the agnostic case). We can chose the examples helpfully here (what's the antonym of adversarial?), but actually we want errors so that many of the hypothesis are incorrect and can be quickly eliminated. Unfortunately we might encounter a condition where the desired-to-be-communicated hypothesis disagrees with at most one other hypothesis on any point. Littlestone finds this condition favorable since a mistake would eliminate all but one hypothesis, and otherwise no harm no foul; but in our situation this is worst-case behaviour, because it makes it difficult to isolate the target hypothesis with examples. In other words, we can chose the data helpfully, but if the set of hypotheses is chosen adversarially this could be still very difficult.

Inventing an optimal fictitious sequence of data might be computationally too difficult for the sender. In this case active learning algorithms might provide good heuristic solutions. Here label complexity corresponds to data compression between the original sequence of data used to learn the hypothesis and the reduced sequence of data used to transmit the hypothesis.

There is fertile ground for variations, e.g., the communication channel is noisy, the receiver does approximate ERM, or the communication is scored on the difference in loss between communicated and received hypothesis rather than 0-1 loss on hypotheses.

2 comments:

  1. See http://www.cis.upenn.edu/~mkearns/papers/teaching.pdf
    (teaching dimension).

    ReplyDelete
    Replies
    1. Wow, _exactly_ what I was looking for ... and a perfect example of "I couldn't figure out what to ask Google".

      Delete