Wednesday, October 2, 2013

Lack of Supervision

For computational advertising and internet dating, the standard statistical learning theory playbook worked pretty well for me. Yes, there were nonstationary environments, explore-exploit dilemmas, and other covariate shifts; but mostly the intuition from the textbook was valuable. Now, in a potential instance of the Peter Principle, I'm mostly encountering problems in operational telemetry and security that seem very different, for which the textbook is less helpful. In a nod to Gartner, I have summarized my intimidation into a 4 quadrant mnemonic.
LabelsAbundantTextbook Machine LearningMalware Detection
RareService Monitoring and AlertingIntrusion Detection

The first dimension is the environment: is it oblivious or adversarial? Oblivious means that, while the environment might be changing, it is doing so independent of any decisions my system makes. Adversarial means that the environment is changing based upon the decisions I make in a manner to make my decisions worse. (Adversarial is not the opposite of oblivious, of course: the environment could be beneficial.) The second dimension is the prevalence of label information, which I mean in the broadest sense as the ability to define model quality via data. For each combination I give an example problem.

In the top corner is textbook supervised learning, in which the environment is oblivious and labels are abundant. My current employer has plenty of problems like this, but also has lots of people to work on them, and plenty of cool tools to get them done. In the bottom corner is intrusion detection, a domain in which everybody would like to do a better job, but which is extremely challenging. Here's where the quadrant starts to help, by suggesting relaxations of the difficulties of intrusion detection that I can use as a warm-up. In malware detection, the environment is highly adversarial, but labels are abundant. That may sound surprising given that Stuxnet stayed hidden for so long, but actually all the major anti-virus vendors employ legions of humans whose daily activities provide abundant label information, albeit admittedly incomplete. In service monitoring and alerting, certain labels are relatively rare (because severe outages are thankfully infrequent), but the engineers are not injecting defects in a manner designed to explicitly evade detection (although it can feel like that sometimes).

I suspect the key to victory when label information is rare is to decrease the cost of label acquisition. That almost sounds tautological, but it does suggest ideas from active learning,crowdsourcing, exploratory data analysis, search, and implicit label imputation; so it's not completely vacuous. In other words, I'm looking for a system that interrogates domain experts judiciously, asks a question that can be reliably answered and whose answer has high information content, presents the information they need to answer the question in an efficient format, allows the domain export to direct the learning, and can be bootstrapped from existing unlabeled data. Easy peasy!

For adversarial setups I think online learning is an important piece of the puzzle, but only a piece. In particular I'm sympathetic to the notion that in adversarial settings intelligible models have an advantage because they work better with the humans who need to maintain them, understand their vulnerabilities, and harden them against attacks both proactively and reactively. I grudgingly concede this because I feel a big advantage of machine learning to date is the ability to use unintelligible models: intelligibility is a severe constraint! However intelligibility is not a fixed concept, and given the right (model and data) visualization tools a wider class of machine learning techniques become intelligible.

Interestingly both for rare labels and for adversarial problems user interface issues seem important, because both require efficient interaction with a human (for different purposes).

No comments:

Post a Comment