Saturday, May 3, 2014

The Most Causal Observer

David K. Park recently had a guest post on Gelman's blog. You should read it. The tl;dr is ``Big Data is a Big Deal, but causality is important and not the same as prediction.''

I agree with the basic message: causality is important. As a bit of career advice, if you are just starting your career, focusing on causality would be a good idea. Almost never does one put together a predictive model for predictive purposes; rather, the point is to suggest an intervention. For example, why predict the fraud risk of a credit card transaction? Presumably the goal is to decline some transactions. When you do this, things change. Most simply, if you decline a transaction you do not learn about the counterfactual of what would have happened had you approved the transaction. Additional issues arise because of the adversarial nature of the problem, i.e., fraudsters will react to your model. Not paying attention to these effects will cause unintended consequences.

However I have reservations with the idea that ``creative humans who need to think very hard about a problem and the underlying mechanisms that drive those processes'' are necessarily required to ``fulfill the promise of Big Data''. When I read those words, I translate it as ``strong structural prior knowledge will have to be brought to bear to model causal relationships, despite the presence of large volumes of data.'' That statement appears to leave on the table the idea that Big Data, gathered by Big Experimentation systems, will be able to discover casual relationships in an agnostic fashion. Here ``agnostic'' basically means ``weak structural assumptions which are amenable to automation.'' Of course there are always assumptions, e.g., when doing Vapnik-style ERM, one makes an iid assumption about the data generating process. The question is whether humans and creativity will be required.

Perhaps a better statement would be ``creative humans will be required to fulfill the promise of Big Observational Data.'' I think this is true, and the social sciences have been working with observational data for a while, so they have relevant experience, insights, and training to which we should pay more attention. Furthermore another reasonable claim is that ``Big Data will be observational for the near future.'' Certainly it's easy to monitor a Twitter firehouse, whereas it is completely unclear to me how an experimentation platform would manipulate Twitter to determine causal relationships. Nonetheless I think that automated experimental design at a massive scale has enormous disruptive potential.

The main difference I'm positing is that Machine Learning will increasingly move from working with a pile of data generated by another process to driving the process that gathers the data. For computational advertising this is already the case: advertisements are placed by balancing exploitation (making money) and exploration (learning about what ads will do well under what conditions). Contextual bandit technology is already mature and Big Experimentation is not a myth, it happens every day. One could argue that advertising is a particular application vertical of such extreme economic importance that creative humans have worked out a structural model that allows for causal reasoning, c.f., Bottou et. al. I would say this is correct, but perhaps just an initial first step. For prediction we no longer have to do parametric modeling where the parameters are meaningful: nowadays we have lots of models with essentially nuisance parameters. Once we have systems that are gathering data and well as modeling it, will it be required to have strong structural models with meaningful parameters, or will there be some agnostic way of capturing a large class of casual relationships with enough data and experimentation?