Monday, July 17, 2017

Tiered Architectures, Counterfactual Learning, and Sample Complexity

I'm on a product team now, and once again I find myself working on a tiered architecture: an “L1” model selects some candidates which are passed to an “L2” model which reranks and filters the candidates which are passed to an “L3”, etc. The motivation for this is typically computational, e.g., you can index a DSSM model pretty easily but indexing a BIDAF model is more challenging. However I think there are potential sample complexity benefits as well.

I worry about sample complexity in counterfactual setups, because I think it is the likely next source for AI winter. Reinforcement learning takes a tremendous amount of data to converge, which is why all the spectacular results from the media are in simulated environments, self-play scenarios, discrete optimization of a sub-component within a fully supervised setting, or other situations where there is essentially infinite data. In real life data is limited.

So when I read Deep Reinforcement Learning in Large Discrete Action Spaces by Dulac-Arnold et. al., I noticed that the primary motivation was computational, but figured another (more important?) benefit might be statistical. Tiered architectures cannot overcome worst-case sample complexity bounds, but I think in practice they are a good strategy for counterfactual setups.

Tiered architectures admit semi-supervised approaches, because an L1 model can often be initialized using unsupervised techniques (e.g., word embeddings, sentence embeddings, inverted indicies with tf-idf). Learning the L2 model utilizing this L1 model only has a sample complexity based upon the number of candidates produced by the L1 model, rather than the total number of candidates. Of course, learning the L1 still has a sample complexity based upon the total number of candidates, but if the unsupervised initialization is good then it is ok that the L1 learns slowly. Furthermore in practice the L1 hypothesis class is simpler (because of computational reasons) which mitigates the sample complexity.

There was a workshop called ``coarse-to-fine inference'' at NIPS 2017 which presumably explored these ideas, but I didn't attend it and their website is down. Hopefully there will be another one, I will attend!