Sunday, March 9, 2014

Failing Fast

I spent Christmas break working on some matrix factorization related ideas. There were two things I was wondering about: first, whether dropout is a good regularizer for discriminative low-rank quadratics; and second, how to do the analog of GeV representation learning for discriminative low-rank quadratics. For the latter, I had an idea I was sure would work, but I wasn't able to make it work. There's a saying: “your baby is not as beautiful as you think it is”. Most ideas are not good ideas despite our prior beliefs, so it's important to eliminate ideas as quickly as possible.

Startups have popularized the idea of failing fast, since most business ideas are also not good ideas. The idea of the minimum viable product has become central dogma in the startup community. Analogously, when testing machine learning ideas, it's best to start with the “minimum viable algorithm”. Such an algorithm is written in as high level a language as possible (e.g., Matlab, NumPy), using as many existing libraries and packages as possible (e.g., CVX), and not taking any computational shortcuts for efficiency.

I started playing around with dropout regularization for matrix factorization in Matlab, and when I saw that it was working on movielens, then I spent the time to implement it as a reduction in vw. The fact that I knew it should work allowed me to power through the multiple defects I introduced while implementing. Short story even shorter, the result is in the main branch and you can check out the demo.

The next idea I tried was to pose learning low-rank quadratic features as a alternating linear-fractional optimization problem. The analogy to alternating least squares was so strong that asthetically I was sure it was a winner. For a multi-class prediction task (e.g., movielens without side information) over binary dyadic examples $S = \{ \{ ( l, r ), y \} | l \in \{0, 1\}^m, r \in \{ 0,1 \}^n, y \in \{ 0, 1, \ldots, k \} \}$, a predictor with a single latent MF-style feature looks like \[
f (( l, r ); w, p, q) = w^\top (l, r) + (l^\top p) (r^\top q),
\] ignoring the constant feature for simplicity. Here $p \in \mathbb{R}_+^m$ and $q \in \mathbb{R}_+^n$ are the single latent feature. On movielens without side information $l$ and $r$ are indicator variables of the user id and movie id respectively, so $p$ and $q$ are indexed by these identifiers and each produces a scalar whose product is added to the predictor.

The idea was to choose the latent feature to be highly active on class $i$ and highly inactive on another class $j$, \[
\max_{p \in \mathbb{R}_+^m, q \in \mathbb{R}_+^n} \frac{p^\top \mathbb{E}[l r^\top | y = i] q}{\alpha + p^\top \mathbb{E}[l r^\top| y = j] q}.
\] subject to $p \preceq 1$ and $q \preceq 1$ (otherwise it can diverge). $\alpha> 0$ is a hyperparameter which regularizes the denominator. Of course in practice expectations are converted to averages over the training set.

For fixed $p$ this is a linear-fractional program in $q$ and vice versa, so starting from a random point I was able to quickly alternate into features that looked good visually (high product energy on high rating user-movie pairs, low product energy on low rating user-movie pairs). However, the predictive lift on the test set from these features, compared to a linear model without interactions, was almost nonexistent. Then I tried a boosting variant, where first I fit a linear model without interactions and then tried to discriminate between positive and negative residual examples. This was more interesting: the features end up mostly being zero except for a small percentage of the data, suggesting that although the original features look good visually they are mostly providing information redundant with a linear model.

I was able to crank out these negative results in just a few days using Matlab and CVX (it helps that there are no meetings at work during the holidays). Is it possible I screwed this up and actually the idea is a good one? Yes, but working at such a high level eliminates concerns about the optimizer, which makes it more likely that it is actually the strategy at fault. In any event, I have a portfolio of ideas, and I need to invest my time in those ideas that are most likely to yield something interesting. Although not definitive, these quick experiments suggested that I should spend my time somewhere else.

Think of it as Bayesian search theory over the space of ideas.