Comments on Machined Learnings: Learning is easier than optimization

Ok, I guess it's a question of range of value,...

2013-05-07T14:14:49.585-07:00

Ok, I guess it's a question of range of value, since as soon as you hash on more than ~20 bits, the parameter vector will likely be too big to fit in any CPU cache, and thus increasing it even more shouldn't change that much the (absence of) memory locality speedups. But for less than 1M parameters, I understand that this can make a big difference, and you'd better learn on more data with less parameters.

The speed decrease comes from decreased cache effi...

2013-05-07T11:08:05.657-07:00

The speed decrease comes from decreased cache efficiency as hash bits increases.

Thanks for your quick answer. I've been in th...

2013-05-07T10:39:23.627-07:00

Thanks for your quick answer.

I've been in the kind of situation you described. The best solution I empirically found was also to use L1 reg, with some SGD solver. But tuning the regularization parameter + putting as many bits as possible always worked best. To me, I see the hashing trick as a convenient way to handle a very sparse feature space when you have a limited amount of RAM per server, avoiding the use of a more complicated infrastructure like a parameter server. The regularization effect is just the cherry on top of the cake.

I don't really understand the speed decrease argument though: if you have a lot more observations n than the total features space cardinality d, on a typical large-scale setting with sparse features, the CPU complexity of your learning algorithm will relate to the average number of active features per observation, not d (for both SGD and QN methods since a pass over the data will dominate an update of the parameters). So you are not really slower when you increase the hash bits.

I've had this happen when you have lots of fea...

2013-05-07T08:46:06.574-07:00

I've had this happen when you have lots of features, the kind of situation where L1 regularization also helps. It's possible if I thoroughly explored the space of combinations of (hash bits, L1 eta) I would find something with more hash bits that worked better, but the speed decrease from increasing hash bits is a real disincentive to explore.

Hi Paul. Do you have any precise example of when y...

2013-05-06T21:19:59.762-07:00

Hi Paul. Do you have any precise example of when you've seen the hashing trick perform worse when you increase the number of bits?
I've never encountered such a phenomenon. Since hashing acts like a regularizer, it seems to me that as long as you tune your regularization parameter(s), you should get a better test error with more bits.
Of course, sometimes finding a good regularizer is not that easy, but I can't easily imagine a scenario where one wouldn't be able to come up with a better regularization strategy than the pure random projection provided by the hashing trick.

Thanks