Comments on Machined Learnings: Do you really have big data?

Thanks Paul. The term 'Big Data' is a mis...

2012-12-22T03:21:56.700-08:00

Thanks Paul.

The term 'Big Data' is a misnomer because there can be significant value in small data but as you say, it depends on the problem type.

Wrt c), it would be useful if the industry began to classify what types of problems DL is superior at otherwise the religious war will continue but also confuse the heck out of organizations who want to do the 'right' thing.

As an aside, it was good to see the analysis based on a computational constraint (of 27 secs) because an area that is rarely covered is the time to re-train (as new data becomes available) which must become ever more important as near real-time becomes a reality.

Mahout exemplifies the challenges that I/O can cre...

2012-12-21T21:39:14.591-08:00

Mahout exemplifies the challenges that I/O can create for distributed iterative computations.

In general everyone agrees starting with a sample of data on a laptop for quick initial experimentation and visualization is sound procedure. Whether or not that's sufficient is context dependent. By analogy, many programs are written in scripting languages, and a few are written in assembly for maximum performance. Ad serving is so lucrative that minor improvements are economically valuable, but I suspect for many problems the predictor that arises from the subsampled data might be the final product.

I'm not aware of any published comparison of the kind you seek.

The overall message is basically "there's...

2012-12-21T19:53:11.837-08:00

The overall message is basically "there's no data like more data" is sometimes misleading because in practice there is a computational constraint which affects the types of learning methods that are feasible. Sometimes throwing away data to allow for more complicated learning methods is worth it.

Specifically:

a) I would say try deep learning (specifically neural networks) if you have a moderate sized data set (moderate = "fits on a single box and tolerably slow to train with multicores/gpus") and little intuition about what kinds of transformations would make your problem look more linear (aka, no intuition about a good kernel). You should also look at GBDTs since they work great in this zone.

b) Yes constant factors do matter! But the important message is that network I/O is a major bottleneck in the distributed context. If NUMA becomes popular then the machines will start looking even more like little clusters so we can't just ignore this problem.

c) Deep learning methods are superior to all other existing choices for particular types of problems. The key to success in a practical context is to understand the type of problem you face and which techniques are likely to work well.

(Posted on behalf on Dinesh B Vadhia) Hi Paul T...

2012-12-21T19:44:13.749-08:00

(Posted on behalf on Dinesh B Vadhia)

Hi Paul

There isn't an option to leave a comment without owning an identity through a 3rd party. Anyway, another terrific post.

Are the conclusions:
a) If you want predictions in a commercially (competitive) reasonable time then stick to traditional learning methods, if not then try deep learning?

b) Faster hardware helps both types of methods.

c) The intrinsic performance of deep learning algorithms have to improve by a significant factor to be commercially competitive with existing methods.

Is that about right?

Best ...
Dinesh

I wasn't aware there was a war brewing but I h...

2012-12-21T02:42:17.678-08:00

I wasn't aware there was a war brewing but I have been wondering about the value of Mahout in certain scenario's. I've seen people use Mahout on clusters where I thought other tools (VW) on my laptop could have done the trick.

Has anyone done a proper comparison of certain use cases that you know of?

Yup. In contrast, a lot of natural language proce...

2012-12-21T00:19:54.839-08:00

Yup. In contrast, a lot of natural language processing really seems to benefit from more data. It just really depends.

However, even if the actual model estimation is done with a relatively small dataset, you still might have to use a cluster to quickly apply the model to the rest of the dataset. That's the situation I ran into.