tag:blogger.com,1999:blog-4446292666398344382.comments2014-07-07T12:46:56.506-07:00Machined LearningsPaul Mineirohttp://www.blogger.com/profile/05439062526157173163noreply@blogger.comBlogger110125tag:blogger.com,1999:blog-4446292666398344382.post-29906688443169310342014-05-04T07:10:17.947-07:002014-05-04T07:10:17.947-07:00Hey! I thought this post was pretty good. You sh...Hey! I thought this post was pretty good. You should take a look at this presentation: http://users.soe.ucsc.edu/~niejiazhong/slides/chandra.pdf in particular the "design principles" section. I think most people haven't realizedf that machine learning, HPC, and mapreduce have more or less converged.David Konerdinghttp://www.blogger.com/profile/08061100933668606153noreply@blogger.comtag:blogger.com,1999:blog-4446292666398344382.post-66677149958830319022014-04-25T09:07:03.689-07:002014-04-25T09:07:03.689-07:00It is domain dependent. Early experience in compu...It is domain dependent. Early experience in computation advertising, with zipf-distributed text features and linear models, suggested larger amounts of data led to improved performance. This was believed to be generally true for a while but is now recognized as a specific property of these problems, attributable to the long tail of the zipf coupled with the need to have roughly 10 observations per feature in a linear model. For other types of models in other domains, it is possible to hit the Bayes limit (optimal performance given inherent noise) with training data that comfortably fits on a single machine, in which case more data is not helpful.<br /><br />Computer vision is benefiting from large labeled datasets, essentially because very flexible models are required for good performance and therefore lots of training data is required to prevent overfitting. I see this trend continuing for a while: arguably it is held back by our (in)ability to optimize large deep learning models, not our ability to assemble large computer vision training sets.Paul Mineirohttp://www.blogger.com/profile/05439062526157173163noreply@blogger.comtag:blogger.com,1999:blog-4446292666398344382.post-33218079827359238742014-03-30T01:10:33.399-07:002014-03-30T01:10:33.399-07:00A terrific post which I keep coming back to freque...A terrific post which I keep coming back to frequently to learn something else. It would be good to know what "scaling" machine learning means wrt size ie. how big are the models today and what are the expected sizes in the future?Unknownhttp://www.blogger.com/profile/16570083132483784714noreply@blogger.comtag:blogger.com,1999:blog-4446292666398344382.post-20780882857342199482013-12-12T15:03:33.286-08:002013-12-12T15:03:33.286-08:00Updated link here.Updated link <a href="http://www.cs.columbia.edu/~djhsu/papers/multiview-slides.pdf" rel="nofollow">here</a>.Paul Mineirohttp://www.blogger.com/profile/05439062526157173163noreply@blogger.comtag:blogger.com,1999:blog-4446292666398344382.post-47641651422335418292013-12-10T23:33:16.499-08:002013-12-10T23:33:16.499-08:00That is indeed a brilliant idea. I'm pretty su...That is indeed a brilliant idea. I'm pretty sure it couldn't work at math conferences but it's not like anything else does either.Allen Knutsonhttp://www.blogger.com/profile/15616422252030334511noreply@blogger.comtag:blogger.com,1999:blog-4446292666398344382.post-21226269777715131752013-11-15T19:50:21.983-08:002013-11-15T19:50:21.983-08:00Can you clarify which of the multiple algorithms i...Can you clarify which of the multiple algorithms in Halko et al this is (maybe a variant of 4.4?). Also, do 'pcas' and 'pcav' here refer to the singular values and singular vectors respectively (S and V in svd), and if so is there a reason not to do svd() directly on secondpass instead of eig() on secondpass' * secondpass, supposedly svd is numerically more stable. Thanks.Gad Abrahamhttp://www.blogger.com/profile/02690671851015634332noreply@blogger.comtag:blogger.com,1999:blog-4446292666398344382.post-32300478019883902712013-10-16T17:22:27.039-07:002013-10-16T17:22:27.039-07:00Nice post. It's worth noting that the randomiz...Nice post. It's worth noting that the randomized SVD is an early-stopped version of classical SVD algorithms like the subspace iteration algorithm.Alexhttp://www.blogger.com/profile/07509233273590036699noreply@blogger.comtag:blogger.com,1999:blog-4446292666398344382.post-37023693830464626192013-10-13T11:37:01.099-07:002013-10-13T11:37:01.099-07:00Thanks for publishing this code. Here's a slig...Thanks for publishing this code. Here's a slightly modified version for the Digit Recognizer competition at Kaggle:<br /><br />https://github.com/zygmuntz/kaggle-digits<br /><br />It scores ~ 0.977.Zygmunt Zająchttp://www.blogger.com/profile/01337002644285268044noreply@blogger.comtag:blogger.com,1999:blog-4446292666398344382.post-87365800147205382882013-08-29T08:44:38.167-07:002013-08-29T08:44:38.167-07:00@DatsMe:
It's not clear if you want something...@DatsMe:<br /><br />It's not clear if you want something that will work with doubles but with the same precision approximation, or something with higher precision.<br /><br />The former is easy (if somewhat nonsensical): downcast to float, call the routine, then cast the result back to double.<br /><br />The latter will require adjusting the constants in the approximation. The mathematica notebook (https://code.google.com/p/fastapprox/source/browse/trunk/fastapprox/tests/fastapprox.nb) contains the derivation which could be adapted to the proper mantissa, base, bitmask, etc. for doubles.<br /><br />It's possible it will not be faster than standard libraries once you engineer for additional precision.Paul Mineirohttp://www.blogger.com/profile/05439062526157173163noreply@blogger.comtag:blogger.com,1999:blog-4446292666398344382.post-42215435067760513592013-08-29T06:50:58.750-07:002013-08-29T06:50:58.750-07:00I'm trying to adapt fastexp() to work with dou...I'm trying to adapt fastexp() to work with double precision, with little success.<br />I'd appreciate if anyone could point me in the right direction.<br />Thanks!DatsMehttp://www.blogger.com/profile/01020146443768428068noreply@blogger.comtag:blogger.com,1999:blog-4446292666398344382.post-71279424754127429012013-07-26T09:55:59.086-07:002013-07-26T09:55:59.086-07:00The technique from the paper only discusses the pr...The technique from the paper only discusses the process of making a smaller dataset from your original dataset which has fidelity for learning (ERM). So we do a) in the paper just to give people an idea of how this method of choosing a subsample is better than alternatives. Your question suggests spending time getting the size of the subsample "optimally small" in some sense, but this hasn't been how I've applied the technique. In practice the size of the subsample is set by some computational budget, e.g., how much RAM you have on your desktop, or how much data you can process in 5 minutes for rapid experimentation. <br /><br />You want a smaller dataset to do something like b), i.e., run lots of modeling experiments, so yes you are going to do that.<br />Paul Mineirohttp://www.blogger.com/profile/05439062526157173163noreply@blogger.comtag:blogger.com,1999:blog-4446292666398344382.post-41273584424166507392013-07-26T07:46:36.080-07:002013-07-26T07:46:36.080-07:00I am a bit confused, though it has probably more t...I am a bit confused, though it has probably more to do with my knowledge than your article :).<br /><br />Let's say i have n samples, m features. I should find a linear regression (h) which will minimize the loss. <br /><br />Should I then use h to:<br />a. select N samples until the maximum AUC is approached? (Like you seem to do in the paper?)<br />b. Use h to build more features m, and experiment if they improve AUC.<br />c. Both?<br /><br />Thanks for any intuition you could provide.Johan J.http://www.blogger.com/profile/16511137688482612834noreply@blogger.comtag:blogger.com,1999:blog-4446292666398344382.post-55015953410337103812013-07-21T22:00:54.545-07:002013-07-21T22:00:54.545-07:00"A reviewer pointed out that the excess risk ..."A reviewer pointed out that the excess risk bound diverged in this case" <br /><br />By "this case" I meant the case that the loss of tilde h goes to zero.Paul Mineirohttp://www.blogger.com/profile/05439062526157173163noreply@blogger.comtag:blogger.com,1999:blog-4446292666398344382.post-81861621710024308952013-07-21T21:58:24.483-07:002013-07-21T21:58:24.483-07:00Doh!
Originally we didn't distinguish between...Doh!<br /><br />Originally we didn't distinguish between a hypothesis and the induced loss, as in the empirical Bernstein paper, but the reviewers really thought this cluttered the exposition (and, in hindsight, they were correct). So we threaded the loss through everything for the camera ready, but apparently not without error :)Paul Mineirohttp://www.blogger.com/profile/05439062526157173163noreply@blogger.comtag:blogger.com,1999:blog-4446292666398344382.post-8041328260069353942013-07-21T21:56:16.093-07:002013-07-21T21:56:16.093-07:00Originally Pmin was not in the paper and we just s...Originally Pmin was not in the paper and we just set the minimum probability to the empirical loss of tilde h. A reviewer pointed out that the excess risk bound diverged in this case, so we added Pmin, but we suspect the right choice for Pmin is the empirical loss of tilde h. Then you set lambda to hit your subsample budget size. This is what we did for the DNA experiment, although we need some more experience with problems before we can say confidently this is the way to go. (Note that if tilde h has a loss which is indistinguishable from zero given Hoeffding on the large data set, tilde h is the hypothesis you seek and so you really don't need to subsample.)<br /><br />The loss is the proxy being optimized (e.g., logistic loss) and not the true underlying loss (e.g., zero-one), as we leverage the ordering implied by optimization in the proof. Paul Mineirohttp://www.blogger.com/profile/05439062526157173163noreply@blogger.comtag:blogger.com,1999:blog-4446292666398344382.post-56910346197576299952013-07-21T04:07:52.689-07:002013-07-21T04:07:52.689-07:00BTW, I think there is a typo in the article: R_X(h...BTW, I think there is a typo in the article: R_X(h) is defined in terms of a sum of h. It should be a sum of l_h instead.Olivier Griselhttp://www.blogger.com/profile/05751090858946703320noreply@blogger.comtag:blogger.com,1999:blog-4446292666398344382.post-75760927195194683902013-07-21T04:05:43.445-07:002013-07-21T04:05:43.445-07:00Thanks for this post and your paper. Could you tel...Thanks for this post and your paper. Could you tell us a bit more informally on how to deal with the newly introduced hyperparameters? In the paper you state:<br /><br />"""<br />In practice Pmin and λ are chosen according to the subsample budget, since the expected size of the subsample is upper bounded by (Pmin+λRX(h ̃))n. Unfortunately there are two hyperparameters and the analysis presented here does not guide the choice except for suggesting the constraints Pmin ≥ RX (h ̃ ) and λ ≥ 1; this is a subject for future investigation.<br />"""<br /><br />In practice do you run a grid search for Pmin and λ on a small uniform subset of the big dataset and select the (Pmin, λ) pair that has the fastest decrease in validation error? Or do you just fix Pmin according to your budget (e.g. 0.01) and grid search λ?<br /><br />Also in the paper I am not sure whether you define R_X (h ̃ ) in terms of the loss function being optimized by h ̃(e.g. squared hinge loss for a linear support vector machine) or the true target loss you are interested in (e.g. the zero one loss).Olivier Griselhttp://www.blogger.com/profile/05751090858946703320noreply@blogger.comtag:blogger.com,1999:blog-4446292666398344382.post-38146236131575212552013-06-22T19:23:26.090-07:002013-06-22T19:23:26.090-07:00In order to preserve the illusion of serial execut...In order to preserve the illusion of serial execution of the code written by "you", the learning algorithm calls your code many times to simulate different paths through the decision space. Many of these paths are shared and duplicate evaluation could be avoided if you wrote your code as a state machine, but the whole point is to avoid forcing you to write code as a state machine. The compromise is that you can decorate your code with special function calls that help the learning algorithm avoid duplicate path evaluation (these calls are harmless at evaluation time).<br /><br />By the way, this means the code written by "you" has to be a pure function (i.e., side effect free).Paul Mineirohttp://www.blogger.com/profile/05439062526157173163noreply@blogger.comtag:blogger.com,1999:blog-4446292666398344382.post-63201442876399096522013-06-22T19:11:22.722-07:002013-06-22T19:11:22.722-07:00Hi Paul, great post! Could you elaborate a bit on ...Hi Paul, great post! Could you elaborate a bit on the last point:<br />"Testing time overhead is introduced but this can be mitigated by decorating the code with additional annotations that allow the learning algorithm to memoize appropriately."<br />Why is there an overhead between this approach and the usual one, and why is memoization a fix? Thanks!goodwind89http://goodwind89.wordpress.com/noreply@blogger.comtag:blogger.com,1999:blog-4446292666398344382.post-51214560776713453702013-06-08T08:01:19.262-07:002013-06-08T08:01:19.262-07:00Re: Sublinear debugging. There is something subst...Re: Sublinear debugging. There is something substantive that I neglected to mention above, namely, avoid techniques during experimentation that do not admit this kind of treatment.<br /><br />There's often a choice between optimization methods with poor asymptotic convergence but cheap steps (e.g., SGD), and optimization methods with fast asymptotic convergence but expensive steps (e.g., quasi-Newton). The former tend to provide good intermediate information, whereas the latter can look like they are making little progress initially even though they end up somewhere better. So you would use the former during experimentation and the latter for the finished product.<br /><br />Paul Mineirohttp://www.blogger.com/profile/05439062526157173163noreply@blogger.comtag:blogger.com,1999:blog-4446292666398344382.post-84102509596035062822013-06-08T03:30:25.094-07:002013-06-08T03:30:25.094-07:00Agree completely. The first ("Use Less Data&...Agree completely. The first ("Use Less Data") is at my top of the list too and cannot be recommended enough.<br /><br />Isn't "Sublinear Debugging" just a fancy term for the judicious use of print statements?<br /><br />When dealing with a new data set, I start with the simple or obvious features ('the low hanging fruit') to get everything working and then go back to optimize the feature engineering.<br /><br />Very useful article. Best ...<br /><br />Unknownhttp://www.blogger.com/profile/16570083132483784714noreply@blogger.comtag:blogger.com,1999:blog-4446292666398344382.post-9437838219267722402013-05-07T14:14:49.585-07:002013-05-07T14:14:49.585-07:00Ok, I guess it's a question of range of value,...Ok, I guess it's a question of range of value, since as soon as you hash on more than ~20 bits, the parameter vector will likely be too big to fit in any CPU cache, and thus increasing it even more shouldn't change that much the (absence of) memory locality speedups. But for less than 1M parameters, I understand that this can make a big difference, and you'd better learn on more data with less parameters.Benoit Rostykushttp://www.blogger.com/profile/15784288105272457071noreply@blogger.comtag:blogger.com,1999:blog-4446292666398344382.post-461134767395066542013-05-07T11:08:05.657-07:002013-05-07T11:08:05.657-07:00The speed decrease comes from decreased cache effi...The speed decrease comes from decreased cache efficiency as hash bits increases.Paul Mineirohttp://www.blogger.com/profile/05439062526157173163noreply@blogger.comtag:blogger.com,1999:blog-4446292666398344382.post-72971638363216434302013-05-07T10:39:23.627-07:002013-05-07T10:39:23.627-07:00Thanks for your quick answer.
I've been in th...Thanks for your quick answer.<br /><br />I've been in the kind of situation you described. The best solution I empirically found was also to use L1 reg, with some SGD solver. But tuning the regularization parameter + putting as many bits as possible always worked best. To me, I see the hashing trick as a convenient way to handle a very sparse feature space when you have a limited amount of RAM per server, avoiding the use of a more complicated infrastructure like a parameter server. The regularization effect is just the cherry on top of the cake.<br /><br />I don't really understand the speed decrease argument though: if you have a lot more observations n than the total features space cardinality d, on a typical large-scale setting with sparse features, the CPU complexity of your learning algorithm will relate to the average number of active features per observation, not d (for both SGD and QN methods since a pass over the data will dominate an update of the parameters). So you are not really slower when you increase the hash bits.Benoit Rostykushttp://www.blogger.com/profile/15784288105272457071noreply@blogger.comtag:blogger.com,1999:blog-4446292666398344382.post-72695183519653194752013-05-07T08:46:06.574-07:002013-05-07T08:46:06.574-07:00I've had this happen when you have lots of fea...I've had this happen when you have lots of features, the kind of situation where L1 regularization also helps. It's possible if I thoroughly explored the space of combinations of (hash bits, L1 eta) I would find something with more hash bits that worked better, but the speed decrease from increasing hash bits is a real disincentive to explore. Paul Mineirohttp://www.blogger.com/profile/05439062526157173163noreply@blogger.com