tag:blogger.com,1999:blog-4446292666398344382.post383006540693632468..comments2020-10-14T21:49:04.423-07:00Comments on Machined Learnings: Using Less DataPaul Mineirohttp://www.blogger.com/profile/05439062526157173163noreply@blogger.comBlogger7125tag:blogger.com,1999:blog-4446292666398344382.post-71279424754127429012013-07-26T09:55:59.086-07:002013-07-26T09:55:59.086-07:00The technique from the paper only discusses the pr...The technique from the paper only discusses the process of making a smaller dataset from your original dataset which has fidelity for learning (ERM). So we do a) in the paper just to give people an idea of how this method of choosing a subsample is better than alternatives. Your question suggests spending time getting the size of the subsample "optimally small" in some sense, but this hasn't been how I've applied the technique. In practice the size of the subsample is set by some computational budget, e.g., how much RAM you have on your desktop, or how much data you can process in 5 minutes for rapid experimentation. <br /><br />You want a smaller dataset to do something like b), i.e., run lots of modeling experiments, so yes you are going to do that.<br />Paul Mineirohttps://www.blogger.com/profile/05439062526157173163noreply@blogger.comtag:blogger.com,1999:blog-4446292666398344382.post-41273584424166507392013-07-26T07:46:36.080-07:002013-07-26T07:46:36.080-07:00I am a bit confused, though it has probably more t...I am a bit confused, though it has probably more to do with my knowledge than your article :).<br /><br />Let's say i have n samples, m features. I should find a linear regression (h) which will minimize the loss. <br /><br />Should I then use h to:<br />a. select N samples until the maximum AUC is approached? (Like you seem to do in the paper?)<br />b. Use h to build more features m, and experiment if they improve AUC.<br />c. Both?<br /><br />Thanks for any intuition you could provide.Johan J.https://www.blogger.com/profile/16511137688482612834noreply@blogger.comtag:blogger.com,1999:blog-4446292666398344382.post-55015953410337103812013-07-21T22:00:54.545-07:002013-07-21T22:00:54.545-07:00"A reviewer pointed out that the excess risk ..."A reviewer pointed out that the excess risk bound diverged in this case" <br /><br />By "this case" I meant the case that the loss of tilde h goes to zero.Paul Mineirohttps://www.blogger.com/profile/05439062526157173163noreply@blogger.comtag:blogger.com,1999:blog-4446292666398344382.post-81861621710024308952013-07-21T21:58:24.483-07:002013-07-21T21:58:24.483-07:00Doh!
Originally we didn't distinguish between...Doh!<br /><br />Originally we didn't distinguish between a hypothesis and the induced loss, as in the empirical Bernstein paper, but the reviewers really thought this cluttered the exposition (and, in hindsight, they were correct). So we threaded the loss through everything for the camera ready, but apparently not without error :)Paul Mineirohttps://www.blogger.com/profile/05439062526157173163noreply@blogger.comtag:blogger.com,1999:blog-4446292666398344382.post-8041328260069353942013-07-21T21:56:16.093-07:002013-07-21T21:56:16.093-07:00Originally Pmin was not in the paper and we just s...Originally Pmin was not in the paper and we just set the minimum probability to the empirical loss of tilde h. A reviewer pointed out that the excess risk bound diverged in this case, so we added Pmin, but we suspect the right choice for Pmin is the empirical loss of tilde h. Then you set lambda to hit your subsample budget size. This is what we did for the DNA experiment, although we need some more experience with problems before we can say confidently this is the way to go. (Note that if tilde h has a loss which is indistinguishable from zero given Hoeffding on the large data set, tilde h is the hypothesis you seek and so you really don't need to subsample.)<br /><br />The loss is the proxy being optimized (e.g., logistic loss) and not the true underlying loss (e.g., zero-one), as we leverage the ordering implied by optimization in the proof. Paul Mineirohttps://www.blogger.com/profile/05439062526157173163noreply@blogger.comtag:blogger.com,1999:blog-4446292666398344382.post-56910346197576299952013-07-21T04:07:52.689-07:002013-07-21T04:07:52.689-07:00BTW, I think there is a typo in the article: R_X(h...BTW, I think there is a typo in the article: R_X(h) is defined in terms of a sum of h. It should be a sum of l_h instead.Olivier Griselhttps://www.blogger.com/profile/05751090858946703320noreply@blogger.comtag:blogger.com,1999:blog-4446292666398344382.post-75760927195194683902013-07-21T04:05:43.445-07:002013-07-21T04:05:43.445-07:00Thanks for this post and your paper. Could you tel...Thanks for this post and your paper. Could you tell us a bit more informally on how to deal with the newly introduced hyperparameters? In the paper you state:<br /><br />"""<br />In practice Pmin and λ are chosen according to the subsample budget, since the expected size of the subsample is upper bounded by (Pmin+λRX(h ̃))n. Unfortunately there are two hyperparameters and the analysis presented here does not guide the choice except for suggesting the constraints Pmin ≥ RX (h ̃ ) and λ ≥ 1; this is a subject for future investigation.<br />"""<br /><br />In practice do you run a grid search for Pmin and λ on a small uniform subset of the big dataset and select the (Pmin, λ) pair that has the fastest decrease in validation error? Or do you just fix Pmin according to your budget (e.g. 0.01) and grid search λ?<br /><br />Also in the paper I am not sure whether you define R_X (h ̃ ) in terms of the loss function being optimized by h ̃(e.g. squared hinge loss for a linear support vector machine) or the true target loss you are interested in (e.g. the zero one loss).Olivier Griselhttps://www.blogger.com/profile/05751090858946703320noreply@blogger.com