tag:blogger.com,1999:blog-4446292666398344382.comments2015-11-26T15:36:47.638-08:00Machined LearningsPaul Mineirohttp://www.blogger.com/profile/05439062526157173163noreply@blogger.comBlogger119125tag:blogger.com,1999:blog-4446292666398344382.post-21076556296746487712015-11-26T14:37:22.509-08:002015-11-26T14:37:22.509-08:00This is also a pretty fast approximation for float...This is also a pretty fast approximation for floats (also is reasonable for double, change log_t to double then), based on a power serie expansion of the arctan identity of ln. The compiler can vectorize it, which speeds it up considerably.<br /><br /><br /><br />#include <br />/* Needed for M_LN2 */<br />#define __USE_MISC 1<br />#include <br />#include <br /><br />/* Predefined constants */<br />#define i3 0.33333333333333333<br />#define i5 0.2<br />#define i7 0.14285714285714285<br />#define i9 0.11111111111111111<br />#define i11 0.09090909090909091<br />#define i13 0.07692307692307693<br />#define i15 0.06666666666666667<br /><br /><br />/* alias the type to use the core to make more precise approximations */<br />typedef float log_t;<br />/**<br /> This is a logarithmic approximation function. It is done by splitting up the mantissa and the<br /> exponent. And then use the identity ln z = 2 * arctan (z - 1)/(z + 1)<br /> **/<br /><br />/** Logarithm for small fractions<br /> It calculates ln x, where x <= 1.0<br />**/<br />static inline log_t log1ds(log_t z){<br /> log_t pz = (z-1)/(z+1);<br /> log_t pz2 = pz*pz;<br /> log_t pz3 = pz2*pz;<br /> log_t pz5 = pz3*pz2;<br /> log_t pz7 = pz5*pz2;<br /> log_t pz9 = pz7*pz2;<br /> log_t pz11 = pz9*pz2;<br /> log_t pz13 = pz11*pz2;<br /> log_t pz15 = pz13*pz2;<br /> log_t y = 2 * ( pz + pz3 * i3 + pz5 * i5 + pz7 * i7 + pz9 *i9 + pz11 * i11 + pz13 * i13 + pz15 * i15);<br /> return y;<br />}<br /><br />#define CALLS 1<br />// 0000<br />#define TILL 1000000000<br />/** The natural logarithm **/<br />log_t logd(log_t x){<br /> int exp;<br /> log_t fraction = frexpf((log_t)x, &exp);<br /> log_t res = log1ds(fraction);<br /> return res + exp * M_LN2;<br />}<br />Goddard LeRoyhttp://www.blogger.com/profile/02898381097003622186noreply@blogger.comtag:blogger.com,1999:blog-4446292666398344382.post-1589142078264548032015-11-06T07:51:32.935-08:002015-11-06T07:51:32.935-08:00around 1.0 the approximation of fastlog and faster...around 1.0 the approximation of fastlog and fasterlog are poor (relative error > 13 for fastlog and much worse for fasterlog)Soeren Sonnenburghttp://www.blogger.com/profile/00844611789125605702noreply@blogger.comtag:blogger.com,1999:blog-4446292666398344382.post-10878158591112471102015-10-14T04:51:11.157-07:002015-10-14T04:51:11.157-07:00As for automated training, chalearn is running a c...As for automated training, chalearn is running a challenge on that: http://automl.chalearn.org/<br /><br />Also I liked a knowledge competition on Kaggle where the task was to figure out the rules of poker (category rank of 5-card hand). Many incorporated domain knowledge, like that the order of the cards dealt does not matter, so it should be an entirely new game.Anonhttp://www.blogger.com/profile/01360755535659830635noreply@blogger.comtag:blogger.com,1999:blog-4446292666398344382.post-68047873625415673592015-10-14T04:39:12.474-07:002015-10-14T04:39:12.474-07:00I like the partial feedback idea. I've been wo...I like the partial feedback idea. I've been wondering if it was possible to do a bandit problem on Kaggle's current platform, but I don't think that's feasible right now.<br /><br />For instance it would be cool to have 200 slot machines with a variable payout and 10000 pulls, and the objective is to get as much payout as possible. Could even add a little context, like time of day or distance to walking isle.Anonhttp://www.blogger.com/profile/01360755535659830635noreply@blogger.comtag:blogger.com,1999:blog-4446292666398344382.post-87584124273914336702015-08-19T17:21:25.804-07:002015-08-19T17:21:25.804-07:00Re: anti-skilled immigrants, I was reacting to the...Re: anti-skilled immigrants, I was reacting to the H1B elements of Trump's immigration proposal.<br /><br />I relate to your focus on non-monetary aspects of compensation, as such factors are very important to me as well. However, this indicates the fundamental strength of the labor market in technology right now, that laborers can be so selective.<br />Paul Mineirohttp://www.blogger.com/profile/05439062526157173163noreply@blogger.comtag:blogger.com,1999:blog-4446292666398344382.post-63915873114858811812015-08-19T12:25:37.251-07:002015-08-19T12:25:37.251-07:00I'm American. I can see how the current climat...I'm American. I can see how the current climate is anti illegal immigrant, but anti skilled worker immigrant? Doesn't seem like it to me. They might get some push back just because we have so many illegals, dunno.<br /><br /> My experience applying to large corps like MS/Google etc is that I never get any response at all. Though I'm not particular interested in working at these places(they seem like career places, I'm more interested in working for a year or two, then traveling for a few years..repeat). <br /><br />They are also located in horribly overcrowded cities like SF and Seattle, yuck.<br /><br /> Anyway I suspect there are many Americans being ignored because of rather poor screening process(clueless HR people + resume) that assumes everyone has to have the same type of background to be qualified.<br />niadhttp://www.blogger.com/profile/08197767290360188611noreply@blogger.comtag:blogger.com,1999:blog-4446292666398344382.post-39793983843240447992015-07-20T11:31:11.135-07:002015-07-20T11:31:11.135-07:00As a model builder, training times are painful. B...As a model builder, training times are painful. But, inference (evaluation) times are currently more painful in practice. Neural networks may be difficult to optimize, but the resulting function can be fairly compact and cheap to evaluate compared to alternatives with similar statistical performance.<br /><br />Slow training times limit model building experimental turn-around, but experiment-level parallelism helps practitioners at "the majors".Paul Mineirohttp://www.blogger.com/profile/05439062526157173163noreply@blogger.comtag:blogger.com,1999:blog-4446292666398344382.post-61038018502091054392015-07-20T10:52:41.961-07:002015-07-20T10:52:41.961-07:00Thank-you for the review which is appreciated. As ...Thank-you for the review which is appreciated. As deep-everything becomes pervasive, my sense is that more resources (hopefully) will be focused on reducing training time (scale). Whether performed internally or on public clouds, training times that can easily range in the weeks is not going to cut it in the real-world.Unknownhttp://www.blogger.com/profile/16570083132483784714noreply@blogger.comtag:blogger.com,1999:blog-4446292666398344382.post-38817340679615820632015-02-28T23:10:08.792-08:002015-02-28T23:10:08.792-08:00The economics just don't work for the business...The economics just don't work for the business. The amortized cost of perks is much less than that of salary increases, so as cost of labor threatens to consume all profits (especially in late-stage startups), compensation turns to intangible perks that are harder to value. In the finance industry they just pay cash.Evanhttp://www.blogger.com/profile/03067803657727403662noreply@blogger.comtag:blogger.com,1999:blog-4446292666398344382.post-29906688443169310342014-05-04T07:10:17.947-07:002014-05-04T07:10:17.947-07:00Hey! I thought this post was pretty good. You sh...Hey! I thought this post was pretty good. You should take a look at this presentation: http://users.soe.ucsc.edu/~niejiazhong/slides/chandra.pdf in particular the "design principles" section. I think most people haven't realizedf that machine learning, HPC, and mapreduce have more or less converged.David Konerdinghttp://www.blogger.com/profile/08061100933668606153noreply@blogger.comtag:blogger.com,1999:blog-4446292666398344382.post-66677149958830319022014-04-25T09:07:03.689-07:002014-04-25T09:07:03.689-07:00It is domain dependent. Early experience in compu...It is domain dependent. Early experience in computation advertising, with zipf-distributed text features and linear models, suggested larger amounts of data led to improved performance. This was believed to be generally true for a while but is now recognized as a specific property of these problems, attributable to the long tail of the zipf coupled with the need to have roughly 10 observations per feature in a linear model. For other types of models in other domains, it is possible to hit the Bayes limit (optimal performance given inherent noise) with training data that comfortably fits on a single machine, in which case more data is not helpful.<br /><br />Computer vision is benefiting from large labeled datasets, essentially because very flexible models are required for good performance and therefore lots of training data is required to prevent overfitting. I see this trend continuing for a while: arguably it is held back by our (in)ability to optimize large deep learning models, not our ability to assemble large computer vision training sets.Paul Mineirohttp://www.blogger.com/profile/05439062526157173163noreply@blogger.comtag:blogger.com,1999:blog-4446292666398344382.post-33218079827359238742014-03-30T01:10:33.399-07:002014-03-30T01:10:33.399-07:00A terrific post which I keep coming back to freque...A terrific post which I keep coming back to frequently to learn something else. It would be good to know what "scaling" machine learning means wrt size ie. how big are the models today and what are the expected sizes in the future?Unknownhttp://www.blogger.com/profile/16570083132483784714noreply@blogger.comtag:blogger.com,1999:blog-4446292666398344382.post-20780882857342199482013-12-12T15:03:33.286-08:002013-12-12T15:03:33.286-08:00Updated link here.Updated link <a href="http://www.cs.columbia.edu/~djhsu/papers/multiview-slides.pdf" rel="nofollow">here</a>.Paul Mineirohttp://www.blogger.com/profile/05439062526157173163noreply@blogger.comtag:blogger.com,1999:blog-4446292666398344382.post-47641651422335418292013-12-10T23:33:16.499-08:002013-12-10T23:33:16.499-08:00That is indeed a brilliant idea. I'm pretty su...That is indeed a brilliant idea. I'm pretty sure it couldn't work at math conferences but it's not like anything else does either.Allen Knutsonhttp://www.blogger.com/profile/15616422252030334511noreply@blogger.comtag:blogger.com,1999:blog-4446292666398344382.post-21226269777715131752013-11-15T19:50:21.983-08:002013-11-15T19:50:21.983-08:00Can you clarify which of the multiple algorithms i...Can you clarify which of the multiple algorithms in Halko et al this is (maybe a variant of 4.4?). Also, do 'pcas' and 'pcav' here refer to the singular values and singular vectors respectively (S and V in svd), and if so is there a reason not to do svd() directly on secondpass instead of eig() on secondpass' * secondpass, supposedly svd is numerically more stable. Thanks.Gad Abrahamhttp://www.blogger.com/profile/02690671851015634332noreply@blogger.comtag:blogger.com,1999:blog-4446292666398344382.post-32300478019883902712013-10-16T17:22:27.039-07:002013-10-16T17:22:27.039-07:00Nice post. It's worth noting that the randomiz...Nice post. It's worth noting that the randomized SVD is an early-stopped version of classical SVD algorithms like the subspace iteration algorithm.Alexhttp://www.blogger.com/profile/07509233273590036699noreply@blogger.comtag:blogger.com,1999:blog-4446292666398344382.post-37023693830464626192013-10-13T11:37:01.099-07:002013-10-13T11:37:01.099-07:00Thanks for publishing this code. Here's a slig...Thanks for publishing this code. Here's a slightly modified version for the Digit Recognizer competition at Kaggle:<br /><br />https://github.com/zygmuntz/kaggle-digits<br /><br />It scores ~ 0.977.Zygmunt ZajÄ…chttp://www.blogger.com/profile/01337002644285268044noreply@blogger.comtag:blogger.com,1999:blog-4446292666398344382.post-87365800147205382882013-08-29T08:44:38.167-07:002013-08-29T08:44:38.167-07:00@DatsMe:
It's not clear if you want something...@DatsMe:<br /><br />It's not clear if you want something that will work with doubles but with the same precision approximation, or something with higher precision.<br /><br />The former is easy (if somewhat nonsensical): downcast to float, call the routine, then cast the result back to double.<br /><br />The latter will require adjusting the constants in the approximation. The mathematica notebook (https://code.google.com/p/fastapprox/source/browse/trunk/fastapprox/tests/fastapprox.nb) contains the derivation which could be adapted to the proper mantissa, base, bitmask, etc. for doubles.<br /><br />It's possible it will not be faster than standard libraries once you engineer for additional precision.Paul Mineirohttp://www.blogger.com/profile/05439062526157173163noreply@blogger.comtag:blogger.com,1999:blog-4446292666398344382.post-42215435067760513592013-08-29T06:50:58.750-07:002013-08-29T06:50:58.750-07:00I'm trying to adapt fastexp() to work with dou...I'm trying to adapt fastexp() to work with double precision, with little success.<br />I'd appreciate if anyone could point me in the right direction.<br />Thanks!DatsMehttp://www.blogger.com/profile/01020146443768428068noreply@blogger.comtag:blogger.com,1999:blog-4446292666398344382.post-71279424754127429012013-07-26T09:55:59.086-07:002013-07-26T09:55:59.086-07:00The technique from the paper only discusses the pr...The technique from the paper only discusses the process of making a smaller dataset from your original dataset which has fidelity for learning (ERM). So we do a) in the paper just to give people an idea of how this method of choosing a subsample is better than alternatives. Your question suggests spending time getting the size of the subsample "optimally small" in some sense, but this hasn't been how I've applied the technique. In practice the size of the subsample is set by some computational budget, e.g., how much RAM you have on your desktop, or how much data you can process in 5 minutes for rapid experimentation. <br /><br />You want a smaller dataset to do something like b), i.e., run lots of modeling experiments, so yes you are going to do that.<br />Paul Mineirohttp://www.blogger.com/profile/05439062526157173163noreply@blogger.comtag:blogger.com,1999:blog-4446292666398344382.post-41273584424166507392013-07-26T07:46:36.080-07:002013-07-26T07:46:36.080-07:00I am a bit confused, though it has probably more t...I am a bit confused, though it has probably more to do with my knowledge than your article :).<br /><br />Let's say i have n samples, m features. I should find a linear regression (h) which will minimize the loss. <br /><br />Should I then use h to:<br />a. select N samples until the maximum AUC is approached? (Like you seem to do in the paper?)<br />b. Use h to build more features m, and experiment if they improve AUC.<br />c. Both?<br /><br />Thanks for any intuition you could provide.Johan J.http://www.blogger.com/profile/16511137688482612834noreply@blogger.comtag:blogger.com,1999:blog-4446292666398344382.post-55015953410337103812013-07-21T22:00:54.545-07:002013-07-21T22:00:54.545-07:00"A reviewer pointed out that the excess risk ..."A reviewer pointed out that the excess risk bound diverged in this case" <br /><br />By "this case" I meant the case that the loss of tilde h goes to zero.Paul Mineirohttp://www.blogger.com/profile/05439062526157173163noreply@blogger.comtag:blogger.com,1999:blog-4446292666398344382.post-81861621710024308952013-07-21T21:58:24.483-07:002013-07-21T21:58:24.483-07:00Doh!
Originally we didn't distinguish between...Doh!<br /><br />Originally we didn't distinguish between a hypothesis and the induced loss, as in the empirical Bernstein paper, but the reviewers really thought this cluttered the exposition (and, in hindsight, they were correct). So we threaded the loss through everything for the camera ready, but apparently not without error :)Paul Mineirohttp://www.blogger.com/profile/05439062526157173163noreply@blogger.comtag:blogger.com,1999:blog-4446292666398344382.post-8041328260069353942013-07-21T21:56:16.093-07:002013-07-21T21:56:16.093-07:00Originally Pmin was not in the paper and we just s...Originally Pmin was not in the paper and we just set the minimum probability to the empirical loss of tilde h. A reviewer pointed out that the excess risk bound diverged in this case, so we added Pmin, but we suspect the right choice for Pmin is the empirical loss of tilde h. Then you set lambda to hit your subsample budget size. This is what we did for the DNA experiment, although we need some more experience with problems before we can say confidently this is the way to go. (Note that if tilde h has a loss which is indistinguishable from zero given Hoeffding on the large data set, tilde h is the hypothesis you seek and so you really don't need to subsample.)<br /><br />The loss is the proxy being optimized (e.g., logistic loss) and not the true underlying loss (e.g., zero-one), as we leverage the ordering implied by optimization in the proof. Paul Mineirohttp://www.blogger.com/profile/05439062526157173163noreply@blogger.comtag:blogger.com,1999:blog-4446292666398344382.post-56910346197576299952013-07-21T04:07:52.689-07:002013-07-21T04:07:52.689-07:00BTW, I think there is a typo in the article: R_X(h...BTW, I think there is a typo in the article: R_X(h) is defined in terms of a sum of h. It should be a sum of l_h instead.Olivier Griselhttp://www.blogger.com/profile/05751090858946703320noreply@blogger.com