Revenge of the CISCSo the thousand-fold increase in clock speeds from my childhood Commodore 64 is not to be repeated, as we've all heard by now. The machine learning community, like other scientific computing communities, is earnestly embracing multicore and cluster computing in order to scale further, and this is a good thing.
However what this exercise taught me is, computers are getting faster: not in terms of clock speed, but in terms of how much work is done per clock tick. Not only do the chip designers put in new awesome instructions, but they then work with each chip release to make them more efficient (e.g., less latency, stalls). Some of the gains accrue automatically if you are using a recent version of your compiler, e.g., gcc 4 automatically vectorizes but gcc 3 does not. This is one reason why I ``only'' saw an additional 20% improvement from vectorization of the LDA implementation in Vowpal Wabbit: the compiler had already vectorized the obvious stuff (e.g., dot products, loops accumulating counters), but wasn't smart enough to recognize that my digamma, lgamma, and exponential approximations could be vectorized as well. If you are using gcc 3 to compile Vowpal Wabbit, you are probably paying a speed penalty.
Alright well, so what? To get to the petabyte level you need a cluster anyway, right? Well there are lots of problems where having a sample few gigabytes of data on a laptop is useful for quick experimentation. Therefore it helps to have the fastest implementation of your learning algorithm that runs on your laptop. To get to that fast implementation, pay attention to the instruction set!