As CPUs and GPUs increase their numbers of cores, generation over generation clock speed has plateaued. Phil Rhodes explains why.
Between about 2000 and 2005, computer clock speeds jumped from around 500 to 3500MHz, and that was after progress through from 200 to 500MHz in the previous five years. This progression reaches all the way back to the late 70s in some form, but in recent years, this has either slowed down, stopped, or in some cases even gone slightly backwards. We're told that multi-core processing is the answer, but here's the catch: it just isn't.
We talked about this a few weeks ago when rumours of a lower-cost, multi-core Intel CPU surfaced, but regardless of the veracity of that idea, there are some serious problems with parallelism. These problems extend to both the handful of cores on a CPU and the hundreds of cores on a graphics card; crucially, the main issue can't even be solved with cleverly-written software.
Limitations of parallelisation
Writing code that takes the best possible advantage of multi-core CPUs is an art and a science that's currently under massive development, but it relies on one simple idea: that we can do lots of things at once. Well, that's great: next time I'm painting something that requires several coats, I'll get several mates round and we'll apply them all at once.
Oh dear. That won't work.
It might work if we had ten items to paint, but otherwise, applying multiple coats of paint is not something that is, in computer terms, a parallel task. And yes, this applies to code as well. It's even been formalised as Amdahl's Law, a reference to Gene Amdahl, who passed away on November 10. Amdahl described the problem as early as 1967, but it's just as relevant now. In practically every computer task, there are parts that can't be parallelised – that is, they can't be split up into smaller tasks and done all at once. The mathematics associated with Amdahl's Law suggest that even if 95% of the task can be split up into parallel tasks, once there are something like 4096 processors involved, the improvement in performance that can be achieved by adding more processors is close to zero. Actually, it's very nearly zero at 2048 and adding more processing power becomes uneconomic at far lower numbers than that. The graph begins to curve off, perhaps as low as 64 processors, and it's possible to build a big, expensive workstation with that number of processors right now. And that assumes a 95%-parallel task, which many aren't. This is, to put it bluntly, an extremely grim situation.
It's easy to be enthused by the extremely large performance increases which have become possible for certain tasks. Monstrously expensive colour grading hardware has, famously, been obsoleted very quickly and very thoroughly by multi-core computing. The problem is that these big performance increases are not going to be improved upon quickly – and they're not going to happen again. That particular revolution has revolved. Other things may help, particularly faster memory, which is a big enough problem that CPUs include a small scratch-pad of extremely fast memory for moment-to-moment notes and intermediate results. Throwing ever more cores at the problem, however, is not, no matter how hard we try, going to speed things up in the same way as doubling the CPU clock speed.
The canonical example of this is in video encoding. At first, the problem seems easy: most modern codecs (such as H.264) work by dividing the frame up into blocks and treating them individually. Often the blocks are eight pixels square and there are thus 32,400 of them in an HD frame. This immediately sounds fantastic, because we can assign each one of those blocks to a GPU core and work on them all at once. Unfortunately, that doesn't work very well, because the mathematics involved requires results from neighbouring blocks, so that only a few can be done at once. The final stage in H.264 encoding involves arithmetic encoding, a lossless compression technique which relies on finding the most common series of values in a long string of data. Because the data is in one long string, however, it isn't something that can be done in chunks. There are at least partial solutions to some of these things, although some of them can adversely affect picture quality.
To be fair, the situation isn't entirely doom and gloom. For some really intensive tasks, such as rendering a huge After Effects composition, the parallel part of the task can be a lot more than 95% of the task. It can be a huge-string-of-nines percentage of the task, so more cores keeps on helping, right up to thousands of cores. And that's good, because for some of these tasks, current performance isn't half of the performance we'd ideally like. It's not a tenth. It's not a hundredth. What we actually want is a computer that's thousands of times faster, right up to the point where I can view my AE composition, which takes 25 minutes per frame to render, at 25 frames per second. Yes, I'm asking for a 37,500-times speed-up there. Where's my jet pack!
Multi-core processing might help us a bit there, but as we've seen, it is, at best, a part of the answer. At some point, we're going to need more CPU cycles per second, a lot more, and not spread over a dozen cores.
Graphic by Shutterstock