Guest author Rakesh Malik explores why we haven't seen great increases in processor speeds as of late, and how processors continue to improve in other ways.
Throughout the 90's and up until nearly 2010, it seemed like x86 processors were getting speed bumps every few months, with big jumps in maximum clock speed with every new generation.
Clearly, that's no longer the case. Clock speeds haven't actually gone up significantly for quite a while now.
This, of course, leads quite a few people to wonder, "why did the speed race stop?"
The short version is that it hasn't truly stopped; rather, it's changed. Clock speed isn't the only factor that determines the performance of a processor.
Pipelining is splitting the stages of execution into smaller stages, allowing for higher clock speeds.
To understand how this works, imagine a bucket brigade. If there are a few people in the bucket bridge, then each person has to walk back and forth to grab a bucket from the previous person and hand it off to the next. The walking back and forth makes each stage take longer, which in processor parlance is called latency. Making each stage simpler enables a higher clock speed. Imagine adding enough people to the bucket brigade so that a person can grab a bucket and hand it off to the next person without needing to even take a step.
The tradeoff is that if one of those stages gets stalled, all of them do. So if one stage is waiting on a bucket, everyone before has to pause. (We'll get to branching prediction later.)
Most modern processors have a number of pipelines that can execute instructions in parallel. They're generally optimized for certain types of tasks, such as integer math or floating point math, ALU, and load/store. Issuing instructions in parallel is basically like having multiple bucket brigades working together, but there's a limit as to how many instructions the processor can issue at the same time.
Having a lot of execution units is one thing, keeping the occupied is another. The scheduler has to determine the most efficient way to schedule instructions of different types. It has to schedule them to retrieve data that instructions are going to need, accounting for how long it will take to load the data from memory into the caches and from the caches into the appropriate registers.
Pipelines that aren't executing instructions are just consuming power, and as the processors issue rate gets wider, it's also harder for the scheduler to keep the units busy. Because of this, the issue width of processors tends to cap out at a six-to-eight issue super-scalar.
Compilers generate instructions based on an Instruction Set Architecture, or ISA. Since that's a specification that spans a variety of processors, each version of the processor has to decode the compiler instructions into processor native instructions. In some processor architectures, each pipeline has a decoder, and some have just one decoder that saves the decoded instructions in a trace cache.
With a trace cache, the decoder doesn't have to decode instructions every time it encounters them, just the first time. This can add up to a significant improvement in efficiency and performance in programs that execute the same sequences of instructions repeatedly on a large set of data, for example.
Out of Order Execution
The instruction scheduler in the processor tries to keep as many pipelines and execution units as busy as it can. To do this, it will go so far as to figure out what instructions depend on the outputs from other instructions, and reorder them to keep the processor as busy as possible. The processor will also attempt to predict what data the instructions will need in order to complete executing, and load the data from memory to have it available when the instruction is ready to execute. If its rediction is correct, the processor stays busier, and if not then it suffers a mispredict penalty.
Some processors can issue instructions out of order, but must retire them in order. Others can issue and retire instructions out of order, which allows the scheduler to have additional flexibility in how to reorder instructions.
Single Instruction, Multiple Data or SIMD is a way of packing multiple operations into a single instruction. Also known as vector computing, this is a common technique in supercomputers that is now ubiquitous in personal computers. Each vector contains multiple pieces of data, and the processor can execute that one instruction on all of them at the same time. One example is brightening all of the color values of one pixel in Photoshop by an integer value. This method involves packing the initial values into one vector, and adding that number to each pixel into another vector, and issuing a vector add instruction. Then all of the values in each vector are summed with one instruction.
This is requires explicit parallelization when writing the program in order to work, and some workloads are more amenable to parallelism of this type than others. Photo and video editing applications lend themselves to this sort of instruction-level parallelism quite well. 3D rendering applications generally do not, because their data access patterns are not very predictable.