In order to speed up the operation of a computer system beyond what is possible with sequential execution, methods must be found to perform more than one task at a time. One method for gaining significant speedup with modest hardware cost is the technique of pipelining. In this technique, A task is broken down into multiple steps, and independent processing units are assigned to each step. Once a task has completed its initial step, another task may enter that step while the original task moves on to the following step. The process is much like an assembly line, with a different task in progress at each stage. In theory, a pipeline which breaks a process into N steps could achieve an N-fold increase in processing speed. Due to various practical problems, the actual gain may be significantly less.
The concept of pipelines can be extended to various structures of interconnected processing elements, including those in which data flows from more than one source or to more than one destination, or may be fed back into an earlier stage. We will limit our attention to linear sequential pipelines in which all data flows through the stages in the same sequence, and data remains in the same order in which it originally entered.
Pipelining is most suited for tasks in which essentially the same sequence of steps must be repeated many times for different data. This is true, for example, in many numerical problems which systematically process data from arrays. Arithmetic pipelining is used in some specialized computers discussed elsewhere. One action common to all computers, however, is the systematic fetch and execute of instructions. This process can be effectively pipelined, and this instruction pipelining is the subject to be considered in this chapter.
The first step in applying pipelining techniques to instruction processing is to divide the task into steps that may be performed with independent hardware. The most obvious division is between the FETCH cycle (fetch and interpret instructions) and the EXECUTE cycle (access operands and perform operation). If these two activities are to run simultaneously, they must use independent registers and processing circuits, including independent access to memory (separate MAR and MBR).
It is possible to further divide FETCH into fetching and interpreting, but since interpreting is very fast this is not generally done. To gain the benefits of pipelining it is desirable that each stage take a comparable amount of time.
A more practical division would split the EXECUTE cycle into three parts: Fetch operands, perform operation, and store results. A typical pipeline might then have four stages through which instructions pass, and each stage could be processing a different instruction at the same time. The result of each stage is passed on to the next stage.
Several difficulties prevent instruction pipelining from being as simple as the above description suggests. The principal problems are:
TIMING VARIATIONS: Not all stages take the same amount of time. This means that the speed gain of a pipeline will be determined by its slowest stage. This problem is particularly acute in instruction processing, since different instructions have different operand requirements and sometimes vastly different processing time. Moreover, synchronization mechanisms are required to ensure that data is passed from stage to stage only when both stages are ready.
DATA HAZARDS: When several instructions are in partial execution, a problem arises if they reference the same data. We must ensure that a later instruction does not attempt to access data sooner than a preceding instruction, if this will lead to incorrect results. For example, instruction N+1 must not be permitted to fetch an operand that is yet to be stored into by instruction N.
BRANCHING: In order to fetch the "next" instruction, we must know which one is required. If the present instruction is a conditional branch, the next instruction may not be known until the current one is processed.
INTERRUPTS: Interrupts insert unplanned "extra" instructions into the instruction stream. The interrupt must take effect between instructions, that is, when one instruction has completed and the next has not yet begun. With pipelining, the next instruction has usually begun before the current one has completed.
All of these problems must be solved in the context of our need for high speed performance. If we cannot achieve sufficient speed gain, pipelining may not be worth the cost.
Possible solutions to the problems described above include the following strategies:
To maximize the speed gain, stages must first be chosen to be as uniform as possible in timing requirements. However, a timing mechanism is needed. A synchronous method could be used, in which a stage is assumed to be complete in a definite number of clock cycles. However, asynchronous techniques are generally more efficient. A flag bit or signal line is passed forward to the next stage indicating when valid data is available. A signal must also be passed back from the next stage when the data has been accepted.
In all cases there must be a buffer register between stages to hold the data; sometimes this buffer is expanded to a memory which can hold several data items. Each stage must take care not to accept input data until it is valid, and not to produce output data until there is room in its output buffer.
To guard against data hazards it is necessary for each stage to be aware of the operands in use by stages further down the pipeline. The type of use must also be known, since two successive reads do not conflict and should not be cause to slow the pipeline. Only when writing is involved is there a possible conflict.
The pipeline is typically equipped with a small associative check memory which can store the address and operation type (read or write) for each instruction currently in the pipe. The concept of "address" must be extended to identify registers as well. Each instruction can affect only a small number of operands, but indirect effects of addressing must not be neglected.
As each instruction prepares to enter the pipe, its operand addresses are compared with those already stored. If there is a conflict, the instruction (and usually those behind it) must wait. When there is no conflict, the instruction enters the pipe and its operands addresses are stored in the check memory. When the instruction completes, these addresses are removed. The memory must be associative to handle the high-speed lookups required.
The problem in branching is that the pipeline may be slowed down by a branch instruction because we do not know which branch to follow. In the absence of any special help in this area, it would be necessary to delay processing of further instructions until the branch destination is resolved. Since branches are extremely frequent, this delay would be unacceptable.
One solution which is widely used, especially in RISC architectures, is deferred branching. In this method, the instruction set is designed so that after a conditional branch instruction, the next instruction in sequence is always executed, and then the branch is taken. Thus every branch must be followed by one instruction which logically precedes it and is to be executed in all cases. This gives the pipeline some breathing room. If necessary this instruction can be a no-op, but frequent use of no-ops would destroy the speed benefit.
Use of this technique requires a coding method which is confusing for programmers but not too difficult for compiler code generators.
Most other techniques involve some type of speculative execution, in which instructions are processed which are not known with certainty to be correct. It must be possible to discard or "back out" from the results of this execution if necessary.
The usual solution is to follow the "obvious" branch, that is, the next sequential instruction, taking care to perform no irreversible action. Operands may be fetched and processed, but no results may be stored until the branch is decoded. If the choice was wrong, it can be abandoned and the alternate branch can be processed.
This method works reasonably well if the obvious branch is usually right. When coding for such pipelined CPU's, care should be taken to code branches (especially error transfers) so that the "straight through" path is the one usually taken. Of course, unnecessary branching should be avoided.
Another possibility is to restructure programs so that fewer branches are present, such as by "unrolling" certain types of loops. This can be done by optimizing compilers or, in some cases, by the hardware itself.
A widely-used strategy in many current architectures is some type of branch prediction. This may be based on information provided by the compiler or on statistics collected by the hardware. The goal in any case is to make the best guess as to whether or not a particular branch will be taken, and to use this guess to continue the pipeline.
A more costly solution occassionally used is to split the pipeline and begin processing both branches. This idea is receiving new attention in some of the newest processors.
The fastest but most costly solution to the interrupt problem would be to include as part of the saved "hardware state" of the CPU the complete contents of the pipeline, so that all instructions may be restored to their original state in the pipeline. This strategy is too expensive in other ways and is not practical.
The simplest solution is to wait until all instructions in the pipeline complete, that is, flush the pipeline from the starting point, before admitting the interrupt sequence. If interrupts are frequent, this would greatly slow down the pipeline; moreover, critical interrupts would be delayed.
A compromise solution identifies a "point of no return," the point in the pipe at which instructions may first perform an irreversible action such as storing operands. Instructions which have passed this point are allowed to complete, while instructions that have not reached this point are canceled.
More sophisticated instruction pipelines can sometimes be nonlinear or nonsequential. One example is branch processing in which the pipeline has two forks to process two possible paths at once. Sequential processing can be relaxed by a pipeline which allows a later instruction to enter when a previous one is stalled by a data conflict. This, of course, introduces much more difficult timing and consistency problems.
Pipelines for arithmetic processing often are extended to two-dimensional
structures in which input data comes from several other stages
and output may be passed to more than one destination. Feedback
to previous stages can also occur. For such pipelines special
algorithms are devised, called systolic algorithms, to effectively
use the available stages in a synchronized fashion.