The Central Processing Unit (CPU)

The central processing unit (CPU) is the element in a computer that contains the electronic components and circuitry required to execute program instructions and perform arithmetic and logic operations. Before the advent of the microprocessor, electronic CPUs typically consisted of a number of discrete components, and later a smaller number of small-scale integrated circuits. By integrating the elements that carry out the main functions of the CPU onto one mass-produced large-scale integrated circuit containing thousands or millions of transistors, both the size and cost of the CPU could be greatly reduced. The main purpose of the CPU is to execute program instructions. The program instructions are stored as binary machine code on some kind of secondary storage device, such as a hard disk drive. Before the program can be executed, it must be loaded into the computer's working memory (RAM). Virtually all CPUs carry out their operations in a series of discrete steps (the instruction cycle), which are described below:

Once these steps have been carried out, the instruction cycle is repeated using the next program instruction referenced by the program counter.


Most CPUs are synchronous in nature, in the sense that a clock signal is used to synchronise the various operations that the CPU carries out. The clock signal usually takes the form of a periodic square wave. The design of the CPU will determine how many clock cycles are required to complete a single operation, or (in the case of today's much faster and more sophisticated CPUs) how many operations can be completed within a single clock cycle. Bearing in mind that the instructions executed by the CPU will be carried out in a given number of clock cycles, the overall speed of operation of the CPU (i.e. the speed with which it can execute programs) will be directly related to the number of clock cycles per second. In other words, all other factors being equal, the faster the clock speed, the faster the CPU will operate. Unfortunately, due to the fact that the very process of switching within a CPU uses a discrete amount of energy, the faster the CPU operates, the more energy is used, and the more heat is dissipated by the CPU, necessitating the development of more efficient CPU cooling systems.


The Intel Core 2 Extreme QX6700 quad-core CPU

The Intel Core 2 Extreme QX6700 quad-core CPU



CPU History

Between 1961 and 1971, the number of transistors that could reside on a single microchip doubled each year, and integrated circuits became ever more complex. Although CPUs built from a number of discrete transistors and simple integrated circuits already existed, it was inevitable that sooner or later the functionality of the CPU would be integrated into a single microchip, or microprocessor. Intel's 4-bit 4004 chip, first manufactured in 1971, contained approximately 2,300 transistors and is considered to have been the first general-purpose programmable microprocessor. Intel followed this with an 8-bit microprocessor, the 8008, which appeared in 1972. This first, relatively unsophisticated 8-bit microprocessor was not a great commercial success, and 1974 saw the introduction of Intel's 8080 microprocessor, which had a separate 8-bit data bus and a 16-bit address bus that could address up to 64 kilobytes of memory (this was considered a huge amount of memory in 1975).

Late in 1974, Motorola produced its own 8-bit microprocessor, the 6800, which provided similar performance but with a significantly different internal architecture. Other, smaller companies also entered the 8-bit microprocessor arena at this time. In 1975, MOS Technology introduced the 8-bit 6502 microprocessor, which was based on the Motorola 6800, but which was cheaper and faster than most other microprocessors on the market. The 6502 is still used in embedded systems.

Zilog introduced the Z80 in 1976. The Z80 essentially represented a superset of the Intel 8080 in terms of its architecture, which meant that it offered advantages over the the 8080, such as extra registers and instructions, while at the same time being able to execute all of the 8080's machine code instructions. As a result, Z80 microprocessors were used in many of the first generation of personal computers. The success of the Z80 also served to demonstrate the commercial advantages of maintaining compatibility with an existing architecture, as opposed to creating an entirely new architecture, since it meant that existing software would not need to be re-written. The disadvantage of this approach is that it inhibits the introduction of the radical improvements in architecture that could otherwise be made.

A number of what were generically termed personal computers emerged during the latter half of the 1970s and into the early 1980s which were based on one or the other of these microprocessors. The early systems were mainly of interest to the electronics enthusiast, since they came in kit form and required assembly. One of the most influential of these was the Altair 8800 marketed by MITS. Other microcomputers that emerged during the 8-bit era of microcomputing included the Apple I (1976) and Apple II (1977), the Commodore PET (1977), the Commodore VIC-20 (1980), and the Sinclair ZX80 (1980). The Apple and Commodore micros were all based on the 6502 microprocessor, while the Sinclair machine used Zilog's Z80. Within a few years, however, the microcomputer market would be dominated by the IBM PC and its clones, and the Apple Macintosh. The two 16-bit microprocessors that were successful were Intel's 8086 and Motorola's 68000.

Intel based their 16-bit 8086 microprocessor on the core of the 8080. This proved to be a good commercial decision, in the light of their subsequent share of the microprocessor market, but resulted in a microarchitecture that left a lot to be desired. The 8086 retained the 16-bit address bus of the 8080, which meant that it could not address more than 64 kilobytes (216 = 65,536 bytes) of memory. The address bus was expanded to 20 bits by adding a 4-bit segment value. This meant that, at any one time, the 8086 could access 64 kilobytes of memory in any one of 16 segments, giving it a total address space of 16 x 64 kilobytes (1 Megabyte). When the 8086 was chosen for IBM's personal computer (PC), 384 kilobytes were reserved for the operating system and video memory, which left only 640 kilobytes for user applications.

Motorola did not attempt to extend their 8-bit 6800 processor, or to achieve backward compatibility. Their 68000 16-bit microprocessor, which went into production in 1979, had 8 general-purpose address registers and 8 general-purpose data registers, and was one of the first microprocessors to use microcoding to define its instruction set. Ironically, despite being marketed as a 16-bit microprocessor, the 68000 had a 32-bit architecture. The address and data registers were 32 bits wide, and 32-bit operations were supported. Addresses were also 32-bits wide, and segmentation was not necessary. The 68000 only had 24 address pins, so only 224 bytes (16 Megabytes) of external memory could be directly accessed. The 68000 was used in the Apple Macintosh, and in the Atari and Amiga computers. Although all three were highly regarded from a technical viewpoint, the Macintosh was very expensive in comparison to the IBM PC and subsequently failed to capture a significant market share, while the Atari and Amiga computers were generally considered to be gaming computers. Motorola later produced both a memory management unit (MMU) and and a floating point processor (FPP) for the 68000 series of microprocessors, implemented as co-processor chips (an idea first introduced by Intel) that were tightly coupled to the CPU.

The development of microprocessor architectures since 1981 has been influenced as much by commercial considerations as by advances in technology. In making the IBM PC technology non-proprietary, IBM opened the way for other manufacturers to build a copy (or clone) of the IBM PC. Hundreds of manufacturers started producing PC components and peripherals, giving rise to an entire new industry. The intense competition in the PC market both fuelled innovation and forced prices down. The consequent popularity of the IBM PC and its clones had the additional effect of creating a huge market for PC software.

The first IBM PCs operated at a clock speed of 4.77 MHz. They were not built around the 8086, but a modification of the 8086 called the 8088, which had the same architecture as the 8086 but which communicated with memory via an 8-bit bus, reducing hardware complexity. Intel subsequently paired the 8087 maths co-processor with both 8086 and 8088-based processors to improve the execution speed of applications requiring large numbers of floating point operations.

In 1982, Intel introduced the 80286, which initially operated at a clock speed of 6MHz (this later rose to 20 MHz), and had a 16-bit data bus. More efficiently organised than the 8086, the 80286 increased throughput roughly fivefold. IBM adopted the 80286 for its PC AT architecture in 1984, ensuring the continuing success of the 86x family of microprocessors. Other features of the 80286 included 24 address lines that enabled it to access 16 Megabytes of memory, a simple on-chip MMU that supported multitasking, and access to up to 1 Gigabyte of virtual memory. The 80286 could also operate in real mode, in which it could run software written for the 8086 but could address only 1 Megabyte of memory, or protected mode in which the entire 16 Megabyte memory space could be addressed.

1985 saw the introduction of the 80386, which had 32-bit address and data buses, improved memory management, and an additional operating mode (virtual 8086), which made it easier to run 8086 programs. In 1989, Intel introduced the 80486. This processor introduced relatively few architectural changes, but was the first Intel processor to include an on-board maths co-processor. The Intel Pentium followed, in 1993. Intel chose the name Pentium (in preference to 80586) because a number cannot be patented. An initial clock speed of 60 MHz rose to 166 MHz in later versions. The Pentium was architecturally similar to the 32-bit 80486, but had a 64-bit data bus and a slightly expanded instruction set. Performance was enhanced using parallel processing and a 16 kilobyte cache memory (split into code and data sections).

Intel microprocessors have been cloned by other semiconductor manufacturers. Many of the clones have provided a similar level of performance to the original at a significantly lower price. The NextGen Nx586, for example, offered comparable performance to a Pentium running at 90 MHz. By 1999 some of Intel's competitors were attempting to improve on Intel's processors rather than just producing cheaper, functionally equivalent copies. The PowerPC is worth mentioning here, and was the result of a collaboration between IBM, Motorola, and Apple. IBM provided the architecture, Motorola manufactured the chip, and Apple used it in their personal computers. IBM had been working on their own RISC technology since 1975 with some success, and the resulting POWER architecture, designed for use in their RS/6000 series workstations, incorporated both RISC and CISC features. IBM, Motorola and Apple engineers developed the POWER technology to produce the PowerPC family of microprocessors. The PowerPC microprocessor was used in most Apple Macintosh computers up until 2006, and is today widely used in automotive applications.


The IBM Power PC

The IBM Power PC


Today's microprocessors are in the order of 10,000 times faster than the first generation of microprocessors, while microprocessor-based computer systems are now up to fifty times cheaper than their eary predecessors in real terms. This amounts to a colossal improvement in terms of the cost-to-performance ratio over the past thirty or so years.


Microprocessor architecture

Microprocessor architecture describes the organisation and functionality of the components and circuitry that make up the microprocessor hardware. The architectural design features of the processor will have a bearing on its cost, performance and power consumption. The physical architecture is often represented as a block diagram that describes the functional areas of the processor and the interconnections between them, including the number and type of execution unit, cache, and bus interface.

The execution units include the Arithmetic and Logic Units (ALUs) and Floating Point Units (FPUs) that carry out the mathematical and logical operations of the processor. The inclusion of on board cache memory has been an important factor in improving the speed with which the processor can carry out operations, since one of the limiting factors in terms of performance is the length of time taken to fetch program instructions and data from main memory. The cache acts as an intermediate store for frequently used instructions and data, and can be accessed much more quickly than RAM. As processor technology has advanced, the size of on-board cache memory has gone from a few kilobytes to several hundred kilobytes, and is continuing to increase in both size and complexity.

Branch prediction units are now a feature of many microprocessors. The purpose of branch prediction is to attempt to predict the outcome of a conditional operation to determine which branch of the program will be followed. This enables the program instructions to be fetched without waiting for the result of the conditional operation, speeding up program execution. In the case of complex operations involving a number of steps, some of these program instructions may also be executed in order to speed things up even more. This is known as speculative execution. A superscalar architecture allows a degree of parallelism by duplicating execution units such as the ALU and FPU. Where delays in processing occur due to the need to wait for the result of a particular operation, subsequent instructions that are ready to be executed may be executed out of order, and the results stored until they are needed. This is known as out-of-order execution. Recent advances in semiconductor technology have reduced the size of transistors still further, allowing multiple processors to be implemented on the same chip. These multi-core CPUs are now appearing in desktop personal computers.


Block diagram of Intel's Pentium 4 architecture

Block diagram of Intel's Pentium 4 architecture



CISC v RISC

CISC stands for Complex Instruction Set Computer, and RISC for Reduced Instruction Set Computer. In a CISC microprocessor, a single machine code instruction can execute a number of low-level operations. The resulting high code density means that programs take up less space in memory. This is an important consideration when only a relatively small amount of memory is available. In addition, the ability to carry out multiple operatins with a single instruction means that fewer instructions need to be retrieved from memory for a given program, and programmers are more productive because they need to write less code.

As more and more functionality was added to CISC microprocessor instruction sets, the number of available instructions increased. It became apparent, however, that the majority of operations carried out by a program could be achieved using a relatively small subset of the overall instruction set, and that some of the more complex instructions were rarely, if ever, used. Furthermore, many complex instructions could be carried out as efficiently (if not more so) using a number of much more basic instructions. A microprocessor that implements a relatively small number of instructions tends to run faster because the individual instructions take less time to decode, and the microcode that implements them is less complex.

As well as IBM's work on RISC technology, which was not widely known about at the time, both UC Berkeley and Stanford University were working on RISC projects sponsored by DARPA (Defense Advanced Research Projects Agency) as part of their VLSI (Very Large Scale Integration) program. Berkeley's RISC project concentrated on the use of pipelining and register windows.

Each program instruction must be carried out in a number of discrete steps. A typical sequence might be to fetch the instruction, decode it, fetch the operand the instruction needs, execute the instruction, and finally write the result to memory. Each operation would typically require one clock cycle to complete. The principle behind pipelining is that, once the first instruction has been fetched, the next instruction can be fetched during the same clock cycle used to decode the first instruction. A number of instructions could thus be ?in the pipeline? at any given time. In this manner, the number of clock cycles required to execute a single instruction could be reduced, improving performance. This idea can be carried even further by including additional execution units within the same processor, so that multiple instructions can be fetched simultaneously. This type of processor architecture is known as superscalar.

Non-RISC processors have only a relatively small number of registers. Since registers are essentially the fastest form of memory, it makes sense to the store the variables required by a procedure in one or more registers. Unfortunately, the small number of available registers means that all active processes are competing for the same registers. Each time an active process has to relinquish control of the processor, the contents of any registers used by that process must be saved to memory, and restored when the process regains control of the processor.

The idea behind Berkeley's RISC technology was to provide a much larger number of registers (e.g. 64), but limit each process to only 8 registers. Each procedure would therefore be allocated its own window (a set of 8 registers), and up to eight procedures could be allocated a window at any one time. This scheme greatly reduced the frequency with which data had to be moved back and forth between the CPU registers and main memory (or cache memory), enabling RISC processors to significantly outperform existing non-RISC processors.

Berkeley's RISC-II microprocessor had only 40,760 transistors - less than half the number found in equivalent non-RISC processors, and only 39 instructions in its instruction set. The reduction in the number of transistors had two advantages. It reduced the amount of power consumed by switching operations and consequently made the problem of heat dissipation easier to manage, and it left meant that there was space available to include a greater number of registers. The reduced number of instructions, together with a simplified and uniform instruction format, also meant that decoding operations could be carried out more efficiently. Most modern RISC processors are copies of the RISC-II design, which was used by Sun Microsystems to develop their SPARC (Scalable Processor ARChitecture) workstation, which dominated the workstaion market in the 1990s.

The performance of any microprocessor, in terms of its speed of operation, depends on the relationship between the number of instructions that must be executed to achieve a particular task, and the number of instructions that can be executed in each clock cycle. The complex instructions characteristic of CISC processors give a reduced number of instructions overall, at the cost of more cycles per instruction. RISC processors employ simplified instructions, reducing the number of clock cycles required per instruction at the cost of an increased overall number of instructions.

One of the disadvantages of RISC was that it required a much larger number of CPU registers, the most expensive form of memory to implement. Another drawback was that the removal of complex instructions meant that more work had to be done by the compiler, and the resulting low code density meant that programs written for RISC processors required more RAM. On the plus side, the price of RAM has decreased dramatically, while compiler technology has grown in sophistication.

Despite the obvious advantages of RISC, however, the PC market is still dominated by x86 processors. The main reason for this is that Intel and other large manufaturers of x86 processors have been able to invest huge amounts of money in improving the design of x86 processors. Furthermore the distinction between RISC and non-RISC processors has become far less well-defined, as RISC processors have increased in complexity, and many of the features typically found in RISC processors have found their way into modern x86 processors such as the Intel® Core™2 and the AMD K8. These x86 designs are as fast as (if not faster) than the fastest true RISC single-chip solutions on the market.