Representing Numbers in Computers
The smallest unit if information that can be stored by a computer is a binary digit (usually shortened to "bit"). Its value is usually held in memory as an electrical charge stored in a capacitor. Modern memory chips contain millions of these tiny capacitors, each of which is capable of storing exactly one bit of information. A bit can have one of two values at any given time - one or zero. This rather limits the usefulness of the bit in terms of the data it can store. Generally speaking, single bits are used only for storing Boolean values (true or false). In most programming languages, true equates to 1, while false equates to 0. The smallest unit of data that can be addressed in computer memory is the byte. The definition of the byte has varied over the years but it is now generally considered to be a group of eight bits, and can be used to represent alpha-numeric and non-printing characters, unsigned integer (whole number) values from 0 to 255, or signed integer values from -127 to +127. Some texts refer to a group of eight bits as an octet to avoid any possible ambiguity. Because the hexadecimal number system consists of sixteen digits, each of which can be specified using just four bits, it is sometimes useful to consider groupings of four bits as a unit. Such a grouping is often referred to as a nibble (which undoubtedly proves that programmers have a sense of humour after all!).
The number of bits that can be processed by a CPU in a single machine operation is dependent upon the number of bits it can store in its internal registers. In the early days of computing this was a relatively small number (four or eight bits). At some point, therefore, the size of the processor's register coincided with the size of a byte. For many years now, however, this has not been the case. As CPU architecture has evolved, we have seen the size of registers double and re-double. Most processors now have either 32-bit or 64-bit registers that can hold four or eight bytes of data respectively. The unit of data that can be processed in a single operation by a machine-code instruction is called a word. A word can be viewed as a 32-bit or 64-bit binary number. The range of values that can be represented by a word is therefore dependent on microprocessor architecture, and will determine the size of the memory space that can be addressed. As of 2010, practically all personal computers are capable of processing 64-bits of data, although they are frequently used in 32-bit mode in order to provide support for existing software. Keep in mind however that many embedded systems still use microcontroller chips that have eight or sixteen-bit registers.
Integers
Integers are whole numbers. The range of values that can be stored as an integer depends on whether or not the number is signed (i.e. positive or negative), and how much memory is allocated for it in memory. Programming languages can generally represent integers that are signed or unsigned, and of different sizes. A single byte, for example, can represent unsigned numbers ranging in value from 0 to 255 or signed numbers ranging from -128 to +127. If two bytes are used, unsigned numbers from 0 to 65,535 or signed numbers from -32,768 to 32,767 can be stored. Much larger numbers can be represented if more bytes are made available. For signed numbers, one bit is used to store the sign (+ or -) of the number, so the absolute value of the biggest number that can be stored is only half that for unsigned numbers. The number of bits used to represent an integer value will equal the number of bytes multiplied by eight. An integer represented by n bits can represent 2n numbers. The magnitude of a four-byte integer can thus be anything up to 2(4 x 8) or 232 which means it can hold an unsigned value of up to 4,294,967,296 (a tad over two billion). Negative numbers can be represented in several different ways in binary number systems, although the most commonly used method is two's complement (the subject of two's complement is dealt with elsewhere).
Fixed-point numbers
A fixed-point number is used to represent a real number (one that has a fractional part) using a fixed number of digits after the radix point. The radix point is called the decimal point for real numbers to base 10. In binary number systems it would be called the binary point. Fixed-point numbers are sometimes used where the processor employed does not have a floating-point unit (FPU), which is often the case in low-cost microcontrollers. Fixed point fractional numbers are usually represented by integer values that are scaled by an appropriate factor (the exponent). For example, the real number 1.234 could be represented by the integer value 1234, with a scaling factor of 1/1000 (or 10-3), while the number 1,234,000 could also be represented by the integer value 1234 but using a scaling factor of 1000 (103). The difference between fixed-point and floating-point representation of real numbers is that the scaling factor remains the same for all values represented by a particular fixed-point data type. The scaling factor used will (usually) be a power of ten for denary (base10) numbers or a power of two for binary numbers. The maximum and minimum values that can be represented by a fixed-point data type will depend on the maximum and minimum values that can be represented by the underlying integer data type, and the scaling factor.
Arithmetic operations on fixed-point numbers can produce answers that cannot be accurately represented using the number of places available either before or after the radix point. In such cases, the answer will be rounded or truncated. The options are either to keep the same number format for the answer and accept that there will be some loss of accuracy, or to convert the result to a more appropriate data type to preserve accuracy. In the first approach, the number of digits before and after the radix point remains the same for the result of an operation. If digits are lost from the fractional part of the result, there will be an associated loss of precision that may be acceptable in many cases. If digits are lost from the integer part of the result however, the result will be fundamentally incorrect. When writing programs for control systems that will be implemented on microprocessors, it is essential to understand the limitations of the microprocessor being used in terms of the maximum size of the integer values it can store. This will usually depend on the size of its internal registers.
Floating point numbers
Floating point numbers are somewhat more complicated to deal with because the radix point does not occupy a fixed position (i.e. it can "float" to the left or right within the representation of a real number depending on the number’s magnitude). In most commonly used encodings, a floating point value is stored as three separate components – the significand, the exponent, and the sign. A 32-bit floating-point number is typically made up as follows:
- Significand - 23 bits
- Exponent - 8 bits
- Sign - 1 bit
The significand represents the significant digits of the number itself, while the exponent essentially represents the position occupied within those digits of the decimal (or binary) point. When a floating –point value is stored in memory, it is first normalised. This means moving the decimal point to the left until it is immediately on the right of the most significant (left-most) digit. The number of places that the decimal point must be moved in order to achieve this is the exponent. As an example, take the real number 1234.56 (which has a significand of 123456). In order to normalise the number, we need to move the radix point (in this case, a decimal point) three places to the left, resulting in a normalised value of 1.23456 and an exponent of 3. Because we are looking at a denary (base 10) number, we can now write the number as 1.23456 x 103. Binary real numbers can be dealt with in exactly the same way, except that the exponent will be applied to a number base of 2. Note however that fractional values such as ½ (0.5) that can be represented exactly in base 10 cannot be exactly represented in base 2.
From the above, it should be evident that the maximum number of digits that can be used to represent any real number in a given format will be fixed. It follows that for any given number, the precision (accuracy) of its representation will depend on the number of digits required to represent it exactly. If the number of digits required is less than or equal to the number of digits available in a given representation, there will be no loss of precision. If the number of digits required is greater than the number available, there will inevitably be some loss of precision. Of course, for values that do not have an exact representation in a given number base precision will be limited in any case. The fractional value 1/3, for example, cannot be exactly represented in a denary number system, no matter how many digits are used after the decimal point (although as more digits are added, the value stored will more closely approach the actual value). A number that cannot be represented exactly in a number base regardless of the number of digits used is said to be non-terminating.
The main advantage floating point numbers have over fixed point numbers is that they can represent a much greater range of values. If we use a fixed point format with two significant digits after the decimal point, for example, the significand 1234567 could represent the values 12,345.67, 1,234.56, 123.45, 12.34 and so on. A floating point number with the same significand could represent values such as 1.234567, 123,456.7, 0.00001234567, 1,234,567,000,000 and so on. The down side is that the floating point format requires more bits to store the exponent part of the number, so floating point numbers that occupy the same space as a fixed-point data type achieve a greater range at the expense of some loss of precision. Generally speaking, the larger the range of values we wish to represent, the more bits are needed to store numbers in that range. Because the number of bits available for both the significand and the exponent will be fixed for a given real number data type, programming languages tend to offer floating point data types of different sizes (and hence precision) so that the programmer can select the type most appropriate for the intended purpose of a variable. In this way, memory can be used more economically than if there were a single "one size fits all" real number data type. Floating point values are usually represented using either 32 bits (single precision) or 64 bits (double precision).
As we said earlier, the significand is stored as an integer having a fixed number of digits, with an implied radix point to the immediate right of the most significant digit. In order to derive the real number value being stored, the significand must be multiplied by the base raised to the power of the exponent. This will effectively move the radix point from its implied position by the number of places given by the exponent. If the exponent is positive, the radix point moves to the right. If the exponent is negative, it moves to the left. Using a denary example, the number 12345.67 would be normalised to 1.234567. In order to restore the number to its original value, the normalised value would have to be multiplied by 104. Note that the computer representation of binary floating point numbers is standardised in IEEE 754.