Representing Numbers in Computers

The smallest unit of information that can be stored by a computer is a binary digit (usually shortened to "bit"). Its value is usually held in memory as an electrical charge stored in a capacitor. Modern memory chips contain millions of these tiny capacitors, each of which is capable of storing exactly one bit of information. A single bit can have one of two values at any given time - one or zero. As we shall see, in order to represent a number greater than one, we will have to use several bits.


Overview

Generally speaking, single bits are used only for storing Boolean values (true or false). In most programming languages, true equates to one, while false equates to zero. The smallest unit of data that can be addressed in computer memory is the byte. The definition of the byte has varied over the years but it is now generally considered to be a group of eight bits, and can be used to represent alpha-numeric and non-printing characters, unsigned integer (whole number) values from 0 to 255, or signed integer values from -127 to +127.


A byte usually consists of eight bits

A byte usually consists of eight bits


Some texts refer to a group of eight bits as an octet to avoid any possible ambiguity. Because the hexadecimal number system consists of sixteen digits, each of which can be specified using just four bits, it is sometimes useful to consider groupings of four bits as a unit. Such a grouping is often referred to as a nibble (which undoubtedly proves that computer scientists have a sense of humour after all!).

The number of bits that can be processed by a CPU in a single machine operation is dependent upon the number of bits it can store in its internal registers. In the early days of computing this was a relatively small number (four or eight bits). At some point, therefore, the size of the processor's register coincided with the size of a byte. For many years now, however, this has not been the case. As CPU architecture has evolved, we have seen the size of registers double and re-double. Most processors now have either 32-bit or 64-bit registers that can hold four or eight bytes of data respectively.


A modern CPU has a 64-bit architecture

A modern CPU has a 64-bit architecture


The unit of data that can be processed in a single operation by a machine-code instruction is called a word. A word can be viewed as a 32-bit or 64-bit binary number. The range of values that can be represented by a word is therefore dependent on microprocessor architecture, and will determine the size of the memory space that can be addressed.

As of 2010, practically all personal computers are capable of processing 64-bits of data, although they are frequently used in 32-bit mode in order to provide support for existing software. Keep in mind however that many embedded systems still use microcontroller chips that have eight or sixteen-bit registers.


Integers

Integers are whole numbers. The range of values that can be stored as an integer depends on whether or not the number is signed (i.e. positive or negative), and how much memory is allocated for it in memory. Programming languages can generally represent integers that are signed or unsigned, and of different sizes.

A single byte can represent unsigned numbers ranging in value from 0 to 255, or signed integers ranging from -128 to +127. If two bytes are used, unsigned numbers from 0 to 65,535 or signed numbers from -32,768 to 32,767 can be stored. Much larger numbers can be represented if more bytes are made available.

For signed numbers, one bit is used to store the sign (+ or -) of the number, so the absolute value of the biggest number that can be stored is only half that for unsigned numbers. The number of bits used to represent an integer value will equal the number of bytes multiplied by eight. An integer represented by n bits can represent 2n numbers.

The magnitude of a four-byte integer can thus be anything up to 2 (4×8) or 2 32 which means it can hold an unsigned value of up to 4,294,967,296 (a tad over two billion). Negative numbers can be represented in several different ways in binary number systems, although the most commonly used method is two's complement (the subject of two's complement is dealt with below).


Fixed-point numbers

A fixed-point number is used to represent a real number (one that has a fractional part) using a fixed number of digits after the radix point. The radix point is called the decimal point for real numbers to base ten. In binary number systems it would be called the binary point. Fixed-point numbers are sometimes used where the processor employed does not have a floating-point unit (FPU), which is often the case in low-cost microcontrollers.

Fixed point fractional numbers are usually represented by integer values that are scaled by an appropriate factor (the exponent). For example, the real number 1.234 could be represented by the integer value 1234, with a scaling factor of  1/1000  (or 10 -3), while the number 1,234,000 could also be represented by the integer value 1234 but using a scaling factor of 1000 (10 3).

The difference between fixed-point and floating-point representation of real numbers is that the scaling factor remains the same for all values represented by a particular fixed-point data type. The scaling factor used will (usually) be a power of ten for denary (base ten) numbers or a power of two for binary numbers.

The maximum and minimum values that can be represented by a fixed-point data type will depend on the maximum and minimum values that can be represented by the underlying integer data type, and the scaling factor.

Arithmetic operations on fixed-point numbers can produce answers that cannot be accurately represented using the number of places available either before or after the radix point. In such cases, the answer will be rounded or truncated. The options are either to keep the same number format for the answer and accept that there will be some loss of accuracy, or to convert the result to a more appropriate data type to preserve accuracy.

In the first approach, the number of digits before and after the radix point remains the same for the result of an operation. If digits are lost from the fractional part of the result, there will be an associated loss of precision that may be acceptable in many cases. If digits are lost from the integer part of the result however, the result will be fundamentally incorrect.

When writing programs for control systems that will be implemented on microprocessors, it is essential to understand the limitations of the microprocessor being used in terms of the maximum size of the integer values it can store. This will usually depend on the size of its internal registers.


Floating-point numbers

Floating-point numbers are somewhat more complicated to deal with because the radix point does not occupy a fixed position (i.e. it can "float" to the left or right within the representation of a real number depending on the number's magnitude). In most commonly used encodings, a floating-point value is stored as three separate components – the significand, the exponent, and the sign. A 32-bit floating-point number is typically made up as follows:

The significand represents the significant digits of the number itself, while the exponent essentially represents the position occupied within those digits of the decimal (or binary) point. When a floating-point value is stored in memory, it is first normalised. This means moving the decimal point to the left until it is immediately on the right of the most significant (left-most) digit. The number of places that the decimal point must be moved in order to achieve this is the exponent.

As an example, take the real number 1234.56 (which has a significand of 123456). In order to normalise the number, we need to move the radix point (in this case, a decimal point) three places to the left, resulting in a normalised value of 1.23456 and an exponent of 3. Because we are looking at a denary (base ten) number, we can now write the number as 1.23456 × 10 3.

Binary real numbers can be dealt with in exactly the same way, except that the exponent will be applied to a number base of two. Note however that fractional values such as  1/5  (0.2) that can be represented exactly in base ten cannot be exactly represented in base two.

From the above, it should be evident that the maximum number of digits that can be used to represent any real number in a given format will be fixed. It follows that for any given number, the precision of its representation will depend on the number of digits required to represent it exactly. If the number of digits required is less than or equal to the number of digits available in a given representation, there will be no loss of precision. If the number of digits required is greater than the number available, there will inevitably be some loss of precision.

Of course, for values that do not have an exact representation in a given number base precision will be limited in any case. The fractional value  1/3 , for example, cannot be exactly represented in a denary number system, no matter how many digits are used after the decimal point (although as more digits are added, the value stored will more closely approach the actual value). A number that cannot be represented exactly in a number base regardless of the number of digits used is said to be non-terminating.

The main advantage floating-point numbers have over fixed-point numbers is that they can represent a much greater range of values. If we use a fixed-point format with two significant digits after the decimal point, for example, the significand 1234567 could represent the values 12,345.67, 1,234.56, 123.45, 12.34 and so on. A floating-point number with the same significand could represent values such as 1.234567, 123,456.7, 0.00001234567, 1,234,567,000,000 and so on.

The down side is that the floating-point format requires more bits to store the exponent part of the number, so floating-point numbers that occupy the same space as a fixed-point data type achieve a greater range at the expense of some loss of precision. Generally speaking, the larger the range of values we wish to represent, the more bits are needed to store numbers in that range.

Because the number of bits available for both the significand and the exponent will be fixed for a given real number data type, programming languages tend to offer floating-point data types of different sizes (and hence precision) so that the programmer can select the type most appropriate for the intended purpose of a variable. In this way, memory can be used more economically than if there were a single "one size fits all" real number data type. Floating-point values are usually represented using either 32 bits (single precision) or 64 bits (double precision).

As we said earlier, the significand is stored as an integer having a fixed number of digits, with an implied radix point to the immediate right of the most significant digit. In order to derive the real number value being stored, the significand must be multiplied by the base raised to the power of the exponent. This will effectively move the radix point from its implied position by the number of places given by the exponent.

If the exponent is positive, the radix point moves to the right. If the exponent is negative, it moves to the left. Using a denary example, the number 12345.67 would be normalised to 1.234567. In order to restore the number to its original value, the normalised value would have to be multiplied by 10 4. Note that the computer representation of binary floating-point numbers is standardised in IEEE 754-2008 - the IEEE Standard for Floating-Point Arithmetic.


One's Complement

A computer uses a fixed number of bits to store each of the common data types. For example, a byte (8 bits) is commonly used to represent alphanumeric characters, unsigned integer values from 0 to 255, or signed integer values from -127 to +127.

For signed integer data types, the range of values that can be represented is only half that for unsigned integer data types because the most significant bit is used to signify whether the number is positive (0) or negative (1), leaving one less bit available to represent the absolute value of the number.

The range of signed 4-bit integers that can be represented using this system (known as sign and magnitude) is shown in the table below, and illustrates how the system works. Note that there are two possible representations for zero (00002  = +010  and 10002  = -010 ).


4-bit Signed Binary Integers
DecimalBinaryDecimalBinary
+00000-01000
+10001-11001
+20010-21010
+30011-31011
+40100-41100
+50101-51101
+60110-61110
+70111=71111

Signed integers can be represented using a number of alternative systems, one of which is called one's complement. Using one's complement, the most significant bit is again used to indicate the sign (0 = positive, 1 = negative), and positive numbers are represented in the normal way (see above).

To change the sign of a positive number (i.e. to negate it) using one's complement however, all of the bits are inverted (or "flipped"). In other words, all the ones are replaced with zeros, and all the zeros are replaced with ones. The following binary numbers are the 8-bit one's complement representation of 1210 and -1210, respectively:

1210  =  00011002

-1210  =  11100112

For those familiar with Boolean logic, the one's complement of any binary number is the equivalent of carrying out a bitwise NOT operation on it and it should be fairly obvious that the one's complement of a negative one's complement number is its positive counterpart.

The 8-bit binary representation of zero using one's complement can take one of two forms - 000000002 (+010) or 111111112 (−010), and the range of values that can be represented using 8 bits is -12710 to +12710. One of the benefits of using one's complement is that addition and subtraction can both be carried out using binary addition, with end-around carry (this simply means that if there is a carry of one bit into the bit position to the left of the most significant bit as a result of the addition, the bit is added back to the least significant bit).

To subtract one binary number x from another binary number y using one's complement, simply add y to the one's complement of x. The two examples below illustrate the principle.

Example 1:

11810 – 8510  =  011101102 – 010101012  =  3310  =  001000012

   01110110
+  10101010    (one's complement of 01010101)
   --------
= 100100000    (result of standard binary addition)

=  00100001    (overflow bit is added back to LSB)

Example 2:

10610 + -7910 (011010102 – 101100002) = 2710 (000110112)

   01101010
+  10110000    (one's complement of 01001111)
   --------
= 100011010    (result of standard binary addition)

=  00011011    (overflow bit is added back to LSB)


Two's Complement

Because of the limitations of one's complement when carrying out arithmetic operations such as multiplication and division, the most commonly used way of representing signed integers on computers is to use two's complement, because it allows the logic that handles arithmetic functions to be implemented more easily in hardware.

The two's complement of a binary number is the value obtained by subtracting the number from a large power of two (specifically, from 2n for an n-bit number). As with one's complement, the most significant bit is used to indicate the sign (0 = positive, 1 = negative), and positive numbers are represented in the same way. To negate a positive number, its two's complement is used.

Taking the unsigned number 310 as an example, this would be represented as a positive 8-bit two's complement binary number as 000000112. The value of 28 expressed in standard binary format is 100000000. In order to find the two's complement of +3, therefore, we would carry out the following arithmetic operation:

 100000000
- 00000011
  --------
= 11111101

Negative binary numbers can thus be represented using the two's complement of their absolute value. The absolute value of the largest negative number that can be represented with a given number of bits is always one more than that of the largest positive number that can be represented using the same number of bits. Thus, a two's complement 8-bit binary number can represent signed integer values from −128 to +127 (note that the two's complement of -128 is -128).

For similar reasons, zero has only a single representation in a two's complement binary system, since the two's complement of zero is zero. It is worth noting that the two's complement of a binary number can also be obtained by adding one to the one's complement of that number.

Putting this in simple terms, you can get the two's complement of a number simply by inverting all the bits (i.e. changing ones to zeros, and zeros to ones) to get the one's complement, and adding one to the result (any resulting carry past the most significant bit is ignored). The range of signed 4-bit integers that can be represented using two's complement is shown in the table below.


4-bit Signed Binary Integers
DecimalBinaryDecimalBinary
00000-81000
10001-11111
20010-21110
30011-31101
40100-41100
50101-51011
60110-61010
70111-71001

Like one's complement, two's complement allows an addition operation to be used to carry out both addition and subtraction. To subtract one binary number x from another binary number y using two's complement, simply add y to the two's complement of x (any carry past the most significant bit is simply ignored). The two examples below illustrate the principle.

Example 1:

11810 - 8510 (011101102 – 010101012)  =  3310 (001000012)

  01110110
+ 10101011    (two's complement of 01010101)
  --------
= 00100001    (result of standard binary addition, ignoring overflow)

Example 2:

10610 + -7910 (011010102 + 101100002)  =  2710 (000110112)

  01101010
+ 10110001    (two's complement of 01001111)
  --------
= 00011011    (result of standard binary addition, ignoring overflow)

Note that although binary arithmetic (addition, subtraction, multiplication and division) is simplified by the use of the two's complement representation of signed binary values, the problem remains that an arithmetic operation involving numbers having a fixed number of bits may well produce a result that requires more than the number of bits provided.

The addition of the 8-bit signed integers 11710 and 9610, for example, would produce the result 21310. This result is outside the range of values that can be stored as an 8-bit signed binary integer, regardless of the type of representation used (sign and magnitude, one's complement, or two's complement). For this reason, computers will carry out checks to determine whether the result of an arithmetic operation will result in an overflow, and as a result will often need to promote a value from one data type to another (e.g. from integer to long integer).

When converting an 8-bit two's complement number to a 16-bit two's complement number, a process of sign extension occurs, in which the sign bit (the most significant bit of the original 8-bit binary number) is repeated in every bit position to the left of it in the new 16-bit version. The following examples illustrate this:

Example 1:

010101012 (+8510) = 00000000 010101012 (as a 16-bit number)

Example 2:

001110112 (-5910) = 11111111 101010112 (as a 16-bit number)

Finally, it is worth noting that binary multiplication is also relatively straightforward in a two's complement system. For example, a bitwise multiplication of 111101002 (-1210) and 000010002 (+810) gives the result 111101000002.

Since we are dealing with 8-bit numbers, we can drop the leftmost three bits, leaving 101000002. Since the most significant bit is 1, we know the result has a negative value. The two's complement of 101000002 is 011000002, or 9610, so the result represents -9610, which is the correct result of multiplying -1210 and +810.


Binary-Coded Decimal

In electronic control systems, binary-coded decimal (BCD) is a method for representing decimal numbers in which each decimal digit is represented by a sequence of binary digits. This makes it relatively easy for the system to convert the numeric representation for printing or display purposes, and speeds up decimal calculations. The main disadvantage is that representing decimal numbers in this way takes up more space in memory than using a more conventional binary representation.

The circuitry required to carry out calculations also tends to be more complex. The decimal digits 0-9 are each represented using four bits. Additional bit combinations may be used to represent sign values (+ or -) or other values (e.g. overflow or error conditions). The table below shows the standard BCD encodings for the decimal digits 0-9. Note that values greater than 1001 (1010, 1011, 1100, 1101, 1110, or 1111) are not valid BCD decimal values.



BCD Representation
DecimalBCD Encoding
00000
10001
20010
30011
40100
50101
60110
70111
81000
91001

The BCD encoding for the number 123 would therefore be:

0001 0010 0011

As opposed to the normal binary representation:

1111011

Computers usually store data in blocks of 8 bits (called bytes). Two main methods of storing BCD digits are used. In the first method (called zoned BCD), each 4-bit BCD representation of a decimal digit is stored in the right-most four bits of a byte. The left-most four bits are all set to zero, or all set to one in systems such as mainframe computers that use Extended Binary Coded Decimal Interchange Code (EBCDIC), or to 0011 (in systems that use the American Standard Code for Information Interchange (ASCII).

In the second form of representation, two BCD encoded decimal digits are stored in a single byte. Displaying a BCD-encoded number is relatively straightforward for hardware, because each of the numeric characters is mapped to a distinct bit pattern consisting of four binary digits. The control circuitry required to display each number is therefore relatively straightforward since no arithmetic conversion is required.

Another variation on BCD encoding called packed BCD uses each byte of a multi-byte word to store two BCD-encoded decimal digits except the lowest byte. In this byte, the upper four bits are used to store a BCD-encoded decimal digit, but the lowest four bits are used to store the sign value (most commonly 1100 for '+' and 1101 for '-'). A 32-bit word can thus hold a 7-digit signed decimal number using binary coded decimal encoding. The number -1,234,567 would therefore be represented as follows:

0001 0010 0011 0100 0101 0110 0111 1101

The same sign values can be used with zoned BCD to represent signed numbers. The left-most four bits of the least significant byte are used to represent the sign of the number. For an EBCDIC representation of -123, therefore, the binary pattern used would be as follows:

1111 0001 1111 0010 1101 0011


BCD addition

Unpacked BCD numbers can be added together in the same way that other binary numbers are added together, with the result being put into the appropriate binary format afterwards. If the result of adding two BCD numbers is greater than 910 (1001), however, the result will be invalid and must be corrected by adding 610 (0110). Take the example of adding 7 and 8:

  0000 0111
+ 0000 1000
  ---------
= 0000 1111

The answer derived is OK for normal binary addition (11112 = 1510) but results in an invalid BCD representation. In order to rectify this, we need to add 6 (0110) to the result, as follows (note that the number being added must also be in the unpacked format):

  0000 1111
+ 0000 0110
  ---------
= 0001 0101

We now have two BCD values represented, 1 and 5. This is the correct BCD representation of 15 (the correct result of adding 7 and 8). In order to correctly represent this number in the unpacked BCD format, however, we need to shift the left-most four bits of the answer into a higher-order byte, as shown below.

0000 0001 0000 0101 = 1510

Adding together groups of BCD values is trickier, but basically involves adding each set of BCD encoded values from right to left, and carrying the second digit to the next highest order byte (remembering of course to correct for invalid BCD results before doing so). The following example, in which we will add 97 to 48, illustrates the principle (note that the BCD-encoded values that will appear in the final encoding are highlighted at each stage):

Add the first set of values:

  0000 0111
+ 0000 1000
  ---------
= 0000 1111

The result is an invalid BCD value, so add 0110:

  0000 1111
+ 0000 0110
  ---------
= 0001 0101

Add the second set of values (plus the 1 carried from the first addition):

  0000 1001
+ 0000 0100
+ 0000 0001
  ---------
= 0000 1110

This result of this addition is also an invalid BCD value, so add 0110:

  0000 1110
+ 0000 0110
  ---------
= 0001 0100

The 1 will be carried into the next highest order byte. The overall result of the addition is shown below (note that the same method for addition can also be applied to packed BCD - just remember that each byte contains two BCD-encoded decimal digits rather than one).

0000 0001 0000 0100 0000 0101 = 14510


BCD subtraction

Binary subtraction can be carried out on both packed and unpacked BCD-encoded numbers in the same way that it can for numbers that use a standard binary representation. Note that, as with addition, subtraction can result in an invalid BCD value. It can also give an erroneous answer if the subtraction operation results in a borrow (from the next highest BCD-encoded value).

If either of these situations occurs, an adjustment is needed to produce a correct result. For subtraction, we need to subtract 6 (0110) from the invalid or erroneous BCD value. Some examples will illustrate the point (we will use packed BCD values for the purposes of this exercise). First, a straightforward example that requires no corrections would be to subtract 12 from 37:

  0011 0111
- 0001 0010
  ---------
= 0010 0101

For the next example, we will subtract 19 from 65:

  0110 0101
- 0001 1001
  ---------
= 0100 1100

Since the right-most BCD value is invalid, we need to subtract 6 (0110):

  0100 1100
- 0000 0110
  ---------
= 0100 0110

The adjustment gives us the correct answer:

0100 0110 = 4610

For the final example we will subtract 18 from 41:

  0100 0001
- 0001 1000
  ---------
= 0010 1001

Since we had to borrow from the left most BCD value when subtracting, the right most BCD value in the result is erroneous and we again need to subtract 6 (0110):

  0010 1001
- 0000 0110
  ---------
= 0010 0011

Once again, the adjustment gives us the correct answer:

0010 0011 = 2310