The fact that computers are finite has important design implications. It means that computers can never faithfully represent the sets of integers or real numbers, both of which are infinite. Since these are generally what we work with in physics and mathematics, it’s important to understand how we approximately represent integers and reals on a computer. (What we’ll discover is that finite-fields are substituted for the integers and rationals for the reals.)

But first, let’s consider how we might represent numbers at all. What strategies could we employ? Historically, the most common are these two:

- simple enumeration (/ // /// //// ...)
- grouping and labelling (e.g., Roman numerals I II III IV V X L C D M; 1998 = MCMXCVIII)

Neither, however, is suitable for computation. The first is grossly inefficient, requiring \(N\) bits to store numbers as large as \(N\). The second requires an ever-growing family of new symbols to represent large values. Moreover, we need a systematic and extensible representation in which basic arithmetic operations are mechanistic. The usual solution to this dilema is the following.

- positional number systems: (\(a_3a_2a_1a_0.a_{-1}a_{-2}\)) base \(b\) with \(0 \le a_k < b\)

By convention, the leading (high) digit is most significant and the trailing (low) digit the least. The part to the right of the radix point is understood to be fractional. Our conventional decimal number system corresponds to \(b = 10\) with digits \(a_k \in \{ 0,1,2,\ldots,9 \}\). Bases \(b = 2, 8, 16\) (all powers of two) are the most commonly used in computer science, since computers use the binary representation in hardware.

base | common name | math name | example | decimal conversion |
---|---|---|---|---|

2 | binary | 10010111_{2} |
167 | |

8 | octal | octonal | 1735_{8} |
989 |

10 | decimal | 234 | 234 | |

16 | hexadecimal | sexadecimal | 3F7A_{16} |
16250 |

60 | sexagesimal | 23 44’ 12” | 23.736666 |

The conversion to decimal can be carried out by summing powers.

\[\begin{split}10010111_2 &= 2^7+ 2^4 + 2^2 + 2^1 + 2^0 = 167 \\
1735_8 &= 1\times 8^3 + 7\times 8^2+ 3\times 8^1 + 5\times 8^ 0 = 989\end{split}\]

Note that for bases greater than 10, we run out of arabic numerals. The convention is to fill out the missing digits using roman letters.

\[a_k \in \{ 0,1,2,\cdots ,9,\text{A},\text{B},\text{C},\text{D},\text{E},\text{F} \}\]

Hence,

\[\text{3F7A}_{16} = 3\times 16^3 + 15\times 16^2 + 7\times 16^1 + 10\times 16^0 = 16250\]

Of course, for base \(b > 36\) we run out of symbols, and the notation becomes idiosyncratic. The convention for fractions of degrees is to use an increasing number of primes to mark places:

\[23\ 44'\ 12'' = 23 + 44\times 60^{-1} + 12\times 60^{-2} = 23.736666\]

Binary is not as foreign as it first seems. Humans have invented many binary number systems; good examples are the western system of musical notation (1 whole note = 2 half notes = 4 quarter notes = 8 eights notes, etc.) and the British system of weights and measures (1 gallon = 2 pottles = 4 quarts = 8 pints, etc.).

A fixed-width binary number is a sequence of \(N\) bits. The smallest possible number is \(0\) and the largest is \(2^N-1\). For example, with eight bits the numbers \(0\) to \(255\) are represented by the \(2^8\) unique patterns of \(0\) and \(1\).

0 | 0 0 0 0 0 0 0 0 |

1 | 0 0 0 0 0 0 0 1 |

2 | 0 0 0 0 0 0 1 0 |

3 | 0 0 0 0 0 0 1 1 |

255 | 1 1 1 1 1 1 1 1 |

Negative numbers can be represented using what’s called the
*two’s complement scheme*. Here, the numbers \(\{0,1,\ldots,255\}\) are
reinterpreted as \(\{-128,-127,\ldots,127\}\) with the
leading (most significant) bit signalling that \(x \to x-256\).

two’s complement | bit pattern | unsigned binary |
---|---|---|

-128 | 1 0 0 0 0 0 0 0 | 128 |

-3 | 1 1 1 1 1 1 0 1 | 253 |

-2 | 1 1 1 1 1 1 1 0 | 254 |

-1 | 1 1 1 1 1 1 1 1 | 255 |

0 | 0 0 0 0 0 0 0 0 | 0 |

1 | 0 0 0 0 0 0 0 1 | 1 |

2 | 0 0 0 0 0 0 1 0 | 2 |

3 | 0 0 0 0 0 0 1 1 | 3 |

127 | 0 1 1 1 1 1 1 1 | 127 |

Under two’s complement, the high bit effectively contains the sign information. Still, this is quite different than having the high bit represent the sign explicitly and the remaining low bits the magnitude. Under that scheme we have the numbers

\[-127, -126, \ldots, -2, -1, -0, +0, 1, 2, \ldots, 127\]

Note that there are two representations of zero. Two’s complement has the advantage of no redundancy.

A more important advantage is that addition and subtraction of two’s complement numbers can be carried out in exactly the same way as for unsigned binary numbers. The result of the following sum is shown in grey; the carry bits are shown in red. The addition operation can equally be interpreted as acting on signed or unsigned integers.

\[\begin{split}&\ \ {\color{red} 1}\\
&0011\\
+&1010\\ \hline
&1101\end{split}\]

This 4-digit binary computation can be interpreted in two ways: unsigned \(3 + 10 = 13\) or two’s complement \(3 + (-6) = -3\).

Fixed width binary numbers can only represent a limited range of
integers. The result of an operation (such as addition or
multiplication) performed on a pair of representable integers may not be
representable itself. This condition is called *overflow*.

The next example provides a simple demonstration. The overflow result can be understood in terms of arithmetic modulo \(2^4 = 16\).

\[\begin{split}{\color{red} 1}& \ \ \ \ {\color{red} 1}\\
&1001\\
+&1101\\ \hline
&0110\end{split}\]

For unsigned binary, \(9 + 13 = (\textit{overflow})\ 6\), which is \(22\ \text{mod}\ 16\). For two’s complement, \((-7) + (-3) = (\textit{overflow})\ 6\), which is \(-10\ \text{mod}\ 16\).

The integer types provided by C++ do not have a guaranteed width. Rather, only their relative sizes are enforced.

```
assert( sizeof(char) <= sizeof(short) and
sizeof(short) <= sizeof(int) and
sizeof(int) <= sizeof(long) );
```

The sizes, however, are standardized across 32-bit Intel machines.

type | width |
---|---|

char |
8 bits, 1 byte |

short int |
16 bits, 2 bytes, 1 word |

int, long int |
32 bits, 4 bytes, 1 double word |

By default, the integer types are signed. On almost all architectures,
they are represented internally using the two’s complement scheme.
Unsigned binary versions can be specified by prepending the `unsigned`
keyword: e.g.,
`unsigned char`, `unsigned short int`, `unsigned int`,
`unsigned long int`.