## What Is Floating-Point Arithmetic?

A floating-point number system $F$ is a finite subset of the real line comprising numbers of the form $y = \pm m \times \beta^{e-t},$

where $\beta$ is the base, $t$ is the precision, and $e\in [e_{\min},e_{\max}]$ is the exponent. The system is completely defined by the four integers $\beta$, $t$, $e_{\min}$, and $e_{\max}$. The significand $m$ satisfies $0 \le m \le \beta^t-1$. Normalized numbers are those for which $m \ge \beta^{t-1}$, and they have a unique representation. Subnormal numbers are those with $0 < m < \beta^{t-1}$ and $e = e_{\min}$.

An alternative representation of $y\in F$ is $y = \pm \beta^e \left( \displaystyle\frac{d_1}{\beta} + \frac{d_2}{\beta^2} + \cdots + \frac{d_t}{\beta^t} \right) = \pm \beta^e \times .d_1 d_2 \dots d_t,$

where each digit $d_i$ satisfies $0 \le d_i \le \beta-1$ and $d_1 \ne 0$ for normalized numbers.

The floating-point numbers are not equally spaced, but they have roughly constant relative spacing (varying by up to a factor $\beta$).

Here are the normalized nonnegative numbers in a toy system with $\beta = 2$, $t = 3$, and $e \in [-1, 3]$. Three key properties that hold in general for binary arithmetic are visible in this example.

• The spacing of the numbers increases by a factor 2 at every power of 2.
• The spacing of the numbers between $1/2$ and $1$ is $u = 2^{-t}$, which is called the unit roundoff. The spacing of the numbers between $1$ and $2$ is $\epsilon = 2^{1-t}$, which is called the machine epsilon. Note that $\epsilon = 2u$.
• There is a gap between $0$ and the smallest normalized number, which is $2^{e_{\min}-1}$. The subnormal numbers fill this gap with numbers having the same spacing as those between $2^{e_{\min}-1}$ and $2^{e_{\min}}$, namely $2^{e_{\min}-t}$. The next diagram shows the complete set of nonnegative normalized and subnormal numbers in the toy system. In MATLAB, eps is the machine epsilon, eps(x) is the distance from x to the next larger (in magnitude) floating-point number, realmax is the largest finite number, and realmin is the smallest normalized positive number.

A real number $x$ is mapped into $F$ by rounding, and the result is denoted by $\mathrm{f\kern.2ptl}(x)$. If $x$ exceeds the largest number in $F$ then we say that $\mathrm{f\kern.2ptl}(x)$ overflows, and in IEEE arithmetic it is represented by Inf. If $x\in F$ then $\mathrm{f\kern.2ptl}(x) = x$; otherwise, $x$ lies between two floating-point numbers and we need a rule for deciding which one to round $x$ to. The usual rule, known as round to nearest, is to round to whichever number is nearer. If $x$ is midway between two floating-point numbers then we need a tie-breaking rule, which is usually to round to the number with an even last digit. If $x\ne0$ and $\mathrm{f\kern.2ptl}(x) = 0$ then $\mathrm{f\kern.2ptl}(x)$ is said to underflow.

For round to nearest it can be shown that $\mathrm{f\kern.2ptl}(x) = x(1+\delta), \quad |\delta|\le u.$

This result shows that rounding introduces a relative error no larger than $u$.

Elementary floating-point operations, $+$, $-$, $*$, $/$, and $\sqrt{}$ are usually defined to return the correctly rounded exact result, so they satisfy $\mathrm{f\kern.2ptl}(x\mathbin{\mathrm{op}} y) = (x \mathbin{\mathrm{op}} y)(1+\delta), \quad |\delta|\le u, \quad \mathbin{\mathrm{op}}\in\{+,-,*,/,\sqrt{}\}.$

Most floating-point arithmetics adhere to the IEEE standard, which defines several floating-point formats and four different rounding modes.

Another form of finite precision arithmetic is fixed-point arithmetic, in which numbers have the same form as $F$ but with a fixed exponent $e$, so all the numbers are equally spaced. In most scientific computations scale factors must be introduced in order to be able to represent the range of numbers occurring. Fixed-point arithmetic is mainly used on special purpose devices such as FPGAs and in embedded systems.

## References

This is a minimal set of references, which contain further useful references within.