What Is Floating-Point Arithmetic?

A floating-point number system F is a finite subset of the real line comprising numbers of the form

y = \pm m \times \beta^{e-t},

where \beta is the base, t is the precision, and e\in [e_{\min},e_{\max}] is the exponent. The system is completely defined by the four integers \beta, t, e_{\min}, and e_{\max}. The significand m satisfies 0 \le m \le \beta^t-1. Normalized numbers are those for which m \ge \beta^{t-1}, and they have a unique representation. Subnormal numbers are those with 0 < m < \beta^{t-1} and e = e_{\min}.

An alternative representation of y\in F is

y = \pm \beta^e               \left( \displaystyle\frac{d_1}{\beta} +                      \frac{d_2}{\beta^2} + \cdots +                      \frac{d_t}{\beta^t} \right)         = \pm \beta^e \times .d_1 d_2 \dots d_t,

where each digit d_i satisfies 0 \le d_i \le \beta-1 and d_1 \ne 0 for normalized numbers.

The floating-point numbers are not equally spaced, but they have roughly constant relative spacing (varying by up to a factor \beta).

Here are the normalized nonnegative numbers in a toy system with \beta = 2, t = 3, and e \in [-1, 3].


Three key properties that hold in general for binary arithmetic are visible in this example.

  • The spacing of the numbers increases by a factor 2 at every power of 2.
  • The spacing of the numbers between 1/2 and 1 is u = 2^{-t}, which is called the unit roundoff. The spacing of the numbers between 1 and 2 is \epsilon = 2^{1-t}, which is called the machine epsilon. Note that \epsilon = 2u.
  • There is a gap between 0 and the smallest normalized number, which is 2^{e_{\min}-1}. The subnormal numbers fill this gap with numbers having the same spacing as those between 2^{e_{\min}-1} and 2^{e_{\min}}, namely 2^{e_{\min}-t}. The next diagram shows the complete set of nonnegative normalized and subnormal numbers in the toy system.


In MATLAB, eps is the machine epsilon, eps(x) is the distance from x to the next larger (in magnitude) floating-point number, realmax is the largest finite number, and realmin is the smallest normalized positive number.

A real number x is mapped into F by rounding, and the result is denoted by \mathrm{f\kern.2ptl}(x). If x exceeds the largest number in F then we say that \mathrm{f\kern.2ptl}(x) overflows, and in IEEE arithmetic it is represented by Inf. If x\in F then \mathrm{f\kern.2ptl}(x) = x; otherwise, x lies between two floating-point numbers and we need a rule for deciding which one to round x to. The usual rule, known as round to nearest, is to round to whichever number is nearer. If x is midway between two floating-point numbers then we need a tie-breaking rule, which is usually to round to the number with an even last digit. If x\ne0 and \mathrm{f\kern.2ptl}(x) = 0 then \mathrm{f\kern.2ptl}(x) is said to underflow.

For round to nearest it can be shown that

\mathrm{f\kern.2ptl}(x)  = x(1+\delta), \quad |\delta|\le u.

This result shows that rounding introduces a relative error no larger than u.

Elementary floating-point operations, +, -, *, /, and \sqrt{} are usually defined to return the correctly rounded exact result, so they satisfy

\mathrm{f\kern.2ptl}(x\mathbin{\mathrm{op}} y)     = (x \mathbin{\mathrm{op}} y)(1+\delta),     \quad |\delta|\le u, \quad \mathbin{\mathrm{op}}\in\{+,-,*,/,\sqrt{}\}.

Most floating-point arithmetics adhere to the IEEE standard, which defines several floating-point formats and four different rounding modes.

Another form of finite precision arithmetic is fixed-point arithmetic, in which numbers have the same form as F but with a fixed exponent e, so all the numbers are equally spaced. In most scientific computations scale factors must be introduced in order to be able to represent the range of numbers occurring. Fixed-point arithmetic is mainly used on special purpose devices such as FPGAs and in embedded systems.


This is a minimal set of references, which contain further useful references within.

Related Blog Posts

This article is part of the “What Is” series, available from https://nhigham.com/category/what-is and in PDF form from the GitHub repository https://github.com/higham/what-is.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s