A floating-point number system is a finite subset of the real line comprising numbers of the form
where is the base, is the precision, and is the exponent. The system is completely defined by the four integers , , , and . The significand satisfies . Normalized numbers are those for which , and they have a unique representation. Subnormal numbers are those with and .
An alternative representation of is
where each digit satisfies and for normalized numbers.
The floating-point numbers are not equally spaced, but they have roughly constant relative spacing (varying by up to a factor ).
Here are the normalized nonnegative numbers in a toy system with , , and .
Three key properties that hold in general for binary arithmetic are visible in this example.
- The spacing of the numbers increases by a factor 2 at every power of 2.
- The spacing of the numbers between and is , which is called the unit roundoff. The spacing of the numbers between and is , which is called the machine epsilon. Note that .
- There is a gap between and the smallest normalized number, which is . The subnormal numbers fill this gap with numbers having the same spacing as those between and , namely . The next diagram shows the complete set of nonnegative normalized and subnormal numbers in the toy system.
eps is the machine epsilon,
eps(x) is the distance from
x to the next larger (in magnitude) floating-point number,
realmax is the largest finite number, and
realmin is the smallest normalized positive number.
A real number is mapped into by rounding, and the result is denoted by . If exceeds the largest number in then we say that overflows, and in IEEE arithmetic it is represented by Inf. If then ; otherwise, lies between two floating-point numbers and we need a rule for deciding which one to round to. The usual rule, known as round to nearest, is to round to whichever number is nearer. If is midway between two floating-point numbers then we need a tie-breaking rule, which is usually to round to the number with an even last digit. If and then is said to underflow.
For round to nearest it can be shown that
This result shows that rounding introduces a relative error no larger than .
Elementary floating-point operations, , , , , and are usually defined to return the correctly rounded exact result, so they satisfy
Most floating-point arithmetics adhere to the IEEE standard, which defines several floating-point formats and four different rounding modes.
Another form of finite precision arithmetic is fixed-point arithmetic, in which numbers have the same form as but with a fixed exponent , so all the numbers are equally spaced. In most scientific computations scale factors must be introduced in order to be able to represent the range of numbers occurring. Fixed-point arithmetic is mainly used on special purpose devices such as FPGAs and in embedded systems.
This is a minimal set of references, which contain further useful references within.
- D. Goldberg, What Every Computer Scientist Should Know About Floating-Point Arithmetic, ACM Computing Surveys 23, 5–48, 1991.
- Nicholas J. Higham, Accuracy and Stability of Numerical Algorithms, second edition, Society for Industrial and Applied Mathematics, Philadelphia, PA, USA, 2002.
- IEEE Standard for Floating-Point Arithmetic, IEEE Std 754-2019 (Revision of IEEE 754-2008), The Institute of Electrical and Electronics Engineers, New York, 2019.
- Jean-Michel Muller, Nicolas Brunie, Florent de Dinechin, Claude-Pierre Jeannerod, Mioara Joldes, Vincent Lefèvre, Guillaume Melquiond, Nathalie Revol, and Serge Torres, Handbook of Floating-Point Arithmetic, second edition, Birkhäuser, Boston, MA, 2018.
Related Blog Posts
- A Multiprecision World (2017)
- Book Review Revisited: Overton’s Numerical Computing with IEEE Floating Point Arithmetic (2014)
- Half Precision Arithmetic: fp16 Versus bfloat16 (2018)
- The Rise of Mixed Precision Arithmetic (2015)
- What Is IEEE Standard Arithmetic?—forthcoming
- What Is Rounding? (2020)