What Is IEEE Standard Arithmetic?

The IEEE Standard 754, published in 1985 and revised in 2008 and 2019, is a standard for binary and decimal floating-point arithmetic. The standard for decimal arithmetic (IEEE Standard 854) was separate when it was first published in 1987, but it was included with the binary standard from 2008. We focus here on the binary part of the standard.

The standard specifies floating-point number formats, the results of the basic floating-point operations and comparisons, rounding modes, floating-point exceptions and their handling, and conversion between different arithmetic formats.

A binary floating-point number is represented as

y = \pm m \times 2^{e-t},

where t is the precision and e\in [e_{\min},e_{\max}] is the exponent. The significand m is an integer satisfying m \le 2^t-1. Numbers with m \ge 2^{t-1} are called normalized. Subnormal numbers, for which 0 < m <2^{t-1} and e = e_{\min}, are supported.

Four formats are defined, whose key parameters are summarized in the following table. The second column shows the number of bits allocated to store the significand and the exponent. I use the prefix “fp” instead of the prefix “binary” used in the standard. The unit roundoff is u = 2^{-t}.

ieee_params_table.jpg Fp32 (single precision) and fp64 (double precision) were in the 1985 standard; fp16 (half precision) and fp128 (quadruple precision) were introduced in 2008. Fp16 is defined only as a storage format, though it is widely used for computation.

The size of these different number systems varies greatly. The next table shows the number of normalized and subnormal numbers in each system.


We see that while one can easily carry out a computation on every fp16 number (to check that the square root function is correctly computed, for example), it is impractical to do so for every double precision number.

A key feature of the standard is that it is a closed system, thanks to the inclusion of NaN (Not a Number) and \infty (usually written as inf in programing languages) as floating-point numbers: every arithmetic operation produces a number in the system. A NaN is generated by operations such as

0/0, \quad 0 \times \infty, \quad \infty/\infty, \quad (+\infty) + (-\infty), \quad \sqrt{-1}.

Arithmetic operations involving a NaN return a NaN as the answer. The number \infty obeys the usual mathematical conventions regarding infinity, such as

\infty+\infty = \infty, \quad (-1)\times\infty = -\infty, \quad   ({\textrm{finite}})/\infty = 0.

This means, for example, that 1 + 1/x evaluates as 1 when x = \infty.

The standard specifies that all arithmetic operations (including square root) are to be performed as if they were first calculated to infinite precision and then rounded to the target format. A number is rounded to the next larger or next smaller floating-point number according to one of four rounding modes:

  • round to the nearest floating-point number, with rounding to even (rounding to the number with a zero least significant bit) in the case of a tie;
  • round towards plus infinity and round towards minus infinity (used in interval arithmetic); and
  • round towards zero (truncation, or chopping).

For round to nearest it follows that

\mathrm{f\kern.2ptl}(x\mathbin{\mathrm{op}} y)     = (x \mathbin{\mathrm{op}} y)(1+\delta),     \quad |\delta|\le u, \quad \mathbin{\mathrm{op}}\in\{+,-,*,/,\sqrt{}\},

where \mathrm{f\kern.2ptl} denotes the computed result. The standard also includes a fused multiply-add operation (FMA), x*y + z. The definition requires it to be computed with just one rounding error, so that \mathrm{f\kern.2ptl}(x*y+z) is the rounded version of x*y+z, and hence satisfies

\mathrm{f\kern.2ptl}(x*y + z) = (x*y + z)(1+\delta), \quad |\delta|\le u.

FMAs are supported in some hardware and are usually executed at the same speed as a single addition or multiplication.

The standard recommends the provision of correctly rounded exponentiation (x^y) and transcendental functions (\exp, \log, \sin, \mathrm{acos}, etc.) and defines domains and special values for them, but these functions are not required.

A new feature of the 2019 standard is augmented arithmetic operations, which compute \mathrm{fl}(x \mathbin{\mathrm{op}}y) along with the error x\mathbin{\mathrm{op}}y - \mathrm{fl}(x \mathbin{\mathrm{op}}y), for \mathbin{\mathrm{op}} = +,-,*. These operations are useful for implementing compensated summation and other special high accuracy algorithms.

William (“Velvel”) Kahan of the University of California at Berkeley received the 1989 ACM Turing Award for his contributions to computer architecture and numerical analysis, and in particular for his work on IEEE floating-point arithmetic standards 754 and 854.


This is a minimal set of references, which contain further useful references within.

This entry was posted in what-is and tagged . Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s