What Is a Vector Norm?

A vector norm measures the size, or length, of a vector. For complex $n$-vectors, a vector norm is a function $\|\cdot\| : \mathbb{C}^n \to \mathbb{R}$ satisfying

• $\|x\| \ge 0$ with equality if and only if $x=0$ (nonnegativity),
• $\|\alpha x\| =|\alpha| \|x\|$ for all $\alpha\in\mathbb{C}$ $x\in\mathbb{C}^n$ (homogeneity),
• $\|x+y\| \le \|x\|+\|y\|$ for all $x,y\in\mathbb{C}^n$ (the triangle inequality).

An important class of norms is the Hölder $p$-norms

$\|x\|_p = \biggl( \displaystyle\sum_{i=1}^n |x_i|^p \biggr)^{1/p}, \quad p\ge 1. \qquad\qquad (1)$

The $\infty$-norm is interpreted as $\lim_{p\to\infty}\|x\|_p$ and is given by

$\notag \|x\|_{\infty} = \displaystyle\max_{1\le i\le n} |x_i|.$

Other important special cases are

\notag \begin{alignedat}{2} \|x\|_1 &= \sum_{i=1}^n |x_i|, &\quad& \mbox{Manhattan'' or taxi~cab'' norm}, \\ \|x\|_2 &= \biggl( \sum_{i=1}^n |x_i|^2 \biggr)^{1/2} = (x^*x)^{1/2}, &\quad& \mbox{Euclidean length}. \end{alignedat}

A useful concept is that of the dual norm associated with a given vector norm, which is defined by

$\notag \|y\|^D = \displaystyle\max_{x\ne0} \displaystyle\frac{\mathop{\mathrm{Re}}x^* y}{\|x\|}.$

The maximum is attained because the definition is equivalent to $\|y\|^D = \max\{ \, \mathop{\mathrm{Re}} x^*y: \|x\| = 1\,\}$, in which we are maximizing a continuous function over a compact set. Importantly, the dual of the dual norm is the original norm (Horn and Johnson, 2013, Thm.$\,$ 5.5.9(c)).

We can rewrite the definition of dual norm, using the homogeneity of vector norms, as

$\notag \|y\|^D = \displaystyle\max_{|c| = 1} \| cy \|^D = \max_{|c| = 1} \max_{x\ne 0} \frac{\mathop{\mathrm{Re}} x^*(cy) }{\|x\|} = \max_{x\ne 0} \max_{|c| = 1} \frac{\mathop{\mathrm{Re}} c(x^*y) }{\|x\|} = \max_{x\ne 0} \frac{ |x^*y| }{\|x\|}.$

Hence we have the attainable inequality

$\notag |x^*y| \le \|x\| \, \|y\|^D, \qquad\qquad (2)$

which is the generalized Hölder inequality.

The dual of the $p$-norm is the $q$-norm, where $p^{-1} + q^{-1} = 1$, so for the $p$-norms the inequality (2) becomes the (standard) Hölder inequality,

$\notag |x^*y| \le \|x\|_p \, \|y\|_q, \quad \displaystyle\frac{1}{p} + \frac{1}{q} = 1.$

An important special case is the Cauchy–Schwarz inequality,

$\notag |x^*y| \le \|x\|_2 \, \|y\|_2.$

The notation $\|x\|_0$ is used to denote the number of nonzero entries in $x$, even though it is not a vector norm and is not obtained from (1) with $p = 0$. In portfolio optimization, if $x_k$ specifies how much to invest in stock $k$ then the inequality $\|x\|_0 \le k$ says “invest in at most $k$ stocks”.

In numerical linear algebra, vector norms play a crucial role in the definition of a subordinate matrix norm, as we will explain in the next post in this series.

All norms on $\mathbb{C}^n$ are equivalent in the sense that for any two norms $\|\cdot\|_\alpha$ and $\|\cdot\|_\beta$ there exist positive constants $\nu_1$ and $\nu_2$ such that

$\nu_1\|x\|_\alpha \le \|x\|_\beta \le \nu_2 \|x\|_\alpha \quad \mathrm{for~all}~x\in \mathbb{C}^n.$

For example, it is easy to see that

\notag \begin{aligned} \|x\|_2 &\le \|x\|_1 \le \sqrt{n} \|x\|_2,\\ \|x\|_\infty &\le \|x\|_2 \le \sqrt{n} \|x\|_\infty,\\ \|x\|_\infty &\le \|x\|_1 \le n \|x\|_\infty. \end{aligned}

The 2-norm is invariant under unitary transformations: if $Q^*Q = I$, then $\|Qx\|^2 = x^*Q^* Qx = x^*x = \|x\|_2^2$.

Care must be taken to avoid overflow and (damaging) underflow when evaluating a vector $p$-norm in floating-point arithmetic for $p\ne 1,\infty$. One can simply use the formula $\|x\|_p = \| (x/\|x\|_{\infty}) \|_p \|x\|_{\infty}$, but this requires two passes over the data (the first to evaluate $\|x\|_{\infty}$). For more efficient one-pass algorithms for the $2$-norm see Higham (2002, Sec. 21.8) and Harayama et al. (2021).

References

This is a minimal set of references, which contain further useful references within.

Note: This article was revised on October 12, 2021 to change the definition of dual norm to use $\mathrm{Re}$.

This article is part of the “What Is” series, available from https://nhigham.com/category/what-is and in PDF form from the GitHub repository https://github.com/higham/what-is.

Can We Solve Linear Algebra Problems at Extreme Scale and Low Precisions?

The largest dense linear systems being solved today are of order $n = 10^7$, and future exascale computer systems will be able to tackle even larger problems. Rounding error analysis shows that the computed solution satisfies a componentwise backward error bound that, under favorable assumptions, is of order $nu$, where $u$ is the unit roundoff of the floating-point arithmetic: $u \approx 10^{-16}$ for double precision and $u \approx 10^{-8}$ for single precision. This backward error bound cannot guarantee any stability for single precision solution of today’s largest problems and suggests a loss of half the digits in the backward error for double precision.

Half precision floating-point arithmetic is now readily available in hardware, in both the IEEE binary16 format and the bfloat16 format, and it is increasingly being used in machine learning and in scientific computing more generally. For the computation of the inner product of two $n$-vectors the backward error bound is again of order $nu$, and this bound exceeds $1$ for $n \ge 684$ for both half precision formats, suggesting a potentially complete loss of numerical stability. Yet inner products with $n \ge 684$ are successfully used in half precision computations in practice.

The error bounds I have referred to are upper bounds and so bound the worst-case over all possible rounding errors. Their main purpose is to reveal potential instabilities rather than to provide realistic error estimates. Yet we do need to know the limits of what we can compute, and for mission critical applications we need to be able to guarantee a successful computation..

Can we understand the behavior of linear algebra algorithms at extreme scale and in low precision floating-point arithmetics?

To a large extent the answer is yes if we exploit three different features to obtain smaller error bounds.

Blocked Algorithms

Many algorithms are implemented in blocked form. For example, an inner product $x^Ty$ of two $n$-vectors $x$ and $y$ can computed as

\notag \begin{aligned} s_i &= x((i-1)b+1:ib)^T y((i-1)b+1:ib), \quad i = 1:k,\\ s &= s_1 + s_2 + \dots + s_k, \end{aligned}

where $n = kb$ and $b \ll n$ is the block size. The inner product has been broken into $k$ smaller inner products of size $b$, which are computed independently then summed. Many linear algebra algorithms are blocked in an analogous way, where the blocking is into submatrices with $b$ rows or $b$ columns (or both). Careful analysis of the error analysis shows that a blocked algorithm has an error bound about a factor of $b$ smaller than that for the corresponding unblocked algorithm. Practical block sizes for matrix algorithms are typically $128$ or $256$, so blocking brings a substantial reduction in the error bounds.

In fact, one can do even better than an error bound of order $(n/b)u$. By computing the sum $s= s_1 + s_2 + \dots + s_k$ with a more accurate summation method the error constant is further reduced to $bu + O(u^2)$ (this is the FABsum method of Blanchard et al. (2020)).

Architectural Features

Intel x86 processors support an 80-bit extended precision format with a 64-bit significand, which is compatible with that specified in the IEEE standard. When a compiler uses this format with 80-bit registers to accumulate sums and inner products it is effectively working with a unit roundoff of $2^{-64}$ rather than $2^{-53}$ for double precision, giving error bounds smaller by a factor up to $2^{11} = 2048$.

Some processors have a fused multiply–add (FMA) operation, which computes a combined multiplication and addition $x + yz$ with one rounding error instead of two. This results in a reduction in error bounds by a factor $2$.

Mixed precision block FMA operations $D = C + AB$, with matrices $A,B,C,D$ of fixed size, are available on Google tensor processing units, NVIDIA GPUs, and in the ARMv8-A architecture. For half precision inputs these devices can produce results of single precision quality, which can give a significant boost in accuracy when block FMAs are chained together to form a matrix product of arbitrary dimension.

Probabilistic Bounds

Worst-case rounding error bounds suffer from the problem that they are not attainable for most specific sets of data and are unlikely to be nearly attained. Stewart (1990) noted that

To be realistic, we must prune away the unlikely. What is left is necessarily a probabilistic statement.

Theo Mary and I have recently developed probabilistic rounding error analysis, which makes probabilistic assumptions on the rounding errors and derives bounds that hold with a certain probability. The key feature of the bounds is that they are proportional to $\sqrt{n}u$ when a corresponding worst-case bound is proportional to $nu$. In the most general form of the analysis (Connolly, Higham, and Mary, 2021), the rounding errors are assumed to be mean independent and of mean zero, where mean independence is a weaker assumption than independence.

Putting the Pieces Together

The different features we have described can be combined to obtain significantly smaller error bounds. If we use a blocked algorithm with block size $b \ll n$ then in an inner product the standard error bound of order $nu$ reduces to a probabilistic bound of order $(\sqrt{n/b})u$, which is a significant reduction. Block FMAs and extended precision registers provide further reductions.

For example, for a linear system of order $10^7$ solved in single precision with a block size of $256$, the probabilistic error bound is of order $10^{-5}$ versus $1$ for the standard worst-case bound. If FABsum is used then the bound is further reduced.

Our conclusion is that we can successfully solve linear algebra problems of greater size and at lower precisions than the standard rounding error analysis suggests. A priori bounds will always be pessimistic, though. One should compute a posteriori residuals or backward errors (depending on the problem) in order to assess the quality of a numerical solution.

For full details of the work summarized here, see Higham (2021).