Seven Sins of Numerical Linear Algebra

In numerical linear algebra we are concerned with solving linear algebra problems accurately and efficiently and understanding the sensitivity of the problems to perturbations. We describe seven sins, whereby accuracy or efficiency is lost or misleading information about sensitivity is obtained.

1. Inverting a Matrix

In linear algebra courses we learn that the solution to a linear system Ax = b of n equations in n unknowns can be written x = A^{-1}b, where A^{-1} is the matrix inverse. What is not always emphasized is that there are very few circumstances in which one should compute A^{-1}. Indeed one would not solve the scalar (n=1) system 7x = 21 by computing x = 7^{-1} \times 21, but rather would carry out a division x = 21/7. In the n\times n case, it is faster and more accurate to solve a linear system by LU factorization (Gaussian elimination) with partial pivoting than by inverting A (which has, in any case, to be done by LU factorization).

Rare cases where A^{-1} is required are in statistics, where the diagonal elements of the inverse of the covariance matrix are relevant quantities, and in certain algorithms for computing matrix functions.

2. Forming the Cross-Product Matrix A^TA

The solution to the linear least squares problem \min_x\| b - Ax \|_2, where A is a full-rank m\times n matrix with m\ge n, satisfies the normal equations A^T\!A x = A^Tb. It is therefore natural to form the symmetric positive definite matrix A^T\!A and solve the normal equations by Cholesky factorization. While fast, this method is numerically unstable when A is ill conditioned. By contrast, solving the least squares problem via QR factorization is always numerically stable.

What is wrong with the cross-product matrix A^T\!A (also known as the Gram matrix)? It squares the data, which can cause a loss of information in floating-point arithmetic. For example, if

A = \begin{bmatrix} 1 & 1 \\ \epsilon & 0 \end{bmatrix},        \quad  0 < \epsilon < \sqrt{u},

where u is the unit roundoff of the floating point arithmetic, then

A^T\!A = \begin{bmatrix} 1 + \epsilon^2 & 1 \\                              1              & 1 \end{bmatrix}

is positive definite but, since \epsilon^2<u, in floating-point arithmetic 1+\epsilon^2 rounds to 1 and so

\mathrm{f\kern.2ptl}( A^T\!A) = \begin{bmatrix} 1 & 1 \\                                  1 & 1 \end{bmatrix}.

which is singular, and the information in \epsilon has been lost.

Another problem with the cross product matrix is that the 2-norm condition number of A^T\!A is the square of that of A, and this leads to numerical instability in algorithms that work with A^T\!A when the condition number is large.

3. Evaluating Matrix Products in an Inefficient Order

The cost of evaluating a matrix product depends on the order in which the product is evaluated (assuming the matrices are not all n\times n). More precisely, matrix multiplication is associative, so A(BC) = (AB)C, and in general the cost of the evaluation of a product depends on where one puts the parentheses. One order may be much superior to others, so one should not simply evaluate the product in a fixed left-right or right-left order. For example, if x, y, and z are n-vectors then xy^Tz can be evaluated as

  • (xy^T)z: a vector outer product followed by a matrix–vector product, costing O(n^2) operations, or
  • x (y^Tz): a vector scalar product followed by a vector scaling, costing just O(n) operations.

In general. finding where to put the parentheses in a matrix product A_1A_2\dots A_k in order to minimize the operation count is a difficult problem, but for many cases that arise in practice it is easy to determine a good order.

4. Assuming that a Matrix is Positive Definite

Symmetric positive definite matrices (symmetric matrices with positive eigenvalues) are ubiquitous, not least because they arise in the solution of many minimization problems. However, a matrix that is supposed to be positive definite may fail to be so for a variety of reasons. Missing or inconsistent data in forming a covariance matrix or a correlation matrix can cause a loss of definiteness, and rounding errors can cause a tiny positive eigenvalue to go negative.

Definiteness implies that

  • the diagonal entries are positive,
  • \det(A) > 0,
  • |a_{ij}| < \sqrt{a_{ii}a_{jj}} for all i \ne j,

but none of these conditions, or even all taken together, guarantees that the matrix has positive eigenvalues.

The best way to check definiteness is to compute a Cholesky factorization, which is often needed anyway. The MATLAB function chol returns an error message if the factorization fails, and a second output argument can be requested, which is set to the number of the stage on which the factorization failed, or to zero if the factorization succeeded. In the case of failure, the partially computed R factor is returned in the first argument, and it can be used to compute a direction of negative curvature (as needed in optimization), for example.

This sin takes the top spot in Schmelzer and Hauser’s Seven Sins in Portfolio Optimization, because in portfolio optimization a negative eigenvalue in the covariance matrix can identify a portfolio with negative variance, promising an arbitrarily large investment with no risk!

5. Not Exploiting Structure in the Matrix

One of the fundamental tenets of numerical linear algebra is that one should try to exploit any matrix structure that might be present. Sparsity (a matrix having a large number of zeros) is particularly important to exploit, since algorithms intended for dense matrices may be impractical for sparse matrices because of extensive fill-in (zeros becoming nonzero). Here are two examples of structures that can be exploited.

Matrices from saddle point problems are symmetric indefinite and of the form

\notag  C =  \begin{bmatrix} A & B^T \\ B & 0       \end{bmatrix},

with A symmetric positive definite. Much work has been done on developing numerical methods for solving Cx = b that exploit the block structure and possible sparsity in A and B. A second example is a circulant matrix

\notag    C = \begin{bmatrix} c_1     & c_2    & \dots   & c_n     \\                        c_n     & c_1    & \dots   & \vdots  \\                        \vdots  & \ddots & \ddots  & c_2     \\                        c_2     & \dots  & c_n     & c_1     \\      \end{bmatrix}.

Circulant matrices have the important property that they are diagonalized by a unitary matrix called the discrete Fourier transform matrix. Using this property one can solve Cx = v in O(n \log_2n) operations, rather than the O(n^3) operations required if the circulant structure is ignored.

Ideally, linear algebra software would detect structure in a matrix and call an algorithm that exploits that structure. A notable example of such a meta-algorithm is the MATLAB backslash function x = A\b for solving Ax = b. Backslash checks whether the matrix is triangular (or a permutation of a triangular matrix), upper Hessenberg, symmetric, or symmetric positive definite, and applies an appropriate method. It also allows A to be rectangular and solves the least squares problem if there are more rows than columns and the underdetermined system if there are more columns than rows.

6. Using the Determinant to Detect Near Singularity

An n\times n matrix A is nonsingular if and only if its determinant is nonzero. One might therefore expect that a small value for \det(A) indicates a matrix that is nearly singular. However, the size of \det(A) tells us nothing about near singularity. Indeed, since \det(\alpha A) = \alpha^n \det(A) we can achieve any value for the determinant by multiplying by a scalar \alpha, yet \alpha A is no more or less nearly singular than A for \alpha \ne 0.

Another limitation of the determinant is shown by the two matrices

\notag  T =  \begin{bmatrix}    1 & -1 & -1 & \dots  & -1\\      &  1 & -1 & \dots  & -1\\      &    &  1 & \dots  & \vdots\\      &    &    & \ddots & -1 \\      &    &    &        & 1   \end{bmatrix}, \quad  U =  \begin{bmatrix}    1 &  1 &  1 & \dots  &  1\\      &  1 &  1 & \dots  &  1\\      &    &  1 & \dots  & \vdots\\      &    &    & \ddots &  1 \\      &    &    &        & 1  \end{bmatrix}  \qquad (1)

Both matrices have unit diagonal and off-diagonal elements bounded in modulus by 1. So \det(T) = \det(U) = 1, yet

\notag  T^{-1} =  \begin{bmatrix}    1 &  1 & 2  & \dots  & 2^{n-2}\\      &  1 &  1 & \dots  & \vdots\\      &    &  1 & \ddots  & 2\\      &    &    & \ddots & 1 \\      &    &    &        & 1   \end{bmatrix}, \quad  U^{-1} =  \begin{bmatrix}    1 &  -1 &    &        &   \\      &  1 &  -1 &        &   \\      &    &  1 & \ddots  &       \\      &    &    & \ddots & -1 \\      &    &    &        & 1  \end{bmatrix}.

So T is ill conditioned for large n. In fact, if we change the (n,1) element of T to -2^{n-2} then the matrix becomes singular! By contrast, U is always very well conditioned. The determinant cannot distinguish between the ill-conditioned T and the well-conditioned U.

7. Using Eigenvalues to Estimate Conditioning

For any n\times n matrix A and any consistent matrix norm it is true that \|A\| \ge |\lambda_i| for all i, where the \lambda_i are the eigenvalue of A. Since the eigenvalues of A^{-1} are \lambda^{-1}, it follows that the matrix condition number \kappa(A) = \|A\| \, \|A^{-1}\| is bounded below by the ratio of largest to smallest eigenvalue in absolute value, that is,

\notag      \kappa(A) \ge \displaystyle\frac{ \max_i |\lambda_i| }                                            { \min_i |\lambda_i| }.

But as the matrix T in (1) shows, this bound can be very weak.

It is singular values not eigenvalues that characterize the condition number for the 2-norm. Specifically,

\notag        \kappa_2(A) = \displaystyle\frac{\sigma_1}{\sigma_n},

where A = U\Sigma V^T is a singular value decomposition (SVD), with U and V orthogonal and \Sigma = \mathrm{diag}(\sigma_i), \sigma_1 \ge \sigma_2 \ge \cdots \ge \sigma_n \ge 0. If A is symmetric, for example, then the sets \{ |\lambda_i| \} and \{\sigma_i \} are the same, but in general the eigenvalues \lambda_i and singular values \sigma_i can be very different.

Can We Solve Linear Algebra Problems at Extreme Scale and Low Precisions?


The Fugaku supercomputer that tops the HPL-AI mixed-precision benchmark in the June 2021 TOP500 list. It solved a linear system of order 10^7 using IEEE half precision arithmetic for most of the computations.

The largest dense linear systems being solved today are of order n = 10^7, and future exascale computer systems will be able to tackle even larger problems. Rounding error analysis shows that the computed solution satisfies a componentwise backward error bound that, under favorable assumptions, is of order nu, where u is the unit roundoff of the floating-point arithmetic: u \approx 10^{-16} for double precision and u \approx 10^{-8} for single precision. This backward error bound cannot guarantee any stability for single precision solution of today’s largest problems and suggests a loss of half the digits in the backward error for double precision.

Half precision floating-point arithmetic is now readily available in hardware, in both the IEEE binary16 format and the bfloat16 format, and it is increasingly being used in machine learning and in scientific computing more generally. For the computation of the inner product of two n-vectors the backward error bound is again of order nu, and this bound exceeds 1 for n \ge 684 for both half precision formats, suggesting a potentially complete loss of numerical stability. Yet inner products with n \ge 684 are successfully used in half precision computations in practice.

The error bounds I have referred to are upper bounds and so bound the worst-case over all possible rounding errors. Their main purpose is to reveal potential instabilities rather than to provide realistic error estimates. Yet we do need to know the limits of what we can compute, and for mission critical applications we need to be able to guarantee a successful computation..

Can we understand the behavior of linear algebra algorithms at extreme scale and in low precision floating-point arithmetics?

To a large extent the answer is yes if we exploit three different features to obtain smaller error bounds.

Blocked Algorithms

Many algorithms are implemented in blocked form. For example, an inner product x^Ty of two n-vectors x and y can computed as

\notag \begin{aligned} s_i &= x((i-1)b+1:ib)^T y((i-1)b+1:ib), \quad i = 1:k,\\ s   &= s_1 + s_2 + \dots + s_k, \end{aligned}

where n = kb and b \ll n is the block size. The inner product has been broken into k smaller inner products of size b, which are computed independently then summed. Many linear algebra algorithms are blocked in an analogous way, where the blocking is into submatrices with b rows or b columns (or both). Careful analysis of the error analysis shows that a blocked algorithm has an error bound about a factor of b smaller than that for the corresponding unblocked algorithm. Practical block sizes for matrix algorithms are typically 128 or 256, so blocking brings a substantial reduction in the error bounds.


Backward errors for the inner product of two vectors with elements of the form -0.25 + randn, computed in single precision in MATLAB with block size 256.

In fact, one can do even better than an error bound of order (n/b)u. By computing the sum s= s_1 + s_2 + \dots + s_k with a more accurate summation method the error constant is further reduced to bu + O(u^2) (this is the FABsum method of Blanchard et al. (2020)).

Architectural Features

Intel x86 processors support an 80-bit extended precision format with a 64-bit significand, which is compatible with that specified in the IEEE standard. When a compiler uses this format with 80-bit registers to accumulate sums and inner products it is effectively working with a unit roundoff of 2^{-64} rather than 2^{-53} for double precision, giving error bounds smaller by a factor up to 2^{11} = 2048.

Some processors have a fused multiply–add (FMA) operation, which computes a combined multiplication and addition x + yz with one rounding error instead of two. This results in a reduction in error bounds by a factor 2.

Mixed precision block FMA operations D = C + AB, with matrices A,B,C,D of fixed size, are available on Google tensor processing units, NVIDIA GPUs, and in the ARMv8-A architecture. For half precision inputs these devices can produce results of single precision quality, which can give a significant boost in accuracy when block FMAs are chained together to form a matrix product of arbitrary dimension.

Probabilistic Bounds

Worst-case rounding error bounds suffer from the problem that they are not attainable for most specific sets of data and are unlikely to be nearly attained. Stewart (1990) noted that

To be realistic, we must prune away the unlikely. What is left is necessarily a probabilistic statement.

Theo Mary and I have recently developed probabilistic rounding error analysis, which makes probabilistic assumptions on the rounding errors and derives bounds that hold with a certain probability. The key feature of the bounds is that they are proportional to \sqrt{n}u when a corresponding worst-case bound is proportional to nu. In the most general form of the analysis (Connolly, Higham, and Mary, 2021), the rounding errors are assumed to be mean independent and of mean zero, where mean independence is a weaker assumption than independence.

Putting the Pieces Together

The different features we have described can be combined to obtain significantly smaller error bounds. If we use a blocked algorithm with block size b \ll n then in an inner product the standard error bound of order nu reduces to a probabilistic bound of order (\sqrt{n/b})u, which is a significant reduction. Block FMAs and extended precision registers provide further reductions.

For example, for a linear system of order 10^7 solved in single precision with a block size of 256, the probabilistic error bound is of order 10^{-5} versus 1 for the standard worst-case bound. If FABsum is used then the bound is further reduced.

Our conclusion is that we can successfully solve linear algebra problems of greater size and at lower precisions than the standard rounding error analysis suggests. A priori bounds will always be pessimistic, though. One should compute a posteriori residuals or backward errors (depending on the problem) in order to assess the quality of a numerical solution.

For full details of the work summarized here, see Higham (2021).


Bounds for the Matrix Condition Number

We present a selection of bounds for the condition number \kappa(A) = \|A\| \|A^{-1}\| of a nonsingular matrix A\in\mathbb{C}^{n\times n} in terms of quantities that might be known or can be estimated.

General Matrices

From the inequality \|A\| \ge \rho(A), for any matrix norm, where \rho(A) is the spectral radius (the largest magnitude of any eigenvalue of A) we have

\notag       \kappa(A) \ge \rho(A) \rho(A^{-1}).  \qquad (1)

Fir the 2-norm, this bound is an equality for a normal matrix (one for which A^*A = AA^*), but it can be arbitrarily weak for nonnormal matrices.

Guggenheimer, Edelman, and Johnson (1995) obtain the bound

\notag       \kappa_2(A) < \displaystyle\frac{2}{|\det(A)|}                  \left( \frac{\|A\|_F}{n^{1/2}} \right)^n. \qquad (2)

The proof of the bound applies the arithmetic–geometric mean inequality to the n numbers \sigma_1^2/2, \sigma_1^2/2,  \sigma_2^2, \sigma_3^2, \dots, \sigma_{n-1}^2, where the \sigma_i are the singular values of A. This bound can be arbitrarily weak but it is an approximate equality when \sigma_1,\sigma_2, \dots \sigma_{n-1} are of similar order of magnitude.

Merikoski, Urpala, Virtanen, Tam, and Uhlig (1997) obtain the bound

\notag  \kappa_2(A) \le  \left(\displaystyle\frac{1+x}{1-x}\right)^{1/2}, \quad      x = \sqrt{1 - (n/\|A\|_F^2)^n |\det(A)|^2 }. \qquad (3)

Their proof uses a more refined application of the arithmetic–geometric mean inequality, and they show that this bound is the smallest that can be obtained based on \|A\|_F, \det(A), and n only. Hence (3) is no larger than (2), and they show that it can be smaller by no more than 1.5. Equality holds in (3) if and only if \sigma_2 = \sigma_3 = \cdots = \sigma_{n-1} = (\sigma_1 + \sigma_n)/2.

As an example, for three random 25\times 25 matrices with \kappa_2(A) = 10, generated by gallery('randsvd') with three different singular value dsitributions:

Mode (2) (3)
One large singular value 9.88e+07 9.88e+07
One small singular value 1.21e+01 1.20e+01
Geometrically distributed singular values 5.71e+04 5.71e+04

We note that for larger \kappa_2(A) the formula (3) is prone to overflow, which can be avoided by evaluating it in higher precision arithmetic.

Hermitian Positive Definite Matrices

Merikoski et al. (1997) also give a version of (3) for Hermitian positive definite A\in\mathbb{C}^{n\times n}:

\kappa_2(A) \le \displaystyle\frac{1+x}{1-x}, \quad      x = \sqrt{1 - (n/\mathrm{trace}(A))^n \det(A) }.     \qquad (4)

This is the smallest bound that can be obtained based on \mathrm{trace}(A), \det(A), and n only. Equality holds in (4) if and only if the eigenvalues \lambda_1 \ge \lambda_2 \ge \cdots \ge \lambda_n of A satisfy \lambda_2 = \lambda_3 = \cdots = \lambda_{n-1} = (\lambda_1 + \lambda_n)/2. We can rewrite this upper bound as

\displaystyle\frac{1+x}{1-x} = \frac{(1+x)^2}{1-x^2}                < \frac{4}{1-x^2},

which gives the weaker bound

\notag   \kappa_2(A) < \displaystyle\frac{4}{\det(A)} \Bigl(\displaystyle\frac{\mathrm{trace}(A)}{n}\Bigr)^n.     \qquad (5)

This bound is analogous to (2) and is up to a factor 4 larger than (4), this factor being attained for A = I.

If \mathrm{trace}(A) = n then (4) reduces to

\notag \begin{aligned}   \kappa_2(A) &< \displaystyle\frac{1 + \sqrt{1-\det(A)}}{1 - \sqrt{1-\det(A)}}                =\displaystyle\frac{\bigl(1 + \sqrt{1-\det(A)}\,\bigr)^2}{\det(A)}   \qquad(6)\\               &< \displaystyle\frac{4}{\det(A)}. \end{aligned}

These bounds hold for any positive definite matrix with unit diagonal, that is, any nonsingular correlation matrix.

We can sometimes get a sharper bound than (4) and (5) by writing A = DCD, where D = \mathrm{diag}(a_{ii}^{1/2}) and c_{ii} \equiv 1 (thus C is a correlation matrix), using

\notag \kappa_2(A) \le \kappa_2(D)^2 \kappa_2(C)           = \displaystyle\frac{\max_i a_{ii}}{\min_i a_{ii}} \kappa_2(C), \qquad (7)

and bounding \kappa_(C) using (6). For example, for the 5\times 5 Pascal matrix

\notag P_5 = \left[\begin{array}{ccccc} 1 & 1 & 1 & 1 & 1\\ 1 & 2 & 3 & 4 & 5\\ 1 & 3 & 6 & 10 & 15\\ 1 & 4 & 10 & 20 & 35\\ 1 & 5 & 15 & 35 & 70 \end{array}\right]

the condition number is \kappa_1(P_5) = 8.52 \times 10^3. The bounds from (4) and (5) are both 1.22 \times 10^7, whereas combining (4) and (7) gives a bound of 4.70 \times 10^6.


Many other condition number bounds are available in the literature. All have their pros and cons and any bound based on limited information such as traces of powers of A and the determinant will be potentially very weak.

A drawback of the bounds (3)–(6) is that they require \det(A). Sometimes the determinant is easily computable, as for a Vandermonde matrix, or can be bounded: for example, |\det(A)| \ge 1 for a matrix with integer entries. If a Cholesky, LU, or QR factorization of A is available then |\det(A)| is easily computable, but in this case a good order of magnitude estimate of the condition number can be cheaply computed using condition estimation techniques (Higham, 2002, Chapter 15).

The bounds (3) and (4) are used by Higham and Lettington (2021) in investigating the most ill conditioned 4\times 4 symmetric matrices with integer elements bounded by 10; see What Is the Wilson Matrix?


This is a minimal set of references, which contain further useful references within.

Singular Value Inequalities

Recall that the singular value decomposition (SVD) of a matrix A \in\mathbb{C}^{m\times n} is a factorization A = U\Sigma V^*, where U\in\mathbb{C}^{m\times m} and V\in\mathbb{C}^{n\times n} are unitary and \Sigma = \mathrm{diag}(\sigma_1,\dots, \sigma_p)\in\mathbb{R}^{m\times n}, with \sigma_1\ge \sigma_2\ge \cdots \ge \sigma_p \ge 0, where where p = \min(m,n). We sometimes write \sigma_i(A) to specify the matrix to which the singular value belongs.

A standard technique for obtaining singular value inequalities for A is to apply eigenvalue inequalities to the Hermitian positive semidefinite matrices A^*A or AA^*, whose eigenvalues are the squares of the singular values of A, or to the Hermitian matrix

\notag    \begin{bmatrix}      0   & A \\      A^* & 0     \end{bmatrix}, \qquad (1)

whose eigenvalues are plus and minus the singular values of A together with |m-n| zero eigenvalues if m\ne n.

We begin with a variational characterization of singular values.

Theorem 1. For A\in\mathbb{C}^{m\times n},

\notag \begin{aligned}   \sigma_k &= \min_{\dim(S)=n-k+1} \, \max_{0\ne x\in S} \frac{\|Ax\|_2}{\|x\|_2}\\            &= \max_{\dim(S)= k} \, \min_{0\ne x\in S} \frac{\|Ax\|_2}{\|x\|_2},                  \quad k=1\colon \min(m,n), \end{aligned}

where S\subseteq \mathbb{C}^n.

Proof. The result is obtained by applying the Courant–Fischer theorem (a variational characterization of eigenvalues) to A^*A. ~\square

As a special case of Theorem 1, we have

\notag    \sigma_1 = \displaystyle\max_{x \ne 0}\frac{ \|Ax\|_2 }{ \|x\|_2 },               \qquad (2)

and, for m\ge n,

\notag    \sigma_n = \displaystyle\min_{x \ne 0}\frac{ \|Ax\|_2 }{ \|x\|_2 }.               \qquad (3)

The expression in the theorem can be rewritten using \|x\|_2 = \max_{y\ne 0}|y^*x|/\|y\|_2 (the equality case in the Cauchy–Schwarz inequality). For example, (2) is equivalent to

\notag   \sigma_1 = \displaystyle\max_{0\ne x\in \mathbb{C}^n\atop 0 \ne y \in \mathbb{C}^m}         \displaystyle\frac{|y^*Ax|}{\|x\|_2\|y\|_2}.

Our first perturbation result bounds the change in a singular value.

Theorem 2. For A,B\in\mathbb{C}^{m\times n},

\notag   |\sigma_i(A) - \sigma_i(B)| \le \|A - B \|_2, \quad i = 1\colon \min(m,n).    \qquad (4)

Proof. The bound is obtained by applying the corresponding result for the Hermitian eigenvalue problem to (1). ~\square

The bound (4) says that the singular values of a matrix are well conditioned under additive perturbation. Now we consider multiplicative perturbations.

The next result is an analogue for singular values of Ostrowski’s theorem for eigenvalues.

Theorem 3. For A\in \mathbb{C}^{m\times n} and nonsingular X\in\mathbb{C}^{n\times n} and Y\in\mathbb{C}^{m\times m},

\notag     \sigma_i(Y^*AX) = \theta_i \sigma_i(A), \quad i = 1\colon \min(m,n),      \qquad (5)

where \sigma_n(X)\sigma_m(Y) \le \theta_i \le \sigma_1(X) \sigma_1(Y).

A corollary of this result is

\notag   |\sigma_i(A) - \sigma_i(Y^*AX)| \le \sigma_i(A) \epsilon, \quad i = 1\colon \min(m,n),    \qquad (6)

where \epsilon = \max(\|X^*X - I\|_2,\|Y^*Y - I\|_2). The bounds (5) and (6) are intuitively reasonable, because unitary transformations preserve singular values and the bounds quantify in different ways how close X and Y are to being unitary.

Next, we have an interlacing property.

Theorem 4. Let A\in\mathbb{C}^{m\times n}, A_k = A(:,1\colon k), and q = \min(m,k). Then

\notag  \sigma_{i+1}(A_{k+1}) \le \sigma_i(A_k) \le \sigma_i(A_{k+1}), \quad   i=1\colon q, \quad k = 1\colon n-1,

where we define \sigma_{q+1}(A_{k+1}) = 0 if m < k+1.

Proof. The result is obtained by applying the Cauchy interlace theorem to A^*A, noting that A_k^*A_k is the leading principal submatrix of order k of A^*A. ~\square

An analogous result holds with rows playing the role of columns (just apply Theorem 4 to A^*).

Theorem 4 encompasses two different cases, which we illustrate with i = q and k = n-1. The first case is m \ge n, so that q = n-1 and

\notag   \sigma_n(A) \le \sigma_{n-1}(A_{n-1}) \le \sigma_{n-1}(A).

The second case is m < n, so q = m and

\notag   0 \le \sigma_m(A_{n-1}) \le \sigma_m(A).

Therefore Theorem 3 shows that removing a column from A does not increase any singular value and that when m\ge n no singular value decreases below \sigma_n(A). However, when m < n the smallest singular value of A_{n-1} may be less than the smallest singular value of A.

Here is a numerical example. Note that transposing A does not change its singular values.

>> rng(1), A = rand(5,4); % Case 1.
>> B = A(:,1:end-1); sv_A = svd(A)', sv_B = svd(B)'
sv_A =
   1.7450e+00   6.4492e-01   5.5015e-01   3.2587e-01
sv_B =
   1.5500e+00   5.8472e-01   3.6128e-01
> A = A'; B = A(:,1:end-1); sv_B = svd(B)' % Case 2
sv_B =
   1.7098e+00   6.0996e-01   4.6017e-01   1.0369e-01

By applying Theorem 4 repeatedly we find that if we partition A = [A_{11}~A_{12}] then \sigma_i(A_{11}) \le \sigma_i(A) for all i for which the left-hand side is defined.


Bounds for the Norm of the Inverse of a Triangular Matrix

In many situations we need to estimate or bound the norm of the inverse of a matrix, for example to compute an error bound or to check whether an iterative process is guaranteed to converge. This is the same problem as bounding the condition number \kappa(A) = \|A\| \|A^{-1}\|, assuming \|A\| is easy to compute or estimate. Here, we focus on triangular matrices. The bounds we derive can be applied to a general matrix if an LU or QR factorization is available.

We denote by \|\cdot\| any matrix norm, and we take the consistency condition \|AB\| \le \|A\| \|B\| as one of the defining properties of a matrix norm.

It will be useful to note that

\notag       \left[\begin{array}{crrrr}       1 & -\theta & -\theta & -\theta & -\theta\\         & 1 & -\theta & -\theta & -\theta\\         &   & 1 & -\theta & -\theta\\         &   &   & 1 & -\theta\\         &   &   &   & 1 \end{array}\right]^{-1} =      \left[\begin{array}{ccccc}      1 & \theta & \theta(1+\theta) & \theta(1+\theta)^2 & \theta(1+\theta)^3\\        & 1 & \theta & \theta(1+\theta) & \theta(1+\theta)^2\\        &   & 1 & \theta & \theta(1+\theta)\\        &   &   & 1 & \theta\\        &   &   &   & 1      \end{array}\right]

and that more generally the inverse of the n\times n upper triangular matrix T(\theta) with

\notag   (T(\theta))_{ij} = \begin{cases} 1, & i=j, \\                     -\theta, & i<j, \end{cases} \qquad (1)

is given by

\notag   \bigl(T(\theta)^{-1}\bigr)_{ij} =     \begin{cases} 1, & i=j, \\                 \theta(1+\theta)^{j-i-1}, & j > i. \end{cases} \qquad (2)

Lower Bound

First, we consider a general matrix A\in\mathbb{C}^{n\times n} and let \lambda be an eigenvalue with |\lambda| = \rho(A) (the spectral radius) and x a corresponding eigenvector. With X = xe^T \in\mathbb{C}^{n\times n}, where e is the vector of ones, AX = \lambda X, so

\notag     |\lambda| \|X\| = \|\lambda X\| = \| AX \| \le \|A\| \|X\|,

which implies |\lambda| \le \|A\| since X\ne 0. Hence

\notag           \|A\| \ge \rho(A).

Let T be a triangular matrix. Applying the latter bound to T^{-1}, whose eigenvalues are its diagonal entries t_{ii}^{-1}, gives

\notag       \|T^{-1}\| \ge \displaystyle\frac{1}{\min_i |t_{ii}|}.  \qquad (3)

Combining this bound with the analogous bound for \|T\| gives

\notag       \kappa(T) \ge \displaystyle\frac{\max_i |t_{ii}|}{\min_i |t_{ii}|}. \qquad (4)

We note that commonly used norms satisfy \|A\| \ge \max_{i,j}|a_{ij}|, which yields another proof of (3) and (4).

For any x and y such that y = Tx we have the lower bound \|x\| / \|y\| \le \| T^{-1} \|. We can choose y and then solve the triangular system Tx = y for x to obtain the lower bound. Condition number estimation techniques, which we will describe in another article, provide ways to choose y that usually yield estimates of \| T^{-1} \| correct to within an order of magnitude.

For the 2-norm, we can choose y and then compute x = (T^TT)^{-k}y by repeated triangular solves, obtaining the lower bound (\|x\|_2 / \|y\|_2)^{\frac{1}{2k}} \le \| T^{-1} \|_2. This bound is simply the power method applied to (T^TT)^{-1}.

Upper Bounds

Let T\in\mathbb{C}^{n\times n} be an upper triangular matrix. The upper bounds for \|T^{-1}\| that we will discuss depend only on the absolute values of the elements of T. This limits the ability of the bounds to distinguish between well-conditioned and ill-conditioned matrices. For example, consider

\notag \begin{gathered}      T_1 =       \left[\begin{array}{crrrr} 1 & -2 & -2 & -2 & -2\\         & 1 & -2 & -2 & -2\\         &   & 1 & -2 & -2\\         &   &   & 1 & -2\\         &   &   &   & 1 \end{array}\right], \quad      T_1^{-1} =      \left[\begin{array}{ccccc}      1 & 2 & 6 & 18 & 54\\        & 1 & 2 & 6 & 18\\        &   & 1 & 2 & 6\\        &   &   & 1 & 2\\        &   &   &   & 1      \end{array}\right], \\     T_2 =     \left[\begin{array}{ccccc}     1 & 2 & 2 & 2 & 2\\       & 1 & 2 & 2 & 2\\       &   & 1 & 2 & 2\\       &   &   & 1 & 2\\       &   &   &   & 1     \end{array}\right], \quad   T_2^{-1} =     \left[\begin{array}{crrrr}      1 & -2 & 2 & -2 & 2\\        & 1 & -2 & 2 & -2\\        &   & 1 & -2 & 2\\        &   &   & 1 & -2\\        &   &   &   & 1     \end{array}\right]. \end{gathered}

The bounds for T_1^{-1} and T_2^{-1} will be the same, yet the inverses are of different sizes (the more so as the dimension increases).

Let D = \mathrm{diag}(T) and write

\notag    T = D(I - N),

where N is strictly upper triangular and hence nilpotent with N^n = 0. Then

\notag    T^{-1} = (I + N + N^2 + \cdots + N^{n-1}) D^{-1}.

Taking absolute values and using the triangle inequality gives

\notag   |T^{-1}| \le (I + |N| + |N|^2 + \cdots + |N|^{n-1}) |D|^{-1}, \qquad(5)

where the inequalities hold elementwise.

The comparison matrix M(A) associated with a general A\in\mathbb{C}^{n \times n} is the matrix with

\notag   (M(A))_{ij} =    \begin{cases} |a_{ii}|, & i=j, \\                 -|a_{ij}|, & i\ne j.     \end{cases}

It is not hard to see that M(T) is upper triangular with M(T) = |D| (I - |N|) and so the bound (5) is

\notag   |T^{-1}| \le M(T)^{-1}.

If we replace every element above the diagonal of M(T) by the most negative off-diagonal element in its row we obtain the upper triangular matrix W(T) with

\notag     (W(T))_{ij} = \begin{cases}                     |t_{ii}|, & i=j, \\                             -\max_{i+1\le k\le n}|t_{ik}|, & i<j. \\                 \end{cases}

Then W(T) = |D| (I - |N_1|), where |N| \le |N_1|, so

\notag \begin{aligned}   M(T)^{-1} &= (I + |N| + |N|^2 + \cdots + |N|^{n-1}) |D|^{-1}\\   & \le (I + |N_1| + |N_1|^2 + \cdots + |N_1|^{n-1}) |D|^{-1} = W(T)^{-1}. \end{aligned}

Finally, let Z(T) = \min_i|t_{ii}|(I - |N_2|), where N_2 is strictly upper triangular with every element above the diagonal equal to the maximum element of |N_1|, that is,

\notag      (Z(T))_{ij} = \begin{cases}       \alpha, & i=j, \\                               -\alpha\beta, & i<j, \\                 \end{cases} \qquad     \alpha = \min_i|t_{ii}|, \quad      \beta = \max_{i < j}|t_{ij}|/|t_{ii}|.


\notag \begin{aligned}   W(T)^{-1} &= (I + |N_1| + |N_1|^2 + \cdots + |N_1|^{n-1}) |D|^{-1} \\             &\le \alpha^{-1} (I + |N_2| + |N_2|^2 + \cdots + |N_2|^{n-1}) = Z(T)^{-1}. \end{aligned}

We note that M(T), W(T), and Z(T) are all nonsingular M-matrices. We summarize the bounds.

Theorem 1.

If T\in\mathbb{C}^{n\times n} is a nonsingular upper triangular matrix then

\notag      |T^{-1}|  \le M(T)^{-1}                \le W(T)^{-1}                \le Z(T)^{-1}. \qquad (6)

We make two remarks.

  • The bounds (6) are equally valid for lower triangular matrices as long as the maxima in the definitions of W(T) and Z(T) are taken over columns instead of rows.
  • We could equally well have written A = (I-N)D. The comparison matrix M(T) = (I - |N|)|D| is unchanged, and (6) continues to hold as long as the maxima in the definitions of W(T) and Z(T) are taken over columns rather than rows.

It follows from the theorem that

\notag    \|T^{-1}\|  \le \|M(T)^{-1}\|                \le \|W(T)^{-1}\|                \le \|Z(T)^{-1}\|

for the 1-, 2-, and \infty-norms and the Frobenius norm. Now M(T), W(T), and Z(T) all have nonnegative inverses, and for a matrix A with nonnegative inverse we have \|A^{-1}\|_{\infty} = \|A^{-1}e\|_{\infty}. Hence

\notag   \begin{aligned}    \|T^{-1}\|_{\infty}                &\le \|M(T)^{-1}e\|_{\infty}                \le \|W(T)^{-1}e\|_{\infty}                \le \|Z(T)^{-1}e\|_{\infty}\\         O(n^3) \hskip10pt & \hskip35pt  O(n^2)                  \hskip65pt  O(n)                  \hskip65pt O(1)   \end{aligned}

where the big-Oh expressions show the asymptotic cost in flops of evaluating each term by solving the relevant triangular system. As the bounds become less expensive to compute they become weaker. The quantity \|Z(T)^{-1}\|_p can be explicitly evaluated for p = \infty, using (2). It has the same value for p = 1, and since \|A\|_2 \le (\|A\|_1\|A\|_{\infty})^{1/2} we have

\notag    \|T^{-1}\|_p \le \displaystyle\frac{ (\beta + 1)^{n-1}}{\alpha}, \quad p = 1,2,\infty.    \qquad(7)

This bound is an equality for p = 1,\infty for the matrix T(\theta) in (1).

For the Frobenius norm, evaluating \|Z(T)^{-1}\|_F, and using \|A\|_2 \le \|A\|_F, gives

\notag  \|T^{-1}\|_{2,F} \le        \displaystyle\frac{ \bigl( (\beta + 1)^{2n} + 2n(\beta + 2) - 1 \bigr)^{1/2}}             {\alpha(\beta + 2)}.       \qquad(8)

For the 2-norm, either of (7) and (8) can be the smaller bound depending on \beta.

For the special case of a bidiagonal matrix B it is easy to show that |B^{-1}| = M(B)^{-1}, and so \|B^{-1}\|_{\infty} = \|M(B)^{-1}\|_{\infty} = \|M(B)^{-1}e\|_{\infty} can be computed exactly in O(n) flops.

These upper bounds can be arbitrarily weak, even for a fixed n, as shown by the example

\notag   T(\theta) = \begin{bmatrix} \theta^{-1} &   1      & 1       \\                       0     &  \theta^{-1} & \theta^{-1} \\                       0     &   0      & \theta^{-2} \end{bmatrix},     \quad \theta > 0,

for which

\notag   T(\theta)^{-1} =           \begin{bmatrix} \theta      &  -\theta^2   & 0       \\                       0     &  \theta      & -\theta^2   \\                       0     &   0      & \theta^2    \end{bmatrix},           \quad   M(T(\theta))^{-1} =           \begin{bmatrix} \theta      &  \theta^2    & 2\theta^3   \\                       0     &  \theta      & \theta^2    \\                       0     &   0      & \theta^2    \end{bmatrix}.

As \theta\to\infty, \|M(T(\theta))^{-1}\|_{\infty} /\|T(\theta)^{-1}\|_{\infty} \approx 2\theta. On the other hand, the overestimation is bounded as a function of n for triangular matrices resulting from certain pivoting strategies.

Theorem 1.

Suppose the upper triangular matrix T\in\mathbb{C}^{n\times n} satisfies

\notag       |t_{ii}| \ge |t_{ij}|, \quad j>i. \qquad (9)

Then, for the 1-, 2-, and \infty-norms,

\notag     \displaystyle\frac{1}{\min_i|t_{ii}|} \le \|T^{-1}\| \le \|M(T)^{-1}\|                               \le \|W(T)^{-1}\|                               \le \|Z(T)^{-1}\|                               \le \displaystyle\frac{2^{n-1}}{{\min_i|t_{ii}|}}.

Proof. The first four inequalities are a combination of (3) and (6). The fifth inequality is obtained from the expression (7) for \|Z(T)^{-1}\| with \beta = 1.

Condition (9) is satisfied for the triangular factors from QR factorization with column pivoting and for the transpose of the unit lower triangular factors from LU factorization with any form of pivoting.

The upper bounds we have described have been derived independently by several authors, as explained by Higham (2002).


Eigenvalue Inequalities for Hermitian Matrices

The eigenvalues of Hermitian matrices satisfy a wide variety of inequalities. We present some of the most useful and explain their implications. Proofs are omitted, but as Parlett (1998) notes, the proofs of the Courant–Fischer, Weyl, and Cauchy results are all consequences of the elementary fact that if the sum of the dimensions of two subspaces of \mathbb{C}^n exceeds n then the subspaces have a nontrivial intersection.

The eigenvalues of a Hermitian matrix A\in\mathbb{C}^{n\times n} are real and we order them \lambda_n\le \lambda_{n-1} \le \cdots \le \lambda_1. Note that in some references, such as Horn and Johnson (2013), the reverse ordering is used, with \lambda_n the largest eigenvalue. When it is necessary to specify what matrix \lambda_k is an eigenvalue of we write \lambda_k(A): the kth largest eigenvalue of A. All the following results also hold for symmetric matrices over \mathbb{R}^{n\times n}.

Quadratic Form

The function f(x) = x^*Ax/x^*x is the quadratic form x^*Ax for A evaluated on the unit sphere, since f(x) = f(x/\|x\|_2). As A is Hermitian it has a spectral decomposition A = Q\Lambda Q^*, where Q is unitary and \Lambda = \mathrm{diag}(\lambda_i). Then

f(x) = \displaystyle\frac{x^*Q\Lambda Q^*x}{x^*x}             = \displaystyle\frac{y^*\Lambda y}{y^*y}             = \displaystyle\frac{\sum_{i=1}^{n}\lambda_i y_i^2}                                 {\sum_{i=1}^{n}y_i^2} \quad (y = Q^*x),

from which is it clear that

\notag  \lambda_n = \displaystyle\min_{x\ne0} \displaystyle\frac{x^*Ax}{x^*x}, \quad  \lambda_1 = \displaystyle\max_{x\ne0} \displaystyle\frac{x^*Ax}{x^*x}, \qquad(*)

with equality when x is an eigenvector corresponding to \lambda_n and \lambda_1, respectively, This characterization of the extremal eigenvalues of A as the extrema of f is due to Lord Rayleigh (John William Strutt), and f(x) is called a Rayleigh quotient. The intermediate eigenvalues correspond to saddle points of f.

Courant–Fischer Theorem

The Courant–Fischer theorem (1905) states that every eigenvalue of a Hermitian matrix A\in\mathbb{C}^{n\times n} is the solution of both a min-max problem and a max-min problem over suitable subspaces of \mathbb{C}^n.

Theorem (Courant–Fischer).

For a Hermitian A\in\mathbb{C}^{n\times n},

\notag \begin{aligned}    \lambda_k &= \min_{\dim(S)=n-k+1} \, \max_{0\ne x\in S} \frac{x^*Ax}{x^*x}\\              &= \max_{\dim(S)= k} \, \min_{0\ne x\in S} \frac{x^*Ax}{x^*x},                  \quad k=1\colon n. \end{aligned}

Note that the equalities (*) are special cases of these characterizations.

In general there is no useful formula for the eigenvalues of a sum A+B of Hermitian matrices. However, the Courant–Fischer theorem yields the upper and lower bounds

\notag  \lambda_k(A) + \lambda_n(B) \le \lambda_k(A+B) \le \lambda_k(A) + \lambda_1(B),   \qquad (1)

from which it follows that

\notag   \max_k|\lambda_k(A+B)-\lambda_k(A)| \le \max(|\lambda_n(B)|,|\lambda_1(B)|)     = \|B\|_2.

This inequality shows that the eigenvalues of a Hermitian matrix are well conditioned under perturbation. We can rewrite the inequality in the symmetric form

\notag   \max_k |\lambda_k(A)-\lambda_k(B)| \le \|A-B\|_2.

If B is positive semidefinite then (1) gives

\notag    \lambda_k(A) \le \lambda_k(A + B),    \quad k = 1\colon n, \qquad (2)

while if B is positive definite then strict inequality holds for all i. These bounds are known as the Weyl monotonicity theorem.

Weyl’s Inequalities

Weyl’s inequalities (1912) bound the eigenvalues of A+B in terms of those of A and B.

Theorem (Weyl).

For Hermitian A,B\in\mathbb{C}^{n\times n} and i,j = 1\colon n,

\notag \begin{aligned}     \lambda_{i+j-1}(A+B) &\le \lambda_i(A) + \lambda_j(B),     \quad i+j \le n+1, \qquad (3)\\     \lambda_i(A) + \lambda_j(B) &\le \lambda_{i+j-n}(A+B).     \quad i+j \ge n+1, \qquad (4) \end{aligned}

The Weyl inequalities yield much information about the effect of low rank perturbations. Consider a positive semidefinite rank-1 perturbation B = zz^*. Inequality (3) with j = 1 gives

\notag     \lambda_i(A+B) \le \lambda_i(A) + z^*z,       \quad i = 1\colon n

(which also follows from (1)). Inequality (3) with j = 2, combined with (2), gives

\notag     \lambda_{i+1}(A) \le \lambda_{i+1}(A + zz^*) \le \lambda_i(A),       \quad i = 1\colon n-1. \qquad (5)

These inequalities confine each eigenvalue of A + zz^* to the interval between two adjacent eigenvalues of A; the eigenvalues of A + zz^* are said to interlace those of A. The following figure illustrates the case n = 4, showing a possible configuration of the eigenvalues \lambda_i of A and \mu_i of A + zz^*.

weyl_fig.jpg A specific example, in MATLAB, is

>> n = 4; eig_orig = 5:5+n-1
>> D = diag(eig_orig); eig_pert = eig(D + ones(n))'
eig_orig =
     5     6     7     8
eig_pert =
   5.2961e+00   6.3923e+00   7.5077e+00   1.0804e+01

Since \mathrm{trace}(A + zz^*) = \mathrm{trace}(A) + z^*z and the trace is the sum of the eigenvalues, we can write

\notag       \lambda_i(A + zz^*) = \lambda_i(A) + \theta_i z^*z,

where the \theta_i are nonnegative and sum to 1. If we greatly increase z^*z, the norm of the perturbation, then most of the increase in the eigenvalues is concentrated in the largest, since (5) bounds how much the smaller eigenvalues can change:

>> eig_pert = eig(D + 100*ones(n))'
eig_pert =
   5.3810e+00   6.4989e+00   7.6170e+00   4.0650e+02

More generally, if B has p positive eigenvalues and q negative eigenvalues then (3) with j = p+1 gives

\notag     \lambda_{i+p}(A+B) \le \lambda_i(A),      \quad i = 1\colon n-p,

while (4) with j = n-q gives

\notag     \lambda_i(A) \le \lambda_{i-q}(A + B),     \quad i = q+1\colon n.

So the inertia of B (the number of negative, zero, and positive eigenvalues) determines how far the eigenvalues can move as measured relative to the indexes of the eigenvalues of A.

An important implication of the last two inequalities is for the case A = I, for which we have

\notag \begin{aligned}  \lambda_{i+p}(I+B) &\le 1, \quad i = 1 \colon n-p, \\  \lambda_{i-q}(I+B) &\ge 1, \quad i = q+1 \colon n. \end{aligned}

Exactly p+q eigenvalues appear in one of these inequalities and n-(p+q) appear in both. Therefore n - (p+q) of the eigenvalues are equal to 1 and so only \mathrm{rank}(B) = p+q eigenvalues can differ from 1. So perturbing the identity matrix by a Hermitian matrix of rank r changes at most r of the eigenvalues. (In fact, it changes exactly r eigenvalues, as can be seen from a spectral decomposition.)

Finally, if B has rank r then \lambda_{r+1}(B) \le 0 and \lambda_{n-r}(B) \ge 0 and so taking j = r+1 in (3) and j = n-r in (4) gives

\notag   \begin{aligned}     \lambda_{i+r}(A+B) &\le \lambda_i(A),      ~~\qquad\qquad i = 1\colon n-r, \\         \lambda_i(A) &\le \lambda_{i-r}(A + B), ~~\quad i = r+1\colon n.   \end{aligned}

Cauchy Interlace Theorem

The Cauchy interlace theorem relates the eigenvalues of successive leading principal submatrices of a Hermitian matrix. We denote the leading principal submatrix of A of order k by A_k = A(1\colon k, 1\colon k).

Theorem (Cauchy).

For a Hermitian A\in\mathbb{C}^{n\times n},

\notag  \lambda_{i+1}(A_{k+1}) \le \lambda_i(A_k) \le \lambda_i(A_{k+1}),    \quad i = 1\colon k, \quad k=1\colon n-1.

The theorem says that the eigenvalues of A_k interlace those of A_{k+1} for all k. Two immediate implications are that (a) if A is Hermitian positive definite then so are all its leading principal submatrices and (b) appending a row and a column to a Hermitian matrix does not decrease the largest eigenvalue or increase the smallest eigenvalue.

Since eigenvalues are unchanged under symmetric permutations of the matrix, the theorem can be reformulated to say that the eigenvalues of any principal submatrix of order n-1 interlace those of A. A generalization to principal submatrices of order n-\ell is given in the next result.


If B is a principal submatrix of order n-\ell of a Hermitian A\in\mathbb{C}^{n\times n} then

\notag  \lambda_{i+\ell}(A) \le \lambda_i(B) \le \lambda_i(A),    \quad i=1\colon n-\ell.

Majorization Results

It follows by taking x to be a unit vector e_i in the formula \lambda_1 = \max_{x\ne0} x^*Ax/(x^*x) that \lambda_1 \ge a_{ii} for all i. And of course the trace of A is the sum of the eigenvalues: \sum_{i=1}^n a_{ii} = \sum_{i=1}^n \lambda_i. These relations are the first and last in a sequence of inequalities relating sums of eigenvalues to sums of diagonal elements obtained by Schur in 1923.

Theorem (Schur).

For a Hermitian A\in\mathbb{C}^{n\times n},

\notag     \displaystyle\sum_{i=1}^k \lambda_i \ge \displaystyle\sum_{i=1}^k \widetilde{a}_{ii},     \quad k=1\colon n,

where \{\widetilde{a}_{ii}\} is the set of diagonal elements of A arranged in decreasing order: \widetilde{a}_{11} \ge \cdots \ge \widetilde{a}_{nn}.

These inequalities say that the vector [\lambda_1,\dots,\lambda_n] of eigenvalues majorizes the ordered vector [\widetilde{a}_{11},\dots,\widetilde{a}_{nn}] of diagonal elements.

An interesting special case is a correlation matrix, a symmetric positive semidefinite matrix with unit diagonal, for which the inequalities are

\notag     \lambda_1 \ge 1, \quad     \lambda_1+ \lambda_2\ge 2, \quad \dots, \quad     \lambda_1+ \lambda_2 + \cdots + \lambda_{n-1} \ge n-1,

and \lambda_1+ \lambda_2 + \cdots + \lambda_n = n. Here is an illustration in MATLAB.

>> n = 5; rng(1); A = gallery('randcorr',n);
>> e = sort(eig(A)','descend'), partial_sums = cumsum(e)
e =
  2.2701e+00   1.3142e+00   9.5280e-01   4.6250e-01   3.6045e-04
partial_sums =
  2.2701e+00   3.5843e+00   4.5371e+00   4.9996e+00   5.0000e+00

Ky Fan (1949) proved a majorization relation between the eigenvalues of A, B, and A+B:

\notag   \displaystyle\sum_{i=1}^k \lambda_i(A+B) \le   \displaystyle\sum_{i=1}^k \lambda_i(A) +   \displaystyle\sum_{i=1}^k \lambda_i(B), \quad k = 1\colon n.

For k = 1, the inequality is the same as the upper bound of (1), and for k = n it is an equality: \mathrm{trace}(A+B) = \mathrm{trace}(A) + \mathrm{trace}(B).

Ostrowski’s Theorem

For a Hermitian A and a nonsingular X, the transformation A\to X^*AX is a congruence transformation. Sylvester’s law of inertia says that congruence transformations preserve the inertia. A result of Ostrowski (1959) goes further by providing bounds on the ratios of the eigenvalues of the original and transformed matrices.

Theorem (Ostrowski).

For a Hermitian A\in \mathbb{C}^{n\times n} and X\in\mathbb{C}^{n\times n},

\lambda_k(X^*AX) = \theta_k \lambda_k(A), \quad k=1\colon n,

where \lambda_n(X^*X) \le \theta_k \le \lambda_1(X^*X).

If X is unitary then X^*X = I and so Ostrowski’s theorem reduces to the fact that a congruence with a unitary matrix is a similarity transformation and so preserves eigenvalues. The theorem shows that the further X is from being unitary the greater the potential change in the eigenvalues.

Ostrowski’s theorem can be generalized to the situation where X is rectangular (Higham and Cheng, 1998).


The results we have described are strongly interrelated. For example, the Courant–Fischer theorem and the Cauchy interlacing theorem can be derived from each other, and Ostrowski’s theorem can be proved using the Courant–Fischer Theorem.


Matrix Rank Relations

Matrix rank is an important concept in linear algebra. While rank deficiency can be a sign of an incompletely or improperly specified problem (a singular system of linear equations, for example), in some problems low rank of a matrix is a desired property or outcome. Here we present some fundamental rank relations in a concise form useful for reference. These are all immediate consequences of the singular value decomposition (SVD), but we give elementary (albeit not entirely self-contained) proofs of them.

The rank of a matrix A\in\mathbb{R}^{m\times n} is the maximum number of linearly independent columns, which is the dimension of the range space of A, \mathrm{range}(A) = \{\, Ax: x \in\mathbb{R}^n \,\}. An important but non-obvious fact is that this is the same as the maximum number of linearly independent rows (see (5) below).

A rank-1 matrix has the form xy^*, where x and y are nonzero vectors. Every column is a multiple of x and every row is a multiple of y^*. A sum of k rank-1 matrices has the form

\notag    A = \displaystyle\sum_{i=1}^{k} x_iy_i^*      =  \begin{bmatrix} x_1 & x_2 & \dots & x_k \end{bmatrix}         \begin{bmatrix} y_1^* \\ y_2^* \\ \vdots\\ y_k^* \end{bmatrix}      \equiv XY^*.         \qquad (0)

Each column of A is a linear combination of the vectors x_1, x_2, …, x_k, so A has at most k linearly independent columns, that is, A has rank at most k. In fact, \mathrm{rank}(A) = k if X and Y have rank k, as follows from (4) below. Any rank-k matrix can be written in the form (0) with X and Y of rank k; indeed this is the full-rank factorization below.

Here are some fundamental rank equalities and inequalities.

Rank-Nullity Theorem

The rank-nullity theorem says that

\notag    \boxed{ \mathrm{rank}(A) +  \mathrm{dim}( \mathrm{null}(A) ) = n,     \quad A\in\mathbb{R}^{m\times n},}

where \mathrm{null}(A) = \{\, x \in\mathbb{R}^n: Ax = 0 \,\} is the null space of A.

Rank Bound

The rank cannot exceed the number of columns, or, by (5) below, the number of rows:

\notag    \boxed{ \mathrm{rank}(A) \le \min(m,n), \quad A\in\mathbb{C}^{m\times n}. }

Rank of a Sum

For any A and B of the same dimension,

\notag     \boxed{|\mathrm{rank}(A) - \mathrm{rank}(B)| \le    \mathrm{rank}(A+B) \le \mathrm{rank}(A) + \mathrm{rank}(B).} \qquad (1)

The upper bound follows from the fact that the dimension of the sum of two subspaces cannot exceed the sum of the dimensions of the subspaces. Interestingly, the upper bound is also a corollary of the bound (3) for the rank of a matrix product, because

\notag   \begin{aligned}    \mathrm{rank}(A+B) &= \mathrm{rank}\biggl(  \begin{bmatrix} A & B  \end{bmatrix}  \begin{bmatrix} I \\ I  \end{bmatrix} \biggr)\\  &\le  \min\biggl(\mathrm{rank}\bigl(\begin{bmatrix} A & B  \end{bmatrix}\bigr),  \mathrm{rank}\biggl(\begin{bmatrix} I \\ I  \end{bmatrix} \biggr)\biggr)\\  &\le \mathrm{rank}\bigl(\begin{bmatrix} A & B  \end{bmatrix}\bigr)\\   &\le \mathrm{rank}(A) + \mathrm{rank}(B).   \end{aligned}

For the lower bound, writing A = -B + A+B and applying the upper bound gives \mathrm{rank}(A) \le \mathrm{rank}(-B) + \mathrm{rank}(A+B) = \mathrm{rank}(B) + \mathrm{rank}(A+B), and likewise with the roles of A and B interchanged.

Rank of A and A^*A

For any A,

\notag    \boxed{\mathrm{rank}(A^*A) = \mathrm{rank}(A).}  \qquad (2)

Indeed Ax = 0 implies A^*Ax = 0, and A^*Ax = 0 implies 0 = x^*A^*Ax = (Ax)^*(Ax), which implies Ax = 0. Hence the null spaces of A and A^*A are the same. The equality (2) follows from the rank-nullity theorem.

Rank of a General Product

For any A and B for which the product AB is defined,

\notag   \boxed{\mathrm{rank}(AB) \le \min\bigl( \mathrm{rank}(A), \mathrm{rank}(B) \bigr).}     \qquad (3)

If B = [b_1,\dots,b_n] then AB = [Ab_1,\dots,Ab_n], so the columns of AB are linear combinations of those of A and so AB cannot have more linearly independent columns than A, that is, \mathrm{rank}(AB) \le \mathrm{rank}(A). Using (5) below, we then have

\notag  \mathrm{rank}(AB) = \mathrm{rank}(B^*A^*)   \le \mathrm{rank}(B^*) = \mathrm{rank}(B).

The latter inequality can be proved without using (5) (our proof of which uses (3)), as follows. Suppose \mathrm{rank}(B) < \mathrm{rank}(AB) = r. Let the columns of Y span \mathrm{range}(AB), so that Y has r columns and Y = ABZ for some matrix Z with r columns. Now \mathrm{rank}(BZ) \le \mathrm{rank}(B) < r by the first part, so BZg = 0 for some nonzero g. But then Yg = ABZg = 0, which contradicts the linear independence of the columns of Y, so we must have \mathrm{rank}(B) \ge \mathrm{rank}(AB).

Rank of a Product of Full-Rank Matrices

We have

\notag    \boxed{  \mathrm{rank}(AB) = r, \quad     A\in\mathbb{C}^{m\times r}, \;     B\in\mathbb{C}^{r\times n}, \;     \mathrm{rank}(A) = \mathrm{rank}(B) = r .}     \qquad (4)

We note that A^*A and BB^* are both nonsingular r\times r matrices by (2), so their product has rank r. Using (3),

\notag   r = \mathrm{rank}(A^*A BB^*)     \le \mathrm{rank}(A B) \le r,

and hence \mathrm{rank}(A B) = r.

Another important relation is

\notag   \boxed{ \mathrm{rank}(XAY ) = \mathrm{rank}(A), \quad    X\in\mathbb{C}^{m\times m} \;\mathrm{and}\;    Y\in\mathbb{C}^{n\times n}\; \mathrm{nonsingular}. }

This is a consequence of the equality \mathrm{range}(XAY) = X\mathrm{range}(A)Y for nonsingular X and Y.

Ranks of A and A^*

By (2) and (3) we have \mathrm{rank}(A) = \mathrm{rank}(A^*A) \le \mathrm{rank}(A^*). Interchanging the roles of A and A^* gives \mathrm{rank}(A^*) \le \mathrm{rank}(A) and so

\notag   \boxed{ \mathrm{rank}(A^*) = \mathrm{rank}(A). } \qquad (5)

In other words, the rank of A is equal to the maximum number of linearly independent rows as well as the maximum number of linearly independent columns.

Full-Rank Factorization

A\in\mathbb{C}^{m \times n} has rank r if and only if A = GH for some G\in\mathbb{C}^{m \times r} and H\in\mathbb{C}^{r \times n}, both of rank r, and this is called a full-rank factorization. The existence of such a factorization implies that \mathrm{rank}(A) = r by (4). Conversely, suppose that A has rank r. Let the columns of X\in\mathbb{C}^{m \times r} form a basis for the range space of A. Then there are r-vectors y_j such that a_j = Xy_j, j = 1\colon n, and with Y = [y_1,y_2,\dots, y_n] we have A = XY. Finally, r = \mathrm{rank}(A)     = \mathrm{rank}(XY)     \le \mathrm{rank}(Y) by (3), and since \mathrm{rank}(Y) \le r we have \mathrm{rank}(Y) = r.

Rank and Minors

A characterization of rank that is sometimes used as the definition is that it is the size of the largest nonsingular square submatrix. Equivalently, the rank is the size of the largest nonzero minor, where a minor of size k is the determinant of a k\times k submatrix.

rank(AB) and rank(BA)

Although AB and BA have some properties in common when both products are defined (notably they have the same nonzero eigenvalues), \mathrm{rank}(AB) is not always equal to \mathrm{rank}(BA). A simple example is A = x and B = y^* with x and y orthogonal vectors: AB = xy^* but BA = y^*x = 0. An example with square A and B is

\notag   \begin{gathered}   A = \begin{bmatrix} 0 & 1 \\ 0 & 0 \end{bmatrix}, \quad   B = \begin{bmatrix} 1 & 0 \\ 0 & 0 \end{bmatrix}, \\    \mathrm{rank}(AB) =    \mathrm{rank}\biggl( \begin{bmatrix} 0 & 0 \\ 0 & 0 \end{bmatrix}    \biggr) = 0, \quad    \mathrm{rank}(BA) =    \mathrm{rank}\biggl( \begin{bmatrix} 0 & 1 \\ 0 & 0 \end{bmatrix}    \biggr) = 1.   \end{gathered}

Note that A = e_1e_2^T and B = e_1e_1^T, where e_i has 1 in the ith position and zeros everywhere else. Such matrices are easy to manipulate in this form (e.g., AB = e_1 (e_2^Te_1)e_1^T = 0) and are useful for constructing examples.

How to Find Rank

If we have a full-rank factorization of A then we can read off the rank from the dimensions of the factors. But finding a full-rank factorization is a nontrivial task. The ultimate full-rank factorization is the SVD

\notag     A = U\Sigma V^T,

where U\in\mathbb{R}^{m\times m} and V\in\mathbb{R}^{n\times n} are orthogonal, \Sigma = \mathrm{diag}(\sigma_1,\dots, \sigma_p)\in\mathbb{R}^{m\times n}, where p = \min(m,n), and \sigma_1\ge \sigma_2\ge \cdots \ge \sigma_r > 0 = \sigma_{r+1} = \cdots = \sigma_p = 0. The rank of A is r, the number of nonzero singular values.

In floating-point arithmetic, the standard algorithms for computing the SVD are numerically stable, that is, the computed singular values are the exact singular values of a matrix A + \Delta A with \|\Delta A\|_2 \le c_{m,n}u\|A\|, where c_{m,n} is a constant and u is the unit roundoff. Unfortunately, A + \Delta A will typically be full rank when A is rank deficient. For example, consider this computation.

>> n = 4; A = zeros(n); A(:) = 1:n^2, svd(A)
A =
     1     5     9    13
     2     6    10    14
     3     7    11    15
     4     8    12    16
ans =

The matrix has rank 2 and the two zero singular values are approximated by computed singular values of order 10^{-15}. In general, we have no way to know whether tiny computed singular values signify exactly zero singular values. In practice, one typically defines a numerical rank based on a threshold and regards computed singular values less than the threshold as zero. Indeed the MATLAB rank function computes the rank as the number of singular values exceeding 2u \max(m,n)\widehat{\sigma}_1, where \widehat{\sigma}_1 is the largest computed singular value. If the data from which the matrix is constructed is uncertain then the definition of numerical rank should take into account the level of uncertainty in the data. Dealing with rank deficiency in the presence of data errors and in finite precision arithmetic is a tricky business.


An excellent reference for further rank relations is Horn and Johnson. Stewart describes some of the issues associated with rank-deficient matrices in practical computation.

Diagonally Perturbing a Symmetric Matrix to Make It Positive Definite

Suppose A is a matrix that is symmetric but not positive definite. What is the best way to perturb the diagonal to make A positive definite? We want to compute a vector d such that

\notag     A(d) = A + D, \quad D = \mathrm{diag}(d),

is positive definite. Since the positive definite matrices form an open set there is no minimal d, so we relax the requirement to be that A is positive semidefinite. The perturbation D needs to make any negative eigenvalues of A become nonnegative. We will require all the entries of d to be nonnegative. Denote the eigenvalues of A by \lambda_n(A) \le \lambda_{n-1}(A) \le \cdots \le \lambda_1(A) and assume that \lambda_n(A) < 0.

A natural choice is to take D to be a multiple of the identity matrix. For d_i \equiv \delta, A(d) has eigenvalues \lambda_i + \delta, and so the smallest possible \delta is \delta = - \lambda_n(A). This choice of D shifts all the diagonal elements by the same amount, which might be undesirable for a matrix with widely varying diagonal elements.

When the diagonal entries of A are positive another possibility is to take d_i = \alpha a_{ii}, so that each diagonal entry undergoes a relative perturbation of size \alpha. Write D_A = \mathrm{diag}(a_{ii}) and note that C = D_A^{-1/2}A D_A^{-1/2} is symmetric with unit diagonal. Then

\notag     A + \alpha D_A = D_A^{1/2}(C + \alpha I)D_A^{1/2}.

Since A + \alpha D_A is positive semidefinite if and only if C + \alpha I is positive semidefinite, the smallest possible \alpha is \alpha = -\lambda_n(C).

More generally, we can treat the d_i as n independent variables and ask for the solution of the optimization problem

\notag    \min \|d\| ~~\mathrm{subject~to}~~ A + \mathrm{diag}(d)               ~\mathrm{positive~semidefinite}, ~d \ge 0.   \qquad(\dagger)

Of particular interest are the norms \|d\|_1 = \sum_i d_i = \mathrm{trace}(D) (since d\ge0) and \|d\|_\infty = \max_i d_i.

If A+D is positive semidefinite then from standard eigenvalue inequalities,

\notag       0 \le \lambda_n(A+D) \le  \lambda_n(A) +  \lambda_1(D),

so that

\notag       \max_i d_i \ge -\lambda_n(A).

Since d_i \equiv -\lambda_n(A) satisfies the constraints of (\dagger), this means that this d solves (\dagger) for the \infty-norm, though the solution is obviously not unique in general.

For the 1– and 2-norms, (\dagger) does not have an explicit solution, but it can be solved by semidefinite programming techniques.

Another approach to finding a suitable D is to compute a modified Cholesky factorization. Given a symmetric A, such a method computes a perturbation E such that A + E = R^TR for an upper triangular R with positive diagonal elements, so that A + E is positive definite. The methods of Gill, Murray, and Wright (1981) and Schnabel and Eskow (1990) compute a diagonal E. The cost in flops is essentially the same as that of computing a Cholesky factorization (n^3/3) flops), so this approach is likely to require fewer flops than computing the minimum eigenvalue or solving an optimization problem, but the perturbations produces will not be optimal.


We take the 5\times 5 Fiedler matrix

>> A = gallery('fiedler',5)
A =
     0     1     2     3     4
     1     0     1     2     3
     2     1     0     1     2
     3     2     1     0     1
     4     3     2     1     0

The smallest eigenvalue is -5.2361, so A+D is positive semidefinite for D_1 = 5.2361I. The Gill–Murray–Wright method gives D_{\mathrm{GMW}} with diagonal elements

2.0000e+01   4.6159e+00   1.3194e+00   2.5327e+00   1.0600e+01

and has \lambda_5(A+D_{\mathrm{GMW}}) = 0.5196 while the Schnabel–Eskow method gives D_{\mathrm{SE}} with diagonal elements

6     6     6     6     6

and has \lambda_5(A+D_{\mathrm{SE}}) = 0.7639. If we increase the diagonal elements of D_1 by 0.5 to give comparable smallest eigenvalues for the perturbed matrices then we have

  \Vert d\Vert_{\infty} \Vert d \Vert_1
Shift 5.2361 26.180
Gill-Murray-Wright 20.000 39.068
Schnabel–Eskow 6.000 30.000


When Does Thresholding Preserve Positive Definiteness?

Does a symmetric positive definite matrix remain positive definite when we set one or more elements to zero? This question arises in thresholding, in which elements of absolute value less than some tolerance are set to zero. Thresholding is used in some applications to remove correlations thought to be spurious, so that only statistically significant ones are retained.

We will focus on the case where just one element is changed and consider an arbitrary target value rather than zero. Given an n\times n symmetric positive definite matrix A we define A(t) to be the matrix resulting from adding t to the (i,j) and (j,i) elements and we ask when is A(t) positive definite. We can write

\notag   A(t) = A + t(e_i^{}e_j^T + e_j^{}e_i^T) \equiv A + tE_{ij},

where e_i is the ith column of the identity matrix. The perturbation E_{ij} has rank 2, with eigenvalues -1, 1, and 0 repeated n-2 times. Hence we can write E_{ij} in the form E_{ij} = pp^T - qq^T, where p^Tp = q^Tq = 1 and p^Tq = 0. Adding pp^T to A causes each eigenvalue to increase or stay the same, while subtracting qq^T decreases or leaves unchanged each eigenvalue. However, more is true: after each of these rank-1 perturbations the eigenvalues of the original and perturbed matrices interlace, by Weyl’s theorem. Hence, with the eigenvalues of A ordered as \lambda_n(A) \le \cdots \le \lambda_1(A), we have (Horn and Johnson, Cor. 4.3.7)

\notag \begin{aligned}   \lambda_n(A(t)) &\le \lambda_{n-1}(A), \\   \lambda_{i+1}(A) &\le \lambda_i(A(t)) \le \lambda_{i-1}(A),    \quad i = 2\colon n-1, \\   \lambda_2(A) &\le \lambda_1(A(t)). \end{aligned}

Because A is positive definite these inequalities imply that \lambda_{n-1}(A(t)) \ge \lambda_n(A) > 0, so A(t) has at most one negative eigenvalue. Since \det(A(t)) is the product of the eigenvalues of A(t) this means that A(t) is positive definite precisely when \det(A(t)) > 0.

There is a simple expression for \det(A(t)), which follows from a lemma of Chan (1984), as explained by Georgescu, Higham, and Peters (2018):

\notag  \det(A(t)) = \det(A)\big(1+ 2t b_{ij} + t^2(b_{ij}^2-b_{ii}b_{jj})\big),

where B = A^{-1}. Hence the condition for A(t) to be positive definite is

\notag  q_{ij}(t) = 1 + 2t b_{ij} + t^2(b_{ij}^2-b_{ii}b_{jj}) > 0.

We can factorize

\notag     q_{ij}(t) = \Bigl( t\bigl(b_{ij}  - \sqrt{b_{ii}b_{jj}}\bigr) + 1 \Bigr)                 \Bigl( t\bigl(b_{ij}  + \sqrt{b_{ii}b_{jj}}\bigr) + 1 \Bigr),

so q_{ij}(t) > 0 for

\notag    t\in \left( \displaystyle\frac{-1}{ \sqrt{b_{ii}b_{jj}} + b_{ij} },                 \displaystyle\frac{1}{ \sqrt{b_{ii}b_{jj}} - b_{ij} } \right) =: I_{ij},

where the endpoints are finite because B, like A, is positive definite and so |b_{ij}| < \sqrt{b_{ii}b_{jj}}.

The condition for A to remain positive definite when a_{ij} is set to zero is q_{ij}(-a_{ij}) > 0, or equivalently -a_{ij} \in I_{ij}. To check either of these conditions we need just b_{ij}, b_{ii}, and b_{jj}. These elements can be computed without computing the whole inverse by solving the equations Ab_k = e_k for k = i,j, for the kth column b_k of B, making use of a Cholesky factorization of A.

As an example, we consider the 4\times 4 Lehmer matrix, which has (i,j) element i/j for i \ge j:

\notag   A = \begin{bmatrix}         1           & \frac{1}{2}  & \frac{1}{3} & \frac{1}{4} \\[3pt]         \frac{1}{2} &           1  & \frac{2}{3} & \frac{1}{2} \\[3pt]         \frac{1}{3} &  \frac{2}{3} & 1           & \frac{3}{4} \\[3pt]         \frac{1}{4} &  \frac{1}{2} & \frac{3}{4} &  1         \end{bmatrix}.

The smallest eigenvalue of A is 0.208. Any off-diagonal element except the (2,4) element can be zeroed without destroying positive definiteness, and if the (2,4) element is zeroed then the new matrix has smallest eigenvalue -0.0249. For i=2 and j=4, the following plot shows in red \lambda_{\min}(A(t)) and in blue q_{24}(t); the black dots are the endpoints of the closure of the interval I_{24} = (-0.453,0.453) and the vertical black line is the value -a_{24}. Clearly, -a_{24} lies outside I_{24}, which is why zeroing this element causes a loss of positive definiteness. Note that I_{24} also tells us that we can increase a_{24} to any number less than 0.953 without losing definiteness.


Given a positive definite matrix and a set S of elements to be modified we may wish to determine subsets (including a maximal subset) of S for which the modifications preserve definiteness. Efficiently determining these subsets appears to be an open problem.

In practical applications thresholding may lead to an indefinite matrix. Definiteness must then be restored to obtain a valid correlation matrix. One way to do this is to find the nearest correlation matrix in the Frobenius norm such that the zeroed elements remain zero. This can be done by the alternating projections method with a projection to keep the zeroed elements fixed. Since the nearest correlation matrix is positive semidefinite, it is also desirable to to incorporate a lower bound \delta > 0 on the smallest eigenvalue, which corresponds to another projection. Both these projections are supported in the algorithm of Higham and Strabić (2016), implemented in the code at For the Lehmer matrix, the nearest correlation matrix with zero (2,4) element and eigenvalues at least \delta = 0.01 is (to four significant figures)

\notag   \begin{bmatrix}    1       &    0.4946  &    0.3403  &    0.2445  \\    0.4946  &    1       &    0.6439  &    0       \\    0.3403  &    0.6439  &    1       &    0.7266  \\    0.2445  &    0       &    0.7266  &    1    \end{bmatrix}.

A related question is for what patterns of elements that are set to zero is positive definiteness guaranteed to be preserved for all positive definite A? Clearly, setting all the off-diagonal elements to zero preserves definiteness, since the diagonal of a positive definite matrix is positive. Guillot and Rajaratnam (2012) show that the answer to the question is that the new matrix must be a symmetric permutation of a block diagonal matrix. However, for particular A this restriction does not necessarily hold, as the Lehmer matrix example shows.


Randsvd Matrices with Large Growth Factors

Sixty years ago James Wilkinson published his backward error analysis of Gaussian elimination for solving a linear system Ax = b, where A is a nonsingular n\times n matrix. He showed that in floating-point arithmetic the computed solution \widehat{x} satisfies

(A+\Delta A) \widehat{x} = b, \qquad      \|\Delta A\|_{\infty} \le  p(n) \rho_n  u \|A\|_{\infty},

where u is the unit roundoff and p is a low degree polynomial. The term \rho_n is the growth factor, defined by

\rho_n = \displaystyle\frac{\max_{i,j,k} |a_{ij}^{(k)}|}               {\max_{i,j}|a_{ij}|} \ge 1,

where the a_{ij}^{(k)} are the elements at the kth stage of Gaussian elimination. The growth factor measures how much elements grow during the elimination. We would like the product p(n)\rho_n to be of order 1, so that \Delta A is a small relative perturbation of A. We therefore need \rho_n not to be too large.

With partial pivoting, in which row interchanges are used to ensure that at each stage the pivot element is the largest in its column, Wilkinson showed that \rho_n \le 2^{n-1} and that equality is possible. Such exponential growth implies a large \Delta A (unless we are lucky), meaning a severe loss of numerical stability. However, seventy years of digital computing experience have shown that \rho_n is usually of modest size in practice. Explaining why this is the case is one of the outstanding problems in numerical analysis.

It is easy to experiment with growth factors in MATLAB. I will use the function

function g = gf(A)
%GF     Approximate growth factor.
%   g = GF(A) is an approximation to the
%   growth factor for LU factorization
%   with partial pivoting.
[~,U] = lu(A);
g = max(abs(U),[],'all')/max(abs(A),[],'all');

It computes a lower bound on the growth factor (since it only considers k=n in the numerator in the definition), but it is entirely adequate for our purposes here. Let’s compute the growth factor for a random matrix of order 10,000 with elements from the standard normal distribution (mean 0, variance 1):

>> rng(1); n = 10000; gf(randn(n))
ans =

Growth of 61 is unremarkable for a matrix of this size. Now we try a matrix of the same size generated by the gallery('randsvd') function:

>> A = gallery('randsvd',n,1e6,2,[],[],1);
>> gf(A)
ans =

This function generates an n\times n matrix with known singular value distribution and with singular vector matrices that are random orthogonal matrices from the Haar distribution. The parameter 1e6 specifies the 2-norm condition number, while the 2 (the mode parameter) specifies that there is only one small singular value, so the singular values are 1 repeated n-1 times and 1e-6. Growth of 975 is exceptional! These matrices have been in MATLAB since the 1990s, but this large growth property has apparently not been noticed before.

It turns out that mode 2 randsvd matrices generate with high probability growth factors of size at least n/(4 \log n) for any condition number and for any pivoting strategy, not just partial pivoting. One way to check this is to randomly permute the columns of A before doing the LU factorization with partial pivoting:

>> gf(A(:,randperm(n)))
ans =

Here is a plot showing the maximum over 12 randsvd matrices for each n of the growth factors for three different pivoting strategies, along with the maximum growth factors for partial pivoting for rand and randn matrices. The black curve is n/(4 \log n). This plot emphasizes the unusually large growth for mode 2 randsvd matrices.


What is the explanation for this large growth? It stems from three facts.

  • Haar distributed orthogonal matrices have the property that that their elements are fairly small with high probability, as shown by Jiang in 2005.
  • If the largest entries in magnitude of A and A^{-1} are both small, in the sense that their product is \theta \ll 1, then A will produce a growth factor of at least 1/\theta for any pivoting strategy. This was proved by Des Higham and I in the paper Large Growth Factors in Gaussian Elimination with Pivoting.
  • If W is an orthogonal matrix generating large growth then a rank-1 perturbation of 2-norm at most 1 tends to preserve the large growth.

For full details see the new EPrint Random Matrices Generating Large Growth in LU Factorization with Pivoting by Des Higham, Srikara Pranesh and me.

Is growth of order n a problem in practice? It can be for two reasons.

  • The largest dense linear systems Ax = b solved today are of dimension n = 10^7. If we work in single precision then nu \approx 1 and so LU factorization can potentially be completely unstable if there is growth of order n.
  • For IEEE half precision arithmetic growth of order n will cause overflow once n exceeds 10^5 / \max_{i,j} |a_{ij}|. It was overflow in half precision LU factorization on randsvd matrices that alerted us to the large growth.