# What Is Gershgorin’s Theorem?

For a given $n\times n$ matrix, Gershgorin’s theorem defines $n$ discs in the complex plane whose union contains the eigenvalues of the matrix. The theorem can provide approximations to eigenvalues. It can also provide qualitative information, such as that all the eigenvalues lie in a particular half-plane.

Theorem 1 (Gershgorin’s theorem).

The eigenvalues of $A\in\mathbb{C}^{n\times n}$ lie in the union of the $n$ discs in the complex plane

$\notag D_i = \Big\{ z\in\mathbb{C}: |z-a_{ii}| \le \displaystyle\sum_{j\ne i} |a_{ij}|\Big\}, \quad i=1\colon n.$

Proof. Let $\lambda$ be an eigenvalue of $A$ and $x$ a corresponding eigenvector and let $|x_k| = \|x\|_\infty$. From the $k$th equation in $Ax=\lambda x$ we have

$\notag \sum_{j=1 \atop j\ne k}^n a_{kj}x_j = (\lambda-a_{kk}) x_k.$

Hence

$\notag |\lambda-a_{kk}| \le \sum_{j=1 \atop j\ne k}^n |a_{kj}||x_j|/|x_k|,$

and since $|x_j|/|x_k|\le 1$ it follows that $\lambda$ belongs to the $k$th disc, $D_k$.

The Gershgorin discs $D_i$ are defined in terms of a summation over the rows of $A$, but since the eigenvalues of $A$ are the same as those of $A^T$ the same result holds with summation over the columns.

A consequence of the theorem is that if $0$ does not belong to any of the Gershgorin discs then $A$ is nonsingular. This is equivalent to the well-known result that a strictly diagonally dominant matrix is nonsingular.

Another consequence of the theorem is that if $A$ is symmetric and all the discs lie in the open right half-plane then the eigenvalues are positive and so $A$ is positive definite. This condition is equivalent to $A$ having positive diagonal elements and being strictly diagonally dominant.

The Gershgorin discs for the $5\times 5$ matrix

$\notag \left[\begin{array}{ccccc} 5/4 & 1 & 3/4 & 1/2 & 1/4 \\ 1 & 0 & 0 & 0 & 0\\ -1 & 1 & 0 & 0 & 0\\ 0 & 0 & 1 & 3 & 0\\ 0 & 0 & 0 & 1/2 & 5 \end{array}\right]$

are shown here:

The eigenvalues—three real and one complex conjugate pair—are the black dots. It happens that each disc contains an eigenvalue, but this is not always the case. For the matrix

$\notag \left[\begin{array}{cc} 2 & -1\\ 2 & 0 \end{array}\right]$

the discs are

and the blue disc does not contain an eigenvalue. The next result, which is proved by a continuity argument, provides additional information that increases the utility of Gershgorin’s theorem. In particular it says that if a disc is disjoint from the other discs then it contains an eigenvalue.

Theorem 2.

If $k$ of the Gershgorin discs of $A\in\mathbb{C}^{n\times n}$ are disjoint from the other discs then their union contains $k$ eigenvalues of $A$.

Theorem 2 tells us that the rightmost disc in our $5\times 5$ example contains one eigenvalue, $\lambda$, since it is disjoint from the other discs, and the union of the other four discs contains four eigenvalues. Furthermore, $\lambda$ must be real, because if not it occurs in a complex conjugate pair since the matrix is real, and as the disc is symmetric about the real axis $\overline{\lambda}$ would also lie in the disc, contradicting Theorem 2.

Gershgorin’s theorem is most useful for matrices that are close to being diagonal. A technique that can produce improved eigenvalue estimates is to apply the theorem to $D^{-1}AD$ for some nonsingular diagonal matrix $D$. This similarity transformation does not change the eigenvalues but it does change the discs, and the aim is to choose $D$ to reduce the radiuses of the discs. Consider our $5\times 5$ example. We know that there is one eigenvalue $\lambda$ in the rightmost disc and that it is real, so $4.5 \le \lambda \le 5.5$. For $D = \mathrm{diag}(1,1,1,1,1/4)$ the rightmost disc shrinks and remains distinct from the others and we obtain the sharper bounds $4.875 \le \lambda \le 5.126$. The discs for $D^{-1}AD$ are shown here:

## Notes

Most books on matrix analysis or numerical linear algebra include Gershgorin’s theorem.

Eigenvalue inclusion regions have been developed with discs replaced by more complicated shapes, such as Brauer’s ovals of Cassini.

Varga’s 2004 book is devoted to Gershgorin’s theorem and related results. It reproduces Gershgorin’s 1931 paper in an appendix.

## References

• S. Gerschgorin. Uber die Abgrenzung der Eigenwerte einer Matrix. Izv. Akad. Nauk. SSSR, 1:749–754, 1931.
• Richard S. Varga. Geršgorin and His Circles. Springer-Verlag, Berlin, Germany, 2004.

# What Is a Nilpotent Matrix?

An $n\times n$ matrix $A$ is nilpotent if $A^k =0$ for some positive integer $k$. A nonzero nilpotent matrix must have both positive and negative entries in order for cancellation to take place in the matrix powers. The smallest $k$ for which $A^k =0$ is called the index of nilpotency. The index does not exceed $n$, as we will see below.

Here are some examples of nilpotent matrices.

\notag \begin{aligned} A_1 &= \begin{bmatrix}0 & 1 \\ 0 & 0 \end{bmatrix}, \quad A_1^2=0,\\ A_2 &= \begin{bmatrix}0 & 1 & 1\\ 0 & 0 & 1\\ 0 & 0 & 0 \end{bmatrix}, \quad A_2^3 = 0,\\ A_3 &= \begin{bmatrix}1 & -1 \\ 1 & -1 \end{bmatrix}, \quad A_3^2 = 0,\\ A_4 &= \begin{bmatrix}2 & 2 & 4\\ -4 & -4 & -8\\ 1 & 1 & 2 \end{bmatrix}, \quad A_4^2 = 0. \end{aligned}

Matrix $A_1$ is the $2\times 2$ instance of the upper bidiagonal $p\times p$ matrix

$\notag N = \begin{bmatrix} 0 & 1 & & \\ & 0 & \ddots & \\ & & \ddots & 1 \\ & & & 0 \end{bmatrix}, \qquad (1)$

for which

$\notag N^2 = \begin{bmatrix} 0 & 0 & 1 & & \\ & 0 & \ddots & \ddots & \\ & & & \ddots & 1 \\ & & & \ddots & 0 \\ & & & & 0 \end{bmatrix}, \quad \dots, \quad N^{p-1} = \begin{bmatrix} 0 & 0 & \dots & 0 & 1 \\ & 0 & \ddots & & 0 \\ & & \ddots & \ddots & \vdots \\ & & & 0 & 0 \\ & & & & 0 \end{bmatrix}$

and $N^p = 0$. The superdiagonal of ones moves up to the right with each increase in the index of the power until it disappears off the top right corner of the matrix.

Matrix $A_4$ has rank $1$ and was constructed using a general formula: if $A = xy^T$ with $y^Tx = 0$ then $A^2 = xy^T xy^T = (y^Tx) xy^T = 0$. We simply took orthogonal vectors $x =[2, -4, 1]^T$ and $y = [1, 1, 2]^T$.

If $A$ is nilpotent then every eigenvalue is zero, since $Ax = \lambda x$ with $x\ne 0$ implies $0 = A^nx = \lambda^n x$ or $\lambda = 0$. Consequently, the trace and determinant of a nilpotent matrix are both zero.

If $A$ is nilpotent and Hermitian or symmetric, or more generally normal ($A^*A = AA^*$), then $A = 0$, since such a matrix has a spectral decomposition $A = Q \mathrm{diag}(\lambda_i)Q^*$ and the matrix $\mathrm{diag}(\lambda_i)$ is zero. It is only for nonnormal matrices that nilpotency is a nontrivial property, and the best way to understand it is with the Jordan canonical form (JCF). The JCF of a matrix with only zero eigenvalues has the form $A = XJX^{-1}$, where $J = \mathrm{diag}(J_{m_1},J_{m_2}, \dots, J_{m_p})$, where $J_{m_i}$ is of the form (1) and hence $J_{m_i}^{m_i} = 0$. It follows that the index of nilpotency is $k = \max\{\,m_i : i=1\colon p\,\} \le n$.

What is the rank of an $n\times n$ nilpotent matrix $A$? The minimum possible rank is $0$, attained for the zero matrix. The maximum possible rank is $n-1$, attained when the JCF of $A$ has just one Jordan block of size $n$. Any rank between $0$ and $n-1$ is possible: rank $j$ is attained when there is a Jordan block of size $j+1$ and all other blocks are $1\times 1$.

Finally, while a nilpotent matrix is obviously not invertible, like every matrix it has a Moore–Penrose pseudoinverse. The pseudoinverse of a Jordan block with eigenvalue zero is just the transpose of the block: $N^+ = N^T$ for $N$ in (1).

# What Is an Eigenvalue?

An eigenvalue of a square matrix $A$ is a scalar $\lambda$ such that $Ax = \lambda x$ for some nonzero vector $x$. The vector $x$ is an eigenvector of $A$ and it has the distinction of being a direction that is not changed on multiplication by $A$.

An $n\times n$ matrix has $n$ eigenvalues. This can be seen by noting that $Ax = \lambda x$ is equivalent to $(\lambda I - A) x = 0$, which means that $\lambda I - A$ is singular, since $x\ne 0$. Hence $\det(\lambda I - A) = 0$. But

$\notag \det(\lambda I - A) = \lambda^n + a_{n-1}\lambda^{n-1} + \cdots + a_1 \lambda + a_0$

is a scalar polynomial of degree $n$ (the characteristic polynomial of $A$) with nonzero leading coefficient and so has $n$ roots, which are the eigenvalues of $A$. Since $\det(\lambda I - A) = \det( (\lambda I - A)^T) = \det(\lambda I - A^T)$, the eigenvalues of $A^T$ are the same as those of $A$.

A real matrix may have complex eigenvalues, but they appear in complex conjugate pairs. Indeed $Ax = \lambda x$ implies $\overline{A}\overline{x} = \overline{\lambda} \overline{x}$, so if $A$ is real then $\overline{\lambda}$ is an eigenvalue of $A$ with eigenvector $\overline{x}$.

Here are some $2\times 2$ matrices and their eigenvalues.

\notag \begin{aligned} A_1 &= \begin{bmatrix}1 & 0 \\ 0 & 1 \end{bmatrix}, \quad \lambda = 1,1,\\ A_2 &= \begin{bmatrix}0 & 1 \\ 0 & 0 \end{bmatrix}, \quad \lambda = 0,0,\\ A_3 &= \begin{bmatrix}0 & 1 \\ -1 & 0 \end{bmatrix}, \quad \lambda = \mathrm{i},\mathrm{-i}. \end{aligned}

Note that $A_1$ and $A_2$ are upper triangular, that is, $a_{ij} = 0$ for $i>j$. For such a matrix the eigenvalues are the diagonal elements.

A symmetric matrix ($A^T = A$) or Hermitian matrix ($A^* = A$, where $A^* = \overline{A}^T$) has real eigenvalues. A proof is $Ax = \lambda x \Rightarrow x^*A^* = \overline{\lambda} x^*$ so premultiplying the first equation by $x^*$ and postmultiplying the second by $x$ gives $x^*Ax = \lambda x^*x$ and $x^*Ax = \overline{\lambda} x^*x$, which means that $(\lambda-\overline{\lambda})x^*x = 0$, or $\lambda=\overline{\lambda}$ since $x^*x \ne 0$. The matrix $A_1$ above is symmetric.

A skew-symmetric matrix ($A^T = -A$) or skew-Hermitian complex matrix ($A^* = -A$) has pure imaginary eigenvalues. A proof is similar to the Hermitian case: $Ax = \lambda x \Rightarrow -x^*A = x^*A^* = \overline{\lambda} x^*$ and so $x^*Ax$ is equal to both $\lambda x^*x$ and $-\overline{\lambda} x^*x$, so $\lambda = -\overline{\lambda}$. The matrix $A_3$ above is skew-symmetric.

In general, the eigenvalues of a matrix $A$ can lie anywhere in the complex plane, subject to restrictions based on matrix structure such as symmetry or skew-symmetry, but they are restricted to the disc centered at the origin with radius $\|A\|$, because for any matrix norm $\|\cdot\|$ it can be shown that every eigenvalue satisfies $|\lambda| \le \|A\|$.

Here are some example eigenvalue distributions, computed in MATLAB. (The eigenvalues are computed at high precision using the Advanpix Multiprecision Computing Toolbox in order to ensure that rounding errors do not affect the plots.) The second and third matrices are real, so the eigenvalues are symmetrically distributed about the real axis. (The first matrix is complex.)

Although this article is about eigenvalues we need to say a little more about eigenvectors. An $n\times n$ matrix $A$ with distinct eigenvalues has $n$ linearly independent eigenvectors. Indeed it is diagonalizable: $A = XDX^{-1}$ for some nonsingular matrix $X$ with $D = \mathrm{diag}(\lambda_i)$ the matrix of eigenvalues. If we write $X$ in terms of its columns as $X = [x_1,x_2,\dots,x_n]$ then $AX = XD$ is equivalent to $Ax_i = \lambda _i x_i$, $i=1\colon n$, so the $x_i$ are eigenvectors of $A$. The matrices $A_1$ and $A_3$ above both have two linearly independent eigenvectors.

If there are repeated eigenvalues there can be less than $n$ linearly independent eigenvectors. The matrix $A_2$ above has only one eigenvector: the vector $\left[\begin{smallmatrix}1 \\ 0 \end{smallmatrix}\right]$ (or any nonzero scalar multiple of it). This matrix is a Jordan block. The matrix $A_1$ shows that a matrix with repeated eigenvalues can have linearly independent eigenvectors.

Here are some questions about eigenvalues.

• What matrix decompositions reveal eigenvalues? The answer is the Jordan canonical form and the Schur decomposition. The Jordan canonical form shows how many linearly independent eigenvectors are associated with each eigenvalue.
• Can we obtain better bounds on where eigenvalues lie in the complex plane? Many results are available, of which the most well-known is Gershgorin’s theorem.
• How can we compute eigenvalues? Various methods are available. The QR algorithm is widely used and is applicable to all types of eigenvalue problems.

Finally, we note that the concept of eigenvalue is more general than just for matrices: it extends to nonlinear operators on finite or infinite dimensional spaces.

## References

Many books include treatments of eigenvalues of matrices. We give just three examples.

• Gene Golub and Charles F. Van Loan, Matrix Computations, fourth edition, Johns Hopkins University Press, Baltimore, MD, USA, 2013.
• Roger A. Horn and Charles R. Johnson, Matrix Analysis, second edition, Cambridge University Press, 2013. My review of the second edition.
• Carl D. Meyer, Matrix Analysis and Applied Linear Algebra, Society for Industrial and Applied Mathematics, Philadelphia, PA, USA, 2000.

# What Is a Symmetric Indefinite Matrix?

A symmetric indefinite matrix $A$ is a symmetric matrix for which the quadratic form $x^TAx$ takes both positive and negative values. By contrast, for a positive definite matrix $x^TAx > 0$ for all nonzero $x$ and for a negative definite matrix $x^TAx < 0$ for all nonzero $x$.

A neat way to express the indefinitess is that there exist vectors $x$ and $y$ for which $(x^TAx)(y^TAy) < 0$.

A symmetric indefinite matrix has both positive and negative eigenvalues and in some sense is a typical symmetric matrix. For example, a random symmetric matrix is usually indefinite:

>> rng(3); B = rand(4); A = B + B'; eig(A)'
ans =
-8.9486e-01  -6.8664e-02   1.1795e+00   3.9197e+00


In general it is difficult to tell if a symmetric matrix is indefinite or definite, but there is one easy-to-spot sufficient condition for indefinitess: if the matrix has a zero diagonal element that has a nonzero element in its row then it is indefinite. Indeed if $a_{kk} = 0$ then $e_k^TAe_k = a_{kk} = 0$, where $e_k$ is the $k$th unit vector, so $A$ cannot be positive definite or negative definite. The existence of a nonzero element in the row of the zero rules out the matrix being positive semidefinite ($x^TAx \ge 0$ for all $x$) or negative semidefinite ($x^TAx \le 0$ for all $x$).

An example of a symmetric indefinite matrix is a saddle point matrix, which has the block $2\times 2$ form

$\notag C = \begin{bmatrix} A & B^T \\ B & 0 \end{bmatrix},$

where $A$ is symmetric positive definite and $B\ne0$. When $A$ is the identity matrix, $C$ is the augmented system matrix associated with a least squares problem $\min_x \|Bx - d\|_2$. Another example is the $n\times n$ reverse identity matrix $J_n$, illustrated by

$\notag J_4 = \begin{bmatrix} 0 & 0 & 0 & 1 \\ 0 & 0 & 1 & 0 \\ 0 & 1 & 0 & 0 \\ 1 & 0 & 0 & 0 \end{bmatrix},$

which has eigenvalues $\pm1$ (exercise: how many $1$s and how many $-1$s?). A third example is a Toeplitz tridiagonal matrix with zero diagonal:

>> A = full(gallery('tridiag',5,1,0,1)), eig(sym(A))'
A =
0     1     0     0     0
1     0     1     0     0
0     1     0     1     0
0     0     1     0     1
0     0     0     1     0
ans =
[-1, 0, 1, 3^(1/2), -3^(1/2)]


How can we exploit symmetry in solving a linear system $Ax = b$ with a symmetric indefinite matrix $A$? A Cholesky factorization does not exist, but we could try to compute a factorization $A = LDL^T$, where $L$ is unit lower triangular and $D$ is diagonal with both positive and negative diagonal entries. However, this factorization does not always exist and if it does, its computation in floating-point arithmetic can be numerically unstable. The simplest example of nonexistence is the matrix

$\notag \begin{bmatrix} 0 & 1\\ 1 & 1 \end{bmatrix} \ne \begin{bmatrix} 1 & 0 \\ \ell_{21} & 0 \end{bmatrix} \begin{bmatrix} d_{11} & 0 \\ 0 & d_{22}\end{bmatrix} \begin{bmatrix} 1 & \ell_{21}\\ 0 & 1\end{bmatrix}.$

The way round this is to allow $D$ to have $2 \times 2$ blocks. We can compute a block $\mathrm{LDL^T}$ factorization $PAP^T = LDL^T$, were $P$ is a permutation matrix, $L$ is unit lower triangular, and $D$ is block diagonal with diagonal blocks of size $1$ or $2$. Various pivoting strategies, which determine $P$, are possible, but the recommend one is the symmetric rook pivoting strategy of Ashcraft, Grimes, and Lewis (1998), which has the key property of producing a bounded $L$ factor. Solving $Ax = b$ now reduces to substitutions with $L$ and a solve with $D$, which involves solving $2\times 2$ linear systems for the $2\times 2$ blocks and doing divisions for the $1\times 1$ blocks (scalars).

MATLAB implements $\mathrm{LDL^T}$ factorization in its ldl function. Here is an example using Anymatrix:

>> A = anymatrix('core/blockhouse',4), [L,D,P] = ldl(A), eigA = eig(A)'
A =
-4.0000e-01  -8.0000e-01  -2.0000e-01   4.0000e-01
-8.0000e-01   4.0000e-01  -4.0000e-01  -2.0000e-01
-2.0000e-01  -4.0000e-01   4.0000e-01  -8.0000e-01
4.0000e-01  -2.0000e-01  -8.0000e-01  -4.0000e-01
L =
1.0000e+00            0            0            0
0   1.0000e+00            0            0
5.0000e-01  -8.3267e-17   1.0000e+00            0
-2.2204e-16  -5.0000e-01            0   1.0000e+00
D =
-4.0000e-01  -8.0000e-01            0            0
-8.0000e-01   4.0000e-01            0            0
0            0   5.0000e-01  -1.0000e+00
0            0  -1.0000e+00  -5.0000e-01
P =
1     0     0     0
0     1     0     0
0     0     1     0
0     0     0     1
eigA =
-1.0000e+00  -1.0000e+00   1.0000e+00   1.0000e+00


Notice the $2\times 2$ blocks on the diagonal of $D$, each of which contains one negative eigenvalue and one positive eigenvalue. The eigenvalues of $D$ are not the same as those of $A$, but since $A$ and $D$ are congruent they have the same number of positive, zero, and negative eigenvalues.

# What Is a Toeplitz Matrix?

$T\in\mathbb{C}^{n\times n}$ is a Toeplitz matrix if $t_{ij} = t_{i-j}$ for $2n-1$ parameters $t_{1-n},\dots,t_{n-1}$. A Toeplitz matrix has constant diagonals. For $n = 4$:

$\notag T = \begin{bmatrix} t_0 & t_{-1} & t_{-2} & t_{-3}\\ t_1 & t_0 & t_{-1} & t_{-2}\\ t_2 & t_1 & t_0 & t_{-1}\\ t_3 & t_2 & t_1 & t_0 \end{bmatrix}.$

Toeplitz matrices arise in various problems, including analysis of time series, discretization of constant coefficient differential equations, and discretization of convolution equations $\int a(t-s)x(s)\,\mathrm{d} s = b(t)$.

Since a Toeplitz matrix depends on just $2n-1$ parameters it is reasonable to expect that a linear system $Tx = b$ can be solved in less than the $O(n^3)$ flops that would be required by LU factorization. Indeed methods are available that require only $O(n^2)$ flops; see Golub and Van Loan (2013) for details.

Upper triangular Toeplitz matrices can be written in the form

$\notag T = \sum_{j=1}^n t_{1-j} N^{j-1}, \quad N = \begin{bmatrix} 0 & 1 & & \\ & 0 & \ddots & \\ & & \ddots & 1 \\ & & & 0 \end{bmatrix},$

where $N$ is upper bidiagonal with a superdiagonal of ones and $N^n = 0$. It follows that the product of two upper triangular Toeplitz matrices is again upper triangular Toeplitz, upper triangular Toeplitz matrices commute, and $T^{-1}$ is also an upper triangular Toeplitz matrix (assuming $t_0$ is nonzero, so that $T$ is nonsingular).

Tridiagonal Toeplitz matrices arise frequently:

$\notag T(c,d,e) = \begin{bmatrix} d & e & & \\ c & d & \ddots & \\ & \ddots & \ddots & e \\ & & c & d \end{bmatrix} \in\mathbb{C}^{n\times n}.$

The eigenvalues of $T(c,d,e)$ are

$\notag d + 2 (ce)^{1/2} \cos\biggl( \displaystyle\frac{k \pi}{n+1} \biggr), \quad k = 1:n.$

The Kac–Murdock–Szegö matrix is the symmetric Toeplitz matrix

$\notag A(\rho) = \begin{bmatrix} 1 & \rho & \rho^2 & \dots & \rho^{n-1} \\ \rho & 1 & \rho & \dots & \rho^{n-2} \\ \rho^2 & \rho & 1 & \ddots & \vdots \\ \vdots & \vdots & \ddots & \ddots & \rho \\ \rho^{n-1} & \rho^{n-2} & \dots & \rho & 1 \end{bmatrix}.$

In MATLAB, a Toeplitz matrix can be constructed using toeplitz(c,r), which produces the matrix with first column c and first row r. Example:

>> n = 5; A = toeplitz(1:n,[1 -2:-1:-n])
A =
1    -2    -3    -4    -5
2     1    -2    -3    -4
3     2     1    -2    -3
4     3     2     1    -2
5     4     3     2     1


## References

• Gene Golub and Charles F. Van Loan, Matrix Computations, fourth edition, Johns Hopkins University Press, Baltimore, MD, USA, 2013. Section 4.7.

# Seven Sins of Numerical Linear Algebra

In numerical linear algebra we are concerned with solving linear algebra problems accurately and efficiently and understanding the sensitivity of the problems to perturbations. We describe seven sins, whereby accuracy or efficiency is lost or misleading information about sensitivity is obtained.

## 1. Inverting a Matrix

In linear algebra courses we learn that the solution to a linear system $Ax = b$ of $n$ equations in $n$ unknowns can be written $x = A^{-1}b$, where $A^{-1}$ is the matrix inverse. What is not always emphasized is that there are very few circumstances in which one should compute $A^{-1}$. Indeed one would not solve the scalar ($n=1$) system $7x = 21$ by computing $x = 7^{-1} \times 21$, but rather would carry out a division $x = 21/7$. In the $n\times n$ case, it is faster and more accurate to solve a linear system by LU factorization (Gaussian elimination) with partial pivoting than by inverting $A$ (which has, in any case, to be done by LU factorization).

Rare cases where $A^{-1}$ is required are in statistics, where the diagonal elements of the inverse of the covariance matrix are relevant quantities, and in certain algorithms for computing matrix functions.

## 2. Forming the Cross-Product Matrix A^TA

The solution to the linear least squares problem $\min_x\| b - Ax \|_2$, where $A$ is a full-rank $m\times n$ matrix with $m\ge n$, satisfies the normal equations $A^T\!A x = A^Tb$. It is therefore natural to form the symmetric positive definite matrix $A^T\!A$ and solve the normal equations by Cholesky factorization. While fast, this method is numerically unstable when $A$ is ill conditioned. By contrast, solving the least squares problem via QR factorization is always numerically stable.

What is wrong with the cross-product matrix $A^T\!A$ (also known as the Gram matrix)? It squares the data, which can cause a loss of information in floating-point arithmetic. For example, if

$A = \begin{bmatrix} 1 & 1 \\ \epsilon & 0 \end{bmatrix}, \quad 0 < \epsilon < \sqrt{u},$

where $u$ is the unit roundoff of the floating point arithmetic, then

$A^T\!A = \begin{bmatrix} 1 + \epsilon^2 & 1 \\ 1 & 1 \end{bmatrix}$

is positive definite but, since $\epsilon^2, in floating-point arithmetic $1+\epsilon^2$ rounds to $1$ and so

$\mathrm{f\kern.2ptl}( A^T\!A) = \begin{bmatrix} 1 & 1 \\ 1 & 1 \end{bmatrix}.$

which is singular, and the information in $\epsilon$ has been lost.

Another problem with the cross product matrix is that the $2$-norm condition number of $A^T\!A$ is the square of that of $A$, and this leads to numerical instability in algorithms that work with $A^T\!A$ when the condition number is large.

## 3. Evaluating Matrix Products in an Inefficient Order

The cost of evaluating a matrix product depends on the order in which the product is evaluated (assuming the matrices are not all $n\times n$). More precisely, matrix multiplication is associative, so $A(BC) = (AB)C$, and in general the cost of the evaluation of a product depends on where one puts the parentheses. One order may be much superior to others, so one should not simply evaluate the product in a fixed left-right or right-left order. For example, if $x$, $y$, and $z$ are $n$-vectors then $xy^Tz$ can be evaluated as

• $(xy^T)z$: a vector outer product followed by a matrix–vector product, costing $O(n^2)$ operations, or
• $x (y^Tz)$: a vector scalar product followed by a vector scaling, costing just $O(n)$ operations.

In general. finding where to put the parentheses in a matrix product $A_1A_2\dots A_k$ in order to minimize the operation count is a difficult problem, but for many cases that arise in practice it is easy to determine a good order.

## 4. Assuming that a Matrix is Positive Definite

Symmetric positive definite matrices (symmetric matrices with positive eigenvalues) are ubiquitous, not least because they arise in the solution of many minimization problems. However, a matrix that is supposed to be positive definite may fail to be so for a variety of reasons. Missing or inconsistent data in forming a covariance matrix or a correlation matrix can cause a loss of definiteness, and rounding errors can cause a tiny positive eigenvalue to go negative.

Definiteness implies that

• the diagonal entries are positive,
• $\det(A) > 0$,
• $|a_{ij}| < \sqrt{a_{ii}a_{jj}}$ for all $i \ne j$,

but none of these conditions, or even all taken together, guarantees that the matrix has positive eigenvalues.

The best way to check definiteness is to compute a Cholesky factorization, which is often needed anyway. The MATLAB function chol returns an error message if the factorization fails, and a second output argument can be requested, which is set to the number of the stage on which the factorization failed, or to zero if the factorization succeeded. In the case of failure, the partially computed $R$ factor is returned in the first argument, and it can be used to compute a direction of negative curvature (as needed in optimization), for example.

This sin takes the top spot in Schmelzer and Hauser’s Seven Sins in Portfolio Optimization, because in portfolio optimization a negative eigenvalue in the covariance matrix can identify a portfolio with negative variance, promising an arbitrarily large investment with no risk!

## 5. Not Exploiting Structure in the Matrix

One of the fundamental tenets of numerical linear algebra is that one should try to exploit any matrix structure that might be present. Sparsity (a matrix having a large number of zeros) is particularly important to exploit, since algorithms intended for dense matrices may be impractical for sparse matrices because of extensive fill-in (zeros becoming nonzero). Here are two examples of structures that can be exploited.

Matrices from saddle point problems are symmetric indefinite and of the form

$\notag C = \begin{bmatrix} A & B^T \\ B & 0 \end{bmatrix},$

with $A$ symmetric positive definite. Much work has been done on developing numerical methods for solving $Cx = b$ that exploit the block structure and possible sparsity in $A$ and $B$. A second example is a circulant matrix

$\notag C = \begin{bmatrix} c_1 & c_2 & \dots & c_n \\ c_n & c_1 & \dots & \vdots \\ \vdots & \ddots & \ddots & c_2 \\ c_2 & \dots & c_n & c_1 \\ \end{bmatrix}.$

Circulant matrices have the important property that they are diagonalized by a unitary matrix called the discrete Fourier transform matrix. Using this property one can solve $Cx = v$ in $O(n \log_2n)$ operations, rather than the $O(n^3)$ operations required if the circulant structure is ignored.

Ideally, linear algebra software would detect structure in a matrix and call an algorithm that exploits that structure. A notable example of such a meta-algorithm is the MATLAB backslash function x = A\b for solving $Ax = b$. Backslash checks whether the matrix is triangular (or a permutation of a triangular matrix), upper Hessenberg, symmetric, or symmetric positive definite, and applies an appropriate method. It also allows $A$ to be rectangular and solves the least squares problem if there are more rows than columns and the underdetermined system if there are more columns than rows.

## 6. Using the Determinant to Detect Near Singularity

An $n\times n$ matrix $A$ is nonsingular if and only if its determinant is nonzero. One might therefore expect that a small value for $\det(A)$ indicates a matrix that is nearly singular. However, the size of $\det(A)$ tells us nothing about near singularity. Indeed, since $\det(\alpha A) = \alpha^n \det(A)$ we can achieve any value for the determinant by multiplying by a scalar $\alpha$, yet $\alpha A$ is no more or less nearly singular than $A$ for $\alpha \ne 0$.

Another limitation of the determinant is shown by the two matrices

$\notag T = \begin{bmatrix} 1 & -1 & -1 & \dots & -1\\ & 1 & -1 & \dots & -1\\ & & 1 & \dots & \vdots\\ & & & \ddots & -1 \\ & & & & 1 \end{bmatrix}, \quad U = \begin{bmatrix} 1 & 1 & 1 & \dots & 1\\ & 1 & 1 & \dots & 1\\ & & 1 & \dots & \vdots\\ & & & \ddots & 1 \\ & & & & 1 \end{bmatrix} \qquad (1)$

Both matrices have unit diagonal and off-diagonal elements bounded in modulus by $1$. So $\det(T) = \det(U) = 1$, yet

$\notag T^{-1} = \begin{bmatrix} 1 & 1 & 2 & \dots & 2^{n-2}\\ & 1 & 1 & \dots & \vdots\\ & & 1 & \ddots & 2\\ & & & \ddots & 1 \\ & & & & 1 \end{bmatrix}, \quad U^{-1} = \begin{bmatrix} 1 & -1 & & & \\ & 1 & -1 & & \\ & & 1 & \ddots & \\ & & & \ddots & -1 \\ & & & & 1 \end{bmatrix}.$

So $T$ is ill conditioned for large $n$. In fact, if we change the $(n,1)$ element of $T$ to $-2^{n-2}$ then the matrix becomes singular! By contrast, $U$ is always very well conditioned. The determinant cannot distinguish between the ill-conditioned $T$ and the well-conditioned $U$.

## 7. Using Eigenvalues to Estimate Conditioning

For any $n\times n$ matrix $A$ and any consistent matrix norm it is true that $\|A\| \ge |\lambda_i|$ for all $i$, where the $\lambda_i$ are the eigenvalue of $A$. Since the eigenvalues of $A^{-1}$ are $\lambda^{-1}$, it follows that the matrix condition number $\kappa(A) = \|A\| \, \|A^{-1}\|$ is bounded below by the ratio of largest to smallest eigenvalue in absolute value, that is,

$\notag \kappa(A) \ge \displaystyle\frac{ \max_i |\lambda_i| } { \min_i |\lambda_i| }.$

But as the matrix $T$ in (1) shows, this bound can be very weak.

It is singular values not eigenvalues that characterize the condition number for the 2-norm. Specifically,

$\notag \kappa_2(A) = \displaystyle\frac{\sigma_1}{\sigma_n},$

where $A = U\Sigma V^T$ is a singular value decomposition (SVD), with $U$ and $V$ orthogonal and $\Sigma = \mathrm{diag}(\sigma_i)$, $\sigma_1 \ge \sigma_2 \ge \cdots \ge \sigma_n \ge 0$. If $A$ is symmetric, for example, then the sets $\{ |\lambda_i| \}$ and $\{\sigma_i \}$ are the same, but in general the eigenvalues $\lambda_i$ and singular values $\sigma_i$ can be very different.

# Cleve Moler Wins ICIAM Industry Prize 2023

Congratulations to Cleve Moler, who has won the inaugural ICIAM Industry Prize 2023 for “outstanding contributions to the development of mathematical and computational tools and methods for the solution of science and engineering problems and his invention of MATLAB”.

I first saw Cleve demonstrate the original Fortran version of MATLAB on an IBM PC at the Gatlinburg meeting at the University of Waterloo in 1984. The commercial version of MATLAB was released soon after, and it has been my main programming environment ever since.

MATLAB succeeded for a number of reasons, some of which Dennis Sherwood and I describe in one of the creativity stories in our recent book How to Be Creative: A Practical Guide for the Mathematical Sciences. But there is one reason that is rarely mentioned.

From the beginning, MATLAB supported complex arithmetic—indeed, the basic data type has always been a complex matrix. The original 1980 MATLAB Users’ Guide says

MATLAB works with essentially only one kind of object, a rectangular matrix with complex elements. If the imaginary parts of the elements are all zero, they are not printed, but they still occupy storage.

By contrast, early competing packages usually supported only real arithmetic (see my 1989 SIAM News article Matrix Computations on a PC for a comparison of PC-MATLAB and GAUSS). Cleve understood the fundamental need to compute in the complex plane in real life problems, as opposed to textbook examples, and he appreciated how tedious it is to program with real and imaginary parts stored in separate arrays. The storing of zero imaginary parts of real numbers was a small price to pay for the convenience. Of course, the commercial version of MATLAB was optimized not to store the imaginary part of reals. Control engineers—a group who were early adopters of MATLAB—appreciated the MATLAB approach, because the stability of control systems depends on eigenvalues, which are in general complex.

Another wise choice was that MATLAB allows the imaginary unit to be written as i or j, thus keeping mathematicians and electrical engineers happy!

Here is Cleve demonstrating MATLAB in October 2000:

# What Is a Circulant Matrix?

An $n\times n$ circulant matrix is defined by $n$ parameters, the elements in the first row, and each subsequent row is a cyclic shift forward of the one above:

$\notag C = \begin{bmatrix} c_1 & c_2 & \dots & c_n \\ c_n & c_1 & \dots & \vdots \\ \vdots & \ddots & \ddots & c_2 \\ c_2 & \dots & c_n & c_1 \\ \end{bmatrix}.$

Circulant matrices have the important property that they are diagonalized by the discrete Fourier transform matrix

$\notag F_n = \Bigl(\exp( -2\pi \mathrm{i} (r-1)(s-1) / n )\Bigr)_{r,s=1}^n,$

which satisfies $F_n^*F_n = nI$, so that $n^{-1/2}F_n$ is a unitary matrix. ($F_n$ is a Vandermonde matrix with points the roots of unity.) Specifically,

$\notag F_n C F_n^{-1} = D = \mathrm{diag}(d_i). \qquad(1)$

Hence circulant matrices are normal ($C^*C = CC^*$). Moreover, the eigenvalues are given by $d = F_n Ce_1$, where $e_1$ is the first unit vector.

Note that one particular eigenpair is immediate, since $Ce = \bigl(\sum_{i=1}^n c_i\bigr) e$, where $e$ is the vector of ones.

The factorization (1) enables a circulant linear system to be solved in $O(n\log n)$ flops, since multiplication by $F_n$ can be done using the fast Fourier transform.

A particular circulant matrix is the (up) shift matrix $K_n$, the $4\times 4$ version of which is

$\notag K_4 = \begin{bmatrix} 0 & 1 & 0 & 0 \\ 0 & 0 & 1 & 0 \\ 0 & 0 & 0 & 1 \\ 1 & 0 & 0 & 0 \end{bmatrix}.$

It is not hard to see that

$C = c_1 I + c_2 K_n + c_3K_n^2 + \cdots + c_n K_n^{n-1}.$

Since powers of $K_n$ commute, it follows that any two circulant matrices commute (this is also clear from (1)). Furthermore, the sum and product of two circulant matrices is a circulant matrix and the inverse of a nonsingular circulant matrix is a circulant matrix.

One important use of circulant matrices is to act as preconditioners for Toeplitz linear systems. Several methods have been proposed for constructing a circulant matrix from a Toeplitz matrix, including copying the central diagonals and wrapping them around and finding the nearest circulant matrix to the Toeplitz matrix. See Chan and Ng (1996) or Chan and Jin (2017) for a summary of work on circulant preconditioners for Toeplitz systems.

An interesting circulant matrix is anymatrix('core/circul_binom',n) in the Anymatrix collection, which is the $n\times n$ circulant matrix whose first row has $i$th element $n \choose ,i-1$. The eigenvalues are $(1 + w^i)^n - 1$, $i=1:n$, where $w = \exp(2\pi\mathrm{i}/n)$. The matrix is singular when $n$ is a multiple of 6, in which case the null space has dimension 2. Example:

>> n = 6; C = anymatrix('core/circul_binom',n), svd(C)
C =
1     6    15    20    15     6
6     1     6    15    20    15
15     6     1     6    15    20
20    15     6     1     6    15
15    20    15     6     1     6
6    15    20    15     6     1
ans =
6.3000e+01
2.8000e+01
2.8000e+01
1.0000e+00
2.0244e-15
7.6607e-16


## Notes

A classic reference on circulant matrices is Davis (1994).

# Photos and Videos from NJH60 Conference

The conference Advances in Numerical Linear Algebra: Celebrating the 60th Birthday of Nick Higham was held at the University of Manchester, July 6–8, 2022.

Most of the talks are available on the NLA Group YouTube channel and links to them are available on the conference web page.

Here is the conference photo.

Here is me with some of my current and former PhD students.

And with some of my current and former postdocs.

After-dinner speaker Charlie Van Loan:

Rob Corless kindly gave me a Bohemian matrix eigenvalue tie based on this image.

Many thanks to Stefan Güttel, Sven Hammarling, Stephanie Lai, Françoise Tisseur and Marcus Webb for organizing the conference and the Royal Society, MathWorks and the University of Manchester for financial support.

# What Is Fast Matrix Multiplication?

The definition of matrix multiplication says that for $n\times n$ matrices $A$ and $B$, the product $C = AB$ is given by $c_{ij} = \sum_{k=1}^n a_{ik}b_{kj}$. Each element of $C$ is an inner product of a row of $A$ and a column of $B$, so if this formula is used then the cost of forming $C$ is $n^2(2n-1)$ additions and multiplications, that is, $O(n^3)$ operations. For over a century after the development of matrix algebra in the 1850s by Cayley, Sylvester and others, all methods for matrix multiplication were based on this formula and required $O(n^3)$ operations.

In 1969 Volker Strassen showed that when $n=2$ the product can be computed from the formulas

\notag \begin{gathered} p_1 =(a_{11}+a_{22})(b_{11}+b_{22}), \\ p_2=(a_{21}+a_{22})b_{11}, \qquad p_3=a_{11}(b_{12}-b_{22}), \\ p_4=a_{22}(b_{21}-b_{11}), \qquad p_5=(a_{11}+a_{12})b_{22}, \\ p_6=(a_{21}-a_{11})(b_{11}+b_{12}), \qquad p_7=(a_{12}-a_{22})(b_{21}+b_{22}), \\ \noalign{\smallskip} C = \begin{bmatrix} p_1+p_4-p_5+p_7 & p_3+p_5 \\ p_2+p_4 & p_1+p_3-p_2+p_6 \end{bmatrix}. \end{gathered}

The evaluation requires $7$ multiplications and $18$ additions instead of $8$ multiplications and $4$ additions for the usual formulas.

At first sight, Strassen’s formulas may appear simply to be a curiosity. However, the formulas do not rely on commutativity so are valid when the $a_{ij}$ and $b_{ij}$ are matrices, in which case for large dimensions the saving of one multiplication greatly outweighs the extra $14$ additions. Assuming $n$ is a power of $2$, we can partition $A$ and $B$ into four blocks of size $n/2$, apply Strassen’s formulas for the multiplication, and then apply the same formulas recursively on the half-sized matrix products.

Let us examine the number of multiplications for the recursive Strassen algorithm. Denote by $M(k)$ the number of scalar multiplications required to multiply two $2^k \times 2^k$ matrices. We have $M(k) = 7M(k-1)$, so

$M(k) = 7M(k-1) = 7^2M(k-2) = \cdots = 7^k M(0) = 7^k.$

But $7^k = 2^{\log_2{7^k}} = (2^k)^{\log_2 7} = n^{\log_2 7} = n^{2.807\dots}$. The number of additions can be shown to be of the same order of magnitude, so the algorithm requires $O(n^{\log_27})=O(n^{2.807\dots})$ operations.

Strassen’s work sparked interest in finding matrix multiplication algorithms of even lower complexity. Since there are $O(n^2)$ elements of data, which must each participate in at least one operation, the exponent of $n$ in the operation count must be at least $2$.

The current record upper bound on the exponent is $2.37286$, proved by Alman and Vassilevska Williams (2021) which improved on the previous record of $2.37287$, proved by Le Gall (2014) The following figure plots the best upper bound for the exponent for matrix multiplication over time.

In the methods that achieve exponents lower than 2.775, various intricate techniques are used, based on representing matrix multiplication in terms of bilinear or trilinear forms and their representation as tensors having low rank. Laderman, Pan, and Sha (1993) explain that for these methods “very large overhead constants are hidden in the `$O$‘ notation”, and that the methods “improve on Strassen’s (and even the classical) algorithm only for immense numbers $N$.”

Strassen’s method, when carefully implemented, can be faster than conventional matrix multiplication for reasonable dimensions. In practice, one does not recur down to $1\times 1$ matrices, but rather uses conventional multiplication once $n_0\times n_0$ matrices are reached, where the parameter $n_0$ is tuned for the best performance.

Strassen’s method has the drawback that it satisfies a weaker form of rounding error bound that conventional multiplication. For conventional multiplication $C = AB$ of $A\in\mathbb{R}^{m\times n}$ and $B\in\mathbb{R}^{n\times p}$ we have the componentwise bound

$\notag |C - \widehat{C}| \le \gamma^{}_n |A|\, |B|, \qquad(1)$

where $\gamma^{}_n = nu/(1-nu)$ and $u$ is the unit roundoff. For Strassen’s method we have only a normwise error bound. The following result uses the norm $\|A\| = \max_{i,j} |a_{ij}|$, which is not a consistent norm.

Theorem 1 (Brent). Let $A,B \in \mathbb{R}^{n\times n}$, where $n=2^k$. Suppose that $C=AB$ is computed by Strassen’s method and that $n_0=2^r$ is the threshold at which conventional multiplication is used. The computed product $\widehat{C}$ satisfies

$\notag \|C - \widehat{C}\| \le \left[ \Bigl( \displaystyle\frac{n}{n_0} \Bigr)^{\log_2{12}} (n_0^2+5n_0) - 5n \right] u \|A\|\, \|B\| + O(u^2). \qquad(2)$

With full recursion ($n_0=1$) the constant in (2) is $6 n^{\log_2 12}-5n \approx 6 n^{3.585}-5n$, whereas with just one level of recursion ($n_0 = n/2$) it is $3n^2+25n$. These compare with $n^2u + O(u^2)$ for conventional multiplication (obtained by taking norms in (1)). So the constant for Strassen’s method grows at a faster rate than that for conventional multiplication no matter what the choice of $n_0$.

The fact that Strassen’s method does not satisfy a componentwise error bound is a significant weakness of the method. Indeed Strassen’s method cannot even accurately multiply by the identity matrix. The product

$\notag C = \begin{bmatrix} 1 & 0 \\ 0 & 1 \end{bmatrix} \begin{bmatrix} 1 & \epsilon \\ \epsilon & \epsilon^2 \end{bmatrix}, \quad 0 < \epsilon \ll 1$

is evaluated exactly in floating-point arithmetic by conventional multiplication, but Strassen’s method computes

$c_{22} = 2(1+\epsilon^2) + (\epsilon-\epsilon^2) - 1 - (1+\epsilon).$

Because $c_{22}$ involves subterms of order unity, the error $c_{22} - \widehat c_{22}$ will be of order $u$. Thus the relative error $|c_{22} - \widehat c_{22}| / |c_{22}| = O(u/\epsilon^2) \gg O(u)$,

Another weakness of Strassen’s method is that while the scaling $AB \to (AD) (D^{-1}B)$, where $D$ is diagonal, leaves (1) unchanged, it can alter (2) by an arbitrary amount. Dumitrescu (1998) suggests computing $D_1(D_1^{-1}A\cdot B D_2^{-1})D_2$, where the diagonal matrices $D_1$ and $D_2$ are chosen to equilibrate the rows of $A$ and the columns of $B$ in the $\infty$-norm; he shows that this scaling can improve the accuracy of the result. Further investigations along these lines are made by Ballard et al. (2016).

Should one use Strassen’s method in practice, assuming that an implementation is available that is faster than conventional multiplication? Not if one needs a componentwise error bound, which ensures accurate products of matrices with nonnegative entries and ensures that the column scaling of $A$ and row scaling of $B$ has no effect on the error. But if a normwise error bound with a faster growing constant than for conventional multiplication is acceptable then the method is worth considering.

## Notes

For recent work on high-performance implementation of Strassen’s method see Huang et al. (2016, 2020).

Theorem 1 is from an unpublished technical report of Brent (1970). A proof can be found in Higham (2002, §23.2.2).

For more on fast matrix multiplication see Bini (2014) and Higham (2002, Chapter 23).

## References

This is a minimal set of references, which contain further useful references within.