What Is a (Non)normal Matrix?

An $n\times n$ matrix is normal if $A^*A = AA^*$, that is, if $A$ commutes with its conjugate transpose. Although the definition is simple to state, its significance is not immediately obvious.

The definition says that the inner product of the $i$th and $j$th columns equals the inner product of the $i$th and $j$th rows for all $i$ and $j$. For $i = j$ this means that the $i$th row and the $i$th column have the same $2$-norm for all $i$. This fact can easily be used to show that a normal triangular matrix must be diagonal. It then follows from the Schur decomposition that $A$ is normal if it is unitarily diagonalizable: $A = QDQ^*$ for some unitary $Q$ and diagonal $D$, where $D$ contains the eigenvalues of $A$ on the diagonal. Thus the normal matrices are those with a complete set of orthonormal eigenvectors.

For a general diagonalizable matrix, $A = XDX^{-1}$, the condition number $\kappa(X) = \|X| \|X^{-1}\|$ can be arbitrarily large, but for a normal matrix $X$ can be taken to have 2-norm condition number $1$. This property makes normal matrices well-behaved for numerical computation.

Many equivalent conditions to $A$ being normal are known: seventy are given by Grone et al. (1987) and a further nineteen are given by Elsner and Ikramov (1998).

The normal matrices include the classes of matrix given in this table:

Real Complex
Diagonal Diagonal
Symmetric Hermitian
Skew-symmetric Skew-Hermitian
Orthogonal Unitary
Circulant Circulant

Circulant matrices are $n\times n$ Toeplitz matrices in which the diagonals wrap around:

$\notag \begin{bmatrix} c_1 & c_n & \dots & c_2 \\ c_2 & c_1 & \dots & \vdots \\ \vdots & \ddots & \ddots & c_n \\ c_n & \dots & c_2 & c_1 \\ \end{bmatrix}.$

They are diagonalized by a unitary matrix known as the discrete Fourier transform matrix, which has $(r,s)$ element $\exp( -2\pi \mathrm{i} (r-1)(s-1) / n )$.

A normal matrix is not necessarily of the form given in the table, even for $n = 2$. Indeed, a $2\times 2$ normal matrix must have one of the forms

$\notag \left[\begin{array}{@{\mskip2mu}rr@{\mskip2mu}} a & b\\ b & c \end{array}\right], \quad \left[\begin{array}{@{}rr@{\mskip2mu}} a & b\\ -b & a \end{array}\right].$

The first matrix is symmetric. The second matrix is of the form $aI + bJ$, where $J = \bigl[\begin{smallmatrix}\!\phantom{-}0 & 1\\\!-1 & 0 \end{smallmatrix}\bigr]$ is skew-symmetric and satisfies $J^2 = -I$, and it has eigenvalues $a \pm \mathrm{i}b$.

It is natural to ask what the commutator $C = AA^*- A^*A$ can look like when $A$ is not normal. One immediate observation is that $C$ has zero trace, so its eigenvalues sum to zero, implying that $C$ is an indefinite Hermitian matrix if it is not zero. Since an indefinite matrix has at least two different nonzero eigenvalues, $C$ cannot be of rank $1$.

In the polar decomposition $A = UH$, where $U$ is unitary and $H$ is Hermitian positive semidefinite, the polar factors commute if and only if $A$ is normal.

The field of values, also known as the numerical range, is defined for $A\in\mathbb{C}^{n\times n}$ by

$F(A) = \biggl\{\, \displaystyle\frac{z^*Az}{z^*z} : 0 \ne z \in \mathbb{C}^n \, \biggr\}.$

The set $F(A)$ is compact and convex (a nontrivial property proved by Toeplitz and Hausdorff), and it contains all the eigenvalues of $A$. Normal matrices have the property that the field of values is the convex hull of the eigenvalues. The next figure illustrates two fields of values, with the eigenvalues plotted as dots. The one on the left is for the nonnormal matrix gallery('smoke',16) and that on the right is for the circulant matrix gallery('circul',x) with x constructed as x = randn(16,1); x = x/norm(x).

Measures of Nonnormality

How can we measure the degree of nonnormality of a matrix? Let $A$ have the Schur decomposition $A = QTQ^*$, where $Q$ is unitary and $T$ is upper triangular, and write $T = D+M$, where $D = \mathrm{diag}(\lambda_i)$ is diagonal with the eigenvalues of $A$ on its diagonal and $M$ is strictly upper triangular. If $A$ is normal then $M$ is zero, so $\|M\|_F$ is a natural measure of how far $A$ is from being normal. While $M$ depends on $Q$ (which is not unique), its Frobenius norm does not, since $\|A\|_F^2 = \|T\|_F^2 = \|D\|_F^2 + \|M\|_F^2$. Accordingly, Henrici defined the departure from normality by

$\notag \nu(A) = \biggl( \|A\|_F^2 - \displaystyle\sum_{j=1}^n |\lambda_j|^2 \biggr)^{1/2}.$

Henrici (1962) derived an upper bound for $\nu(A)$ and Elsner and Paardekooper (1987) derived a lower bound, both based on the commutator:

$\notag \displaystyle\frac{\|A^*A-AA^*\|_F}{4\|A\|_2} \le \nu(A) \le \Bigl( \displaystyle\frac{n^3-n}{12} \Bigr)^{1/4} \|A^*A-AA^*\|_F^{1/2}.$

The distance to normality is

$\notag d(A) = \min \bigl\{\, \|E\|_F: ~A+E \in \mathbb{C}^{n\times n}~~\mathrm{is~ normal} \,\bigr\}.$

This quantity can be computed by an algorithm of Ruhe (1987). It is trivially bounded above by $\nu(A)$ and is also bounded below by a multiple of it (László, 1994):

$\notag \displaystyle\frac{\nu(A)}{n^{1/2}} \le d(A) \le \nu(A).$

Normal matrices are a particular class of diagonalizable matrices. For diagonalizable matrices various bounds are available that depend on the condition number of a diagonalizing transformation. Since such a transformation is not unique, we take a diagonalization $A = XDX^{-1}$, $D = \mathrm{diag}(\lambda_i)$, with $X$ having minimal 2-norm condition number:

$\kappa_2(X) = \min\{\, \kappa_2(Y): A = YDY^{-1}, ~D~\mathrm{diagonal} \,\}.$

Here are some examples of such bounds. We denote by $\rho(A)$ the spectral radius of $A$, the largest absolute value of any eigenvalue of $A$.

• By taking norms in the eigenvalue-eigenvector equation $Ax = \lambda x$ we obtain $\rho(A) \le \|A\|_2$. Taking norms in $A = XDX^{-1}$ gives $\|A\|_2 \le \kappa_2(X) \|D\|_2 = \kappa_2(X)\rho(A)$. Hence

$\notag \displaystyle\frac{\|A\|_2}{\kappa_2(X)} \le \rho(A) \le \|A\|_2.$

• If $A$ has singular values $\sigma_1 \ge \sigma_2 \ge \cdots \ge \sigma_n$ and its eigenvalues are ordered $|\lambda_1| \ge |\lambda_2| \ge \cdots \ge |\lambda_n|$, then (Ruhe, 1975)

$\notag \displaystyle\frac{\sigma_i(A)}{\kappa_2(X)} \le |\lambda_i(A)| \le \kappa_2(X) \sigma_i(A), \quad i = 1\colon n.$

Note that for $i=1$ the previous upper bound is sharper.

• For any real $p > 0$,

$\notag \displaystyle\frac{\rho(A)^p}{\kappa_2(X)} \le \|A^p\|_2 \le \kappa_2(X) \rho(A)^p.$

• For any function $f$ defined on the spectrum of $A$,

$\notag \displaystyle\frac{\max_i|f(\lambda_i)|}{\kappa_2(X)} \le \|f(A)\|_2 \le \max_i|f(\lambda_i)|.$

For normal $A$ we can take $X$ unitary and so all these bounds are equalities. The condition number $\kappa_2(X)$ can therefore be regarded as another measure of non-normality, as quantified by these bounds.

References

This is a minimal set of references, which contain further useful references within.

What Is the Matrix Logarithm?

A logarithm of a square matrix $A$ is a matrix $X$ such that $\mathrm{e}^X = A$, where $\mathrm{e}^X$ is the matrix exponential. Just as in the scalar case, the matrix logarithm is not unique, since if $\mathrm{e}^X = A$ then $\mathrm{e}^{X+2k\pi\mathrm{i}I} = A$ for any integer $k$. However, for matrices the nonuniqueness is complicated by the presence of repeated eigenvalues. For example, the matrix

$\notag X(t) = 2\pi \mathrm{i} \begin{bmatrix}1 & -2t & 2t^2 \\ 0 & -1 & -t \\ 0 & 0 & 0 \end{bmatrix}$

is an upper triangular logarithm of the $3\times 3$ identity matrix $I_3$ for any $t$, whereas the obvious logarithms are the diagonal matrices $2\pi\mskip1mu\mathrm{i}\mskip1mu\mathrm{diag}(k_1,k_2,k_3)$, for integers $k_1$, $k_2$, and $k_3$. Notice that the repeated eigenvalue $1$ of $I_3$ has been mapped to the distinct eigenvalues $2\pi$, $-2\pi$, and $0$ in $X(t)$. This is characteristic of nonprimary logarithms, and in some applications such strange logarithms may be required—an example is the embeddability problem for Markov chains.

An important question is whether a nonsingular real matrix has a real logarithm. The answer is that it does if and only if the Jordan canonical form contains an even number of Jordan blocks of each size for every negative eigenvalue. This means, in particular, that if $A$ has an unrepeated negative eigenvalue then it does not have a real logarithm. Minus the $n\times n$ identity matrix has a real logarithm for even $n$ but not for odd $n$. Indeed, for $n=2$,

$\notag X = \pi\left[\begin{array}{@{}rr@{}} 0 & 1\\ -1 & 0 \end{array}\right]$

satisfies $\mathrm{e}^X = -I_2$, as does $Y = ZXZ^{-1}$ for any nonsingular $Z$, since $\mathrm{e}^Y = Z\mathrm{e}^X Z^{-1} = -Z Z^{-1} = -I_2$.

For most practical purposes it is the principal logarithm that is of interest, defined for any matrix with no negative real eigenvalues as the unique logarithm whose eigenvalues lie in the strip $\{\, z : -\pi < \mathrm{Im}(z) < \pi \,\}$. From this point on we assume that $A$ has no negative eigenvalues and that the logarithm is the principal logarithm.

Various explicit representations of the logarithm are available, including

\notag \begin{aligned} \log A &= \int_0^1 (A-I)\bigl[ t(A-I) + I \bigr]^{-1} \,\mathrm{d}t, \\ \log(I+X) &= X - \frac{X^2}{2} + \frac{X^3}{3} - \frac{X^4}{4} + \cdots, \quad \rho(X)<1, \end{aligned}

where the spectral radius $\rho(X) = \max\{\, |\lambda| : \lambda~\textrm{is an eigenvalue of}~X\,\}$. A useful relation is $\log (A^{\alpha}) = \alpha \log A$ for $\alpha\in[-1,1]$, with important special cases $\log (A^{-1}) = - \log A$ and $\log (A^{1/2}) = \frac{1}{2} \log A$ (where the square root is the principal square root). Recurring the latter expression gives, for any positive integer $k$,

$\notag \log(A) = 2^k \log\bigl(A^{1/2^k}\bigr).$

This formula is the basis for the inverse scaling and squaring method for computing the logarithm, which chooses $k$ so that $E = I - A^{1/2^k}$ is small enough that $\log(I + E)$ can be efficiently approximated by Padé approximation. The MATLAB function logm uses the inverse scaling and squaring method together with a Schur decomposition.

References

This references contains more on the facts above, as well as further references.

What Is a QR Factorization?

A QR factorization of a rectangular matrix $A\in\mathbb{R}^{m\times n}$ with $m\ge n$ is a factorization $A = QR$ with $Q\in\mathbb{R}^{m\times m}$ orthonormal and $R\in\mathbb{R}^{m\times n}$ upper trapezoidal. The $R$ factor has the form $R = \left[\begin{smallmatrix}R_1\\ 0\end{smallmatrix}\right]$, where $R_1$ is $n\times n$ and upper triangular. Partitioning $Q$ conformably with $R$ we have

$\notag A = QR = \begin{array}[b]{@{\mskip-20mu}c@{\mskip0mu}c@{\mskip-1mu}c@{}} & \mskip10mu\scriptstyle n & \scriptstyle m-n \\ \mskip15mu \begin{array}{r} \scriptstyle m \end{array}~ & \multicolumn{2}{c}{\mskip-15mu \left[\begin{array}{c@{~}c@{~}} Q_1 & Q_2 \end{array}\right] } \end{array} \mskip-10mu \begin{array}[b]{@{\mskip-25mu}c@{\mskip-20mu}c@{}} \scriptstyle n \\ \multicolumn{1}{c}{ \left[\begin{array}{@{}c@{}} R_1\\ 0 \end{array}\right]} & \mskip-12mu\ \begin{array}{l} \scriptstyle n \\ \scriptstyle m-n \end{array} \end{array} = Q_1 R_1.$

There are therefore two forms of QR factorization:

• $A = QR$ is the full QR factorization,
• $A = Q_1R_1$ is the reduced (also called economy-sized, or thin) QR factorization.

To prove the existence of a QR factorization note that if $A$ has full rank then $A^T\!A$ is symmetric positive definite. Since $A = Q_1R_1$ implies $A^T\!A = R_1^TQ_1^TQ_1R_1 = R_1^TR_1$, we can take $R_1$ to be the Cholesky factor of $A^T\!A$ and then define $Q_1 = AR_1^{-1}$. The resulting $Q_1$ has orthonormal columns because

$\notag Q_1^TQ_1 = R_1^{-T} A^T A R_1^{-1} = R_1^{-T} R_1^T R_1 R_1^{-1} = I.$

Therefore when $A$ has full rank there is a unique reduced QR factorization if we require $R_1$ to have positive diagonal elements. (Without this requirement we can multiply the $i$th column of $Q$ and the $i$th row of $R$ by $-1$ and obtain another QR factorization.)

When $A$ has full rank the columns of $Q_1$ span the column space (or range) of $A$. Indeed $Ax = Q_1R_1x = Q_1(R_1x)$ implies $\mathrm{range}(A) \subseteq \mathrm{range}(Q_1)$ while $Q_1x = Q_1R_1\cdot R_1^{-1}x =: Ay$ implies $\mathrm{range}(Q_1) \subseteq \mathrm{range}(A)$, so $\mathrm{range}(Q_1) = \mathrm{range}(A)$. Furthermore, $Q^TA = R$ gives $Q_2^TA = 0$, so the columns of $Q_2$ span the null space of $A^T$.

The QR factorization provides a way of orthonormalizing the columns of a matrix. An alternative is provided by the polar decomposition $A = UH$, where $U$ has orthonormal columns and $H$ is positive semidefinite. The orthogonal polar factor $U$ is the closest matrix with orthonormal columns to $A$ in any unitarily invariant norm, but it is more expensive to compute than the $Q$ factor.

There are three standard ways of computing a QR factorization.

Gram–Schmidt orthogonalization computes the reduced factorization. It has the disadvantage that in floating-point arithmetic the computed $\widehat{Q}$ is not guaranteed to be orthonormal to the working precision. The modified Gram–Schmidt method (a variation of the classical method) is better behaved numerically in that $\|\widehat{Q}^T\widehat{Q} - I\|_F \le c_1(m,n)\kappa_2(A)u$ for some constant $c_1(m,n)$, where $u$ is the unit roundoff, so the loss of orthogonality is bounded.

Householder QR factorization and Givens QR factorization both construct $Q^T$ as a product of orthogonal matrices that are chosen to reduce $A$ to upper trapezoidal form. In both methods, at the start of the $k$th stage we have

$\notag \qquad\qquad\qquad\qquad A^{(k)} = Q_{k-1}^T A = \begin{array}[b]{@{\mskip35mu}c@{\mskip20mu}c@{\mskip-5mu}c@{}c} \scriptstyle k-1 & \scriptstyle 1 & \scriptstyle n-k & \\ \multicolumn{3}{c}{ \left[\begin{array}{c@{\mskip10mu}cc} R_{k-1} & y_k & B_k \\ 0 & z_k & C_k \end{array}\right]} & \mskip-12mu \begin{array}{c} \scriptstyle k-1 \\ \scriptstyle m-k+1 \end{array} \end{array}, \qquad\qquad\qquad\qquad (*)$

where $R_{k-1}$ is upper triangular and $Q_{k-1}$ is a product of Householder transformations or Givens rotations. Working on $A^{(k)}(k:m,k:n)$ we now apply a Householder transformation or $n-k$ Givens rotations in order to zero out the last $n-k$ elements of $z_k$ and thereby take the matrix one step closer to upper trapezoidal form.

Householder QR factorization is the method of choice for general matrices, but Givens QR factorization is preferred for structured matrices with a lot of zeros, such as upper Hessenberg matrices and tridiagonal matrices.

Both these methods produce $Q$ in factored form and if the product is explicitly formed they yield a computed $\widehat{Q}$ that is orthogonal to the working precision, that is, $\|\widehat{Q}^T\widehat{Q} - I\|_F \le c_2(m,n)u$, for some constant $c_2$.

Modified Gram–Schmidt, Householder QR, and Givens QR all have the property that there exists an exactly orthogonal $Q$ such that the computed $\widehat{R}$ satisfies

$\notag A + \Delta A = Q \widehat{R}, \quad \|\Delta A\|_F \le c_3(m,n)u \|A\|_F,$

for some constant $c_3$.

Another way of computing a QR factorization is by the technique in the existence proof above, via Cholesky factorization of $A^T\!A$. This is known as the Cholesky QR algorithm and it has favorable computational cost when $m \gg n$. In its basic form, this method is not recommended unless $A$ is extremely well conditioned, because the computed $\widehat{Q}$ is far from orthonormal for ill conditioned matrices. The method can be made competitive with the others either by using extra precision or by iterating the process.

Column Pivoting and Rank-Revealing QR Factorization

In practice, we often want to compute a basis for the range of $A$ when $A$ is rank deficient. The basic QR factorization may not do so. Householder QR factorization with column pivoting reveals rank deficiency by incorporating column interchanges. At the $k$th stage, before applying a Householder transformation to $(*)$, the column of largest $2$-norm of $C_k$, the $j$th say, is determined, and if its norm exceeds that of $z_K$ then the $k$th and $(k+j)$th columns of $A^{(k)}$ are interchanged. The result is a factorization $A\Pi = QR$, where $\Pi$ is a permutation matrix and $R$ satisfies the inequalities

$\notag |r_{kk}|^2 \ge \displaystyle\sum_{i=k}^j |r_{ij}|^2, \quad j=k+1\colon n, \quad k=1\colon n.$

In particular,

$\notag \qquad\qquad\qquad\qquad\qquad\qquad |r_{11}| \ge |r_{22}| \ge \cdots \ge |r_{nn}|. \qquad\qquad\qquad\qquad\qquad\qquad (\dagger)$

If $A$ is rank deficient then $R$ has the form

$\notag R = \begin{bmatrix}R_{11} & R_{12} \\ 0 & 0 \end{bmatrix},$

with $R_{11}$ nonsingular, and the rank of $A$ is the dimension of $R_{11}$.

Near rank deficiency of $A$ to tends to be revealed by a small trailing diagonal block of $R$, but this is not guaranteed. Indeed for the Kahan matrix

$\notag U_n(\theta) = \mathrm{diag}(1,s,\dots,s^{n-1}) \begin{bmatrix} 1 & -c & -c & \dots & -c \\ & 1 & -c & \dots & -c \\ & & \ddots &\ddots & \vdots \\ & & &\ddots & -c \\ & & & & 1 \end{bmatrix}$

where $c =\cos\theta$ and $s = \sin\theta$, $u_{nn}$ is of order $2^n$ times larger than the smallest singular value for small $\theta$ and $U_n(\theta)$ is invariant under QR factorization with column pivoting.

In practice, column pivoting reduces the efficiency of Householder QR factorization because it limits the amount of the computation that can be expressed in terms of matrix multiplication. This has motivated the development of methods that select the pivot columns using randomized projections. These methods gain speed and produce factorizations of similar rank-revealing quality, though they give up the inequalities $(\dagger)$.

It is known that at least one permutation matrix $\Pi$ exists such that the QR factorization of $A\Pi$ is rank-revealing. Computing such a permutation is impractical, but heuristic algorithms for producing an approximate rank-revealing factorization are available.

References

This is a minimal set of references, which contain further useful references within.

What is the Cayley–Hamilton Theorem?

The Cayley–Hamilton Theorem says that a square matrix $A$ satisfies its characteristic equation, that is $p(A) = 0$ where $p(t) = \det(tI-A)$ is the characteristic polynomial. This statement is not simply the substitution “$p(A) = \det(A - A) = 0$”, which is not valid since $t$ must remain a scalar inside the $\det$ term. Rather, for an $n\times n$ $A$, the characteristic polynomial has the form

$\notag p(t) = t^n + a_{n-1}t^{n-1} + \cdots + a_1 t + a_0$

and the Cayley–Hamilton theorem says that

$\notag p(A) = A^n + a_{n-1}A^{n-1} + \cdots + a_1 A + a_0 I = 0.$

Various proofs of the theorem are available, of which we give two. The first is the most natural for anyone familiar with the Jordan canonical form. The second is more elementary but less obvious.

First proof.

Consider a $4\times 4$ Jordan block with eigenvalue $\lambda$:

$\notag J = \begin{bmatrix} \lambda & 1 & 0 & 0 \\ & \lambda & 1 & 0\\ & & \lambda & 1\\ & & & \lambda \end{bmatrix}.$

We have

$\notag J - \lambda I = \begin{bmatrix} 0 & 1 & 0 & 0 \\ & 0 & 1 & 0\\ & & 0 & 1\\ & & & 0 \end{bmatrix}, \quad (J - \lambda I)^2 = \begin{bmatrix} 0 & 0 & 1 & 0 \\ & 0 & 0 & 1 \\ & & 0 & 0 \\ & & & 0 \end{bmatrix}, \quad (J - \lambda I)^3 = \begin{bmatrix} 0 & 0 & 0 & 1 \\ & 0 & 0 & 0 \\ & & 0 & 0 \\ & & & 0 \end{bmatrix},$

and then $(J - \lambda I)^4 = 0$. In general, for an $n \times n$ Jordan block $J$ with eigenvalue $\lambda$, $(J - \lambda I)^k$ is zero apart from a $k$th superdiagonal of ones for $k\le n-1$, and $(J - \lambda I)^n = 0$.

Let $A$ have the Jordan canonical form $A = Z JZ^{-1}$, where $J = \mathrm{diag}(J_1, \dots, J_k)$ and each $J_i$ is an $m_i\times m_i$ Jordan block with eigenvalue $\lambda_i$. The characteristic polynomial of $A$ can be factorized as $p(t) = (t-\lambda_1)^{m_1}(t-\lambda_2)^{m_2}\dots(t-\lambda_k)^{m_k}$. Note that $A^i = Z J^i Z^{-1}$ for all $i$, and hence $q(A) = Z q(J) Z^{-1}$ for any polynomial $q$. Then

$\notag Z^{-1}p(A)Z = p(J) = \mathrm{diag}\bigl( p(J_1), p(J_2), \dots, p(J_k) \bigr),$

and $p(J_i)$ is zero because it contains a factor $(J_i - \lambda_i I\bigr)^{m_i}$ and this factor is zero, as noted above. Hence $Z^{-1}p(A)Z = 0$ and therefore $p(A) = 0$.

Second Proof

Recall that the adjugate $\mathrm{adj}(B)$ of an $n\times n$ matrix $B$ is the transposed matrix of cofactors, where a cofactor is a signed sum of products of $n-1$ entries of $B$, and that $B\mathrm{adj}(B) = \det(B)I$. With $B = tI - A$, each entry of $\mathrm{adj}(B)$ is a polynomial of degree $n-1$ in $t$, so $B^{-1} = \mathrm{adj}(B)/\det(B)$ can be written

$\notag (tI - A)^{-1} = \displaystyle\frac{\mathrm{adj}(tI - A)}{\det(tI - A)} = \displaystyle\frac{t^{n-1}C_{n-1} + \cdots + tC_1 + C_0}{p(t)},$

for some matrices $C_{n-1}$, $\dots$, $C_0$ not depending on $t$. Rearranging, we obtain

$\notag (tI - A) \bigl(t^{n-1}C_{n-1} + \cdots + tC_1 + C_0 \bigr) = p(t)I,$

and equating coefficients of $t^n$, …, $t^0$ gives

\notag \begin{aligned} C_{n-1} &= I,\\ C_{n-2} - A C_{n-1} &= a_{n-1}I,\\ & \vdots\\ C_0 - A C_1 &= a_1 I,\\ - A C_0 &= a_0I. \end{aligned}

Premultiplying the first equation by $A^n$, the second by $A^{n-1}$, and so on, and adding, gives

$\notag 0 = A^n + a_{n-1}A^{n-1} + \cdots + a_1 A + a_0 I = p(A),$

as required. This proof is by Buchheim (1884).

Applications and Generalizations

A common use of the Cayley–Hamilton theorem is to show that $A^{-1}$ is expressible as a linear combination of $I$, $A$, …, $A^{n-1}$. Indeed for a nonsingular $A$, $p(A) = 0$ implies that

$\notag A^{-1} = -\displaystyle\frac{1}{a_0} \bigl( A^{n-1} + a_{n-1}A^{n-2} + \cdots + a_1 I \bigr),$

since $a_0 = \det(A) \ne 0$.

Similarly, $A^k$ for any $k \ge n$ can be expressed as a linear combination of $I$, $A$, …, $A^{n-1}$. An interesting implication is that any matrix power series is actually a polynomial in the matrix. Thus the matrix exponential $\mathrm{e}^A = I + A + A^2/2! + \cdots$ can be written $\mathrm{e}^A = c_{n-1} A^{n-1} + \cdots + C_1A + c_0I$ for some scalars $c_{n-1}$, …, $c_0$. However, the $c_i$ depend on $A$, which reduces the usefulness of the polynomial representation. A rare example of an explicit expression of this form is Rodrigues’s formula for the exponential of a skew-symmetric matrix $A \in \mathbb{R}^{3\times 3}$:

$\notag \mathrm{e}^A = I + \displaystyle\frac{\sin\theta}{\theta} A + \displaystyle\frac{1-\cos\theta}{\theta^2} A^2,$

where $\theta = \sqrt{\|A\|_F^2/2}$.

Cayley used the Cayley–Hamilton theorem to find square roots of a $2\times 2$ matrix. If $X^2 = A$ then applying the theorem to $X$ gives $X^2 - \mathrm{trace}(X)X + \det(X)I = 0$, or

$\notag \qquad\qquad\qquad\qquad A - \mathrm{trace}(X)X + \det(X)I = 0, \qquad\qquad\qquad\qquad (*)$

which gives

$X = \displaystyle\frac{A + \det(X)I}{\mathrm{trace}(X)}.$

Now $\det(X) = \sqrt{\det(A)}$ and taking the trace in $(*)$ gives an equation for $\mathrm{trace}(X)$, leading to

$\notag X = \displaystyle\frac{ A + \sqrt{\det(A)} \,I} {\sqrt{\mathrm{trace}(A) + 2 \sqrt{\det(A)}}}.$

With appropriate choices of signs for the square roots this formula gives all four square roots of $A$ when $A$ has distinct eigenvalues, but otherwise the formula can break down.

Expressions obtained from the Cayley–Hamilton theorem are of little practical use for general matrices, because algorithms that compute the coefficients $a_i$ of the characteristic polynomial are typically numerically unstable.

The Cayley–Hamilton theorem has been generalized in various directions. The theorem can be interpreted as saying that the powers $A^i$ for all nonnegative $i$ generate a vector space of dimension at most $n$. Gerstenhaber (1961) proved that if $A$ and $B$ are two commuting $n\times n$ matrices then the matrices $A^iB^j$, for all nonnegative $i$ and $j$, generate a vector space of dimension at most $n$. It is conjectured that this result extends to three matrices.

Historical Note

The Cayley–Hamilton theorem appears in the 1858 memoir in which Cayley introduced matrix algebra. Cayley gave a proof for $n = 2$ and stated that he had verified the result for $n = 3$, adding “I have not thought it necessary to undertake the labour of a formal proof of the theorem in the general case of a matrix of any degree.” Hamilton had proved the result for quaternions in 1853. Cayley actually discovered a more general version of the Cayley–Hamilton theorem, which appears in an 1857 letter to Sylvester but not in any of his published work: if the square matrices $A$ and $B$ commute and $f(x,y) = \det(x A - y B)$ then $f(B,A) = 0$.