What Is the Schur Complement of a Matrix?

For an n\times n matrix

\notag   A =   \begin{bmatrix} A_{11} & A_{12} \\                   A_{21} & A_{22}   \end{bmatrix} \qquad (1)

with nonsingular (1,1) block A_{11} the Schur complement is A_{22} - A_{21}A_{11}^{-1}A_{12}. It is denoted by A/A_{11}. The block with respect to which the Schur complement is taken need not be the (1,1) block. Assuming that A_{22} is nonsingular, the Schur complement of A_{22} in A is A_{11} - A_{12}A_{22}^{-1}A_{21}.

More generally, for an index vector \alpha = [i_1,i_2,\dots,i_k], where 1\le i_1 < i_2 < \cdots < i_p \le n, the Schur complement of A(\alpha,\alpha) in A is A(\alpha^c,\alpha^c) - A(\alpha^c,\alpha) A(\alpha,\alpha)^{-1} A(\alpha,\alpha^c), where \alpha^c is the complement of \alpha (the vector of indices not in \alpha). Most often, it is the Schur complement of A_{11} in A that is of interest, and the general case can be reduced to it by row and column permutations that move A(\alpha,\alpha) to the (1,1) block of A.

Schur Complements in Gaussian Elimination

The reduced submatrices in Gaussian elimination are Schur complements. Write an n\times n matrix A with nonzero (1,1) element \alpha as

\notag   A =   \begin{bmatrix} \alpha & a^*   \\                    b     & C   \end{bmatrix}    =   \begin{bmatrix}  1            & 0      \\                    b/\alpha     & I_{n-1}   \end{bmatrix}   \begin{bmatrix}  \alpha       & a^*    \\                     0           & C - ba^*/\alpha   \end{bmatrix}.

This factorization represents the first step of Gaussian elimination.

After k stages of Gaussian elimination we have computed the following factorization, in which the (1,1) blocks are k\times k:

\notag   \begin{bmatrix} L_{11} & 0      \\                   L_{21} & I   \end{bmatrix}   \begin{bmatrix} A_{11} & A_{12} \\                   A_{21} & A_{22}   \end{bmatrix}  =   \begin{bmatrix} U_{11} & U_{12}      \\                      0   & S   \end{bmatrix},

where the first matrix on the left is the product of the elementary transformations that reduce the first k columns to upper triangular form. Equating (2,1) and (2,2) blocks in this equation gives L_{21}A_{11} + A_{21} = 0 and L_{21}A_{12} + A_{22} = S. Hence S= A_{22} + L_{21}A_{12}                       = A_{22} + (-A_{21}A_{11}^{-1})A_{12}, which is the Schur complement of A_{11} in A.

The next step of the elimination, which zeros out the first column of S below the diagonal, succeeds as long as the (1,1) element of S is nonzero (or if the whole first column of S is zero, in which case there is nothing to do, and A is singular in this case).

For a number of matrix structures, such as

  • Hermitian (or real symmetric) positive definite matrices,
  • totally positive matrices,
  • matrices diagonally dominant by rows or columns,
  • M-matrices,

one can show that the Schur complement inherits the structure. For these four structures the (1,1) element of the matrix is nonzero, so the success of Gaussian elimination (or equivalently, the existence of an LU factorization) is guaranteed. In the first three cases the preservation of structure is also the basis for a proof of the numerical stability of Gaussian elimination.

Inverse of a Block Matrix

The Schur complement arises in formulas for the inverse of a block 2\times 2 matrix. If A_{11} is nonsingular, we can write

\notag   A = \begin{bmatrix}A_{11} & A_{12} \\ A_{21} & A_{22} \end{bmatrix}     = \begin{bmatrix}I & 0 \\ A_{21}A_{11}^{-1} & I \end{bmatrix}       \begin{bmatrix}A_{11} & A_{12} \\ 0 & S \end{bmatrix}, \quad                       S = A_{22} - A_{21}A_{11}^{-1}A_{12}.  \qquad (2)

If S is nonsingular then inverting gives

\notag \begin{aligned}   A^{-1} &=       \begin{bmatrix}A_{11}^{-1} & -A_{11}^{-1}A_{12}S^{-1} \\                     0 & S^{-1}                     \end{bmatrix}                         \begin{bmatrix}I & 0 \\ -A_{21}A_{11}^{-1} & I                           \end{bmatrix} \notag\\         &=       \begin{bmatrix}A_{11}^{-1} + A_{11}^{-1}A_{12}S^{-1}A_{21}A_{11}^{-1} & -A_{11}^{-1}A_{12}S^{-1} \\                -S^{-1}A_{21}A_{11}^{-1} & S^{-1}       \end{bmatrix}. \end{aligned}

So S^{-1} is the (2,2) block of A^{-1}.

One can obtain an analogous formula for A^{-1} in terms of the Schur complement of A_{22} in A, in which the inverse of the Schur complement is the (1,1) block of A^{-1}. More generally, we have (A/A(\alpha,\alpha))^{-1} = A^{-1}(\alpha^c,\alpha^c).

Test for Positive Definiteness

For Hermitian matrices the Schur complement provides a test for positive definiteness. Suppose

\notag   A =   \begin{bmatrix} A_{11} & A_{12} \\                   A_{12}^* & A_{22}   \end{bmatrix}

with A_{11} positive definite. The factorization

\notag  A =  \begin{bmatrix} I & 0 \cr A_{12}^*A_{11}^{-1} & I \end{bmatrix}       \begin{bmatrix} A_{11} & 0 \cr 0 & S \end{bmatrix}        \begin{bmatrix} I & A_{11}^{-1}A_{12} \cr 0 & I \end{bmatrix}, \quad    S = A_{22} - A_{12}^*A_{11}^{-1}A_{12},

shows that A is congruent to the block diagonal matrix \mathrm{diag}(A_{11},S), and since congruences preserve inertia (the number of positive, zero, and negative eigenvalues), A is positive definite if and only if \mathrm{diag}(A_{11},S) is positive definite, that is, if and only if S is positive definite. The same equivalence holds with “definite” replace by “semidefinite” (with A_{11} still required to be positive definite).

Generalized Schur Complement

For A in (1) with a possibly singular (1,1) block A_{11} we can define the generalized Schur complement A_{22} - A_{21}A_{11}^+A_{12}, where “+” denotes the Moore-Penrose pseudo-inverse. The generalized Schur complement is useful in the context of Hermitian positive semidefinite matrices, as the following result shows.

Theorem 1.

If

\notag        A = \begin{bmatrix} A_{11} & A_{12} \\                            A_{12}^* & A_{22}            \end{bmatrix} \in\mathbb{C}^{n\times n}

is Hermitian then A is positive semidefinite if and only if \mathrm{range}(A_{12}) \subseteq \mathrm{range}(A_{11}) and A_{11} and the generalized Schur complement S = A_{22} - A_{12}^*A_{11}^+ A_{12} are both positive semidefinite.

If A_{11} is positive definite then the range condition is trivially satisfied.

Notes and References

The term “Schur complement” was coined by Haynsworth in 1968, thereby focusing attention on this important form of matrix and spurring many subsequent papers that explore its properties and applications. The name “Schur” was chosen because a 1917 determinant lemma of Schur says that if A_{11} is nonsingular then

\notag    \det\left(        \begin{bmatrix} A_{11} & A_{12} \\ A_{21} & A_{22} \end{bmatrix} \right)      = \det(A_{11}) \det(A_{22} - A_{21}A_{11}^{-1} A_{12})      = \det(A_{11}) \det(A/A_{11}),

which is obtained by taking determinants in (2).

For much more about the Schur complement see Zhang (2005) or Horn and Johnson (2013).

What Is the Minimal Polynomial of a Matrix?

For a polynomial

\notag  \phi(t) = a_kt^k + \cdots + a_1t + a_0,

where a_k\in\mathbb{C} for all k, the matrix polynomial obtained by evaluating \phi at A\in\mathbb{C}^{n \times n} is

\notag  \phi(A) = a_kA^k + \cdots + a_1A + a_0 I.

(Note that the constant term is a_0 A^0 = a_0 I). The polynomial \phi is monic if a_k = 1.

The characteristic polynomial of a matrix A\in\mathbb{C}^{n \times n} is p(t) = \det(t I - A), a degree n monic polynomial whose roots are the eigenvalues of A. The Cayley–Hamilton theorem tells us that p(A) = 0, but p may not be the polynomial of lowest degree that annihilates A. The monic polynomial \psi of lowest degree such that \psi(A) = 0 is the minimal polynomial of A. Clearly, \psi has degree at most n.

The minimal polynomial divides any polynomial \phi such that \phi(A) = 0, and in particular it divides the characteristic polynomial. Indeed by polynomial long division we can write \phi(t) = q(t)\psi(t) + r(t), where the degree of r is less than the degree of \psi. Then

\notag      0 = \phi(A) = q(A) \psi(A) + r(A) = r(A).

If r\ne 0 then we have a contradiction to the minimality of the degree of \psi. Hence r = 0 and so \psi divides \phi.

The minimal polynomial is unique. For if \psi_1 and \psi_2 are two different monic polynomials of minimum degree m such that \psi_i(A) = 0, i = 1,2, then \psi_3 = \psi_1 - \psi_2 is a polynomial of degree less than m and \psi_3(A) = \psi_1(A) - \psi_2(A) = 0, and we can scale \psi_3 to be monic, so by the minimality of m, \psi_3 = 0, or \psi_1 = \psi_2.

If A has distinct eigenvalues then the characteristic polynomial and the minimal polynomial are equal. When A has repeated eigenvalues the minimal polynomial can have degree less than n. An extreme case is the identity matrix, for which \psi(t) = t - 1, since \psi(I) = I - I = 0. On the other hand, for the Jordan block

\notag  J = \begin{bmatrix}      \lambda & 1 & 0 \\      0 & \lambda & 1 \\      0 & 0 & \lambda   \end{bmatrix}

the characteristic polynomial and the minimal polynomial are both equal to (\lambda - 1)^3.

The minimal polynomial has degree less than n when in the Jordan canonical form of A an eigenvalue appears in more than one Jordan block. Indeed it is not hard to show that the minimal polynomial can be written

\notag     \psi(t) = \displaystyle\prod_{i=1}^s (t-\lambda_i)^{n_i},

where \lambda_1,\lambda_2,\dots,\lambda_s are the distinct eigenvalues of A and n_i is the dimension of the largest Jordan block in which \lambda_i appears. This expression is composed of linear factors (that is, n_i = 1 for all i) if and only if A is diagonalizable.

To illustrate, for the matrix

\notag  A = \left[\begin{array}{cc|c|c|c}      \lambda &  1      &         &     &   \\              & \lambda &         &     &   \\\hline              &         & \lambda &     &   \\\hline              &         &         & \mu &    \\\hline              &         &         &     & \mu   \end{array}\right]

in Jordan form (where blank elements are zero), the minimal polynomial is \psi(t) = (t-\lambda)^2(t-\mu), while the characteristic polynomial is p(t) = (t-\lambda)^3(t-\mu)^2.

What is the minimal polynomial of a rank-1 matrix, A = xy^* \ne 0? Since A^2 = (y^*x) xy^*, we have q(A) = 0 for q(t) = t^2 - (y^*x) t = t^2 - \mathrm{trace}(A) t. For any linear polynomial p(t) = t - a_0, p(A) = xy^* - a_0 I, which is nonzero since xy^* has rank 1 and I has rank n. Hence the minimal polynomial is q.

The minimal polynomial is important in the theory of matrix functions and in the theory of Krylov subspace methods. One does not normally need to compute the minimal polynomial in practice.

Related Blog Posts

This article is part of the “What Is” series, available from https://nhigham.com/category/what-is and in PDF form from the GitHub repository https://github.com/higham/what-is.

Ten Year Anniversary of the Blog

6177441448_5072c214f8_o.gif

This blog is ten years old! I made the first post, Trefethen’s Approximation Theory and Approximation Practice on January 1, 2013. 347 posts and over one million views (for the combined website and blog) later, the aims of the blog are unchanged: to cover numerical linear algebra, software, and numerical analysis and applied mathematics more generally.

In the last three years most of my posts have been in the What Is series, which now has 84 articles. I will continue to add to this series, and will be grateful for any suggestions of topics (please put them in the “Leave a Reply” box below).

There is a lot of content on this site. The best way to find specific information is to use the gray search box at the top right of the page. I use it frequently, often finding things I’d forgotten are here.

Image credit: Bernie Goldbach.

What Is Iterative Refinement?

Iterative refinement is a method for improving the quality of an approximate solution \widehat{x} to a linear system of equations Ax = b, where A is an n\times n nonsingular matrix. The basic iterative refinement algorithm is very simple.

  1. Compute the residual r = b - A\widehat{x}.
  2. Solve Ad = r.
  3. Update \widehat{x} \gets \widehat{x} + d.
  4. Repeat from step 1 if necessary.

At first sight, this algorithm seems as expensive as solving for the original \widehat{x}. However, usually the solver is LU factorization with pivoting, A = LU (where we include the permutations in L). Most of the work is in the LU factorization, which costs O(n^3) flops, and each iteration requires a multiplication with A for the residual and two substitutions to compute d, which total only O(n^2) flops. If the refinement converges quickly then it is inexpensive.

Turning to the error, with a stable LU factorization the initial \widehat{x} computed in floating-point arithmetic of precision u satisfies (omitting constants)

\notag     \displaystyle\frac{\|x- \widehat{x}\|}{\|x\|}\lesssim \kappa_{\infty}(A) u, \qquad (1)

where \kappa_{\infty}(A) = \|A\|_{\infty} \|A^{-1}\|_{\infty} \ge 1 is the matrix condition number in the \infty-norm. We would like refinement to produce a solution accurate to precision u:

\notag     \displaystyle\frac{\|x- \widehat{x}\|}{\|x\|}\approx u. \qquad (2)

But if the solver cannot compute the initial \widehat{x} accurately when A is ill-conditioned why should it be able to produce an update d that improves \widehat{x}?

The simplest answer is that when iterative refinement was first used on digital computers the residual r was computed at twice the working precision, which could be done at no extra cost in the hardware. If \widehat{x} is a reasonable approximation then we expect cancellation in forming r = b - A\widehat{x}, so using extra precision in forming r ensures that r has enough correct digits to yield a correction d that improves \widehat{x}. This form of iterative refinement produces a solution satisfying (2) as long as \kappa_{\infty}(A) < u^{-1}.

Here is a MATLAB example, where the working precision is single and residuals are computed in double precision.

n = 8; A = single(gallery('frank',n)); xact = ones(n,1);
b = A*xact; % b is formed exactly for small n.
x = A\b;
fprintf('Initial_error = %4.1e\n', norm(x - xact,inf))
r = single( double(b) - double(A)*double(x) );
d = A\r;
x = x + d;
fprintf('Second error  = %4.1e\n', norm(x - xact,inf))

The output is

Initial_error = 9.1e-04
Second error  = 6.0e-08

which shows that after just one step the error has been brought down from 10^{-4} to the level of u_s\approx 5\times 10^{-8}, the unit roundoff for IEEE single precision arithmetic.

Fixed Precision Iterative Refinement

By the 1970s, computers had started to lose the ability to cheaply accumulate inner products in extra precision, and extra precision could not be programmed portably in software. It was discovered, though, that even if iterative refinement is run entirely in one precision it can bring benefits when \kappa_{\infty}(A) < u^{-1}. Specifically,

  • if the solver is somewhat numerically unstable the instability is cured by the refinement, in that a relative residual satisfying

    \notag     \displaystyle\frac{\|b-A\widehat{x}\|_{\infty}}{\|A\|_{\infty} \|\widehat{x}\|_{\infty} + \|b\|_{\infty}} \approx u \qquad(3)

    is produced, and

  • a relative error satisfying

    \notag     \displaystyle\frac{\|x- \widehat{x}\|_{\infty}}{\|x\|_{\infty}}      \lesssim \mathrm{cond}(A,x) u   \qquad (4)

    is produced, where

\notag    \mathrm{cond}(A,x) = \displaystyle\frac{ \|\, |A^{-1}| |A| |x| \,\|_{\infty} } {\|x\|_{\infty}}.

The bound (4) is stronger than (1) because \mathrm{cond}(A,x) is no larger than \kappa_{\infty}(A) and can be much smaller, especially if A has badly scaled rows.

Low Precision Factorization

In the 2000s processors became available in which 32-bit single precision arithmetic ran at twice the speed of 64-bit double precision arithmetic. A new usage of iterative refinement was developed in which the working precision is double precision and a double precision matrix A is factorized in single precision.

  1. Factorize A = LU in single precision.
  2. Solve LUx = b by substitution in single precision, obtaining \widehat{x}.
  3. Compute the residual r = b - A\widehat{x} in double precision.
  4. Solve LUd = r in single precision.
  5. Update \widehat{x} \gets \widehat{x} + d in double precision.
  6. Repeat from step 3 if necessary.

Since most of the work is in the single precision factorization, this algorithm is potentially twice as fast as solving Ax=b entirely in double precision arithmetic. The algorithm achieves the same limiting accuracy (4) and limiting residual (3) provided that \kappa_{\infty}(A) < u_s^{-1} \approx 2\times 10^7.

GRMES-IR

A way to weaken this restriction on \kappa_{\infty}(A) is to use a different solver on step 4: solve

\notag      \widetilde{A}d := U^{-1}L^{-1}Ad = U^{-1}L^{-1}r

by GMRES (a Krylov subspace iterative method) in double precision. Within GMRES, \widetilde{A}d is not formed but is applied to a vector as a multiplication with A followed by substitutions with L and U. As long as GMRES converges quickly for this preconditioned system the speed again from the fast single precision factorization will not be lost. Moreover, this different step 4 results in convergence for \kappa_{\infty}(A) a couple of orders magnitude larger than before, and if residuals are computed in quadruple precision then a limiting accuracy (2) is achieved.

This GMRES-based iterative refinement becomes particularly advantageous when the fast half precision arithmetic now available in hardware is used within the LU factorization, and one can use three or more precisions in the algorithm in order to balance speed, accuracy, and the range of problems that can be solved.

Finally, we note that iterative refinement can also be applied to least squares problems, eigenvalue problems, and the singular value decomposition. See Higham and Mary (2022) for details and references.

References

We give five references, which contain links to the earlier literature.

Related Blog Posts

Gershgorin Graphics

It’s not so often in numerical linear algebra that the plots we produce are visually attractive. This plot came up in some MATLAB experiments. Can you guess what it is?

gersh_art64.jpg

I plotted the Gershgorin discs of a stochastic matrix: a matrix with nonnegative elements and row sums all equal to 1. The Gershgorin discs for an n\times n matrix A are the n discs in the complex plane defined by

\notag      \Big\{ z\in\mathbb{C}: |z-a_{ii}| \le \displaystyle\sum_{j\ne i}      |a_{ij}|\Big\}, \quad i=1\colon n.

Gershgorin’s theorem says that the eigenvalues of A lie in the union of the discs.

Why do the discs form this interesting pattern? For a stochastic matrix the ith Gershgorin disc is

\notag  \Big\{ z\in\mathbb{C}: |z - a_{ii}| \le \displaystyle\sum_{j\ne i} a_{ij} = 1 - a_{ii}   \Big\}.

This disc goes through 1 and the closer a_{ii} is to 1 the smaller the radius of the disc, so the discs are nested, with the disc corresponding to \min_i a_{ii} containing all the others.

The matrix used for the plot is A = anymatrix('core/symmstoch',64) from the Anymatrix collection. It has diagonal elements approximately uniformly distributed on [0,1], so the centers of the discs are roughly equally spaced and shrink as the centers move to the right.

The image above is for the matrix of dimension n = 64. The black dots are the eigenvalues. Here is the plot for n = 24. The function used to produce these plots is gersh from the Matrix Computation Toolbox.

gersh_art24.jpg Here are two other matrices whose Gershgorin discs make a graphically interesting plot.

gersh_art_lesp64.jpg

gersh_art_smoke40.jpg

If you know of any other interesting examples please put them in the comments below.

What Is the Trace of a Matrix?

The trace of an n\times n matrix is the sum of its diagonal elements: \mathrm{trace}(A) = \sum_{i=1}^n a_{ii}. The trace is linear, that is, \mathrm{trace}(A+B) = \mathrm{trace}(A) + \mathrm{trace}(B), and \mathrm{trace}(A) = \mathrm{trace}(A^T).

A key fact is that the trace is also the sum of the eigenvalues. The proof is by considering the characteristic polynomial p(t) = \det(t I - A) = t^n + a_{n-1}t^{n-1} + \dots + a_1 t + a_0. The roots of p are the eigenvalues \lambda_1, \lambda_2, \dots, \lambda_n of A, so p can be factorized

\notag  p(t) = (t - \lambda_1) (t - \lambda_2) \dots (t - \lambda_n),

and so a_{n-1} = -(\lambda_1 + \lambda_2 + \cdots + \lambda_n). The Laplace expansion of \det(t I - A) shows that the coefficient of t^{n-1} is -(a_{11} + a_{22} + \cdots + a_{nn}). Equating these two expressions for a_{n-1} gives

\notag     \mathrm{trace}(A) = \displaystyle\sum_{i=1}^n \lambda_i.  \qquad\qquad(1)

A consequence of (1) is that any transformation that preserves the eigenvalues preserves the trace. Therefore the trace is unchanged under similarity transformations: \mathrm{trace}(X^{-1}AX) = \mathrm{trace}(A) for any nonsingular X.

An an example of how the trace can be useful, suppose A is a symmetric and orthogonal n\times n matrix, so that its eigenvalues are \pm 1. If there are p eigenvalues 1 and q eigenvalues -1 then \mathrm{trace}(A) = p - q and n = p + q. Therefore p = (n + \mathrm{trace}(A))/2 and q = (n - \mathrm{trace}(A))/2.

Another important property is that for an m\times n matrix A and an n\times m matrix B,

\notag  \mathrm{trace}(AB) = \mathrm{trace}(BA) \qquad\qquad(2)

(despite the fact that AB \ne BA in general). The proof is simple:

\notag \begin{aligned}   \mathrm{trace}(AB) &= \displaystyle\sum_{i=1}^m (AB)_{ii}                      = \sum_{i=1}^m \sum_{k=1}^n a_{ik} b_{ki}                      = \sum_{k=1}^n \sum_{i=1}^m b_{ki} a_{ik} \\                      & = \sum_{k=1}^n (BA)_{kk}                      = \mathrm{trace}(BA). \end{aligned}

This simple fact can have non-obvious consequences. For example, consider the equation AX - XA = I in n\times n matrices. Taking the trace gives 0 = \mathrm{trace}(AX) - \mathrm{trace}(XA)    = \mathrm{trace}(AX - XA)    = \mathrm{trace}(I) = n, which is a contradiction. Therefore the equation has no solution.

The relation (2) gives \mathrm{trace}(ABC) = \mathrm{trace}((AB)C) = \mathrm{trace}(C(AB)) = \mathrm{trace}(CAB) for n\times n matrices A, B, and C, that is,

\notag   \mathrm{trace}(ABC) = \mathrm{trace}(CAB). \qquad\qquad(3)

So we can cyclically permute terms in a matrix product without changing the trace.

As an example of the use of (2) and (3), if x and y are n-vectors then \mathrm{trace}(xy^T) = \mathrm{trace}(y^Tx) = y^Tx. If A is an n\times n matrix then \mathrm{trace}(xy^TA) can be evaluated without forming the matrix xy^TA since, by (3), \mathrm{trace}(xy^TA) =                 \mathrm{trace}(y^TAx) = y^TAx.

The trace is useful in calculations with the Frobenius norm of an m\times n matrix:

\notag     \|A\|_F = \left(\displaystyle\sum_{i=1}^m \sum_{j=1}^n |a_{ij}|^2\right)^{1/2}             = \bigr(\mathrm{trace}(A^*A)\bigr)^{1/2},

where * denotes the conjugate transpose. For example, we can generalize the formula |x+\mathrm{i}y|^2 = x^2 + y^2 for a complex number to an m\times n matrix A by splitting A into its Hermitian and skew-Hermitian parts:

\notag  A = \frac{1}{2}(A+A^*) + \frac{1}{2}(A-A^*) \equiv B + C,

where B = B^* and C = -C^*. Then

\notag \begin{aligned}     \|A\|_F^2 =     \|B + C\|_F^2 &= \mathrm{trace}((B+C)^*(B+C))\\                 &= \mathrm{trace}(B^*B + C^*C) + \mathrm{trace}(B^*C + C^*B)\\                 &= \mathrm{trace}(B^*B + C^*C) + \mathrm{trace}(BC - CB)\\                 &= \mathrm{trace}(B^*B + C^*C)\\                 &= \|B\|_F^2 + \|C\|_F^2. \end{aligned}

If a matrix is not explicitly known but we can compute matrix–vector products with it then the trace can be estimated by

\notag      \mathrm{trace}(A) \approx x^TAx,

where the vector x has elements independently drawn from the standard normal distribution with mean 0 and variance 1. The expectation of this estimate is

\notag \begin{aligned}     E\bigl( x^TAx \bigr)  &=     E\bigl( \mathrm{trace}(x^TAx) \bigr)  =     E\bigl( \mathrm{trace}(Axx^T) \bigr)  =     \mathrm{trace} \bigl( E(Axx^T) \bigr) \\      &=  \mathrm{trace}\bigl(A E(xx^T) \bigr)       = \mathrm{trace}(A), \end{aligned}

since E(x_ix_j) = 0 for i\ne j and E(x_i^2) = 1 for all i, so E(xx^T) = I. This stochastic estimate, which is due to Hutchinson, is therefore unbiased.

References

Related Blog Posts

This article is part of the “What Is” series, available from https://nhigham.com/category/what-is and in PDF form from the GitHub repository https://github.com/higham/what-is.

What Is a Diagonalizable Matrix?

A matrix A \in\mathbb{C}^{n\times n} is diagonalizable if there exists a nonsingular matrix X\in\mathbb{C}^{n\times n} such that X^{-1}AX is diagonal. In other words, a diagonalizable matrix is one that is similar to a diagonal matrix.

The condition X^{-1}AX = D = \mathrm{diag}(\lambda_i) is equivalent to AX = XD with X nonsingular, that is, Ax_i = \lambda_ix_i, i=1\colon n, where X = [x_1,x_2,\dots, x_n]. Hence A is diagonalizable if and only if it has a complete set of linearly independent eigenvectors.

A Hermitian matrix is diagonalizable because the eigenvectors can be taken to be mutually orthogonal. The same is true for a normal matrix (one for which A^*A = AA^*). A matrix with distinct eigenvalues is also diagonalizable.

Theorem 1.

If A\in\mathbb{C}^{n\times n} has distinct eigenvalues then it is diagonalizable.

Proof. Let A have eigenvalues \lambda_1,\lambda_2,\dots,\lambda_n with corresponding eigenvectors x_1,x_2,\dots,x_n. Suppose that y = \sum_{i=1}^n \alpha_i x_i = 0 for some \alpha_1,\alpha_2,\dots,\alpha_n. Then

\notag \begin{aligned}  0 &= (A-\lambda_2 I)\cdots (A-\lambda_n I)y    = \sum_{i=1}^n \alpha_i (A-\lambda_2 I)\cdots (A-\lambda_n I) x_i\\    &= \sum_{i=1}^n \alpha_i (\lambda_i-\lambda_2)\cdots(\lambda_i-\lambda_n)x_i    =  \alpha_1 (\lambda_1-\lambda_2)\cdots(\lambda_1-\lambda_n)x_1, \end{aligned}

which implies \alpha_1 = 0 since \lambda_1\ne \lambda_j for j\ge 2 and x_1\ne0. Premultiplying y = \sum_{i=2}^n \alpha_i x_i = 0 by \prod_{j=3}^n(A-\lambda_j I) shows, in the same way, that \alpha_2 = 0. Continuing in this way we find that \alpha_1 = \alpha_2 = \cdots = \alpha_n = 0. Therefore the x_i are linearly independent and hence A is diagonalizable.

A matrix can have repeated eigenvalues and be diagonalizable, as diagonal matrices with repeated diagonal entries show. What is needed for diagonalizability is that every k-times repeated eigenvalue has k linearly independent eigenvectors associated with it. Equivalently, the algebraic and geometric multiplicities of every eigenvalue must be equal, that is, the eigenvalues must all be semisimple. Another equivalent condition is that the degree of the minimal polynomial is equal to the number of distinct eigenvalues.

The simplest example of a matrix that is not diagonalizable is \bigl[\begin{smallmatrix} 0 & 1 \\ 0 & 0 \end{smallmatrix}\bigr]. This matrix is a 2\times 2 Jordan block with the eigenvalue 0. Diagonalizability is easily understood in terms of the Jordan canonical form: A is diagonalizable if and only if all the Jordan blocks in its Jordan form are 1\times 1.

Most matrices are diagonalizable, in the sense that the diagonalizable matrices are dense in \mathbb{C}^{n\times n}, that is, any matrix in \mathbb{C}^{n\times n} is arbitrarily close to a diagonalizable matrix. This property is useful because it can be convenient to prove a result by first proving it for diagonalizable matrices and then arguing that by continuity the result holds for a general matrix.

Is a rank-1 matrix A = xy^* diagonalizable, where x,y\in\mathbb{C}^n are nonzero? There are n-1 zero eigenvalues with eigenvectors any set of linearly independent vectors orthogonal to y. If y^*x \ne 0 then y^*x is the remaining eigenvalue, with eigenvector x, which is linearly independent of the eigenvectors for 0, and A is diagonalizable. If y^*x = 0 then all the eigenvalues of A are zero and so A cannot be diagonalizable, as the only diagonalizable matrix whose eigenvalues are all zero is the zero matrix. For the matrix \bigl[\begin{smallmatrix} 0 & 1 \\ 0 & 0 \end{smallmatrix}\bigr] mentioned above, x = \bigl[{1 \atop 0} \bigr] and y = \bigl[{0 \atop 1} \bigr], so y^*x = 0, confirming that this matrix is not diagonalizable.

Related Blog Posts

This article is part of the “What Is” series, available from https://nhigham.com/category/what-is and in PDF form from the GitHub repository https://github.com/higham/what-is.

The Ten Most-Viewed Posts of 2022

top10.jpg

According to the WordPress statistics, this blog received over 192,000 visitors and 281,000 views in 2022. These are the ten most-viewed posts published during the year.

  1. Seven Sins of Numerical Linear Algebra
  2. What Is an Eigenvalue?
  3. The Big Six Matrix Factorizations
  4. What Is a Schur Decomposition?
  5. What Is a Permutation Matrix?
  6. What Is the Logarithmic Norm?
  7. What Is the Jordan Canonical Form?
  8. What Is the Frank Matrix?
  9. What Is a Circulant matrix?
  10. What Is a Toeplitz Matrix?

Eight of the posts are from the What Is series, which now contains 83 articles; more will follow in 2023.

What Is a Stochastic Matrix?

A stochastic matrix is an n\times n matrix with nonnegative entries and unit row sums. If A\in\mathbb{R}^{n\times n} is stochastic then Ae = e, where e = [1,1,\dots,1]^T is the vector of ones. This means that e is an eigenvector of A corresponding to the eigenvalue 1.

The identity matrix is stochastic, as is any permutation matrix. Here are some other examples of stochastic matrices:

\notag \begin{aligned}   A_n &= n^{-1}ee^T, \quad \textrm{in particular}~~     A_3 = \begin{bmatrix}    \frac{1}{3} & \frac{1}{3} & \frac{1}{3}\\[3pt]    \frac{1}{3} & \frac{1}{3} & \frac{1}{3}\\[3pt]    \frac{1}{3} & \frac{1}{3} & \frac{1}{3}    \end{bmatrix}, \qquad (1)\\   B_n &= \frac{1}{n-1}(ee^T -I), \quad \textrm{in particular}~~     B_3 = \begin{bmatrix}              0 & \frac{1}{2} & \frac{1}{2}\\[2pt]    \frac{1}{2} & 0           & \frac{1}{2}\\[2pt]    \frac{1}{2} & \frac{1}{2} & 0    \end{bmatrix},    \qquad (2)\\   C_n &= \begin{bmatrix}             1         &                   &           &    \\      \frac{1}{2}      &  \frac{1}{2}      &           &    \\      \vdots           &  \vdots           &\ddots     &    \\      \frac{1}{n}      &  \frac{1}{n}      &\cdots     &  \frac{1}{n}      \end{bmatrix}.   \qquad (3)\\ \end{aligned}

For any matrix A, the spectral radius \rho(A) = \max\{ |\lambda| : \lambda \mathrm{~is~an~eigenvalue~of~} A \} is bounded by \rho(A) \le \|A\| for any norm. For a stochastic matrix, taking the \infty-norm (the maximum row sum of absolute values) gives \rho(A) \le 1, so since we know that 1 is an eigenvalue, \rho(A) = 1. It can be shown that 1 is a semisimple eigenvalue, that is, if there are k eigenvalues equal to 1 then there are k linearly independent eigenvectors corresponding to 1 (Meyer, 2000, p. 696).

A lower bound on the spectral radius can be obtained from Gershgorin’s theorem. The ith Gershgorin disc is defined by |\lambda - a_{ii}| \le \sum_{j\ne i} |a_{ij}| = 1 - a_{ii}, which implies |\lambda| \ge 2a_{ii}-1. Every eigenvalue \lambda lies in the union of the n discs and so must satisfy

\notag         2\min_i a_{ii}-1 \le |\lambda| \le 1.

The lower bound is positive if A is strictly diagonally dominant by rows.

If A and B are stochastic then AB is nonnegative and ABe = Ae = e, so AB is stochastic. In particular, any positive integer power of A is stochastic. Does A^k converge as k\to\infty? The answer is that it does, and the limit is stochastic, as long as 1 is the only eigenvalue of modulus 1, and this will be the case if all the elements of A are positive (by Perron’s theorem). The simplest example of non-convergence is the stochastic matrix

\notag    A = \begin{bmatrix} 0 & 1 \\ 1 & 0 \end{bmatrix},

which has eigenvalues 1 and -1. Since A^2 = I, all even powers are equal to I and all odd powers are equal to A. For the matrix (1), A_n^k = A_n for all k, while for (2), B_n^k \to A_n as k\to\infty. For (3), C_n^k converges to the matrix with 1 in every entry of the first column and zeros everywhere else.

Stochastic matrices arise in Markov chains. A transition matrix for a finite homogeneous Markov chain is a matrix whose (i,j) element is the probability of moving from state i to state j over a time step. It has nonnegative entries and the rows sum to 1, so it is a stochastic matrix. In applications including finance and healthcare, a transition matrix may be estimated for a certain time period, say one year, but a transition matrix for a shorter period, say one month, may be needed. If A is a transition matrix for a time period p then a stochastic pth root of A, which is a stochastic solution X of the equation X^p = A, is a transition matrix for a time period a factor p smaller. Therefore p = 12 (years to months) and p = 7 (weeks to days) are among the values of interest. Unfortunately, a stochastic pth root may not exist. For example, the matrix

\notag \begin{bmatrix}    0 & 1 & 0 \\    0 & 0 & 1 \\    0 & 0 & 1 \end{bmatrix}

has no pth roots at all, let alone stochastic ones. Yet many stochastic matrices do have stochastic roots. The matrix (3) has a stochastic pth root for all p, as shown by Higham and Lin (2011), who give a detailed analysis of pth roots of stochastic matrices. For example, to four decimal places,

\notag  C_6^{1/7} = \begin{bmatrix} 1.000  &        &        &        &        &        \\ 0.094  & 0.906  &        &        &        &        \\ 0.043  & 0.102  & 0.855  &        &        &        \\ 0.027  & 0.050  & 0.103  & 0.820  &        &        \\ 0.019  & 0.032  & 0.052  & 0.103  & 0.795  &        \\ 0.014  & 0.023  & 0.034  & 0.053  & 0.102  & 0.774  \\ \end{bmatrix}.

A stochastic matrix is sometime called a row-stochastic matrix to distinguish it from a column-stochastic matrix, which is a nonnegative matrix for which e^TA = e^T (so that A^T is row-stochastic). A matrix that is both row-stochastic and column-stochastic is called doubly stochastic. A permutation matrix is an example of a doubly stochastic matrix. If U is a unitary matrix then the matrix with a_{ij} = |u_{ij}|^2 is doubly stochastic. A magic square scaled by the magic sum is also doubly stochastic. For example,

>> M = magic(4), A = M/sum(M(1,:)) % A is doubly stochastic.
M =
    16     2     3    13
     5    11    10     8
     9     7     6    12
     4    14    15     1
A =
   4.7059e-01   5.8824e-02   8.8235e-02   3.8235e-01
   1.4706e-01   3.2353e-01   2.9412e-01   2.3529e-01
   2.6471e-01   2.0588e-01   1.7647e-01   3.5294e-01
   1.1765e-01   4.1176e-01   4.4118e-01   2.9412e-02
>> [sum(A) sum(A')]
ans =
     1     1     1     1     1     1     1     1
>> eig(A)'
ans =
   1.0000e+00   2.6307e-01  -2.6307e-01   8.5146e-18

References

  • Nicholas J. Higham and Lijing Lin, On pth Roots of Stochastic Matrices, Linear Algebra Appl. 435, 448–463, 2011.
  • Carl D. Meyer, Matrix Analysis and Applied Linear Algebra, Society for Industrial and Applied Mathematics, Philadelphia, PA, USA, 2000.

Related Blog Posts

What Is the Inertia of a Matrix?

The inertia of a real symmetric n\times n matrix A is a triple, written \mathrm{In}(A) = (i_+(A),i_-(A),i_0(A)), where i_+(A) is the number of positive eigenvalues of A, i_-(A) is the number of negative eigenvalues of A, and i_0(A) is the number of zero eigenvalues of A.

The rank of A is i_+(A) + i_-(A). The difference i_+(A) - i_-(A) is called the signature.

In general it is not possible to determine the inertia by inspection, but some deductions can be made. If A has both positive and negative diagonal elements then i_+(A) > 1 and i_-(A) > 1. But in general the diagonal elements do not tell us much about the inertia. For example, here is a matrix that has positive diagonal elements but only one positive eigenvalue (and this example works for any n):

>> n = 4; A = -eye(n) + 2*ones(n), eigA = eig(sym(A))'
A =
     1     2     2     2
     2     1     2     2
     2     2     1     2
     2     2     2     1
eigA =
[-1, -1, -1, 7]

A congruence transformation of a symmetric matrix A is a transformation A \to X^T\!AX for a nonsingular matrix X. The result is clearly symmetric. Sylvester’s law of inertia (1852) says that the inertia is preserved under congruence transformations.

Theorem 1 (Sylvester’s law of inertia).

If A\in\mathbb{R}^{n\times n} is symmetric and X\in\mathbb{R}^{n\times n} is nonsingular then \mathrm{In}(A) = \mathrm{In}(X^T\!AX).

Sylvester’s law gives a way to determine the inertia without computing eigenvalues: find a congruence transformation that transforms A to a matrix whose inertia can be easily determined. A factorization PAP^T = LDL^T does the job, where P is a permutation matrix, L is unit lower triangular, and D is diagonal Then \mathrm{In}(A) = \mathrm{In}(D), and \mathrm{In}(D) can be read off the diagonal of D. This factorization does not always exist, and if it does exist is can be numerically unstable. A block LDL^T factorization, in which D is block diagonal with diagonal blocks of size 1 or 2, always exists, and its computation is numerically stable with a suitable pivoting strategy such as symmetric rook pivoting.

For the matrix above we can compute a block LDL^T factorization using the MATLAB ldl function:

>> [L,D,P] = ldl(A); D
D =
   1.0000e+00   2.0000e+00            0            0
   2.0000e+00   1.0000e+00            0            0
            0            0  -1.6667e+00            0
            0            0            0  -1.4000e+00

Since the leading 2-by-2 block of D has negative determinant and hence one positive eigenvalue and one negative eigenvalue, it follows that A has one positive eigenvalue and three negative eigenvalues.

A congruence transformation preserves the signs of the eigenvalues but not their magnitude. A result of Ostrowski (1959) bounds the ratios of the eigenvalues of the original and transformed matrices. Let the eigenvalues of a symmetric matrix be ordered \lambda_n \le \cdots \le \lambda_1.

Theorem (Ostrowski).

For a symmetric A\in \mathbb{R}^{n\times n} and X\in\mathbb{R}^{n\times n},

\lambda_k(X^*AX) = \theta_k \lambda_k(A), \quad k=1\colon n,

where \lambda_n(X^*X) \le \theta_k \le \lambda_1(X^*X).

The theorem shows that the further X is from being orthogonal the greater the potential change in the eigenvalues.

Finally, we note that everything here generalizes to complex Hermitian matrices by replacing transpose by conjugate transpose.

Related Blog Posts

This article is part of the “What Is” series, available from https://nhigham.com/category/what-is and in PDF form from the GitHub repository https://github.com/higham/what-is.