What Is the Sherman–Morrison–Woodbury Formula?

When a nonsingular n\times n matrix A is perturbed by a matrix of rank k, the inverse also undergoes a rank-k perturbation. More precisely, if E has rank k and B = A+E is nonsingular then the identity A^{-1} - B^{-1} =  A^{-1} (B-A) B^{-1} shows that

\notag   \mathrm{rank}(A^{-1} -  B^{-1})    = \mathrm{rank}(A^{-1} E B^{-1}) = \mathrm{rank}(E) = k.

The Sherman–Morrison–Woodbury formula provides an explicit formula for the inverse of the perturbed matrix B.

Sherman–Morrison Formula

We will begin with the simpler case of a rank-1 perturbation: B = A + uv^*, where u and v are n-vectors, and we consider first the case where A = I. We might expect that (I + uv^*)^{-1} = I + \theta uv^* for some \theta (consider a binomial expansion of the inverse). Multiplying out, we obtain

\notag   (I + uv^*) (I + \theta uv^*) = I + (1 + \theta + \theta v^*u) uv^*,

so the product equals the identity matrix when \theta = -1/(1 + v^*u). The condition that I + uv* be nonsingular is v^*u \ne -1 (as can also be seen from \det(I + uv^*) = 1 + v^*u, derived in What Is a Block Matrix?). So

\notag    (I + uv^*)^{-1} = I - \displaystyle\frac{1}{1 + v^*u} uv^*.

For the general case write B = A + uv^* = A(I + A^{-1}u v^*). Inverting this equation and applying the previous result gives

\notag   (A + uv^*)^{-1} = A^{-1} - \displaystyle\frac{A^{-1} uv^* A^{-1}}{1 + v^* A^{-1} u},

subject to the nonsingularity condition v^*A^{-1}x \ne -1. This is known as the Sherman–Morrison formula. It explicitly identifies the rank-1 change to the inverse.

As an example, if we take u = te_i and v = e_j (where e_k is the kth column of the identity matrix) then, writing A^{-1} = (\alpha_{ij}), we have

\notag   \bigl(A + te_ie_j^*\bigr)^{-1}    = A^{-1} - \displaystyle\frac{tA^{-1}e_i e_j^* A^{-1}}{1 +  t \alpha_{ji}}.

The Frobenius norm of the change to A^{-1} is

\notag   \displaystyle\frac{  |t|\, \| A^{-1}e_i\|_2 \| e_j^*A^{-1}\|_2 }                     {|1 + t \alpha_{ji}|}.

If t is sufficiently small then this quantity is approximately maximized for i and j such that the product of the norms of ith column and jth row of A^{-1} is maximized. For an upper triangular matrix i = n and j = 1 are likely to give the maximum, which means that the inverse of an upper triangular matrix is likely to be most sensitive to perturbations in the (n,1) element of the matrix. To illustrate, we consider the matrix

\notag  T =  \left[\begin{array}{rrrr}   1 & -1 & -2 & -3\\   0 & 1 & -4 & -5\\   0 & 0 & 1 & -6\\   0 & 0 & 0 & 1   \end{array}\right]

The (i,j) element of the following matrix is \| T^{-1} - (T + 10^{-3}e_ie_j^*)^{-1} \|_F:

\notag   \left[\begin{array}{cccc}  0.044  &  0.029  &  0.006  &  0.001  \\  0.063  &  0.041  &  0.009  &  0.001  \\  0.322  &  0.212  &  0.044  &  0.007  \\  2.258  &  1.510  &  0.321  &  0.053  \\   \end{array}\right]

As our analysis suggests, the (4,1) entry is the most sensitive to perturbation.

Sherman–Morrison–Woodbury Formula

Now consider a perturbation UV^*, where U and V are n\times k. This perturbation has rank at most k, and its rank is k if U and V are both of rank k. If I + V^* A^{-1} U is nonsingular then A + UV^* is nonsingular and

\notag   (A + UV^*)^{-1} = A^{-1} - A^{-1} U (I + V^* A^{-1} U)^{-1} V^*                      A^{-1},

which is the Sherman–Morrison–Woodbury formula. The significance of this formula is that I + V^* A^{-1} U is k\times k, so if k\ll n and A^{-1} is known then it is much cheaper to evaluate the right-hand side than to invert A + UV^* directly. In practice, of course, we rarely invert matrices, but rather exploit factorizations of them. If we have an LU factorization of A then we can use it in conjunction with the Sherman–Morrison–Woodbury formula to solve (A + UV^*)x = b in O(n^2 + k^3) flops, as opposed to the O(n^3) flops required to factorize A + UV^* from scratch.

The Sherman–Morrison–Woodbury formula is straightforward to verify, by showing that the product of the two sides is the identity matrix. How can the formula be derived in the first place? Consider any two matrices F and G such that FG and GF are both defined. The associative law for matrix multiplication gives F(GF) = (FG)F, or (I + FG)F = F (I + GF), which can be written as F(I+GF)^{-1} = (I+FG)^{-1}F. Postmultiplying by G gives

\notag   F(I+GF)^{-1}G = (I+FG)^{-1}FG                 = (I+FG)^{-1}(I + FG - I)                 = I - (I+FG)^{-1}.

Setting F = U and G = V^* gives the special case of the Sherman–Morrison–Woodbury formula with A = I, and the general formula follows from A + UV^* = A(I + A^{-1}U V^*).

General Formula

We will give a different derivation of an even more general formula using block matrices. Consider the block matrix

\notag   X =  \begin{bmatrix} A & U \\ V^* & -W^{-1} \end{bmatrix}

where A is n\times n, U and V are n\times k, and W is k\times k. We will obtain a formula for (A + UWV^*)^{-1} by looking at X^{-1}.

It is straightforward to verify that

\notag    \begin{bmatrix} A & U \\ V^* & -W^{-1} \end{bmatrix}     =    \begin{bmatrix} I & 0 \\ V^*A^{-1} & I \end{bmatrix}    \begin{bmatrix} A & 0 \\ 0 & -(W^{-1} + V^*A^{-1}U) \end{bmatrix}    \begin{bmatrix} I & A^{-1}U \\ 0 & I \end{bmatrix}.

Hence

\notag \begin{aligned}    \begin{bmatrix} A & U \\ V^* & -W^{-1} \end{bmatrix}^{-1}     &=    \begin{bmatrix} I & -A^{-1}U \\ 0 & I \end{bmatrix}.    \begin{bmatrix} A^{-1} & 0 \\ 0 & -(W^{-1} + V^*A^{-1}U)^{-1} \end{bmatrix}    \begin{bmatrix} I & 0 \\ -V^*A^{-1} & I \end{bmatrix}\\[\smallskipamount]     &=    \begin{bmatrix} A^{-1} - A^{-1}U(W^{-1} + V^*A^{-1}U)^{-1}V^*A^{-1} &                    A^{-1}U(W^{-1} + V^*A^{-1U})^{-1} \\          (W^{-1} + V^*A^{-1}U)^{-1} V^*A^{-1} & -(W^{-1} + V^*A^{-1}U)^{-1}      \end{bmatrix}. \end{aligned}

In the (1,1) block we see the right-hand side of a Sherman–Morrison–Woodbury-like formula, but it is not immediately clear how this relates to (A + UWV^*)^{-1}. Let P = \bigl[\begin{smallmatrix} 0 & I \\ I & 0 \end{smallmatrix} \bigr], and note that P^{-1} = P. Then

\notag     PXP = \begin{bmatrix} -W^{-1} & V^* \\ U & A \end{bmatrix}

and applying the above formula (appropriately renaming the blocks) gives, with \times denoting a block whose value does not matter,

\notag     PX^{-1}P = (PXP)^{-1}  = \begin{bmatrix} \times & \times \\ \times & (A + UWV^*)^{-1} \end{bmatrix}.

Hence (X^{-1})_{11} = (A + UWV^*)^{-1}. Equating our two formulas for (X^{-1})_{11} gives

\notag    \qquad\qquad\qquad   (A + UWV^*)^{-1} = A^{-1} - A^{-1}U (W^{-1} + V^*A^{-1}U)^{-1} V^*A^{-1},    \qquad\qquad\qquad(*)

provided that W^{-1} + V^*A^{-1}U is nonsingular.

To see one reason why this formula is useful, suppose that the matrix A and its perturbation are symmetric and we wish to preserve symmetry in our formulas. The Sherman–Morrison–Woodbury requires us to write the perturbation as UU^*, so the perturbation must be positive semidefinite. In (*), however, we can write an arbitrary symmetric perturbation as UWU^*, with W symmetric but possibly indefinite, and obtain a symmetric formula.

The matrix -(W^{-1} + V^*A^{-1}U) is the Schur complement of A in X. Consequently the inversion formula (*) is intimately connected with the theory of Schur complements. By manipulating the block matrices in different ways it is possible to derive variations of (*). We mention just the simple rewriting

\notag   (A + UWV^*)^{-1} = A^{-1} - A^{-1}U W(I + V^*A^{-1}UW)^{-1} V^*A^{-1},

which is valid if W is singular, as long as I + WV^*A^{-1}U is nonsingular. Note that the formula is not symmetric when V = U and W = W^*. This variant can also be obtained by replacing U by UW in the Sherman–Morrison–Woodbury formula.

Historical Note

Formulas for the change in a matrix inverse under low rank perturbations have a long history. They have been rediscovered on multiple occasions, sometimes appearing without comment within other formulas. Equation (*) is given by Duncan (1944), which is the earliest appearance in print that I am aware of. For discussions of the history of these formulas see Henderson and Searle (1981) or Puntanen and Styan (2005).

References

This is a minimal set of references, which contain further useful references within.

Related Blog Posts

This article is part of the “What Is” series, available from https://nhigham.com/category/what-is and in PDF form from the GitHub repository https://github.com/higham/what-is.

What Is a Block Matrix?

A matrix is a rectangular array of numbers treated as a single object. A block matrix is a matrix whose elements are themselves matrices, which are called submatrices. By allowing a matrix to be viewed at different levels of abstraction, the block matrix viewpoint enables elegant proofs of results and facilitates the development and understanding of numerical algorithms.

A block matrix is defined in terms of a partitioning, which breaks a matrix into contiguous pieces. The most common and important case is for an n\times n matrix to be partitioned as a block 2\times 2 matrix (two block rows and two block columns). For n = 4, partitioning into 2\times 2 blocks gives

\notag   A = \left[\begin{array}{cc|cc}         a_{11} & a_{12} & a_{13} & a_{14}\\         a_{21} & a_{22} & a_{23} & a_{24}\\\hline         a_{31} & a_{32} & a_{33} & a_{34}\\         a_{41} & a_{42} & a_{43} & a_{44}\\         \end{array}\right]    =  \begin{bmatrix}         A_{11} & A_{12} \\         A_{21} & A_{22}        \end{bmatrix},

where

\notag  A_{11} =   A(1\colon2,1\colon2) = \begin{bmatrix} a_{11} & a_{12} \\ a_{21} & a_{22} \end{bmatrix},

and similarly for the other blocks. The diagonal blocks in a partitioning of a square matrix are usually square (but not necessarily so), and they do not have to be of the same dimensions. This same 4\times 4 matrix could be partitioned as

\notag   A = \left[\begin{array}{c|ccc}         a_{11} & a_{12} & a_{13} & a_{14}\\\hline         a_{21} & a_{22} & a_{23} & a_{24}\\         a_{31} & a_{32} & a_{33} & a_{34}\\         a_{41} & a_{42} & a_{43} & a_{44}\\         \end{array}\right]    =  \begin{bmatrix}         A_{11} & A_{12} \\         A_{21} & A_{22}        \end{bmatrix},

where A_{11} = (a_{11}) is a scalar, A_{21} is a column vector, and A_{12} is a row vector.

The sum C = A + B of two block matrices A = (A_{ij}) and B = (B_{ij}) of the same dimension is obtained by adding blockwise as long as A_{ij} and B_{ij} have the same dimensions for all i and j, and the result has the same block structure: C_{ij} = A_{ij}+B_{ij},

The product C = AB of an m\times n matrix A = (A_{ij}) and an n\times p matrix B = (B_{ij}) can be computed as C_{ij} = \sum_k A_{ik}B_{kj} as long as the products A_{ik}B_{kj} are all defined. In this case the matrices A and B are said to be conformably partitioned for multiplication. Here, C has as many block rows as A and as many block columns as B. For example,

\notag   AB =        \begin{bmatrix}         A_{11} & A_{12} \\         A_{21} & A_{22}        \end{bmatrix}        \begin{bmatrix}         B_{11} & B_{12} \\         B_{21} & B_{22}        \end{bmatrix}   =  \begin{bmatrix}         A_{11} B_{11} + A_{12} B_{21} & A_{11} B_{12} + A_{12} B_{22} \\         A_{21} B_{11} + A_{22} B_{21} & A_{21} B_{12} + A_{22} B_{22}        \end{bmatrix}

as long as all the eight products A_{ik}B_{kj} are defined.

Block matrix notation is an essential tool in numerical linear algebra. Here are some examples of its usage.

Matrix Factorization

For an n\times n matrix A with nonzero (1,1) element \alpha we can write

\notag   A =  \begin{bmatrix}         \alpha & b^T \\           c & D        \end{bmatrix}    =        \begin{bmatrix}         1 & 0 \\           c/\alpha & I_{n-1}        \end{bmatrix}       \begin{bmatrix}         \alpha  & b^T \\          0 & D - cb^T/\alpha        \end{bmatrix} = : L_1U_1

The first row and column of L_1 have the correct form for a unit lower triangular matrix and likewise the first row and column of U_1 have the correct form for an upper triangular matrix. If we can find an LU factorization D - cb^T/\alpha = L_2U_2 of the (n-1)\times (n-1) Schur complement D then A = L_1\mathrm{diag}(1,L_2)\cdot \mathrm{diag}(1,U_2)U_1 is an LU factorization of A. This construction is the basis of an inductive proof of the existence of an LU factorization (provided all the pivots are nonzero) and it also yields an algorithm for computing it.

The same type of construction applies to other factorizations, such as Cholesky factorization, QR factorization, and the Schur decomposition.

Matrix Inverse

A useful formula for the inverse of a nonsingular block triangular matrix

\notag      T = \begin{bmatrix}         T_{11} & T_{12} \\           0 & T_{22}        \end{bmatrix}

is

\notag      T^{-1}        =        \begin{bmatrix}      T_{11}^{-1} & - T_{11}^{-1}T_{12}T_{22}^{-1}\\               0  & T_{22}^{-1}        \end{bmatrix},

which has the special case

\notag        \begin{bmatrix}         I & X \\           0 & I        \end{bmatrix}^{-1}    =        \begin{bmatrix}              I   & -X\\               0  & I        \end{bmatrix}.

If T is upper triangular then so are T_{11} and T_{22}. By taking T_{11} of dimension the nearest integer to n/2 this formula can be used to construct a divide and conquer algorithm for computing T^{-1}.

We note that \det(T) = \det(T_{11}) \det(T_{22}), a fact that will be used in the next section.

Determinantal Formulas

Block matrices provides elegant proofs of many results involving determinants. For example, consider the equations

\notag        \begin{bmatrix}         I & -A \\           0 & I        \end{bmatrix}        \begin{bmatrix}         I+AB & 0 \\           B & I        \end{bmatrix}      =        \begin{bmatrix}               I  & -A\\               B  & I        \end{bmatrix}      =        \begin{bmatrix}         I & 0 \\         B & I+BA        \end{bmatrix}        \begin{bmatrix}         I & -A \\         0 & I        \end{bmatrix},

which hold for any A and B such that AB and BA are defined. Taking determinants gives the formula \det(I + AB) = \det(I + BA). In particular we can take A = x, B = y^T, for n-vectors x and y, giving \det(I + xy^T) = 1 + y^Tx.

Constructing Matrices with Required Properties

We can sometimes build a matrix with certain desired properties by a block construction. For example, if X is an n\times n involutory matrix (X^2 = I) then

\notag        \begin{bmatrix}         X & I \\         0 & -X        \end{bmatrix}

is a (block triangular) 2n\times 2n involutory matrix. And if A and B are any two n\times n matrices then

\notag        \begin{bmatrix}               I - BA & B \\               2A-ABA & AB-I        \end{bmatrix}

is involutory.

The Anti Block Diagonal Trick

For n\times n matrices A and B consider the anti block diagonal matrix

\notag     X = \begin{bmatrix}               0 & A \\               B & 0        \end{bmatrix}.

Note that

\notag     X^2 = \begin{bmatrix}               AB & 0 \\               0 & BA        \end{bmatrix}, \quad     X^{-1} = \begin{bmatrix}               0 & B^{-1} \\               A^{-1} & 0        \end{bmatrix}.

Using these properties one can show a relation between the matrix sign function and the principal matrix square root:

\notag     \mathrm{sign}\left(        \begin{bmatrix} 0 & A \\ I & 0 \end{bmatrix}                 \right)         = \begin{bmatrix} 0 & A^{1/2} \\ A^{-1/2} & 0 \end{bmatrix}.

This allows one to derive iterations for computing the matrix square root and its inverse from iterations for computing the matrix sign function.

It is easy to derive explicit formulas for all the powers of X, and hence for any power series evaluated at X. In particular, we have the formula

\notag   \mathrm{e}^X = \left[\begin{array}{cc}                   \cosh\sqrt{AB} & A (\sqrt{BA})^{-1} \sinh \sqrt{BA}                        \\[\smallskipamount]                       B(\sqrt{AB})^{-1} \sinh \sqrt{AB} &                      \cosh\sqrt{BA}                  \end{array}\right],

where \sqrt{Y} denotes any square root of Y. With B = I, this formula arises in the solution of the ordinary differential equation initial value problem y'' + Ay = 0, y(0)=y_0, y'(0)=y'_0,

The most well known instance of the trick is when B = A^T. The eigenvalues of

\notag     X = \begin{bmatrix}               0 & A \\               A^T & 0        \end{bmatrix}

are plus and minus the singular values of A, together with |m-n| additional zeros if A is m\times n with m \ne n, and the eigenvectors of X and the singular vectors of A are also related. Consequently, by applying results or algorithms for symmetric matrices to X one obtains results or algorithms for the singular value decomposition of A.

References

This is a minimal set of references, which contain further useful references within.

  • Gene Golub and Charles F. Van Loan, Matrix Computations, fourth edition, Johns Hopkins University Press, Baltimore, MD, USA, 2013.
  • Nicholas J. Higham, Functions of Matrices: Theory and Computation, Society for Industrial and Applied Mathematics, Philadelphia, PA, USA, 2008. (Sections 1.5 and 1.6 for the theory of matrix square roots.)
  • Roger A. Horn and Charles R. Johnson, Matrix Analysis, second edition, Cambridge University Press, 2013. My review of the second edition.

Related Blog Posts

This article is part of the “What Is” series, available from https://nhigham.com/category/what-is and in PDF form from the GitHub repository https://github.com/higham/what-is.

What Is a Householder Matrix?

A Householder matrix is an n\times n orthogonal matrix of the form

\notag         P = I - \displaystyle\frac{2}{v^Tv} vv^T, \qquad 0 \ne v \in\mathbb{R}^n.

It is easily verified that P is

  • orthogonal (P^TP = I),
  • symmetric (P^T = P),
  • involutory (P^2 = I that is, P is a square root of the identity matrix),

where the last property follows from the first two.

A Householder matrix is a rank-1 perturbation of the identity matrix and so all but one of its eigenvalues are 1. The eigensystem can be fully described as follows.

  • P has an eigenvalue -1 with eigenvector v, since Pv = -v.
  • P has n-1 eigenvalues 1 with eigenvectors any set of n-1 linearly independent vectors orthogonal to v, which can be taken to be mutually orthogonal: Px = x for every such x.

P has trace n-2 and determinant -1, as can be derived directly or deduced from the facts that the trace is the sum of the eigenvalues and the determinant is the product of the eigenvalues.

For n = 2, a Householder matrix can be written as

\notag      P = \begin{bmatrix} \cos\theta & \sin\theta\\                          \sin\theta & -\cos\theta           \end{bmatrix}.

Simple examples of Householder matrices are obtained by choosing v = e = [1,1,\dots,1]^T, for which P = I  - (2/n)ee^T. For n=2,3,4,5,6 we obtain the matrices

\notag    \begin{gathered}        \left[\begin{array}{@{\mskip2mu}rr@{\mskip2mu}}                      0 & -1 \\ -1 & 0        \end{array}\right], \quad   \displaystyle\frac{1}{3}        \left[\begin{array}{@{\mskip2mu}rrr@{\mskip2mu}}                       1 &  -2 &  -2\\                      -2 &   1 &  -2\\                      -2 &  -2 &   1\\        \end{array}\right], \quad    \displaystyle\frac{1}{2}        \left[\begin{array}{@{\mskip2mu}rrrr@{\mskip2mu}}                        1 &   -1 &   -1 &   -1\\                       -1 &    1 &   -1 &   -1\\                       -1 &   -1 &    1 &   -1\\                       -1 &   -1 &   -1 &    1\\        \end{array}\right], \\   \displaystyle\frac{1}{5}        \left[\begin{array}{@{\mskip2mu}rrrrr@{\mskip2mu}}    3 & -2 & -2 & -2 & -2\\ -2 & 3 & -2 & -2 &    -2\\ -2 & -2 & 3 & -2 & -2\\ -2 & -2 & -2 & 3 & -2\\ -2 & -2 & -2 & -2 & 3   \end{array}\right], \quad  \displaystyle\frac{1}{3} \left[\begin{array}{@{\mskip2mu}rrrrrr@{\mskip2mu}}   2 & -1 & -1 & -1 & -1 & -1\\ -1 & 2 & -1 & -1 & -1 & -1\\   -1 & -1 & 2 & -1 & -1 & -1\\ -1 & -1 & -1 & 2 & -1 & -1\\ -1 & -1 & -1 & -1 & 2 & -1\\   -1 & -1 & -1 & -1 & -1 & 2 \end{array}\right].    \end{gathered}

Note that the 4\times 4 matrix is 1/2 times a Hadamard matrix.

Applying P to a vector x gives

\notag    Px = x - \displaystyle\left( \frac{2 v^Tx}{v^Tv} \right) v.

This equation shows that P reflects x about the hyperplane {\mathrm{span}}(v)^{\perp}, as illustrated in the following diagram, which explains why P is sometimes called a Householder reflector. Another way of expressing this property is to write x = \alpha v + z, where z is orthogonal to v. Then Px = -\alpha v + z, so the component of x in the direction v has been reversed. If we take v = e_i, the ith unit vector, then P = I - 2e_ie_i^T = \mathrm{diag}(1,1,\dots,-1,1,\dots,1), which has -1 in the (i,i) position. In this case premultiplying a vector by P flips the sign of the ith component.

householder_fig.jpg

Transforming a Vector

Householder matrices are powerful tools for introducing zeros into vectors. Suppose we are given vectors x and y and wish to find a Householder matrix P such that Px=y. Since P is orthogonal, we require that \|x\|_2 = \|y\|_2, and we exclude the trivial case x = y. Now

Px = y \quad \Longleftrightarrow \quad             x - 2 \left( \displaystyle\frac{v^Tx}{v^Tv} \right)  v = y,

and this last equation has the form \alpha v = x-y for some \alpha. But P is independent of the scaling of v, so we can set \alpha=1. Now with v=x-y we have

\notag        v^Tv = x^Tx + y^Ty  -2x^Ty

and, since x^Tx = y^Ty,

\notag       v^Tx = x^Tx - y^Tx = \frac{1}{2} v^Tv.

Therefore

\notag       Px = x - v = y,

as required. Most often we choose y to be zero in all but its first component.

Square Roots

What can we say about square roots of a Householder matrix, that is, matrices X such that X^2 = P?

We note first that the eigenvalues of X are the square roots of those of P and so n-1 of them will be \pm 1 and one will be \pm \mathrm{i}. This means that X cannot be real, as the nonreal eigenvalues of a real matrix must appear in complex conjugate pairs.

Write P = I - 2vv^T, where v is normalized so that v^Tv = 1. It is natural to look for a square root of the form X = I - \theta vv^T. Setting X^2 = P leads to the quadratic equation \theta^2-2\theta + 2 = 0, and hence \theta = 1 \pm \mathrm{i}. As expected, these two square roots are complex even though P is real. As an example, \theta = 1 - \mathrm{i} gives the following square root of the matrix above corresponding to v = e/n^{1/2} with n = 3:

\notag X = \displaystyle\frac{1}{3} \left[\begin{array}{@{\mskip2mu}rrr} 2+\mathrm{i} & -1+\mathrm{i} & -1+\mathrm{i}\\ -1+\mathrm{i} & 2+\mathrm{i} & -1+\mathrm{i}\\ -1+\mathrm{i} & -1+\mathrm{i} & 2+\mathrm{i} \end{array}\right].

A good way to understand all the square roots is to diagonalize P, which can be done by a similarity transformation with a Householder matrix! Normalizing v^Tv = 1 again, let w = v - e_1 and H = I - 2ww^T/(w^Tw). Then from the construction above we know that Hv = e_1. Hence

\notag   H^T\!PH = HPH = I - 2 Hv v^T\!H = I - 2 e_1e_1^T         = \mathrm{diag}(-1,1,1,\dots,1)=: D.

Then P = HDH^T and so X = H \sqrt{D} H^T gives 2^n square roots on taking all possible combinations of signs on the diagonal for \sqrt{D}. Because P has repeated eigenvalues these are not the only square roots. The infinitely many others are obtained by taking non-diagonal square roots of D, which are of the form \mathrm{diag}(\pm i, Y), where Y is any non-diagonal square root of the (n-1)\times (n-1) identity matrix, which in particular could be a Householder matrix!

Block Householder Matrix

It is possible to define an n\times n block Householder matrix in terms of a given Z\in\mathbb{R}^{n\times p}, where n\ge p, as

\notag      P = I - 2 Z(Z^TZ)^+Z^T.

Here, “+” denotes the Moore–Penrose pseudoinverse. For p=1, P clearly reduces to a standard Householder matrix. It can be shown that (Z^TZ)^+Z^T = Z^+ (this is most easily proved using the SVD), and so

P = I - 2 ZZ^+ = I - 2 P_Z,

where P_Z = ZZ^+ is the orthogonal projector onto the range of Z (that is, \mathrm{range}(PZ) = \mathrm{range}(Z), P_Z^2 = P_Z, and P_Z = P_Z^T). Hence, like a standard Householder matrix, P is symmetric, orthogonal, and involutory. Furthermore, premultiplication of a matrix by P has the effect of reversing the component in the range of Z.

As an example, here is the block Householder matrix corresponding to Z = \bigl[\begin{smallmatrix} 1 & 2 & 3 & 4\\ 5 & 6 & 7 & 8 \end{smallmatrix}\bigr]^T:

\notag \displaystyle\frac{1}{5} \left[\begin{array}{@{\mskip2mu}rrrr@{\mskip2mu}} -2 & -4 & -1 & 2\\ -4 & 2 & -2 & -1\\ -1 & -2 & 2 & -4\\ 2 & -1 & -4 & -2 \end{array}\right].

One can show (using the SVD again) that the eigenvalues of P are -1 repeated r times and 1 repeated n-r times, where r = \mathrm{rank}(Z). Hence \mathrm{trace}(P) = n - 2r and \det(P) = (-1)^r.

Schreiber and Parlett (1988) note the representation for n = 2k,

\notag    P =  \pm \mathrm{diag}(Q_1,Q_2)         \begin{bmatrix} \cos(2\Theta) & \sin(2\Theta) \\                         \sin(2\Theta) & -\cos(2\Theta)         \end{bmatrix}         \mathrm{diag}(Q_1,Q_2)^T,

where Q_1 and Q_2 are orthogonal and \Theta is symmetric positive definite. This formula neatly generalizes the formula for a standard Householder matrix for n = 2 given above, and a similar formula holds for odd n.

Schreiber and Parlett also show how given E\in\mathbb{R}^{n\times p} (n > p) one can construct a block Householder matrix H such that

\notag      HE = \begin{bmatrix} F \\ 0 \end{bmatrix},          \qquad F \in \mathbb{R}^{p\times p}.

The polar decomposition plays a key role in the theory and algorithms for such H.

Rectangular Householder Matrix

We can define a rectangular Householder matrix as follows. Let m > n, u \in \mathbb{R}^n, v \in \mathbb{R}^{m-n}, and

\notag   P =   \begin{bmatrix} I_n\\0 \end{bmatrix}   + \alpha   \begin{bmatrix}     u\\v   \end{bmatrix}u^T =   \begin{bmatrix}     I_n + \alpha u u^T\\ \alpha vu^T   \end{bmatrix} \in \mathbb{R}^n.

Then P^TP = I, that is, P has orthonormal columns, if

\alpha = \displaystyle\frac{-2}{u^Tu + v^Tv}.

Of course, P is just the first n columns of the Householder matrix built from the vector [u^T~v^T]^T.

Historical Note

The earliest appearance of Householder matrices is in the book by Turnbull and Aitken (1932). These authors show that if \|x\|_2 = \|y\|_2 (x\ne -y) then a unitary matrix of the form R = \alpha zz^* - I (in their notation) can be constructed so that Rx = y. They use this result to prove the existence of the Schur decomposition. The first systematic use of Householder matrices for computational purposes was by Householder (1958) who used them to construct the QR factorization.

References

This is a minimal set of references, which contain further useful references within.

Related Blog Posts

This article is part of the “What Is” series, available from https://nhigham.com/category/what-is and in PDF form from the GitHub repository https://github.com/higham/what-is.

What is a Sparse Matrix?

A sparse matrix is one with a large number of zero entries. A more practical definition is that a matrix is sparse if the number or distribution of the zero entries makes it worthwhile to avoid storing or operating on the zero entries.

Sparsity is not to be confused with data sparsity, which refers to the situation where, because of redundancy, the data can be efficiently compressed while controlling the loss of information. Data sparsity typically manifests itself in low rank structure, whereas sparsity is solely a property of the pattern of nonzeros.

Important sources of sparse matrices include discretization of partial differential equations, image processing, optimization problems, and networks and graphs. In designing algorithms for sparse matrices we have several aims.

  • Store the nonzeros only, in some suitable data structure.
  • Avoid operations involving only zeros.
  • Preserve sparsity, that is, minimize fill-in (a zero element becoming nonzero).

We wish to achieve these aims without sacrificing speed, stability, or reliability.

An important class of sparse matrices is banded matrices. A matrix A has bandwidth p if the elements outside the main diagonal and the first p superdiagonals and subdiagonals are zero, that is, if a_{ij} = 0 for j>i+p and i>j+p.

The most common type of banded matrix is a tridiagonal matrix (p = 1), of which an archetypal example is the second-difference matrix, illustrated for n = 5 by

\notag    A_5 = \left[    \begin{array}{@{}*{4}{r@{\mskip10mu}}r}                 2 &  -1  & 0  & 0 & 0\\                 -1 & 2  & -1  & 0 & 0\\                  0 & -1  & 2 & -1 & 0\\                  0 &  0  &-1 & 2  & -1\\                  0 &  0  & 0 & -1 & 2    \end{array}\right].

This matrix (or more precisely its negative) corresponds to a centered finite difference approximation to a second derivative: f''(x) \approx (f(x+h) -2 f(x) + f(x-h))/h^2.

The following plots show the sparsity patterns for two symmetric positive definite matrices. Here, the nonzero elements are indicated by dots.

sparse_plots.jpg The matrices are both from power network problems and they are taken from the SuiteSparse Matrix Collection (https://sparse.tamu.edu/). The matrix names are shown in the titles and the nz values below the x-axes are the numbers of nonzeros. The plots were produced using MATLAB code of the form

W = ssget('HB/494_bus'); A = W.A; spy(A)

where the ssget function is provided with the collection. The matrix on the left shows no particular pattern for the nonzero entries, while that on the right has a structure comprising four diagonal blocks with a relatively small number of elements connecting the blocks.

It is important to realize that while the sparsity pattern often reflects the structure of the underlying problem, it is arbitrary in that it will change under row and column reorderings. If we are interested in solving Ax = b, for example, then for any permutation matrices P and Q we can form the transformed system PAQ (Q^*x) = Pb, which has a coefficient matrix PAQ having permuted rows and columns, a permuted right-hand side Pb, and a permuted solution. We usually wish to choose the permutations to minimize the fill-in or (almost equivalently) the number of nonzeros in L and U. Various methods have been derived for this task; they are necessarily heuristic because finding the minimum is in general an NP-complete problem. When A is symmetric we take Q = P^T in order to preserve symmetry.

For the HB/494_bus matrix the symmetric reverse Cuthill-McKee permutation gives a reordered matrix with the following sparsity pattern, plotted with the MATLAB commands

r = symrcm(A); spy(A(r,r))

sparse_plots_rcm.jpg

The reordered matrix with a variable band structure that is characteristic of the symmetric reverse Cuthill-McKee permutation. The number of nonzeros is, of course, unchanged by reordering, so what has been gained? The next plots show the Cholesky factors of the HB/494_bus matrix and the reordered matrix. The Cholesky factor for the reordered matrix has a much narrower bandwidth than that for the original matrix and has fewer nonzeros by a factor 3. Reordering has greatly reduced the amount of fill-in that occurs; it leads to a Cholesky factor that is cheaper to compute and requires less storage.

sparse_plots_chol.jpg

Because Cholesky factorization is numerically stable, the matrix can be permuted without affecting the numerical stability of the computation. For a nonsymmetric problem the choice of row and column interchanges also needs to take into account the need for numerical stability, which complicates matters.

The world of sparse matrix computations is very different from that for dense matrices. In the first place, sparse matrices are not stored as n\times n arrays, but rather just the nonzeros are stored, in some suitable data structure. Programming sparse matrix computations is, consequently, more difficult than for dense matrix computations. A second difference from the dense case is that certain operations are, for practical purposes, forbidden, Most notably, we never invert sparse matrices because of the possibly severe fill-in. Indeed the inverse of a sparse matrix is usually dense. For example, the inverse of the tridiagonal matrix given at the start of this article is

\notag   A_5^{-1} = \displaystyle\frac{1}{6}  \begin{bmatrix} 5 & 4 & 3 & 2 & 1\\ 4 & 8 & 6 & 4 & 2\\ 3 & 6 & 9 & 6 & 3\\ 2 & 4 & 6 & 8 & 4\\ 1 & 2 & 3 & 4 & 5  \end{bmatrix}.

While it is always true that one should not solve Ax = b by forming x = A^{-1} \times b, for reasons of cost and numerical stability (unless A is orthogonal!), it is even more true when A is sparse.

Finally, we mention an interesting property of A_5^{-1}. Its upper triangle agrees with the upper triangle of the rank-1 matrix

\notag  \begin{bmatrix}   1 \\ 2 \\ 3 \\ 4 \\ 5  \end{bmatrix}  \begin{bmatrix}    5 & 4 & 3 & 2 & 1  \end{bmatrix} =  \begin{bmatrix}    5 & 4 & 3 & 2 & 1\\    10 & 8 & 6 & 4 & 2\\    15 & 12& 9 & 6 & 3\\    20 & 16& 12& 8 & 4\\    25 & 20& 15& 10& 5  \end{bmatrix}.

This property generalizes to other tridiagonal matrices. So while a tridiagonal matrix is sparse, its inverse is data sparse—as it has to be because in general A depends on 2n-1 parameters and hence so does A^{-1}. One implication of this property is that it is possible to compute the condition number \kappa_{\infty}(A) = \|A\|_{\infty} \|A^{-1}\|_{\infty} of a tridiagonal matrix in O(n) flops.

References

This is a minimal set of references, which contain further useful references within.

This article is part of the “What Is” series, available from https://nhigham.com/category/what-is and in PDF form from the GitHub repository https://github.com/higham/what-is.

What Is the Sylvester Equation?

The Sylvester equation is the linear matrix equation

AX - XB = C,

where A\in\mathbb{C}^{m\times m}, B\in\mathbb{C}^{n\times n}, and X,C\in\mathbb{C}^{m\times n}. It is named after James Joseph Sylvester (1814–1897), who considered the homogeneous version of the equation, AX - XB = 0 in 1884. Special cases of the equation are Ax = b (a standard linear system), AX  = XA (matrix commutativity), Ax = \lambda x (an eigenvalue–eigenvector equation), and AX = I (matrix inversion).

In the case where B = A, taking the trace of both sides of the equation gives

\mathrm{trace}(C) = \mathrm{trace}(AX - XA) = \mathrm{trace}(AX) -  \mathrm{trace} (XA) = 0,

so a solution can exist only when C has zero trace. Hence AX - XA = I, for example, has no solution.

To determine when the Sylvester equation has a solution we will transform it into a simpler form. Let A = URU^* and B = VSV^* be Schur decompositions, where U and V are unitary and R and S are upper triangular. Premultiplying the Sylvester equation by U^*, postmultiplying by V, and setting Y = U^*XV and D = U^*CV, we obtain

RY - YS = D,

which is a Sylvester equation with upper triangular coefficient matrices. Equating the jth columns on both sides leads to

(R - s_{jj}I) y_j = d_j - \displaystyle\sum_{k=1}^{j-1} s_{kj} y_k, \quad j = 1\colon n.

As long as the triangular matrices R - s_{jj}I are nonsingular for all j we can uniquely solve for y_1, y_2, …, y_n in turn. Hence the Sylvester equation has a unique solution if r_{ii} \ne s_{jj} for all i and j, that is, if A and B have no eigenvalue in common.

Since the Sylvester is a linear equation it must be possible to express it in the standard form “Ax = b”. This can be done by applying the vec operator, which yields

\qquad\qquad\qquad\qquad\qquad    (I_n \otimes A - B^T \otimes I_m) \mathrm{vec}(X) = \mathrm{vec}(C),    \qquad\qquad\qquad\qquad\qquad(*)

where \otimes is the Kronecker product. Using the Schur transformations above it is easy to show that the eigenvalues of the coefficient matrix are given in terms of those of A and B by

\lambda_{ij} (I_n\otimes A - B^T\otimes I_m) = \lambda_i(A) - \lambda_j(B),   \quad i=1\colon m,  \quad j=1\colon n,

so the coefficient matrix is nonsingular when \lambda_i(A) \ne \lambda_j(B) for all i and j.

By considering the derivative of Z(t) = \mathrm{e}^{At}C\mathrm{e}^{-Bt}, it can be shown that if the eigenvalues of A and -B have negative real parts (that is, A and -B are stable matrices) then

X = -\displaystyle\int_0^{\infty} \mathrm{e}^{At} C \mathrm{e}^{-Bt} \, \mathrm{d}t

is the unique solution of AX - XB = C.

Applications

An important application of the Sylvester equation is in block diagonalization. Consider the block upper triangular matrix

T = \begin{bmatrix}       A & C\\       0 & B    \end{bmatrix}.

If we can find a nonsingular matrix Z such that Z^{-1}TZ = \mathrm{diag}(A,B) then certain computations with T become much easier. For example, for any function f,

f(T) = f(Z \mathrm{diag}(A,B) Z^{-1})          = Zf(\mathrm{diag}(A,B)) Z^{-1}          = Z\mathrm{diag}(f(A),f(B)) Z^{-1},

so computing f(T) reduces to computing f(A) and f(B). Setting

Z = \begin{bmatrix}       I & -X\\       0 & I    \end{bmatrix}.

and noting that Z^{-1} is just Z with the sign of the (1,2) block reversed, we find that

Z^{-1} TZ = \begin{bmatrix}       A & -AX + XB + C\\       0 & B    \end{bmatrix}.

Hence Z block diagonalizes T if X satisfies the Sylvester equation AX - XB = C, which we know is possible if the eigenvalues of A and B are distinct. This restriction is unsurprising, as without it we could use this construction to diagonalize a 2\times 2 Jordan block, which of course is impossible.

For another way in which Sylvester equations arises consider the expansion (X+E)^2 = X^2 + XE + EX + E^2 for square matrices X and E, from which it follows that XE + EX is the Fréchet derivative of the function x^2 at X in the direction E, written L_{x^2}(X,E). Consequently, Newton’s method for the square root requires the solution of Sylvester equations, though in practice certain simplifications can be made to avoid their appearance. We can find the Fréchet derivative of x^{1/2} by applying the chain rule to \bigl(x^{1/2}\bigr)^2 = x, which gives L_{x^2}\left(X^{1/2}, L_{x^{1/2}}(X,E)\right) = E. Therefore Z = L_{x^{1/2}}(X,E) is the solution to the Sylvester equation X^{1/2} Z + Z X^{1/2}  = E. Consequently, the Sylvester equation plays a role in the perturbation theory for matrix square roots.

Sylvester equations also arise in the Schur–Parlett algorithm for computing matrix functions, which reduces a matrix to triangular Schur form T and then solves TF-FT = 0 for F = f(T), blockwise, by a recurrence.

Solution Methods

How can we solve the Sylvester equation? One possibility is to solve (*) by LU factorization with partial pivoting. However, the coefficient matrix is mn\times mn and LU factorization cannot exploit the Kronecker product structure, so this approach is prohibitively expensive unless m and n are small. It is more efficient to compute Schur decompositions of A and B, transform the problem, and solve a sequence of triangular systems, as described above in our derivation of the conditions for the existence of a unique solution. This method was developed by Bartels and Stewart in 1972 and it is implemented in the MATLAB function sylvester.

In recent years research has focused particularly on solving Sylvester equations in which A and B are large and sparse and C has low rank, which arise in applications in control theory and model reduction, for example. In this case it is usually possible to find good low rank approximations to X and iterative methods based on Krylov subspaces have been very successful.

Sensitivity and the Separation

Define the separation of A and B by

\mathrm{sep}(A,B) =       \displaystyle\min_{Z\ne0} \displaystyle\frac{ \|AZ-ZB\|_F }{ \|Z\|_F }.

The separation is positive if A and B have no eigenvalue in common, which we now assume. If X is the solution to AX - XB = C then

\notag   \mathrm{sep}(A,B) \le \displaystyle\frac{ \|AX-XB\|_F }{ \|X\|_F }                      = \frac{\|C\|_F}{\|X\|_F},

so X is bounded by

\notag   \|X\|_F \le \displaystyle\frac{\|C\|_F}{\mathrm{sep}(A,B)}.

It is not hard to show that \mathrm{sep}(A,B)^{-1} = \|P^{-1}\|_2, where P is the matrix in (*). This bound on \|X\|_F is a generalization of \|x\|_2 \le \|A^{-1}\|_2 \|b\|_2 for Ax = b.

The separation features in a perturbation bound for the Sylvester equation. If

\notag     (A+\Delta A)(X+\Delta X) - (X+\Delta X)(B+\Delta B) = C+\Delta C,

then

\notag      \displaystyle\frac{ \|\Delta X\|_F }{ \|X\|_F }      \le 2\sqrt{3}\, \mathrm{sep}(A,B)^{-1} (\|A\|_F + \|B\|_F) \epsilon           + O(\epsilon^2),

where

\notag    \epsilon = \max \left\{ \displaystyle\frac{\|\Delta A\|_F}{\|A\|_F},                            \frac{\|\Delta B\|_F}{\|B\|_F},                            \frac{\|\Delta C\|_F}{\|C\|_F} \right\}.

While we have the upper bound \mathrm{sep}(A,B) \le \min_{i,j} |\lambda_i(A) - \lambda_j(B)|, this inequality can be extremely weak for nonnormal matrices, so two matrices can have a small separation even if their eigenvalues are well separated. To illustrate, let T(\alpha) denote the n\times n upper triangular matrix with \alpha on the diagonal and -1 in all entries above the diagonal. The following table shows the values of \mathrm{sep}(T(1),T(1.1)) for several values of n.

sep_table.jpg

Even though the eigenvalues of A and B are 0.1 apart, the separation is at the level of the unit roundoff for n as small as 8.

The sep function was originally introduced by Stewart in the 1970s as a tool for studying the sensitivity of invariant subspaces.

Variations and Generalizations

The Sylvester equation has many variations and special cases, including the Lyapunov equation AX + XA^* = C, the discrete Sylvester equation X + AXB = C, and versions of all these for operators. It has also been generalized to multiple terms and to have coefficient matrices on both sides of X, yielding

\displaystyle\sum_{i=1}^k A_i X B_i = C.

For k\le 2 and m=n this equation can be solved in O(n^3) flops. For k > 2, no O(n^3) flops algorithm is known and deriving efficient numerical methods remains an open problem. The equation arises in stochastic finite element discretizations of partial differential equations with random inputs, where the matrices A_i and B_i are large and sparse and, depending on the statistical properties of the random inputs, k can be arbitrarily large.

References

This is a minimal set of references, which contain further useful references within.

Related Blog Posts

This article is part of the “What Is” series, available from https://nhigham.com/category/what-is and in PDF form from the GitHub repository https://github.com/higham/what-is.