What Is a QR Factorization?

A QR factorization of a rectangular matrix $A\in\mathbb{R}^{m\times n}$ with $m\ge n$ is a factorization $A = QR$ with $Q\in\mathbb{R}^{m\times m}$ orthonormal and $R\in\mathbb{R}^{m\times n}$ upper trapezoidal. The $R$ factor has the form $R = \left[\begin{smallmatrix}R_1\\ 0\end{smallmatrix}\right]$ , where $R_1$ is $n\times n$ and upper triangular. Partitioning $Q$ conformably with $R$ we have

$\notag A = QR = \begin{array}[b]{@{\mskip-20mu}c@{\mskip0mu}c@{\mskip-1mu}c@{}} & \mskip10mu\scriptstyle n & \scriptstyle m-n \\ \mskip15mu \begin{array}{r} \scriptstyle m \end{array}~ & \multicolumn{2}{c}{\mskip-15mu \left[\begin{array}{c@{~}c@{~}} Q_1 & Q_2 \end{array}\right] } \end{array} \mskip-10mu \begin{array}[b]{@{\mskip-25mu}c@{\mskip-20mu}c@{}} \scriptstyle n \\ \multicolumn{1}{c}{ \left[\begin{array}{@{}c@{}} R_1\\ 0 \end{array}\right]} & \mskip-12mu\ \begin{array}{l} \scriptstyle n \\ \scriptstyle m-n \end{array} \end{array} = Q_1 R_1.$

There are therefore two forms of QR factorization:

$A = QR$ is the full QR factorization,
$A = Q_1R_1$ is the reduced (also called economy-sized, or thin) QR factorization.

To prove the existence of a QR factorization note that if $A$ has full rank then $A^T\!A$ is symmetric positive definite. Since $A = Q_1R_1$ implies $A^T\!A = R_1^TQ_1^TQ_1R_1 = R_1^TR_1$ , we can take $R_1$ to be the Cholesky factor of $A^T\!A$ and then define $Q_1 = AR_1^{-1}$ . The resulting $Q_1$ has orthonormal columns because

$\notag Q_1^TQ_1 = R_1^{-T} A^T A R_1^{-1} = R_1^{-T} R_1^T R_1 R_1^{-1} = I.$

Therefore when $A$ has full rank there is a unique reduced QR factorization if we require $R_1$ to have positive diagonal elements. (Without this requirement we can multiply the $i$ th column of $Q$ and the $i$ th row of $R$ by $-1$ and obtain another QR factorization.)

When $A$ has full rank the columns of $Q_1$ span the column space (or range) of $A$ . Indeed $Ax = Q_1R_1x = Q_1(R_1x)$ implies $\mathrm{range}(A) \subseteq \mathrm{range}(Q_1)$ while $Q_1x = Q_1R_1\cdot R_1^{-1}x =: Ay$ implies $\mathrm{range}(Q_1) \subseteq \mathrm{range}(A)$ , so $\mathrm{range}(Q_1) = \mathrm{range}(A)$ . Furthermore, $Q^TA = R$ gives $Q_2^TA = 0$ , so the columns of $Q_2$ span the null space of $A^T$ .

The QR factorization provides a way of orthonormalizing the columns of a matrix. An alternative is provided by the polar decomposition $A = UH$ , where $U$ has orthonormal columns and $H$ is positive semidefinite. The orthogonal polar factor $U$ is the closest matrix with orthonormal columns to $A$ in any unitarily invariant norm, but it is more expensive to compute than the $Q$ factor.

There are three standard ways of computing a QR factorization.

Gram–Schmidt orthogonalization computes the reduced factorization. It has the disadvantage that in floating-point arithmetic the computed $\widehat{Q}$ is not guaranteed to be orthonormal to the working precision. The modified Gram–Schmidt method (a variation of the classical method) is better behaved numerically in that $\|\widehat{Q}^T\widehat{Q} - I\|_F \le c_1(m,n)\kappa_2(A)u$ for some constant $c_1(m,n)$ , where $u$ is the unit roundoff, so the loss of orthogonality is bounded.

Householder QR factorization and Givens QR factorization both construct $Q^T$ as a product of orthogonal matrices that are chosen to reduce $A$ to upper trapezoidal form. In both methods, at the start of the $k$ th stage we have

$\notag \qquad\qquad\qquad\qquad A^{(k)} = Q_{k-1}^T A = \begin{array}[b]{@{\mskip35mu}c@{\mskip20mu}c@{\mskip-5mu}c@{}c} \scriptstyle k-1 & \scriptstyle 1 & \scriptstyle n-k & \\ \multicolumn{3}{c}{ \left[\begin{array}{c@{\mskip10mu}cc} R_{k-1} & y_k & B_k \\ 0 & z_k & C_k \end{array}\right]} & \mskip-12mu \begin{array}{c} \scriptstyle k-1 \\ \scriptstyle m-k+1 \end{array} \end{array}, \qquad\qquad\qquad\qquad (*)$

where $R_{k-1}$ is upper triangular and $Q_{k-1}$ is a product of Householder transformations or Givens rotations. Working on $A^{(k)}(k:m,k:n)$ we now apply a Householder transformation or $n-k$ Givens rotations in order to zero out the last $n-k$ elements of $z_k$ and thereby take the matrix one step closer to upper trapezoidal form.

Householder QR factorization is the method of choice for general matrices, but Givens QR factorization is preferred for structured matrices with a lot of zeros, such as upper Hessenberg matrices and tridiagonal matrices.

Both these methods produce $Q$ in factored form and if the product is explicitly formed they yield a computed $\widehat{Q}$ that is orthogonal to the working precision, that is, $\|\widehat{Q}^T\widehat{Q} - I\|_F \le c_2(m,n)u$ , for some constant $c_2$ .

Modified Gram–Schmidt, Householder QR, and Givens QR all have the property that there exists an exactly orthogonal $Q$ such that the computed $\widehat{R}$ satisfies

$\notag A + \Delta A = Q \widehat{R}, \quad \|\Delta A\|_F \le c_3(m,n)u \|A\|_F,$

for some constant $c_3$ .

Another way of computing a QR factorization is by the technique in the existence proof above, via Cholesky factorization of $A^T\!A$ . This is known as the Cholesky QR algorithm and it has favorable computational cost when $m \gg n$ . In its basic form, this method is not recommended unless $A$ is extremely well conditioned, because the computed $\widehat{Q}$ is far from orthonormal for ill conditioned matrices. The method can be made competitive with the others either by using extra precision or by iterating the process.

Column Pivoting and Rank-Revealing QR Factorization

In practice, we often want to compute a basis for the range of $A$ when $A$ is rank deficient. The basic QR factorization may not do so. Householder QR factorization with column pivoting reveals rank deficiency by incorporating column interchanges. At the $k$ th stage, before applying a Householder transformation to $(*)$ , the column of largest $2$ -norm of $C_k$ , the $j$ th say, is determined, and if its norm exceeds that of $z_K$ then the $k$ th and $(k+j)$ th columns of $A^{(k)}$ are interchanged. The result is a factorization $A\Pi = QR$ , where $\Pi$ is a permutation matrix and $R$ satisfies the inequalities

$\notag |r_{kk}|^2 \ge \displaystyle\sum_{i=k}^j |r_{ij}|^2, \quad j=k+1\colon n, \quad k=1\colon n.$

In particular,

$\notag \qquad\qquad\qquad\qquad\qquad\qquad |r_{11}| \ge |r_{22}| \ge \cdots \ge |r_{nn}|. \qquad\qquad\qquad\qquad\qquad\qquad (\dagger)$

If $A$ is rank deficient then $R$ has the form

$\notag R = \begin{bmatrix}R_{11} & R_{12} \\ 0 & 0 \end{bmatrix},$

with $R_{11}$ nonsingular, and the rank of $A$ is the dimension of $R_{11}$ .

Near rank deficiency of $A$ to tends to be revealed by a small trailing diagonal block of $R$ , but this is not guaranteed. Indeed for the Kahan matrix

$\notag U_n(\theta) = \mathrm{diag}(1,s,\dots,s^{n-1}) \begin{bmatrix} 1 & -c & -c & \dots & -c \\ & 1 & -c & \dots & -c \\ & & \ddots &\ddots & \vdots \\ & & &\ddots & -c \\ & & & & 1 \end{bmatrix}$

where $c =\cos\theta$ and $s = \sin\theta$ , $u_{nn}$ is of order $2^n$ times larger than the smallest singular value for small $\theta$ and $U_n(\theta)$ is invariant under QR factorization with column pivoting.

In practice, column pivoting reduces the efficiency of Householder QR factorization because it limits the amount of the computation that can be expressed in terms of matrix multiplication. This has motivated the development of methods that select the pivot columns using randomized projections. These methods gain speed and produce factorizations of similar rank-revealing quality, though they give up the inequalities $(\dagger)$ .

It is known that at least one permutation matrix $\Pi$ exists such that the QR factorization of $A\Pi$ is rank-revealing. Computing such a permutation is impractical, but heuristic algorithms for producing an approximate rank-revealing factorization are available.

References

This is a minimal set of references, which contain further useful references within.

Shivkumar Chandrasekaran and Ilse Ipsen, On Rank-Revealing Factorisations, SIAM J. Matrix Anal. Appl. 15(2), 592–622, 1994.
Per-Gunnar Martinsson and Joel A. Tropp, Randomized Numerical Linear Algebra: Foundations and Algorithms, Acta Numerica, 2020, to appear.
Nicholas J. Higham, Accuracy and Stability of Numerical Algorithms, second edition, Society for Industrial and Applied Mathematics, Philadelphia, PA, USA, 2002.
Takeshi Fukaya, Ramaseshan Kannan, Yuji Nakatsukasa, Yusaku Yamamoto and Yuka Yanagisawa, Shifted Cholesky QR for Computing the QR Factorization of Ill-Conditioned Matrices, SIAM J. Sci. Comput. 42(1), A477–A503, 2020.

What is the Cayley–Hamilton Theorem?

The Cayley–Hamilton Theorem says that a square matrix $A$ satisfies its characteristic equation, that is $p(A) = 0$ where $p(t) = \det(tI-A)$ is the characteristic polynomial. This statement is not simply the substitution “ $p(A) = \det(A - A) = 0$ ”, which is not valid since $t$ must remain a scalar inside the $\det$ term. Rather, for an $n\times n$ $A$ , the characteristic polynomial has the form

$\notag p(t) = t^n + a_{n-1}t^{n-1} + \cdots + a_1 t + a_0$

and the Cayley–Hamilton theorem says that

$\notag p(A) = A^n + a_{n-1}A^{n-1} + \cdots + a_1 A + a_0 I = 0.$

Various proofs of the theorem are available, of which we give two. The first is the most natural for anyone familiar with the Jordan canonical form. The second is more elementary but less obvious.

First proof.

Consider a $4\times 4$ Jordan block with eigenvalue $\lambda$ :

$\notag J = \begin{bmatrix} \lambda & 1 & 0 & 0 \\ & \lambda & 1 & 0\\ & & \lambda & 1\\ & & & \lambda \end{bmatrix}.$

We have

$\notag J - \lambda I = \begin{bmatrix} 0 & 1 & 0 & 0 \\ & 0 & 1 & 0\\ & & 0 & 1\\ & & & 0 \end{bmatrix}, \quad (J - \lambda I)^2 = \begin{bmatrix} 0 & 0 & 1 & 0 \\ & 0 & 0 & 1 \\ & & 0 & 0 \\ & & & 0 \end{bmatrix}, \quad (J - \lambda I)^3 = \begin{bmatrix} 0 & 0 & 0 & 1 \\ & 0 & 0 & 0 \\ & & 0 & 0 \\ & & & 0 \end{bmatrix},$

and then $(J - \lambda I)^4 = 0$ . In general, for an $n \times n$ Jordan block $J$ with eigenvalue $\lambda$ , $(J - \lambda I)^k$ is zero apart from a $k$ th superdiagonal of ones for $k\le n-1$ , and $(J - \lambda I)^n = 0$ .

Let $A$ have the Jordan canonical form $A = Z JZ^{-1}$ , where $J = \mathrm{diag}(J_1, \dots, J_k)$ and each $J_i$ is an $m_i\times m_i$ Jordan block with eigenvalue $\lambda_i$ . The characteristic polynomial of $A$ can be factorized as $p(t) = (t-\lambda_1)^{m_1}(t-\lambda_2)^{m_2}\dots(t-\lambda_k)^{m_k}$ . Note that $A^i = Z J^i Z^{-1}$ for all $i$ , and hence $q(A) = Z q(J) Z^{-1}$ for any polynomial $q$ . Then

$\notag Z^{-1}p(A)Z = p(J) = \mathrm{diag}\bigl( p(J_1), p(J_2), \dots, p(J_k) \bigr),$

and $p(J_i)$ is zero because it contains a factor $(J_i - \lambda_i I\bigr)^{m_i}$ and this factor is zero, as noted above. Hence $Z^{-1}p(A)Z = 0$ and therefore $p(A) = 0$ .

Second Proof

Recall that the adjugate $\mathrm{adj}(B)$ of an $n\times n$ matrix $B$ is the transposed matrix of cofactors, where a cofactor is a signed sum of products of $n-1$ entries of $B$ , and that $B\mathrm{adj}(B) = \det(B)I$ . With $B = tI - A$ , each entry of $\mathrm{adj}(B)$ is a polynomial of degree $n-1$ in $t$ , so $B^{-1} = \mathrm{adj}(B)/\det(B)$ can be written

$\notag (tI - A)^{-1} = \displaystyle\frac{\mathrm{adj}(tI - A)}{\det(tI - A)} = \displaystyle\frac{t^{n-1}C_{n-1} + \cdots + tC_1 + C_0}{p(t)},$

for some matrices $C_{n-1}$ , $\dots$ , $C_0$ not depending on $t$ . Rearranging, we obtain

$\notag (tI - A) \bigl(t^{n-1}C_{n-1} + \cdots + tC_1 + C_0 \bigr) = p(t)I,$

and equating coefficients of $t^n$ , …, $t^0$ gives

$\notag \begin{aligned} C_{n-1} &= I,\\ C_{n-2} - A C_{n-1} &= a_{n-1}I,\\ & \vdots\\ C_0 - A C_1 &= a_1 I,\\ - A C_0 &= a_0I. \end{aligned}$

Premultiplying the first equation by $A^n$ , the second by $A^{n-1}$ , and so on, and adding, gives

$\notag 0 = A^n + a_{n-1}A^{n-1} + \cdots + a_1 A + a_0 I = p(A),$

as required. This proof is by Buchheim (1884).

Applications and Generalizations

A common use of the Cayley–Hamilton theorem is to show that $A^{-1}$ is expressible as a linear combination of $I$ , $A$ , …, $A^{n-1}$ . Indeed for a nonsingular $A$ , $p(A) = 0$ implies that

$\notag A^{-1} = -\displaystyle\frac{1}{a_0} \bigl( A^{n-1} + a_{n-1}A^{n-2} + \cdots + a_1 I \bigr),$

since $a_0 = \det(A) \ne 0$ .

Similarly, $A^k$ for any $k \ge n$ can be expressed as a linear combination of $I$ , $A$ , …, $A^{n-1}$ . An interesting implication is that any matrix power series is actually a polynomial in the matrix. Thus the matrix exponential $\mathrm{e}^A = I + A + A^2/2! + \cdots$ can be written $\mathrm{e}^A = c_{n-1} A^{n-1} + \cdots + C_1A + c_0I$ for some scalars $c_{n-1}$ , …, $c_0$ . However, the $c_i$ depend on $A$ , which reduces the usefulness of the polynomial representation. A rare example of an explicit expression of this form is Rodrigues’s formula for the exponential of a skew-symmetric matrix $A \in \mathbb{R}^{3\times 3}$ :

$\notag \mathrm{e}^A = I + \displaystyle\frac{\sin\theta}{\theta} A + \displaystyle\frac{1-\cos\theta}{\theta^2} A^2,$

where $\theta = \sqrt{\|A\|_F^2/2}$ .

Cayley used the Cayley–Hamilton theorem to find square roots of a $2\times 2$ matrix. If $X^2 = A$ then applying the theorem to $X$ gives $X^2 - \mathrm{trace}(X)X + \det(X)I = 0$ , or

$\notag \qquad\qquad\qquad\qquad A - \mathrm{trace}(X)X + \det(X)I = 0, \qquad\qquad\qquad\qquad (*)$

which gives

$X = \displaystyle\frac{A + \det(X)I}{\mathrm{trace}(X)}.$

Now $\det(X) = \sqrt{\det(A)}$ and taking the trace in $(*)$ gives an equation for $\mathrm{trace}(X)$ , leading to

$\notag X = \displaystyle\frac{ A + \sqrt{\det(A)} \,I} {\sqrt{\mathrm{trace}(A) + 2 \sqrt{\det(A)}}}.$

With appropriate choices of signs for the square roots this formula gives all four square roots of $A$ when $A$ has distinct eigenvalues, but otherwise the formula can break down.

Expressions obtained from the Cayley–Hamilton theorem are of little practical use for general matrices, because algorithms that compute the coefficients $a_i$ of the characteristic polynomial are typically numerically unstable.

The Cayley–Hamilton theorem has been generalized in various directions. The theorem can be interpreted as saying that the powers $A^i$ for all nonnegative $i$ generate a vector space of dimension at most $n$ . Gerstenhaber (1961) proved that if $A$ and $B$ are two commuting $n\times n$ matrices then the matrices $A^iB^j$ , for all nonnegative $i$ and $j$ , generate a vector space of dimension at most $n$ . It is conjectured that this result extends to three matrices.

Historical Note

The Cayley–Hamilton theorem appears in the 1858 memoir in which Cayley introduced matrix algebra. Cayley gave a proof for $n = 2$ and stated that he had verified the result for $n = 3$ , adding “I have not thought it necessary to undertake the labour of a formal proof of the theorem in the general case of a matrix of any degree.” Hamilton had proved the result for quaternions in 1853. Cayley actually discovered a more general version of the Cayley–Hamilton theorem, which appears in an 1857 letter to Sylvester but not in any of his published work: if the square matrices $A$ and $B$ commute and $f(x,y) = \det(x A - y B)$ then $f(B,A) = 0$ .

References

Arthur Buchheim, Mathematical Notes, Messenger Math. 13, 62–66, 1884.
Arthur Cayley, A Memoir on the Theory of Matrices, Philos. Trans. Roy. Soc. London 148, 17–37, 1858.
Tony Crilly, Cayley’s Anticipation of a Generalised Cayley–Hamilton Theorem, Historia Mathematica 5, 211–219, 1978.

What Is the CS Decomposition?

The CS (cosine-sine) decomposition reveals close relationships between the singular value decompositions (SVDs) of the blocks an orthogonal matrix expressed in block $2\times 2$ form. In full generality, it applies when the diagonal blocks are not necessarily square. We focus here mainly on the most practically important case of square diagonal blocks.

Let $Q\in\mathbb{R}^{n \times n}$ be orthogonal and suppose that $n = 2p$ is even and $Q$ is partitioned into four equally sized blocks:

$\notag Q = \begin{array}[b]{@{\mskip35mu}c@{\mskip-20mu}c@{\mskip-10mu}c@{}} \scriptstyle p & \scriptstyle p & \\ \multicolumn{2}{c}{ \left[\begin{array}{c@{~}c@{~}} Q_{11}& Q_{12} \\ Q_{21}& Q_{22} \\ \end{array}\right]} & \mskip-12mu\ \begin{array}{c} \scriptstyle p \\ \scriptstyle p \end{array} \end{array}.$

Then there exist orthogonal matrices $U_1,U_2,V_1,V_2\in\mathbb{R}^{p \times p}$ such that

$\notag \begin{bmatrix} U_1^T & 0\\ 0 & U_2^T \end{bmatrix} \begin{bmatrix} Q_{11} & Q_{12}\\ Q_{21} & Q_{22} \end{bmatrix} \begin{bmatrix} V_1 & 0\\ 0 & V_2 \end{bmatrix} = \begin{array}[b]{@{\mskip36mu}c@{\mskip-13mu}c@{\mskip-10mu}c@{}} \scriptstyle p & \scriptstyle p & \\ \multicolumn{2}{c}{ \left[\begin{array}{@{\mskip3mu}rr@{~}} C & S \\ -S & C \end{array}\right]} & \mskip-12mu\ \begin{array}{c} \scriptstyle p \\ \scriptstyle p \end{array} \end{array},$

where $C = \mathrm{diag}(c_i)$ and $S = \mathrm{diag}(s_i)$ with $c_i \ge 0$ , $s_i \ge 0$ , and $c_i^2 + s_i^2 = 1$ for all $i$ . This CS decomposition comprises four SVDs:

$\notag \begin{alignedat}{2} Q_{11} &= U_1CV_1^T, &\quad Q_{12} &= U_1 S V_2^T, \\ Q_{21} &= U_2 (-S) V_1^T, &\quad Q_{22} &= U_2C V_2^T. \end{alignedat}$

(Strictly speaking, for $Q_{21}$ we need to move the minus sign from $S$ to $U_2$ or $V_1$ to obtain an SVD.) The orthogonality ensures that there are only four different singular vector matrices instead of eight, and it makes the singular values of the blocks closely linked. We also obtain SVDs of four cross products of the blocks: $Q_{11}^TQ_{12} = V_1^T CS V_2^T$ , etc.

Note that for $p = 1$ , the CS decomposition reduces to the fact that any $2\times 2$ orthogonal matrix is of the form $\left[\begin{smallmatrix} c & s \\ -s & c \end{smallmatrix}\right]$ (a rotation ) up to multiplication of a row or column by $-1$ .

A consequence of the decomposition is that $Q_{11}$ and $Q_{22}$ have the same 2-norms and Frobenius norms, as do their inverses if they are nonsingular. The same is true for $Q_{12}$ and $Q_{21}$ .

Now we drop the requirement that $n$ is even and consider diagonal blocks of different sizes:

$\notag Q = \begin{array}[b]{@{\mskip33mu}c@{\mskip-16mu}c@{\mskip-10mu}c@{}} \scriptstyle p & \scriptstyle n-p & \\ \multicolumn{2}{c}{ \left[\begin{array}{c@{~}c@{~}} Q_{11}& Q_{12} \\ Q_{21}& Q_{22} \\ \end{array}\right]} & \mskip-12mu\ \begin{array}{c} \scriptstyle p \\ \scriptstyle n-p \end{array} \end{array}, \quad p \le \displaystyle\frac{n}{2}.$

The CS decomposition now has the form

$\notag \begin{bmatrix} U_1^T & 0\\ 0 & U_2^T \end{bmatrix} \begin{bmatrix} Q_{11} & Q_{12}\\ Q_{21} & Q_{22} \end{bmatrix} \begin{bmatrix} V_1 & 0\\ 0 & V_2 \end{bmatrix} = \begin{array}[b]{@{\mskip35mu}c@{\mskip30mu}c@{\mskip-10mu}c@{}c} \scriptstyle p & \scriptstyle p & \scriptstyle n-2p & \\ \multicolumn{3}{c}{ \left[\begin{array}{c@{~}|c@{~}c} C & S & 0 \\ \hline -S & C & 0 \\ 0 & 0 & I_{n-2p} \end{array}\right]} & \mskip-12mu \begin{array}{c} \scriptstyle p \\ \scriptstyle p \\ \scriptstyle n-2p \end{array} \end{array},$

with $U_1$ , $U_2$ , $C$ , and $S$ , and $V_1$ and $V_2$ (both now $(n-p) \times )n-p)$ ), having the same properties as before. The new feature for $p < n/2$ is the identity matrix in the bottom right-hand corner on the right-hand side. Here is an example with $p = 2$ and $n=5$ , with elements shown to two decimal places:

$\notag \begin{aligned} \left[\begin{array}{rr|rrr} 0.71 & -0.71 & 0 & 0 & 0 \\ -0.71 & -0.71 & 0 & 0 & 0 \\\hline 0 & 0 & 0.17 & 0.61 & -0.78 \\ 0 & 0 & -0.58 & -0.58 & -0.58 \\ 0 & 0 & -0.80 & 0.54 & 0.25 \\ \end{array}\right] \left[\begin{array}{rr|rrr} -0.60 & -0.40 & -0.40 & -0.40 & -0.40 \\ 0.40 & 0.60 & -0.40 & -0.40 & -0.40 \\\hline 0.40 & -0.40 & 0.60 & -0.40 & -0.40 \\ 0.40 & -0.40 & -0.40 & 0.60 & -0.40 \\ 0.40 & -0.40 & -0.40 & -0.40 & 0.60 \\ \end{array}\right] \\ \times \left[\begin{array}{rr|rrr} -0.71 & 0.71 & 0 & 0 & 0 \\ -0.71 & -0.71 & 0 & 0 & 0 \\\hline 0 & 0 & 0.17 & 0.58 & -0.80 \\ 0 & 0 & 0.61 & 0.58 & 0.54 \\ 0 & 0 & -0.78 & 0.58 & 0.25 \\ \end{array}\right] = \left[\begin{array}{rr|rrr} 1.00 & 0 & 0 & 0 & 0 \\ 0 & 0.20 & 0 & 0.98 & 0 \\\hline 0 & 0 & 1.00 & 0 & 0 \\ 0 & -0.98 & 0 & 0.20 & 0 \\ 0 & 0 & 0 & 0 & 1.00 \\ \end{array}\right]. \end{aligned}$

We mention two interesting consequences of the CS decomposition.

With $p=1$ : if $q_{11} = 0$ then $Q_{22}$ is singular.
For unequally sized diagonal blocks it is no longer always true that $Q_{11}$ and $Q_{22}$ have the same norms, but their inverses do: $\|Q_{11}^{-1}\|_2 = \|Q_{22}^{-1}\|_2 = 1/\min_ic_i \ge 1$ . When $p = 1$ , this relation becomes $\|Q_{22}^{-1}\|_2 = 1/|q_{11}|$ .

The CS decomposition also exists for a rectangular matrix with orthonormal columns,

$\notag Q = \begin{array}[b]{@{\mskip-25mu}c@{\mskip-20mu}c@{}} \scriptstyle n \\ \multicolumn{1}{c}{ \left[\begin{array}{@{}c@{}} Q_{1}\\ Q_{2} \end{array}\right]} & \mskip-12mu\ \begin{array}{c} \scriptstyle p \\ \scriptstyle q \end{array} \end{array}, \quad p\ge n, \quad q \ge n.$

Now the decomposition takes the form

$\notag \begin{bmatrix} U_1^T & 0\\ 0 & U_2^T \end{bmatrix} \begin{bmatrix} Q_{1}\\ Q_{2} \end{bmatrix} V = \begin{array}[b]{@{\mskip-25mu}c@{\mskip-20mu}c@{}} \scriptstyle n \\ \multicolumn{1}{c}{ \left[\begin{array}{c@{~}} C\\ S \end{array}\right]} & \mskip-12mu\ \begin{array}{c} \scriptstyle p \\ \scriptstyle q \end{array} \end{array},$

where $U_1\in\mathbb{R}^{p\times p}$ , $U_2\in\mathbb{R}^{q\times q}$ , and $V\in\mathbb{R}^{n\times n}$ are orthogonal and $C$ and $S$ have the same form as before except that they are rectangular.

The most general form of the CS decomposition is for an orthogonal matrix with diagonal blocks that are not square. Now the matrix on the right-hand side has a more complicated block structure (see the references for details).

The CS decomposition arises in measuring angles and distances between subspaces. These are defined in terms of the orthogonal projectors onto the subspaces, so singular values of orthonormal matrices naturally arise.

Software for computing the CS decomposition is available in LAPACK, based on an algorithm of Sutton (2009). We used a MATLAB interface to it, available on MathWorks File Exchange, for the numerical example. Note that the output of this code is not quite in the form in which we have presented the decomposition, so some post-processing is required to achieve it.

References

This is a minimal set of references, which contain further useful references within.

Gene Golub and Charles F. Van Loan, Matrix Computations, fourth edition, Johns Hopkins University Press, Baltimore, MD, USA, 2013.
C. C. Paige and M. Wei, History and Generality of the CS Decomposition, Linear Algebra Appl. 208/209, 303–326, 1994.
Brian Sutton, Computing the Complete CS Decomposition, Numer. Algorithms 50(1), 33–65, 2009.

What’s New in MATLAB R2020a and R2020b?

In this post I discuss new features in MATLAB R2020a and R2020b. As usual in this series, I focus on a few of the features most relevant to my work. See the release notes for a detailed list of the many changes in MATLAB and its toolboxes.

Exportgraphics (R2020a)

The exportgraphics function is very useful for saving to a file a tightly cropped version of a figure with the border white instead of gray. Simple usages are

exportgraphics(gca,'image.pdf')
exportgraphics(gca,'image.jpg','Resolution',200)

I have previously used the export_fig function, which is not built into MATLAB but is available from File Exchange; I think I will be using exportgraphics instead from now on.

Svdsketch (R2020b)

The new svdsketch function computes the singular value decomposition (SVD) $USV^T$ of a low rank approximation to a matrix ( $U$ and $V$ orthogonal, $S$ diagonal with nonnegative diagonal entries). It is mainly intended for use with matrices that are close to having low rank, as is the case in various applications.

This function uses a randomized algorithm that computes a sketch of the given $m$ -by- $n$ matrix $A$ , which is essentially a product $Q^TA$ , where $Q$ is an orthonormal basis for the product $A\Omega$ , where $\Omega$ is a random $n$ -by- $k$ matrix. The value of $k$ is chosen automatically to achieve $\|USV^T-A\|_F \le \mathrm{tol}\|A\|_F$ , where $\mathrm{tol}$ is a tolerance that defaults to $\epsilon^{1/4}$ and must not be less than $\epsilon^{1/2}$ , where $\epsilon$ is the machine epsilon ( $2\times 10^{-16}$ for double precision). The algorithm includes a power method iteration that refines the sketch before computing the SVD.

The output of the function is an SVD in which $U$ and $V$ are numerically orthogonal and the singular values in $S$ of size $\mathrm{tol}$ or larger are good approximations to singular values of $A$ , but smaller singular values in $S$ may not be good approximations to singular values of $A$ .

Here is an example. The code

n = 8; rng(1); 8; A = gallery('randsvd',n,1e8,3);
[U,S,V] = svdsketch(A,1e-3);
rel_res = norm(A-U*S*V')/norm(A)
singular_values = [svd(A) [diag(S); zeros(n-length(S),1)]]

produces the following output, with the exact singular values in the first column and the approximate ones in the second column:

rel_res =
   1.9308e-06
singular_values =
   1.0000e+00   1.0000e+00
   7.1969e-02   7.1969e-02
   5.1795e-03   5.1795e-03
   3.7276e-04   3.7276e-04
   2.6827e-05   2.6827e-05
   1.9307e-06            0
   1.3895e-07            0
   1.0000e-08            0

The approximate singular values are correct down to around $10^{-5}$ , which is more than the $10^{-3}$ requested. This is a difficult matrix for svdsketch because there is no clear gap in the singular values of $A$ .

Axis Padding (R2020b)

The padding property of an axis puts some padding between the axis limits and the surrounding box. The code

x = linspace(0,2*pi,50); plot(x,tan(x),'linewidth',1.4)
title('Original axis')
axis padded, title('Padded axis')

produces the output

Turbo Colormap (2020b)

The default colormap changed from jet (the rainbow color map) to parula in R2014b (with a tweak in R2017a), because parula is more perceptually uniform and maintains information when printed in monochrome. The new turbo colormap is a more perceptually uniform version of jet, as these examples show. Notice that turbo has a longer transition through the greens and yellows. If you can’t give up on jet, use turbo instead.

Turbo:

Jet:

Parula:

ND Arrays (R2020b)

The new pagemtimes function performs matrix multiplication on pages of $n$ -dimensional arrays, while pagetranspose and pagectranspose carry out the transpose and conjugate transpose, respectively, on pages of $n$ -dimensional arrays.

Performance

Both releases report significantly improved speed of certain functions, including some of the ODE solvers.

What Is the Singular Value Decomposition?

A singular value decomposition (SVD) of a matrix $A\in\mathbb{R}^{m\times n}$ is a factorization

$\notag A = U\Sigma V^T,$

where $U\in\mathbb{R}^{m\times m}$ and $V\in\mathbb{R}^{n\times n}$ are orthogonal, $\Sigma = \mathrm{diag}(\sigma_1,\dots, \sigma_p)\in\mathbb{R}^{m\times n}$ , where $p = \min(m,n)$ , and $\sigma_1\ge \sigma_2\ge \cdots \ge \sigma_p \ge 0$ .

Partition $U =[ u_1,\dots,u_m]$ and $V = [v_1,\dots, v_n]$ . The $\sigma_i$ are called the singular values of $A$ and the $u_i$ and $v_i$ are the left and right singular vectors. We have $Av_i = \sigma_i u_i$ , $i = 1 \colon p$ . The matrix $\Sigma$ is unique but $U$ and $V$ are not. The form of $\Sigma$ is

$\notag \Sigma = \left[\begin{array}{ccc}\sigma_1&&\\ &\ddots&\\& &\sigma_n\\\hline &\rule{0cm}{15pt} \text{\Large 0} & \end{array}\right] \mathrm{for}~ m \ge n, \quad \Sigma = \begin{bmatrix} \begin{array}{ccc|c@{\mskip5mu}}\sigma_1&&\\ &\ddots& & \text{\Large 0} \\& &\sigma_m\end{array}\\ \end{bmatrix} \mathrm{for}~ m \le n$

Here is an example, in which the entries of $A$ have been specially chosen to give simple forms for the elements of the factors:

$\notag A = \left[\begin{array}{rr} 0 & \frac{4}{3}\\[\smallskipamount] -1 & -\frac{5}{3}\\[\smallskipamount] -2 & -\frac{2}{3} \end{array}\right] = \underbrace{ \displaystyle\frac{1}{3} \left[\begin{array}{rrr} 1 & -2 & -2\\ -2 & 1 & -2\\ -2 & -2 & 1 \end{array}\right] }_U \mskip5mu \underbrace{ \left[\begin{array}{cc} 2\,\sqrt{2} & 0\\ 0 & \sqrt{2}\\ 0 & 0 \end{array}\right] }_{\Sigma} \mskip5mu \underbrace{ \displaystyle\frac{1}{\sqrt{2}} \left[\begin{array}{cc} 1 & 1\\ 1 & -1 \end{array}\right] }_{V^T}.$

The power of the SVD is that it reveals a great deal of useful information about norms, rank, and subspaces of a matrix and it enables many problems to be reduced to a trivial form.

Since $U$ and $V$ are nonsingular, $\mathrm{rank}(A) = \mathrm{rank}(\Sigma) = r$ , where $r \le p$ is the number of nonzero singular values. Since the $2$ -norm and Frobenius norm are invariant under orthogonal transformations, $\|A\| = \|\Sigma\|$ for both norms, giving

$\notag \|A\|_2 = \sigma_1, \quad \|A\|_F = \Bigl(\displaystyle\sum_{i=1}^r \sigma_i^2\Bigr)^{1/2},$

and hence $\|A\|_2 \le \|A\|_F \le r^{1/2} \|A\|_2$ . The range space and null space of $A$ are given in terms of the columns of $U$ and $V$ by

$\notag \begin{aligned} \mathrm{null}(A) &= \mathrm{span} \{ v_{r+1}, \dots,v_n \},\\ \mathrm{range}(A) &= \mathrm{span} \{u_1,u_2,\dots, u_r\}. \end{aligned}$

We can write the SVD as

$\notag \qquad\qquad A = \begin{bmatrix} u_1, u_2 \dots, u_r \end{bmatrix} \mathrm{diag}(\sigma_1,\dots, \sigma_r) \begin{bmatrix} v_1^T\\ v_2^T\\ \vdots\\ v_r^T \end{bmatrix} = \displaystyle\sum_{i=1}^{r} \sigma_i u_i v_i^T, \qquad\qquad(*)$

which expresses $A$ as a sum of $r$ rank- $1$ matrices, the $i$ th of which has $2$ -norm $\sigma_i$ . The famous Eckart–Young theorem (1936) says that

$\notag \min_{\mathrm{rank}(B) = k} \|A-B\|_q = \begin{cases} \sigma_{k+1}, & q = 2, \\ \Bigl(\sum_{i=k+1}^r \sigma_i^2\Bigr)^{1/2}, & q = F, \end{cases}$

and that the minimum is attained at

$\notag A_k = U D_k V^T, \quad D_k = \mathrm{diag}(\sigma_1, \dots, \sigma_k, 0, \dots, 0).$

In other words, truncating the sum $(*)$ after $k < r$ terms gives the best rank- $k$ approximation to $A$ in both the $2$ -norm and the Frobenius norm. In particular, this result implies that when $A$ has full rank the distance from $A$ to the nearest rank-deficient matrix is $\sigma_r$ .

Relations with Symmetric Eigenvalue Problem

The SVD is not directly related to the eigenvalues and eigenvectors of $A$ . However, for $m\ge n$ , $A = U \Sigma V^T$ implies

$\notag A^T\!A = V \mathrm{diag}(\sigma_1^2,\dots,\sigma_n^2) V^T, \quad AA^T = U \mathrm{diag}(\sigma_1^2,\dots,\sigma_n^2,\underbrace{0,\dots,0}_{m-n}) U^T,$

so the singular values of $A$ are the square roots of the eigenvalues of the symmetric positive semidefinite matrices $A^T\!A$ and $AA^T$ (modulo $m-n$ zeros in the latter case), and the singular vectors are eigenvectors. Moreover, the eigenvalues of the $(m+n)\times (m+n)$ matrix

$\notag C = \begin{bmatrix} 0 & A \\ A^T & 0 \end{bmatrix}$

are plus and minus the singular values of $A$ , together with $|m-n|$ additional zeros if $m \ne n$ , and the eigenvectors of $C$ and the singular vectors of $A$ are also related.

Consequently, by applying results or algorithms for the eigensystem of a symmetric matrix to $A^T\!A$ , $AA^T$ , or $C$ one obtains results or algorithms for the singular value decomposition of $A$ .

Connections with Other Problems

The pseudoinverse of a matrix $A\in\mathbb{R}^{n\times n}$ can be expressed in terms of the SVD as

$\notag A^+ = V\mathrm{diag}(\sigma_1^{-1},\dots,\sigma_r^{-1},0,\dots,0)U^T.$

The least squares problem $\min_x \|b - Ax\|_2$ , where $A\in\mathbb{R}^{m\times n}$ with $m \ge n$ is solved by $x = A^+b$ , and when $A$ is rank-deficient this is the solution of minimum $2$ -norm. For $m < n$ this is an underdetermined system and $x = A^+b$ gives the minimum 2-norm solution.

We can write $A = U\Sigma V^T = UV^T \cdot V \Sigma V^T \equiv PQ$ , where $P$ is orthogonal and $Q$ is symmetric positive semidefinite. This decomposition $A = PQ$ is the polar decomposition and $Q = (A^T\!A)^{1/2}$ is unique. This connection between the SVD and the polar decomposition is useful both theoretically and computationally.

Applications

The SVD is used in a very wide variety of applications—too many and varied to attempt to summarize here. We just mention two.

The SVD can be used to help identify to which letters vowels and consonants have been mapped in a substitution cipher (Moler and Morrison, 1983).

An inverse use of the SVD is to construct test matrices by forming a diagonal matrix of singular values from some distribution then pre- and post-multiplying by random orthogonal matrices. The result is matrices with known singular values and 2-norm condition number that are nevertheless random. Such “randsvd” matrices are widely used to test algorithms in numerical linear algebra.

History and Computation

The SVD was introduced independently by Beltrami in 1873 and Jordan in 1874. Golub popularized the SVD as an essential computational tool and developed the first reliable algorithms for computing it. The Golub–Reinsch algorithm, dating from the late 1960s and based on bidiagonalization and the QR algorithm, is the standard way to compute the SVD. Various alternatives are available; see the references.

References

This is a minimal set of references, which contain further useful references within.

Jack Dongarra and Mark Gates and Azzam Haidar and Jakub Kurzak and Piotr Luszczek and Stanimire Tomov and Ichitaro Yamazaki, The Singular Value Decomposition: Anatomy of Optimizing an Algorithm for Extreme Scale, SIAM Rev. 60(4), 808–865, 2018.
Gene Golub and Charles F. Van Loan, Matrix Computations, fourth edition, Johns Hopkins University Press, Baltimore, MD, USA, 2013.
Roger Horn and Charles Johnson, Topics in Matrix Analysis, Cambridge University Press, 1991. Chapter 3.
Cleve B. Moler and Donald Morrison, Singular Value Analysis of Cryptograms, Amer. Math. Monthly 90, 78–87, 1983.
Yuji Nakatsukasa and Nicholas J. Higham, Stable and Efficient Spectral Divide and Conquer Algorithms for the Symmetric Eigenvalue Decomposition and the SVD, SIAM J. Sci. Comput. 35(3), A1325–A1349, 2013.

What Is the Complex Step Approximation?

In many situations we need to evaluate the derivative of a function but we do not have an explicit formula for the derivative. The complex step approximation approximates the derivative (and the function value itself) from a single function evaluation. The catch is that it involves complex arithmetic.

For an analytic function $f$ we have the Taylor expansion

$\notag \qquad\qquad\qquad\qquad f(x + \mathrm{i}h) = f(x) + \mathrm{i}h f'(x) - h^2\displaystyle\frac{f''(x)}{2} + O(h^3), \qquad\qquad\qquad\qquad(*)$

where $\mathrm{i} = \sqrt{-1}$ is the imaginary unit. Assume that $f$ maps the real line to the real line and that $x$ and $h$ are real. Then equating real and imaginary parts in $(*)$ gives $\mathrm{Re} f(x+\mathrm{i}h) = f(x) + O(h^2)$ and $\mathrm{Im} f(x+\mathrm{i}h) = hf'(x) + O(h^3)$ . This means that for small $h$ , the approximations

$\notag f(x) \approx \mathrm{Re} f(x+\mathrm{i}h), \quad f'(x) \approx \mathrm{Im} \displaystyle\frac{f(x+\mathrm{i}h)}{h}$

both have error $O(h^2)$ . So a single evaluation of $f$ at a complex argument gives, for small $h$ , a good approximation to $f'(x)$ , as well as a good approximation to $f(x)$ if we need it.

The usual way to approximate derivatives is with finite differences, for example by the forward difference approximation

$\notag f'(x) \approx \displaystyle\frac{f(x+h) - f(x)}{h}.$

This approximation has error $O(h)$ so it is less accurate than the complex step approximation for a given $h$ , but more importantly it is prone to numerical cancellation. For small $h$ , $f(x+h)$ and $f(x)$ agree to many significant digits and so in floating-point arithmetic the difference approximation suffers a loss of significant digits. Consequently, as $h$ decreases the error in the computed approximation eventually starts to increase. As numerical analysis textbooks explain, the optimal choice of $h$ that balances truncation error and rounding errors is approximately

$\notag h_{\mathrm{opt}} = 2\Bigl|\displaystyle\frac{u f(x)}{f''(x))} \Bigr|^{1/2},$

where $u$ is the unit roundoff. The optimal error is therefore of order $u^{1/2}$ .

A simple example illustrate these ideas. For the function $f(x) = \mathrm{e}^x$ with $x = 1$ , we plot in the figure below the relative error for the finite difference, in blue, and the relative error for the complex step approximation, in orange, for $h$ ranging from about $10^{-5}$ to $10^{-11}$ . The dotted lines show $u$ and $u^{1/2}$ . The computations are in double precision ( $u \approx 1.1\times 10^{-16}$ ). The finite difference error decreases with $h$ until it reaches about $h_{\mathrm{opt}} = 2.1\times 10^{-8}$ ; thereafter the error grows, giving the characteristic V-shaped error curve. The complex step error decreases steadily until it is of order $u$ for $h \approx u^{1/2}$ , and for each $h$ it is about the square of the finite difference error, as expected from the theory.

Remarkably, one can take $h$ extremely small in the complex step approximation (e.g., $h = 10^{-100}$ ) without any ill effects from roundoff.

The complex step approximation carries out a form of approximate automatic differentiation, with the variable $h$ functioning like a symbolic variable that propagates through the computations in the imaginary parts.

The complex step approximation applies to gradient vectors and it can be extended to matrix functions. If $f$ is analytic and maps real $n\times n$ matrices to real $n\times n$ matrices and $A$ and $E$ are real then (Al-Mohy and Higham, 2010)

$\notag L_f(A,E) \approx \mathrm{Im} \displaystyle\frac{f(A+\mathrm{i}hE)}{h},$

where $L_f(A,E)$ is the Fréchet derivative of $f$ at $A$ in the direction $E$ . It is important to note that the method used to evaluate $f$ must not itself use complex arithmetic (as methods based on the Schur decomposition do); if it does, then the interaction of those complex terms with the much smaller $\mathrm{i}hE$ term can lead to damaging subtractive cancellation.

The complex step approximation has also been extended to higher derivatives by using “different imaginary units” in different components (Lantoine et al., 2012).

Here are some applications where the complex step approximation has been used.

Sensitivity analysis in engineering applications (Giles et al., 2003).
Approximating gradients in deep learning (Goodfellow et al., 2016).
Approximating the exponential of an operator in option pricing (Ackerer and Filipović, 2019).

Software has been developed for automatically carrying out the complex step method—for example, by Shampine (2007).

The complex step approximation has been rediscovered many times. The earliest published appearance that we are aware of is in a paper by Squire and Trapp (1998), who acknowledge earlier work of Lyness and Moler on the use of complex variables to approximate derivatives.

References

This is a minimal set of references, which contain further useful references within.

Awad H. Al-Mohy and Nicholas J. Higham, The Complex Step Approximation to the Fréchet Derivative of a Matrix Function, Numer. Algorithms 53, 133–148, 2010.
Damien Ackerer and Damir Filipović, Option Pricing with Orthogonal Polynomial Expansions, Mathematical Finance 30, 47–84, 2019.
Michael B. Giles, Mihai C. Duta, Jens-Dominik Möuller, and Niles A. Pierce, Algorithm Developments for Discrete Adjoint Methods, AIAA Journal 4(2), 198–205, 2003.
Ian Goodfellow, Yoshua Bengio, and Aaron Courville, Deep Learning, MIT Press, 2016. Page 434.
Gregory Lantoine, Ryan P. Russell P., and Thierry Dargent, Using Multicomplex Variables for Automatic Computation of High-Order Derivatives, ACM Trans. Math. Software 38, 16:1–16:21, 2012.
L. F. Shampine, Accurate Numerical Derivatives in MATLAB, ACM Trans. Math. Software 33,26:1–26:17, 2007.
W. Squire and G. E. Trapp (1998), Using Complex Variables to Estimate Derivatives of Real Functions, SIAM Rev., 40(1), 110–112.

What Is the Sherman–Morrison–Woodbury Formula?

When a nonsingular $n\times n$ matrix $A$ is perturbed by a matrix of rank $k$ , the inverse also undergoes a rank- $k$ perturbation. More precisely, if $E$ has rank $k$ and $B = A+E$ is nonsingular then the identity $A^{-1} - B^{-1} = A^{-1} (B-A) B^{-1}$ shows that

$\notag \mathrm{rank}(A^{-1} - B^{-1}) = \mathrm{rank}(A^{-1} E B^{-1}) = \mathrm{rank}(E) = k.$

The Sherman–Morrison–Woodbury formula provides an explicit formula for the inverse of the perturbed matrix $B$ .

Sherman–Morrison Formula

We will begin with the simpler case of a rank- $1$ perturbation: $B = A + uv^*$ , where $u$ and $v$ are $n$ -vectors, and we consider first the case where $A = I$ . We might expect that $(I + uv^*)^{-1} = I + \theta uv^*$ for some $\theta$ (consider a binomial expansion of the inverse). Multiplying out, we obtain

$\notag (I + uv^*) (I + \theta uv^*) = I + (1 + \theta + \theta v^*u) uv^*,$

so the product equals the identity matrix when $\theta = -1/(1 + v^*u)$ . The condition that $I + uv*$ be nonsingular is $v^*u \ne -1$ (as can also be seen from $\det(I + uv^*) = 1 + v^*u$ , derived in What Is a Block Matrix?). So

$\notag (I + uv^*)^{-1} = I - \displaystyle\frac{1}{1 + v^*u} uv^*.$

For the general case write $B = A + uv^* = A(I + A^{-1}u v^*)$ . Inverting this equation and applying the previous result gives

$\notag (A + uv^*)^{-1} = A^{-1} - \displaystyle\frac{A^{-1} uv^* A^{-1}}{1 + v^* A^{-1} u},$

subject to the nonsingularity condition $v^*A^{-1}x \ne -1$ . This is known as the Sherman–Morrison formula. It explicitly identifies the rank- $1$ change to the inverse.

As an example, if we take $u = te_i$ and $v = e_j$ (where $e_k$ is the $k$ th column of the identity matrix) then, writing $A^{-1} = (\alpha_{ij})$ , we have

$\notag \bigl(A + te_ie_j^*\bigr)^{-1} = A^{-1} - \displaystyle\frac{tA^{-1}e_i e_j^* A^{-1}}{1 + t \alpha_{ji}}.$

The Frobenius norm of the change to $A^{-1}$ is

$\notag \displaystyle\frac{ |t|\, \| A^{-1}e_i\|_2 \| e_j^*A^{-1}\|_2 } {|1 + t \alpha_{ji}|}.$

If $t$ is sufficiently small then this quantity is approximately maximized for $i$ and $j$ such that the product of the norms of $i$ th column and $j$ th row of $A^{-1}$ is maximized. For an upper triangular matrix $i = n$ and $j = 1$ are likely to give the maximum, which means that the inverse of an upper triangular matrix is likely to be most sensitive to perturbations in the $(n,1)$ element of the matrix. To illustrate, we consider the matrix

$\notag T = \left[\begin{array}{rrrr} 1 & -1 & -2 & -3\\ 0 & 1 & -4 & -5\\ 0 & 0 & 1 & -6\\ 0 & 0 & 0 & 1 \end{array}\right]$

The $(i,j)$ element of the following matrix is $\| T^{-1} - (T + 10^{-3}e_ie_j^*)^{-1} \|_F$ :

$\notag \left[\begin{array}{cccc} 0.044 & 0.029 & 0.006 & 0.001 \\ 0.063 & 0.041 & 0.009 & 0.001 \\ 0.322 & 0.212 & 0.044 & 0.007 \\ 2.258 & 1.510 & 0.321 & 0.053 \\ \end{array}\right]$

As our analysis suggests, the $(4,1)$ entry is the most sensitive to perturbation.

Sherman–Morrison–Woodbury Formula

Now consider a perturbation $UV^*$ , where $U$ and $V$ are $n\times k$ . This perturbation has rank at most $k$ , and its rank is $k$ if $U$ and $V$ are both of rank $k$ . If $I + V^* A^{-1} U$ is nonsingular then $A + UV^*$ is nonsingular and

$\notag (A + UV^*)^{-1} = A^{-1} - A^{-1} U (I + V^* A^{-1} U)^{-1} V^* A^{-1},$

which is the Sherman–Morrison–Woodbury formula. The significance of this formula is that $I + V^* A^{-1} U$ is $k\times k$ , so if $k\ll n$ and $A^{-1}$ is known then it is much cheaper to evaluate the right-hand side than to invert $A + UV^*$ directly. In practice, of course, we rarely invert matrices, but rather exploit factorizations of them. If we have an LU factorization of $A$ then we can use it in conjunction with the Sherman–Morrison–Woodbury formula to solve $(A + UV^*)x = b$ in $O(n^2 + k^3)$ flops, as opposed to the $O(n^3)$ flops required to factorize $A + UV^*$ from scratch.

The Sherman–Morrison–Woodbury formula is straightforward to verify, by showing that the product of the two sides is the identity matrix. How can the formula be derived in the first place? Consider any two matrices $F$ and $G$ such that $FG$ and $GF$ are both defined. The associative law for matrix multiplication gives $F(GF) = (FG)F$ , or $(I + FG)F = F (I + GF)$ , which can be written as $F(I+GF)^{-1} = (I+FG)^{-1}F$ . Postmultiplying by $G$ gives

$\notag F(I+GF)^{-1}G = (I+FG)^{-1}FG = (I+FG)^{-1}(I + FG - I) = I - (I+FG)^{-1}.$

Setting $F = U$ and $G = V^*$ gives the special case of the Sherman–Morrison–Woodbury formula with $A = I$ , and the general formula follows from $A + UV^* = A(I + A^{-1}U V^*)$ .

General Formula

We will give a different derivation of an even more general formula using block matrices. Consider the block matrix

$\notag X = \begin{bmatrix} A & U \\ V^* & -W^{-1} \end{bmatrix}$

where $A$ is $n\times n$ , $U$ and $V$ are $n\times k$ , and $W$ is $k\times k$ . We will obtain a formula for $(A + UWV^*)^{-1}$ by looking at $X^{-1}$ .

It is straightforward to verify that

$\notag \begin{bmatrix} A & U \\ V^* & -W^{-1} \end{bmatrix} = \begin{bmatrix} I & 0 \\ V^*A^{-1} & I \end{bmatrix} \begin{bmatrix} A & 0 \\ 0 & -(W^{-1} + V^*A^{-1}U) \end{bmatrix} \begin{bmatrix} I & A^{-1}U \\ 0 & I \end{bmatrix}.$

Hence

$\notag \begin{aligned} \begin{bmatrix} A & U \\ V^* & -W^{-1} \end{bmatrix}^{-1} &= \begin{bmatrix} I & -A^{-1}U \\ 0 & I \end{bmatrix}. \begin{bmatrix} A^{-1} & 0 \\ 0 & -(W^{-1} + V^*A^{-1}U)^{-1} \end{bmatrix} \begin{bmatrix} I & 0 \\ -V^*A^{-1} & I \end{bmatrix}\\[\smallskipamount] &= \begin{bmatrix} A^{-1} - A^{-1}U(W^{-1} + V^*A^{-1}U)^{-1}V^*A^{-1} & A^{-1}U(W^{-1} + V^*A^{-1U})^{-1} \\ (W^{-1} + V^*A^{-1}U)^{-1} V^*A^{-1} & -(W^{-1} + V^*A^{-1}U)^{-1} \end{bmatrix}. \end{aligned}$

In the $(1,1)$ block we see the right-hand side of a Sherman–Morrison–Woodbury-like formula, but it is not immediately clear how this relates to $(A + UWV^*)^{-1}$ . Let $P = \bigl[\begin{smallmatrix} 0 & I \\ I & 0 \end{smallmatrix} \bigr]$ , and note that $P^{-1} = P$ . Then

$\notag PXP = \begin{bmatrix} -W^{-1} & V^* \\ U & A \end{bmatrix}$

and applying the above formula (appropriately renaming the blocks) gives, with $\times$ denoting a block whose value does not matter,

$\notag PX^{-1}P = (PXP)^{-1} = \begin{bmatrix} \times & \times \\ \times & (A + UWV^*)^{-1} \end{bmatrix}.$

Hence $(X^{-1})_{11} = (A + UWV^*)^{-1}$ . Equating our two formulas for $(X^{-1})_{11}$ gives

$\notag \qquad\qquad\qquad (A + UWV^*)^{-1} = A^{-1} - A^{-1}U (W^{-1} + V^*A^{-1}U)^{-1} V^*A^{-1}, \qquad\qquad\qquad(*)$

provided that $W^{-1} + V^*A^{-1}U$ is nonsingular.

To see one reason why this formula is useful, suppose that the matrix $A$ and its perturbation are symmetric and we wish to preserve symmetry in our formulas. The Sherman–Morrison–Woodbury requires us to write the perturbation as $UU^*$ , so the perturbation must be positive semidefinite. In $(*)$ , however, we can write an arbitrary symmetric perturbation as $UWU^*$ , with $W$ symmetric but possibly indefinite, and obtain a symmetric formula.

The matrix $-(W^{-1} + V^*A^{-1}U)$ is the Schur complement of $A$ in $X$ . Consequently the inversion formula $(*)$ is intimately connected with the theory of Schur complements. By manipulating the block matrices in different ways it is possible to derive variations of $(*)$ . We mention just the simple rewriting

$\notag (A + UWV^*)^{-1} = A^{-1} - A^{-1}U W(I + V^*A^{-1}UW)^{-1} V^*A^{-1},$

which is valid if $W$ is singular, as long as $I + WV^*A^{-1}U$ is nonsingular. Note that the formula is not symmetric when $V = U$ and $W = W^*$ . This variant can also be obtained by replacing $U$ by $UW$ in the Sherman–Morrison–Woodbury formula.

Historical Note

Formulas for the change in a matrix inverse under low rank perturbations have a long history. They have been rediscovered on multiple occasions, sometimes appearing without comment within other formulas. Equation $(*)$ is given by Duncan (1944), which is the earliest appearance in print that I am aware of. For discussions of the history of these formulas see Henderson and Searle (1981) or Puntanen and Styan (2005).

References

This is a minimal set of references, which contain further useful references within.

W. J. Duncan, LXXVIII. Some Devices for the Solution of Large Sets of Simultaneous Linear Equations, The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science 35, 660–670, 1944.
H. V. Henderson and S. R. Searle, On Deriving the Inverse of a Sum of Matrices, SIAM Rev. 23(1), 53–60, 1981.
Simo Puntanen and George Styan, Historical Introduction: Issai Schur and the Early Development of the Schur Complement, pages 1-16 in Fuzhen Zhang, ed., The Schur Complement and Its Applications, Springer-Verlag, New York, 2005.

What Is a Block Matrix?

A matrix is a rectangular array of numbers treated as a single object. A block matrix is a matrix whose elements are themselves matrices, which are called submatrices. By allowing a matrix to be viewed at different levels of abstraction, the block matrix viewpoint enables elegant proofs of results and facilitates the development and understanding of numerical algorithms.

A block matrix is defined in terms of a partitioning, which breaks a matrix into contiguous pieces. The most common and important case is for an $n\times n$ matrix to be partitioned as a block $2\times 2$ matrix (two block rows and two block columns). For $n = 4$ , partitioning into $2\times 2$ blocks gives

$\notag A = \left[\begin{array}{cc|cc} a_{11} & a_{12} & a_{13} & a_{14}\\ a_{21} & a_{22} & a_{23} & a_{24}\\\hline a_{31} & a_{32} & a_{33} & a_{34}\\ a_{41} & a_{42} & a_{43} & a_{44}\\ \end{array}\right] = \begin{bmatrix} A_{11} & A_{12} \\ A_{21} & A_{22} \end{bmatrix},$

where

$\notag A_{11} = A(1\colon2,1\colon2) = \begin{bmatrix} a_{11} & a_{12} \\ a_{21} & a_{22} \end{bmatrix},$

and similarly for the other blocks. The diagonal blocks in a partitioning of a square matrix are usually square (but not necessarily so), and they do not have to be of the same dimensions. This same $4\times 4$ matrix could be partitioned as

$\notag A = \left[\begin{array}{c|ccc} a_{11} & a_{12} & a_{13} & a_{14}\\\hline a_{21} & a_{22} & a_{23} & a_{24}\\ a_{31} & a_{32} & a_{33} & a_{34}\\ a_{41} & a_{42} & a_{43} & a_{44}\\ \end{array}\right] = \begin{bmatrix} A_{11} & A_{12} \\ A_{21} & A_{22} \end{bmatrix},$

where $A_{11} = (a_{11})$ is a scalar, $A_{21}$ is a column vector, and $A_{12}$ is a row vector.

The sum $C = A + B$ of two block matrices $A = (A_{ij})$ and $B = (B_{ij})$ of the same dimension is obtained by adding blockwise as long as $A_{ij}$ and $B_{ij}$ have the same dimensions for all $i$ and $j$ , and the result has the same block structure: $C_{ij} = A_{ij}+B_{ij}$ ,

The product $C = AB$ of an $m\times n$ matrix $A = (A_{ij})$ and an $n\times p$ matrix $B = (B_{ij})$ can be computed as $C_{ij} = \sum_k A_{ik}B_{kj}$ as long as the products $A_{ik}B_{kj}$ are all defined. In this case the matrices $A$ and $B$ are said to be conformably partitioned for multiplication. Here, $C$ has as many block rows as $A$ and as many block columns as $B$ . For example,

$\notag AB = \begin{bmatrix} A_{11} & A_{12} \\ A_{21} & A_{22} \end{bmatrix} \begin{bmatrix} B_{11} & B_{12} \\ B_{21} & B_{22} \end{bmatrix} = \begin{bmatrix} A_{11} B_{11} + A_{12} B_{21} & A_{11} B_{12} + A_{12} B_{22} \\ A_{21} B_{11} + A_{22} B_{21} & A_{21} B_{12} + A_{22} B_{22} \end{bmatrix}$

as long as all the eight products $A_{ik}B_{kj}$ are defined.

Block matrix notation is an essential tool in numerical linear algebra. Here are some examples of its usage.

Matrix Factorization

For an $n\times n$ matrix $A$ with nonzero $(1,1)$ element $\alpha$ we can write

$\notag A = \begin{bmatrix} \alpha & b^T \\ c & D \end{bmatrix} = \begin{bmatrix} 1 & 0 \\ c/\alpha & I_{n-1} \end{bmatrix} \begin{bmatrix} \alpha & b^T \\ 0 & D - cb^T/\alpha \end{bmatrix} = : L_1U_1$

The first row and column of $L_1$ have the correct form for a unit lower triangular matrix and likewise the first row and column of $U_1$ have the correct form for an upper triangular matrix. If we can find an LU factorization $D - cb^T/\alpha = L_2U_2$ of the $(n-1)\times (n-1)$ Schur complement $D$ then $A = L_1\mathrm{diag}(1,L_2)\cdot \mathrm{diag}(1,U_2)U_1$ is an LU factorization of $A$ . This construction is the basis of an inductive proof of the existence of an LU factorization (provided all the pivots are nonzero) and it also yields an algorithm for computing it.

The same type of construction applies to other factorizations, such as Cholesky factorization, QR factorization, and the Schur decomposition.

Matrix Inverse

A useful formula for the inverse of a nonsingular block triangular matrix

$\notag T = \begin{bmatrix} T_{11} & T_{12} \\ 0 & T_{22} \end{bmatrix}$

$\notag T^{-1} = \begin{bmatrix} T_{11}^{-1} & - T_{11}^{-1}T_{12}T_{22}^{-1}\\ 0 & T_{22}^{-1} \end{bmatrix},$

which has the special case

$\notag \begin{bmatrix} I & X \\ 0 & I \end{bmatrix}^{-1} = \begin{bmatrix} I & -X\\ 0 & I \end{bmatrix}.$

If $T$ is upper triangular then so are $T_{11}$ and $T_{22}$ . By taking $T_{11}$ of dimension the nearest integer to $n/2$ this formula can be used to construct a divide and conquer algorithm for computing $T^{-1}$ .

We note that $\det(T) = \det(T_{11}) \det(T_{22})$ , a fact that will be used in the next section.

Determinantal Formulas

Block matrices provides elegant proofs of many results involving determinants. For example, consider the equations

$\notag \begin{bmatrix} I & -A \\ 0 & I \end{bmatrix} \begin{bmatrix} I+AB & 0 \\ B & I \end{bmatrix} = \begin{bmatrix} I & -A\\ B & I \end{bmatrix} = \begin{bmatrix} I & 0 \\ B & I+BA \end{bmatrix} \begin{bmatrix} I & -A \\ 0 & I \end{bmatrix},$

which hold for any $A$ and $B$ such that $AB$ and $BA$ are defined. Taking determinants gives the formula $\det(I + AB) = \det(I + BA)$ . In particular we can take $A = x$ , $B = y^T$ , for $n$ -vectors $x$ and $y$ , giving $\det(I + xy^T) = 1 + y^Tx$ .

Constructing Matrices with Required Properties

We can sometimes build a matrix with certain desired properties by a block construction. For example, if $X$ is an $n\times n$ involutory matrix ( $X^2 = I$ ) then

$\notag \begin{bmatrix} X & I \\ 0 & -X \end{bmatrix}$

is a (block triangular) $2n\times 2n$ involutory matrix. And if $A$ and $B$ are any two $n\times n$ matrices then

$\notag \begin{bmatrix} I - BA & B \\ 2A-ABA & AB-I \end{bmatrix}$

is involutory.

The Anti Block Diagonal Trick

For $n\times n$ matrices $A$ and $B$ consider the anti block diagonal matrix

$\notag X = \begin{bmatrix} 0 & A \\ B & 0 \end{bmatrix}.$

Note that

$\notag X^2 = \begin{bmatrix} AB & 0 \\ 0 & BA \end{bmatrix}, \quad X^{-1} = \begin{bmatrix} 0 & B^{-1} \\ A^{-1} & 0 \end{bmatrix}.$

Using these properties one can show a relation between the matrix sign function and the principal matrix square root:

$\notag \mathrm{sign}\left( \begin{bmatrix} 0 & A \\ I & 0 \end{bmatrix} \right) = \begin{bmatrix} 0 & A^{1/2} \\ A^{-1/2} & 0 \end{bmatrix}.$

This allows one to derive iterations for computing the matrix square root and its inverse from iterations for computing the matrix sign function.

It is easy to derive explicit formulas for all the powers of $X$ , and hence for any power series evaluated at $X$ . In particular, we have the formula

$\notag \mathrm{e}^X = \left[\begin{array}{cc} \cosh\sqrt{AB} & A (\sqrt{BA})^{-1} \sinh \sqrt{BA} \\[\smallskipamount] B(\sqrt{AB})^{-1} \sinh \sqrt{AB} & \cosh\sqrt{BA} \end{array}\right],$

where $\sqrt{Y}$ denotes any square root of $Y$ . With $B = I$ , this formula arises in the solution of the ordinary differential equation initial value problem $y'' + Ay = 0$ , $y(0)=y_0$ , $y'(0)=y'_0$ ,

The most well known instance of the trick is when $B = A^T$ . The eigenvalues of

$\notag X = \begin{bmatrix} 0 & A \\ A^T & 0 \end{bmatrix}$

are plus and minus the singular values of $A$ , together with $|m-n|$ additional zeros if $A$ is $m\times n$ with $m \ne n$ , and the eigenvectors of $X$ and the singular vectors of $A$ are also related. Consequently, by applying results or algorithms for symmetric matrices to $X$ one obtains results or algorithms for the singular value decomposition of $A$ .

References

This is a minimal set of references, which contain further useful references within.

Gene Golub and Charles F. Van Loan, Matrix Computations, fourth edition, Johns Hopkins University Press, Baltimore, MD, USA, 2013.
Nicholas J. Higham, Functions of Matrices: Theory and Computation, Society for Industrial and Applied Mathematics, Philadelphia, PA, USA, 2008. (Sections 1.5 and 1.6 for the theory of matrix square roots.)
Roger A. Horn and Charles R. Johnson, Matrix Analysis, second edition, Cambridge University Press, 2013. My review of the second edition.

What Is a Householder Matrix?

A Householder matrix is an $n\times n$ orthogonal matrix of the form

$\notag P = I - \displaystyle\frac{2}{v^Tv} vv^T, \qquad 0 \ne v \in\mathbb{R}^n.$

It is easily verified that $P$ is

orthogonal ( $P^TP = I$ ),
symmetric ( $P^T = P$ ),
involutory ( $P^2 = I$ that is, $P$ is a square root of the identity matrix),

where the last property follows from the first two.

A Householder matrix is a rank- $1$ perturbation of the identity matrix and so all but one of its eigenvalues are $1$ . The eigensystem can be fully described as follows.

$P$ has an eigenvalue $-1$ with eigenvector $v$ , since $Pv = -v$ .
$P$ has $n-1$ eigenvalues $1$ with eigenvectors any set of $n-1$ linearly independent vectors orthogonal to $v$ , which can be taken to be mutually orthogonal: $Px = x$ for every such $x$ .

$P$ has trace $n-2$ and determinant $-1$ , as can be derived directly or deduced from the facts that the trace is the sum of the eigenvalues and the determinant is the product of the eigenvalues.

For $n = 2$ , a Householder matrix can be written as

$\notag P = \begin{bmatrix} \cos\theta & \sin\theta\\ \sin\theta & -\cos\theta \end{bmatrix}.$

Simple examples of Householder matrices are obtained by choosing $v = e = [1,1,\dots,1]^T$ , for which $P = I - (2/n)ee^T$ . For $n=2,3,4,5,6$ we obtain the matrices

$\notag \begin{gathered} \left[\begin{array}{@{\mskip2mu}rr@{\mskip2mu}} 0 & -1 \\ -1 & 0 \end{array}\right], \quad \displaystyle\frac{1}{3} \left[\begin{array}{@{\mskip2mu}rrr@{\mskip2mu}} 1 & -2 & -2\\ -2 & 1 & -2\\ -2 & -2 & 1\\ \end{array}\right], \quad \displaystyle\frac{1}{2} \left[\begin{array}{@{\mskip2mu}rrrr@{\mskip2mu}} 1 & -1 & -1 & -1\\ -1 & 1 & -1 & -1\\ -1 & -1 & 1 & -1\\ -1 & -1 & -1 & 1\\ \end{array}\right], \\ \displaystyle\frac{1}{5} \left[\begin{array}{@{\mskip2mu}rrrrr@{\mskip2mu}} 3 & -2 & -2 & -2 & -2\\ -2 & 3 & -2 & -2 & -2\\ -2 & -2 & 3 & -2 & -2\\ -2 & -2 & -2 & 3 & -2\\ -2 & -2 & -2 & -2 & 3 \end{array}\right], \quad \displaystyle\frac{1}{3} \left[\begin{array}{@{\mskip2mu}rrrrrr@{\mskip2mu}} 2 & -1 & -1 & -1 & -1 & -1\\ -1 & 2 & -1 & -1 & -1 & -1\\ -1 & -1 & 2 & -1 & -1 & -1\\ -1 & -1 & -1 & 2 & -1 & -1\\ -1 & -1 & -1 & -1 & 2 & -1\\ -1 & -1 & -1 & -1 & -1 & 2 \end{array}\right]. \end{gathered}$

Note that the $4\times 4$ matrix is $1/2$ times a Hadamard matrix.

Applying $P$ to a vector $x$ gives

$\notag Px = x - \displaystyle\left( \frac{2 v^Tx}{v^Tv} \right) v.$

This equation shows that $P$ reflects $x$ about the hyperplane ${\mathrm{span}}(v)^{\perp}$ , as illustrated in the following diagram, which explains why $P$ is sometimes called a Householder reflector. Another way of expressing this property is to write $x = \alpha v + z$ , where $z$ is orthogonal to $v$ . Then $Px = -\alpha v + z$ , so the component of $x$ in the direction $v$ has been reversed. If we take $v = e_i$ , the $i$ th unit vector, then $P = I - 2e_ie_i^T = \mathrm{diag}(1,1,\dots,-1,1,\dots,1)$ , which has $-1$ in the $(i,i)$ position. In this case premultiplying a vector by $P$ flips the sign of the $i$ th component.

Transforming a Vector

Householder matrices are powerful tools for introducing zeros into vectors. Suppose we are given vectors $x$ and $y$ and wish to find a Householder matrix $P$ such that $Px=y$ . Since $P$ is orthogonal, we require that $\|x\|_2 = \|y\|_2$ , and we exclude the trivial case $x = y$ . Now

$Px = y \quad \Longleftrightarrow \quad x - 2 \left( \displaystyle\frac{v^Tx}{v^Tv} \right) v = y,$

and this last equation has the form $\alpha v = x-y$ for some $\alpha$ . But $P$ is independent of the scaling of $v$ , so we can set $\alpha=1$ . Now with $v=x-y$ we have

$\notag v^Tv = x^Tx + y^Ty -2x^Ty$

and, since $x^Tx = y^Ty$ ,

$\notag v^Tx = x^Tx - y^Tx = \frac{1}{2} v^Tv.$

Therefore

$\notag Px = x - v = y,$

as required. Most often we choose $y$ to be zero in all but its first component.

Square Roots

What can we say about square roots of a Householder matrix, that is, matrices $X$ such that $X^2 = P$ ?

We note first that the eigenvalues of $X$ are the square roots of those of $P$ and so $n-1$ of them will be $\pm 1$ and one will be $\pm \mathrm{i}$ . This means that $X$ cannot be real, as the nonreal eigenvalues of a real matrix must appear in complex conjugate pairs.

Write $P = I - 2vv^T$ , where $v$ is normalized so that $v^Tv = 1$ . It is natural to look for a square root of the form $X = I - \theta vv^T$ . Setting $X^2 = P$ leads to the quadratic equation $\theta^2-2\theta + 2 = 0$ , and hence $\theta = 1 \pm \mathrm{i}$ . As expected, these two square roots are complex even though $P$ is real. As an example, $\theta = 1 - \mathrm{i}$ gives the following square root of the matrix above corresponding to $v = e/n^{1/2}$ with $n = 3$ :

$\notag X = \displaystyle\frac{1}{3} \left[\begin{array}{@{\mskip2mu}rrr} 2+\mathrm{i} & -1+\mathrm{i} & -1+\mathrm{i}\\ -1+\mathrm{i} & 2+\mathrm{i} & -1+\mathrm{i}\\ -1+\mathrm{i} & -1+\mathrm{i} & 2+\mathrm{i} \end{array}\right].$

A good way to understand all the square roots is to diagonalize $P$ , which can be done by a similarity transformation with a Householder matrix! Normalizing $v^Tv = 1$ again, let $w = v - e_1$ and $H = I - 2ww^T/(w^Tw)$ . Then from the construction above we know that $Hv = e_1$ . Hence

$\notag H^T\!PH = HPH = I - 2 Hv v^T\!H = I - 2 e_1e_1^T = \mathrm{diag}(-1,1,1,\dots,1)=: D.$

Then $P = HDH^T$ and so $X = H \sqrt{D} H^T$ gives $2^n$ square roots on taking all possible combinations of signs on the diagonal for $\sqrt{D}$ . Because $P$ has repeated eigenvalues these are not the only square roots. The infinitely many others are obtained by taking non-diagonal square roots of $D$ , which are of the form $\mathrm{diag}(\pm i, Y)$ , where $Y$ is any non-diagonal square root of the $(n-1)\times (n-1)$ identity matrix, which in particular could be a Householder matrix!

Block Householder Matrix

It is possible to define an $n\times n$ block Householder matrix in terms of a given $Z\in\mathbb{R}^{n\times p}$ , where $n\ge p$ , as

$\notag P = I - 2 Z(Z^TZ)^+Z^T.$

Here, “ $+$ ” denotes the Moore–Penrose pseudoinverse. For $p=1$ , $P$ clearly reduces to a standard Householder matrix. It can be shown that $(Z^TZ)^+Z^T = Z^+$ (this is most easily proved using the SVD), and so

$P = I - 2 ZZ^+ = I - 2 P_Z,$

where $P_Z = ZZ^+$ is the orthogonal projector onto the range of $Z$ (that is, $\mathrm{range}(PZ) = \mathrm{range}(Z)$ , $P_Z^2 = P_Z$ , and $P_Z = P_Z^T$ ). Hence, like a standard Householder matrix, $P$ is symmetric, orthogonal, and involutory. Furthermore, premultiplication of a matrix by $P$ has the effect of reversing the component in the range of $Z$ .

As an example, here is the block Householder matrix corresponding to $Z = \bigl[\begin{smallmatrix} 1 & 2 & 3 & 4\\ 5 & 6 & 7 & 8 \end{smallmatrix}\bigr]^T$ :

$\notag \displaystyle\frac{1}{5} \left[\begin{array}{@{\mskip2mu}rrrr@{\mskip2mu}} -2 & -4 & -1 & 2\\ -4 & 2 & -2 & -1\\ -1 & -2 & 2 & -4\\ 2 & -1 & -4 & -2 \end{array}\right].$

One can show (using the SVD again) that the eigenvalues of $P$ are $-1$ repeated $r$ times and $1$ repeated $n-r$ times, where $r = \mathrm{rank}(Z)$ . Hence $\mathrm{trace}(P) = n - 2r$ and $\det(P) = (-1)^r$ .

Schreiber and Parlett (1988) note the representation for $n = 2k$ ,

$\notag P = \pm \mathrm{diag}(Q_1,Q_2) \begin{bmatrix} \cos(2\Theta) & \sin(2\Theta) \\ \sin(2\Theta) & -\cos(2\Theta) \end{bmatrix} \mathrm{diag}(Q_1,Q_2)^T,$

where $Q_1$ and $Q_2$ are orthogonal and $\Theta$ is symmetric positive definite. This formula neatly generalizes the formula for a standard Householder matrix for $n = 2$ given above, and a similar formula holds for odd $n$ .

Schreiber and Parlett also show how given $E\in\mathbb{R}^{n\times p}$ ( $n > p$ ) one can construct a block Householder matrix $H$ such that

$\notag HE = \begin{bmatrix} F \\ 0 \end{bmatrix}, \qquad F \in \mathbb{R}^{p\times p}.$

The polar decomposition plays a key role in the theory and algorithms for such $H$ .

Rectangular Householder Matrix

We can define a rectangular Householder matrix as follows. Let $m > n$ , $u \in \mathbb{R}^n$ , $v \in \mathbb{R}^{m-n}$ , and

$\notag P = \begin{bmatrix} I_n\\0 \end{bmatrix} + \alpha \begin{bmatrix} u\\v \end{bmatrix}u^T = \begin{bmatrix} I_n + \alpha u u^T\\ \alpha vu^T \end{bmatrix} \in \mathbb{R}^n.$

Then $P^TP = I$ , that is, $P$ has orthonormal columns, if

$\alpha = \displaystyle\frac{-2}{u^Tu + v^Tv}.$

Of course, $P$ is just the first $n$ columns of the Householder matrix built from the vector $[u^T~v^T]^T$ .

Historical Note

The earliest appearance of Householder matrices is in the book by Turnbull and Aitken (1932). These authors show that if $\|x\|_2 = \|y\|_2$ ( $x\ne -y$ ) then a unitary matrix of the form $R = \alpha zz^* - I$ (in their notation) can be constructed so that $Rx = y$ . They use this result to prove the existence of the Schur decomposition. The first systematic use of Householder matrices for computational purposes was by Householder (1958) who used them to construct the QR factorization.

References

This is a minimal set of references, which contain further useful references within.

Massimiliano Fasi and Nicholas J. Higham, Generating Extreme-Scale Matrices with Specified Singular Values or Condition Numbers, MIMS EPrint 2020.8, Manchester Institute for Mathematical Sciences, The University of Manchester, UK, March 2020. (For the use of rectangular Householder matrices.)
Nicholas J. Higham, Accuracy and Stability of Numerical Algorithms, second edition, Society for Industrial and Applied Mathematics, Philadelphia, PA, USA, 2002. (Chapter 19.)
Nicholas J. Higham, Functions of Matrices: Theory and Computation, Society for Industrial and Applied Mathematics, Philadelphia, PA, USA, 2008. (Sections 1.5 and 1.6 for the theory of matrix square roots.)
Robert S. Schreiber and Beresford N. Parlett, Block Reflectors: Theory and Computation, SIAM J. Numer. Anal. 25(1), 189–205, 1988.

What is a Sparse Matrix?

A sparse matrix is one with a large number of zero entries. A more practical definition is that a matrix is sparse if the number or distribution of the zero entries makes it worthwhile to avoid storing or operating on the zero entries.

Sparsity is not to be confused with data sparsity, which refers to the situation where, because of redundancy, the data can be efficiently compressed while controlling the loss of information. Data sparsity typically manifests itself in low rank structure, whereas sparsity is solely a property of the pattern of nonzeros.

Important sources of sparse matrices include discretization of partial differential equations, image processing, optimization problems, and networks and graphs. In designing algorithms for sparse matrices we have several aims.

Store the nonzeros only, in some suitable data structure.
Avoid operations involving only zeros.
Preserve sparsity, that is, minimize fill-in (a zero element becoming nonzero).

We wish to achieve these aims without sacrificing speed, stability, or reliability.

An important class of sparse matrices is banded matrices. A matrix $A$ has bandwidth $p$ if the elements outside the main diagonal and the first $p$ superdiagonals and subdiagonals are zero, that is, if $a_{ij} = 0$ for $j>i+p$ and $i>j+p$ .

The most common type of banded matrix is a tridiagonal matrix $(p = 1$ ), of which an archetypal example is the second-difference matrix, illustrated for $n = 5$ by

$\notag A_5 = \left[ \begin{array}{@{}*{4}{r@{\mskip10mu}}r} 2 & -1 & 0 & 0 & 0\\ -1 & 2 & -1 & 0 & 0\\ 0 & -1 & 2 & -1 & 0\\ 0 & 0 &-1 & 2 & -1\\ 0 & 0 & 0 & -1 & 2 \end{array}\right].$

This matrix (or more precisely its negative) corresponds to a centered finite difference approximation to a second derivative: $f''(x) \approx (f(x+h) -2 f(x) + f(x-h))/h^2$ .

The following plots show the sparsity patterns for two symmetric positive definite matrices. Here, the nonzero elements are indicated by dots.

The matrices are both from power network problems and they are taken from the SuiteSparse Matrix Collection (https://sparse.tamu.edu/). The matrix names are shown in the titles and the nz values below the $x$ -axes are the numbers of nonzeros. The plots were produced using MATLAB code of the form

W = ssget('HB/494_bus'); A = W.A; spy(A)

where the ssget function is provided with the collection. The matrix on the left shows no particular pattern for the nonzero entries, while that on the right has a structure comprising four diagonal blocks with a relatively small number of elements connecting the blocks.

It is important to realize that while the sparsity pattern often reflects the structure of the underlying problem, it is arbitrary in that it will change under row and column reorderings. If we are interested in solving $Ax = b$ , for example, then for any permutation matrices $P$ and $Q$ we can form the transformed system $PAQ (Q^*x) = Pb$ , which has a coefficient matrix $PAQ$ having permuted rows and columns, a permuted right-hand side $Pb$ , and a permuted solution. We usually wish to choose the permutations to minimize the fill-in or (almost equivalently) the number of nonzeros in $L$ and $U$ . Various methods have been derived for this task; they are necessarily heuristic because finding the minimum is in general an NP-complete problem. When $A$ is symmetric we take $Q = P^T$ in order to preserve symmetry.

For the HB/494_bus matrix the symmetric reverse Cuthill-McKee permutation gives a reordered matrix with the following sparsity pattern, plotted with the MATLAB commands

r = symrcm(A); spy(A(r,r))

The reordered matrix with a variable band structure that is characteristic of the symmetric reverse Cuthill-McKee permutation. The number of nonzeros is, of course, unchanged by reordering, so what has been gained? The next plots show the Cholesky factors of the HB/494_bus matrix and the reordered matrix. The Cholesky factor for the reordered matrix has a much narrower bandwidth than that for the original matrix and has fewer nonzeros by a factor 3. Reordering has greatly reduced the amount of fill-in that occurs; it leads to a Cholesky factor that is cheaper to compute and requires less storage.

Because Cholesky factorization is numerically stable, the matrix can be permuted without affecting the numerical stability of the computation. For a nonsymmetric problem the choice of row and column interchanges also needs to take into account the need for numerical stability, which complicates matters.

The world of sparse matrix computations is very different from that for dense matrices. In the first place, sparse matrices are not stored as $n\times n$ arrays, but rather just the nonzeros are stored, in some suitable data structure. Programming sparse matrix computations is, consequently, more difficult than for dense matrix computations. A second difference from the dense case is that certain operations are, for practical purposes, forbidden, Most notably, we never invert sparse matrices because of the possibly severe fill-in. Indeed the inverse of a sparse matrix is usually dense. For example, the inverse of the tridiagonal matrix given at the start of this article is

$\notag A_5^{-1} = \displaystyle\frac{1}{6} \begin{bmatrix} 5 & 4 & 3 & 2 & 1\\ 4 & 8 & 6 & 4 & 2\\ 3 & 6 & 9 & 6 & 3\\ 2 & 4 & 6 & 8 & 4\\ 1 & 2 & 3 & 4 & 5 \end{bmatrix}.$

While it is always true that one should not solve $Ax = b$ by forming $x = A^{-1} \times b$ , for reasons of cost and numerical stability (unless $A$ is orthogonal!), it is even more true when $A$ is sparse.

Finally, we mention an interesting property of $A_5^{-1}$ . Its upper triangle agrees with the upper triangle of the rank- $1$ matrix

$\notag \begin{bmatrix} 1 \\ 2 \\ 3 \\ 4 \\ 5 \end{bmatrix} \begin{bmatrix} 5 & 4 & 3 & 2 & 1 \end{bmatrix} = \begin{bmatrix} 5 & 4 & 3 & 2 & 1\\ 10 & 8 & 6 & 4 & 2\\ 15 & 12& 9 & 6 & 3\\ 20 & 16& 12& 8 & 4\\ 25 & 20& 15& 10& 5 \end{bmatrix}.$

This property generalizes to other tridiagonal matrices. So while a tridiagonal matrix is sparse, its inverse is data sparse—as it has to be because in general $A$ depends on $2n-1$ parameters and hence so does $A^{-1}$ . One implication of this property is that it is possible to compute the condition number $\kappa_{\infty}(A) = \|A\|_{\infty} \|A^{-1}\|_{\infty}$ of a tridiagonal matrix in $O(n)$ flops.

References

This is a minimal set of references, which contain further useful references within.

Timothy A. Davis, Direct Methods for Sparse Linear Systems, Society for Industrial and Applied Mathematics, Philadelphia, PA, USA, 2006.
Timothy A. Davis, Sivasankaran Rajamanickam, and Wissam M. Sid-Lakhdar, A Survey of Direct Methods for Sparse Linear Systems, Acta Numerica 25, 383–566, 2016.
Timothy A. Davis and Yifan Hu, The University of Florida Sparse Matrix Collection, ACM Trans. Math. Software 38 (1), 1:1–1:25, 2011. Note: this collection is now called the SuiteSparse Matrix Collection.
Gareth I. Hargreaves, Computing the Condition Number of Tridiagonal and Diagonal-Plus-Semiseparable Matrices in Linear Time, SIAM J. Matrix Anal. Appl. 27, 801–820, 2006.
Gérard Meurant, A Review on the Inverse of Symmetric Tridiagonal and Block Tridiagonal Matrices, SIAM J. Matrix Anal. Appl. 13, 707–728, 1992.
Yousef Saad, Iterative Methods for Sparse Linear Systems, Society for Industrial and Applied Mathematics, Philadelphia, PA, USA, 2003.

This article is part of the “What Is” series, available from https://nhigham.com/category/what-is and in PDF form from the GitHub repository https://github.com/higham/what-is.

Column Pivoting and Rank-Revealing QR Factorization

References

Related Blog Posts

Share this:

First proof.

Second Proof

Applications and Generalizations

Historical Note

References

Related Blog Posts

Share this:

References

Related Blog Posts

Share this:

Exportgraphics (R2020a)

Svdsketch (R2020b)

Axis Padding (R2020b)

Turbo Colormap (2020b)

ND Arrays (R2020b)

Performance

Share this:

Relations with Symmetric Eigenvalue Problem

Connections with Other Problems

Applications

History and Computation

References

Related Blog Posts

Share this:

References

Related Blog Posts

Share this:

Sherman–Morrison Formula

Sherman–Morrison–Woodbury Formula

General Formula

Historical Note

References

Related Blog Posts

Share this:

Matrix Factorization

Matrix Inverse

Determinantal Formulas

Constructing Matrices with Required Properties

The Anti Block Diagonal Trick

References

Related Blog Posts

Share this:

Transforming a Vector

Square Roots

Block Householder Matrix

Rectangular Householder Matrix

Historical Note

References

Related Blog Posts

Share this:

References

Share this: