What Is a Rank-Revealing Factorization?

In many applications a matrix A\in\mathbb{R}^{m\times n} has less than full rank, that is, r = \mathrm{rank}(A) < \min(m,n). Sometimes, r is known, and a full-rank factorization A = GH with G\in\mathbb{R}^{m \times r} and H\in\mathbb{R}^{r \times n}, both of rank r, is given—especially when r = 1 or r = 2. Often, though, the rank r is not known. Moreover, rather than being of exact rank r, A is merely close to a rank r matrix because of errors from various possible sources.

What is usually wanted is a factorization that displays how close A is to having particular ranks and provides an approximation to the range space of a lower rank matrix. The ultimate tool for providing this information is the singular value decomposition (SVD)

\notag     A = U\Sigma V^T, \quad    \Sigma = \mathrm{diag}(\sigma_1,\dots, \sigma_p)\in\mathbb{R}^{m\times n},

where p = \min(m,n), \sigma_1\ge \sigma_2\ge \cdots \ge \sigma_p \ge 0, and U\in\mathbb{R}^{m\times m} and V\in\mathbb{R}^{n\times n} are orthogonal. The Eckart–Young theorem says that

\notag  \min_{\mathrm{rank}(B) = k} \|A-B\|_q =  \begin{cases}      \sigma_{k+1},                                & q = 2,  \\      \Bigl(\sum_{i=k+1}^r \sigma_i^2\Bigr)^{1/2}, & q = F,   \end{cases}

and that the minimum is attained at

\notag    A_k = U \Sigma_k V^T, \quad    \Sigma_k = \mathrm{diag}(\sigma_1, \dots, \sigma_k, 0, \dots, 0),

so A_k is the best rank-k approximation to A in both the 2-norm and the Frobenius norm.

Although the SVD is expensive to compute, it may not be significantly more expensive than alternative factorizations. However, the SVD is expensive to update when a row or column is added to or removed from the matrix, as happens repeatedly in signal processing applications.

Many different definitions of a rank-revealing factorization have been given, and they usually depend on a particular matrix factorization. We will use the following general definition.

Definition 1. A rank-revealing factorization (RRF) of A\in\mathbb{R}^{m\times n} is a factorization

\notag   A = XDY^T, \quad   X\in\mathbb{R}^{m\times p}, \quad   D\in\mathbb{R}^{p\times p}, \quad   Y\in\mathbb{R}^{n\times p},

where p \le \min(m,n), D is diagonal and nonsingular, and X and Y are well conditioned.

An RRF concentrates the rank deficiency and ill condition of A into the diagonal matrix D. An RRF clearly exists, because the SVD is one, with X and Y having orthonormal columns and hence being perfectly conditioned. Justification for this definition comes from a version of Ostrowski’s theorem, which shows that

\notag     \sigma_i(A) = \theta_i \sigma_i(D), \quad i = 1\colon \min(m,n),      \qquad (1)

where \sigma_p(X)\sigma_p(Y) \le \theta_i \le \sigma_1(X) \sigma_1(Y). Hence as long as X and Y are well conditioned, the singular values are good order of magnitude approximations to those of A up a scale factor.

Without loss of generality we can assume that

\notag   D = \mathrm{diag}(d_i), \quad   |d_1| \ge |d_2| \ge \cdots \ge |d_p|

(since XDY^T = XP\cdot P^TDP \cdot P^T Y^T for any permutation matrix P and the second expression is another RRF). For \widetilde{A}_k = X \mathrm{diag}(d_1,\dots,d_k,0,\dots,0)Y^T we have

\notag   \|A - \widetilde{A}_k\| \le \|X\| \|Y\|   \|\mathrm{diag}(0,\dots,0,d_{k+1},\dots,d_p)\|,

so A is within distance of order |d_{k+1}| from the rank-k matrix \widetilde{A}_k, which is the same order as the distance to the nearest rank-k matrix if |d_{k+1}| \approx \sigma_{k+1}.

Definition 2 is a strong requirement, since it requires all the singular values of A to be well approximated by the (scaled) diagonal elements of D. We will investigate below how it compares with another definition of RRF.

Numerical Rank

An RRF helps to determine the numerical rank, which we now define.

Definition 2. For a given \epsilon > 0 the numerical rank of A is the largest integer k such that \sigma_k > \epsilon.

By the Eckart–Young theorem, the numerical rank is the smallest rank attained over all A+E with \|E\|_2 \le \epsilon. For the numerical rank to be meaningful in the sense that it is unchanged if \epsilon is perturbed slightly, we need \epsilon not to be too close to \sigma_k or \sigma_{k+1}, which means that there must be a significant gap between these two singular values.

QR Factorization

One might attempt to compute an RRF by using a QR factorization A = QR, where Q\in\mathbb{R}^{m\times n} has orthonormal columns, R\in\mathbb{R}^{n\times n} is upper triangular, and we assume that m\ge n. In Definition 1, we can take

\notag   X = I, \quad D = \mathrm{diag}(R), \quad Y^T = D^{-1}R. \qquad (*)

However, it is easy to see that QR factorization in its basic form is flawed as a means for computing an RRF. Consider the matrix

\notag   A = \begin{bmatrix}        0 & 1  & 0 &0\\        0 & 0  & 1 &0 \\        0 & 0  & 0 &1\\        0 & 0  & 0 &0   \end{bmatrix},

which is a Jordan block with zero eigenvalue. This matrix is its own QR factorization (R = A), and the prescription (*) gives D = 0, so A \ne XDY^T. The essential problem is that the diagonal of R has no connection with the nonzero singular values of A. What is needed are column permutations: A\Pi = \mathrm{diag}(1,1,1,0) for the permutation matrix \Pi that reorders [a_1,a_2,a_3,a_4] to [a_2,a_3,a_4,a_1], and this is a perfect RRF with X = Y = I.

For a less trivial example, consider the matrix

\notag   A = \left[\begin{array}{rrrr}        1 & 1  &\theta &0\\        1 & -1 & 2 &1 \\        1 & 0  &1+\theta &-1\\        1 &-1  & 2 &-1   \end{array}\right], \quad \theta = 10^{-8}. \qquad (\dagger)

Computing the QR factorization we obtain

R =
  -2.0000e+00   5.0000e-01  -2.5000e+00   5.0000e-01
            0   1.6583e+00  -1.6583e+00  -1.5076e-01
            0            0  -4.2640e-09   8.5280e-01
            0            0            0  -1.4142e+00

The (3,3) element tells us that A is within distance about 4\times 10^{-9} of being rank deficient and so has a singular value bounded above by this quantity, but it does not provide any information about the next larger singular value. Moreover, in (*), \kappa_2(Y) is of order 10^{16} for this factorization. We need any small diagonal elements to be in the bottom right-hand corner, and to achieve this we need to introduce column permutations to move the “dependent columns” to the end.

QR Factorization With Column Pivoting

A common method for computing an RRF is QR factorization with column pivoting, which for a matrix A\in\mathbb{R}^{m\times n} with m\ge n computes a factorization A\Pi = QR, where \Pi is a permutation matrix, Q\in\mathbb{R}^{m\times n} has orthonormal columns, and R\in\mathbb{R}^{n\times n} is upper triangular and satisfies the inequalities

\notag    |r_{kk}|^2 \ge \displaystyle\sum_{i=k}^j |r_{ij}|^2,     \quad j=k+1\colon n, \quad k=1\colon n. \qquad (2)

In particular,

\notag   |r_{11}| \ge |r_{22}| \ge \cdots \ge |r_{nn}|.   \qquad(3)

If |r_{kk}| \ge \epsilon \ge |r_{k+1,k+1}| with \epsilon > 0 then we can write

\notag    R =    \begin{array}[b]{@{\mskip33mu}c@{\mskip-16mu}c@{\mskip-10mu}c@{}}    \scriptstyle k &    \scriptstyle n-k &    \\    \multicolumn{2}{c}{        \left[\begin{array}{c@{~}c@{~}}                  R_{11}& R_{12} \\                    0   & R_{22} \\              \end{array}\right]}    & \mskip-12mu\          \begin{array}{c}              \scriptstyle k \\              \scriptstyle n-k              \end{array}    \end{array},  \qquad(4)

with

\notag   \|R_{22}\|_2 \le \|R_{22}\|_F \le 2^{-1/2}(n-k+1)\epsilon.

Hence R is within 2-norm distance 2^{-1/2}(n-k+1)\epsilon of the rank-k matrix \left[\begin{smallmatrix} R_{11} & R_{12} \\ 0 & 0 \end{smallmatrix}\right]. Note that if Q = [Q_1~Q_2] is partitioned conformally with Q in (4) then

\notag   A\Pi =   \begin{bmatrix}    Q_1 & Q_2   \end{bmatrix}   \begin{bmatrix}    R_{11} & R_{12} \\     0     & R_{22} \\   \end{bmatrix}    = Q_1   \begin{bmatrix}    R_{11} & R_{12}   \end{bmatrix}   + \begin{bmatrix}    0 & Q_2 R_{22}   \end{bmatrix},

so \| A\Pi - Q_1 [R_{11}~R_{12}]\|_2 \le \|R_{22}\|_2, which means that Q_1 provides an O(\epsilon) approximation to the range of A.

To assess how good an RRF this factorization is (with p = n) we write it as

\notag   A = QR\Pi^T = Q D Y^T, \quad D = \mathrm{diag}(r_{ii}),                          \quad Y^T = D^{-1}R \Pi^T. \quad (\#)

Applying (1) gives

\notag     \sigma_i(A) = \theta_i \sigma_i(D), \quad i = 1\colon p, \qquad (5)

where \sigma_n(Y)\le \theta_i \le \sigma_1(Y), since Q has orthonormal columns and so has unit singular values. Now D^{-1}R has unit diagonal and, in view of (2), its off-diagonal elements are bounded by 1. Therefore \sigma_1(Y) = \|Y\|_2 \le \|Y\|_F \le (n(n+1)/2)^{1/2}. On the other hand, \sigma_p(Y)^{-1} \le 2^{n-1} by Theorem 1 in Bounds for the Norm of the Inverse of a Triangular Matrix. Therefore

\notag    2^{1-n} \le \theta_i \le (n(n+1)/2)^{1/2}.

The lower bound is an approximate equality for small \tau for the triangular matrix

\notag        R_n(\theta) = \mathrm{diag}(1,s,\dots,s^{n-1})             \begin{bmatrix} 1 & -c & -c     & \dots & -c \\                               & 1  & -c     & \dots & -c \\                               &    & \ddots &\ddots & \vdots \\                               &    &        &\ddots & -c \\                               &    &        &       &  1 \end{bmatrix},      \quad c=\cos\tau, \quad s=\sin\tau,

devised by Kahan, which is invariant under QR factorization with column pivoting. Therefore QR factorization with column pivoting is not guaranteed to reveal the rank, and indeed it can fail to do so by an exponentially large factor.

For the matrix (\dagger), QR with column pivoting reorders A to A\Pi = [a_3,~a_4,~a_2,~a_1] and yields

R =
  -3.0000e+00   3.3333e-01   1.3333e+00  -1.6667e+00
            0  -1.6997e+00   2.6149e-01   2.6149e-01
            0            0   1.0742e+00   1.0742e+00
            0            0            0   3.6515e-09

This R suggests a numerical rank of 3 for \epsilon = 10^{-8} (say). In fact, this factorization provides a very good RRF, as in (\#) we have \kappa_2(Y) \approx 3.4.

QR Factorization with Other Pivoting Choices

Consider a QR factorization A\Pi = QR with triangular factor partitioned as

\notag    R =    \begin{array}[b]{@{\mskip33mu}c@{\mskip-16mu}c@{\mskip-10mu}c@{}}    \scriptstyle k &    \scriptstyle n-k &    \\    \multicolumn{2}{c}{        \left[\begin{array}{c@{~}c@{~}}                  R_{11}& R_{12} \\                    0   & R_{22} \\              \end{array}\right]}    & \mskip-12mu\          \begin{array}{c}              \scriptstyle k \\              \scriptstyle n-k              \end{array}    \end{array}. \qquad (6)

We have

\notag    \begin{aligned}      \sigma_{\min}(R_{11}) &\le \sigma_k(A),    \quad \qquad (7)\\      \sigma_{\max}(R_{22}) &\ge \sigma_{k+1}(A), ~\qquad (8)    \end{aligned}

where (7) is from singular value interlacing inequalities and (8) follows from the Eckart-Young theorem, since setting R_{22} to zero gives a rank-k matrix. Suppose A has numerical rank k and \sigma_{k+1} \ll \sigma_k. We would like to be able to detect this situation from R, so clearly we need

\notag  \sigma_{\min}(R_{11}) \approx \sigma_k(A), \quad  \sigma_{\max}(R_{22}) \approx \sigma_{k+1}(A). \qquad (9)

In view of the inequalities (7) and (8) this means that we wish to choose \Pi maximize \sigma_{\min}(R_{11}) and minimize \sigma_{\max}(R_{22}).

Some theoretical results are available on the existence of such QR factorizations. First, we give a result that shows that for k = n-1 the approximations in (9) can hold to within a factor n^{1/2}.

Theorem 1. For A\in\mathbb{R}^{m\times n} with m\ge n there exists a permutation \Pi such that A has the QR factorization A\Pi = QR with |r_{nn}| \le n^{1/2}\sigma_n(A) and \sigma_{\min}(R_{11}) \ge n^{-1/2} \sigma_{n-1}(A), where R_{11} = R(1\colon n-1, 1\colon n-1).

Proof. Let Av = \sigma_n u, with \|v\|_2 = \|u\|_2 = 1 and let \Pi^T be such that \widetilde{v}  = \Pi^Tv satisfies |\widetilde{v}_n| = \|\widetilde{v}\|_{\infty}. Then if A\Pi = QR is a QR factorization,

\notag \begin{aligned}    \sigma_n = \| \sigma_n u \|_2             = \| Av \|_2             = \| QR\Pi^Tv \|_2             = \| R\mskip1mu \widetilde{v} \|_2             \ge | r_{nn} \widetilde{v}_n |             \ge n^{-1/2} | r_{nn} |, \end{aligned}

since \|\widetilde{v}\|_2 = 1, which yields the result.

Next, we write \Pi = [\Pi_1~\pi], where \pi\in\mathbb{R}^n, and partition

\notag   R = \begin{bmatrix}    R_{11} & R_{12} \\     0     & R_{22} \\   \end{bmatrix}

with R_{11}\in\mathbb{R}^{(n-1)\times (n-1)}. Then

A\Pi_1 = Q \begin{bmatrix} R_{11} \\ 0 \end{bmatrix}

implies \sigma_{\min}(A\Pi_1) = \sigma_{\min}(R_{11}). On the other hand, if A = U\Sigma V^T is an SVD with U\in\mathbb{R}^{m\times n}, \Sigma = \mathrm{diag}(D_1,\sigma_n)\in\mathbb{R}^{n\times n}, and V = [V_1~v] then

\notag  A\Pi_1 = U\Sigma V^T \Pi_1 =    U    \begin{bmatrix}    D_1    & 0 \\     0     & \sigma_n \\   \end{bmatrix}   \begin{bmatrix}    V_1^T \\ v^T   \end{bmatrix} \Pi_1   = U   \begin{bmatrix}    D_1V_1^T \\ \sigma_n v^T   \end{bmatrix} \Pi_1,

so

\notag   \sigma_{\min}(A\Pi_1) =   \sigma_{\min}\left(   \begin{bmatrix}    D_1V_1^T \\ \sigma_n v^T   \end{bmatrix} \Pi_1 \right)    \ge \sigma_{\min}(D_1V^T\Pi_1)    \ge \sigma_{n-1}\sigma_{\min}(V^T\Pi_1).

Finally, we note that we can partition the orthogonal matrix V^T\Pi_1 as

\notag  V^T\Pi =    \begin{bmatrix}    V_1^T\Pi_1  & V_1^T\pi \\    v^T\Pi_1    & v^T\pi   \end{bmatrix},

and the CS decomposition implies that

\notag    \sigma_{\min}(V_1^T\Pi_1) =    \sigma_{\min}(v^T\pi) =    |v^T\pi| = |\widetilde{v}_n| \ge n^{-1/2}.

Hence \sigma_{\min}(R_{11}) \ge n^{-1/2} \sigma_{n-1}, as required. ~\square

Theorem 1 is a special case of the next result of Hong and Pan (1992).

Theorem 2. For A\in\mathbb{R}^{m\times n} with m\ge n and any k there exists a permutation matrix \Pi such that A has the QR factorization A\Pi = QR where, with R partitioned as in (6),

\notag    \sigma_{\max}(R_{22}) \le f(k,n) \sigma_{k+1}(A), \quad    \sigma_{\min}(R_{11}) \ge f(k,n)^{-1} \sigma_k(A),

where f(k,n) = (k(n-k) + \min(k,n-k))^{1/2}.

The proof of Theorem 2 is constructive and chooses \Pi to move a submatrix of maximal determinant of V_2 to the bottom of V_2, where V_2 comprises the last n-k columns of the matrix of right singular vectors.

Theorem 2 shows the existence of an RRF up to the factor f(k,n) \le (n+1)/2, but it does not provide an efficient algorithm for computing one.

How do the conditions (9) relate to Definition 1? Since algorithms for computing an RRF are usually not specialized to any particular k, it is reasonable to ask that (9) holds for all k. We consider what can be said if the first condition in (9) holds as an equality for all k.

Lemma 3. If R\in\mathbb{R}^{m\times n} is upper triangular and satisfies \sigma_{\min}(R_{11}) = \sigma_k(R) for k=1\colon n-1 in the partitioning (6) then R is diagonal.

Proof. The proof is by induction. Assume that R_{k-1} = \mathrm{diag}(\sigma_1(R),\dots,\sigma_{k-1}(R)). This is clearly true for k = 2, since |r_{11}| = \sigma_1(R). Write

\notag       R_k =       \begin{bmatrix}       R_{k-1} & z \\         0     & \rho       \end{bmatrix}.

Then |\rho| \ge \sigma_{\min}(R_k) by (8). Also \sigma_i(R_k) \le \sigma_i(R) for i = 1\colon k by standard singular value inequalities. We therefore have

\notag \begin{aligned}    \sum_{i=1}^k \sigma_i(R)^2    \ge \sum_{i=1}^k \sigma_i(R_k)^2     = \|R_k\|_F^2     = \|R_{k-1}\|_F^2  + \|z\|_2^2 + \rho^2\\     = \sum_{i=1}^{k-1} \sigma_i(R)^2 + \|z\|_2^2 + \rho^2     \ge \sum_{i=1}^k \sigma_i(R)^2 + \|z\|_2^2. \end{aligned}

It follows that z = 0, and so R_k is diagonal, which completes the induction.

We conclude from the lemma that if the first condition in (9) is an equality for all k then we have a perfect RRF A\Pi = QR with R diagonal. Therefore if the approximations in (9) are reasonably good for all k we should have a reasonably good RRF.

Much work has been done on algorithms that choose the permutation matrix \Pi in a different way to column pivoting or post-process a QR factorization with column pivoting, with the aim of satisfying (9) at reasonable cost. Typically, these algorithms involve estimating singular values and singular vectors. We are not aware of any algorithm that is guaranteed to satisfy (9) and requires only O(n^3) flops.

UTV Decomposition

By applying Householder transformations on the right, a QR factorization with column pivoting can be turned into a complete orthogonal decomposition of A\in\mathbb{R}^{m\times n}, which has the form

\notag      A = U \begin{bmatrix} T & 0 \\ 0 & 0 \end{bmatrix} V^T, \qquad (10)

where T\in\mathbb{R}^{r \times r} is upper triangular and U\in\mathbb{R}^{m\times m} and V\in\mathbb{R}^{n\times n} are orthogonal. Stewart (1998) calls (6) with T upper triangular or lower triangular a UTV decomposition and he defines a rank-revealing UTV decomposition of numerical rank r by

\notag \begin{aligned}      A &= U \begin{bmatrix} T & F \\ 0 & G  \end{bmatrix} V^T,          \qquad T\in\mathbb{R}^{r \times r}, \\         &          \sigma_r(T) \approx \sigma_r(A), \quad          \|F\|_F^2 + \|G\|_F^2 \approx \sigma_{r+1}^2 + \cdots + \sigma_n^2. \end{aligned}

The UTV decomposition is easy to update (when a row is added) and downdate (when a row is removed) using Givens rotations and it is suitable for parallel implementation. Initial determination of the UTV decomposition can be done by applying the updating algorithm as the rows are brought in one at a time.

LU Factorization

Instead of QR factorization we can build an RRF from an LU factorization with pivoting. For A\in\mathbb{R}^{m\times n} with m\ge n, let

\notag     \Pi_1 A \Pi_2 = LU =     \begin{bmatrix}       L_{11} & 0 \\       L_{12} & L_{22}     \end{bmatrix}     \begin{bmatrix}     U_{11} & U_{12}\\       0    & U_{22}     \end{bmatrix},

where \Pi_1 and \Pi_2 are permutation matrices, L and U are m\times n lower and n\times n upper triangular, respectively, and L_{11} and U_{11} are k\times k. Analogously to (7) and (8), we always have \sigma_{\min}(L_{11}U_{11}) \le \sigma_k(A) and \sigma_{\max}(L_{22}U_{22}) \ge \sigma_{k+1}(A). With a suitable pivoting strategy we can hope that \sigma_{\min}(L_{11}U_{11}) \approx \sigma_k(A) and \sigma_{\max}(L_{22}U_{22}) \approx \sigma_{k+1}(A).

A result of Pan (2000) shows that an RRF based on LU factorization always exists up to a modest factor f(k,n). This is analogue for LU factorization of Theorem 2.

Theorem 3 For A\in\mathbb{R}^{m\times n} with m\ge n and any k there exist permutation matrices \Pi_1 and \Pi_2 such that

\notag     \Pi_1 A \Pi_2 = LU =     \begin{bmatrix}     L_{11} &  0    \\     L_{12} & I_{m-k,n-k}     \end{bmatrix}    \begin{array}[b]{@{\mskip33mu}c@{\mskip-16mu}c@{\mskip-10mu}c@{}}    \scriptstyle k &    \scriptstyle n-k &    \\    \multicolumn{2}{c}{        \left[\begin{array}{c@{~}c@{~}}                  U_{11}& U_{12} \\                    0   & U_{22} \\              \end{array}\right]}    & \mskip-12mu\          \begin{array}{c}              \scriptstyle k \\              \scriptstyle n-k              \end{array}    \end{array},

where L_{11} is unit lower triangular, U_{11} is upper triangular, and

\notag    \sigma_{\max}(U_{22}) \le f(k,n) \sigma_{k+1}(A), \quad    \sigma_{\min}(L_{11}U_{11}) \ge f(k,n)^{-1} \sigma_k(A),

where f(k,n) = k(n-k) + 1.

Again the proof is constructive, but the permutations it chooses are too expensive to compute. In practice, complete pivoting often yields a good RRF.

In terms of Definition 1, an RRF has

\notag     X = \Pi_1^TL D, \quad D = \mathrm{diag}(u_{ii}), \quad     Y^T = D^{-1}U\Pi_2. \qquad (\ddagger)

For the matrix (\dagger), the U factor for LU factorization without pivoting is

U =
   1.0000e+00   1.0000e+00   1.0000e-08            0
            0  -2.0000e+00   2.0000e+00   1.0000e+00
            0            0   5.0000e-09  -1.5000e+00
            0            0            0  -2.0000e+00

As for QR factorization without pivoting, an RRF is not obtained from (\ddagger).. However, with complete pivoting we obtain

U =
   2.0000e+00   1.0000e+00  -1.0000e+00   1.0000e+00
            0  -2.0000e+00            0            0
            0            0   1.0000e+00   1.0000e+00
            0            0            0  -5.0000e-09

which yields a very good RRF (\ddagger) with \kappa_2(X) = 3.5 and \kappa_2(Y) = 3.4.

Notes

QR factorization with column pivoting is difficult to implement efficiently, as the criterion for choosing the pivots requires the norms of the active parts of the remaining columns and this requires a significant amount of data movement. In recent years, randomized RRF algorithms have been developed that use projections with random matrices to make pivot decisions based on small sample matrices and thereby reduce the amount of data movement. See, for example, Martinsson et al. (2019).

References

This is a minimal set of references, which contain further useful references within.

Related Blog Posts

This article is part of the “What Is” series, available from https://nhigham.com/category/what-is and in PDF form from the GitHub repository https://github.com/higham/what-is.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s