Matrix Rank Relations

Matrix rank is an important concept in linear algebra. While rank deficiency can be a sign of an incompletely or improperly specified problem (a singular system of linear equations, for example), in some problems low rank of a matrix is a desired property or outcome. Here we present some fundamental rank relations in a concise form useful for reference. These are all immediate consequences of the singular value decomposition (SVD), but we give elementary (albeit not entirely self-contained) proofs of them.

The rank of a matrix $A\in\mathbb{R}^{m\times n}$ is the maximum number of linearly independent columns, which is the dimension of the range space of $A$, $\mathrm{range}(A) = \{\, Ax: x \in\mathbb{R}^n \,\}$. An important but non-obvious fact is that this is the same as the maximum number of linearly independent rows (see (5) below).

A rank-$1$ matrix has the form $xy^*$, where $x$ and $y$ are nonzero vectors. Every column is a multiple of $x$ and every row is a multiple of $y^*$. A sum of $k$ rank-$1$ matrices has the form

$\notag A = \displaystyle\sum_{i=1}^{k} x_iy_i^* = \begin{bmatrix} x_1 & x_2 & \dots & x_k \end{bmatrix} \begin{bmatrix} y_1^* \\ y_2^* \\ \vdots\\ y_k^* \end{bmatrix} \equiv XY^*. \qquad (0)$

Each column of $A$ is a linear combination of the vectors $x_1$, $x_2$, …, $x_k$, so $A$ has at most $k$ linearly independent columns, that is, $A$ has rank at most $k$. In fact, $\mathrm{rank}(A) = k$ if $X$ and $Y$ have rank $k$, as follows from (4) below. Any rank-$k$ matrix can be written in the form $(0)$ with $X$ and $Y$ of rank $k$; indeed this is the full-rank factorization below.

Here are some fundamental rank equalities and inequalities.

Rank-Nullity Theorem

The rank-nullity theorem says that

$\notag \boxed{ \mathrm{rank}(A) + \mathrm{dim}( \mathrm{null}(A) ) = n, \quad A\in\mathbb{R}^{m\times n},}$

where $\mathrm{null}(A) = \{\, x \in\mathbb{R}^n: Ax = 0 \,\}$ is the null space of $A$.

Rank Bound

The rank cannot exceed the number of columns, or, by (5) below, the number of rows:

$\notag \boxed{ \mathrm{rank}(A) \le \min(m,n), \quad A\in\mathbb{C}^{m\times n}. }$

Rank of a Sum

For any $A$ and $B$ of the same dimension,

$\notag \boxed{|\mathrm{rank}(A) - \mathrm{rank}(B)| \le \mathrm{rank}(A+B) \le \mathrm{rank}(A) + \mathrm{rank}(B).} \qquad (1)$

The upper bound follows from the fact that the dimension of the sum of two subspaces cannot exceed the sum of the dimensions of the subspaces. Interestingly, the upper bound is also a corollary of the bound (3) for the rank of a matrix product, because

\notag \begin{aligned} \mathrm{rank}(A+B) &= \mathrm{rank}\biggl( \begin{bmatrix} A & B \end{bmatrix} \begin{bmatrix} I \\ I \end{bmatrix} \biggr)\\ &\le \min\biggl(\mathrm{rank}\bigl(\begin{bmatrix} A & B \end{bmatrix}\bigr), \mathrm{rank}\biggl(\begin{bmatrix} I \\ I \end{bmatrix} \biggr)\biggr)\\ &\le \mathrm{rank}\bigl(\begin{bmatrix} A & B \end{bmatrix}\bigr)\\ &\le \mathrm{rank}(A) + \mathrm{rank}(B). \end{aligned}

For the lower bound, writing $A = -B + A+B$ and applying the upper bound gives $\mathrm{rank}(A) \le \mathrm{rank}(-B) + \mathrm{rank}(A+B) = \mathrm{rank}(B) + \mathrm{rank}(A+B)$, and likewise with the roles of $A$ and $B$ interchanged.

Rank of $A$ and $A^*A$

For any $A$,

$\notag \boxed{\mathrm{rank}(A^*A) = \mathrm{rank}(A).} \qquad (2)$

Indeed $Ax = 0$ implies $A^*Ax = 0$, and $A^*Ax = 0$ implies $0 = x^*A^*Ax = (Ax)^*(Ax)$, which implies $Ax = 0$. Hence the null spaces of $A$ and $A^*A$ are the same. The equality (2) follows from the rank-nullity theorem.

Rank of a General Product

For any $A$ and $B$ for which the product $AB$ is defined,

$\notag \boxed{\mathrm{rank}(AB) \le \min\bigl( \mathrm{rank}(A), \mathrm{rank}(B) \bigr).} \qquad (3)$

If $B = [b_1,\dots,b_n]$ then $AB = [Ab_1,\dots,Ab_n]$, so the columns of $AB$ are linear combinations of those of $A$ and so $AB$ cannot have more linearly independent columns than $A$, that is, $\mathrm{rank}(AB) \le \mathrm{rank}(A)$. Using (5) below, we then have

$\notag \mathrm{rank}(AB) = \mathrm{rank}(B^*A^*) \le \mathrm{rank}(B^*) = \mathrm{rank}(B).$

The latter inequality can be proved without using (5) (our proof of which uses (3)), as follows. Suppose $\mathrm{rank}(B) < \mathrm{rank}(AB) = r$. Let the columns of $Y$ span $\mathrm{range}(AB)$, so that $Y$ has $r$ columns and $Y = ABZ$ for some matrix $Z$ with $r$ columns. Now $\mathrm{rank}(BZ) \le \mathrm{rank}(B) < r$ by the first part, so $BZg = 0$ for some nonzero $g$. But then $Yg = ABZg = 0$, which contradicts the linear independence of the columns of $Y$, so we must have $\mathrm{rank}(B) \ge \mathrm{rank}(AB)$.

Rank of a Product of Full-Rank Matrices

We have

$\notag \boxed{ \mathrm{rank}(AB) = r, \quad A\in\mathbb{C}^{m\times r}, \; B\in\mathbb{C}^{r\times n}, \; \mathrm{rank}(A) = \mathrm{rank}(B) = r .} \qquad (4)$

We note that $A^*A$ and $BB^*$ are both nonsingular $r\times r$ matrices by (2), so their product has rank $r$. Using (3),

$\notag r = \mathrm{rank}(A^*A BB^*) \le \mathrm{rank}(A B) \le r,$

and hence $\mathrm{rank}(A B) = r$.

Another important relation is

$\notag \boxed{ \mathrm{rank}(XAY ) = \mathrm{rank}(A), \quad X\in\mathbb{C}^{m\times m} \;\mathrm{and}\; Y\in\mathbb{C}^{n\times n}\; \mathrm{nonsingular}. }$

This is a consequence of the equality $\mathrm{range}(XAY) = X\mathrm{range}(A)Y$ for nonsingular $X$ and $Y$.

Ranks of $A$ and $A^*$

By (2) and (3) we have $\mathrm{rank}(A) = \mathrm{rank}(A^*A) \le \mathrm{rank}(A^*)$. Interchanging the roles of $A$ and $A^*$ gives $\mathrm{rank}(A^*) \le \mathrm{rank}(A)$ and so

$\notag \boxed{ \mathrm{rank}(A^*) = \mathrm{rank}(A). } \qquad (5)$

In other words, the rank of $A$ is equal to the maximum number of linearly independent rows as well as the maximum number of linearly independent columns.

Full-Rank Factorization

$A\in\mathbb{C}^{m \times n}$ has rank $r$ if and only if $A = GH$ for some $G\in\mathbb{C}^{m \times r}$ and $H\in\mathbb{C}^{r \times n}$, both of rank $r$, and this is called a full-rank factorization. The existence of such a factorization implies that $\mathrm{rank}(A) = r$ by (4). Conversely, suppose that $A$ has rank $r$. Let the columns of $X\in\mathbb{C}^{m \times r}$ form a basis for the range space of $A$. Then there are $r$-vectors $y_j$ such that $a_j = Xy_j$, $j = 1\colon n$, and with $Y = [y_1,y_2,\dots, y_n]$ we have $A = XY$. Finally, $r = \mathrm{rank}(A) = \mathrm{rank}(XY) \le \mathrm{rank}(Y)$ by (3), and since $\mathrm{rank}(Y) \le r$ we have $\mathrm{rank}(Y) = r$.

Rank and Minors

A characterization of rank that is sometimes used as the definition is that it is the size of the largest nonsingular square submatrix. Equivalently, the rank is the size of the largest nonzero minor, where a minor of size $k$ is the determinant of a $k\times k$ submatrix.

rank(AB) and rank(BA)

Although $AB$ and $BA$ have some properties in common when both products are defined (notably they have the same nonzero eigenvalues), $\mathrm{rank}(AB)$ is not always equal to $\mathrm{rank}(BA)$. A simple example is $A = x$ and $B = y^*$ with $x$ and $y$ orthogonal vectors: $AB = xy^*$ but $BA = y^*x = 0$. An example with square $A$ and $B$ is

$\notag \begin{gathered} A = \begin{bmatrix} 0 & 1 \\ 0 & 0 \end{bmatrix}, \quad B = \begin{bmatrix} 1 & 0 \\ 0 & 0 \end{bmatrix}, \\ \mathrm{rank}(AB) = \mathrm{rank}\biggl( \begin{bmatrix} 0 & 0 \\ 0 & 0 \end{bmatrix} \biggr) = 0, \quad \mathrm{rank}(BA) = \mathrm{rank}\biggl( \begin{bmatrix} 0 & 1 \\ 0 & 0 \end{bmatrix} \biggr) = 1. \end{gathered}$

Note that $A = e_1e_2^T$ and $B = e_1e_1^T$, where $e_i$ has $1$ in the $i$th position and zeros everywhere else. Such matrices are easy to manipulate in this form (e.g., $AB = e_1 (e_2^Te_1)e_1^T = 0$) and are useful for constructing examples.

How to Find Rank

If we have a full-rank factorization of $A$ then we can read off the rank from the dimensions of the factors. But finding a full-rank factorization is a nontrivial task. The ultimate full-rank factorization is the SVD

$\notag A = U\Sigma V^T,$

where $U\in\mathbb{R}^{m\times m}$ and $V\in\mathbb{R}^{n\times n}$ are orthogonal, $\Sigma = \mathrm{diag}(\sigma_1,\dots, \sigma_p)\in\mathbb{R}^{m\times n}$, where $p = \min(m,n)$, and $\sigma_1\ge \sigma_2\ge \cdots \ge \sigma_r > 0 = \sigma_{r+1} = \cdots = \sigma_p = 0$. The rank of $A$ is $r$, the number of nonzero singular values.

In floating-point arithmetic, the standard algorithms for computing the SVD are numerically stable, that is, the computed singular values are the exact singular values of a matrix $A + \Delta A$ with $\|\Delta A\|_2 \le c_{m,n}u\|A\|$, where $c_{m,n}$ is a constant and $u$ is the unit roundoff. Unfortunately, $A + \Delta A$ will typically be full rank when $A$ is rank deficient. For example, consider this computation.

>> n = 4; A = zeros(n); A(:) = 1:n^2, svd(A)
A =
1     5     9    13
2     6    10    14
3     7    11    15
4     8    12    16
ans =
3.8623e+01
2.0713e+00
1.5326e-15
1.3459e-16


The matrix has rank $2$ and the two zero singular values are approximated by computed singular values of order $10^{-15}$. In general, we have no way to know whether tiny computed singular values signify exactly zero singular values. In practice, one typically defines a numerical rank based on a threshold and regards computed singular values less than the threshold as zero. Indeed the MATLAB rank function computes the rank as the number of singular values exceeding $2u \max(m,n)\widehat{\sigma}_1$, where $\widehat{\sigma}_1$ is the largest computed singular value. If the data from which the matrix is constructed is uncertain then the definition of numerical rank should take into account the level of uncertainty in the data. Dealing with rank deficiency in the presence of data errors and in finite precision arithmetic is a tricky business.

References

An excellent reference for further rank relations is Horn and Johnson. Stewart describes some of the issues associated with rank-deficient matrices in practical computation.

Diagonally Perturbing a Symmetric Matrix to Make It Positive Definite

Suppose $A$ is a matrix that is symmetric but not positive definite. What is the best way to perturb the diagonal to make $A$ positive definite? We want to compute a vector $d$ such that

$\notag A(d) = A + D, \quad D = \mathrm{diag}(d),$

is positive definite. Since the positive definite matrices form an open set there is no minimal $d$, so we relax the requirement to be that $A$ is positive semidefinite. The perturbation $D$ needs to make any negative eigenvalues of $A$ become nonnegative. We will require all the entries of $d$ to be nonnegative. Denote the eigenvalues of $A$ by $\lambda_n(A) \le \lambda_{n-1}(A) \le \cdots \le \lambda_1(A)$ and assume that $\lambda_n(A) < 0$.

A natural choice is to take $D$ to be a multiple of the identity matrix. For $d_i \equiv \delta$, $A(d)$ has eigenvalues $\lambda_i + \delta$, and so the smallest possible $\delta$ is $\delta = - \lambda_n(A)$. This choice of $D$ shifts all the diagonal elements by the same amount, which might be undesirable for a matrix with widely varying diagonal elements.

When the diagonal entries of $A$ are positive another possibility is to take $d_i = \alpha a_{ii}$, so that each diagonal entry undergoes a relative perturbation of size $\alpha$. Write $D_A = \mathrm{diag}(a_{ii})$ and note that $C = D_A^{-1/2}A D_A^{-1/2}$ is symmetric with unit diagonal. Then

$\notag A + \alpha D_A = D_A^{1/2}(C + \alpha I)D_A^{1/2}.$

Since $A + \alpha D_A$ is positive semidefinite if and only if $C + \alpha I$ is positive semidefinite, the smallest possible $\alpha$ is $\alpha = -\lambda_n(C)$.

More generally, we can treat the $d_i$ as $n$ independent variables and ask for the solution of the optimization problem

$\notag \min \|d\| ~~\mathrm{subject~to}~~ A + \mathrm{diag}(d) ~\mathrm{positive~semidefinite}, ~d \ge 0. \qquad(\dagger)$

Of particular interest are the norms $\|d\|_1 = \sum_i d_i = \mathrm{trace}(D)$ (since $d\ge0)$ and $\|d\|_\infty = \max_i d_i$.

If $A+D$ is positive semidefinite then from standard eigenvalue inequalities,

$\notag 0 \le \lambda_n(A+D) \le \lambda_n(A) + \lambda_1(D),$

so that

$\notag \max_i d_i \ge -\lambda_n(A).$

Since $d_i \equiv -\lambda_n(A)$ satisfies the constraints of $(\dagger)$, this means that this $d$ solves $(\dagger)$ for the $\infty$-norm, though the solution is obviously not unique in general.

For the $1$– and $2$-norms, $(\dagger)$ does not have an explicit solution, but it can be solved by semidefinite programming techniques.

Another approach to finding a suitable $D$ is to compute a modified Cholesky factorization. Given a symmetric $A$, such a method computes a perturbation $E$ such that $A + E = R^TR$ for an upper triangular $R$ with positive diagonal elements, so that $A + E$ is positive definite. The methods of Gill, Murray, and Wright (1981) and Schnabel and Eskow (1990) compute a diagonal $E$. The cost in flops is essentially the same as that of computing a Cholesky factorization ($n^3/3$) flops), so this approach is likely to require fewer flops than computing the minimum eigenvalue or solving an optimization problem, but the perturbations produces will not be optimal.

Example

We take the $5\times 5$ Fiedler matrix

>> A = gallery('fiedler',5)
A =
0     1     2     3     4
1     0     1     2     3
2     1     0     1     2
3     2     1     0     1
4     3     2     1     0


The smallest eigenvalue is $-5.2361$, so $A+D$ is positive semidefinite for $D_1 = 5.2361I$. The Gill–Murray–Wright method gives $D_{\mathrm{GMW}}$ with diagonal elements

2.0000e+01   4.6159e+00   1.3194e+00   2.5327e+00   1.0600e+01


and has $\lambda_5(A+D_{\mathrm{GMW}}) = 0.5196$ while the Schnabel–Eskow method gives $D_{\mathrm{SE}}$ with diagonal elements

6     6     6     6     6


and has $\lambda_5(A+D_{\mathrm{SE}}) = 0.7639$. If we increase the diagonal elements of $D_1$ by 0.5 to give comparable smallest eigenvalues for the perturbed matrices then we have

$\Vert d\Vert_{\infty}$ $\Vert d \Vert_1$
Shift 5.2361 26.180
Gill-Murray-Wright 20.000 39.068
Schnabel–Eskow 6.000 30.000

When Does Thresholding Preserve Positive Definiteness?

Does a symmetric positive definite matrix remain positive definite when we set one or more elements to zero? This question arises in thresholding, in which elements of absolute value less than some tolerance are set to zero. Thresholding is used in some applications to remove correlations thought to be spurious, so that only statistically significant ones are retained.

We will focus on the case where just one element is changed and consider an arbitrary target value rather than zero. Given an $n\times n$ symmetric positive definite matrix $A$ we define $A(t)$ to be the matrix resulting from adding $t$ to the $(i,j)$ and $(j,i)$ elements and we ask when is $A(t)$ positive definite. We can write

$\notag A(t) = A + t(e_i^{}e_j^T + e_j^{}e_i^T) \equiv A + tE_{ij},$

where $e_i$ is the $i$th column of the identity matrix. The perturbation $E_{ij}$ has rank $2$, with eigenvalues $-1$, $1$, and $0$ repeated $n-2$ times. Hence we can write $E_{ij}$ in the form $E_{ij} = pp^T - qq^T$, where $p^Tp = q^Tq = 1$ and $p^Tq = 0$. Adding $pp^T$ to $A$ causes each eigenvalue to increase or stay the same, while subtracting $qq^T$ decreases or leaves unchanged each eigenvalue. However, more is true: after each of these rank-$1$ perturbations the eigenvalues of the original and perturbed matrices interlace, by Weyl’s theorem. Hence, with the eigenvalues of $A$ ordered as $\lambda_n(A) \le \cdots \le \lambda_1(A)$, we have (Horn and Johnson, Cor. 4.3.7)

\notag \begin{aligned} \lambda_n(A(t)) &\le \lambda_{n-1}(A), \\ \lambda_{i+1}(A) &\le \lambda_i(A(t)) \le \lambda_{i-1}(A), \quad i = 2\colon n-1, \\ \lambda_2(A) &\le \lambda_1(A(t)). \end{aligned}

Because $A$ is positive definite these inequalities imply that $\lambda_{n-1}(A(t)) \ge \lambda_n(A) > 0$, so $A(t)$ has at most one negative eigenvalue. Since $\det(A(t))$ is the product of the eigenvalues of $A(t)$ this means that $A(t)$ is positive definite precisely when $\det(A(t)) > 0$.

There is a simple expression for $\det(A(t))$, which follows from a lemma of Chan (1984), as explained by Georgescu, Higham, and Peters (2018):

$\notag \det(A(t)) = \det(A)\big(1+ 2t b_{ij} + t^2(b_{ij}^2-b_{ii}b_{jj})\big),$

where $B = A^{-1}$. Hence the condition for $A(t)$ to be positive definite is

$\notag q_{ij}(t) = 1 + 2t b_{ij} + t^2(b_{ij}^2-b_{ii}b_{jj}) > 0.$

We can factorize

$\notag q_{ij}(t) = \Bigl( t\bigl(b_{ij} - \sqrt{b_{ii}b_{jj}}\bigr) + 1 \Bigr) \Bigl( t\bigl(b_{ij} + \sqrt{b_{ii}b_{jj}}\bigr) + 1 \Bigr),$

so $q_{ij}(t) > 0$ for

$\notag t\in \left( \displaystyle\frac{-1}{ \sqrt{b_{ii}b_{jj}} + b_{ij} }, \displaystyle\frac{1}{ \sqrt{b_{ii}b_{jj}} - b_{ij} } \right) =: I_{ij},$

where the endpoints are finite because $B$, like $A$, is positive definite and so $|b_{ij}| < \sqrt{b_{ii}b_{jj}}$.

The condition for $A$ to remain positive definite when $a_{ij}$ is set to zero is $q_{ij}(-a_{ij}) > 0$, or equivalently $-a_{ij} \in I_{ij}$. To check either of these conditions we need just $b_{ij}$, $b_{ii}$, and $b_{jj}$. These elements can be computed without computing the whole inverse by solving the equations $Ab_k = e_k$ for $k = i,j$, for the $k$th column $b_k$ of $B$, making use of a Cholesky factorization of $A$.

As an example, we consider the $4\times 4$ Lehmer matrix, which has $(i,j)$ element $i/j$ for $i \ge j$:

$\notag A = \begin{bmatrix} 1 & \frac{1}{2} & \frac{1}{3} & \frac{1}{4} \\[3pt] \frac{1}{2} & 1 & \frac{2}{3} & \frac{1}{2} \\[3pt] \frac{1}{3} & \frac{2}{3} & 1 & \frac{3}{4} \\[3pt] \frac{1}{4} & \frac{1}{2} & \frac{3}{4} & 1 \end{bmatrix}.$

The smallest eigenvalue of $A$ is $0.208$. Any off-diagonal element except the $(2,4)$ element can be zeroed without destroying positive definiteness, and if the $(2,4)$ element is zeroed then the new matrix has smallest eigenvalue $-0.0249$. For $i=2$ and $j=4$, the following plot shows in red $\lambda_{\min}(A(t))$ and in blue $q_{24}(t)$; the black dots are the endpoints of the closure of the interval $I_{24} = (-0.453,0.453)$ and the vertical black line is the value $-a_{24}$. Clearly, $-a_{24}$ lies outside $I_{24}$, which is why zeroing this element causes a loss of positive definiteness. Note that $I_{24}$ also tells us that we can increase $a_{24}$ to any number less than $0.953$ without losing definiteness.

Given a positive definite matrix and a set $S$ of elements to be modified we may wish to determine subsets (including a maximal subset) of $S$ for which the modifications preserve definiteness. Efficiently determining these subsets appears to be an open problem.

In practical applications thresholding may lead to an indefinite matrix. Definiteness must then be restored to obtain a valid correlation matrix. One way to do this is to find the nearest correlation matrix in the Frobenius norm such that the zeroed elements remain zero. This can be done by the alternating projections method with a projection to keep the zeroed elements fixed. Since the nearest correlation matrix is positive semidefinite, it is also desirable to to incorporate a lower bound $\delta > 0$ on the smallest eigenvalue, which corresponds to another projection. Both these projections are supported in the algorithm of Higham and Strabić (2016), implemented in the code at https://github.com/higham/anderson-accel-ncm. For the Lehmer matrix, the nearest correlation matrix with zero $(2,4)$ element and eigenvalues at least $\delta = 0.01$ is (to four significant figures)

$\notag \begin{bmatrix} 1 & 0.4946 & 0.3403 & 0.2445 \\ 0.4946 & 1 & 0.6439 & 0 \\ 0.3403 & 0.6439 & 1 & 0.7266 \\ 0.2445 & 0 & 0.7266 & 1 \end{bmatrix}.$

A related question is for what patterns of elements that are set to zero is positive definiteness guaranteed to be preserved for all positive definite $A$? Clearly, setting all the off-diagonal elements to zero preserves definiteness, since the diagonal of a positive definite matrix is positive. Guillot and Rajaratnam (2012) show that the answer to the question is that the new matrix must be a symmetric permutation of a block diagonal matrix. However, for particular $A$ this restriction does not necessarily hold, as the Lehmer matrix example shows.

What Is a Unitarily Invariant Norm?

A norm on $\mathbb{C}^{m \times n}$ is unitarily invariant if $\|UAV\| = \|A\|$ for all unitary $U\in\mathbb{C}^{m \times m}$ and $V\in\mathbb{C}^{n\times n}$ and for all $A\in\mathbb{C}^{m \times n}$. One can restrict the definition to real matrices, though the term unitarily invariant is still typically used.

Two widely used matrix norms are unitarily invariant: the $2$-norm and the Frobenius norm. The unitary invariance follows from the definitions. For the $2$-norm, for any unitary $U$ and $V$, using the fact that $\|Uz\|_2 = \|z\|_2$, we obtain

\notag \begin{aligned} \|UAV\|_2 &= \max_{x\ne0} \frac{\|UAVx\|_2}{\|x\|_2} = \max_{x\ne0} \frac{\|AVx\|_2}{\|x\|_2}\\ &= \max_{x\ne0} \frac{\|Ay\|_2}{\|V^*y\|_2} \quad(y = Vx)\\ &= \max_{y\ne0} \frac{\|Ay\|_2}{\|y\|_2} = \|A\|_2. \end{aligned}

For the Frobenius norm, using $\|A\|_F^2 = \mathrm{trace}(A^*A)$,

\notag \begin{aligned} \|UAV\|_F^2 &= \mathrm{trace}(V^*A^*U^* \cdot UAV)\\ &= \mathrm{trace}(V^*A^*AV)\\ &= \mathrm{trace}(A^*A) = \|A\|_F^2, \end{aligned}

since the trace is invariant under similarity transformations.

More insight into unitarily invariant norms comes from recognizing a connection with the singular value decomposition

$\notag A = P\Sigma Q^*, \quad P^*P = I_m, \quad Q^*Q = I_n, \quad \Sigma = \mathrm{diag}(\sigma_i), \quad \sigma_1 \ge \cdots \ge \sigma_q \ge 0.$

Clearly, $\|A\| = \|\Sigma\|$, so $\|A\|$ depends only on the singular values. Indeed, for the 2-norm and the Frobenius norm we have $\|A\|_2 = \sigma_1$ and $\|A\|_F = (\sum_{i=1}^q \sigma_i^2)^{1/2}$. Here, and throughout this article, $q = \min(m,n)$. Another implication of the singular value dependence is that $\|A\| = \|A^*\|$ for all $A$ for any unitarily invariant norm.

There is a beautiful characterization of unitarily invariant norms in terms of symmetric gauge functions, which are functions $f:\mathbb{R}^q\to\mathbb{R}^q$ such that $f$ is an absolute norm on $\mathbb{R}^q$ and $f(Px) = f(x)$ for any permutation matrix $P$ and all $x$. An absolute norm is one with the property that $\|x\| = \||x|\|$ for all $x$, and this condition is equivalent to the monotonicity condition that $|x| \le |y|$ implies $\|x\| \le \|y\|$ for all $x$ and $y$.

Theorem.

A norm on $\mathbb{C}^{m \times n}$ is unitarily invariant if and only if $\|A\| = f(\sigma_1, \dots, \sigma_q)$ for all $A$ for some symmetric gauge function $f$, where $\sigma_1, \dots, \sigma_q$ are the singular values of $A$.

The matrix $2$-norm and the Frobenius norm correspond to $f$ being the vector $\infty$-norm and the $2$-norm, respectively. More generally, we can take for $f$ any vector $p$-norm, obtaining the class of Schatten $p$-norms:

$\notag \|A\|_p^S = \biggl(\displaystyle\sum_{i=1}^q \sigma_i^p\biggr)^{1/p}, \quad 1\le p\le\infty.$

The Schatten $1$-norm is the sum of the singular values, $\|A\|_1^S = \sum_{i=1}^q \sigma_i$, which is called the trace norm or nuclear norm. It can act as a proxy for the rank of a matrix. The trace norm can be expressed as $\|A\|_1^S = \mathrm{trace}(H) = \mathrm{trace}\bigl((A^*A)^{1/2}\bigr)$, where $A = UH$ is a polar decomposition.

Another class of unitarily invariant norms is the Ky Fan $k$-norms

$\notag \|A\|_{(k)} = \displaystyle\sum_{i=1}^k \sigma_i, \quad 1\le k \le n.$

We have $\|A\|_{(n)} = \|A\|_1^S$ and $\|A\|_{(1)} = \|A\|_2$.

Among unitarily invariant norms, the $2$-norm and the Frobenius norm are widely usd in numerical analysis and matrix analysis. The nuclear norm is used in problems involving matrix rank minimization, such as matrix completion problems.

The benefit of the concept of unitarily invariant norm is that one can prove certain results for this whole class of norms, obtaining results for the particular norms of interest as special cases. Here are three important examples.

• For $A\in\mathbb{C}^{n\times n}$, the matrix $(A+A^*)/2$ is the nearest Hermitian matrix to $A$ in any unitarily invariant norm.
• For $A\in\mathbb{C}^{m\times n}$ ($m\ge n$), the unitary polar factor is the nearest matrix with orthonormal columns to $A$ in any unitarily invariant norm.
• The best rank-$k$ approximation to $A\in\mathbb{C}^{m\times n}$ in any unitarily invariant norm is obtained by setting all the singular values beyond the $k$th to zero in the SVD of $A$. This result, which generalizes the Eckart-Young theorem, which covers the 2- and Frobenius norm instances, is an easy consequence of the following result of Mirsky (1960).

Theorem.

Let $A,B\in\mathbb{C}^{m\times n}$ have SVDs with diagonal matrices $\Sigma_A, \Sigma_B \in\mathbb{R}^{m\times n}$, where the diagonal elements are arranged in nonincreasing order. Then $\|A-B\| \ge \|\Sigma_A-\Sigma_B\|$ for every unitarily invariant norm.

We also give a useful matrix norm inequality. For any matrices $A$, $B$, and $C$ for which the product $ABC$ is defined,

$\notag \|ABC\| \le \|A\|_2 \mskip2mu \|B\| \mskip2mu \|C\|_2$

holds for any unitarily invariant norm, and in fact, any two of the norms on the right-hand side can be 2-norms.

Finally, here are three interesting facts about unitarily invariant norms.

• A unitarily invariant norm is consistent, that is, it satisfies $\|AB\| \le \|A\| \|B\|$ for all $A,B\in\mathbb{C}^{n\times n}$ if and only if $\|A\| \ge \|A\|_2$ for all $A\in\mathbb{C}^{n\times n}$.
• A unitarily invariant norm is absolute, that is, $\|A\| = \||A|\|$ for all $A\in\mathbb{C}^{n\times n}$, where $|A| = (|a_{ij}|)$ if and only if the norm is a positive scalar multiple of the Frobenius norm.
• The only subordinate unitarily invariant norm is the $2$-norm.

References

This is a minimal set of references, which contain further useful references within.