What Is the Nearest Positive Semidefinite Matrix?

Given a symmetric matrix and a nonnegative number \delta, what is the nearest symmetric matrix whose eigenvalues are all at least \delta? In other words, how can we project a symmetric matrix onto the set of symmetric positive semidefinite matrices with eigenvalues at least \delta?

Distance can be measured in any norm, but the most common choice is the Frobenius norm, \|A\|_F = (\sum_{i,j = 1}^n |a_{ij}|^2)^{1/2}.

This problem occurs in a very wide range of applications. A typical scenario is that a covariance matrix is approximated in some way that does not guarantee positive semidefinitess, for example by treating blocks of the matrix independently. In machine learning, some methods use indefinite kernels and these can require an indefinite similarity matrix to be replaced by a semidefinite one. When \delta > 0, the problem arises when the matrix is positive definite but ill conditioned and a matrix of smaller condition number is required, or when rounding errors result in small negative eigenvalues and a “safely positive definite” matrix is wanted.

The following theorem gives the solution to the problem for the Frobenius norm.

Theorem (Cheng and Higham, 1998).

Let the symmetric matrix A\in\mathbb{R}^{n \times n} have the spectral decomposition A = Q \mskip1mu\mathrm{diag}(\lambda_i) \mskip1muQ^T and let \delta \ge 0. The unique matrix with smallest eigenvalue at least \delta nearest to A in the Frobenius norm is given by

\notag        X = Q \mskip1mu \mathrm{diag}(\tau_i) \mskip1mu Q^T, \quad  \tau_i = \begin{cases} \lambda_i, & \lambda_i \geq \delta\\                         \delta, & \lambda_i < \delta.           \end{cases}

The theorem says that there is a unique nearest matrix and that is has the same eigenvectors as A with any eigenvalues less than \delta increased to \delta. When \delta = 0, another way to express this result is in terms of the polar decomposition A = UH, where U is orthogonal and H is symmetric positive definite (with H = Q \mskip1mu\mathrm{diag}(|\lambda_i|)Q^T, since A is symmetric): X = (A + H)/2.

One can pose the same nearness problem for nonsymmetric A. For the Frobenius norm there is again a unique nearest symmetric matrix whose eigenvalues are all at least \delta, and it is given by applying the theorem to the symmetric part B of A (which is the nearest symmetric matrix to A), where

\notag   A = \frac{1}{2} (A + A^T ) + \frac{1}{2} (A - A^T )  \equiv B + C,    \quad(B = B^T, ~C = -C^T).

For the 2-norm the answer is known for nonsymmetric A only for \delta = 0, so we take \delta = 0 from this point on. We write X \ge 0 to denote that the symmetric matrix X is positive semidefinite and use a superscript 1/2 to denote the positive semidefinite square root.

Theorem (Halmos, 1972).

The 2-norm distance from A\in\mathbb{R}^{n \times n} to the nearest positive semidefinite matrix is

\notag   d_2(A) = \min\{\, r \ge 0: r^2I + C^2 \ge 0 ~\mathrm{and}~ G(r) \ge 0 \,\},

where

\notag    G(r) = B + \bigl(r^2I + C^2\bigr)^{1/2},

and a nearest matrix is

\notag   X = B + (d_2(A)^2I + C^2)^{1/2}.

When A is symmetric, so that C = 0, the nearest matrix and the distance reduce to

\notag   X = B + d_2(A)I, \quad d_2(A) = \max\{\, |\lambda_i| : \lambda_i < 0 \,\}.

Clearly, X is not, in general, unique because when A is symmetric, instead of shifting B by a multiple of I we could just perturb the negative eigenvalue to zero, as with the optimal Frobenius norm perturbation, without changing the 2-norm of the perturbation. A distinguishing feature of X in Halmos’s theorem is that X-P \ge 0 for any other nearest positive semidefinite matrix P (Bouldin, 1973, Theorem 4.2); thus X has the minimum number of zero eigenvalues over all nearest positive semidefinite matrices.

The Frobenius norm solution X_F is an approximate minimizer of the 2-norm distance since (Higham, 1988)

\notag   d_2(A) \le \| A - X_F \|_F \le 2 d_2(A).

Halmos’s theorem simplifies the computation of d_2(A) because it reduces the minimization problem to one dimension. However, the problem is nonlinear and has no closed form solution for general A, so iterative methods are needed to compute d_2(A). Two algorithms are developed in Higham (1988). Both are based on the following properties of the matrix G(r) in the theorem: \lambda_{\min}(G(r)) is monotone increasing on [\,\rho(C), \infty), and either \lambda_{\min}(G(\rho(C))) \ge 0, in which case d_2(A) = \rho(C), or \lambda_{\min}(G(r)) has a unique zero r^* = d_2(A) > \rho(C). The first algorithm uses the bisection method, determining the sign of \lambda_{\min}(G(r)) by attempting a Cholesky factorization of G(r): the sign is nonnegative if the factorization exists and negative otherwise. This approach is attractive if d_2(A) is required only to low accuracy. For higher accuracy computations it is better to apply a more rapidly converging zero finder to f(r) = \lambda_{\min} (G(r)) = 0. A hybrid Newton–bisection method is used in Higham (1988).

Example

For a numerical example (in MATLAB), we take the Jordan block

\notag  A =  \begin{bmatrix}  0  & 1  & 0  & 0  & 0  \\  0  & 0  & 1  & 0  & 0  \\  0  & 0  & 0  & 1  & 0  \\  0  & 0  & 0  & 0  & 1  \\  0  & 0  & 0  & 0  & 0  \end{bmatrix}.

The symmetric part of A, B = (A+A^T)/2, has two negative eigenvalues and a zero eigenvalue:

>> A = gallery('jordbloc',5,0); As = sym(A); eig_B = eig((As+As')/2)'
eig_B =
[-1/2, 0, 1/2, -3^(1/2)/2, 3^(1/2)/2]
>> double(eig_B)
ans =
  -5.0000e-01            0   5.0000e-01  -8.6603e-01   8.6603e-01

The nearest symmetric positive semidefinite matrix to A in the Frobenius norm can be computed in MATLAB as

B = (A + A')/2; [Q,D] = eig(B); d = diag(D); X_F = Q*diag(max(d,0))*Q';

We can improve this code by using the implicit expansion feature of MATLAB to avoid forming a diagonal matrix. Since the computed result is not exactly symmetric because of rounding errors, we also need to replace it by the nearest symmetric matrix:

[Q,d] = eig(B,'vector'); X_F = Q*(max(d,0).*Q'); X_F = (X_F + X_F')/2;

We obtain

X_F = 
 1.972e-01   2.500e-01   1.443e-01  -4.163e-17  -5.283e-02  
 2.500e-01   3.415e-01   2.500e-01   9.151e-02  -1.665e-16  
 1.443e-01   2.500e-01   2.887e-01   2.500e-01   1.443e-01  
-4.163e-17   9.151e-02   2.500e-01   3.415e-01   2.500e-01  
-5.283e-02  -1.665e-16   1.443e-01   2.500e-01   1.972e-01

which has eigenvalues

-3.517e-17   7.246e-17   9.519e-17   5.000e-01   8.660e-01  

Notice that the three eigenvalues that would be zero in exact arithmetic are of order the unit roundoff. A nearest matrix in the 2-norm, as given by Halmos’s theorem. is

X_2 = 
 8.336e-01   5.000e-01   1.711e-01   0.000e+00  -1.756e-02  
 5.000e-01   6.625e-01   5.000e-01   1.887e-01   0.000e+00  
 1.711e-01   5.000e-01   6.450e-01   5.000e-01   1.711e-01  
 0.000e+00   1.887e-01   5.000e-01   6.625e-01   5.000e-01  
-1.756e-02   0.000e+00   1.711e-01   5.000e-01   8.336e-01 

and its eigenvalues are

-7.608e-17   1.281e-01   5.436e-01   1.197e+00   1.769e+00  

The distances are

\notag  \begin{aligned}   1.732  &= \|A - X_F\|_F < \|A - X_2\|_F = 2.207,\\   0.9872 &= \|A - X_2\|_2 < \|A - X_F\|_2 = 1.0355.   \end{aligned}

References

This is a minimal set of references, which contain further useful references within.

Related Blog Posts

This article is part of the “What Is” series, available from https://nhigham.com/category/what-is and in PDF form from the GitHub repository https://github.com/higham/what-is.

What Is the Nearest Symmetric Matrix?

What is the nearest symmetric matrix to a real, nonsymmetric square matrix A? This question arises whenever a symmetric matrix is approximated by formulas that do not respect the symmetry. For example, a finite difference approximation to a Hessian matrix can be nonsymmetric even though the Hessian is symmetric. In some cases, lack of symmetry is caused by rounding errors. The natural way to symmetrize A is to replace it by (A + A^T)/2. Is this the best choice?

As our criterion of optimality we take that \| A - X \| is minimized over symmetric X for some suitable norm. Fan and Hoffman (1955) showed that (A + A^T)/2 is a solution in any unitarily invariant norm. A norm is unitarily invariant if \|A\| = \|UAV\| for all unitary U and V. Such a norm depends only on the singular values of A, and hence \|A\| = \|A^T\| since A and A^T have the same singular values. The most important examples of unitarily invariant norms are the 2-norm and the Frobenius norm.

The proof that (A+A^T)/2 is optimal is simple. For any symmetric Y,

\notag  \begin{aligned}   \Bigl\| A - \frac{1}{2}(A + A^T) \Bigr\|         &=  \frac{1}{2} \|A - A^T \|          = \frac{1}{2} \| A - Y + Y^T - A^T \| \\         &\le \frac{1}{2} \| A - Y \| + \frac{1}{2} \|(Y - A)^T \| \\         &= \| A - Y\|.   \end{aligned}

Simple examples of a matrix and a nearest symmetric matrix are

\notag    A = \begin{bmatrix} 0 & 1 \\ 0 & 0 \end{bmatrix}, \;\;    X = \begin{bmatrix} 0 & \frac{1}{2} \\ \frac{1}{2} & 0 \end{bmatrix},\qquad    A = \begin{bmatrix} 0 & 1 \\ -1 & 0 \end{bmatrix}, \;\;    X = \begin{bmatrix} 0 & 0 \\ 0 & 0 \end{bmatrix}.

Note that any A can be written

\notag   A = \frac{1}{2} (A + A^T ) + \frac{1}{2} (A - A^T )  \equiv B + C,

where B and C are the symmetric part and the skew-symmetric part of A, respectively, so the nearest symmetric matrix to A is the symmetric part of A.

For the Frobenius norm, (A + A^T)/2 is the unique nearest symmetric matrix, which follows from the fact that \|S + K\|_F^2 = \|S\|_F^2 + \|K\|_F^2 for symmetric S and skew-symmetric K. For the 2-norm, however, the nearest symmetric matrix is not unique in general. An example of non-uniqueness is

\notag    A = \begin{bmatrix} 0 & 0 & 0 \\ 0 & 0 & 1 \\ 0 & 0 & 0 \end{bmatrix}, \quad    X = \begin{bmatrix} 0 & 0 & 0 \\ 0 & 0 & \frac{1}{2} \\ 0 & \frac{1}{2} & 0 \end{bmatrix}, \quad    Y = \begin{bmatrix} \theta & 0 & 0 \\ 0 & 0 & \frac{1}{2} \\ 0 & \frac{1}{2} & 0 \end{bmatrix},

for which \|A - X\|_2 = 0.5, and \|A - Y\|_2 = 0.5 for any \theta such that |\theta| \le 0.5.

Entirely analogous to the above is the nearest skew-symmetric matrix problem, for which the solution is the skew-symmetric part for any unitarily invariant norm. For complex matrices, these results generalize in the obvious way: (A + A^*)/2 is the nearest Hermitian matrix to A and (A - A^*)/2 is the nearest skew-Hermitian matrix to A in any unitarily invariant norm.

Reference

This article is part of the “What Is” series, available from https://nhigham.com/category/what-is and in PDF form from the GitHub repository https://github.com/higham/what-is.

What Is the Softmax Function?

The softmax function takes as input a real n-vector x and returns the vector g with elements given by

\notag   \qquad\qquad   g_j(x) = \displaystyle\frac{\mathrm{e}^{x_j}}                              {\sum_{i=1}^n \mathrm{e}^{x_i}}, \quad          j=1\colon n.   \qquad\qquad (*)

It arises in machine learning, game theory, and statistics. Since \mathrm{e}^{x_j} \ge 0 and \sum_{j=1}^n g_j = 1, the softmax function is often used to convert a vector x into a vector of probabilities, with the more positive entries giving the larger probabilities.

The softmax function is the gradient of the log-sum-exp function

\notag      \mathrm{lse}(x) = \log\displaystyle\sum_{i=1}^n \mathrm{e}^{x_i},

where \log is the natural logarithm, that is, g_j(x) = (\partial/\partial x_j) \mathrm{lse}(x).

The following plots show the two components of softmax for n = 2. Note that they are constant on lines x_1 - x_2 = \mathrm{constant}, as shown by the contours.

softmax2d.jpg

Here are some examples:

>> softmax([-1 0 1])
ans =
   9.0031e-02
   2.4473e-01
   6.6524e-01
>> softmax([-1 0 10])
ans =
   1.6701e-05
   4.5397e-05
   9.9994e-01

Note how softmax increases the relative weighting of the larger components over the smaller ones. The MATLAB function softmax used here is available at https://github.com/higham/logsumexp-softmax.

A concise alternative formula, which removes the denominator of (*) by rewriting it as the exponential of \mathrm{lse}(x) and moving it into the numerator, is

\notag   \qquad\qquad     g_j = \exp\bigl(x_j - \mathrm{lse}(x)\bigr).   \qquad\qquad (\#)

Straightforward evaluation of softmax from either (*) or (\#) is not recommended, because of the possibility of overflow. Overflow can be avoided in (*) by shifting the components of x, just as for the log-sum-exp function, to obtain

\notag   \qquad\qquad   g_j(x) =  \displaystyle\frac{\mathrm{e}^{x_j-\max(x)}}{\sum_{i=1}^n \mathrm{e}^{x_i-\max(x)}}, \quad             j=1\colon n.   \qquad\qquad (\dagger)

where \max(x) = \max_i x_i. It can be shown that computing softmax via this formula is numerically reliable. The shifted version of (\#) tends to be less accurate, so (\dagger) is preferred.

References

This is a minimal set of references, which contain further useful references within.

Related Blog Posts

This article is part of the “What Is” series, available from https://nhigham.com/category/what-is and in PDF form from the GitHub repository https://github.com/higham/what-is.

What Is the Log-Sum-Exp Function?

The log-sum-exp function takes as input a real n-vector x and returns the scalar

\notag      \mathrm{lse}(x) = \log \displaystyle\sum_{i=1}^n \mathrm{e}^{x_i},

where \log is the natural logarithm. It provides an approximation to the largest element of x, which is given by the \max function, \max(x) = \max_i x_i. Indeed,

\notag   \mathrm{e}^{\max(x)} \le \displaystyle\sum_{i=1}^n \mathrm{e}^{x_i}                        \le n \mskip1mu \mathrm{e}^{\max(x)},

and on taking logs we obtain

\notag   \qquad\qquad       \max(x) \le \mathrm{lse}(x) \le \max(x) + \log n.   \qquad\qquad (*)

The log-sum-exp function can be thought of as a smoothed version of the max function, because whereas the max function is not differentiable at points where the maximum is achieved in two different components, the log-sum-exp function is infinitely differentiable everywhere. The following plots of \mathrm{lse}(x) and \max(x) for n = 2 show this connection.

lse2d.jpg

The log-sum-exp function appears in a variety of settings, including statistics, optimization, and machine learning.

For the special case where x = [0~t]^T, we obtain the function f(t) = \log(1+\mathrm{e}^t), which is known as the softplus function in machine learning. The softplus function approximates the ReLU (rectified linear unit) activation function \max(t,0) and satisfies, by (*),

\notag     \max(t,0) \le f(t) \le \max(t,0) + \log 2.

Two points are worth noting.

  • While \log(x_1 + x_2) \ne \log x_1 + \log x_2, in general, we do (trivially) have \log(x_1 + x_2) = \mathrm{lse}(\log x_1,\log x_2), and more generally \log(x_1 + x_2 + \cdots + x_n) = \mathrm{lse}(\log x_1,\log   x_2,\dots,\log x_n).
  • The log-sum-exp function is not to be confused with the exp-sum-log function: \exp \sum_{i=1}^n \log x_i = x_1x_2\dots x_n.

Here are some examples:

>> format long e
>> logsumexp([1 2 3])
ans =
     3.407605964444380e+00

>> logsumexp([1 2 30])
ans =
     3.000000000000095e+01

>> logsumexp([1 2 -3])
ans =
     2.318175429247454e+00

The MATLAB function logsumexp used here is available at https://github.com/higham/logsumexp-softmax.

Straightforward evaluation of log-sum-exp from its definition is not recommended, because of the possibility of overflow. Indeed, \exp(x) overflows for x = 12, x = 89, and x = 710 in IEEE half, single, and double precision arithmetic, respectively. Overflow can be avoided by writing

\notag \begin{aligned}   \mathrm{lse}(x) &= \log \sum_{i=1}^n \mathrm{e}^{x_i}          = \log \sum_{i=1}^n \mathrm{e}^a \mathrm{e}^{x_i - a}          = \log  \left(\mathrm{e}^a\sum_{i=1}^n \mathrm{e}^{x_i - a}\right), \end{aligned}

which gives

\notag     \mathrm{lse}(x) = a + \log\displaystyle\sum_{i=1}^n \mathrm{e}^{x_i - a}.

We take a = \max(x), so that all exponentiations are of nonpositive numbers and therefore overflow is avoided. Any underflows are harmless. A refinement is to write

\notag     \qquad\qquad     \mathrm{lse}(x) = \max(x) + \mathrm{log1p}\Biggl( \displaystyle\sum_{i=1 \atop i\ne k}^n                                 \mathrm{e}^{x_i - \max(x)}\Biggr),     \qquad\qquad (\#)

where x_k = \max(x) (if there is more than one such k, we can take any of them). Here, \mathrm{log1p}(x) = \log(1+x) is a function provided in MATLAB and various other languages that accurately evaluates \log(1+x) even when x is small, in which case 1+x would suffer a loss of precision if it was explicitly computed.

Whereas the original formula involves the logarithm of a sum of nonnegative quantities, when \max(x) < 0 the shifted formula (\#) computes \mathrm{lse}(x) as the sum of two terms of opposite sign, so could potentially suffer from numerical cancellation. It can be shown by rounding error analysis, however, that computing log-sum-exp via (\#) is numerically reliable.

References

This is a minimal set of references, which contain further useful references within.

Related Blog Posts

This article is part of the “What Is” series, available from https://nhigham.com/category/what-is and in PDF form from the GitHub repository https://github.com/higham/what-is.