# What Is the Nearest Symmetric Matrix?

What is the nearest symmetric matrix to a real, nonsymmetric square matrix $A$? This question arises whenever a symmetric matrix is approximated by formulas that do not respect the symmetry. For example, a finite difference approximation to a Hessian matrix can be nonsymmetric even though the Hessian is symmetric. In some cases, lack of symmetry is caused by rounding errors. The natural way to symmetrize $A$ is to replace it by $(A + A^T)/2$. Is this the best choice?

As our criterion of optimality we take that $\| A - X \|$ is minimized over symmetric $X$ for some suitable norm. Fan and Hoffman (1955) showed that $(A + A^T)/2$ is a solution in any unitarily invariant norm. A norm is unitarily invariant if $\|A\| = \|UAV\|$ for all unitary $U$ and $V$. Such a norm depends only on the singular values of $A$, and hence $\|A\| = \|A^T\|$ since $A$ and $A^T$ have the same singular values. The most important examples of unitarily invariant norms are the $2$-norm and the Frobenius norm.

The proof that $(A+A^T)/2$ is optimal is simple. For any symmetric $Y$,

\notag \begin{aligned} \Bigl\| A - \frac{1}{2}(A + A^T) \Bigr\| &= \frac{1}{2} \|A - A^T \| = \frac{1}{2} \| A - Y + Y^T - A^T \| \\ &\le \frac{1}{2} \| A - Y \| + \frac{1}{2} \|(Y - A)^T \| \\ &= \| A - Y\|. \end{aligned}

Simple examples of a matrix and a nearest symmetric matrix are

$\notag A = \begin{bmatrix} 0 & 1 \\ 0 & 0 \end{bmatrix}, \;\; X = \begin{bmatrix} 0 & \frac{1}{2} \\ \frac{1}{2} & 0 \end{bmatrix},\qquad A = \begin{bmatrix} 0 & 1 \\ -1 & 0 \end{bmatrix}, \;\; X = \begin{bmatrix} 0 & 0 \\ 0 & 0 \end{bmatrix}.$

Note that any $A$ can be written

$\notag A = \frac{1}{2} (A + A^T ) + \frac{1}{2} (A - A^T ) \equiv B + C,$

where $B$ and $C$ are the symmetric part and the skew-symmetric part of $A$, respectively, so the nearest symmetric matrix to $A$ is the symmetric part of $A$.

For the Frobenius norm, $(A + A^T)/2$ is the unique nearest symmetric matrix, which follows from the fact that $\|S + K\|_F^2 = \|S\|_F^2 + \|K\|_F^2$ for symmetric $S$ and skew-symmetric $K$. For the $2$-norm, however, the nearest symmetric matrix is not unique in general. An example of non-uniqueness is

$\notag A = \begin{bmatrix} 0 & 0 & 0 \\ 0 & 0 & 1 \\ 0 & 0 & 0 \end{bmatrix}, \quad X = \begin{bmatrix} 0 & 0 & 0 \\ 0 & 0 & \frac{1}{2} \\ 0 & \frac{1}{2} & 0 \end{bmatrix}, \quad Y = \begin{bmatrix} \theta & 0 & 0 \\ 0 & 0 & \frac{1}{2} \\ 0 & \frac{1}{2} & 0 \end{bmatrix},$

for which $\|A - X\|_2 = 0.5$, and $\|A - Y\|_2 = 0.5$ for any $\theta$ such that $|\theta| \le 0.5$.

Entirely analogous to the above is the nearest skew-symmetric matrix problem, for which the solution is the skew-symmetric part for any unitarily invariant norm. For complex matrices, these results generalize in the obvious way: $(A + A^*)/2$ is the nearest Hermitian matrix to $A$ and $(A - A^*)/2$ is the nearest skew-Hermitian matrix to $A$ in any unitarily invariant norm.

# What Is the Softmax Function?

The softmax function takes as input a real $n$-vector $x$ and returns the vector $g$ with elements given by

$\notag \qquad\qquad g_j(x) = \displaystyle\frac{\mathrm{e}^{x_j}}{\sum_{i=1}^n \mathrm{e}^{x_i}}, \quad j=1\colon n. \qquad\qquad (*)$

It arises in machine learning, game theory, and statistics. Since $\mathrm{e}^{x_j} \ge 0$ and $\sum_{j=1}^n g_j = 1$, the softmax function is often used to convert a vector $x$ into a vector of probabilities, with the more positive entries giving the larger probabilities.

The softmax function is the gradient of the log-sum-exp function

$\notag \mathrm{lse}(x) = \log\displaystyle\sum_{i=1}^n \mathrm{e}^{x_i},$

where $\log$ is the natural logarithm, that is, $g_j(x) = (\partial/\partial x_j) \mathrm{lse}(x)$.

The following plots show the two components of softmax for $n = 2$. Note that they are constant on lines $x_1 - x_2 = \mathrm{constant}$, as shown by the contours.

Here are some examples:

>> softmax([-1 0 1])
ans =
9.0031e-02
2.4473e-01
6.6524e-01
>> softmax([-1 0 10])
ans =
1.6701e-05
4.5397e-05
9.9994e-01


Note how softmax increases the relative weighting of the larger components over the smaller ones. The MATLAB function softmax used here is available at https://github.com/higham/logsumexp-softmax.

A concise alternative formula, which removes the denominator of $(*)$ by rewriting it as the exponential of $\mathrm{lse}(x)$ and moving it into the numerator, is

$\notag \qquad\qquad g_j = \exp\bigl(x_j - \mathrm{lse}(x)\bigr). \qquad\qquad (\#)$

Straightforward evaluation of softmax from either $(*)$ or $(\#)$ is not recommended, because of the possibility of overflow. Overflow can be avoided in $(*)$ by shifting the components of $x$, just as for the log-sum-exp function, to obtain

$\notag \qquad\qquad g_j(x) = \displaystyle\frac{\mathrm{e}^{x_j-\max(x)}}{\sum_{i=1}^n \mathrm{e}^{x_i-\max(x)}}, \quad j=1\colon n. \qquad\qquad (\dagger)$

where $\max(x) = \max_i x_i$. It can be shown that computing softmax via this formula is numerically reliable. The shifted version of $(\#)$ tends to be less accurate, so ($\dagger$) is preferred.

## References

This is a minimal set of references, which contain further useful references within.

# What Is the Log-Sum-Exp Function?

The log-sum-exp function takes as input a real $n$-vector $x$ and returns the scalar

$\notag \mathrm{lse}(x) = \log \displaystyle\sum_{i=1}^n \mathrm{e}^{x_i},$

where $\log$ is the natural logarithm. It provides an approximation to the largest element of $x$, which is given by the $\max$ function, $\max(x) = \max_i x_i$. Indeed,

$\notag \mathrm{e}^{\max(x)} \le \displaystyle\sum_{i=1}^n \mathrm{e}^{x_i} \le n \mskip1mu \mathrm{e}^{\max(x)},$

and on taking logs we obtain

$\notag \qquad\qquad \max(x) \le \mathrm{lse}(x) \le \max(x) + \log n. \qquad\qquad (*)$

The log-sum-exp function can be thought of as a smoothed version of the max function, because whereas the max function is not differentiable at points where the maximum is achieved in two different components, the log-sum-exp function is infinitely differentiable everywhere. The following plots of $\mathrm{lse}(x)$ and $\max(x)$ for $n = 2$ show this connection.

The log-sum-exp function appears in a variety of settings, including statistics, optimization, and machine learning.

For the special case where $x = [0~t]^T$, we obtain the function $f(t) = \log(1+\mathrm{e}^t)$, which is known as the softplus function in machine learning. The softplus function approximates the ReLU (rectified linear unit) activation function $\max(t,0)$ and satisfies, by $(*)$,

$\notag \max(t,0) \le f(t) \le \max(t,0) + \log 2.$

Two points are worth noting.

• While $\log(x_1 + x_2) \ne \log x_1 + \log x_2$, in general, we do (trivially) have $\log(x_1 + x_2) = \mathrm{lse}(\log x_1,\log x_2)$, and more generally $\log(x_1 + x_2 + \cdots + x_n) = \mathrm{lse}(\log x_1,\log x_2,\dots,\log x_n)$.
• The log-sum-exp function is not to be confused with the exp-sum-log function: $\exp \sum_{i=1}^n \log x_i = x_1x_2\dots x_n$.

Here are some examples:

>> format long e
>> logsumexp([1 2 3])
ans =
3.407605964444380e+00

>> logsumexp([1 2 30])
ans =
3.000000000000095e+01

>> logsumexp([1 2 -3])
ans =
2.318175429247454e+00


The MATLAB function logsumexp used here is available at https://github.com/higham/logsumexp-softmax.

Straightforward evaluation of log-sum-exp from its definition is not recommended, because of the possibility of overflow. Indeed, $\exp(x)$ overflows for $x = 12$, $x = 89$, and $x = 710$ in IEEE half, single, and double precision arithmetic, respectively. Overflow can be avoided by writing

\notag \begin{aligned} \mathrm{lse}(x) &= \log \sum_{i=1}^n \mathrm{e}^{x_i} = \log \sum_{i=1}^n \mathrm{e}^a \mathrm{e}^{x_i - a} = \log \left(\mathrm{e}^a\sum_{i=1}^n \mathrm{e}^{x_i - a}\right), \end{aligned}

which gives

$\notag \mathrm{lse}(x) = a + \log\displaystyle\sum_{i=1}^n \mathrm{e}^{x_i - a}.$

We take $a = \max(x)$, so that all exponentiations are of nonpositive numbers and therefore overflow is avoided. Any underflows are harmless. A refinement is to write

$\notag \qquad\qquad \mathrm{lse}(x) = \max(x) + \mathrm{log1p}\Biggl( \displaystyle\sum_{i=1 \atop i\ne k}^n \mathrm{e}^{x_i - \max(x)}\Biggr), \qquad\qquad (\#)$

where $x_k = \max(x)$ (if there is more than one such $k$, we can take any of them). Here, $\mathrm{log1p}(x) = \log(1+x)$ is a function provided in MATLAB and various other languages that accurately evaluates $\log(1+x)$ even when $x$ is small, in which case $1+x$ would suffer a loss of precision if it was explicitly computed.

Whereas the original formula involves the logarithm of a sum of nonnegative quantities, when $\max(x) < 0$ the shifted formula $(\#)$ computes $\mathrm{lse}(x)$ as the sum of two terms of opposite sign, so could potentially suffer from numerical cancellation. It can be shown by rounding error analysis, however, that computing log-sum-exp via $(\#)$ is numerically reliable.

## References

This is a minimal set of references, which contain further useful references within.