# What Is the Softmax Function?

The softmax function takes as input a real $n$-vector $x$ and returns the vector $g$ with elements given by

$\notag \qquad\qquad g_j(x) = \displaystyle\frac{\mathrm{e}^{x_j}}{\sum_{i=1}^n \mathrm{e}^{x_i}}, \quad j=1\colon n. \qquad\qquad (*)$

It arises in machine learning, game theory, and statistics. Since $\mathrm{e}^{x_j} \ge 0$ and $\sum_{j=1}^n g_j = 1$, the softmax function is often used to convert a vector $x$ into a vector of probabilities, with the more positive entries giving the larger probabilities.

The softmax function is the gradient of the log-sum-exp function

$\notag \mathrm{lse}(x) = \log\displaystyle\sum_{i=1}^n \mathrm{e}^{x_i},$

where $\log$ is the natural logarithm, that is, $g_j(x) = (\partial/\partial x_j) \mathrm{lse}(x)$.

The following plots show the two components of softmax for $n = 2$. Note that they are constant on lines $x_1 - x_2 = \mathrm{constant}$, as shown by the contours.

Here are some examples:

>> softmax([-1 0 1])
ans =
9.0031e-02
2.4473e-01
6.6524e-01
>> softmax([-1 0 10])
ans =
1.6701e-05
4.5397e-05
9.9994e-01


Note how softmax increases the relative weighting of the larger components over the smaller ones. The MATLAB function softmax used here is available at https://github.com/higham/logsumexp-softmax.

A concise alternative formula, which removes the denominator of $(*)$ by rewriting it as the exponential of $\mathrm{lse}(x)$ and moving it into the numerator, is

$\notag \qquad\qquad g_j = \exp\bigl(x_j - \mathrm{lse}(x)\bigr). \qquad\qquad (\#)$

Straightforward evaluation of softmax from either $(*)$ or $(\#)$ is not recommended, because of the possibility of overflow. Overflow can be avoided in $(*)$ by shifting the components of $x$, just as for the log-sum-exp function, to obtain

$\notag \qquad\qquad g_j(x) = \displaystyle\frac{\mathrm{e}^{x_j-\max(x)}}{\sum_{i=1}^n \mathrm{e}^{x_i-\max(x)}}, \quad j=1\colon n. \qquad\qquad (\dagger)$

where $\max(x) = \max_i x_i$. It can be shown that computing softmax via this formula is numerically reliable. The shifted version of $(\#)$ tends to be less accurate, so ($\dagger$) is preferred.

## References

This is a minimal set of references, which contain further useful references within.