Faster SVD via Polar Decomposition

The car number plate of Gene Golub, who did so much to promote the SVD. Photo credit: P. M. Kroonenberg.

The singular value decomposition (SVD) is one of the most important tools in matrix theory and matrix computations. It is described in many textbooks and is provided in all the standard numerical computing packages. I wrote a two-page article about the SVD for The Princeton Companion to Applied Mathematics, which can be downloaded in pre-publication form as an EPrint.

The polar decomposition is a close cousin of the SVD. While it is probably less well known it also deserves recognition as a fundamental theoretical and computational tool. The polar decomposition is the natural generalization to matrices of the polar form z = r \mathrm{e}^{\mathrm{i}\theta} for complex numbers, where r\ge0, \mathrm{i} is the imaginary unit, and \theta\in(-\pi,\pi]. The generalization to an m\times n matrix is A = UH, where U is m\times n with orthonormal columns and H is n\times n and Hermitian positive semidefinite. Here, U plays the role of \mathrm{e}^{\mathrm{i}\theta} in the scalar case and H the role of r.

It is easy to prove existence and uniqueness of the polar decomposition when A has full rank. Since A = UH implies A^*A = HU^*UH = H^2, we see that H must be the Hermitian positive definite square root of the Hermitian positive definite matrix A^*A. Therefore we set H = (A^*A)^{1/2}, after which U = AH^{-1} is forced. It just remains to check that this U has orthonormal columns: U^*U = H^{-1}A^*AH^{-1} = H^{-1}H^2H^{-1} = I.

Many applications of the polar decomposition stem from a best approximation property: for any m\times n matrix A the nearest matrix with orthonormal columns is the polar factor U, for distance measured in the 2-norm, the Frobenius norm, or indeed any unitarily invariant norm. This result is useful in applications where a matrix that should be unitary turns out not to be so because of errors of various kinds: one simply replaces the matrix by its unitary polar factor.

However, a more trivial property of the polar decomposition is also proving to be important. Suppose we are given A = UH and we compute the eigenvalue decomposition H = QDQ^*, where D is diagonal with the eigenvalues of H on its diagonal and Q is a unitary matrix of eigenvectors. Then A = UH = UQDQ^* = (UQ)DQ^* \equiv P\Sigma Q^* is an SVD! My PhD student Pythagoras Papadimitriou and I proposed using this relation to compute the SVD in 1994, and obtained speedups by a factor six over the LAPACK SVD code on the Kendall Square KSR1, a shared memory parallel machine of the time.

Yuji Nakatsukasa and I recently revisited this idea. In a 2013 paper in the SIAM Journal of Scientific Computing we showed that on modern architectures it is possible to compute the SVD via the polar decomposition in a way that is both numerically stable and potentially much faster than the standard Golub–Reinsch SVD algorithm. Our algorithm has two main steps.

  1. Compute the polar decomposition by an accelerated Halley algorithm called QDWH devised by Nakatsukasa, Bai, and Gygi (2010), for which the main computational kernel is QR factorization.
  2. Compute the eigendecomposition of the Hermitian polar factor by a spectral divide and conquer algorithm. This algorithm repeatedly applies QDWH to the current block to compute an invariant subspace corresponding to the positive or negative eigenvalues and thereby divides the problem into two smaller pieces.

The polar decomposition is fundamental to both steps of the algorithm. While the total number of flops required is greater than for the standard SVD algorithm, the new algorithm has lower communication costs and so should be faster on parallel computing architectures once communication costs are sufficiently greater than the costs of floating point arithmetic. Sukkari, Ltaief, and Keyes have recently shown that on a multicore architecture enhanced with multiple GPUs the new QDWH-based algorithm is indeed faster than the standard approach. Another interesting feature of the new algorithm is that it has been found experimentally to have better accuracy.

The Halley iteration that underlies the QDWH algorithm for the polar decomposition has cubic convergence. A version of QDWH with order of convergence seventeen, which requires just two iterations to converge to double-precision accuracy, has been developed by Nakatsukasa and Freund (2015), and is aimed particularly at parallel architectures. This is a rare example of an iteration with a very high order of convergence actually being of practical use.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s