1 Introduction
Modern information processing and machine learning often have to deal with (structured) lowrank matrix factorization. Given a few observations
about a matrix of rank , one seeks a lowrank solution compatible with this set of observations as well as other prior constraints. Examples include lowrank matrix completion [1, 2, 3], phase retrieval [4], blind deconvolution and selfcalibration [5, 6], robust principal component analysis [7, 8], synchronization and alignment [9, 10], to name just a few. A common goal of these problems is to develop reliable, scalable, and robust algorithms to estimate a lowrank matrix from potentially noisy, nonlinear, and highly incomplete observations.
1.1 Optimizationbased methods
Towards this goal, arguably one of the most popular approaches is optimizationbased methods. By factorizing a candidate solution as with lowrank factors and , one attempts recovery by solving an optimization problem in the form of
(1a)  
subject to  (1b)  
(1c) 
Here, is certain empirical risk function (e.g. Euclidean loss, negative loglikelihood) that evaluates how well a candidate solution fits the observations, and the set encodes additional prior constraints, if any. This problem is often highly nonconvex and appears daunting to solve to global optimality at first sight. After all, conventional wisdom usually perceives nonconvex optimization as a computationally intractable task that is susceptible to local minima.
To bypass the challenge, one can do convex relaxation, an effective strategy that already enjoys theoretical success in addressing a large number of problems. The basic idea is to convexify the problem by, amongst others, dropping or replacing the lowrank constraint [cf. (1b)] by a nuclear norm constraint [11, 12, 1, 13, 2, 3], and solving the convexified problem in the full matrix space (i.e. the space of ). While such convex relaxation schemes exhibit intriguing performance guarantees in several aspects (e.g. nearminimal sample complexity, stability against noise), its computational cost often scales at least cubically in the size of the matrix , which often far exceeds the time taken to read the data. In addition, the prohibitive storage complexity associated with the convex relaxation approach presents another hurdle that limits its applicability to largescale problems.
This overview article focuses on provable lowrank matrix estimation based on nonconvex optimization. This approach operates over the parsimonious factorized representation [cf. (1b)] and optimizes the nonconvex loss directly over the lowrank factors and . The advantage is clear: adopting economical representation of the lowrank matrix results in low storage requirements, affordable periteration computational cost, amenability to parallelization, and scalability to large problem size, when performing iterative optimization methods like gradient descent. However, despite its wide use and remarkable performance in practice [14, 15], the foundational understanding of generic nonconvex optimization is far from mature. It is often unclear whether an optimization algorithm can converge to the desired global solution and, if so, how fast this can be accomplished. For many nonconvex problems, theoretical underpinnings had been lacking until very recently.
1.2 Nonconvex optimization meets statistical models
Fortunately, despite general intractability, some important nonconvex problems may not be as hard as they seem. For instance, for several lowrank matrix factorization problems, it has been shown that: under proper statistical models, simple firstorder methods are guaranteed to succeed in a small number of iterations, achieving low computational and sample complexities simultaneously (e.g. [16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26]
). The key to enabling guaranteed and scalable computation is to concentrate on problems arising from specific statistical signal estimation tasks, which may exhibit benign structures amenable to computation and rule out undesired “hard” instances by focusing on the averagecase performance. Two messages deserve particular attention when we examine the geometry of associated nonconvex loss functions:

Basin of attraction. For several statistical problems of this kind, there often exists a reasonably large basin of attraction around the global solution, within which an iterative method like gradient descent is guaranteed to be successful and converge fast. Such a basin might exist even when the sample complexity is quite close to the informationtheoretic limit [16, 17, 18, 19, 20, 21, 22].

Benign global landscape. Several problems provably enjoy benign optimization landscape when the sample size is sufficiently large, in the sense that there is no spurious local minima, i.e. all local minima are also global minima, and that the only undesired stationary points are strict saddle points [27, 28, 29, 30, 31].
These important messages inspire a recent flurry of activities in the design of two contrasting algorithmic approaches:

Twostage approach. Motivated by the existence of a basin of attraction, a large number of works follow a twostage paradigm: (1) initialization, which locates an initial guess within the basin; (2) iterative refinement, which successively refines the estimate without leaving the basin. This approach often leads to very efficient algorithms that run in time proportional to that taken to read the data.

Saddlepoint escaping algorithms. In the absence of spurious local minima, a key challenge boils down to how to efficiently escape undesired saddle points and find a local minimum, which is the focus of this approach. This approach does not rely on carefullydesigned initialization.
The research along these lines highlights the synergy between statistics and optimization in signal processing and machine learning. The algorithmic choice often needs to properly exploit the underlying statistical models in order to be truly efficient, in terms of both statistical accuracy and computational efficiency.
1.3 This paper
Understanding the effectiveness of nonconvex optimization is currently among the most active areas of research in signal processing, machine learning, optimization and statistics. Many exciting new developments in the last several years have significantly advanced our understanding of this approach for various statistical problems. This article aims to provide a thorough technical overview of important recent results in this exciting area, targeting the broader signal processing, machine learning, statistics, and optimization communities.
The rest of this paper is organized as follows. Section 2 reviews some preliminary facts on optimization that are instrumental to understanding the materials in this paper. Section 3 uses a toy (but nontrivial) example (i.e. rank1 matrix factorization) to illustrate why it is hopeful to solve a nonconvex problem to global optimality, through both local and global lenses. Section 4 introduces a few canonical statistical estimation problems that will be visited multiple times in the sequel. Section 5 and Section 6 review gradient descent and its many variants as a local refinement procedure, followed by a discussion of other methods in Section 7. Section 8 discusses the spectral method, which is commonly used to provide an initialization within the basin of attraction. Section 9 provides a global landscape analysis, in conjunction with algorithms that work without the need of careful initialization. We conclude the paper in Section 10 with some discussions and remarks. Furthermore, a short note is provided at the end of several sections to cover some historical remarks and provide further pointers.
1.4 Notations
It is convenient to introduce a few notations that will be used throughout. We use boldfaced symbols to represent vectors and matrices. For any vector
, we let , and denote its , , and norm, respectively. For any matrix , let , , , , andstand for the spectral norm (i.e. the largest singular value), the Frobenius norm, the nuclear norm (i.e. the sum of the singular values), the
norm (i.e. the largest norm of the rows), and the entrywise norm (the largest magnitude of all entries), respectively. We denote by (resp. ) itsth largest singular value (resp. eigenvalue), and let
(resp. ) represent its th row (resp. column). The condition number of is denoted by . In addition, , and indicate the transpose, the conjugate transpose, and the entrywise conjugate of , respectively. For two matrices and of the same size, we define their inner product as , where stands for the trace. The matrix denotes the identity matrix, and denotes the th column of . For any linear operator , denote by its adjoint operator. For example, if maps to , then . We also let denote the vectorization of a matrix . The indicator function equals to when the event holds true, and otherwise. Further, the notation denotes the set of orthonormal matrices. Let and be two square matrices. We write (resp. ) if their difference is a positive definite (resp. positive semidefinite) matrix.Additionally, the standard notation or means that there exists a constant such that , means that there exists a constant such that , and means that there exist constants such that .
2 Preliminaries in optimization theory
We start by reviewing some basic concepts and preliminary facts in optimization theory that play a vital role in the main development of the theory. For simplicity of presentation, this section focuses on an unconstrained problem
(2) 
The optimal solution, if exists, is denoted by
(3) 
When is strictly convex,^{1}^{1}1Recall that is said to be strictly convex if and only if for any and , one has . is unique. But it may be nonunique when is nonconvex.
2.1 Gradient descent for locally strongly convex functions
To solve (2), arguably the simplest method is (vanilla) gradient descent (GD), which follows the update rule
(4) 
Here, is the step size or learning rate at the th iteration, and is the initial point. This method and its variants are widely used in practice, partly due to their simplicity and scalability to largescale problems.
A central question is when GD converges fast to the global minimum . As is wellknown in the optimization literature, GD is provably convergent at a linear rate when is (locally) strongly convex and smooth. Here, an algorithm is said to converge linearly if the error converges to 0 as a geometric series. To formally state this result, we define two concepts that commonly arise in the optimization literature.
Definition 1 ().
A twice continuously differentiable function is said to be strongly convex in a set if
(5) 
Definition 2 ().
A twice continuously differentiable function is said to be smooth in a set if
(6) 
With these in place, we have the following standard result:
Lemma 1.
Suppose that is strongly convex and smooth within a local ball , and that . If , then GD obeys
Proof of Lemma 1.
The optimality of indicates that , which allows to rewrite the GD update rule as
where . Here, the second line arises from the fundamental theorem of calculus [32, Theorem 4.2]. If , then it is selfevident that , which combined with the assumption of Lemma 1 gives
Therefore, as long as (and hence ), we have
This together with the submultiplicativity of yields
By setting , we arrive at the desired error contraction, namely,
(7) 
A byproduct is: if , then the next iterate also falls in . Consequently, applying the above argument repetitively and recalling the assumption , we see that all GD iterates remain within . Hence, (7) holds true for all . This immediately concludes the proof. ∎
This result essentially implies that: to yield accuracy (in a relative sense), i.e. , the number of iterations required for GD — termed iteration complexity — is at most
if we initialize GD properly such that lies in the local region . In words, the iteration complexity scales linearly with the condition number — the ratio of smoothness to strong convexity parameters. As we shall see, for multiple problems considered herein, the radius of this locally strongly convex and smooth ball can be reasonably large (e.g. on the same order of .
2.2 Convergence under regularity conditions
Another condition that has been extensively employed in the literature is the Regularity Condition (RC), which accommodates algorithms beyond vanilla GD, as well as is applicable to possibly nonsmooth functions. Specifically, consider the iterative algorithm
(8) 
for some general mapping . In vanilla GD, , but can also incorporate several variants of GD; see Section 6. The regularity condition is defined as follows.
Definition 3 ().
is said to obey the regularity condition for some if
(9) 
for all .
This condition basically implies that at any feasible point , the associated negative search direction is positively correlated with the error , and hence the update rule (8) — in conjunction with a sufficiently small step size — drags the current point closer towards the global solution. It follows from the CauchySchwarz inequality that one must have .
It is worth noting that this condition does not require to be differentiable. Also, when , it does not require to be convex within ; see Fig. 1 for an example. Instead, the regularity condition can be viewed as a combination of smoothness and “onepoint strong convexity” (as the condition is stated w.r.t. a point ) defined as follows
(10) 
for all . To see this, observe that
(11) 
where the second line follows from an equivalent definition of the smoothness condition [33, Theorem 5.8]. Combining (10) and (11) arrives at the regularity condition with and .
Under the regularity condition, (8) converges linearly:
Lemma 2.
Under , the iterative algorithm (8) with and obeys
Proof of Lemma 2.
Assuming that , we can obtain
where the second inequality comes from , and the last line arises if . By setting , we arrive at
(12) 
which also shows that . The claim then follows by induction under the assumption .
In view of Lemma 2, the iteration complexity to reach accuracy (i.e. ) is at most
as long as a suitable initialization is provided.
2.3 Critical points
An iterative algorithm like gradient descent often converges to one of its fixed points [34]. For gradient descent, the associated fixed points are (firstorder) critical points or stationary points of the loss function, defined as follows.
Definition 4 ().
A firstorder critical point (stationary point) of is any point that satisfies
Moreover, we call a point an firstorder critical point, for some , if it satisfies .
A critical point can be a local minimum, a local maximum, or a saddle point of , depending on the curvatures at / surrounding the point. Specifically, denote by the Hessian matrix at , and let be its minimum eigenvalue. Then for any firstorder critical point :

if , then is a local maximum;

if , then is a local minimum;

, then is either a local minimum or a degenerate saddle point;

, then is a strict saddle point.
Another concept that will be useful is secondorder critical points, defined as follows.
Definition 5 ().
A point is said to be a secondorder critical point (stationary point) if and .
Clearly, secondorder critical points do not encompass local maxima and strict saddle points, and, as we shall see, are of more interest for the nonconvex problems considered herein. Since we are interested in minimizing the loss function, we do not distinguish local maxima and strict saddle points.
3 A warmup example: rank1 matrix factorization
For pedagogical reasons, we begin with a selfcontained study of a simple nonconvex matrix factorization problem, demonstrating local convergence in the basin of attraction in Section 3.1 and benign global landscape in Section 3.2. The analysis in this section requires only elementary calculations.
Specifically, consider a positive semidefinite matrix (which is not necessarily lowrank) with eigendecomposition . We assume throughout this section that there is a gap between the 1st and 2nd largest eigenvalues
(13) 
The aim is to find the best rank approximation. Clearly, this can be posed as the following problem:^{2}^{2}2Here, the preconstant is introduced for the purpose of normalization and does not affect the solution.
(14) 
where is a degreefour polynomial and highly nonconvex. The solution to (14
) can be expressed in closed form as the scaled leading eigenvector
. See Fig. 2 for an illustration of the function when. This problem stems from interpreting principal component analysis (PCA) from an optimization perspective, which has a long history in the literature of (linear) neural networks and unsupervised learning; see for example
[35, 36, 37, 38, 39, 40].We attempt to minimize the nonconvex function directly in spite of nonconvexity. This problem, though simple, plays a critical role in understanding the success of nonconvex optimization, since several important nonconvex estimation problems can be regarded as randomized versions or extensions of this problem.
3.1 Local linear convergence of gradient descent
To begin with, we demonstrate that gradient descent, when initialized at a point sufficiently close to the true optimizer (i.e. ), is guaranteed to converge fast.
Remark 1.
By symmetry, Theorem 1 continues to hold if is replaced by .
In a nutshell, Theorem 1 establishes linear convergence of GD for rank1 matrix factorization, where the convergence rate largely depends upon the eigengap (relative to the largest eigenvalue). This is a local result, assuming that a suitable initialization is present in the basin of attraction . Its radius, which is not optimized in this theorem, is given by . This also depends on the relative eigengap .
Proof of Theorem 1.
The proof mainly consists of showing that is locally strongly convex and smooth, which allows us to invoke Lemma 1. The gradient and the Hessian of are given respectively by
(15)  
(16) 
For notational simplicity, let . A little algebra yields that if , then
(17)  
(18) 
We start with the smoothness condition. The triangle inequality gives
where the last line follows from (18).
The question then comes down to whether one can secure an initial guess of this quality. One popular approach is spectral initialization, obtained by computing the leading eigenvector of . For this simple problem, this already yields a solution of arbitrary accuracy. As it turns out, such a spectral initialization approach is particularly useful when dealing with noisy and incomplete measurements. We refer the readers to Section 8 for detailed discussions on spectral initialization methods.
3.2 Global optimization landscape
We then move on to examining the optimization landscape of this simple problem. In particular, what kinds of critical points does have? This is addressed as follows.
Theorem 2.
Consider the problem (14). All local minima of are global optima. The rest of the critical points are either local maxima or strict saddle points.
Proof of Theorem 2.
Recall that is critical point if and only if , or equivalently,
As a result, is a critical point if either it aligns with an eigenvector of or . Given that the eigenvectors of obey , by properly adjusting the global scaling, we determine the set of critical points
To further categorize the critical points, we need to examine the Hessian matrices as given by (16). Regarding the critical points , we have
We can then categorize them as follows:

With regards to the points , one has
and hence they are (equivalent) local minima of ;

For the points , one has
and therefore they are strict saddle points of .

Finally, the critical point at the origin satisfies
and is hence either a local maxima (if ) or a strict saddle point (if ).
∎
This result reveals the benign geometry of the problem (14) amenable to optimization. All undesired fixed points of gradient descent are strict saddles, which have negative directional curvature and may not be difficult to escape or avoid.
4 Formulations of a few canonical problems
For a paper of this length, it is impossible to cover all nonconvex statistical problems of interest. Instead, we decide to focus on a few concrete and fundamental matrix factorization problems. This section presents formulations of several such examples that will be visited multiple times throughout this article. Unless otherwise noted, the assumptions made in this section (e.g. restricted isometry for matrix sensing, Gaussian design for phase retrieval) will be imposed throughout the rest of the paper.
4.1 Matrix sensing
Suppose that we are given a set of measurements of of the form
(19) 
where is a collection of sensing matrices known a priori. We are asked to recover — which is assumed to be of rank — from these linear matrix equations [12, 41].
When is positive semidefinite, this can be cast as solving the leastsquares problem
(20) 
Clearly, we cannot distinguish from for any orthonormal matrix , as they correspond to the same lowrank matrix . This simple fact implies that there exist multiple global optima for (20), a phenomenon that holds for most problems discussed herein.
For the general case where , we wish to minimize
(21) 
Similarly, we cannot distinguish from for any invertible matrix , as . Throughout the paper, we denote by
(22) 
the true lowrank factors, where
stands for its singular value decomposition.
In order to make the problem wellposed, we need to make proper assumptions on the sensing operator defined by:
(23) 
A useful property for the sensing operator that enables tractable algorithmic solutions is the Restricted Isometry Property (RIP), which says that the operator preserves the Euclidean norm of the input matrix when restricted to the set of lowrank matrices. More formally:
Definition 6 ([42]).
An operator is said to satisfy RIP with RIP constant if
(24) 
holds simultaneously for all of rank at most .
As an immediate consequence, the inner product between two lowrank matrices is also nearly preserved if satisfies the RIP. Therefore, behaves like an isometry when restricting its operations over lowrank matrices.
Lemma 3 ([42]).
If an operator satisfies RIP with RIP constant , then
(25) 
holds simultaneously for all and of rank at most .
Notably, many random sensing designs are known to satisfy the RIP with high probability, with one example given below.
4.2 Phase retrieval and quadratic sensing
Imagine that we have access to quadratic measurements of a rank1 matrix :
(26) 
where is the design vector known a priori. How can we reconstruct — or equivalently, — from this collection of quadratic equations about ? This problem, often dubbed as phase retrieval, arises for example in Xray crystallography, where one needs to recover a specimen based on intensities of the diffracted waves scattered by the object [43, 4, 44, 45]. Mathematically, the problem can be posed as finding a solution to the following program
(27) 
More generally, consider the quadratic sensing problem, where we collect quadratic measurements of a rank matrix with :
(28) 
This subsumes phase retrieval as a special case, and comes up in applications such as covariance sketching for streaming data [46, 47], and phase space tomography under the name coherence retrieval [48, 49]. Here, we wish to solve
(29) 
Clearly, this is equivalent to the matrix sensing problem by taking . Here and throughout, we assume i.i.d. Gaussian design, a tractable model that has been extensively studied recently.
Assumption 1.
Suppose that the design vectors .
4.3 Matrix completion
Suppose we observe partial entries of a lowrank matrix of rank , indexed by the sampling set . It is convenient to introduce a projection operator such that for an input matrix ,
(30) 
The matrix completion problem then boils down to recovering from (or equivalently, from the partially observed entries of ) [1, 13]. This arises in numerous scenarios; for instance, in collaborative filtering, we may want to predict the preferences of all users about a collection of movies, based on partially revealed ratings from the users. Throughout this paper, we adopt the following random sampling model:
Assumption 2.
Each entry is observed independently with probability , i.e.
(31) 
For the positive semidefinite case where , the task can be cast as solving
(32) 
When it comes to the more general case where , the task boils down to solving
(33) 
One parameter that plays a crucial role in determining the feasibility of matrix completion is a certain coherence measure, defined as follows [1].
Definition 7 ().
A rank matrix with singular value decomposition (SVD) is said to be incoherent if
Comments
There are no comments yet.