Class notes: some food on deep learning theory
ㅤ
Acknowledgement: This grad-level course (CSCI 8980 Topics in Modern Machine Learning) was offered by Prof. Ju Sun at the University of Minnesota in Fall 2021. The course website can be found here. This course is organized around six trusts: Approximation, Optimization, Generlization, Invariance, Robustness and Generation. As we have bi-weekly reflections as our assignments to discuss stengths and limitations of what we learned each week, or implement ways to test or disprove certain theories or methods, or inspire any new problems, etc, I plan to open this blog page to record some nuggets.
Optimization
Federated Optimization of Non-convex Objective with Polyak-Łojasiewicz Condition on non i.i.d. Data Distribution (Sep 26)
Week Summary: This week we learned about two important families of tractable nonconvex problems using gradient descent: star-convex functions, and functions that satisfy the Polyak-Lojasiewicz (PL) conditions. Both guarantee the equivalence of stationary points, local minimizers, and global minimizers, and admit strong optimization guarantees for functions with Lipschitz gradients. We put more emphasis on the PL conditions and sketch the high-level strategy people employ to provide global optimization guarantees for neural networks.
Abstract
In Federated Learning (FL), the goal is to optimize over a global training objective over many different local clients, where the issue of data distribution discrepancy occurs quite often in practice. In this reflection, we survey FL optimization that satisfies both 1) non-convex objective under Polyak-Łojasiewicz (PL) condition and 2) non-i.i.d. data distribution via local Gradient Descent (GD) with periodic averaging.
Convergence Analysis of local GD
We have a global training objective of FL:
where is the number of clients, and is the weight of th client with and . We assume that th client has training data sampled i.i.d. from th local client distribution, then the local lost function can be written as:
where is a non-convex function.
In practice, the discrepancy of data distribution often cause the FedAvg [1] or local SGD [2] to suffer from slow convergence rate or even divergence, when the learning rate or the number of local updates are not properly chosen. Intuitively, if the local gradients of different clients are orthogonal or even pointing to opposite directions, then you can only expect little gain in aggregated optimization, because the global optimum might be far away from those local optimums. [3] introduced weighted gradient diversity to measure the data discrepancy among local objectives to establish condition for convergence analysis later for non-i.i.d. setting:
The assumptions include:
- Bounded Gradient Diversity:
- L-Lipschitz: with the value of global objective function is bounded below by a scalar .
- u-PL Condition: where is the global optima.
With these assumptions established, [3] draws a theorem as the following:
For local GD with local updates, if we choose the learning rate such that
holds, and all local model parameters initialized as the same as global model at the beginning, after iterations we have:
where is the condition number and is the Lipschitz constant.
Discussion
We know that large learning rate under convergence guarantee will help achieve fast convergence. From the theorem above, we see that learning rate is inversely proportional to the gradient diversity measure. If the data distribution of different clients are nearly i.i.d. (e.g., MRI imaging with different sequences (T1, T2, FLAIR, etc)), then a larger learning rate or number of local updates can be taken. If the data distribution is non-i.i.d. (e.g., photos v.s. sketch draws), then a small learning rate may be taken. This confirms with the practice that slow convergence or even divergence occurs when apply FL on heterogeneous data distributions.
All the results of the paper by [3] heavily relies on the bounded gradient diversity assumption. If the gradient diversity is large to an extent, the convergence may fail and tuning the learning rate may be challenging. This in a sense explains why many recent literature [4,5,6] optimize the training algorithms of FL in a way that is to reduce the gradient diversity. Although they take different form of regularization on the step of gradient descent, the core idea is similar – to correct the local drift over local clients for better generalization. Meanwhile, once they propose a way to mitigate the local drift, they can always make sure the algorithm converges because essentially they’re also reducing the gradient diversity proposed by [1].
Here we again see the L-Lipschitz assumption that is commonly used to analyze smooth functions. Oftentimes people made the Lipschitz assumption to analyze federated optimization so that local gradient won’t drift too far away from the averaged gradient intuitively, but what if in practice we’re using a neural network with non-smooth activation such as ReLU activation?
Overparameterization is a prevalent machine learning setting today. Does local (S)GD still have convergence guarantee and enjoy linear speedup under this setting? [7] have shown it for strong convex smooth problems, but it still remains a open question for non-convex case.
Reference
[1] Farzin Haddadpour and Mehrdad Mahdavi. On the convergence of local descent methods in feder-ated learning, 2019.
[2] Sebastian U. Stich. Local sgd converges fast and communicates little, 2019.
[3] Farzin Haddadpour and Mehrdad Mahdavi. On the convergence of local descent methods in feder-ated learning, 2019
[4] Sai Praneeth Karimireddy, Satyen Kale, Mehryar Mohri, Sashank J. Reddi, Sebastian U. Stich, andAnanda Theertha Suresh. Scaffold: Stochastic controlled averaging for federated learning, 2021b.
[5]Sai Praneeth Karimireddy, Martin Jaggi, Satyen Kale, Mehryar Mohri, Sashank J. Reddi, Sebas-tian U. Stich, and Ananda Theertha Suresh. Mime: Mimicking centralized stochastic algorithmsin federated learning, 2021a
[6] Tomoya Murata and Taiji Suzuki. Bias-variance reduced local sgd for less heterogeneous federatedlearning, 2021.
[7] Zhaonan Qu, Kaixiang Lin, Jayant Kalagnanam, Zhaojian Li, Jiayu Zhou, and Zhengyuan Zhou.Federated learning’s blessing: Fedavg has linear speedup, 2020
The Fertileness of Lineared Models under Infinite-width Limit (Oct 3)
Week Summary: This week we leanred about how to apply the PL condition to obtain global optimization guarantees for 2-layer (“1.5-layer” only in fact) neural networks, and potentially also short-path claim on the optimization trajectories, due to the nearness to a linearized problem. We then learned about why and when linearization makes sense. For optimization alone, linearization is not necessary, as we learned that networks with nonlinear output layers, as well as bottleneck layers, could make linearization highly inaccurate, even if the network width approaches infinity. But one can still apply PL-style arguments, which for nonlinear models only rely on well-conditionedness of the Jacobian at its core.
Abstract
The two papers [1] and [2] provide slightly different viewpoints about why linearization makes sense, and when it does not. This reflection aims to discuss and compare their main arguments, as well as their strengths and limitations.
Introduction
[2] identified a phenomenon that very-wide neural networks become linear functions w.r.t. their parameters, because their tangent kernel becomes constant functions of the weights during training. The phenomenon is also called “transition to linearity” in machine learning (ML) community today. [1] argued that this phenomenon should be explained by the linearity in an Euclidean ball of an arbitrary fixed radius instead of in a small change of the initialization point (i.e., “lazy training” regime).
Main Arguments
[2] claimed that the linearization of a neural network around its initialization is more due to implicit choice of scaling instead of over-parameterization. They proved the statement by introducing a explicit scale factor and conducted a theoretical analysis that shows large- rescaled models’ gradients don’t sensibly change while the loss enjoys a sharp decrease.
As we may know from calculus that the deviation from a linear function is controlled by its second derivative, [1] demonstrates that if the spectral norm of the second derivative (i.e., Hessian Matrix ) of a model is small compared to the first derivative (i.e., gradient ) in a ball of a fixed radius, then the model is close to linear and will have near-constant tangent kernel in that ball. In fact, if we notice that the scaling of gradient norm is controlled by 2-norm of where is the value of -th hidden layer, and spectral norm of Hessian in controlled by -norm, it’s not difficult to see that the discrepancy between these two norms increases when network width (i.e., the minimal width of a neural network) gets to infinitely large. Numerically, the gradient norm is of the order of and spectral norm of Hessian scales as . As the network gets wider, the contribution of Hessian toward the optimization trajectory is more and more negligible.
Beside stating their own argument toward linearity, [1] also disproved the model scaling prospective brought forth by [2]. In latter’s paper, it stated:
where is the Hessian and is the gradient. After re-scaling, the rescaled model becomes:
when and is large, the part getting small enough forces when part remains the same. However, the original work of Neural Tangent Kernel (NTK) [3] suggested that many practical neural networks are initialized at instead of , so scaling of model cannot explain the transition to linearity. Rather, [1]'s argument from the last paragraph above implies that part gets super small due to vanishing Hessian when the model is wide enough, and it still satisfies the condition of .
When gradient norm is of the order and spectral norm of Hessian scales as , [1] echos that very-wide models are close to be linear. However, they argued that the linearity won’t hold when 1) there is a non-linear activation on the output, and 2) the bottleneck layer is very narrow.
Stengths and limitations
As for the two papers we discussed, [2]'s viewpoint shows that lazy regimes degrades the performance of CNN on many high-dimensional tasks, where it’s even worse than some classical linear models, it doesn’t well interpret the many successes of deep neural networks. Also, although the work is based on a explicit scaling factor during analysis, the scaling factors are often implicit, such as choices of hyper-parameters that govern normalization, initialization and number of iteration. on the other hand, [1] explicitly unveiled that the linearity doesn’t always hold especially in the following situations: 1) there is a non-linear activation on the output, and 2) the bottleneck layer is very narrow.
Why do we care about linearization of neural networks? It provides guarantees that fast training is indeed possible at a cost of recovering linear method. Also, the framework that assumes the infinite-width limit and certain initialization schemes help reduce key questions in deep learning theory to the study of linear method and convex functional analysis, which is rather well understood.
Nevertheless, it’s also been observed empirically that linearized models who use first-order Taylor expansion around the initialization perform much worse than the networks they approximate [4]. In particular, if one linearizes the network at a later stage, then so called non-linear advantages described before this sentence is significantly reduced. The reason behind this phenomenon is still a open question.
Reference
[1] Chaoyue Liu, Libin Zhu, and Mikhail Belkin. On the linearity of large non-linear models: when andwhy the tangent kernel is constant, 2021.
[2] Lenaic Chizat, Edouard Oyallon, and Francis Bach. On lazy training in differentiable programming, 2020
[3] Arthur Jacot, Franck Gabriel, and Cl ́ement Hongler. Neural tangent kernel: Convergence and gen-eralization in neural networks, 2020
[4] Stanislav Fort, Gintare Karolina Dziugaite, Mansheej Paul, Sepideh Kharaghani, Daniel M. Roy,and Surya Ganguli. Deep learning versus kernel learning: an empirical study of loss landscapegeometry and the time evolution of the neural tangent kernel, 2020.
Accelerated Gradient Flow VS Accelerated Gradient Descent (Oct 10)
Weekly Summary: This week we learned how to move beyond linearization in optimization theory for DL, using PL conditions. In the week starting Oct 10, we learned another line of research based on two key ideas: 1) pass to infinite width, i.e., asymptotic analysis, and 2) pass to infinitesimal step size, i.e., using gradient flow. We described the mean-field analysis founded on these ideas. Then, the following week, we shifted gears toward generalization, and reviewed the basics of statistical learning theory, and highlighted the major results around Rademacher complexity and VC dimensions, and also discussed they tend to lead to vacuous results in modern deep learning practice.
Abstract
Nestorov’s acceleration method is a backbone for optimizing machine learning and deep learning models: even Adam optimizer essentially borrows the momentum idea, besides being adaptive. However, it’s still debatable what the essential intuitions and ideas are behind the method. In this report, we sketch the key arguments of gradient flow analysis as one candidate explanation from the work of [1].
What’s missing?
Let’s say we have a space in , and there is a convex function that we want to minimize. The gradient descent method
works in discrete time. In other words, it has convergence guarantee under some assumptions on such as L-Lipschitz. Meanwhile, there is a corresponding gradient flow method
in continuous time with a matching convergence rate with gradient descent method but less assumptions.
For instance, gradient descent converges at rate when is )-Lipschitz. Gradient flow converges at rate when is convex. If is the democratization time step, and , the convergence rate matches. Also, the Lipschitz assumption vanishes in continuous time because as .
Figure 1: The matching between gradient descent methods and gradient flow methods. ([2])
Thirsty years ago, [3] proposed an accelerated gradient descent
that has same assumptions as gradient descent but achieve faster and optimal convergence rate . Although accelerated gradient descent is a oscillatory (i.e., not always locally descent) method, it still enjoys the benefit of faster convergence without making additional assumptions. Does gradient flow line of work has similar acceleration? It does. The work of [1] “completed the square” (see Figure 1), which we’re going to introduce in the next section.
Accelerated Gradient Flow
The main result of this work is the following first-order method in space (due to ), second-order differential equation in time (due to )
It has momentum term of accelerated gradient descent implicitly in the velocity , and means that we care about the notion of acceleration. The derivation of this differential equation is ascribed in details in the paper. Here we plan to summarize some key properties. For infinite time , accelerated gradient flow does match accelerated gradient descent (see Figure 2).
Figure 2: The trajectories of accelerated gradient descent with different small step size ( in the figure) that agree with the curve of accelerated gradient flow (left). The matching of function values of accelerated gradient descent with accelerated gradient flow (right). [1]
From Figure 2, we also confirmed that accelerated gradient flow is oscillatory (i.e., function values of bounces up and down). It does achieve a faster convergence rate , and the proof is in the paper as well with the assumption of convex function .
Can the convergence rate be even faster?
I guess a natural question is why it has to be in the equation. [1] also studied a general form with , and they claimed that the convergence rate would still be when , which they called magic constant in their paper. Furthermore, they demonstrated that a faster convergence rate if a stronger convexity condition holds such as -convexity. Usually, -strong convexity assumption can lead to exponential convergence rate such as in gradient descent or gradient flow. However, [1] showed that accelerated gradient flow cannot obtain exponential convergence rate because is a quadratic function. One viable way they proposed to break the barrier is a restart regime built upon the accelerated gradient flow method. The idea is to restart the accelerated gradient flow algorithm whenever the convergence speed is decreasing. This regime is based on that assumptions that f has to be both strong-convex and L-lipschitz in continuous time, which is often impractical.
In any cases, the gradient flow line of work does have matching results and yields similiar promises with gradient descent line of work.
Reference
[1] Weijie Su, Stephen Boyd, and Emmanuel Candes. A differential equation for modeling nesterov’saccelerated gradient method: Theory and insights. In Z. Ghahramani, M. Welling, C. Cortes,N. Lawrence, and K. Q. Weinberger (eds.),Advances in Neural Information Processing Systems,volume 27. Curran Associates, Inc., 2014
[2] Andre Wibisono. Accelerated gradient flow, Jun 2016. URLhttp://awibisono.github.io/2016/06/27/accelerated-gradient-flow.html
[3] Yurii Nesterov. A method for solving the convex programming problem with convergence rate .Proceedings of the USSR Academy of Sciences, 269:543–547, 1983
Generalization
Existing Generalization Results When Extra Assumptions on Data Are Made (Oct 24)
Week Summary: This past week we computed the VC dimensions of DNNs with simple (i.e., sign) activations, and described the state-of-the-art results and how they couldn’t quite explain the generalization behavior of DNNs in practical regimes, where the # of weights far exceeds # of data points. We then quickly reviewed the margin theory for classification, which converts discrete problems into continuous ones, so that norm-based capacity control can be naturally enforced. Rademacher complexity, together with margin-based generalization bounds provides a convincing explanation for large-margin learning models such as SVMs, but for DNNs, norm-based control is seldom explicitly enforced in practice. We then learned some empirical arguments suggesting the conventional wisdom may be unable to explain the success of deep learning.
Abstract
The major gap of current generalization theory is for data with noise: the distribution we consider can be an arbitrary fixed distribution, but one can adequately capture the generalization error when extra assumptions are made on . In this report, we look at the recent work of [1] and [2] where extra assumptions are made on the data distribution to derive the generalization results.
Memorization is Necessary
[1] drew a conclusion that memorization of labels (even include outliers) is crucial for accurate learning.
The basic model that [1] proposed explicitly incorporates the prior distribution on the frequencies of subpopulations (i.e., a portion of the data of a certain class) in the data, and they argues that the necessity of such modeling to bridge the gap between classical view of generalization and the practice of machine learning.
To feed in some background, since natural data tends to be long-tail (see Figure 1), [1] made an assumption that the prior distribution of frequencies of subpopulations tends to be long-tailed distribution also. Additionally, it is natural to presume that the model doesn’t know the frequencies of subpopulations before training. Also, the model may not predict correctly on a subpopulation until at least one example from the subpopulation is observed. The challenge is that it’s impossible to distinguish whether this one example comes from a “atypical” subpopulation (correctly labeled) or a “outlier” subpopulation (wrongly labeled). Memorizing the former reduces the generalization error by where is the sample number of that subpopulation, while the latter has no significant benefit.
The long-tailed effect can also explain why memorization is necessary in high-level. If a observed label is more likely to be true and the one example comes from a “atypical” subpoluation, then memorizing such a label is necessary. On the other hand, if a mislabeled example comes from a subpopulation with many other examples, then the correct label can be inferred by other correctly-labeled examples of this subpopulation so memorization is not necessary.
However, memorization also brings in data-privacy issues. There are a number of black-box membership inference attack with high accuracy, so it’s not always good to memorize the training data perfectly especially there are sensitive information. To train a data-hungry good-performance model when data-privacy is emphasized, people tend to utilize the federated learning regime. If memorization is necessary as stated above, then will it become the limitation of federated learning? Luckily, some people have initially started looking into unintended memorization of federated learning. [3] found clustering the training data helps reduced unintended memorization, which suggests the larger data discrepancy exists the better. The data heterogeneity setting happens quite often (even by design) in federated learning research. Although memorization helps model to gain high classification accuracy, it’s not practical in the domains that privacy is stressed.
Overparameterization and Random Initialization Helps Generalization
In federated learning, if the data is better to be separable in client-level, then how about within each client? In fact, the separability of data also help to derive a better shape of generalization results for local data, as shown in the work of [2]. They observed that the solution learned from less-structured data is further away from its initialization, the model has higher complexity and need more iterations to converge. This work is based on overparameterization and random initialization. Here the role that overparameterization plays is to reduce the likelihood that activation pattern of one neuron and one data point will change in a fixed number of iterations, while random initialization in the weights has strong cancellation when the structured signal updates that are learned in the weights collectively affect the model output.
Reference
[1] Vitaly Feldman. Does learning require memorization? a short tale about a long tail, 2021.
[2] Yuanzhi Li and Yingyu Liang. Learning overparameterized neural networks via stochastic gradientdescent on structured data, 2019.
[3] Om Thakkar, Swaroop Ramaswamy, Rajiv Mathews, and Franc ̧oise Beaufays. Understanding unin-tended memorization in federated learning, 2020.
A Naive Wonder-around of Double Descent (Nov 21)
We started by learning Occam’s Razor, which says among all explanations to a phenomenon, the simplest one is preferred. When mapped to supervised learning, which essentially concerns function learning, this principle implies that the simplest (or “smoothest” in a very loosely defined sense) function is preferred among all functions that can interpolate a training set perfectly. This leads to an intriguing phenomenon called “double descent”, which transcends our intuition gathered in classical machine learning. We then used linear and kernel regression as concrete examples to explain how the (expected) excess risk can be well controlled when the least L2 norm solution is returned, and then extended it to 2-layer (or 1.5-layer) NNs by assuming linearization. The key idea underlying the analysis is a prediction-interpolation decomposition, where the (transformed) features are partitioned into primary features—which are responsible for generalizable prediction, and secondary features—which are responsible for interpolating the training data, particularly the potential noise thereof.
Abstract
Occam’s Razor says among all explanations to a phenomenon, the simplest one is preferred. When mapped to supervised learning, which essentially concerns function learning, this principle implies that the simplest (or “smoothest” in a very loosely defined sense) function is preferred among all functions that can interpolate a training set perfectly. This leads to an intriguing phenomenon called “double descent”, which transcends our intuition gathered in classical machine learning.
Classical Wisdom
Plotting the training and testing error against some measure of model complexity for classical machine learning under bias-variance tradeoff regime may look like the first row of Figure 1. We observe that both errors decrease with testing error slightly higher than training error, and the lowest point is what we usually call the optimal in an under-parameterized model for this training data. That’s why as we move further in along x-axis, we expect to see the test error skyrockets upward due to overfitting while the training error continues decreasing.
Figure 1: A visual of double descent phenomenon. Image credit: [1]
New Philosophy
However, when we go beyond the under-parameterization regime separated by the interpolation threshold in Figure 1, although generalization ability is still poor (e.g., high variance at the time of interpolation threshold), the generalization keeps improving as we continue increasing the model complexity. The new optimal we get after it converges is empirically better than the old optimal. This phenomenon was first phrased as double descent by [2]. [3] further conducted massive experiments to show that this phenomenon exists in many popular deep neutral networks as well such as CNNs, ResNets, and Transformers. In [3], they demonstrated that for a sufficiently large model, training it long enough will surface the double descent phenomenon. In addition, the peak of the testing error in double descent as well as the position of interpolation threshold is mostly sensitive to label noise. In other words, adding a certain level of label noise, which is one way to improve generalization, helps amplify the general behavior of double descent, which sometimes further decreases the test error beyond previous optimal.
Do We Need Remedies for Double Descent?
If we buy the Occam’s Razor in the abstract, can a model with more parameters actually have a real tendency to be simpler than the one with less parameters? I’m afraid that probably yes. The double descent paradigm shows that over-parameterized models with zero training error generalize better than under-parameterized models with zero training error, maybe because they do better in stochastic gradient descent’s inductive biases (i.e., they’re better behaved between the points they’re interpolating). However, is this paradigm practical? As we know from previous section, one has to train “forever” to observe the double descent phenomenon; in practice, people use early stopping in realistic setting, which may indicate that it may never get to the effect of double descent.
Then, what’s a better early stopping technique when taking double descent into account? [4] believe that double descent in deep models is caused by a superposition of two or more bias-variance tradeoffs that arise because different layers of the networks are learned in different stages (i.e., epochs). They also empirically and theoretically show that eliminating such a superposition can improve the early stopping performance (see Figure 2).
Figure 2: The early stopping performance before and after elimination with different step sizes. Bias-variance 1 & 2 are modeled from different model weights . Image credit: [4]
Another recent work by [5] believe that training data contain slow-to-learn features, and they prove that the epoch-wise double descent can be avoided by removing the slow-to-learn but informative feature from the training data by analyzing a simple model. While the model performance is not comparable after removing the informative features, they also suggests training the last layer of deep network only can remove epoch-wise double descent. Although the results are less convincing, I think the intuition is well-motivated. From the deep double descent paper aforementioned, we know the appearance of double descent behavior require a certain level of noises. Additionally, we remember that the representation learned by early layers of deep networks stabilize quickly, with the latter layers to adapt the noises. In transfer learning, the feature layers also don’t change much during finetuning but the classification layers. [5] trained the last layer of neural networks to convergence while training previous layers for a fixed number of epoch, and it yields matching or exceed generalization performance than its training-all-layers-to-convergence counterpart.
Reference
[1] Jared Wilber and Brent Werness. Double descent part 1: A visual introduction, Dec 2021. URL: https://mlu-explain.github.io/double-descent/
[2] Mikhail Belkin, Daniel Hsu, Siyuan Ma, and Soumik Mandal. Reconciling modern machine learningpractice and the bias-variance trade-off, 2019.
[3] Preetum Nakkiran, Gal Kaplun, Yamini Bansal, Tristan Yang, Boaz Barak, and Ilya Sutskever. Deepdouble descent: Where bigger models and more data hurt, 2019.
[4] Reinhard Heckel and Fatih Furkan Yilmaz. Early stopping in deep networks: Double descent andhow to eliminate it, 2020
[5] Cory Stephenson and Tyler Lee. When and how epochwise double descent happens, 2021