Communication-efficient algorithms for online distributed multitask learning

online optimization

distributed optimization

multitask learning

communication-efficient algorithms

Author

Affiliation

Ying Lin

The Hong Kong Polytechnic University

Published

March 6, 2024

Modified

March 14, 2024

Abstract

This is a short review for the communication-efficient algorithms for online distributed multitask learning.

Keywords

online optimization, distributed optimization, heterogeneity, communication-efficient algorithm, subgroup analysis, multitask learning

1 Introduction

Traditional machine learning paradigms involve aggregating data at once and transferring them to a centralized entity for processing and training. This offline centralized manner, however, suffers from significant challenges, including inflexibility in model updating, high communication and storage costs, significant computational demands, and concerns regarding data privacy. For a detailed discussion, see (Verbraeken et al. 2020, sec. 2). To address the issues inherent to offline centralized manner, and to accommodate the proliferation of intelligent devices generating mass data as well as increasing concerns regarding privacy, online distributed learning has emerged as a paradigm shift (Predd, Kulkarni, and Poor 2006; Yang et al. 2019; T.-H. Chang et al. 2020; Verbraeken et al. 2020; S. Hu et al. 2021; X. Li, Xie, and Li 2023; Beltrán et al. 2023). This new paradigm enables multiple entities to collaboratively and distributedly train models utilizing their local online/streaming data, without sharing them, thereby significantly improving training efficiency and enhancing privacy protection (see Lian et al. 2017; also Verbraeken et al. 2020, sec. 5).

Online distributed learning can be simply divided into two components: online optimization and distributed learning. In the field of online optimization, compared to well-developed online convex optimization with a long history (Hazan 2021; X. Li, Xie, and Li 2023), the exploration into online non-convex optimization remains relatively limited (Xin 2022, secs. 4, 5; Lu and Wang 2023; T.-H. Chang et al. 2020). Additionally, the research also focuses on how to deal with different kinds of streaming data such as stochastic setting (Alghunaim and Yuan 2024) and the multiarmed bandit problem (Hazan 2021, sec. 6). On the other hand, the efficiency of distributed learning is impeded by constraints such as the limited and unbalanced computation resources across nodes in the network and the substantial communication costs. These factors pose significant challenges in achieving efficient online distributed learning, leading to a surge of research interests, especially in communication-efficient algorithm design (see X. Cao et al. 2023 and references therein) and computation-communication trade-off/balance (Nedic, Olshevsky, and Rabbat 2018; Berahas et al. 2019; Berahas, Bollapragada, and Gupta 2023).

The goal of distributed learning is typically to obtain a single global model, which assumes consensus among all nodes in the network. This assumption is based on the premise that the data across different nodes are independently and identically distributed (i.i.d.). However, this premise can be violated in many real-world scenarios, especially when the data are collected from multiple sources; see, for example, (Lee, Zhu, and Toscas 2015; Tong et al. 2022). The violation of this assumption, termed heterogeneity, complicates the design and analysis of algorithms, particularly those that are gradient-based. For instance, decentralized gradient descent (DGD) and its variants (Kar, Moura, and Ramanan 2012; Nedic and Ozdaglar 2009; J. Chen and Sayed 2012), which employ average consensus (Olfati-Saber, Fax, and Murray 2007) and gradient descent techniques, may diverge or converge to biased solutions when node-specific gradients differ significantly from the global gradient due to heterogeneity. To address this issue, numerous studies were presented in the past decades, especially the well-known gradient tracking technique; see, for example, (Zhu and Martínez 2010; Xin and Khan 2018; Pu et al. 2021; Nedić, Olshevsky, and Shi 2017; Lorenzo and Scutari 2016; Qu and Li 2018; B. Li et al. 2020; Alghunaim and Yuan 2024) and so forth. Its basic idea is that each node keeps tracking a local estimation of the global gradient to eliminate the influences of heterogeneity. Based on the assumption of bounded heterogeneity, i.e., the differences of local gradients and the global gradient are bounded above, the gradient tracking technique can achieve better performance; see (Xin 2022, secs. 1.4, 1.5) for detailed discussion and related literature review. The assumption of bounded heterogeneity is essential in the convergence analysis of DGD and decentralized stochastic gradient descent (DSGD), especially in the non-convex settings. A counterexample that without bounded heterogeneity, DSGD diverges for any constant step-size can be found in (T.-H. Chang et al. 2020). Many efforts were made to get rid of such a kind of assumptions, such as the GT-VR framework in (Xin 2022, chaps. 2, 3; Xin, Kar, and Khan 2020).

The so-called personalized learning offers an alternative approach to manage unbounded heterogeneity by allowing nodes to train personalized models distinct from the global model to better cope with heterogeneity (Tan et al. 2023). For instance, as a specific case of personalized learning, clustered federated learning (Ghosh et al. 2022; Sattler, Muller, and Samek 2021; Armacki et al. 2024) aims to classify the nodes in the network into \(K\) mutually disjoint clusters with the number of clusters \(K\) being known beforehand. Within each cluster, the nodes share the same model, resulting in \(K\) different models.¹ In the settings that the network is fully decentralized and the number of clusters is unknown in advance, clustered federated learning can be further extended to subgroup analysis originated from statistics (S. Ma and Huang 2017). Following the pioneering work (S. Ma and Huang 2017), subgroup analysis is always formulated as an optimization problem with a pairwise fusion regularizer. It is also for the pairwise fusion regularizer that subgroup analysis appears frequently with different names in literature, including network lasso (Hallac, Leskovec, and Boyd 2015), spatial heterogeneity detection/identification (F. Li and Sang 2019; X. Zhang, Liu, and Zhu 2022), spatial cost minimization (Y. Lin et al. 2022; P. Li et al. 2023), network cost minimization (X. Cao and Liu 2018; X. Cao and Liu 2021), pairwise constrained optimization (X. Cao and Basar 2020), or a broader topic multitask learning (Caruana 1997; Yu Zhang and Yang 2022; Jacob, Bach, and Vert 2008; Zhou, Chen, and Ye 2011; Smith et al. 2017; Koppel, Sadler, and Ribeiro 2017; X. Cao and Liu 2017; Mortaheb, Vahapoglu, and Ulukus 2022; H. Hu et al. 2023; H. Ma, Guo, and Lau 2023; Okazaki and Kawano 2023; Singh et al. 2024; Yan and Cao 2024).

¹ One should be noted that clustered federated learning differs from the conventional clustering in unsupervised learning in that the former clusters models obtained by training on data distributed over different nodes, while the latter clusters the data directly.

The prevalence of unbounded heterogeneity in the real world has led to increased research interest in clustered multitask learning, epitomized by subgroup analysis. However, the lack of a methodological framework analogous to gradient tracking has resulted in a notable paucity of research concerning online, communication-efficient algorithms for this topic.

In Section 2, we will present some related works in the areas related to the aforementioned problem, including some intuitive understandings and crucial techniques. A comprehensive discussion of the problem formulation and challenges will also be provided in Section 3.

2 Preliminaries and related work

In this section, we provide a brief overview of topics and tools that can help us to develop communication-efficient algorithms for online distributed multitask learning. We will first review some preliminaries in optimization, particularly the standard form of a convex optimization problem and some classical algorithms to solve it. We then discuss separately online optimization, distributed optimization, multitask learning, and communication-efficient algorithms, all of which are topics that have been studied extensively in the past decades. For each topic, we will present the problem formulation, performance metrics, state-of-the-art algorithms and some crucial techniques. Finally we will combine them together to consider our main goal, a topic still waiting for exploration.

2.1 Preliminaries

Consider the following optimization problem \[ \min_{x \in \Omega}\quad f(x), \tag{1}\] where \(f: \textrm{dom}\,f \to \mathbb{R}\) is a closed convex function with \(\inf f > -\infty\) and \(\Omega \subseteq \mathbb{R}^n\) is a closed convex set.² We shall now review some classical algorithms to solve (1). The following contents are mainly taken from several classic textbooks, especially (Stephen Boyd and Vandenberghe 2004; and Nocedal and Wright 2006).

² For the sake of simplicity, here we only consider convex cases to illustrate the algorithms.

2.1.1 First-order methods

First-order methods refer to a class of algorithms that only involves the first-order derivative information. The most basic and commonly used one is the gradient descent algorithm, which uses the negative gradient as the descent direction to reduce the function value. When the problem is constrained as in (1) with a closed convex set \(\Omega\) and a differentiable \(f\), a projection step is added, leading to the projected gradient descent algorithm with the update scheme being: \[ x^{t + 1} = P_{\Omega}(x^t - \alpha^t \nabla f(x^t)), \tag{2}\] where \(x^t\) is the target variable in the \(t\)-th iteration, \(P_{\Omega}\) is the projection onto \(\Omega\), \(\alpha^t\) is the step-size (or learning rate) in the \(t\)-th iteration, and \(\nabla f(x^t)\) is the gradient of \(f\) at \(x^t\). The step-size is crucial in the convergence rate of projected gradient descent algorithm and is always selected via the so-called line search techniques, which we will not explore here.

We say that \(f\) has \(L\)-Lipschitz gradient or \(f\) is \(L\)-smooth if for any \(x, y \in \textrm{dom}\,f\), we have \[ \| \nabla f(x) - \nabla f(y) \| \leq L \| x - y \|, \] where \(\| \cdot \|\) is the Euclidean norm. If \(f\) has \(L\)-Lipschitz gradient, the gradient descent algorithm with a carefully select fixed step-size can converge to an optimal solution and the convergence rate with respect to the function value is \(O(\frac{1}{k})\).

There are several techniques to accelerate the projected gradient descent algorithms, such as Nesterov acceleration gradient, momentum, heavy-ball algorithm and so on. Most of them are based on the idea that information from previous gradients is aggregated to infer the current direction of the update to accelerate and stabilize the algorithm. We also note here that the gradient in the update formula can be replaced by the subgradient of \(f\), leading to the subgradient descent algorithms.

2.1.2 Newton-type methods

Newton-type methods make use of the second-order information, i.e., the Hessian matrix and the gradient, to construct the update formula. Compared to first-order methods, Newton-type methods generally perform better because more information is utilized, at the cost of a stronger requirement, e.g., second-order differentiability for \(f\).

The update scheme of the classical Newton method with step-size being fixed to \(1\) is \[ x^{t+1} = x^t - \nabla^2 f(x^t)^{-1}\nabla f(x^t), \] where \(\nabla^2f(x^t)^{-1}\) is the inverse of the Hessian matrix of \(f\) at \(x^t\). However, this formula faces several drawbacks:

It requires to solve a linear system in each iteration, which is expensive.
The algorithm is unstable if \(\nabla^2f(x^t)\) is not positive definite.
If the initial point is too far from the optimal solution, the fixed step-size \(1\) makes the iterations unstable.

To address these issues, several variants of Newton methods were proposed, such as modified Newton methods that involves line search techniques, quasi Newton methods, semi-Newton method and so forth.

2.1.3 Mirror descent

This type of algorithms replies on an important notion of Bregman divergence, which generalizes the concepts of distances. Given a strictly/strongly convex and differentiable function \(\phi\), the Bregman divergence with respect to \(\phi\) is defined by \[ \mathcal{D}_{\phi}(x, y) := \phi(x) - \phi(y) - \left\langle \nabla\phi(y), x - y \right\rangle. \] Different \(\phi\)’s will generate different Bregman divergences. For instance, for \(\phi(x) = \| x \|^2\), the corresponding Bregman distance is the squared Euclidean distance; the negative entropy function generates the generalized Kullback-Leibler divergence.

Mirror descent acts as the generalization of projected gradient descent algorithm by replacing the Euclidean distance by Bregman divergence with specific \(\phi\) to better accommodate specific problems with a particular geometry. Then, the standard mirror descent update formula is \[ x^{t+1} = \operatorname*{argmin}_{x \in \Omega} \left\{\alpha^t \left\langle x, \nabla f(x^t) \right\rangle + \mathcal{D}_{\phi}(x, x^t)\right\}, \] where \(\alpha^t > 0\) is the step-size. If \(\phi(x) = \| x \|^2 / 2\), then the mirror descent reduces to the projected gradient descent algorithm.

2.1.4 Primal-dual algorithms

The primal-dual algorithms try to combine information from both primal and dual problems to avoid possible issues when only considering primal or dual problem. Broadly speaking, it first rewrite the problem as a saddle-point problem, taking a form of a mini-max problem, and then alternatively update primal and dual variables.

Consider the following optimization problem \[ \min_{x \in \Omega}\quad f(x) \quad \text{s.t. } g(x) \leq 0, \] where \(g: \mathbb{R}^n \to \mathbb{R}^m\) and the inequality in the constraint is taken elementwise. One can rewrite the problem as a saddle-point problem as \[ \min_{x \in \Omega} \max_{\lambda \in \mathbb{R}^m}\quad L(x, \lambda) := f(x) + \left\langle \lambda, g(x) \right\rangle. \] Here, \(L\) is actually the Lagrangian function (it can also be augmented Lagrangian function), \(\lambda\) is the Lagrangian multiplier. The update scheme of the standard primal-dual algorithm takes the form as \[ \begin{split} & x^{t+1} = P_{\Omega}(x^t - \alpha^t \nabla_x L(x^t, \lambda^t)), \\ & \lambda^{t+1} = \max \{ \lambda_t + \alpha_t \nabla_{\lambda} L(x^t, \lambda^t), 0 \}, \end{split} \] where \(\nabla_x L(x, \lambda)\) and \(\nabla_{\lambda} L(x, \lambda)\) are the gradients of \(L\) with respect to \(x\) and \(\lambda\), respectively.

One of the most famous primal-dual algorithms is the alternating direction method of multipliers(ADMM), which performs very well for separated problem with linear constraints. Consider the following optimization problem \[ \begin{aligned} \min_{x \in \mathbb{R}^n, y \in \mathbb{R}^m} & \quad f(x) + g(y) \\ \text{s.t. } & \quad Ax + By = c, \end{aligned} \] where \(f\) and \(g\) are both convex, \(A \in \mathbb{R}^{p \times n}\), \(B \in \mathbb{R}^{p \times m}\), \(c \in \mathbb{R}^p\). Given \(\beta > 0\), the augmented Lagrangian function is \[ L_{\beta}(x, y, z) = f(x) + g(y) + \left\langle z, Ax + By - c \right\rangle + \frac{\beta}{2} \| Ax + By - c \|^2. \] ADMM consists of the iterations \[ \begin{aligned} x^{t+1} := &\,\, \operatorname*{argmin}_{x} L_{\beta}(x, y^t, z^t), \\ y^{t+1} := &\,\, \operatorname*{argmin}_{y} L_{\beta}(x^{t+1}, y, z^t), \\ z^{t+1} := &\,\, z^t + \beta (Ax^{t+1} + By^{t + 1} - c). \end{aligned} \] There are also several variants of ADMM, such as proximal ADMM that adds a proximal term to help solve the subproblem inexactly. The ADMM in the non-convex settings is also extensively studied.

2.1.5 Derivative-free methods

Derivative-free methods, also known as zeroth-order methods, aim to solve the optimization problem without using derivative information. It is useful in some specific scenarios, especially when the derivative information of the objective function is unavailable, unreliable or impractical to obtain. For example, in many real-world applications, one can only obtain the noisy value of the objective function and hence the objective function and its derivative are never revealed.

There are many kinds of derivative-free methods, such as genetic algorithms, pattern search, simulated annealing, Powell’s methods and so on.

2.1.6 Projection-free methods

In the cases that the projection onto the constraint set \(\Omega\) in (1) is difficult to compute, the update scheme (2) in projected gradient descent algorithm will be expensive. To reduce computational demands, projection-free methods attempt to avoid projection in each iteration. The most celebrated projection-free method is the Frank-Wolfe algorithm, which considers a linear approximation of the objective function and constructs a descent direction using the solution to the approximated problem. In each iteration, the Frank-Wolfe algorithm first solves the following optimization problem \[ s^t := \operatorname*{argmin}_{s \in \Omega} \quad \left\langle s, \nabla f(x^t) \right\rangle. \] Then a descent step is performed with the descent direction being \(x^k - s^k\) and the step-size \(\alpha^k\) being set to \(\frac{2}{k + 2}\) or selected by some line search technique. That is, \[ x^{t+1} = x^t - \alpha^t(x^t - s^t). \]

2.1.7 Stochastic optimization algorithms and variance reduction

Stochastic optimization algorithms, just as its name implies, are a class of algorithms that makes use of randomness to solve optimization problems. They provide a way to handle a range of challenging optimization problems such as those with high-dimensional nonlinear objective functions or with inherent system noise, and non-convex problems with multiple local optima. Stochastic optimization algorithms apply a stochastic strategy where random decisions are engaged to improve efficiency or help jump out of the local optima. In this sense, many derivative-free methods are actually stochastic optimization algorithms.

One of the most successful stochastic optimization algorithms is the stochastic gradient descent (SGD), which is frequently used in machine learning. It replaces the gradient in (2) by a stochastic gradient, which is a randomly selected approximation of the exact whole gradient of the objective function \(f\). For example, in machine learning, the objective function can always be written as a sum of a series of functions, i.e., \[ \min_x f(x) := \frac{1}{N} \sum_{k=1}^N f_k(x). \tag{3}\] Then a choice of a stochastic gradient in the \(t\)-th iteration can be \(\nabla f_{k^t}(x^t)\) with a random index \(k^t \in \{1, 2, \dots , N\}\), instead of \(\nabla f(x^t)\). This can significantly reduce the computation cost and improve the efficiency of the algorithm, at the sacrifice of stability.

Both the efficiency and instability of SGD come from the randomness, more precisely, the variance. To stabilize the stochastic optimization algorithms while still enjoying their efficiency, the so-called variance reduction (VR) techniques are always employed (Gower et al. 2020). Classic VR techniques include using a sequence of decreasing step-size \(\{\alpha^t\}\); using the averaged mini-batch gradient, i.e., \(\frac{1}{|S^t|} \sum_{k\in S^t} \nabla f_k(x^t)\) where \(S^t \subseteq \{1, 2, \dots , N\}\) is a randomly selected subset and \(|S^t|\) is the number of elements in \(S^t\); and the momentum-based technique taking the form as \[ \begin{aligned} m^t = &\,\, \beta m^{t-1} + \nabla f_{k^t}(x^t) \\ x^{t+1} = &\,\, x^t - \gamma m^t, \end{aligned} \] where \(\beta \in (0, 1)\) and \(\gamma > 0\). More recently, several modern VR techniques are developed, notably stochastic average gradient (SAG), SAGA³, stochastic variance reduction gradient (SVRG) and stochastic dual coordinate ascent (SDCA).

³ Note that SAGA is exactly the name of this technique, it is not an abbrevation of four words.

Stochastic average gradient (SAG) and SAGA. In each iteration, SGD uses the stochastic gradient that only involves information of current point, discarding all information of the historical stochastic gradients. This behavior leads to the instability of the algorithm as it approaches convergence, since the descent direction becomes sensitive to the inaccuracy of the gradient. In contrast, as the algorithm approaches convergence, the stochastic gradient in the last iteration is also a good approximation of the exact gradient so that we can make use of it. SAG is constructed based on such an idea. More precisely, SAG keeps \(n\) vectors \(\{g_k^t\}\), where each \(g_k^t\) is used to store the last computed stochastic gradient of \(f_k\) in the \(t\)-th iteration. That is, in the \(t\)-th iteration, \(g_k^t\) will update according to \[ g_k^t=\begin{cases} \nabla f_k(x^t) & \text{if } k = k^t \\ g_k^{t-1} & \text{otherwise}. \end{cases} \] Therefore, the update scheme of SAG will be \[ x^{t+1} = x^t - \alpha^t \left( \frac{1}{N} \left( \nabla f_{k^t}(x^t) - g_{k^t}^{t-1}\right) + \frac{1}{N} \sum_{k=1}^N g_k^{t-1} \right). \] SAGA further improves SAG by making sure the approximated gradient to be unbiased. Hence, its update formula is \[ x^{t+1} = x^t - \alpha^t \left( \nabla f_{k^t}(x^t) - g_{k^t}^{t-1} + \frac{1}{N} \sum_{k=1}^N g_k^{t-1} \right). \]

Stochastic variance reduction gradient (SVRG). One of the major drawbacks of SAG and SAGA is the need to store \(n\) vectors, which can be expensive when \(n\) is large. Instead, SVRG regularly caches an exact gradient and uses it as a reference of the true gradient in the following iterations until the cache is updated to reduce variance. Put it concrete, the \(j\)-th checkpoint denoted by \(\tilde{x}^j\) and the exact gradient \(\nabla f(\tilde{x}^j)\) are stored per \(m\) iterations. The update formula in the following iterations uses the following descent direction \[ v^t = \nabla f_{k^t}(x^t) - \left(\nabla f_{k^t}(\tilde{x}^j) - \nabla f(\tilde{x}^j)\right). \] Compared to SAG and SAGA, although SVRG does not need to store \(n\) vectors, it needs to compute a whole gradient per \(m\) iterations and an additional stochastic gradient \(\nabla f_{k^t}(\tilde{x}^t)\) in each iteration.

Stochastic dual coordinate ascent (SDCA). The dual problem of (3) has \(N\) dual (scalar) variables, each of which corresponds to an \(f_k\). The basic idea of SDCA is to solve the dual problem using stochastic coordinate ascent (Shalev-Shwartz and Zhang 2013). That is, in each iteration, a coordinate is uniformly randomly selected to perform coordinate ascent. Finally, the primal solution can be recovered by the dual solution. The disadvantages of this technique is that it requires the computation of convex conjugate rather than simply gradient in each iteration.

The aforementioned VR techniques are all based on stochastic gradient descent/ascent. There are also research focus on the combination of variance reduction techniques and other stochastic algorithms, such as stochastic ADMM with variance reduction (Suzuki 2014; Y. Liu, Shang, and Cheng 2017; Y. Liu et al. 2021, 2022; S. Wang et al. 2023).

2.2 Online optimization

Online optimization has received increasing attention due to its adaptability to dynamic environment in many real-world applications (Hazan 2021; X. Li, Xie, and Li 2023). It aims to aid a player or a decision maker to pursue success in a dynamic or even adversarial environment. Briefly speaking, the player iteratively makes decisions based on historical information and then some prior unknown losses will reveal according to the decisions. Put it concrete and mathematical, at \(t\)-th iteration, the player makes decision \(x^t \in \mathcal{K}\), where \(\mathcal{K} \subseteq \mathbb{R}^n\) is the decision set, and then a cost function \(f_t:\mathcal{K} \to \mathbb{R}\) is revealed. In this setting, generally one cannot achieve zero cost as the cost function is unknown beforehand. Hence online optimization is actually minimizing the cost in hindsight. This leads to the definition of the performance metric regret / competitive ration in online optimization.

Depending on settings of applications, the definition of regret may be different; see, for example, (X. Li, Xie, and Li 2023, sec. 2.3; Hazan 2021, sec. 1.1 and section 10).

When the environment is relatively stable, the static regret after \(T\) iterations, denoted by \(\textrm{S-Regret}_T\), is defined as \[ \textrm{S-Regret}_T := \sum_{t=1}^T f_t(x^t) - \min_{x \in \mathcal{K}} \sum_{t=1}^T f_t(x). \] One can notice that if the environment is dynamic and there does not exist a uniform optimal solution for all time, the static regret may lead to a poor performance. If there exists a uniform optimal solution for all time, the minimized static regret actually results in the uniform optimal solution. Therefore, in some cases, people just consider seeking the uniform optimal solution, such as (Yijian Zhang, Dall’Anese, and Hong 2021).
To address the issue of static regret in dynamic environment, the dynamic regret, denoted by \(\textrm{D-Regret}_T\), can be used. \[ \textrm{D-Regret}_T := \sum_{t=1}^T (f_t(x^t) - \min_{x \in \mathcal{K}} f_t(x)). \] Dynamic regret allows the referred optimal solution to vary from time to time, ensuring better performance in a dynamic or adversarial environment.
Dynamic regret is defined over the entire horizon, which can be too strong. The adaptive regret instead considers only a time region. With a fixed time interval \(\tau > 0\), the strongly adaptive regret, denoted by \(\textrm{SA-Regret}_T\), is defined as \[ \textrm{SA-Regret}_T := \max_{[r, r + \tau - 1] \subseteq \{1, 2, \dots, T\}} \left\{\sum_{t=r}^{r + \tau - 1} f_t(x^t) - \min_{x \in \mathcal{K}} \sum_{t=r}^{r + \tau - 1} f_t(x)\right\}. \] Without a fixed time interval, the weakly adaptive regret, denoted by \(\textrm{WA-Regret}_T\), is defined as \[ \textrm{WA-Regret}_T := \max_{[r, s] \subseteq \{1, 2, \dots, T\}} \left\{\sum_{t=r}^s f_t(x^t) - \min_{x \in \mathcal{K}} \sum_{t=r}^s f_t(x)\right\}. \]
Another metric is the so-called competitive ratio (CR), which considers the ratio other than the difference. It is defined as \[ \textrm{CR}_T := \frac{\sum_{t=1}^T f_t(x^t)}{\min_{x \in \mathcal{K}} \sum_{t=1}^{T} f_t(x)}. \]

As dynamic environment always involves randomness, one should try to minimize cost in the worst case, i.e., we are seeking the algorithms that can provide lowest upper bound of regret / competitive ratio.

Nearly all algorithms and techniques mentioned in Section 2.1 have been generalized to online settings. To name but a few, the framework of online convex optimization and online gradient descent were first proposed in (Zinkevich 2003); stochastic optimization algorithms can be treated as a special case of online optimization, see (Hazan 2021, sec. 3.4); the first attempt to generalize the Newton method to online settings is the online Newton step algorithm presented in (Hazan, Agarwal, and Kale 2007), followed by many other works such as (Schraudolph, Yu, and Günter 2007; Lesage-Landry, Taylor, and Shames 2021; T.-J. Chang and Shahrampour 2021); online mirror descent is also well studied (Fang et al. 2022); primal-dual algorithms were also generalized to online settings (Shalev-Shwartz and Singer 2007; Molinaro 2021); the derivative-free methods were considered in (S. Liu et al. 2018; Kaya, Sahin, and Hashemi 2023); the VR techniques in online settings were discussed in (Hazan and Luo 2016; Borsos, Krause, and Levy 2018; Borsos et al. 2019; M. Zhang, Jin, and Chen 2020); more details can be found in (X. Li, Xie, and Li 2023, sec. 3; Hazan 2021).

Aside from these algorithms originated from traditional algorithms, the nature of online optimization has led to exploration of novel different algorithms. For example, the machine learning augmented (ML-augmented) methods consider adding a machine learning-based expert to help the decision maker to make better decisions, where the expert is trained using historical knowledge (P. Li et al. 2023; Scavuzzo et al. 2024). Another example is the learning to optimize (L2O) algorithms (T. Chen et al. 2022), which uses learning algorithms to improve the process of optimization itself rather than the obtained solution, thereby perfectly fitting the online settings.

Besides the classical optimization problem, some special optimization problems are also considered in online settings, including online bilevel optimization (Tarzanagh et al. 2022; S. Lin et al. 2023) and online manifold optimization (Maass et al. 2022; X. Wang et al. 2021). Conversely, online optimization also provides a new direction to solve bilevel optimization problem (Shen, Ho-Nguyen, and Kılınç-Karzan 2022). The bandit settings also catch many researchers’ attention, where only noisy, partial or incomplete information of the cost function is provides, see (Hazan 2021, sec. 6) for more details.

2.3 Consensual distributed optimization

Consensual distributed optimization aims to train a uniform model across several data owners in a communication network. More precisely, given a communication network \(\mathcal{G} = (\mathcal{V}, \mathcal{E})\), where \(\mathcal{V}\) and \(\mathcal{E}\) refer to the sets of nodes and edges, each node has its own private data and there is an edge between two nodes in the network if and only if these two nodes can communicate with each other, we are seeking the optimal solution to the following optimization problem \[ \min_{x \in \mathbb{R}^n} F(x) := \sum_{k=1}^K f_k(x), \] where \(K = | \mathcal{V} |\) is the number of nodes in the network; \(f_k\) is the cost function of the \(k\)-th node, constructed based on the local dataset \(\mathcal{D}_k\) for \(k = 1, 2, \dots, K\). Here the consensus is reflected in the fact that all nodes share a uniform target variable (or model parameters) \(x\).

Average-consensus algorithm (Olfati-Saber, Fax, and Murray 2007) is a fundamental tool to achieve consensus. It updates the local target variable at each node by averaging the target variables from adjacent nodes using a mixing matrix associated with the communication network. The mixing matrix \(W = (w_{ij}) \in \mathbb{R}^{K \times K}\) is a primitive and doubly-stochastic matrix⁴ where \(w_{ij} \neq 0\) if and only if \((i, j) \in \mathcal{E}\). One can then observe that \(W \mathbf{1}_K = \mathbf{1}_K\) and \(W^{\top} \mathbf{1}_K = \mathbf{1}_K\), where \(\textbf{1}_K \in \mathbb{R}^K\) is an all-one vector. The update formula of average-consensus algorithm is \[ \mathbf{x}^{t+1} = (W \otimes I_n) \mathbf{x}^t, \] where \(\mathbf{x}^{t+1} := ((x_k^{t+1})^{\top})_{i=1}^K \in \mathbb{R}^{Kn}\) with \(x_k^{t+1}\) being the target variable at the \(i\)-th node at iteration \(t+1\); \(\otimes\) is the Kronecker product and \(I_n\) is the \(n \times n\) identity matrix. Since \(W\) is primitive and doubly stochastic, by the Perron-Frobenius theorem, we have \[ \lim_{t\to \infty} W^t = \frac{1}{K} \mathbf{1}_K \mathbf{1}_K^{\top}, \] and hence we have \[ \lim_{t\to \infty} \mathbf{x}^t = \lim_{t\to \infty} (W \otimes I_K)^t \mathbf{x}^0 = \mathbf{1}_K \otimes \frac{(\mathbf{1}_K^{\top} \otimes I_n)\mathbf{x}_0 }{K}. \] That means, average-consensus algorithm can achieve consensus across nodes with the final value related to the initial point at a linear rate of \(\lambda_2(W)^k\), where \(\lambda_2(W)\) is the second largest singular value of \(W\).

⁴ A squared matrix \(M\) is said to be primitive if it is nonnegative and there exists some \(k>0\) such that \(M^k\) is positive. A squared matrix is said to be doubly-stochastic if it is nonnegative and each of its rows and columns sums to \(1\).

Combining average-consensus algorithm with (stochastic) gradient descent, the distributed gradient descent (DGD) and the distributed stochastic gradient descent (DSGD) can be built. The basic strategy is adding an average-consensus step before gradient descent, i.e., \[ x_k^{t+1} = \sum_{j=1}^K w_{ij} x_j^t - \alpha^t g_k^t, \] where \(\alpha^t\) is the step-size; \(g_k^t\) is the gradient or the stochastic gradient.

However, as what we discussed in Section 2.1.7, the randomness, which comes from local SGD and the distributed setting,⁵ causes the non-degenerate variance and leads to instability of the algorithm. Moreover, the heterogeneity between different nodes can further deteriorate the stability of the algorithm. This then leads to the consideration of posing more assumptions and applying advanced techniques. Typical assumptions are:

⁵ In the distributed setting, the data are randomly distributed to nodes, introducing more randomness.

the \(L\)-smoothness of each \(f_k\);
bounded variance of each local stochastic gradient \(g_k^t\), i.e., there exists some \(\nu > 0\) such that \[ \sup_{i, t} \mathbb{E} \left[ \| g_k^t - \nabla f_k(x_k^t) \|^2 \right] \leq \nu^2; \]
bounded heterogeneity between the local and the global gradient, i.e., there exists some \(\mu > 0\) such that \[ \sup_{x \in \mathbb{R}^n} \frac{1}{N} \sum_{k=1}^N \| \nabla f_k(x) - \nabla F(x) \|^2 \leq \mu^2. \]

In addition to the above assumptions, the gradient tracking (GT) technique is always applied to construct the GT-DGD algorithm to further improve the stability of DGD (Xin and Khan 2018; Nedić, Olshevsky, and Shi 2017; Lorenzo and Scutari 2016; Qu and Li 2018). Before moving on to the introduction of GT, we first discuss briefly the reason to do so. Suppose that DGD starts from an initial point that happens to be a stationary point \(x^{*}\) of \(F\). Ideally DGD should immediately terminate. However, according to the update scheme, one has \[ x_k^1 = \sum_{j=1}^N w_{ij} x^{*} - \alpha^1 \nabla f_k(x^{*}) = x^{*} - \alpha^1 \nabla f_k(x^{*}) \neq x^{*}, \] where the last inequality comes from the general fact that \(f_k \neq F\). In other words, the local optimality condition does not meet even if the global optimality condition is satisfied. To address this issue, GT keeps an estimation of the global gradient and uses it as the descent direction. All in all, the GT-DGD performs update as \[ \begin{aligned} x_k^{t+1} &= \sum_{j=1}^N w_{ij} x_j^t - \alpha^t y_k^t, \\ y_k^{t+1} &= \sum_{j=1}^N w_{ij} y_j^t + \nabla f_k(x_k^{t+1}) - \nabla f_k(x_k^t), \end{aligned} \] with \(y_k^0 = \nabla f_k(x_k^0)\) for all \(i\). Notice that GT-DGD uses the exact gradient instead of the stochastic gradient to update \(y_k^t\). To further accelerate the algorithm, one can replace the exact gradient by the approximated gradient obtained by stochastic gradient and VR techniques described in Section 2.1.7. This leads to the GT-VR framework and related algorithms in (Xin 2022, sec. 2; Xin, Kar, and Khan 2020), as well as some other algorithms based on GT and VR (B. Li et al. 2020).

Before the advent of GT and VR, EXTRA has already achieved linear convergence rate for strongly convex \(F\) (not \(f_k\)) (Shi et al. 2015). EXTRA can be regarded as a corrected DGD as following \[ \mathbf{x}^{t+1} = \underbrace{W\mathbf{x}^t - \alpha \nabla \mathbf{f} (\mathbf{x}^t)}_{\text{DGD}} + \underbrace{\sum_{i=0}^{t - 1} (W - \overline{W})\mathbf{x}^i}_{\text{corrections}} \quad \forall t = 0, 1, \dots, \] where \(\alpha\) is the step-size; \(W\) and \(\overline{W}\) are two mixing matrices satisfying some conditions (Shi et al. 2015, assumption 1); \(\nabla \mathbf{f}(\mathbf{x}) = \left(\nabla f_k(x_k)^{\top}\right)_{k=1}^K \in \mathbb{R}^{K \times n}\). Indeed, given \(W\) and \(\overline{W}\), the update formula is \[ \mathbf{x}^{t+2} = (I + W)\mathbf{x}^{t+1} - \overline{W}\mathbf{x}^t - \alpha [\nabla \mathbf{f}(\mathbf{x}^{t+1}) - \nabla \mathbf{f}(\mathbf{x}^t) ]. \]

The above discussions are all based on first-order algorithms for classic optimization problems. There are also several different types of algorithms can be applied to consensual distributed optimization. To just mention a few, the primal-dual algorithms in (Lan, Lee, and Zhou 2020; Smith et al. 2018); the Newton’s method in (H. Liu et al. 2023; J. Zhang et al. 2023); and so on. Some special optimization problems are also considered in the distributed settings, such as manifold optimization (L. Wang and Liu 2022; L. Wang, Bao, and Liu 2024) and bilevel optimization (Niu et al. 2023).

2.4 Multi-task learning

Broadly speaking, multitask learning aims to learn multiple related tasks simultaneously and collaboratively (Caruana 1997; Yu Zhang and Yang 2022).⁶ In this sense, the multitask learning generally poses constraints to edges, distinguishing it from the consensual distributed learning which only has constraints on nodes. Since it is an abstract concept, it does not have a typical mathematical formulation.

⁶ It is important to avoid confusing this with multi-agent learning, which also needs to learn multiple tasks at the same time, but these tasks can be adversarial.

2.5 Communication-efficient distributed learning

In practice, the cost of communication between entities within the communication network is remarkably higher than that of the local computation. Therefore, even if an algorithm enjoys a fast theoretical convergence, it may still suffer from slow practical convergence due to the slow communication. Motivated by this, the communication-efficient algorithms, i.e., those algorithms that require less communication cost to achieve comparable or even better performance, become a new research hot spot.

Typical methodologies for communication-efficient algorithms can be divided into four types (X. Cao et al. 2023).

The first type of methods is to reduce the number of communication rounds per iteration by only executing communication between nodes when necessary. These algorithms strive to strike a balance between local computation and inter-node communication. Each node will update multiple times in each iteration, with the number being either a fixed number or adaptively adjusted based on certain events, to obtain a more accurate local estimation. Afterwards, the node exchanges information with its neighbors to contribute to the global model. It is important to avoid over-fitting the local estimation by limiting the number of local updates in each iteration, which could result in a biased global model. Examples include (Jaggi et al. 2014; Smith et al. 2018; Lan, Lee, and Zhou 2020; X. Cao and Basar 2021) and so on.
The second type of methods is to compress communications, i.e., compress the information exchanged via communication to reduce the cost. Commonly used ways to compress information are quantization (Ozkara et al. 2021; X. Cao and Basar 2020; Elgabli et al. 2021; Xu et al. 2024) and sparsification (H. Liu et al. 2023). Recently, (Liao et al. 2022) combines communication compression and GT to develop a compressed gradient tracking method.
The third type of methods is to take the resource constraint into consideration and communicate accordingly. Communication will only be performed when the resource allows to do so.
The final type of methods is to use game theory in determination of which neighbors are selected to communicate with. To this extent, the neighbors are generally randomly selected according to some distribution. Representative examples include the algorithms in (Ghosh et al. 2022; M. Chen et al. 2021), the celebrated gossip-type algorithms (S. Boyd et al. 2006; M. Cao, Spielman, and Yeh 2006; Dimakis et al. 2010) and its variants (Mao et al. 2020), and so forth. There are algorithms that attempt to directly modify the communication network to avoid unnecessary communications. For example, (F. Li and Sang 2019; X. Zhang, Liu, and Zhu 2022) consider constructing a minimal spanning tree of the communication network to reduce the number of communication rounds; (Elgabli et al. 2021, 2020) construct a subset of edges of the communication network that is a chain connecting all nodes and only perform communication on this chain. However, these can be categorized into this type of approach since it is equivalent to setting the probability of certain edges participating in communication to zero.

2.6 Combination

While the online consensual distributed optimization has been widely studied, see (X. Li, Xie, and Li 2023) for a survey, the communication-efficient online consensual distributed optimization is relatively limited (X. Cao et al. 2023), not to mention the communication-efficient online multitask distributed optimization. The bottleneck of the development of communication-efficient online multitask distributed optimization mainly comes from the fact that multitask learning generally requests more global information, which can only be acquired via communications. Moreover, the nature of multitask learning prevents us from using many tools in Section 2.1 and Section 2.3, especially the GT and VR techniques.

Here we further mention an interesting topic in online distributed optimization, the Byzantine generals problem (Lamport, Shostak, and Pease 1982), or equivalently Byzantine attacks, Byzantine fault, Byzantine failure, interactive consistency, and so many other names. Simply speaking, this problem assumes that there are some traitors in the communication network, who will selectively provide wrong information when communicating with their neighbors to prevent the network from achieving consensus. This problem is difficult since it involves adversarial environment in online optimization, multitask learning, and data heterogeneity. There are some work in this topic; see, for example, (L. Li et al. 2019; He, Zhu, and Ling 2023; Dong et al. 2024).

3 Problem formulation and challenges

In this section, we provide a brief introduction of the problem we are considering: how to design communication-efficient algorithm for online distributed multitask learning. Since the mathematical formula of multitask learning varies from application to application, here we only provide two widely used formulae that can cover many applications including consensual distributed optimization, network lasso, subgroup analysis, etc.

The first formula is an unconstrained optimization problem admitting the following form \[ \min_{\mathbf{x}} \sum_{k=1}^K f_k(x_k) + \sum_{(j, k) \in \mathcal{E}} g_{jk}(x_j, y_k), \tag{4}\] where \(\mathbf{x} := (x_k^{\top})_{i=1}^K\); \(f_k\) is the local cost function at the \(i\)-th node; \(g_{jk}\) is the cost function for the edge \((j, k)\). Alternatively, one may consider a constrained optimization problem as \[ \begin{aligned} \min_{\mathbf{x}} & \quad \sum_{k=1}^K f_k(x_k) \\ \text{s.t. } & \quad h_{jk}(x_j, x_k) \leq \gamma_{jk} \quad \forall \, (j, k) \in \mathcal{E}, \end{aligned} \tag{5}\] where \(h_{jk}\) is the cost function for the edge \((j, k)\).

Next we give some examples.

If we let \(h_{jk}\) be some norm and let \(\gamma_{jk} = 0\), then problem (5) reduces to the consensual distributed optimization.
If we let \(g_{jk}(x_j, x_k) = \|x_j - x_k\|_2\), then (4) reduces to network lasso in (Hallac, Leskovec, and Boyd 2015).
If we let \(f_k(x_k) = \frac{1}{2}\| A_k x_k - b_k \|_2^2\) with some local data \((A_k, b_k)\) and let \(g_{jk}(x_j, x_k) = \|x_j - x_k\|_1\), then (4) reduces to the subgroup analysis problem in (F. Li and Sang 2019; X. Zhang, Liu, and Zhu 2022).

The main challenge comes from the two functions \(g_{jk}\) and \(h_{jk}\) which lead the problem to be not consensual. Hence many existing algorithms for consensual distributed optimization cannot apply. One possible way to solve (4) is to transfer it to a consensual distributed optimization problem by considering the whole variable \(\mathbf{x}\) in each node, as in (Jiang, Mukherjee, and Sarkar 2018). Although it allows us to use consensual distributed optimization techniques, it significantly increases the problem scale and does not make use of the problem structure.

Looking at the problem structure as a regularized separable problem, it is natural to consider using the celebrated ADMM. Some efforts were made in (Hallac, Leskovec, and Boyd 2015; X. Cao and Liu 2017; X. Cao and Liu 2018; F. Li and Sang 2019; X. Zhang, Liu, and Zhu 2022). However, the ADMMs in these papers still require full communication in each iteration except for two minimal spanning tree (MST)-based methods in (F. Li and Sang 2019; X. Zhang, Liu, and Zhu 2022), while these two papers implicitly assume that data in each node are homogeneous and adequate to give a good initial point and hence fail to handle the heterogeneity and online settings.

Some other attempts are seen in literature. For example, (X. Cao and Liu 2021) uses a distributed Newton’s method to solve (4). However, this Newton’s method requires \(f_k\) and \(g_{jk}\) for all \(i\) and \((j, k) \in \mathcal{E}\) to be twice differentiable, which is impractical. Saddle point methods are employed in (Koppel, Sadler, and Ribeiro 2017; X. Cao and Basar 2020; Singh et al. 2024; Yan and Cao 2024) to solve (5), where the smoothness of \(f_k\) and \(g_{jk}\) for all \(i\) and \((j, k) \in \mathcal{E}\) is required. However, both network lasso and subgroup analysis problem do not fit these conditions.

To conclude, so far the research focuses on these two problems is restricted.

Reference

Alghunaim, Sulaiman A., and Kun Yuan. 2024. “An Enhanced Gradient-Tracking Bound for Distributed Online Stochastic Convex Optimization.” Signal Processing 217 (April): 109345. https://doi.org/10.1016/j.sigpro.2023.109345.

Armacki, Aleksandar, Dragana Bajović, Dušan Jakovetić, and Soummya Kar. 2024. “A One-Shot Framework for Distributed Clustered Learning in Heterogeneous Environments.” IEEE Transactions on Signal Processing 72: 636–51. https://doi.org/10.1109/tsp.2023.3343561.

Beltrán, Enrique Tomás Martínez, Mario Quiles Pérez, Pedro Miguel Sánchez Sánchez, Sergio López Bernal, Gérôme Bovet, Manuel Gil Pérez, Gregorio Martínez Pérez, and Alberto Huertas Celdrán. 2023. “Decentralized Federated Learning: Fundamentals, State of the Art, Frameworks, Trends, and Challenges.” IEEE Communications Surveys & Tutorials 25 (4): 2983–3013. https://doi.org/10.1109/comst.2023.3315746.

Berahas, Albert S., Raghu Bollapragada, and Shagun Gupta. 2023. “Balancing Communication and Computation in Gradient Tracking Algorithms for Decentralized Optimization.” 2023. https://arxiv.org/abs/2303.14289v2.

Berahas, Albert S., Raghu Bollapragada, Nitish Shirish Keskar, and Ermin Wei. 2019. “Balancing Communication and Computation in Distributed Optimization.” IEEE Transactions on Automatic Control 64 (8): 3141–55. https://doi.org/10.1109/tac.2018.2880407.

Borsos, Zalán, Sebastian Curi, Kfir Yehuda Levy, and Andreas Krause. 2019. “Online Variance Reduction with Mixtures.” In International Conference on Machine Learning, 705–14. PMLR.

Borsos, Zalán, Andreas Krause, and Kfir Y Levy. 2018. “Online Variance Reduction for Stochastic Optimization.” In Conference on Learning Theory, 324–57. PMLR.

Boyd, S., A. Ghosh, B. Prabhakar, and D. Shah. 2006. “Randomized Gossip Algorithms.” IEEE Transactions on Information Theory 52 (6): 2508–30. https://doi.org/10.1109/tit.2006.874516.

Boyd, Stephen, and Lieven Vandenberghe. 2004. Convex Optimization. Cambridge University Press. https://doi.org/10.1017/cbo9780511804441.

Cao, Ming, Daniel A Spielman, and Edmund M Yeh. 2006. “Accelerated Gossip Algorithms for Distributed Computation.” In Proc. Of the 44th Annual Allerton Conference on Communication, Control, and Computation, 952–59.

Cao, Xuanyu, and Tamer Basar. 2020. “Decentralized Multi-Agent Stochastic Optimization with Pairwise Constraints and Quantized Communications.” IEEE Transactions on Signal Processing 68: 3296–3311. https://doi.org/10.1109/tsp.2020.2997394.

———. 2021. “Decentralized Online Convex Optimization with Event-Triggered Communications.” IEEE Transactions on Signal Processing 69: 284–99. https://doi.org/10.1109/tsp.2020.3044843.

Cao, Xuanyu, Tamer Başar, Suhas Diggavi, Yonina C. Eldar, Khaled B. Letaief, H. Vincent Poor, and Junshan Zhang. 2023. “Communication-Efficient Distributed Learning: An Overview.” IEEE Journal on Selected Areas in Communications 41 (4): 851–73. https://doi.org/10.1109/jsac.2023.3242710.

Cao, Xuanyu, and K. J. Ray Liu. 2017. “Decentralized Sparse Multitask RLS over Networks.” IEEE Transactions on Signal Processing 65 (23): 6217–32. https://doi.org/10.1109/tsp.2017.2750110.

———. 2021. “Distributed Newton’s Method for Network Cost Minimization.” IEEE Transactions on Automatic Control 66 (3): 1278–85. https://doi.org/10.1109/tac.2020.2989266.

Cao, Xuanyu, and K. J.Ray Liu. 2018. “Distributed Linearized ADMM for Network Cost Minimization.” IEEE Transactions on Signal and Information Processing over Networks 4 (3): 626–38. https://doi.org/10.1109/tsipn.2018.2806841.

Caruana, Rich. 1997. “Multitask Learning.” Machine Learning 28 (1): 41–75. https://doi.org/10.1023/A:1007379606734.

Chang, Ting-Jui, and Shahin Shahrampour. 2021. “On Online Optimization: Dynamic Regret Analysis of Strongly Convex and Smooth Problems.” In Proceedings of the AAAI Conference on Artificial Intelligence, 35:6966–73. 8.

Chang, Tsung-Hui, Mingyi Hong, Hoi-To Wai, Xinwei Zhang, and Songtao Lu. 2020. “Distributed Learning in the Nonconvex World: From Batch Data to Streaming and Beyond.” IEEE Signal Processing Magazine 37 (3): 26–38. https://doi.org/10.1109/msp.2020.2970170.

Chen, Jianshu, and Ali H. Sayed. 2012. “Diffusion Adaptation Strategies for Distributed Optimization and Learning over Networks.” IEEE Transactions on Signal Processing 60 (8): 4289–4305. https://doi.org/10.1109/tsp.2012.2198470.

Chen, Mingzhe, Nir Shlezinger, H. Vincent Poor, Yonina C. Eldar, and Shuguang Cui. 2021. “Communication-Efficient Federated Learning.” Proceedings of the National Academy of Sciences 118 (17). https://doi.org/10.1073/pnas.2024789118.

Chen, Tianlong, Xiaohan Chen, Wuyang Chen, Howard Heaton, Jialin Liu, Zhangyang Wang, and Wotao Yin. 2022. “Learning to Optimize: A Primer and A Benchmark.” Journal of Machine Learning Ressearch 23: 189:1–59. http://jmlr.org/papers/v23/21-0308.html.

Dimakis, Alexandros G, Soummya Kar, José MF Moura, Michael G Rabbat, and Anna Scaglione. 2010. “Gossip Algorithms for Distributed Signal Processing.” Proceedings of the IEEE 98 (11): 1847–64.

Dong, Xingrong, Zhaoxian Wu, Qing Ling, and Zhi Tian. 2024. “Byzantine-Robust Distributed Online Learning: Taming Adversarial Participants in an Adversarial Environment.” IEEE Transactions on Signal Processing 72: 235–48. https://doi.org/10.1109/tsp.2023.3340028.

Elgabli, Anis, Jihong Park, Amrit S. Bedi, Mehdi Bennis, and Vaneet Aggarwal. 2020. “GADMM: Fast and Communication Efficient Framework for Distributed Machine Learning.” Journal of Machine Learning Ressearch 21: 76:1–39. http://jmlr.org/papers/v21/19-718.html.

Elgabli, Anis, Jihong Park, Amrit Singh Bedi, Chaouki Ben Issaid, Mehdi Bennis, and Vaneet Aggarwal. 2021. “Q-GADMM: Quantized Group ADMM for Communication Efficient Decentralized Machine Learning.” IEEE Transactions on Communications 69 (1): 164–81. https://doi.org/10.1109/tcomm.2020.3026398.

Fang, Huang, Nicholas J. A. Harvey, Victor S. Portella, and Michael P. Friedlander. 2022. “Online Mirror Descent and Dual Averaging: Keeping Pace in the Dynamic Case.” Journal of Machine Learning Research 23 (121): 1–38. http://jmlr.org/papers/v23/21-1027.html.

Ghosh, Avishek, Jichan Chung, Dong Yin, and Kannan Ramchandran. 2022. “An Efficient Framework for Clustered Federated Learning.” IEEE Transactions on Information Theory 68 (12): 8076–91. https://doi.org/10.1109/tit.2022.3192506.

Gower, Robert M., Mark Schmidt, Francis Bach, and Peter Richtarik. 2020. “Variance-Reduced Methods for Machine Learning.” Proceedings of the IEEE 108 (11): 1968–83. https://doi.org/10.1109/jproc.2020.3028013.

Hallac, David, Jure Leskovec, and Stephen Boyd. 2015. “Network Lasso: Clustering and Optimization in Large Graphs.” In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. KDD ’15. ACM. https://doi.org/10.1145/2783258.2783313.

Hazan, Elad. 2021. Introduction to Online Convex Optimization. now Publishers Inc. https://doi.org/10.1561/9781680831719.

Hazan, Elad, Amit Agarwal, and Satyen Kale. 2007. “Logarithmic Regret Algorithms for Online Convex Optimization.” Machine Learning 69: 169–92.

Hazan, Elad, and Haipeng Luo. 2016. “Variance-Reduced and Projection-Free Stochastic Optimization.” In International Conference on Machine Learning, 1263–71. PMLR.

He, Xuechao, Heng Zhu, and Qing Ling. 2023. “C-RSA: Byzantine-Robust and Communication-Efficient Distributed Learning in the Non-Convex and Non-IID Regime.” Signal Processing 213 (December): 109222. https://doi.org/10.1016/j.sigpro.2023.109222.

Hu, Haoyang, Youlong Wu, Yuanming Shi, Songze Li, Chunxiao Jiang, and Wei Zhang. 2023. “Communication-Efficient Coded Computing for Distributed Multi-Task Learning.” IEEE Transactions on Communications 71 (7): 3861–75. https://doi.org/10.1109/tcomm.2023.3275194.

Hu, Shuyan, Xiaojing Chen, Wei Ni, Ekram Hossain, and Xin Wang. 2021. “Distributed Machine Learning for Wireless Communication Networks: Techniques, Architectures, and Applications.” IEEE Communications Surveys & Tutorials 23 (3): 1458–93. https://doi.org/10.1109/comst.2021.3086014.

Jacob, Laurent, Francis R. Bach, and Jean-Philippe Vert. 2008. “Clustered Multi-Task Learning: A Convex Formulation.” In Advances in Neural Information Processing Systems 21, Proceedings of the Twenty-Second Annual Conference on Neural Information Processing Systems, Vancouver, British Columbia, Canada, December 8-11, 2008, 745–52. https://proceedings.neurips.cc/paper/2008/hash/fccb3cdc9acc14a6e70a12f74560c026-Abstract.html.

Jaggi, Martin, Virginia Smith, Martin Takác, Jonathan Terhorst, Sanjay Krishnan, Thomas Hofmann, and Michael I. Jordan. 2014. “Communication-Efficient Distributed Dual Coordinate Ascent.” In Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014, December 8-13 2014, Montreal, Quebec, Canada, 3068–76. https://proceedings.neurips.cc/paper/2014/hash/894b77f805bd94d292574c38c5d628d5-Abstract.html.

Jiang, Zhanhong, Kushal Mukherjee, and Soumik Sarkar. 2018. “On Consensus-Disagreement Tradeoff in Distributed Optimization.” In 2018 Annual American Control Conference (ACC). IEEE. https://doi.org/10.23919/acc.2018.8430883.

Kar, Soummya, José M. F. Moura, and Kavita Ramanan. 2012. “Distributed Parameter Estimation in Sensor Networks: Nonlinear Observation Models and Imperfect Communication.” IEEE Transactions on Information Theory 58 (6): 3575–605. https://doi.org/10.1109/tit.2012.2191450.

Kaya, Ege C., Mehmet Berk Sahin, and Abolfazl Hashemi. 2023. “Communication-Efficient Zeroth-Order Distributed Online Optimization: Algorithm, Theory, and Applications.” IEEE Access 11: 61173–91. https://doi.org/10.1109/access.2023.3284891.

Koppel, Alec, Brian M. Sadler, and Alejandro Ribeiro. 2017. “Proximity Without Consensus in Online Multiagent Optimization.” IEEE Transactions on Signal Processing 65 (12): 3062–77. https://doi.org/10.1109/tsp.2017.2686368.

Lamport, Leslie, Robert E. Shostak, and Marshall C. Pease. 1982. “The Byzantine Generals Problem.” ACM Trans. Program. Lang. Syst. 4 (3): 382–401. https://doi.org/10.1145/357172.357176.

Lan, Guanghui, Soomin Lee, and Yi Zhou. 2020. “Communication-Efficient Algorithms for Decentralized and Stochastic Optimization.” Mathematical Programming 180 (1–2): 237–84. https://doi.org/10.1007/s10107-018-1355-4.

Lee, D.‐J., Z. Zhu, and P. Toscas. 2015. “Spatio‐temporal Functional Data Analysis for Wireless Sensor Networks Data.” Environmetrics 26 (5): 354–62. https://doi.org/10.1002/env.2344.

Lesage-Landry, Antoine, Joshua A. Taylor, and Iman Shames. 2021. “Second-Order Online Nonconvex Optimization.” IEEE Transactions on Automatic Control 66 (10): 4866–72. https://doi.org/10.1109/tac.2020.3040372.

Li, Boyue, Shicong Cen, Yuxin Chen, and Yuejie Chi. 2020. “Communication-Efficient Distributed Optimization in Networks with Gradient Tracking and Variance Reduction.” Journal of Machine Learning Ressearch 21: 180:1–51. http://jmlr.org/papers/v21/20-210.html.

Li, Furong, and Huiyan Sang. 2019. “Spatial Homogeneity Pursuit of Regression Coefficients for Large Datasets.” Journal of the American Statistical Association 114 (527): 1050–62. https://doi.org/10.1080/01621459.2018.1529595.

Li, Liping, Wei Xu, Tianyi Chen, Georgios B. Giannakis, and Qing Ling. 2019. “RSA: Byzantine-Robust Stochastic Aggregation Methods for Distributed Learning from Heterogeneous Datasets.” In The Thirty-Third AAAI Conference on Artificial Intelligence, AAAI 2019, the Thirty-First Innovative Applications of Artificial Intelligence Conference, IAAI 2019, the Ninth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2019, Honolulu, Hawaii, USA, January 27 - February 1, 2019, 1544–51. https://doi.org/10.1609/AAAI.V33I01.33011544.

Li, Pengfei, Jianyi Yang, Adam Wierman, and Shaolei Ren. 2023. “Learning-Augmented Decentralized Online Convex Optimization in Networks.” 2023. https://arxiv.org/abs/2306.10158v2.

Li, Xiuxian, Lihua Xie, and Na Li. 2023. “A Survey on Distributed Online Optimization and Online Games.” Annual Reviews in Control 56: 100904. https://doi.org/10.1016/j.arcontrol.2023.100904.

Lian, Xiangru, Ce Zhang, Huan Zhang, Cho-Jui Hsieh, Wei Zhang, and Ji Liu. 2017. “Can Decentralized Algorithms Outperform Centralized Algorithms? A Case Study for Decentralized Parallel Stochastic Gradient Descent.” In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, 5330–40. https://proceedings.neurips.cc/paper/2017/hash/f75526659f31040afeb61cb7133e4e6d-Abstract.html.

Liao, Yiwei, Zhuorui Li, Kun Huang, and Shi Pu. 2022. “A Compressed Gradient Tracking Method for Decentralized Optimization with Linear Convergence.” IEEE Transactions on Automatic Control 67 (10): 5622–29. https://doi.org/10.1109/tac.2022.3180695.

Lin, Sen, Daouda Sow, Kaiyi Ji, Yingbin Liang, and Ness B. Shroff. 2023. “Non-Convex Bilevel Optimization with Time-Varying Objective Functions.” In Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023. http://papers.nips.cc/paper\_files/paper/2023/hash/5ee60ca5686bbcf756e56a6c75e66f32-Abstract-Conference.html.

Lin, Yiheng, Judy Gan, Guannan Qu, Yash Kanoria, and Adam Wierman. 2022. “Decentralized Online Convex Optimization in Networked Systems.” In International Conference on Machine Learning, ICML 2022, 17-23 July 2022, Baltimore, Maryland, USA, 13356–93. https://proceedings.mlr.press/v162/lin22c.html.

Liu, Huikang, Jiaojiao Zhang, Anthony Man-Cho So, and Qing Ling. 2023. “A Communication-Efficient Decentralized Newton’s Method with Provably Faster Convergence.” IEEE Transactions on Signal and Information Processing over Networks 9: 427–41. https://doi.org/10.1109/tsipn.2023.3290397.

Liu, Sijia, Jie Chen, Pin-Yu Chen, and Alfred Hero. 2018. “Zeroth-Order Online Alternating Direction Method of Multipliers: Convergence Analysis and Applications.” In International Conference on Artificial Intelligence and Statistics, 288–97. PMLR.

Liu, Yuanyuan, Jiacheng Geng, Fanhua Shang, Weixin An, Hongying Liu, and Qi Zhu. 2022. “Loopless Variance Reduced Stochastic ADMM for Equality Constrained Problems in IoT Applications.” IEEE Internet of Things Journal 9 (3): 2293–303. https://doi.org/10.1109/jiot.2021.3095561.

Liu, Yuanyuan, Fanhua Shang, and James Cheng. 2017. “Accelerated Variance Reduced Stochastic ADMM.” In Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 31. 1.

Liu, Yuanyuan, Fanhua Shang, Hongying Liu, Lin Kong, Licheng Jiao, and Zhouchen Lin. 2021. “Accelerated Variance Reduction Stochastic ADMM for Large-Scale Machine Learning.” IEEE Transactions on Pattern Analysis and Machine Intelligence 43 (12): 4242–55. https://doi.org/10.1109/tpami.2020.3000512.

Lorenzo, Paolo Di, and Gesualdo Scutari. 2016. “NEXT: In-Network Nonconvex Optimization.” IEEE Transactions on Signal and Information Processing over Networks 2 (2): 120–36. https://doi.org/10.1109/tsipn.2016.2524588.

Lu, Kaihong, and Long Wang. 2023. “Online Distributed Optimization with Nonconvex Objective Functions via Dynamic Regrets.” IEEE Transactions on Automatic Control 68 (11): 6509–24. https://doi.org/10.1109/tac.2023.3239432.

Ma, Haoyu, Huayan Guo, and Vincent K. N. Lau. 2023. “Communication-Efficient Federated Multitask Learning over Wireless Networks.” IEEE Internet of Things Journal 10 (1): 609–24. https://doi.org/10.1109/jiot.2022.3201310.

Ma, Shujie, and Jian Huang. 2017. “A Concave Pairwise Fusion Approach to Subgroup Analysis.” Journal of the American Statistical Association 112 (517): 410–23. https://doi.org/10.1080/01621459.2016.1148039.

Maass, Alejandro I., Chris Manzie, Dragan Nešić, Jonathan H. Manton, and Iman Shames. 2022. “Tracking and Regret Bounds for Online Zeroth-Order Euclidean and Riemannian Optimization.” SIAM Journal on Optimization 32 (2): 445–69. https://doi.org/10.1137/21m1405551.

Mao, Xianghui, Kun Yuan, Yubin Hu, Yuantao Gu, Ali H. Sayed, and Wotao Yin. 2020. “Walkman: A Communication-Efficient Random-Walk Algorithm for Decentralized Optimization.” IEEE Transactions on Signal Processing 68: 2513–28. https://doi.org/10.1109/tsp.2020.2983167.

Molinaro, Marco. 2021. “Robust Algorithms for Online Convex Problems via Primal-Dual.” In Proceedings of the 2021 ACM-SIAM Symposium on Discrete Algorithms (SODA), 2078–92. Society for Industrial; Applied Mathematics. https://doi.org/10.1137/1.9781611976465.124.

Mortaheb, Matin, Cemil Vahapoglu, and Sennur Ulukus. 2022. “Personalized Federated Multi-Task Learning over Wireless Fading Channels.” Algorithms 15 (11): 421. https://doi.org/10.3390/A15110421.

Nedic, Angelia, Alex Olshevsky, and Michael G. Rabbat. 2018. “Network Topology and Communication-Computation Tradeoffs in Decentralized Optimization.” Proceedings of the IEEE 106 (5): 953–76. https://doi.org/10.1109/jproc.2018.2817461.

Nedic, Angelia, and Asuman Ozdaglar. 2009. “Distributed Subgradient Methods for Multi-Agent Optimization.” IEEE Transactions on Automatic Control 54 (1): 48–61. https://doi.org/10.1109/tac.2008.2009515.

Nedić, Angelia, Alex Olshevsky, and Wei Shi. 2017. “Achieving Geometric Convergence for Distributed Optimization over Time-Varying Graphs.” SIAM Journal on Optimization 27 (4): 2597–2633. https://doi.org/10.1137/16m1084316.

Niu, Youcheng, Jinming Xu, Ying Sun, Yan Huang, and Li Chai. 2023. “Distributed Stochastic Bilevel Optimization: Improved Complexity and Heterogeneity Analysis.” 2023. https://arxiv.org/abs/2312.14690v2.

Nocedal, Jorge, and Stephen J. Wright. 2006. Numerical Optimization. Second. Springer Series in Operations Research and Financial Engineering. Springer, New York.

Okazaki, Akira, and Shuichi Kawano. 2023. “Multi-Task Learning Regression via Convex Clustering.” 2023. https://arxiv.org/abs/2304.13342v1.

Olfati-Saber, Reza, J. Alex Fax, and Richard M. Murray. 2007. “Consensus and Cooperation in Networked Multi-Agent Systems.” Proceedings of the IEEE 95 (1): 215–33. https://doi.org/10.1109/jproc.2006.887293.

Ozkara, Kaan, Navjot Singh, Deepesh Data, and Suhas N. Diggavi. 2021. “QuPeD: Quantized Personalization via Distillation with Applications to Federated Learning.” In Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6-14, 2021, Virtual, 3622–34. https://proceedings.neurips.cc/paper/2021/hash/1dba3025b159cd9354da65e2d0436a31-Abstract.html.

Predd, Joel B., Sanjeev B. Kulkarni, and H. Vincent Poor. 2006. “Distributed Learning in Wireless Sensor Networks.” IEEE Signal Processing Magazine 23 (4): 56–69. https://doi.org/10.1109/msp.2006.1657817.

Pu, Shi, Wei Shi, Jinming Xu, and Angelia Nedic. 2021. “Push–Pull Gradient Methods for Distributed Optimization in Networks.” IEEE Transactions on Automatic Control 66 (1): 1–16. https://doi.org/10.1109/tac.2020.2972824.

Qu, Guannan, and Na Li. 2018. “Harnessing Smoothness to Accelerate Distributed Optimization.” IEEE Transactions on Control of Network Systems 5 (3): 1245–60. https://doi.org/10.1109/tcns.2017.2698261.

Sattler, Felix, Klaus-Robert Muller, and Wojciech Samek. 2021. “Clustered Federated Learning: Model-Agnostic Distributed Multitask Optimization Under Privacy Constraints.” IEEE Transactions on Neural Networks and Learning Systems 32 (8): 3710–22. https://doi.org/10.1109/tnnls.2020.3015958.

Scavuzzo, Lara, Karen Aardal, Andrea Lodi, and Neil Yorke-Smith. 2024. “Machine Learning Augmented Branch and Bound for Mixed Integer Linear Programming.” 2024. https://arxiv.org/abs/2402.05501v1.

Schraudolph, Nicol N, Jin Yu, and Simon Günter. 2007. “A Stochastic Quasi-Newton Method for Online Convex Optimization.” In Artificial Intelligence and Statistics, 436–43. PMLR.

Shalev-Shwartz, Shai, and Yoram Singer. 2007. “A Primal-Dual Perspective of Online Learning Algorithms.” Machine Learning 69: 115–42.

Shalev-Shwartz, Shai, and Tong Zhang. 2013. “Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization.” Journal of Machine Learning Research 14 (1).

Shen, Lingqing, Nam Ho-Nguyen, and Fatma Kılınç-Karzan. 2022. “An Online Convex Optimization-Based Framework for Convex Bilevel Optimization.” Mathematical Programming 198 (2): 1519–82. https://doi.org/10.1007/s10107-022-01894-5.

Shi, Wei, Qing Ling, Gang Wu, and Wotao Yin. 2015. “EXTRA: An Exact First-Order Algorithm for Decentralized Consensus Optimization.” SIAM Journal on Optimization 25 (2): 944–66. https://doi.org/10.1137/14096668x.

Singh, Navjot, Xuanyu Cao, Suhas Diggavi, and Tamer Başar. 2024. “Decentralized Multi-Task Stochastic Optimization with Compressed Communications.” Automatica 159 (January): 111363. https://doi.org/10.1016/j.automatica.2023.111363.

Smith, Virginia, Chao-Kai Chiang, Maziar Sanjabi, and Ameet Talwalkar. 2017. “Federated Multi-Task Learning.” In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, 4424–34. https://proceedings.neurips.cc/paper/2017/hash/6211080fa89981f66b1a0c9d55c61d0f-Abstract.html.

Smith, Virginia, Simone Forte, Chenxin Ma, Martin Takáč, Michael I Jordan, and Martin Jaggi. 2018. “CoCoA: A General Framework for Communication-Efficient Distributed Optimization.” Journal of Machine Learning Research 18 (230): 1–49.

Suzuki, Taiji. 2014. “Stochastic Dual Coordinate Ascent with Alternating Direction Method of Multipliers.” In Proceedings of the 31th International Conference on Machine Learning, ICML 2014, Beijing, China, 21-26 June 2014, 736–44. http://proceedings.mlr.press/v32/suzuki14.html.

Tan, Alysa Ziying, Han Yu, Lizhen Cui, and Qiang Yang. 2023. “Towards Personalized Federated Learning.” IEEE Transactions on Neural Networks and Learning Systems 34 (12): 9587–9603. https://doi.org/10.1109/tnnls.2022.3160699.

Tarzanagh, Davoud Ataee, Parvin Nazari, Bojian Hou, Li Shen, and Laura Balzano. 2022. “Online Bilevel Optimization: Regret Analysis of Online Alternating Gradient Methods.” 2022. https://arxiv.org/abs/2207.02829v6.

Tong, Jiayi, Chongliang Luo, Md Nazmul Islam, Natalie E. Sheils, John Buresh, Mackenzie Edmondson, Peter A. Merkel, Ebbing Lautenbach, Rui Duan, and Yong Chen. 2022. “Distributed Learning for Heterogeneous Clinical Data with Application to Integrating COVID-19 Data Across 230 Sites.” Npj Digital Medicine 5 (1). https://doi.org/10.1038/s41746-022-00615-8.

Verbraeken, Joost, Matthijs Wolting, Jonathan Katzy, Jeroen Kloppenburg, Tim Verbelen, and Jan S. Rellermeyer. 2020. “A Survey on Distributed Machine Learning.” ACM Computing Surveys 53 (2): 1–33. https://doi.org/10.1145/3377454.

Wang, Lei, Le Bao, and Xin Liu. 2024. “A Decentralized Proximal Gradient Tracking Algorithm for Composite Optimization on Riemannian Manifolds.” 2024. https://arxiv.org/abs/2401.11573v1.

Wang, Lei, and Xin Liu. 2022. “Decentralized Optimization over the Stiefel Manifold by an Approximate Augmented Lagrangian Function.” IEEE Transactions on Signal Processing 70: 3029–41. https://doi.org/10.1109/tsp.2022.3182883.

Wang, Shuai, Yanqing Xu, Zhiguo Wang, Tsung-Hui Chang, Tony Q. S. Quek, and Defeng Sun. 2023. “Beyond ADMM: A Unified Client-Variance-Reduced Adaptive Federated Learning Framework.” In Thirty-Seventh AAAI Conference on Artificial Intelligence, AAAI 2023, Thirty-Fifth Conference on Innovative Applications of Artificial Intelligence, IAAI 2023, Thirteenth Symposium on Educational Advances in Artificial Intelligence, EAAI 2023, Washington, DC, USA, February 7-14, 2023, 10175–83. https://doi.org/10.1609/AAAI.V37I8.26212.

Wang, Xi, Zhipeng Tu, Yiguang Hong, Yingyi Wu, and Guodong Shi. 2021. “No-Regret Online Learning over Riemannian Manifolds.” Advances in Neural Information Processing Systems 34: 28323–35.

Xin, Ran. 2022. “Decentralized Non-Convex Optimization and Learning over Heterogeneous Networks.” PhD thesis, Carnegie Mellon University. https://www.proquest.com/openview/9eec69c1293e6fb01fd4daa8cb514015.

Xin, Ran, Soummya Kar, and Usman A Khan. 2020. “Decentralized Stochastic Optimization and Machine Learning: A Unified Variance-Reduction Framework for Robust Performance and Fast Convergence.” IEEE Signal Processing Magazine 37 (3): 102–13.

Xin, Ran, and Usman A. Khan. 2018. “A Linear Algorithm for Optimization over Directed Graphs with Geometric Convergence.” IEEE Control Systems Letters 2 (3): 315–20. https://doi.org/10.1109/lcsys.2018.2834316.

Xu, Ping, Yue Wang, Xiang Chen, and Zhi Tian. 2024. “QC-ODKLA: Quantized and Communication-Censored Online Decentralized Kernel Learning via Linearized ADMM.” IEEE Transactions on Neural Networks and Learning Systems, 1–13. https://doi.org/10.1109/tnnls.2023.3310499.

Yan, Wenjing, and Xuanyu Cao. 2024. “Decentralized Multitask Online Convex Optimization Under Random Link Failures.” IEEE Transactions on Signal Processing 72: 622–35. https://doi.org/10.1109/tsp.2023.3348954.

Yang, Tao, Xinlei Yi, Junfeng Wu, Ye Yuan, Di Wu, Ziyang Meng, Yiguang Hong, Hong Wang, Zongli Lin, and Karl H. Johansson. 2019. “A Survey of Distributed Optimization.” Annual Reviews in Control 47: 278–305. https://doi.org/10.1016/j.arcontrol.2019.05.006.

Zhang, Jiaojiao, Huikang Liu, Anthony Man-Cho So, and Qing Ling. 2023. “Variance-Reduced Stochastic Quasi-Newton Methods for Decentralized Learning.” IEEE Transactions on Signal Processing 71: 311–26. https://doi.org/10.1109/tsp.2023.3240652.

Zhang, Mengfei, Danqi Jin, and Jie Chen. 2020. “Variance Reduced Diffusion Adaptation for Online Learning over Networks.” In 2020 IEEE International Conference on Signal Processing, Communications and Computing (ICSPCC). IEEE. https://doi.org/10.1109/icspcc50002.2020.9259540.

Zhang, Xin, Jia Liu, and Zhengyuan Zhu. 2022. “Learning Coefficient Heterogeneity over Networks: A Distributed Spanning-Tree-Based Fused-Lasso Regression.” Journal of the American Statistical Association, December, 1–13. https://doi.org/10.1080/01621459.2022.2126363.

Zhang, Yijian, Emiliano Dall’Anese, and Mingyi Hong. 2021. “Online Proximal-ADMM for Time-Varying Constrained Convex Optimization.” IEEE Transactions on Signal and Information Processing over Networks 7: 144–55. https://doi.org/10.1109/tsipn.2021.3051292.

Zhang, Yu, and Qiang Yang. 2022. “A Survey on Multi-Task Learning.” IEEE Transactions on Knowledge and Data Engineering 34 (12): 5586–5609. https://doi.org/10.1109/tkde.2021.3070203.

Zhou, Jiayu, Jianhui Chen, and Jieping Ye. 2011. “Clustered Multi-Task Learning via Alternating Structure Optimization.” In Advances in Neural Information Processing Systems 24: 25th Annual Conference on Neural Information Processing Systems 2011. Proceedings of a Meeting Held 12-14 December 2011, Granada, Spain, 702–10. https://proceedings.neurips.cc/paper/2011/hash/a516a87cfcaef229b342c437fe2b95f7-Abstract.html.

Zhu, Minghui, and Sonia Martínez. 2010. “Discrete-Time Dynamic Average Consensus.” Automatica 46 (2): 322–29. https://doi.org/10.1016/j.automatica.2009.10.021.

Zinkevich, Martin. 2003. “Online Convex Programming and Generalized Infinitesimal Gradient Ascent.” In Proceedings of the 20th International Conference on Machine Learning (Icml-03), 928–36.