We introduce a novel optimization formulation incorporating pre-trained models and random sketch operators, enabling sparsification-aware training with tighter convergence rates.
We provide a theoretical analysis of Independent Subnetwork Training (IST), identifying fundamental differences from alternative approaches and analyzing its optimization performance.
We develop a unified framework for studying distributed optimization methods with compression.
We propose ADOM -- an accelerated method for smooth and strongly convex decentralized optimization over time-varying networks.
We fix a fundamental issue in the stochastic extragradient method by providing a new sampling strategy.
We propose a general yet simple theorem describing the convergence of SGD under the arbitrary sampling paradigm.