Local SGD is a distributed training algorithm that performs multiple gradient updates locally before synchronizing — dramatically reducing communication overhead in distributed and federated learning by allowing workers to train independently for H steps before averaging parameters, making distributed training practical over slow networks.
What Is Local SGD?
- Definition: Distributed optimization with periodic synchronization.
- Algorithm: Each worker performs H local SGD steps, then synchronizes.
- Goal: Reduce communication rounds by H× while maintaining convergence.
- Also Known As: FedAvg (Federated Averaging) in federated learning context.
Why Local SGD Matters
- Communication Efficiency: H× reduction in communication rounds.
- Slow Network Tolerance: Works with commodity networks, not just high-speed interconnects.
- Straggler Handling: Slow workers don't block others during local phase.
- Federated Learning Enabler: Makes training on mobile devices practical.
- Cost Reduction: Less communication = lower cloud egress costs.
Algorithm
Initialization:
- All workers start with same model parameters θ_0.
- Agree on local steps H and learning rate schedule.
Training Loop:
````
For round t = 1, 2, 3, ...:
// Local training phase
Each worker k independently:
For h = 1 to H:
Sample mini-batch from local data
Compute gradient g_k
Update: θ_k ← θ_k - η · g_k
// Synchronization phase
Aggregate: θ_global ← (1/K) Σ_k θ_k
Broadcast θ_global to all workers
Key Parameters:
- H (local steps): Number of SGD steps between synchronizations.
- K (workers): Number of parallel workers.
- η (learning rate): Step size for local updates.
Convergence Analysis
Convergence Guarantee:
- Converges to same solution as standard SGD (under assumptions).
- Convergence rate: O(1/√(KHT)) for convex, O(1/√(KHT)) for non-convex.
- Requires learning rate adjustment for large H.
Key Insights:
- Worker Divergence: Local models diverge during local phase.
- Synchronization Corrects: Averaging brings models back together.
- Trade-Off: Larger H → more divergence but less communication.
Optimal H Selection:
- Too small: Excessive communication overhead.
- Too large: Worker divergence hurts convergence.
- Typical: H = 10-100 for datacenter, H = 100-1000 for federated.
Comparison with Other Methods
vs. Synchronous SGD:
- Local SGD: H local steps, then sync (H=1 is sync SGD).
- Sync SGD: Every step synchronized.
- Trade-Off: Local SGD reduces communication, slightly slower convergence.
vs. Asynchronous SGD:
- Local SGD: Periodic synchronization, bounded staleness.
- Async SGD: Continuous asynchronous updates, unbounded staleness.
- Trade-Off: Local SGD more stable, async SGD more communication efficient.
vs. Gradient Compression:
- Local SGD: Reduce communication frequency.
- Compression: Reduce communication size per round.
- Combination: Can use both together for maximum efficiency.
Variants & Extensions
Adaptive H Selection:
- Dynamically adjust H based on worker divergence.
- Increase H when models are similar, decrease when diverging.
- Improves convergence while maintaining communication efficiency.
Periodic Averaging Schedules:
- Exponentially increasing H: H = 1, 2, 4, 8, ...
- Allows frequent sync early, less frequent later.
- Balances exploration and communication.
Momentum-Based Local SGD:
- Add momentum to local updates.
- Helps overcome local minima during local phase.
- Improves convergence quality.
Applications
Datacenter Distributed Training:
- Train large models across GPU clusters.
- Reduce network bottleneck in multi-node training.
- Typical: H = 10-50 for fast interconnects.
Federated Learning:
- Train on mobile devices with slow, intermittent connections.
- FedAvg is essentially Local SGD for federated setting.
- Typical: H = 100-1000 for mobile devices.
Edge Computing:
- Train on edge devices with limited connectivity.
- Periodic synchronization with cloud server.
- Balances local computation and communication.
Practical Considerations
Learning Rate Tuning:
- Larger H may require learning rate adjustment.
- Rule of thumb: Scale learning rate by √H or keep constant.
- Warmup helps stabilize early training.
Batch Size:
- Local batch size affects convergence.
- Larger local batches can compensate for larger H.
- Trade-off: Memory vs. convergence speed.
Non-IID Data:
- Worker data distributions may differ (federated learning).
- Non-IID data increases worker divergence.
- May need smaller H or additional regularization.
Tools & Implementations
- PyTorch Distributed: Easy implementation with DDP.
- TensorFlow Federated: Built-in FedAvg (Local SGD).
- Horovod: Supports periodic averaging for Local SGD.
- Custom: Simple to implement with any distributed framework.
Best Practices
- Start with H=1: Verify convergence, then increase H.
- Monitor Divergence: Track worker model differences.
- Tune Learning Rate: Adjust for your specific H value.
- Use Warmup: Stabilize early training with frequent sync.
- Combine with Compression: Maximize communication efficiency.
Local SGD is the foundation of practical distributed training — by allowing workers to train independently between synchronizations, it makes distributed learning feasible over slow networks and enables federated learning on mobile devices, transforming how we train large-scale machine learning models.