Local SGD

Local SGD is a distributed training algorithm that performs multiple gradient updates locally before synchronizing — dramatically reducing communication overhead in distributed and federated learning by allowing workers to train independently for H steps before averaging parameters, making distributed training practical over slow networks.

What Is Local SGD?

- Definition: Distributed optimization with periodic synchronization.
- Algorithm: Each worker performs H local SGD steps, then synchronizes.
- Goal: Reduce communication rounds by H× while maintaining convergence.
- Also Known As: FedAvg (Federated Averaging) in federated learning context.

Why Local SGD Matters

- Communication Efficiency: H× reduction in communication rounds.
- Slow Network Tolerance: Works with commodity networks, not just high-speed interconnects.
- Straggler Handling: Slow workers don't block others during local phase.
- Federated Learning Enabler: Makes training on mobile devices practical.
- Cost Reduction: Less communication = lower cloud egress costs.

Algorithm

Initialization:
- All workers start with same model parameters θ_0.
- Agree on local steps H and learning rate schedule.

Training Loop:
``For round t = 1, 2, 3, ...: // Local training phase Each worker k independently: For h = 1 to H: Sample mini-batch from local data Compute gradient g_k Update: θ_k ← θ_k - η · g_k // Synchronization phase Aggregate: θ_global ← (1/K) Σ_k θ_k Broadcast θ_global to all workers``

Key Parameters:
- H (local steps): Number of SGD steps between synchronizations.
- K (workers): Number of parallel workers.
- η (learning rate): Step size for local updates.

Convergence Analysis

Convergence Guarantee:
- Converges to same solution as standard SGD (under assumptions).
- Convergence rate: O(1/√(KHT)) for convex, O(1/√(KHT)) for non-convex.
- Requires learning rate adjustment for large H.

Key Insights:
- Worker Divergence: Local models diverge during local phase.
- Synchronization Corrects: Averaging brings models back together.
- Trade-Off: Larger H → more divergence but less communication.

Optimal H Selection:
- Too small: Excessive communication overhead.
- Too large: Worker divergence hurts convergence.
- Typical: H = 10-100 for datacenter, H = 100-1000 for federated.

Comparison with Other Methods

vs. Synchronous SGD:
- Local SGD: H local steps, then sync (H=1 is sync SGD).
- Sync SGD: Every step synchronized.
- Trade-Off: Local SGD reduces communication, slightly slower convergence.

vs. Asynchronous SGD:
- Local SGD: Periodic synchronization, bounded staleness.
- Async SGD: Continuous asynchronous updates, unbounded staleness.
- Trade-Off: Local SGD more stable, async SGD more communication efficient.

vs. Gradient Compression:
- Local SGD: Reduce communication frequency.
- Compression: Reduce communication size per round.
- Combination: Can use both together for maximum efficiency.

Variants & Extensions

Adaptive H Selection:
- Dynamically adjust H based on worker divergence.
- Increase H when models are similar, decrease when diverging.
- Improves convergence while maintaining communication efficiency.

Periodic Averaging Schedules:
- Exponentially increasing H: H = 1, 2, 4, 8, ...
- Allows frequent sync early, less frequent later.
- Balances exploration and communication.

Momentum-Based Local SGD:
- Add momentum to local updates.
- Helps overcome local minima during local phase.
- Improves convergence quality.

Applications

Datacenter Distributed Training:
- Train large models across GPU clusters.
- Reduce network bottleneck in multi-node training.
- Typical: H = 10-50 for fast interconnects.

Federated Learning:
- Train on mobile devices with slow, intermittent connections.
- FedAvg is essentially Local SGD for federated setting.
- Typical: H = 100-1000 for mobile devices.

Edge Computing:
- Train on edge devices with limited connectivity.
- Periodic synchronization with cloud server.
- Balances local computation and communication.

Practical Considerations

Learning Rate Tuning:
- Larger H may require learning rate adjustment.
- Rule of thumb: Scale learning rate by √H or keep constant.
- Warmup helps stabilize early training.

Batch Size:
- Local batch size affects convergence.
- Larger local batches can compensate for larger H.
- Trade-off: Memory vs. convergence speed.

Non-IID Data:
- Worker data distributions may differ (federated learning).
- Non-IID data increases worker divergence.
- May need smaller H or additional regularization.

Tools & Implementations

- PyTorch Distributed: Easy implementation with DDP.
- TensorFlow Federated: Built-in FedAvg (Local SGD).
- Horovod: Supports periodic averaging for Local SGD.
- Custom: Simple to implement with any distributed framework.

Best Practices

- Start with H=1: Verify convergence, then increase H.
- Monitor Divergence: Track worker model differences.
- Tune Learning Rate: Adjust for your specific H value.
- Use Warmup: Stabilize early training with frequent sync.
- Combine with Compression: Maximize communication efficiency.

Local SGD is the foundation of practical distributed training — by allowing workers to train independently between synchronizations, it makes distributed learning feasible over slow networks and enables federated learning on mobile devices, transforming how we train large-scale machine learning models.

Want to learn more?