1-Bit SGD is a gradient quantization method that compresses each gradient component to a single bit (its sign) — transmitting only +1 or -1 for each gradient element, achieving 32× compression compared to 32-bit floating point, with error feedback to maintain convergence.
1-Bit SGD Algorithm
- Sign: $hat{g}_i = ext{sign}(g_i + e_i)$ — quantize to +1 or -1.
- Scale: Multiply by the mean absolute gradient magnitude for rescaling.
- Error Feedback: $e_i leftarrow (g_i + e_i) - hat{g}_i$ — accumulate quantization error.
- Communication: 1 bit per gradient component + 1 scalar (mean magnitude) per layer.
Why It Matters
- 32× Compression: Reduces gradient communication by 32× compared to full precision.
- Error Feedback Essential: Without error feedback, 1-bit SGD diverges. With it, convergence is preserved.
- Microsoft: Originally proposed by Microsoft Research — successfully scaled speech recognition training.
1-Bit SGD is extreme gradient quantization — compressing gradients to their signs for massive communication savings with error feedback for convergence.