Cryptographic Accelerator Design: Dedicated Hardware for AES/RSA/ECC/SHA — specialized MAC engines and multipliers for symmetric/asymmetric encryption enabling Gbps throughput and TLS protocol acceleration

AES Hardware Engine
- Cipher Block Size: 128-bit block, operates on 4×4 byte state matrix, 10/12/14 rounds (AES-128/192/256)
- Round Operations: SubBytes (byte substitution), ShiftRows (transpose), MixColumns (GF(2^8) mixing), AddRoundKey (XOR with round key)
- Pipelined Implementation: 1 round per cycle (10-14 cycles for encryption), high throughput (10-100 Gbps at 1-10 GHz)
- Modes of Operation: ECB/CBC (sequential), CTR/GCM (parallel), hardware supports multiple modes via mode-specific control logic
- GCM Mode: authenticated encryption (AES-CTR + GHASH), GHASH operates in GF(2^128) (polynomial multiplication), critical for TLS 1.3

AES-GCM Throughput
- GCM Bottleneck: GHASH sequential (1 128-bit polynomial multiply per block), limits throughput vs CTR parallelism
- Fast GHASH: karatsuba multiplication (3 multiplies instead of 4), precomputed lookup tables, 1-2 cycles per block achievable
- 1400 Gbps Target: modern accelerators achieve 1.4 TB/s (AES-256-GCM), assuming 1 byte/cycle throughput

RSA/ECC Public-Key Accelerator
- RSA Encryption: C = M^e mod N (public exponent operation), requires modular exponentiation (large exponent, typically e=65537)
- RSA Decryption: M = C^d mod N (private exponent d typically 1024-2048 bits), computationally intensive
- Montgomery Multiplier: core building block, computes A×B mod N efficiently (no division), pipelined for speed
- Modular Exponentiation: binary exponentiation (square-multiply algorithm), 1500-2000 modmuls for 2048-bit exponent (@ 50-200 ns/modmul = 100-400 µs per RSA)

ECC Hardware Acceleration
- ECDSA Signature: point multiplication (k×P), requires ~256 point additions (P256 curve), 100-1000 µs per signature (CPU-based ~10 ms)
- Curve Types: NIST curves (P-256, P-384, P-521), Curve25519/Curve448 (emerging), all supported by modern accelerators
- Point Operations: point addition (A+B), point doubling (2A), both require modular inversion (100-1000 cycles via extended Euclidean algorithm)
- Accelerator Design: dedicated adder/multiplier for field arithmetic, pipelined point doubling

SHA Hash Engine
- SHA-256: 256-bit digest, 512-bit message block, 64 rounds per block, sequential round processing
- SHA-3: Keccak permutation (1600-bit state), 24 rounds (vs SHA-256 64 rounds), higher throughput potential (parallelizable rounds)
- Pipelined SHA: simultaneous processing of multiple blocks (SHA-256 block 2 has same throughput as block 1 if pipelined), 10+ GB/s throughput
- HMAC: hash-based MAC (SHA(key XOR opad, SHA(key XOR ipad, msg))), two hash operations sequential (limited pipeline benefit)

TRNG (True Random Number Generator)
- Entropy Source: thermal noise (resistor Johnson noise), oscillator jitter, metastability
- Von Neumann Corrector: post-processor corrects biased entropy source (independent random bits), removes correlation
- NIST DRBG: deterministic random bit generator (seeded with entropy), provides cryptographic RNG (HMAC-DRBG, CTR-DRBG)
- Throughput: 1 Mbps typical for dedicated TRNG, sufficient for key generation + seed replenishment

Post-Quantum Cryptography (PQC) Hardware
- CRYSTALS-Kyber: lattice-based KEM (key encapsulation), polynomial multiplication over Z_q (q=3329), 1024-bit key, ~0.5 ms software (CPU)
- CRYSTALS-Dilithium: lattice-based signature, polynomial-ring operations, Gaussian sampling challenging to accelerate
- Hardware Acceleration: dedicated modular multiplier (mod q), polynomial multiplier, achieves 10-100 µs KEM key generation
- Constraints: larger keys (2.3 kB Kyber, vs 96 B ECDSA), larger ciphertexts, integrate gradually into TLS stacks

Protocol Offload (TLS/IPsec)
- TLS Offload: accelerator executes record-layer encryption (AES-GCM), reduces CPU load (offload ~80% CPU for HTTPS)
- IPsec Offload: encrypt/authenticate IP packets inline (AES-GCM + SHA-256), enables 1-10 Gbps throughput on standard CPU
- Handshake: RSA/ECDSA/ECDH operations in handshake (100-1000 ms total), accelerator speeds server handshake
- Session Key Derivation: HKDF or PRF (pseudo-random function), lower priority (not data-path bottleneck)

Performance Characteristics
- AES-256: 1-10 Gbps throughput, 100-200 mW power (energy efficiency ~10-50 pJ/byte)
- RSA-2048 Signature: 100-400 µs (vs 10-100 ms software), 500 mW peak power
- ECDSA-P256 Signature: 100-500 µs (vs 5-50 ms software), 300 mW peak power
- SHA-256: 1-10 Gbps, 50-100 mW power

Area and Power Trade-offs
- Unrolled Pipeline: deeper unrolling (multiple rounds/cycles) increases throughput but area/power grows quadratically
- Shared Multiplier: single multiplier (RSA+ECC+SHA share) saves area (20-30% area reduction), reduces peak throughput slightly
- Thermal Management: high-power cryptographic operations (RSA, ECC) generate heat, requires thermal throttling or cooling

Integration in SoC
- Memory Hierarchy: accelerator attached to system memory (DDR/HBM), key/data loaded via DMA
- Interrupt Handling: operation completion signaled via interrupt (CPU processes result), or polling (CPU waits)
- Power Saving: accelerator enters sleep when idle (low-power mode), reduces standby power

Future Roadmap: PQC hardware standardization ongoing (NIST finalists), hybrid classical+PQC expected by 2025-2030, standardized PQC ISA extensions (ARM, RISC-V) emerging.

Cryptographic Accelerator Design: Dedicated Hardware for AES/RSA/ECC/SHA — specialized MAC engines and multipliers for symmetric/asymmetric encryption enabling Gbps throughput and TLS protocol acceleration

Want to learn more?