#### **Sapphire:** A Configurable Crypto-Processor for Post-Quantum Lattice-based Protocols

Utsav Banerjee\*, Tenzin S. Ukyab, Anantha P. Chandrakasan

\*<u>utsav@mit.edu</u>

Massachusetts Institute of Technology



# **Post-Quantum Cryptography**



Current public key cryptography vulnerable to quantum attacks

- NIST post-quantum crypto standardization in progress
- Round 2 has 26 candidates:
  - Lattice-based (9 KEM + 3 Sign)
  - Code-based (7 KEM)
  - Hash-based (1 Sign)
  - Multivariate (4 Sign)
  - Supersingular isogeny (1 KEM)
  - Zero-knowledge proofs (1 Sign)

# **Learning with Errors**

#### **Learning with Errors (LWE) and its variants:**



**Computational requirements (apart from standard arithmetic):** 

- Modular arithmetic over various small primes
- Polynomial arithmetic for Ring-LWE and Module-LWE
- Sampling of matrices and polynomials from discrete distributions

# **Sapphire Crypto-Processor**

Energy-efficient configurable lattice-crypto-processor



# Outline

□ Efficient Lattice-Crypto Hardware Implementation

- Configurable Modular Multiplier
- Area-Efficient NTT
- Energy-Efficient Sampler
- **Chip Architecture**
- Measurement Results
- □ Side-Channel Analysis

# **Modular Multiplication**

Algorithm Modular Multiplication with Barrett Reduction Require:  $x, y \in \mathbb{Z}_q$ , m and k such that  $m = \lfloor 2^k/q \rfloor$ Ensure:  $z = x \cdot y \mod q$ 1:  $z \leftarrow x \cdot y$ 2:  $t \leftarrow (z \cdot m) \gg k$ 3:  $z \leftarrow z - (t \cdot q)$ 4: if  $z \ge q$  then 5:  $z \leftarrow z - q$ 6: end if 7: return z

#### **Reduction with fully configurable modulus:**

 $\Box$  configurable parameters *m*, *k*, *q* 

 $\square$  *m* and *q* up to 24 bits

 $\Box 16 \le k \le 48$ 

□ requires 2 explicit multipliers for reduction



# **Modular Multiplication**

Algorithm Modular Multiplication with Barrett Reduction Require:  $x, y \in \mathbb{Z}_q$ , m and k such that  $m = \lfloor 2^k/q \rfloor$ Ensure:  $z = x \cdot y \mod q$ 1:  $z \leftarrow x \cdot y$ 2:  $t \leftarrow (z \cdot m) \gg k$ 3:  $z \leftarrow z - (t \cdot q)$ 4: if  $z \ge q$  then 5:  $z \leftarrow z - q$ 6: end if 7: return z

#### Reduction with pseudo-configurable modulus:

- $\Box$  choice of *q* from a set of primes
- □ reduction coded in digital logic
- □ requires no explicit multiplier for reduction
- □ up to 6× more energy-efficient



# **Unified Butterfly**



# **Number Theoretic Transform**



NTT memory banks using dual-port SRAMs have large area overheads

□ Proposed single-port SRAM-based NTT

Based on constant geometry FFT data-flow [Pease, J. ACM, 1968]

Polynomials split among four single-port SRAMs based on address parity:



Achieves > 30% area savings compared to dual-port implementation (without loss in throughput)

#### **NTT Data Flow**

#11

R

R

W

W

#11 #12

R

R

W

w

w

W

#12

R

W

W



• One butterfly per cycle

□ No read / write hazards

□ No energy overheads

#### **Energy-Efficient PRNG**

**Standard CS-PRNG: D** SHAKE-128 / 256 **D** AES-128 / 256 **D** ChaCha20



Keccak-based PRNG:

24-cycles and 2.33 nJ per round @ 1.1V

#### **Discrete Distribution Sampler**



# **Test Chip Overview**

□ Crypto core integrated with RISC-V processor





**Chip Micrograph** 

### **Protocol Implementations**

□ Following NIST Round 2 protocols were implemented on our test chip:

| CCA-KEM | LWE        | Frodo          |           | Ring-LWE   | aTesla             |  |
|---------|------------|----------------|-----------|------------|--------------------|--|
|         | Ring-LWE   | NewHope        | Signature |            | CRYSTALS-Dilithium |  |
|         | Module-LWE | CRYSTALS-Kyber |           | Module-LWE |                    |  |

□ Computations shared between crypto core and RISC-V processor:

PKE / KEM:

Sign:

| Enc | oding / Compression | Ence               | oding / Compression |  |  |
|-----|---------------------|--------------------|---------------------|--|--|
|     | CCA-KEM             |                    | Sign                |  |  |
|     | CPA-PKE             |                    |                     |  |  |
|     | RISC-               | Lattice-Crypto H/W |                     |  |  |

# Implementation of RLWE and MLWE

□ Efficient utilization of 24 KB polynomial memory with 8192 elements



- Crypto core used to accelerate sampling and polynomial arithmetic
- □ Protocol scheduling, compression and encoding performed on RISC-V processor

# Implementation of LWE

Polynomial memory tiled to support non-power-of-two-size matrix manipulation



- Crypto core used to accelerate sampling and matrix arithmetic
- □ Protocol scheduling, compression and encoding performed on RISC-V processor

#### **Protocol Evaluation Results**



\* Cycle counts for CCA-KEM-Encaps and Sign

Order of magnitude improvement in energy-efficiency and performance

#### **Protocol Evaluation Results**



\* Measured using test chip operating at 1.1 V and 72 MHz

# **Performance Comparison**

| Design                            | Platform | Tech<br>(nm) | VDD<br>(V) | Freq<br>(MHz)     | Protocol                                                                                                                           | Area<br>(kGE)        | Cycles                                                | Energy<br>(μJ)                             |
|-----------------------------------|----------|--------------|------------|-------------------|------------------------------------------------------------------------------------------------------------------------------------|----------------------|-------------------------------------------------------|--------------------------------------------|
|                                   |          |              |            |                   | NewHope-512-CCA-KEM-Encaps<br>NewHope-1024-CPA-PKE-Encrypt                                                                         |                      | 136,077<br>106,611                                    | 10.02<br>12.00                             |
| This work                         | ASIC     | 40           | 1.1        | 72                | Kyber-512-CCA-KEM-Encaps<br>Kyber-768-CPA-PKE-Encrypt<br>Kyber-768-CCA-KEM-Encaps<br>Frodo-640-CCA-KEM-Encaps<br>Dilithium-II-Sign | 106                  | 131,698<br>94,440<br>177,540<br>11,609,668<br>514 246 | 9.37<br>10.31<br>12.80<br>1129.95<br>54.82 |
| Basu et al. [BSNK19] <sup>†</sup> | ASIC     | 65           | 1.2        | 169<br>200<br>158 | NewHope-512-CCA-KEM-Encaps<br>Kyber-512-CCA-KEM-Encaps<br>Dilithium-II-Sign                                                        | 1273<br>1341<br>1603 | 307,847<br>31,669<br>155,166                          | 69.42<br>6.21<br>50.42                     |
| Albrecht et al. [AHH+18]          | SLE 78   | -            | -          | 50                | Kyber-768-CPA-PKE-Encrypt<br>Kyber-768-CCA-KEM-Encaps                                                                              | -                    | 4,747,291<br>5,117,996                                | -                                          |
| Oder et al. [OG17]                | FPGA     | -            | -          | 117               | NewHope-1024-Simple-Encrypt                                                                                                        | -                    | 179,292                                               | -                                          |
| Howe et al. [HOKG18]              | FPGA     | -            | -          | 167               | Frodo-640-CCA-KEM-Encaps                                                                                                           | -                    | 3,317,760                                             | -                                          |
| Fritzmann et al. [FSM+19]         | FPGA     | -            | -          | -                 | NewHope-1024-CPA-PKE-Encrypt                                                                                                       | -                    | 589,285                                               | -                                          |

<sup>†</sup> Only post-synthesis area and energy consumption reported

#### **Side-Channel Analysis Setup**



# **Timing and SPA Side-Channels**

□ All key building blocks constant-time by design

- □ Energy consumption of sampling and polynomial arithmetic follows a narrow distribution with coefficient of variation ≤ 0.5% (=  $\sigma/\mu$ )
- □ SPA attacks target polynomial arithmetic:
  - Number Theoretic Transform
  - Coefficient-wise Multiplication
  - Coefficient-wise Addition
- SPA resistance of polynomial arithmetic evaluated using difference-of-means test with 99.99% confidence interval



# Masking for DPA Security

□ Protocol evaluations without any DPA countermeasures

□ Masked NewHope-CPA-PKE-Decrypt based on additively homomorphic property:

- 1. Generate secret message  $\mu_r$
- 2. Encrypt  $\mu_r$  to its corresponding ciphertext  $c_r = (\hat{u}_r, v'_r)$
- 3. Compute  $c_m = (\hat{u} + \hat{u}_r, v' + v'_r)$  where  $c = (\hat{u}, v')$  is the original ciphertext
- 4. Decrypt  $c_m$  to obtain  $\mu_m = \mu \oplus \mu_r$  where  $\mu$  is the original message
- 5. Recover original message as  $\mu = \mu_m \oplus \mu_r$

□ Masked decryption using same hardware; 3× slower than unmasked version

□ Masking increases decryption failure rate, which can be resolved by decreasing std. dev.  $\sigma$  of error distribution (at the cost of slightly lower security level)

□ Leakage tests and CCA-KEM masking – work in progress

[Reparaz et al, PQCrypto, 2016]

#### Conclusion

- Configurable crypto-processor for LWE, Ring-LWE and Module-LWE protocols
- Area-efficient NTT, energy-efficient sampler and flexible parameters
- ASIC demonstration of NIST Round 2 CCA-KEM and signature protocols: Frodo, NewHope, Kyber, qTesla, Dilithium
- Order of magnitude improvement in performance and energy-efficiency compared to state-of-the-art software and hardware
- Hardware building blocks constant-time and SPA-secure by design; masking can also be implemented for DPA security

#### Acknowledgements

#### Texas Instruments for funding

□ TSMC University Shuttle Program for chip fabrication

#### Questions

