# Smashing the Implementation Records of AES S-box

<u>Arash Reyhani-Masoleh</u>, Mostafa Taha, and Doaa Ashmawy

Western University London, Ontario, Canada

CHES-2018

### Outline

- Introduction.
- Proposed AES S-box Architecture.
- New Logic-Minimization Algorithms.
- New GF((2<sup>4</sup>)<sup>2</sup>) Inversion.
  - New Exponentiation Stage.
  - New Representation of Subfield Inversion.
  - New Output Multipliers.
- Comparisons and Concluding Remarks.









1. The most compact S-box to date.

2. The most efficient S-box to date.

1. Use AND gates,

when NAND gates have smaller area and delay in all technology libraries.

1. Use AND gates,

when NAND gates have smaller area and delay in all technology libraries.

2. Use only simple gates,

when compound gates (AND-OR-Invert, OR-AND-Invert) may be more efficient.

1. Use AND gates,

when NAND gates have smaller area and delay in all technology libraries.

2. Use only simple gates,

when compound gates (AND-OR-Invert, OR-AND-Invert) may be more efficient.

• We improved previous designs using AND gates to the ones using NAND/NOR gates:

| S-box                   | Area     | (GEs)    | Delay (ns) |          |  |
|-------------------------|----------|----------|------------|----------|--|
| 2-00%                   | Original | Improved | Original   | Improved |  |
| Canright [Can05b]       | 200      |          | 1.253      |          |  |
| 113-gates [Boy16]       | 202      | 194      | 1.523      | 1.346    |  |
| Depth-16 (2012) [BP12]  | 230.5    | 222      | 0.960      | 0.906    |  |
| Depth-16 (2017) [BFP17] | 224.5    | 216      | 0.957      | 0.912    |  |
| Ueno et al. [UHS+15]    | 256.5    | 238      | 0.831      | 0.772    |  |

Targeting STM 65-nm CMOS standard library

1. Use AND gates,

when NAND gates have smaller area and delay in all technology libraries.

2. Use only simple gates,

when compound gates (AND-OR-Invert, OR-AND-Invert) may be more efficient.

• We improved previous designs using AND gates to the ones using NAND/NOR gates:

| The smallestThe fastestoriginaloriginal |        |          |         |            |  |  |  |  |
|-----------------------------------------|--------|----------|---------|------------|--|--|--|--|
| <u>C</u> boy                            | Area   | a (GEs)  |         | Pelay (ns) |  |  |  |  |
| S-box                                   | Origin | Improved | Origina | Improved   |  |  |  |  |
| Canright [Can05b]                       | 200    |          | 1.253   |            |  |  |  |  |
| 113-gates [Boy16]                       | 202    | 194      | 1.523   | 1.346      |  |  |  |  |
| Depth-16 (2012) [BP12]                  | 230.5  | 222      | 0.960   | 0.906      |  |  |  |  |
| Depth-16 (2017) [BFP17]                 | 224.5  | 216      | 0.957   | 0.912      |  |  |  |  |
| Ueno et al. [UHS+15]                    | 256.5  | 238      | 0.831   | 0.772      |  |  |  |  |

Targeting STM 65-nm CMOS standard library

1. Use AND gates,

when NAND gates have smaller area and delay in all technology libraries.

2. Use only simple gates,

when compound gates (AND-OR-Invert, OR-AND-Invert) may be more efficient.

• We improved previous designs using AND gates to the ones using NAND/NOR gates:

|                         | The sma<br>origina |         | The fa<br>origi |          |
|-------------------------|--------------------|---------|-----------------|----------|
| S-box                   | Area               | (GEs)   | )e              | lay (ns) |
| S-DOX                   | Origin             | Improve | Origina         | Improved |
| Canright [Can05b]       | 200                |         | 1.253           |          |
| 113-gates [Boy16]       | 202                | 194     | 1.523           | 1.346    |
| Depth-16 (2012) [BP12]  | 230.5              | 222     | 0.960           | 0.906    |
| Depth-16 (2017) [BFP17] | 224.5              | 216     | 0.957           | 0.912    |
| Ueno et al. [UHS+15]    | 256.5              | 238     | 0.831           | 0.772    |

Targeting STM 65-nm CMOS standard library

1. Use AND gates,

when NAND gates have smaller area and delay in all technology libraries.

2. Use only simple gates,

when compound gates (AND-OR-Invert, OR-AND-Invert) may be more efficient.

• We improved previous designs using AND gates to the ones using NAND/NOR gates:

| The smallest<br>originalThe smallest<br>improvedThe fastest<br>originalThe faste<br>improved |        |         |         |           |  |  |  |  |
|----------------------------------------------------------------------------------------------|--------|---------|---------|-----------|--|--|--|--|
| S-box                                                                                        | Area   | (GEs)   | þ       | elay (ns) |  |  |  |  |
| S-DOX                                                                                        | Origin | Improve | Origina | Improved  |  |  |  |  |
| Canright [Can05b]                                                                            | 200    |         | 1.253   |           |  |  |  |  |
| 113-gates [Boy16]                                                                            | 202    | 194     | 1.523   | 1.346     |  |  |  |  |
| Depth-16 (2012) [BP12]                                                                       | 230.5  | 222     | 0.960   | 0.906     |  |  |  |  |
| Depth-16 (2017) [BFP17]                                                                      | 224.5  | 216     | 0.957   | 0.912     |  |  |  |  |
| Ueno et al. [UHS+15]                                                                         | 256.5  | 238     | 0.831   | 0.772     |  |  |  |  |

Targeting STM 65-nm CMOS standard library

At the end, we compare only against the Improved Versions. Formulations of the improved designs are included in the paper.

### **AES S-box**



### **AES S-box**



• Typical implementation using Composite Fields in Normal Basis



### **Proposed AES S-box Architecture**

• 12 terms are shared between the Exponentiation and Multipliers



### **Proposed AES S-box Architecture**

• 12 terms are shared between the Exponentiation and Multipliers



### **Proposed AES S-box Architecture**

• 12 terms are shared between the Exponentiation and Multipliers



Everything optimized by-hand and by CAD tools at various abstraction levels (promote using NAND/NOR and compound gates )

### Outline

- Introduction, Motivation and Previous Work.
- Proposed AES S-box Architecture.
- New Logic-Minimization Algorithms.
- New GF((2<sup>4</sup>)<sup>2</sup>) Inversion.
  - New Exponentiation Stage.
  - New Representation of Subfield Inversion.
  - New Output Multipliers.
- Comparisons and Concluding Remarks.

### **Logic-Minimization Algorithms**

- Implement isomorphic transformation • NP-hard problem [BMP08].





- Implement isomorphic transformation matrices using smallest number of gates.
  - NP-hard problem [BMP08].
- Previous work
  - Cancellation-free search: Gates are never used to cancel-out common terms, Canright [Can05b] and Paar [Paa94].

| $ \begin{bmatrix} a_0\\a_1\\a_2\\a_3\\b_0\\b_1\\b_2 \end{bmatrix} $ | ] = | $ \begin{bmatrix} 0 \\ 0 \\ 1 \\ 0 \\ 0 \\ 1 \\ 0 \end{bmatrix} $ | $egin{array}{c} 0 \\ 0 \\ 1 \\ 0 \\ 1 \\ 0 \\ 1 \\ 1 \end{array}$ | $egin{array}{c} 0 \\ 0 \\ 1 \\ 1 \\ 1 \\ 1 \\ 0 \end{array}$ | $egin{array}{c} 0 \\ 0 \\ 1 \\ 0 \end{array}$ | 0 | $0 \\ 0 \\ 1 \\ 1$ | $egin{array}{c} 0 \\ 0 \\ 1 \\ 0 \\ 0 \\ 1 \\ 0 \end{array}$ | 1<br>1<br>1<br>1<br>1<br>1<br>1<br>1 | $   \begin{bmatrix}     g_7 \\     g_6 \\     g_5 \\     g_4 \\     g_3 \\     g_2 \\     g_1   \end{bmatrix} $ |
|---------------------------------------------------------------------|-----|-------------------------------------------------------------------|-------------------------------------------------------------------|--------------------------------------------------------------|-----------------------------------------------|---|--------------------|--------------------------------------------------------------|--------------------------------------|-----------------------------------------------------------------------------------------------------------------|
| $\begin{bmatrix} b_2 \\ b_3 \end{bmatrix}$                          |     | $\begin{array}{c} 0 \\ 1 \end{array}$                             | 1<br>1                                                            | $\begin{array}{c} 0 \\ 1 \end{array}$                        | 0                                             |   | 1                  | $\begin{array}{c} 0 \\ 0 \end{array}$                        | 1<br>1                               | $\begin{bmatrix} g_1 \\ g_0 \end{bmatrix}$                                                                      |

- Implement isomorphic transformation matrices using smallest number of gates.
  - NP-hard problem [BMP08].
- Previous work
  - Cancellation-free search: Gates are never used to cancel-out common terms, Canright [Can05b] and Paar [Paa94].
  - Heuristics (with cancellation): Normal-BP (Boyar and Peralta [BP10])

| $\begin{bmatrix} a_0 \end{bmatrix}$ |   | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 1 | $\left\lceil g_7 \right\rceil$ |
|-------------------------------------|---|---|---|---|---|---|---|---|---|--------------------------------|
| $a_1$                               |   | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | $ g_6 $                        |
| $a_2$                               |   | 1 | 1 | 1 | 0 | 0 | 0 | 1 | 1 | $ g_5 $                        |
| $a_3$                               | _ | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | $ g_4 $                        |
| $b_0$                               | _ | 0 | 1 | 1 | 1 | 1 | 1 | 0 | 1 | $ g_3 $                        |
| $b_1$                               |   | 1 | 0 | 1 | 0 | 0 | 1 | 1 | 1 | $ g_2 $                        |
| $b_2$                               |   | 0 | 1 | 0 | 0 | 1 | 1 | 0 | 1 | $ g_1 $                        |
| $b_3$                               |   | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 1 | $\lfloor g_0 \rfloor$          |
| $g_7$                               |   |   |   |   |   |   |   |   |   | $a_0$                          |
| $g_6$                               |   |   |   |   |   |   |   |   |   | $a_1$                          |
| $g_5$                               |   |   |   |   |   |   |   |   |   | $a_2$                          |
| $g_4$                               |   |   |   |   |   |   |   |   |   | $a_3$                          |
| $g_3$                               |   |   |   |   |   |   |   |   |   | $b_0$                          |
| $g_2$                               |   |   |   |   |   |   |   |   |   | $b_1$                          |
| $g_1$                               |   |   |   |   |   |   |   |   |   | $b_2$                          |
| $g_0$                               |   |   |   |   |   |   |   |   |   | $b_3$                          |
|                                     |   |   |   |   |   |   |   |   |   |                                |

- Implement isomorphic transformation matrices using smallest number of gates.
  - NP-hard problem [BMP08].
- Previous work
  - Cancellation-free search: Gates are never used to cancel-out common terms, Canright [Can05b] and Paar [Paa94].
  - Heuristics (with cancellation): Normal-BP (Boyar and Peralta [BP10])
    - 1. Test adding one gate

| $\begin{bmatrix} a_0 \end{bmatrix}$ |                 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 1 | $\left\lceil g_7 \right\rceil$ |
|-------------------------------------|-----------------|---|---|---|---|---|---|---|---|--------------------------------|
| $a_1$                               |                 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | $ g_6 $                        |
| $a_2$                               |                 | 1 | 1 | 1 | 0 | 0 | 0 | 1 | 1 | $ g_5 $                        |
| $a_3$                               | _               | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | $ g_4 $                        |
| $b_0$                               | _               | 0 | 1 | 1 | 1 | 1 | 1 | 0 | 1 | $ g_3 $                        |
| $b_1$                               |                 | 1 | 0 | 1 | 0 | 0 | 1 | 1 | 1 | $ g_2 $                        |
| $b_2$                               |                 | 0 | 1 | 0 | 0 | 1 | 1 | 0 | 1 | $ g_1 $                        |
| $b_3$                               |                 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 1 | $\lfloor g_0 \rfloor$          |
| $g_7$                               |                 |   |   |   |   |   |   |   |   | $a_0$                          |
| $g_6$                               | $\widehat{(1)}$ | ) |   |   |   |   |   |   |   | $a_1$                          |
| $g_5$ -                             |                 |   |   |   |   |   |   |   |   | $a_2$                          |
| $g_4$ -                             |                 |   |   |   |   |   |   |   |   | $a_3$                          |
| $g_3$                               |                 |   |   |   |   |   |   |   |   | $b_0$                          |
| $g_2$                               |                 |   |   |   |   |   |   |   |   | $b_1$                          |
| $g_1$                               |                 |   |   |   |   |   |   |   |   | $b_2$                          |
| $g_0$                               |                 |   |   |   |   |   |   |   |   | $b_3$                          |
|                                     |                 |   |   |   |   |   |   |   |   |                                |

- Implement isomorphic transformation matrices using smallest number of gates.
  - NP-hard problem [BMP08].
- Previous work
  - Cancellation-free search: Gates are never used to cancel-out common terms, Canright [Can05b] and Paar [Paa94].
  - Heuristics (with cancellation): Normal-BP (Boyar and Peralta [BP10])
    - 1. Test adding one gate
    - 2. Compute *Distance* to each target (assuming no sharing)



- Implement isomorphic transformation matrices using smallest number of gates.
  - NP-hard problem [BMP08].
- Previous work
  - Cancellation-free search: Gates are never used to cancel-out common terms, Canright [Can05b] and Paar [Paa94].
  - Heuristics (with cancellation): Normal-BP (Boyar and Peralta [BP10])
    - 1. Test adding one gate
    - 2. Compute *Distance* to each target (assuming no sharing)
    - 3. Select a gate leading to the (min average *Dist*) Resolve ties using different methods.



- Implement isomorphic transformation matrices using smallest number of gates.
  - NP-hard problem [BMP08].
- Previous work
  - Cancellation-free search: Gates are never used to cancel-out common terms, Canright [Can05b] and Paar [Paa94].
  - Heuristics (with cancellation): Normal-BP (Boyar and Peralta [BP10])

Add the selected gate and redo

- 1. Test adding one gate
- Compute *Distance* to each target (assuming no sharing)
- 3. Select a gate leading to the (min average *Dist*)Resolve ties using different methods.



#### 10

# Logic-Minimization Algorithms (cont.)

- Proposed Logic-Minimization Algorithms
  - Improved-BP:
    - Test <u>all</u> the ties.
    - Monitor progress of the delay.
  - Shortest-Dist-First:
    - Select a gate leading to <u>many</u> small (short) *Distances* (prioritize small *Distances*, not the average).
    - Test <u>all</u> the ties and monitor the delay.
  - Focused-Search:
    - Select a gate leading to <u>any</u> small (short) *Distance* (ignore the count and search through more cases) (close to exhaustive search).
    - Test <u>all</u> the ties and monitor the delay.



 Studied T<sub>in</sub> and T<sub>out</sub> for all possible isomorphic transformations (a total of <u>96</u> matrices).

- Studied T<sub>in</sub> and T<sub>out</sub> for all possible isomorphic transformations (a total of <u>96</u> matrices).
- The proposed algorithms consistently lead to <u>equal or better</u> implementations.

- Studied  $T_{in}$  and  $T_{out}$  for all possible isomorphic transformations (a total of <u>96</u> matrices).
- The proposed algorithms consistently lead to <u>equal or better</u> implementations.
- Lightweight Implementation

|                           | Optimized<br>by CAD tools | Normal-BP | Improved-BP | Shortest-Dist-<br>First | Focused-Search |
|---------------------------|---------------------------|-----------|-------------|-------------------------|----------------|
| T <sub>in</sub> (#gates)  | 29                        | 19        | 19          | 19                      | 19             |
| T <sub>out</sub> (#gates) | 23                        | 19        | 17          | 17                      | 16             |

- Studied  $T_{in}$  and  $T_{out}$  for all possible isomorphic transformations (a total of <u>96</u> matrices).
- The proposed algorithms consistently lead to <u>equal or better</u> implementations.
- Lightweight Implementation

|                           | Optimized<br>by CAD tools | Normal-BP | Improved-BP | Shortest-Dist-<br>First | Focused-Search |
|---------------------------|---------------------------|-----------|-------------|-------------------------|----------------|
| T <sub>in</sub> (#gates)  | 29                        | 19        | 19          | 19                      | 19             |
| T <sub>out</sub> (#gates) | 23                        | 19        | 17          | 17                      | 16             |

• Fast Implementation

|                           | Area (# XOR gates) | Delay (levels of XOR gates) |
|---------------------------|--------------------|-----------------------------|
| T <sub>in</sub> (#gates)  | 24                 | 3                           |
| T <sub>out</sub> (#gates) | 21                 | 3                           |

### Outline

- Introduction, Motivation and Previous Work.
- Proposed AES S-box Architecture.
- New Logic-Minimization Algorithms.
- New GF((2<sup>4</sup>)<sup>2</sup>) Inversion.
  - New Exponentiation Stage.
  - New Representation of Subfield Inversion.
  - New Output Multipliers.
- Comparisons and Concluding Remarks.

### **New Exponentiation Stage**

 Express as one operation with closed-form equations (allows for maximum sharing).



## **New Exponentiation Stage**

- Express as one operation with closed-form equations (allows for maximum sharing).
- Two designs: Lightweight and Fast. (Optimized by hand)
- One design optimized by CAD tools.



### New Exponentiation Stage (cont.)



### **New Subfield Inversion**

- Express in closed-form equations
- Derive 12 equivalent functions using Karnough maps, and optimize by-hand.
- Optimized using CAD tools.



### **New Subfield Inversion**

- Express in closed-form equations
- Derive 12 equivalent functions using Karnough maps, and optimize by-hand.
- Optimized using CAD tools.

Lightweight and fast, optimized by-hand Used NAND3 gates Optimized by CAD tools Used OR-AND-Invert gates

|                                          | Area (GEs) | Delay (ns) |
|------------------------------------------|------------|------------|
| Lightweight and fast (optimized by-hand) | 36         | 0.121      |
| Optimized by CAD tools                   | 31         | 0.102      |





# **New Output Multipliers**

• Two multipliers with a common input:  $W = B \times E$  &  $Z = A \times E$ 



# **New Output Multipliers**

Two multipliers with a common input:
 W = B x E
 & Z = A x E



Input and output terms represented as

 4 bits x 4 bits → 5 bits
 Reduction from 5 bits back to 4 bits is part of T<sub>out</sub>.

# **New Output Multipliers**

Two multipliers with a common input:
 W = B x E
 & Z = A x E



- Input and output terms represented as

   4 bits x 4 bits → 5 bits
   Reduction from 5 bits back to 4 bits is part of T<sub>out</sub>.
- Previous work:

 $4x4 \rightarrow 4$  [Can05b],  $5x5 \rightarrow 5$  [NNI12],  $4x5 \rightarrow 5$  [UHS<sup>+</sup>15]









• Some multipliers do not allow sharing ([Mas91], [RDJ<sup>+</sup>01] and [GM16]).

### Space and time complexities of a single multiplier

|                                                     | Multiplier used in     | Space Complexity | Time Complexity                    |
|-----------------------------------------------------|------------------------|------------------|------------------------------------|
| 2)2)                                                | Satoh et al. [SMTM01]  | 21 XOR + 9 AND   | $4 D_{\chi} + D_{AD}$              |
| GF(((2 <sup>2</sup> ) <sup>2</sup> ) <sup>2</sup> ) | Canright [Can05b]      | 20 XOR + 9 NAND  | $4 D_{X} + D_{ND}$                 |
| GF(                                                 | Nogami et al. [NNT+10] | 21 XOR + 9 AND   | $4 D_{X} + D_{AD}$                 |
|                                                     | Rudra et al. [RDJ⁺01]  | 15 XOR + 16 AND  | $3 D_{X} + D_{AD}$                 |
| )2)                                                 | Gueron et al. [GM16]   | 15 XOR + 16 AND  | 3 D <sub>X</sub> + D <sub>ND</sub> |
| GF((2 <sup>4</sup> ) <sup>2</sup> )                 | Nekado et al. [NNI12]  | 25 XOR + 10 AND  | $2 D_{X} + D_{AD}$                 |
| GF                                                  | Ueno et al. [UHS⁺15]   | 21 XOR + 10 AND  | $2 D_{X} + D_{AD}$                 |
|                                                     | This work              | 17 XOR + 10 NAND | $2 D_{X} + D_{ND}$                 |

### Space and time complexities of a single multiplier

|                                                     | Multiplier used in     | Space Complexity | Time Complexity                    |  |  |
|-----------------------------------------------------|------------------------|------------------|------------------------------------|--|--|
| 2)2)                                                | Satoh et al. [SMTM01]  | 21 XOR + 9 AND   | 4 D <sub>X</sub> + D <sub>AD</sub> |  |  |
| GF(((2 <sup>2</sup> ) <sup>2</sup> ) <sup>2</sup> ) | Canright [Can05b]      | 20 XOR + 9 NAND  | $4 D_{X} + D_{ND}$                 |  |  |
| GF(                                                 | Nogami et al. [NNT+10] | 21 XOR + 9 AND   | $4 D_{X} + D_{AD}$                 |  |  |
|                                                     | Rudra et al. [RDJ⁺01]  | 15 XOR + 16 AND  | 3 D <sub>X</sub> + D <sub>AD</sub> |  |  |
| )2)                                                 | Gueron et al. [GM16]   | 15 XOR + 16 AND  | 3 D <sub>X</sub> + D <sub>ND</sub> |  |  |
| GF((2 <sup>4</sup> ) <sup>2</sup> )                 | Nekado et al. [NNI12]  | 25 XOR + 10 AND  | $2 D_{X} + D_{AD}$                 |  |  |
| GF                                                  | Ueno et al. [UHS⁺15]   | 21 XOR + 10 AND  | $2 D_{X} + D_{AD}$                 |  |  |
|                                                     | This work              | 17 XOR + 10 NAND | $2 D_{\chi} + D_{ND}$              |  |  |
| The smallest and fastest 4-bit multiplier to date   |                        |                  |                                    |  |  |

The smallest and fastest 4-bit multiplier to date among all the  $GF((2^4)^2)$  and  $GF(((2^2)^2)^2)$  multipliers

# Additional area and delay required for the multipliers

|                        | Area (GEs) | Delay (ns) |
|------------------------|------------|------------|
| Optimized by-hand      | 52         | 0.099      |
| Optimized by CAD tools | 53.5       | 0.121      |



### Outline

- Introduction, Motivation and Previous Work.
- Architecture of the Proposed AES S-box.
- New Logic-Minimization Algorithms.
- New GF((2<sup>4</sup>)<sup>2</sup>) Inversion.
  - New Exponentiation Stage.
  - New Representation of Subfield Inversion.
  - New Output Multipliers.
- Comparisons and Concluding Remarks.

## Comparisons

• Targeting Lightweight Implementation

| S-box                   | Area (GEs) | Delay (ns) | Area-Time Product |
|-------------------------|------------|------------|-------------------|
| Canright [Can05b]       | 200        | 1.25       | 250               |
| Improved 113-gates      | 194        | 1.35       | 261.9             |
| This work (Lightweight) | 182.25     | 1.20       | 218.7             |
|                         |            |            |                   |

The smallest, fastest and most efficient Lightweight S-box

# Comparisons

• Targeting Lightweight Implementation

| S-box                   | Area (GEs) | Delay (ns) | Area-Time Product |
|-------------------------|------------|------------|-------------------|
| Canright [Can05b]       | 200        | 1.25       | 250               |
| Improved 113-gates      | 194        | 1.35       | 261.9             |
| This work (Lightweight) | 182.25     | 1.20       | 218.7             |

The smallest, fastest and most efficient Lightweight S-box

### • Targeting Fast Implementation

| S-box                    | Area (GEs) | Delay (ns) | Area-Time Product |
|--------------------------|------------|------------|-------------------|
| Improved Depth-16 (2012) | 222        | 0.91       | 202.02            |
| Improved Depth-16 (2017) | 216        | 0.91       | 196.56            |
| Improved Ueno et al.     | 238        | 0.77       | 183.26            |
| This work (Fast)         | 208        | 0.78       | 162.24            |

At STM 65-nm CMOS standard technology library The smallest, fastest and most

efficient Fast S-box

# Comparisons

• Targeting Lightweight Implementation

| S-box                   | Area (GEs) | Delay (ns) | Area-Time Product |
|-------------------------|------------|------------|-------------------|
| Canright [Can05b]       | 200        | 1.25       | 250               |
| Improved 113-gates      | 194        | 1.35       | 261.9             |
| This work (Lightweight) | 182.25     | 1.20       | 218.7             |

The smallest, fastest and most efficient Lightweight S-box

### • Targeting Fast Implementation

| S-box                    | Area (GEs) | Delay (ns) | Area-Time Product |
|--------------------------|------------|------------|-------------------|
| Improved Depth-16 (2012) | 222        | 0.91       | 202.02            |
| Improved Depth-16 (2017) | 216        | 0.91       | 196.56            |
| Improved Ueno et al.     | 238        | 0.77       | 183.26            |
| This work (Fast)         | 208        | 0.78       | 162.24            |

At STM 65-nm CMOS standard technology library The smallest, fastest and most

efficient Fast S-box

As a result of testing more than <u>46</u> pieces of VHDL code, at various abstraction levels of the designs

As compared against the improved versions proposed in this paper

- Industrial technology libraries (e.g., STM and TSMC):
  - Lightweight: Used XOR3 and OAI32  $\rightarrow$  182.25 GEs.
  - Fast: Used NAND3  $\rightarrow$  208 GEs.

- Industrial technology libraries (e.g., STM and TSMC):
  - Lightweight: Used XOR3 and OAI32  $\rightarrow$  182.25 GEs.
  - Fast: Used NAND3  $\rightarrow$  208 GEs.
- NanGate45nm:
  - Lightweight: Used AOI12 and OAI12 gates  $\rightarrow$  186 GEs.
  - Fast: Used NAND3  $\rightarrow$  208 GEs (no change).

- Industrial technology libraries (e.g., STM and TSMC):
  - Lightweight: Used XOR3 and OAI32  $\rightarrow$  182.25 GEs.
  - Fast: Used NAND3  $\rightarrow$  208 GEs.
- NanGate45nm:
  - Lightweight: Used AOI12 and OAI12 gates  $\rightarrow$  186 GEs.
  - Fast: Used NAND3  $\rightarrow$  208 GEs (no change).
- Without using any compound gate:
  - Lightweight: 191 GEs (best previous work: 194 GEs)
  - Fast: 211 GEs (best previous work: 216 GEs)

- Industrial technology libraries (e.g., STM and TSMC):
  - Lightweight: Used XOR3 and OAI32  $\rightarrow$  182.25 GEs.
  - Fast: Used NAND3  $\rightarrow$  208 GEs.
- NanGate45nm:
  - Lightweight: Used AOI12 and OAI12 gates  $\rightarrow$  186 GEs.
  - Fast: Used NAND3  $\rightarrow$  208 GEs (no change).
- Without using any compound gate:
  - Lightweight: 191 GEs (best previous work: 194 GEs)
  - Fast: 211 GEs (best previous work: 216 GEs)

The proposed designs are superior under any restriction by the target library.

# **Concluding Remarks**

- In this paper, we proposed:
  - Two new designs for the AES S-box: Lightweight and fast.
  - New logic-minimization heuristics.
  - New formulations for each stage of the S-box.
  - New output multipliers.
  - Design methodology for an optimum synergy between theoretical analysis and technology-assisted CAD tools.

### References

- [Can05b] David Canright. A very compact S-box for AES. CHES-2005.
- [Boy16] CMT: Circuit minimization team, 2016. http://www.cs.yale.edu/homes/peralta/CircuitStuff/CMT.html,
- [BP12] Joan Boyar and René Peralta. A small depth-16 circuit for the AES S-box. Information Security and Privacy Conference, SEC 2012.
- [BFP17] Joan Boyar, Magnus Find, and René Peralta. Low-depth, low-size circuits for cryptographic applications. In Boolean Functions and their Applications BFA-2017.
- [UHS<sup>+</sup>15] Rei Ueno, Naofumi Homma, Yukihiro Sugawara, Yasuyuki Nogami, and Takafumi Aoki. Highly efficient GF(2<sup>8</sup>) inversion circuit based on redundant GF arithmetic and its application to AES design. CHES-2015.
- [BMP08] Joan Boyar, Philip Matthews, and René Peralta. On the shortest linear straight-line program for computing linear forms. Mathematical Foundations of Computer Science, MFCS 2008.
- [Paa94] Christof Paar. Efficient VLSI architectures for bit parallel computation in Galios fields. PhD thesis, University of Duisburg-Essen, Germany, 1994.
- [BP10] Joan Boyar and René Peralta. A new combinational logic minimization technique with applications to cryptology. Symposium on Experimental Algorithms, SEA 2010.
- [NNI12] Kenta Nekado, Yasuyuki Nogami, and Kengo Iokibe. Very short critical path implementation of AES with direct logic gates. International Workshop on Security, IWSEC 2012.
- [Mas91] E. D. Mastrovito. VLSI Architectures for Computation in Galois Fields. PhD thesis, Linkoping Univ., Linkoping Sweden, 1991.
- [RDJ+01] Atri Rudra, Pradeep K. Dubey, Charanjit S. Jutla, Vijay Kumar, Josyula R.Rao, and Pankaj Rohatgi. Efficient Rijndael encryption implementation with composite field arithmetic. CHES 2001.
- [GM16] Shay Gueron and Sanu Mathew. Hardware implementation of AES using area-optimal polynomials for composite-field representation GF((2<sup>4</sup>)<sup>2</sup>) of GF(2<sup>8</sup>). ARITH 2016.
- [SMTM01] Akashi Satoh, Sumio Morioka, Kohji Takano, and Seiji Munetoh. A compact Rijndael hardware architecture with S-box optimization. ASIACRYPT 2001.
- [NNT<sup>+</sup>10] Yasuyuki Nogami, Kenta Nekado, Tetsumi Toyota, Naoto Hongo, and Yoshitaka Morikawa. Mixed bases for efficient inversion in F((2<sup>2</sup>)<sup>2</sup>)<sup>2</sup> and conversion matrices of subbytes of AES. CHES-2010.

Thank You, Questions?

## **Logic-Minimization Algorithms**

| original the inputs                                  | with <i>Dist</i> =1                                  | Dist, assume<br>using <mark>w<sub>0</sub>+w<sub>1</sub></mark> | Dist, assume<br>using w <sub>0</sub> +w <sub>2</sub> | Dist, assume<br>using w <sub>0</sub> +w <sub>3</sub>                                                                                                                                                             | Dist, assume<br>using w <sub>0</sub> +w <sub>4</sub>                                                                                                                                                                              | T <sub>out</sub>                                                                                                                                                                                                                                                                                                                                                                                                                            |                                                                                          |
|------------------------------------------------------|------------------------------------------------------|----------------------------------------------------------------|------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------|
| $\begin{array}{cccccccccccccccccccccccccccccccccccc$ | $\begin{array}{cccccccccccccccccccccccccccccccccccc$ | $\begin{array}{cccccccccccccccccccccccccccccccccccc$           | $\begin{array}{cccccccccccccccccccccccccccccccccccc$ | $w_{0} = 3  s_{7}$ $w_{1} = 6  s_{6}$ $w_{2} = 6  s_{5}$ $w_{4} = 5  s_{4}$ $z_{0} = 3  s_{3}$ $z_{1} = 2  s_{2}$ $z_{2} = 1  s_{1}$ $z_{4} = 5  s_{0}$ $w_{2} \oplus w_{4}$ $w_{0} \oplus w_{3}$ Sum(Dist) = 31 | $w_{0} = 3 \ s_{7}$ $w_{1} = 6 \ s_{6}$ $w_{2} = 6 \ s_{5}$ $w_{4} = 5 \ s_{4}$ $z_{0} = 3 \ s_{3}$ $z_{1} = 2 \ s_{2}$ $z_{2} = 1 \ s_{1}$ $z_{4} = 5 \ s_{0}$ $w_{2} \oplus w_{4}$ $w_{0} \oplus w_{4}$ Sum( <i>Dist</i> ) = 31 | $\begin{bmatrix} s_7\\ s_6\\ s_5\\ s_4\\ s_3\\ s_2\\ s_1\\ s_0 \end{bmatrix} = \begin{bmatrix} 0 & 1 & 0 & 0 & 1 & 0 & 0 & 0 & 1 & 1\\ 1 & 1 & 1 & 0 & 1 & 1 & 1 & 1 & 0 & 1\\ 1 & 1 & 0 & 1 & 1 & 0 & 1 & 1 & 1 & 1\\ 1 & 1 & 0 & 0 & 0 & 0 & 1 & 1 & 1 & 1\\ 0 & 0 & 1 & 1 & 0 & 0 & 0 & 1 & 0 & 1\\ 0 & 1 & 1 & 1 & 1 & 0 & 0 & 0 & 0 & 0\\ 0 & 1 & 1 & 0 & 1 & 0 & 0 & 0 & 0 & 0\\ 0 & 1 & 1 & 0 & 0 & 0 & 1 & 1 & 1 & 1 \end{bmatrix}$ | $egin{array}{c} w_0 \ w_1 \ w_2 \ w_3 \ w_4 \ z_0 \ z_1 \ z_2 \ z_3 \ z_4 \end{bmatrix}$ |

#### Normal-BP:

1.Test all the possible XOR gates that can use the previous level gates (the inputs and  $(w_2+w_4)$ ). That is: from  $(w_0+w_1)$  all the way to  $(z_4 + (w_2+w_4))$ . 2.Select one gate that leads to [min (sum (*Dist*))]. In case of ties, select one gate based on different tie breaking criteria. For example, within the best gates, select one gate that maximizes the Euclidean norm of *Dist* 

#### • Improved-BP:

Similar to Normal-BP, but try all the tie, and monitor progress of the Delay.

#### • Shortest-Dist-First

Similar to Norma-BP, but select all the gates that as many small numbers in the *Dist* as possible. If we consider the four cases above, we will select all of them because the smallest number is 2 (excluding ones), and this number (2) appears one time in each case. If it were to appear twice in any case, I would have selected that case. If the smallest number is 3, so that is the smallest *Dist*, and select the case that leads to as many (Dist=3) as possible.

#### Focused-Search

Similar to 'Shortest-Dist-First', but we ignore the count of (Dist=2) or (Dist=3). Here, we select all the gates that include (Dist=2) within the vector of Distances. We do not differentiate based on the count. If there is no gate that lead to Dist=2, select all the gates that include Dist=3, and so on.