RISQ-V: Tightly Coupled RISC-V Accelerators for Post-Quantum Cryptography

. Empowering electronic devices to support Post-Quantum Cryptography (PQC) is a challenging task. Compared with traditional cryptography, PQC introduces new mathematical elements and operations which are usually not easy to implement on standard CPU architectures. Especially for low cost and resource constraint devices, hardware acceleration is absolutely required. In addition, as the standardization process of PQC is still ongoing, a focus on maintaining crypto-agility is mandatory. To cope with such requirements, Hardware/Software Co-Design techniques have been recently used for developing complex and highly customized PQC solutions. However, while most of the previous works have developed loosely coupled PQC accelerators, the design of tightly coupled accelerators and Instruction Set Architecture (ISA) extensions for PQC have been barely explored. To this end, we present RISQ-V, an enhanced RISC-V architecture that integrates a set of powerful tightly coupled accelerators to speed up lattice-based PQC. RISQ-V eﬃciently reuses processor resources and reduces the amount of memory accesses. This signiﬁcantly increases the performance while keeping the silicon area overhead low. We present three contributions. First, we propose a set of powerful hardware accelerators deeply integrated into the RISC-V pipeline. Second, we extended the RISC-V ISA with 28 new instructions to eﬃciently perform operations for lattice-based cryptography. Third, we implemented our RISQ-V in ASIC technology and on FPGA. We evaluated the performance of NewHope, Kyber, and Saber on RISQ-V. Compared to the pure software implementation on RISC-V, our Co-Design implementations show a speed up factor of up to 10.5 for NewHope, 9.6 for Kyber, and 2.7 for Saber. For the ASIC implementation, the energy consumption was reduced by factors of up to 8.8 for NewHope, 7.7 for Kyber, and 2.1 for Saber. The cell count of the CPU was increased by a factor of 1.6 compared to the original RISC-V design, which can be considered as a moderate increase for the achieved performance gain.


Introduction
Public-key cryptography (PKC) provides the basis for establishing secured communication channels between multiple parties. The security of the PKC in use today is mainly based on the hardness of two mathematical problems, the factorization of large numbers (e.g., RSA) and the calculation of discrete logarithms (e.g., ECC). However, both problems can be solved in polynomial time when a large-scale quantum computer is built. The foreseeable breakthrough of quantum computers therefore represents a risk for all communication systems. In order to ensure PKC security, cryptographic algorithms based on new hard mathematical problems are required. Post-Quantum Cryptography refers to a set of algorithms based on different hard mathematical problems that are considered secure against • High amount of hardware resources: they require huge buffers to store the input and output data of the accelerator. In particular, the NTT module usually requires a huge amount of memory for storing input data, output data, and LUTs for the Twiddle factors of the forward and inverse NTT. Previously proposed works use up to four separate memory blocks to execute the NTT transform. While this excessive use of resources might be acceptable for high performance devices, it is questionable if it is suitable for small embedded devices. The usage of multiple memories unnecessarily increases the design complexity and cost.
• Low flexibility: since standalone hardware accelerators usually run a complete cryptographic algorithm, they are inflexible and hard to update.
For the best of our knowledge, only a single work has exploited the potential of Hardware/Software Co-Design to create tightly coupled accelerators for Post-Quantum Cryptography. In [AEL + 20], the authors integrated a finite field multiplier into the RISC-V architecture in order to accelerate polynomial multiplications in NewHope and Kyber. Their proposed finite field multiplier already lead to a good performance improvement. In contrast to the work of [AEL + 20], in this work we present a set of more powerful hardware accelerators capable of performing vectorized modulo arithmetic, NTT computations, and efficient generation of random polynomials. Polynomial generation has been pointed by several previous works [KRSS19, AEL + 20, ABCG20] as one of the main reasons for performance degradation. This bottleneck is usually even more critical than the polynomial arithmetic.

Contributions.
In this work, we propose RISQ-V, an enhanced RISC-V architecture that embeds a set of powerful tightly coupled accelerators to speed up lattice-based Post-Quantum Cryptography. Their implementation nature, directly integrated into the RISC-V processing element, eliminates the drawbacks of loosely coupled accelerators by sharing the existing processor resources. Our results show that our proposal combines a good performance with flexibility and has a small silicon footprint. Thus, our solution is suitable for a wide range of applications, from constrained devices, such as smartcards or IoT platforms, to a powerful Multi-Processor System-on-Chip (MPSoC). In summary, this paper presents three contributions: First, we propose a set of tightly coupled hardware accelerators capable of performing: i) two butterfly operations in parallel; ii) on-the-fly Twiddle factor generation; iii) vectorized modulo arithmetic; and iv) efficient generation of random polynomials. These accelerators are deeply integrated into the RISC-V pipeline.
Second, we create 28 new instructions for performing packed modular arithmetic (addition, subtraction, multiplication, multiply-accumulate), butterfly operation, update of Twiddle factors, update/multiplication with scaling factors, bit-reversal, hash computations, and binomial sampling.
Third, we developed a RISQ-V FPGA prototype and an ASIC implementation to evaluate its performance and cost. We exploit the RISQ-V capabilities to implement the Post-Quantum algorithms NewHope, Kyber and Saber with different security levels.
Organization. The remainder of this article is structured as follows: In Section 2, latticebased cryptography and its performance bottlenecks are described. In Section 3, we present the design decisions and hardware architectures of the developed Post-Quantum accelerators. In Section 4, the integration of these accelerators into the RISC-V platform and the new Post-Quantum ISA extension are discussed. In Section 5, evaluation results of RISQ-V for an FPGA prototype and an ASIC implementation are presented. A conclusion is given in Section 6.

Lattice-based Cryptography
Most lattice-based cryptosystems are built upon the Learning With Errors (LWE) problem. Among the different alternatives, the algebraically structured variants Ring-LWE (R-LWE) and Module-LWE (M-LWE); and the deterministic variant Module-Learning With Rounding (M-LWR), are very attractive since they offer performance advantages and smaller key sizes over the plain LWE problem [AD17,BPR12].
The R-LWE problem, was introduced in [LPR10]. In contrast to the plain LWE problem, R-LWE replaces n-dimensional vectors by polynomials of degree smaller than n. These polynomials are an element of the ring R = Z q /⟨ϕ(x)⟩, with integers n and q, and the cyclotomic polynomial ϕ(x) that is usually chosen to be x n + 1. This substitution allows to increase the computational performance and to decrease key and ciphertext sizes. The basic R-LWE instance can be written as in Eq. (1), where a is a public polynomial, s a secret polynomial and e an error polynomial. b = a · s + e (1) All arithmetic operations are performed in the ring R. The hardness of the R-LWE problem is based on the difficulty to recover s when a and the result b are given (search problem). Moreover, it is known to be a hard problem to distinguish between the pair (a, b) and a truly uniform and independent sample pair (decision problem). Usually, a is sampled from a uniform distribution U q whose outcome ranges between 0 and q − 1. The polynomials s and e are sampled from an error distribution Ψ k , which is usually a binomial or Gaussian distribution. Among the candidates of the second round of the NIST Post-Quantum standardization process, NewHope [AAB + 19] is the only Key Encapsulation Mechanism (KEM), that is built upon the R-LWE problem.
The M-LWE problem was introduced in [LS15]. In contrast to R-LWE, the M-LWE problem replaces the single ring elements with module elements over the same ring. Instead of using a single large ring element, it uses matrices of smaller ring elements for constructing the public element, and vectors of smaller ring elements for the secret and error elements. R-LWE can be considered to be a special case of the M-LWE problem, where the width of the matrix A over the ring R is always 1 [ABD + 19]. To increase the security level, it is sufficient to increase the dimension of the matrices and vectors, while the ring is always the same. That is, the underlying ring operations do not change, instead the amount that these operations are performed changes (more or fewer). One of the main advantages of M-LWE is that the concrete-security/efficiency trade-off is highly tunable [AD17]. Kyber [ABD + 19] is a second round Post-Quantum NIST candidate built upon M-LWE.
The LWR problem was introduced in [BPR12] as a derandomized version of the LWE problem. It uses the rounding operator ⌊x⌉ p with respect to some modulus p < q, which scales x by p/q and then rounds the result to the nearest integer modulo p. Therefore, in contrast to the LWE problem, the noise in the LWR problem is generated deterministically. LWR replaces a·s+e (given in Eq. 1) by the rounding function ⌊a·s⌉ p with respect to some modulus p. The module version M-LWR is a straightforward generalization of M-LWE. As a result, the schemes based on M-LWR are characterized by reduced bandwidth for keys and ciphertexts. Saber [DKRV19] is a second round Post-Quantum NIST candidate based on M-LWR.
The tightly coupled hardware accelerators developed in this work were used for enhancing the performance of the three promising NIST Post-Quantum candidates: NewHope (R-LWE), Kyber (M-LWE) and Saber (M-LWR). The developed hardware accelerators are able to support multiple security levels (from I to V). For the performance evaluation, always the lowest and highest NIST security level was chosen. To be precise, the following instances were evaluated: NewHope-512 (Level I), NewHope-1024 (Level V), Kyber-512 (Level I), Kyber-1024 (Level V), Lightsaber (Level I), and Firesaber (Level V). The following chapters describe the performance bottlenecks of these schemes and the proposed hardware accelerators.

Performance bottlenecks
The matrix/polynomial generation, addition and multiplication are the basic operations required to encrypt, decrypt, sign, and verify information in lattice-based cryptosystems.
Polynomial Generation. Structured LWE-based cryptography requires the sampling of large random polynomials. The standard way to create the public, secret and error polynomials is to use uniformly distributed random numbers and model each coefficient according to the desired distribution. The coefficients of the public polynomial can be obtained through the rejection sampling process. This technique selects (accepts) the uniform samples that fit to the target distribution (desired range), otherwise the sample is rejected. The coefficients of the secret and error polynomials must be binomially distributed. Although turning uniform random numbers into binomial samples just requires few computational cycles for a single polynomial coefficient, this operation does not scale well. Usually, to achieve the amount of randomness required to implement these two types of sampling processes, a Pseudo Random Number Generator (PRNG), e.g., a hash function, is included. This PRNG expands a small random seed extracted from a physical source of entropy in an electronic device. As reported in [KRSS19], the amount of time spent in the execution of hash functions for lattice-based Post-Quantum schemes, like NewHope and Kyber, is in the range of 50% of the total execution time.
Polynomial Multiplication using the NTT. Besides the generation of random polynomials, polynomial multiplication has been identified as the other most computationally intensive operation [PG12,DB16]. While polynomial addition can be performed with O(n) operations, a naive polynomial multiplication requires O(n 2 ) operations [FS19]. Thus, it becomes specially prohibitive for high dimensional polynomials (larger values of n). However, when polynomials are expressed in the spectral domain, the multiplication operation between a pair of polynomials can be performed coefficient-wise.
The Discrete Fourier Transform (DFT) is a general method to transform a polynomial a into the spectral domain. The coefficients of a are transformed as in Eq. 2, where n − 1 is the degree of the polynomial and ω n the n-th root of unity.
To transform the result vector from the spectral domain into the normal domain, the inverse must be applied as in Eq. 3.
The Number Theoretic Transform (NTT) is a specialized version of the DFT where ω n is an element of the finite ring Z q instead of a complex number. For the NTT computation, the coefficient ring must contain primitive roots of unity. The n-th root of unity ω n ∈ Z q , while ω n n = 1 mod q and ω i n ̸ = 1 mod q for ∀i ∈ [0, n − 1]. The polynomial multiplication c between a pair of polynomials a and s through the NTT is calculated as in Eq. 4.
Where ⊙ denotes a coefficient-wise multiplication and NTT −1 is the inverse NTT. The product polynomial is reduced after the polynomial multiplication by the cyclotomic polynomial ϕ(x) = x n +1. In this way, the 2n-length product polynomial is reduced to a n-length polynomial. When the 2n-th root of unity γ n = √ ω n exists, the reduction in the spectral domain simply corresponds to scaling the i-th coefficient of the polynomials with the factor γ i n before the NTT, and with γ −i n after the NTT −1 . Thus, it can be avoided to pad the polynomials a and s in Eq. 4 with zeros. The 2n-th root of unity exits when q ≡ 1 mod 2n holds. The NTT (forward) and NTT −1 (inverse) reduced by ϕ(x) can be written as in Eq. 5 and Eq. 6, respectively.
To speed up the DFT from a complexity of O(n 2 ) to O(n · log(n)), the Fast Fourier Transform (FFT) can be used. This is also true for the NTT, while the name remains in this case. The FFT exploits the symmetric structure of the Fourier transform by splitting one DFT into smaller DFTs. The computation that breaks a larger DFT into smaller DFTs (subtransforms) is called butterfly operation. The butterfly operation is also used to combine the smaller DFTs into a larger DFT. Two of the most used algorithms to realize the fast NTT and NTT −1 operations are the Cooley-Tukey (CT) [CT65] and the Gentleman-Sande (GS) [GS66] algorithms. While the arithmetic of both algorithms is similar, they mainly differ in the way they store and access data and in the way how the butterfly operation is performed. The butterfly operation consists of a multiplication by a power of ω n , an addition, and a subtraction in Z q . The powers of ω n are called Twiddle factors. While the Cooley-Tukey algorithm follows the decimation-in-time approach, i.e., x ′ ← x + yω n and y ′ ← x − yω n with ω n , x, y ∈ Z q , the Gentleman-Sande algorithm follows the decimation-in-frequency approach, i.e., x ′ ← x + y and y ′ ← (x − y)ω n .
Polynomial Multiplication using Karatsuba and Toom-Cook. Polynomial multiplications using the NTT are very efficient. However, the NTT can only be applied on a restricted parameter set. The prime should be 'NTT-friendly', i.e., it should be chosen such that x n − 1 can be expressed as a product of linear factors. This, for instance, is the case when q ≡ 1 mod 2n. Not all lattice-based algorithms have NTT-friendly primes. For example Saber does not use a prime for its modulo reductions and sets q = 2 13 . This excludes the usage of the NTT but simplifies the modulo arithmetic. There are other alternatives to efficiently multiply polynomials. Two of the most used algorithms for performing flexible and efficient multiplications are Karatsuba and Toom-Cook. These algorithms iteratively split a large polynomial multiplication into several smaller polynomial multiplications. Karatsuba outperforms Toom-Cook for low to medium degree polynomials. Otherwise, Toom-Cook performs more efficient.
Karatsuba Multiplication: The Karatsuba algorithm reduces the quadratic runtime of the polynomial multiplication to a complexity of O(n log2(3) ) ≈ O(n 1.58 ). The length-m polynomials a and s are split into two length-m/2 polynomials: i) lower part (a l , s l ); and ii) higher part (a h , s h ). Instead of four polynomial multiplications of these length-half polynomials, Karatsuba's tweak requires only three different multiplications as in Eq. 7 Toom-Cook Multiplication: The Toom-Cook multiplication is a generalization of the Karatsuba algorithm. Instead of splitting polynomials into two smaller polynomials, the polynomial is split into k parts. For the case k = 2, the k-way Toom-Cook algorithm is identical to the Karatsuba algorithm.

Hardware Accelerators
In order to enhance the performance of lattice-based cryptography, in this work the following hardware accelerators were designed: i) an NTT and modular arithmetic accelerator; ii) an accelerator used for Karatsuba/Toom-Cook multiplications; iii) a Keccak accelerator for the pseudo random number generation; and iv) a binomial sampling accelerator. These modules reduce the performance bottlenecks discussed in Section 2 and speed up the overall performance.

Number Theoretic Transform (NTT) Design
In this subsection, the optimization techniques and design choices for our NTT architecture are presented. The performance and size of the architecture for the NTT described in Section 2 are highly influenced by: i) the calculation of Twiddle factors; ii) the bit-reversal computation; and iii) the memory access strategy. Our NTT architecture is mainly designed for the NewHope instances NewHope-512 and NewHope-1024, but it also supports the Kyber instances Kyber-512, Kyber-768, and Kyber-1024. As Saber chooses the modulo q = 2 13 , it is not suitable for the NTT. While NewHope-512 and NewHope-1024 have a polynomial length of n = 512 and n = 1024, respectively, Kyber has the same polynomial length of n = 256 for all instances. As the polynomial length highly affects the NTT costs, different design decisions were made for NewHope and Kyber.
Calculation of Twiddle factors (powers of ω n ): Most of the optimized software implementations for performing the NTT precompute the Twiddle factors and store them into a separated memory location. As the generated tables for storing these Twiddle factors are very large for high-degree polynomials, previous hardware architectures devised an approach to compute these values (powers of ω n ) on-the-fly [RVM + 14, FS19]. Due to the high memory access latency of long tables, calculating the Twiddle factors on-the-fly can be even faster than loading them from memory. In order to combine the advantages of precomputation and on-the-fly Twiddle factors computation, in this work we propose a hybrid approach. For larger polynomials, such as in NewHope-512 and NewHope-1024, on-the-fly computation is used. Otherwise, precomputed tables are used, i.e., for all Kyber instances.
Bit-reversal computation: Performing a reorder or bit-reversal step before or after the NTT computation can be an expensive operation. There are several in-place variants of the Cooley-Tukey and Gentleman-Sande algorithms, in the following referred to as NTT CT br→no , NTT CT no→br , NTT GS br→no , and NTT GS no→br . The difference between these variants is the order of the input and output data. At the bit-reversed to normal order (br → no) variant, the input coefficients must be loaded in bit-reversed order and the output is given in normal order. For the (no → br) variant, the input is in normal order and the output in bit-reversed order. Previous works avoid the bit-reversal step by using a combination of different algorithms for the forward and inverse transform, e.g., NTT CT no→br and INV-NTT GS br→no [POG15]. However, this combination requires that the Twiddle factors are loaded during the NTT computation in bit-reversed order, which increases the complexity of the on-the-fly computation. In order to avoid large memories, we decided to use the basic NTT CT br→no algorithm for the forward as well as the inverse transform of NewHope. The input coefficients of the forward transform are randomly sampled from the error distribution, consequently no bit-reversal is required for the forward transform. As for Kyber, the precomputation of the Twiddle factors is due to the smaller polynomial length less expensive, we use the combination of NTT CT no→br and INV-NTT GS br→no to completely avoid the bit-reversal step. This approach follows the reference implementation of Kyber, which uses these two NTT variants.

Memory Access Strategy:
The main performance bottlenecks of the NTT are the load and store operations to transfer the coefficients between the main memory and the processing element. A non-optimized NTT architecture always loads two coefficients from the memory into the register file, performs the butterfly operation, and stores the result. The NTT can be divided into log 2 (n) layers, with n being the polynomial length. A naive approach requires to load and store all n coefficients in each layer, resulting in total in n · log 2 (n) load and store operations, respectively. To decrease the memory access overhead two methods can be used: merging of NTT layers or storing two coefficients in one memory word.
Merging the NTT layers was already discussed in [AJS16, BKS19, ABCG20]. The goal is to keep a certain amount of coefficients as long as possible within the register file. Let l denote the total amount of coefficients that can be stored within the register file. After computing the butterfly operations for the l coefficients within the first layer, the results are not written back to the main memory. Instead of completing the first layer, the next layer is already processed. This method saves the memory accesses for the next layer. When l coefficients can be loaded into the registers, up to log 2 (l) layers can be merged. Previous works loaded up to 16 coefficients into the registers to process up to four NTT levels without reloading coefficients between the layers. In [ABCG20], for NewHope-512 3 + 3 + 3 layers of in total 9 layers and for Kyber 3 + 3 + 1 layers of in total 7 layers were merged. In this work, we use the complete floating point register set of the RISC-V core (32 × 32 bit) to store l = 2 × 32 = 64 coefficients (each 16 bit) for NewHope. An address controller loads the coefficients directly from the floating point register set into two butterfly units such that the calculation of multiple NTT levels can be performed. As Kyber requires different algorithms for the forward and inverse transforms, no dedicated address controllers are developed. This means, for Kyber, the general purpose register set was used to store l = 16 coefficients (each 16 bit) allowing to merge 3 + 3 + 1 layers.
To further reduce the amount of load and store operations, multiple coefficients can be stored in a single memory word (32 bit). As the polynomial coefficients of NewHope and Kyber can be expressed with at most 16 bit, two coefficients are stored in a single word. After performing the butterfly operation, the intermediate results of the coefficients are swapped in order to prepare them for the next layer. Note that precalculated Twiddle factors can be stored in any desired order. Such particularity enhances the traverse possibilities through the NTT structure and allows to avoid the swapping operation. Figure 1 illustrates an example of the applied NTT layer merging and swapping techniques for the NTT CT br←no variant that we use for NewHope. In the small example for n = 16, eight coefficients can be loaded from the main memory into the registers (l = 8). Moreover, two butterfly operations can be processed in parallel within the units BF0 and BF1. The red rectangles show the order in which the coefficient pairs are processed by the two butterfly units. At the beginning, the first eight coefficients a 0 , a 8 , a 4 , a 12 , a 2 , a 10 , a 6 , a 14 are loaded pairwise into the register file. In the following, we use the term distance to describe the difference between the positions of two coefficients in an array. In the first layer, the distance between the coefficients that are processed together is equal to one, i.e., they are located in the same register. The coefficient pairs (a 0 , a 8 ), (a 4 , a 12 ) are processed first using the two butterfly units and (a 2 , a 10 ), (a 6 , a 14 ) are processed next. Instead of storing the result back to the main memory, the second layer is already processed. In the second layer, the distance between the coefficients that are processed together is equal to two. It requires the results of the first layer for the coefficient pairs (a 0 , a 4 ), (a 2 , a 6 ) and (a 8 ,a 12 ), (a 10 ,a 14 ). As these coefficient pairs are not next to each other, i.e., they are not within the same register, a swap is performed by the butterfly units indicated in Figure 1 by the blue arrows. This swapping operation, which is an exchange of 16 bit values in two registers, can be performed in hardware almost for free. In the third layer, the distance between the coefficients that are processed together is four, which means that some required coefficients would still be within the register set. However, not all coefficients required for the swapping operation to prepare the content of the registers for the next layer are located in the register set. Therefore, the amount of layers that can be merged is log 2 (l) − 1 and in the example shown in Figure 1 two. However, the swapping technique allows to have always the right coefficients in the same memory word, which significantly decreases the access to the main memory. Table 1 illustrates the register content and the input values for the butterfly units BF0 and BF1 for the small example in Figure 1. An unoptimized version of the NTT diagram is shown in Figure 9 (Appendix A). This unoptimized version only loads two coefficients at the same time, has only one butterfly unit, and processes layer after layer. While in this small example the unoptimized version shown in Figure 9 requires n · log 2 (n) = 16 · 4 = 64 memory accesses, the optimized version shown in Figure 1 (n = 16, l = 8) requires n/2 · (log 2 (n) + 1 − (log 2 (l) − 1)) = 24 memory accesses.  In this case, two coefficients are stored in a single word, eight coefficients can be loaded to the register file (l = 8), and two pairs of coefficients can be executed in parallel. The red boxes indicate which coefficients are stored together in one word and in which order the coefficients are processed by the two butterfly units. The blue arrows show the coefficients which are swapped after the butterfly operations. For l = 8, log 2 (l) − 1 = 2 layers were merged. Table 1: Register content and input for the two butterfly units BF0 and BF1 for the example n = 16, l = 8.
a 4 , a 12 a 4 , a 12 a 8 , a 12 a 5 , a 13 R2 a 2 , a 10 a 2 , a 10 a 2 , a 6 a 3 , a 11 R3 a 6 , a 14 a 6 , a 14 a 10 , a 14 a 7 , a 15 Algorithm 1 shows the NTT CT br←no algorithm with the discussed optimizations: onthe-fly Twiddle factor calculation, storing two coefficients in one memory word, and the swapping operation. The merging technique is not directly included in this algorithm, but will be discussed as well. The algorithm can be divided into three steps: input preparation, calculation of the first NTT layers, and calculations of the last NTT layer.
In the first step, the coefficients of polynomial a are stored in a bit-reversed order in the main memory (line 1). The input coefficients are stored in a way that two coefficients are always stored in a single word.
In the second step (line 2-19), the first log 2 (n) − 1 NTT layers are calculated. The Twiddle factor ω is initialized in line 4, and always updated during runtime by a modulo multiplication with ω m (line 17). The value of ω m depends on the current NTT layer indicated by the variable m. In practice, the values for ω m still need to be precomputed.

Algorithm 1: NTT transform
However, only one entry for each layer is required, resulting in log 2 (n) precomputations. To merge the multiplication by powers of γ, ω is initialized with the square root of ω m (line 4). The same precomputations as for ω m can be used because ω 1/2 m and ω m from the previous layer have the same value. Only for the first layer (m = 2), the value ω 1/2 m has to be computed.
The most important part of Algorithm 1 are the butterfly operations, described in the inner loop (line 7-15). At the beginning, two coefficient pairs are loaded from the main memory locations M EM k+j and M EM k+j+m/2 and are assigned to two temporary variables each consisting of a lower halfword L 1 /L 2 and a higher halfword H 1 /H 2 (line 7-10). Then, two butterfly operations are performed (line 11-13). Finally, the result is swapped and stored in the respective memory location (line 14-15).
The third and last step is the computation of the last NTT layer (line 20-30). The operations of the last layer are similar to the operations of the other layers. The main difference is that no swapping operation is required.
The algorithmic modifications required for the merging technique include the manipulation of the start and end values of the outer and inner loop (lines 2 and 6). For the hardware architecture, we consider 32 registers available from the floating point register set for storing l = 64 coefficients. In this way, five layers can be merged. The nested loops can be split into n/64 parts. The outer loop, which indicates the current layer, starts for each part from 2 1 and terminates at the value 2 5 , i.e., m = {2, 4, 8, 16, 32}. The inner loop iterates for the i-th part from k = 32i to k = 32i + 31, where i ∈ [0, . . . , n/64 − 1]. Instead of loading and storing the coefficients from the main memory, the coefficients are kept within the register set and are only refreshed between the n/64 parts.

Number Theoretic Transform (NTT) Architecture
Our proposed hardware architecture for the NTT and Modular Arithmetic Unit is shown in Figure 2. It is composed of three main modules: Address Unit, Twiddle Update Unit and Modular Arithmetic Unit. The architecture is optimized for the NTT CT br←no algorithm, which is used in NewHope. However, it also supports different NTT algorithms. The input of the NTT and Modular Arithmetic Unit is the content of two registers from the processor's register bank (with lower halfwords L 1 /L 2 and higher halfwords H 1 /H 2 ), and the control signals from the instruction decoder. The output consists of the processed input values and the control signals for the register bank. Address Unit: This module controls the merging operation by setting the two read and write addresses for the register bank according to Algorithm 1 (with the modified loop start and end values) in order to merge five NTT layers. In this work, the automatic address calculation is only supported for NTT CT br←no and therefore only for NewHope. Supporting different NTT variants would be possible but would also lead to a non-negligible increase of area. The address calculation is triggered by the multiple_bf signal. At each clock cycle three sets of signals are updated: the read addresses raddr, the write addresses waddr, and the write enable signal wen. The Address Unit is also responsible for selecting the correct index for the small LUT of the precalculated values of ω m and for triggering the Twiddle factor update. When the NTT and Modular Arithmetic Unit is in single operation mode, the Address Unit remains in idle state.
Twiddle Update Unit: This module calculates the Twiddle factors ω on-the-fly. The update of ω is always triggered by the update_w signal, which is either set by the Address Unit or by the corresponding signal from the instruction decoder. The current Twiddle factor ω = ω · ω m mod q is updated as described in lines 17 and 29 of Algorithm 1. The value of ω is initialized according to line 4 and line 22. The value for ω m is determined through the current NTT layer. All log 2 (n) possible values for ω m are precalculated and The Butterfly Operation calculates two butterfly operations {L 1 − H 1 · ω, L 1 + H 1 · ω} and {L 2 − H 2 · ω, L 2 + H 2 · ω} in parallel. The calculation is triggered by the butterfly signal. In order to calculate two butterfly operations in parallel within the last NTT layer (Algorithm 1, line 23-30), the previous Twiddle factor ω ′ is also forwarded to the Modular Arithmetic Unit. As a result, the utilization of the Modular Arithmetic Unit can be increased and the execution time of the inverse NTT can be decreased. The Reordering MUX is responsible for preparing the coefficients to the next round (layer) by swapping the coefficients after the butterfly operation for all rounds, except for the last one. The architecture was extended in order to support the calculation of the decimationin-frequency butterfly operation, i.e., the calculation of For reasons of better visibility, this extension is not shown in Figure 3. The calculation of L 1 + H 1 and L 2 + H 2 can be done by the existing modulo adders. To avoid any combinatorial loop in the circuit, either two additional modulo multipliers or modulo subtractors are required. As the modulo subtractors are more compact (less area consuming) than the modulo multipliers, they were used in this work to enhance the design. These subtractors were placed on top of the existing multipliers. The The Post-Processing Operation (mul_gamma1, mul_gamma2, update_gamma) is used to calculate the multiplications with n −1 and γ −i at the inverse NTT operation. These multiplications can be merged with the last layer of the inverse NTT. Therefore, first the control signal mul_gamma1 ensures that the coefficients a i and a i+n/2 of the first input (L 1 , H 1 ) are multiplied with γ 1 = n −1 γ −i and γ 2 = n −1 γ −i−n/2 , respectively. Before the multiplication with the first coefficients a 0 and a n/2 starts, γ 1 is initialized with n −1 γ −1 n mod q and γ 2 with n −1 γ −n/2 n mod q. The mul_gamma2 control signal is used to perform the multiplications with the next coefficient pair a i+1 and a i+1+n/2 . The Reordering MUX brings the results of the mul_gamma1 and mul_gamma2 operation in the desired order {a i+1 , a i } and {a i+1+n/2 , a i+n/2 } and assigns the result to the output. The update_gamma signal is used to update γ 1 = γ 1 γ −1 n and γ 2 = γ 2 γ −1 n after each mul_gamma1 and mul_gamma2 operation.
The Vectorized Modulo Arithmetic (mod_mul, mod_add, mod_sub) operation uses the Modular Arithmetic Unit to calculate vectorized modulo multiplications (L 1 · L 2 mod q and H 1 · H 2 mod q); additions (L 1 + L 2 mod q and H 1 + H 2 mod q); and subtractions (L 1 − L 2 mod q and H 1 − H 2 mod q). Where L 1 and L 2 are the lower 16 bit of the two forwarded registers from the register set and H 1 and H 2 are the higher 16 bit. The two results from the modulo multipliers, adders, or subtractors are combined in the Reordering MUX and assigned to the output signal. The corresponding control signals mod_mul, mod_add, and mod_sub are used to perform the vectorized modulo arithmetic and to output the desired result. Similar to the Butterfly Operation, the Vectorized Modulo Arithmetic belongs to the category of packed (vectorized) arithmetic and follows the Single Instruction Multiple Data (SIMD) principle. This means a single instruction is used to process multiple data elements in parallel.

Bit-Reversal
The bit-reversal operation is a particular permutation of a sequence of elements. As discussed in Section 3.1, it is a key part of the NTT. Consider an array of n elements, the index of the i-th element a i can also be represented in binary notation i = {b 0 , b 1 , . . . , b log2(n)−1 }. The bit-reversal operation swaps the i-th element with the j-th element which has the bit-reversed index j = {b log2(n) −1 , b log2(n)−2 , . . . , b 0 }. Reversing the bits is an expensive software task. A straightforward approach with a runtime of O(m) is looping through all m = log 2 (n) bits of an integer. The fastest solution is to use a LUT. This LUT requires n entries, each consisting of m bits. In particular, for large arrays, i.e., for large polynomial lengths, this approach will lead to a high memory footprint (large LUT). In order to have a better trade-off between memory footprint and performance, we extend the RISC-V ISA and develop special instructions for the bit-reversal operation. These instructions are derived from the store word/halfword operations sw, sh. In addition to the value to be stored and the destination address, the new instructions also take an offset. This offset is added in a bit-reversed order to the destination address. The functionality of the new instructions can be expressed as MEM[rs1+bitrev(rs2)]← rd, where rs1 contains the destination address, rs2 the offset, and rd the value that is stored. Reversing the offset in hardware can be efficiently solved through a rewiring process. The bit-reversal of a whole polynomial can be performed by loading in a loop each coefficient, and storing the coefficient with the new instruction. In this case, the offset simply corresponds to the loop counter.

Results for the NTT Accelerator
In this paper, we use the PULP RISC-V Toolchain 1 (version 7.1.1) with the optimization flag 'O3' (optimization for speed) to compile the code. A description of the used RISC-V platform is given in Section 4. Table 2 summarizes the clock cycle count required for performing the NTT, inverse NTT, and bit-reversal step in NewHope and Kyber. While in NewHope different polynomial lengths are used for different security levels (n = 512 and 1024), Kyber uses the same polynomial length for all security levels (n = 256). In all implementations mentioned in Table 2, the bit-reversal step was eliminated for Kyber by using two different NTT variants. The baseline for our optimized implementations are the clean C-code versions of the PQ-M4 project in [KRSS19], which are based on the reference implementations in [AAB + 19] and [ABD + 19].
In comparison to the baseline implementation on RISC-V, the optimized NewHope implementation achieves a speedup factor of 13.18/12.40 (NTT/NTT −1 ) for length-512 polynomials and a speed up factor of 13.01/11.95 (NTT/NTT −1 ) for length-1024 polynomials. The integrated optimization techniques (on-the-fly Twiddle factor calculation, merge of five NTT layers with direct access of the NTT and Modular Arithmetic Unit to the processor's floating point register set, and the storage of two coefficients in one word) result in a significant performance improvement when compared to the baseline implementation. The new bit-reversal instruction does not only eliminate the LUT for this step (1024 bytes for NewHope-512 and 2048 bytes for NewHope-1024), but also leads to an improvement in speed. Although the architecture of the NTT and Modular Arithmetic Unit is not optimized for the NTT CT no→br and INV-NTT GS br→no variants, Kyber also achieves a considerable speedup factor of 17.93/27.79 (NTT/NTT −1 ) when compared to the baseline implementation. In fact, only the parallel decimation-in-frequency butterfly operations and the vectorized modulo multiplication of the NTT and Modular Arithmetic Unit are used.
Our results show that our architecture beats the clock cycle count of the latest assembler optimized ARM Cortex-M4 implementations in [ABCG20], the RISC-V imple-mentation in [AEL + 20], which uses a finite field multiplier to accelerate the NTT, and the Hardware/Software Co-Design architecture of NewHope in [FSMG + 19], which uses a loosely coupled NTT accelerator. In [AEL + 20], the bit-reversal step was also eliminated for NewHope. However, two different NTT variants have to be used, implying an increase of code size. In [FSMG + 19], the bit-reversal costs were hidden by a re-wiring during the transfer of the coefficients to the memory of the NTT accelerator.
In addition, in contrast to the NewHope baseline implementation and the implementations presented in [ABCG20, AEL + 20], our proposed architecture does not require large LUTs. While the NewHope reference implementation in [AAB + 19] requires 7n-bytes (n denotes the polynomial length) for storing LUTs for the bit-reversal step, Twiddle factors and pre-/post processing, our implementation only requires 4 · log 2 (n) + 4 bytes. In concrete numbers, the LUTs were reduced from 7168 to 44 bytes for NewHope-1024 and from 3584 to 40 bytes for NewHope-512. In contrast to the design in [FSMG + 19], no additional data memory is required for the accelerator. A summary of the required hardware costs for the NTT as well as the whole NewHope and Kyber implementations is presented in Section 5.

Hardware Accelerator for Karatsuba/Toom-Cook Polynomial Multiplication
Recent works propose a combination of the Karatsuba and Toom-Cook methods to perform the polynomial multiplication in Saber [DKRV18, DKRV19, MKV20]. Using the four-way Toom-Cook method, the product of a pair of 256-coefficient polynomials is split into seven polynomial multiplications with polynomials of length 64. These 64 × 64coefficient multiplications are then further split into 16 × 16-coefficient multiplications using two levels of Karatsuba. After performing the recursive splittings, the polynomial length is small enough to efficiently perform the schoolbook multiplication. At a certain point, further splitting the polynomials does not bring any performance advantage since the savings derived from the multiplication do not outweigh the increasing number of additions. Similar to [DKRV18] and the reference implementation provided to NIST, in this work we stop the recursive splitting at a polynomial length of 16.
Since the coefficients in Saber are, similar to NewHope and Kyber, smaller than 16 bit, they are suitable for packed (vectorized) arithmetic. Although the ISA extension of the RISC-V specification 2 for packed arithmetic is still in draft mode, the RISC-V core used in this work already supports some packed operations [TGS19]. In the following, useful instructions for the polynomial multiplication in Saber are listed: These instructions are comparable to the Cortex ARM-M4. Saber and also other Post-Quantum NIST candidates, such as some NTRU variants, use a modulo that is a power of two and q ≤ 2 16 . For this reason, it is possible to calculate and store two multiplications in parallel. Moreover, the schoolbook multiplication as well as the Karatsuba step in Eq. 7 can benefit from a Multiply Accumulate (MAC) function. Therefore, we develop a vectorized modulo multiply accumulate function, in the following denoted as pq.mac. The functionality of pq.mac can be expressed with: In this work, the parameter q ′ in pq.mac was set to 2 16 . In this way, it is suitable for all schemes that use a power of two modulo smaller or equal to 2 16 . After performing the polynomial multiplication, the result can be reduced with the original modulus q (simple masking operation) because (a mod q ′ ) mod q ≡ a mod q as long as both moduli are a power of two and q ′ ≥ q. The hardware architecture for the pq.mac operation is shown in Figure 4. Using the pq.mac operation, the amount of clock cycles for the polynomial multiplication in Saber was reduced from 104, 074 to 71, 349.

Pseudo Random Number Generation
Most Post-Quantum cryptosystems require a huge amount of randomness. This is particularly true for the generation of random polynomials in lattice-based cryptography. To produce this large amount of randomness, usually a small seed is expanded using a PRNG. Three primitives are specially suitable for this task: SHA-3, AES and ChaCha20. Among these three alternatives, SHA-3 is the most energy-efficient option because it generates the highest amount of pseudo-random bits per round [BUC19]. SHA-3 is a subset of the Keccak family standardized by NIST. The standard lists four specific instances of SHA-3 and two extendable-output functions (SHAKE128 and SHAKE256). While the SHA-3 functions have a specified output length, the two SHAKE variants permit extracting a variable length of output data, which makes it a suitable candidate for the pseudorandom bit generation. All instances are based on a 1600-bit state. This state can be represented in a three-dimensional array containing 25 words, each with a length of 64 bits. These words can be structured in a cube with x and y coordinates that are indexed from 0 ≤ x < 5 and 0 ≤ y < 5. Each bit of this cube can be addressed with A[x, y, z]. In order to facilitate the description of the applied functions, the following conventions are used: the part of the state which presents the word is also called lane; a two-dimensional part of the state with fixed z is called a slice; and all lanes with the same x-coordinate form a sheet.
The most important part of the SHA-3 and SHAKE primitives is the Keccak permutation function, which calls in each of in total 24 rounds the f-1600 function. Each round is characterized by the five consecutive steps θ, ρ, π, χ and ι. These steps have a state array A as input and output B, a processed new state array.

Hardware Accelerator -Pseudo Random Number Generation
The Keccak team presented three different optimized hardware implementations of Keccak: a high speed core, a mid-range core (trade-off between speed and area), and a low area coprocessor [BDH + 20]. A description of these implementations can be found on their website 3 . The high speed core operates in a standalone fashion and requires no further resources for the Keccak calculations. Chunks of a message can be sent to the accelerator and the core will output the hash value. This approach was also used in [FSMG + 19]. The authors connected a loosely coupled high performance Keccak accelerator to the AHB in order to accelerate the Keccak operations in NewHope. The low area solution (of the Keccak team) is a coprocessor which uses the system memory as data storage instead of storing the Keccak state internally. Only temporary results are kept internally in registers. As a trade-off between the high speed and low area solution, the Keccak team presented the mid-range core which is based on some of the ideas presented in [JA11]. This core rearranges the order of the permutation function such that the two slice oriented steps χ and θ are processed closer together. In order to reduce the required area, the core only works on a subset of the slices that compose the state. This approach requires a different initial and final round. However, as the ρ, π and ι steps work on lanes and the χ and θ steps on slices, it is not the optimal solution to split the state as the state must be loaded multiple times from the memory.
In this work, we develop an alternative solution for Keccak which presents a trade-off between performance and area. It can be classified between the high speed and low area Keccak solutions of the Keccak team. To avoid a high access rate to the main memory, the state is not split into multiple parts. However, instead of a standalone loosely coupled Keccak accelerator, we design a hardware accelerator for a single round of the Keccak permutation. To reuse existing resources from the RISC-V core, the complete Floating Point Register Set (FPR) with 32 × 32 bit registers and a part of the General Purpose Register Set (GPR) with 18 × 32 bits are used. In the following, this combination of FPR and GPR is denoted as Post-Quantum Register Set (PQR). To be precise, the temporary registers t0 to t6 and the saved registers s1 to s11 are used from the GPR. In this way, enough remaining registers exist in the GPR to guarantee a normal operation of the RISC-V core and also enough registers are used to store the complete Keccak state in the PQR. The saved registers have to be stored on the stack before the Keccak function starts. Similar to the NTT and Modular Arithmetic Unit, the registers can be accessed in parallel. Figure 5 illustrates our Keccak hardware accelerator. The input of this accelerator is the content of the PQR, the current round for determining the round constant, and a start and reset signal. Triggered by the start signal, the accelerator will perform one round of the f-1600 function, where the round signal selects the corresponding round constant. The result of the processed state is written back to the PQR. This step can be repeated for all 24 rounds. The permuted state can then be stored from the PQR to the main memory.
To reduce the memory access during the generation of random polynomials, we keep the state in the registers as long as possible. The following steps are used for the generation of random polynomials. First, the state is set to zero with the keccak_rst signal and written into the PQR, which also resets the related registers. After setting all registers to zero, the Keccak absorption phase begins. During this phase, the input message or a message block is written into a subset of the state. The state permutation will then transform this state. Depending on the rate, a certain number of bits is squeezed out while a part of the state, the capacity, remains untouched. The squeezed output is then processed in order to generate the desired random polynomials. Thereby, the state registers must remain untouched, when the state is not written back to the memory. To obtain fresh randomness, the state is permuted and squeezed again. Keeping the state for the whole polynomial sampling process within the PQR leads to a significant performance improvement as the access to the main memory is eliminated in this case.

Binomial Sampling
Many LWE-based schemes, such as NewHope, Kyber and Saber, replaced the discrete Gaussian error distribution by a centered binomial distribution. This significantly increases the efficiency, avoids complex arithmetic or large LUTs, and offers better protec-tion against side-channel attacks. Let Ψ k be a binomial distribution which is centered at zero and has a standard deviation of σ = √ k/2. The distribution is determined by 1} are uniform independent bits. This is similar to taking two k-bit integers b and b ′ , calculating the respective Hamming weight and subtracting one Hamming weight from the other (modulo subtraction). Figure 6 shows the developed hardware architecture of the Binomial Sampling Unit, which turns uniformly distributed samples into binomially distributed ones. The Binomial Sampling Unit reads the uniform samples contained in the two registers rs1 and rs2 from the GPR and calculates the result rd. As the modulus is smaller than 16 bit for the considered schemes, the output register can concatenate two samples. The Binomial Sampling Unit supports different parameters of k: for NewHope k = 8 [AAB + 19], for Kyber k = 2 [ABD + 19], for Lightsaber k = 5, and for Firesaber k = 3 [DKRV19]. Depending on the mode signal, the multiplexers forward two sums of k bits to the modulo subtractors. The two output samples of the subtractors will be combined and written back to the GPR.  Table 3 summarizes the clock cycle count of the SHAKE256 implementation, the uniform sampling, and the binomial sampling. While the uniform rejection sampling is used in LWE-based schemes to generate the random public polynomial, the binomial sampling is used to generate the secret and error polynomials for R-LWE/M-LWE schemes. For M-LWR schemes, the binomial sampling is only used for generating the secret polynomials. The results show that the SHAKE256 function was accelerated by a factor of 103.59 in comparison to the software baseline implementation. Also compared to the baseline implementation, the uniform sampling was accelerated by a factor of 26.90 for NewHope-512 (n = 512), 27.12 for NewHope-1024 (n = 1024), 22.37 for Kyber-512 (n = 256 for all Kyber security levels), and 31.74 for Lightsaber (n = 256 for all Saber security levels). The optimized binomial sampling makes use of the fast Keccak Accelerator described in Section 3.7 as well as the Binomial Sampling Unit described in Section 3.8. With these optimizations, we achieve a speedup factor of 35.69, 35.80, 14.27, and 40.15 for NewHope-512, NewHope-1024, Kyber-512, and Lightsaber, respectively. The speedup factor for Kyber is smaller when compared to NewHope or Lightsaber due to the low variance of the error distribution.

Results for Keccak and Polynomial Sampling
Although, we aimed for a mid-range Keccak solution, the results show that our design outperforms the uniform and binomial sampling of [FSMG + 19], which uses a loosely coupled standalone high performance Keccak implementation. Their architecture is able to calculate two Keccak rounds in one clock cycle. However, the large communication overhead poses a substantial drawback of their architecture. In this work, the communication overhead was nearly eliminated. During the Keccak absorption phase the message initializes the state. Afterwards, the Keccak state is hold in the registers for the complete sampling process. The required hardware costs are summarized in Section 5.  Rocket Chip was constructed using the hardware construction language Chisel 7 . The implementation offers a dedicated interface, called Rocket Custom Coprocessor (RCC), to extend the system with hardware accelerators. However, the presented design does not allow to easily change the pipeline stage and is therefore not suitable for the development of tightly coupled accelerators. VexRiscv was developed using another high level hardware description language called SpinalHDL 8 . The VexRiscv project allows modifications of the pipeline stage. This platform was used in [AEL + 20] to integrate a tightly coupled finite field multiplier and in [WTJ + 20] for developing loosely coupled qTESLA accelerators. The PULP project features three different RISC-V cores designed using the hardware description language SystemVerilog. The Ariane core is a 6-stages 64-bit solution. For smaller embedded devices, the PULP team offers the 2-stages 32-bit solution Ibex (formerly Zero-riscy) and the 4-stages 32-bit solution CV32E40P (formerly RI5CY). The RISC-V cores CV32E40P and Ibex can be integrated into the single-core micro-controller platform PULPino 9 offering a rich set of peripherals such as I2C, SPI, UART, and GPIO.

RISC-V is an open Instruction
Similar to [FSMG + 19], we decided to use the CV32E40P core and integrated this core into the PULPino platform. As this core and platform is written in SytemVerilog, we have the full control over the whole architecture, which makes it ideally suitable for our modifications of the core and pipeline stage. The CV32E40P core has a comparable performance to the widely deployed ARM-Cortex M4, but it is a little slower. The PQR-ALU consists of the NTT and Modular Arithmetic Unit and the Keccak Accelerator. These two modules require parallel access to the register banks. Therefore, the PQR-ALU is directly located within the Decoding Stage. This avoids to route the register signals to the Execution Stage. The PQ-ALU contains the Binomial Sampling Unit. This accelerator requires only two input and one output register and has a similar construction like the MULT unit and the ALU. To reuse existing hardware resources, we integrate the pq.mac operation, described in Section 3.5, directly into the MULT unit. The hardware resources for performing the multiplications are already available in the MULT unit and an extension for the pq.mac support comes with a negligible overhead of multiplexers and two additions. Enhancing or downsizing our developed PQR-ALU and PQ-ALU is straightforward. All accelerators are added as modules and can be selected using dedicated define directives. Thus the accelerators can be selected according to the application requirements.

RISC-V ISA Extension
To enhance the basic integer instruction set (I), RISC-V defines several standard extensions, including the extensions for multiplication/division (M); single, double, and quad precision floating point operations (F, D, Q); atomic operations (A); and compressed instructions (C).
The CV32E40P core fully supports the I instruction set and the extensions M, F, and C. In addition, this core provides the PULP specific extension Xpulp, which includes hardware loops, SIMD extensions, bit manipulation and post-increment instructions. In this work, we develop the PQ extension. Figure 8 shows the RISC-V base instruction format types R, I, S, and U. Depending on the instruction type, the instruction structure consists of an opcode, function fields, immediate values, the source registers rs1 and rs2, and the destination register rd. We decided to use the R-type for all Post-Quantum instructions. This allows the use of multiple operations with only a single opcode. The opcode chosen in this work is the unused value 0x77.   Table 9 (Appendix C).

NTT Configuration Class:
This class performs the following functionalities: i) setting the selected scheme: NewHope-512, NewHope-1024, or Kyber (all security categories); ii) setting the forward or inverse NTT; and iii) setting either the first rounds or the last round of the NTT. When NewHope-512 or NewHope-1024 is selected, the parameters for the optimized NTT CT br←no calculation will be set accordingly. Moreover, selecting one of the NewHope variants configures the modulus to q = 12289 and the Montgomery parameter for the modulo multipliers to −q −1 mod R = 12287 with R = 2 18 for all schemes. When Kyber is selected the modulo parameters are configured to q = 3329 and −q −1 mod R = 199935. Setting the forward or inverse NTT will select the corresponding precomputed values for ω m of the Twiddle Update Unit. Finally, the selection of the first rounds or last round will affect the Reordering MUX of the Modular Arithmetic Unit, more specifically, the swapping of the coefficients.

NTT Operation Class:
This class contains all instructions for the optimized NTT CT br←no calculation. The pq.ntt_multiple_bf instruction triggers the automatic calculation of the first five merged NTT rounds. The pq.ntt_single_bf instruction is used to calculate two parallel decimation-in-time butterfly operations. The instructions pq.update_m (rs1 is used for the index) and pq.update_omega are used to control the Twiddle Update Unit. The pq.mul_gamma1, pq.mul_gamma2, and pq.update_gamma instructions are used for calculating the scaling by n −1 and γ −i within the inverse NTT.

Modulo Arithmetic Operation Class:
This class is used for the vectorized modulo multiplication, addition, subtraction, and the butterfly operations decimation-in-time and decimation-in-frequency. In this context, rs1 and rs2 are used as source registers and rd as destination register (for the butterfly instructions rs1 is also used for the second output). In contrast to the pq.ntt_single_bf operation, which is used for the optimized NTT CT br←no , the pq.bf _dit operation does not use the registers from the FPR but from the GPR.

Bit-Reversal Class:
This class contains the bit-reversal instructions for the polynomial lengths n = 256, n = 512, and n = 1024. The register rd contains the value that has to be stored, rs1 the base address of the store location, and rs2 the value of the offset. The offset value will be added to the base address in bit-reversed order. The Load Store Unit will then store the coefficient to the desired location.

PQ MAC Class:
This class contains the vectorized modulo multiply accumulate function described in Section 2.2.

Keccak Operation Class:
This class is used to perform one round of the Keccak permutation. The keccak_wen signal, described in Section 3.7 is set to one when the pq.keccak_f 1600 instruction is executed. In this context, rs1 selects the current Keccak round and rs2 is used to reset the state.

Binomial Sampling Class:
This class is used to turn uniform samples in rs1 and rs2 into binomially distributed coefficients in rd.

Experimental Results
This section presents our experimental results for NewHope, Kyber and Saber. For all measurements, the Chosen Ciphertext Attacks (CCA) variant of the schemes was chosen and for all schemes the lowest and highest NIST security category was evaluated. The software baseline measurements use the clean C-code versions of the PQ-M4 project in [KRSS19], which are based on the reference implementations in [AAB + 19], [ABD + 19], and [DKRV19].

Cycle Count
The results of our clock cycle benchmarks are summarized in Table 4. In comparison to the software baseline implementation, a speedup factor of 10.07 (NewHope-512), 10.47 (NewHope-1024), 7.68 (Kyber-512), 9.62 (Kyber-1024), 2.65 (Lightsaber), and 2.48 (Firesaber) for a complete algorithm run was achieved. Although the NTT and Modular Arithmetic Unit was not particularly designed for the NTT types used in Kyber, a significant performance improvement was measured. The second round Kyber submission chooses a prime q for which the condition q ≡ 1 mod 2n does not hold and an early termination of the NTT is required. This reduces the cost for the NTT, but a so called basecase multiplication consisting of 128 products is necessary. For this basecase multiplication, the vectorized modulo arithmetic of the Modular Arithmetic Unit was exploited.
Besides that, our work beats the cycle count of the latest assembler optimized ARM Cortex-M4 implementations of NewHope and Kyber in [ABCG20,KRSS19] and the RISC-V VexRiscv implementation in [AEL + 20]. Using more powerful accelerators than only the basic finite field multiplier as in [AEL + 20] is paying off: in comparison to their work a speedup factor of 6.51 (NewHope-512), 6.53 (NewHope-1024), 4.65 (Kyber-512), and 6.15 (Kyber-1024) was achieved. Furthermore, our design is faster than the CPA version of the NewHope-1024 implementation in [FSMG + 19], which uses loosely coupled accelerators, although the CPA version has no costly re-encryption step (1, 078, 695 vs. 1, 113, 984  cycles). The better performance can be explained by the reduced communication overhead due to the tight coupling of the NTT and Keccak accelerators to the register banks and by the additional usage the new accelerators developed in this work (Modular Arithmetic Unit and Binomial Sampling Unit). For the Saber instances the polynomial multiplication remains the performance bottleneck, although the pq.mac operation already brings a considerable improvement. Just recently, the authors in [MKV20] proposed a technique called lazy interpolation in order to accelerate the evaluation and interpolation phase during the Toom-Cook multiplication. The integration of this optimization technique into our approach would probably result in a further improvement. This has been left as future work.
Just recently, the authors of NewHope announced in the NIST PQC Forum 10 that a new reference code is available. This new version appeared as a response of the work presented in [BDG20], which exploits the incorrect oracle cloning of some NIST KEM PQC candidates to perform key-recovery attacks. The new version of NewHope includes a domain separation for the SHAKE calls in order to make each hash call independent. That is, all hash calls with the same input size use a domain separator label (e.g., a nonce). It is estimated by the NewHope team that this modification will have a negligible influence on the performance results of NewHope. RISQ-V is able to support efficiently the new version of NewHope. The absorption of the input message is completely controlled by software and can be easily modified. The integration of a complete domain separation has been left as future work. Table 5 summarizes the measured code size for both the baseline and the optimized implementations. In particular for the optimized NewHope implementations, the code size was significantly decreased compared to the baseline implementation. This is mainly due to the fact that the large LUTs for the Twiddle factors and bit-reversal were eliminated. Moreover, the ISA extensions have the side effect that fewer instructions for complex operations are required. It should also be noted that the Saber implementations in [KRSS19] have a significantly larger memory consumption. This is caused by the fact that the authors developed a tool for the automatic generation of an optimized assembler code for the polynomial multiplication in Saber. While the optimized assembler code leads to a very fast polynomial multiplication, the code size is large (nearly 10,000 lines 11 ).

FPGA Results
RISQ-V can be synthesized for FPGAs as well as ASICs. For the FPGA evaluation, the Xilinx Zynq-7000 programmable SoC was chosen. The resource utilization of the complete RISQ-V implementation and the costs for the single accelerators are provided in Table 6. In total, the circuit size of RISQ-V is 9,210 LUTs and 1,261 registers larger than the circuit size of the original PULPino platform. For this comparison, we omitted the FPU of the original PULPino platform as it is not necessarily required for the considered Post-Quantum algorithms.
Compared to the loosely coupled NTT accelerator in [FSMG + 19], our accelerator has a higher number of LUTs, but a lower number of registers. The higher number of LUTs can be explained by the higher flexibility of the NTT and Modular Arithmetic Unit. Instead of only being capable of calculating the decimation-in-time butterfly operation, the proposed architecture also supports parallel calculations of the decimation-in-frequency butterfly RISC-V (PULPino) 1, 300, 272 1, 622, 818 1, 898, 051 a) Cycle count was only reported for the CPA-secure version. Due to the missing re-encryption step, the CPA version is significantly faster during the decapsulation compared to the CCA-secure versions. operation and packed modular arithmetic. It also has to be noted that our design does not need any further BRAM block for storing the input and output data as well as storing the Twiddle factors. Instead of hiding the post-processing step (scaling by n −1 γ −i ) by using extra multipliers, we use the same multipliers for this operation as for the butterfly calculation.
The tightly coupled Keccak implementation only uses combinatorial logic. No further registers are used in this design as the state is stored in the FPR and GPR. The tight coupling saves all logic and registers that are used for buffering the input and output data, the Keccak absorption and squeezing phase. In contrast to the Keccak implementation in [FSMG + 19], we decided to use a variant which calculates one round of the permutation function per clock cycle instead of two. This is significantly faster than the low-area solution of the Keccak Team and is comparable to their high speed solution [BDH + 20].
The two hardware accelerators of the Execution Stage have a negligible hardware overhead. While the Binomial Sampling Unit only requires 106 LUTs, the overhead for supporting the pq.mac operation was nearly eliminated by reusing the resources of the MULT unit.

ASIC Results
We synthesized the ASIC design with the UMC 65 nm technology. The main objective of this work was to achieve a low energy design. Therefore, a low leakage standard cell library with high threshold voltage was chosen. Table 7 shows the amount of logic cells and the area consumption of the original PULPino and RISQ-V. When both designs are compared, an increase of logic cells for RISQ-V can be observed. The cell area was increased by 64, 108 µm 2 for the combinatorial logic and by 9, 927 µm 2 for the sequential logic. However, this increase does not have a large impact on the overall area because the memory is by far the largest part for both designs. The maximum clock frequency was reduced from 79.66 MHz to 45.26 MHz as the Modular Arithmetic Unit has a relatively long critical path. A solution to break this critical path could be to add pipeline registers within the modulo multipliers. However, this would increase the latency by one cycle. As the achieved frequency is acceptable for most embedded applications, we omit these registers. Our simulated measurements for the power and energy consumption were performed at a frequency of 10 MHz, a nominal supply voltage of 1.2 V, and a temperature of 25 • C. To obtain realistic results, we extracted the dynamic post-synthesis power consumption by means of gate level switching activity files. The Switching Activity Interchange Format (SAIF) file for the dynamic power calculation was generated using Cadence's Incisive Enterprise Simulator and the power consumption was calculated using the Cadence power analyzer Joules. The results regarding the power measurements of the original PULPino and our developed RISQ-V implementation are summarized in Table 8. The leakage power belongs to the category of static power consumption. Due to the higher area consumption, the leakage power is marginally higher for RISQ-V compared to the original PULPino. An interesting aspect is that the total power consumption is mainly dependent on the applied scheme but not on the security level. For instance, the optimized NewHope-512 implementation has a power consumption of 2.42 mW whereas the optimized NewHope-1024 implementation has a power consumption of 2.41 mW . This can be explained by the fact that both instances basically use the same operations and only the execution time differs. Due to the shorter execution time, the energy consumption is significantly lower when using our tightly coupled accelerators.

Conclusion
The generation of uniformly and binomially distributed random polynomials and the polynomial arithmetic are the performance bottlenecks of lattice-based cryptography. Previous works developed loosely coupled accelerators to improve the performance characteristics of several lattice-based schemes. However, loosely coupled accelerators usually have a high data transfer overhead, require a high amount of hardware resources, and suffer from a low flexibility. In this work we developed RISQ-V, an enhanced RISC-V architecture that integrates powerful tightly coupled accelerators directly into the processing pipeline to speed up lattice-based cryptography. The accelerators include an arithmetic unit for vectorized modulo arithmetic and NTT operations, a vectorized modulo multiply accumulate unit, a Keccak accelerator for the pseudo-random bit generation, and a binomial sampling unit for the generation of binomially distributed samples. To control the tightly coupled accelerators, we extended the RISC-V ISA and developed 28 Post-Quantum instructions. The design strategy of this work was to reuse existing hardware resources of the system processor, such as the register banks and multipliers, in order to keep the area footprint small. Moreover, we developed design strategies to decrease the access rates to the system memory during the NTT and Keccak computations. Holding data as long as possible within the processor's registers resulted in a significant performance improvement. RISQ-V was synthesized for an FPGA prototype and an ASIC. Compared to the baseline software implementation, the performance evaluation has shown that the developed tightly coupled accelerators lead to a significant reduction of the clock cycle count and energy consumption for NewHope, Kyber, and Saber.  Layer 1 Layer 2 Layer 3 Layer 4 Figure 9: Example NTT CT br←no with n = 16 (unoptimized). In this case, one coefficient is stored in a single memory word, two coefficients can be loaded to the register file (l = 2), and one pair of coefficients can be executed in parallel. The red boxes indicate which coefficients are stored together in one word (in this case they are not stored together) and in which order the coefficients are processed by a single butterfly unit.