Standard Lattice-Based Key Encapsulation on Embedded Devices

Lattice-based cryptography is one of the most promising candidates being considered to replace current public-key systems in the era of quantum computing. In 2016, Bos et al. proposed the key exchange scheme FrodoCCS, that is also a submission to the NIST post-quantum standardization process, modified as a key encapsulation mechanism (FrodoKEM). The security of the scheme is based on standard lattices and the learning with errors problem. Due to the large parameters, standard latticebased schemes have long been considered impractical on embedded devices. The FrodoKEM proposal actually comes with parameters that bring standard lattice-based cryptography within reach of being feasible on constrained devices. In this work, we take the final step of efficiently implementing the scheme on a low-cost FPGA and microcontroller devices and thus making conservative post-quantum cryptography practical on small devices. Our FPGA implementation of the decapsulation (the computationally most expensive operation) needs 7,220 look-up tables (LUTs), 3,549 flip-flops (FFs), a single DSP, and only 16 block RAM modules. The maximum clock frequency is 162 MHz and it takes 20.7 ms for the execution of the decapsulation. Our microcontroller implementation has a 66% reduced peak stack usage in comparison to the reference implementation and needs 266 ms for key pair generation, 284 ms for encapsulation, and 286 ms for decapsulation. Our results contribute to the practical evaluation of a post-quantum standardization candidate.


Introduction
Secure communications channels have become essential for the transmission of sensitive information over the Internet or between embedded devices, requiring protocols such as public-key encryption and digital signatures. Furthermore, these requirements for data security and privacy are becoming more important as the number of connected devices increases, due to the popularity of the Internet of Things.
So far, practitioners have relied on cryptography based on the hardness of the factoring assumption (RSA) or the discrete logarithm problem (ECC). However, should a quantum computer be realized, the hardness of these related problems will be seriously weakened. This issue not only affects future communications but also secure messages sent today, which could be intercepted and stored, then decrypted by a device built a decade from now. Preparing for this is therefore paramount, and hence quantum-safe alternatives are needed to provide long-term security. The National Institute for Standards and Technology (NIST) has called for quantum-resistant cryptographic algorithms for new public-key cryptography standards, similar to previous AES and SHA-3 competitions [CCJ + 16].
Lattice-based cryptography is one of the most promising replacements for classical cryptography, accounting for more than 40% of the submissions to the NIST post-quantum standardization effort. This is due to many reasons, one of which is that most of the computations required involves very simple and parallelizable operations like integer multiplication, addition, and modular reduction. Unlike RSA-based schemes, which involve very hard computations like modular exponentiation. Moreover, lattice-based cryptography benefits from the strong security notion of worst-case to average-case hardness, meaning average-case instances are at least as hard as worst-case instances of related (much smaller) lattice problems [MP13].
In recent years, there has been tremendous growth in lattice-based cryptography as a research field. As a result, concepts such as functional encryption [BSW11], identitybased encryption [DLP14], attribute-based encryption [Boy13], and fully homomorphic encryption [Gen09] are now available. On the practical front, some constructions of public-key encryption schemes and digital signature schemes based on lattice problems are now more practical than traditional schemes based on RSA [HPO + 15]. Lyubashevsky [Lyu08,LPR10] proposed a new class of lattices, ideal lattices, that provide higher efficiency than standard lattices as schemes based upon ideal lattices require less memory and have a better performance than schemes based on standard lattices. Therefore the majority of research targeting practical evaluations of lattice-based cryptography have focused on ideal lattices [GFS + 12,GOPS13,DDLL13,RVM + 14,LSR + 15, LN16,HRKO17]. This improvement in efficiency is possible due to an introduced structure in the underlying lattice in these constructions. While the main operation in standard lattices is matrix-vector (or matrixmatrix) multiplication, the complexity is reduced to polynomial multiplication in ideal lattice-based cryptography. However, to this day it is not clear whether a future (quantum) attack might be able to exploit this additional structure, introduced in ideal lattices, in order to break the cryptosystem. Standard lattices do not suffer from this potential weakness and can therefore be considered the more conservative choice, as recommended by Howe et al. [HMO + 16] and the EU Horizon 2020 project PQCRYPTO [ABB + ]. The EU Horizon 2020 project SAFEcrypto also investigate the use of (standard) lattice-based cryptography in hardware, specifically for conservative use cases such as satellite communications.
The FrodoCCS key exchange scheme by Bos et al. [BCD + 16] is designed to offer exactly this -trading some efficiency for high security trust in the post-quantum era. Another design rationale for FrodoCCS is for simplicity, and this is seen in its use of basic operations like addition and multiplication. The parameter sets are much more flexible and easier to scale in comparison to NewHope variants [ADPS16, PAA + 17], which have a number of restrictions in order to use NTT polynomial multiplication, and can target more security levels which essentially scale linearly.
A modified version of FrodoCCS [BCD + 16] has been submitted to the NIST standardization process [ABD + ], proposed as a key encapsulation mechanism (KEM), named FrodoKEM. The submission comes with a reference implementation and a vectorized implementation for high-end Intel CPUs. However, to date there has been no research into the feasibility of Frodo variants on embedded devices. In this paper, we want to bridge the gap between the lack of practical evaluations of standard lattice-based cryptography and the need for long-term security solutions for the Internet of Things. This task is especially challenging considering the conservative parameters that were a design rationale of Frodo variants. As embedded devices, like FPGAs and microcontrollers, usually have very limited memory, we have to pay special attention to minimize the memory consumption of our implementations while not deteriorating the performance too much to not overexert the limited computing capabilities of these platforms.

Related Work
To the best of the authors' knowledge, there has been no previous research on evaluating FrodoKEM on embedded devices. There has been some evaluation of standard lattice-based cryptography on constrained devices, however this area of research is very limited due to the high demand of resources these schemes inherently require. Howe et al. [HMO + 16] present an implementation of the standard lattice-based encryption scheme, proposed by Lindner and Peikert [LP11], on FPGA. The encryption scheme is based on the Learning with Errors problem, which is the same hardness problem used for security in FrodoKEM. However, the parameters and the operations are significantly different. Most notable, LWE encryption just requires vector-matrix multiplication, while FrodoKEM requires matrixmatrix multiplication. Another lattice-based submission to the NIST standardization is the NewHopeNIST key exchange [ADPS16, PAA + 17]. In contrast to FrodoKEM, NewHopeNIST is based on ideal lattices and therefore its implementations are much more efficient. The works by Kuo et al. [KLC + 17] and Oder and Güneysu [OG17] implement the scheme on FPGAs and Alkim et al. [AJS16] present a microcontroller implementation.
Key exchange schemes based on other classes of mathematical problems other than lattices have been implemented on embedded devices as well. Koziel et al. [KAK16] have implemented a key exchange scheme based on supersingular isogenies for FPGAs. Also, von Maurich et al. [vMHG16] implemented a key encapsulation scheme based on linear codes for ARM microcontrollers. There are also numerous implementations of key exchange schemes based on elliptic curves [RMF + 15, LSH + 15, DHH + 15]. But as Shor's algorithm [Sho94] can efficiently solve the elliptic curve discrete logarithm problem, those scheme are not considered secure in a post-quantum age.

Contribution
In this work, we present the first implementations of FrodoKEM targeting constrained devices in hardware and software, and demonstrate that the conservative FrodoKEM scheme is a suitable option for embedded devices.
• Our FPGA design targets a balance between area consumption and throughput performance. This design choice is seen in the limited use of one multiplier module and minimal use of memory. A LWE multiplication core is proposed which is constantly reused for the main operations of the scheme, where the remaining operations are computed in parallel, essentially making multiplication the critical path of the design. The runtime of this depends exactly on the number of inputs, meaning all designs run in constant time. Most designs utilize less than 2000 FPGA slices and can output 51 operations per second (20 ms) for the main parameter set and 22 operations per second (45 ms) for the higher parameter set.
• Our ARM implementations make use of an optimized memory allocation that makes the implementation small enough to fit on embedded microcontrollers. We developed an assembly multiplication routine to speed up our implementation, realizing a performance that fits the requirements of common use-cases. The implementation for 128-bit security takes 266 ms for key generation, 284 ms for encapsulation, 286 ms for decapsulation, resulting in a total execution time of 836 ms for a full run of the protocol. To allow independent verification of our results and further improvements, our source code will be made publicly available with publication of this work.
Our results show that even one of the more conservative lattice-based submissions to the NIST standardization process (i.e., no ring structure as in ideal lattices) can be run efficiently on constrained devices. Our implementations are fully compliant with the official specification of FrodoKEM to ensure compatibility with implementations on other platforms. The intention of our work is to contribute to the NIST standardization process by demonstrating the practicability of the promising post-quantum candidate FrodoKEM.

Preliminaries
In this section we review the theoretical background that is relevant for our work. We explain the LWE problem and how it is used in the key encapsulation protocol FrodoKEM.

Notation
In this work, we adopt most of the notation that is used in the official specification of FrodoKEM [ADPS16]. We use bold lower-case letters to denote vectors and bold upper-case letters to denote matrices. We denote the set of all integers by Z and by Z q we denote the quotient ring of integers modulo q. For two n-dimensional vectors a, b their inner product is denoted by a, b = n−1 i=0 a i b i . The concatenation of two vectors a, b is denoted with the ||-operator.

The Learning with Errors problem
In 2005, Regev introduced the Learning with Errors (LWE) problem [Reg05]. The LWE problem is defined in [Reg10] as follows: Fix a size parameter n ≥ 1, a modulus q ≥ 2, and an 'error' probability distribution χ on Z q . Let A s,χ on Z n q × Z q be the probability distribution obtained by choosing a vector a ∈ Z n q uniformly at random, choosing ∈ Z q according to χ, and outputting (a, a, s + ), where additions are performed in Z q , i.e., modulo q. We say that an algorithm solves LWE with modulus q and error distribution χ if, for any s ∈ Z n q , given an arbitrary number of independent samples from A s,χ it outputs s (with high probability).
In other words, solving a system of linear equations is usually easy, but as soon as an error ( ) is added to the equations, it becomes a hard mathematical problem. To date, there is no quantum algorithm known that could solve this problem in polynomial time. Therefore schemes based on LWE, with high enough parameters, are considered quantum-secure.

The Frodo Key Encapsulation Mechanism scheme
The key pair generation, encapsulation, and decapsulation of FrodoKEM are shown in Algorithms 1, 2, and 3, respectively. There are some subroutines called by these algorithms. We explain them very briefly here and refer to the original specification [ABD + ] for details. Frodo.Gen() uniformly samples an n × n matrix. Another sampling algorithm (Frodo.SampleMatrix()) samples a matrix from a specific distribution that is defined in the parameter sets shown in Table 1. Frodo.Pack() and Frodo.Unpack() transform a matrix into a bit string (i.e., a format suitable for transmission) and vice versa. Frodo.Encode() encodes a bit string as mod-q integer matrix. It uses B bits of the bit string to generate one element of the matrix. B is defined by the parameter set. The inverse operation is Frodo.Decode().

4:
Generate the matrix A ∈ Z n×n q via A ← Frodo.Gen(seed A )

3:
Generate pseudo-random values seed E ||k||d ← G(pk||µ) 4: Compute ss ← F (c 1 ||c 2 ||k||d) 15: return ciphertext c 1 ||c 2 ||d and shared secret ss 16: end procedure The main operation of the key generation (Algorithm 1) is the generation of the LWE sample B ← AS + E, where A is a uniformly random matrix, and E and S are distributed according to χ. A is generated by a pseudo-random number generator. The designers of FrodoKEM proposed to either instantiate it with AES or cSHAKE. B and the seed for the generation of A is then the public key and S the secret key. During encapsulation, three noise matrices are generated S , E , and E . Then, both parts of the public key, A, and B are used to compute one part of the ciphertext each by computing B ← S A + E and V ← S B + E . An encoded random bit string µ gets added to V. The shared symmetric key is then computed by hashing both ciphertexts and some salt. The decapsulation checks whether the input is a valid ciphertext by first decrypting µ and then trying to re-encrypt it and check whether both ciphertexts match. If so, the shared symmetric key is again generated by hashing both ciphertexts and the salt.

Error Sampling
The coefficients of the noise matrices are sampled from a discrete, symmetric distribution on Z that approximates a rounded, zero-centered Gaussian distribution. Each parameter set uses a different probability density function (PDF), both are shown in Table 2. In the implementation, each discrete PDF is modified into a discrete cumulative distribution function (CDF) to enable inversion sampling. A CDF f (x) returns the probability for a value being x or less. We will demonstrate in an example how the inversion sampling works for the FrodoKEM-640 set. The CDF results in the

Structured Lattices
For comparison reasons, it is prudent to discuss the differences between the "standard" LWE problem and the Ring-LWE problem. This is due to comparisons we draw to NewHopeNIST [ADPS16,PAA + 17], a Ring-LWE KEM submitted to NIST for post-quantum standardisation.
In Ring-LWE, the operating lattice is an ideal lattice, which is a subset of (standard) lattices, computationally related to polynomials via matrices of a specific form. Thus, instead of having a matrix of the form A ∈ Z n×n q , identically and independently distributed, we use a matrix that is structured in such a way that one column a 1 ∈ Z n q can be used.  The remaining n − 1 columns are then derived as the coefficient representation of the polynomial a 1 x i in the ring Z q / f , for some univariate polynomial [Lyu12]. Hence, we are able to represent matrices from the standard LWE problem as polynomials in the Ring-LWE problem. To date, there has been no significant attacks to exploit this added structure.
Once polynomials are utilised, more efficient multiplication techniques can be used, such as the Number Theoretic Transform (NTT) multiplier. This reduces the multiplication complexity from quadratic (O(n 2 )), to quasi-linear (n log(n)), for an n input multiplication. This, coupled with the significantly smaller key requirements, means that NewHopeNIST is more efficient than FrodoKEM.

FPGA Design
In this section, we explain our design decisions and details of the FPGA implementations. The device targeted is a Xilinx Artix-7 FPGA, although the design is not device specific and is generic enough to comfortably fit on most low-cost FPGA devices. The generic design also applies to the parameter sets, meaning the design ideals are the same for both FrodoKEM-640 and FrodoKEM-976. We propose a design that aims to balance between FPGA area consumption and throughput / runtime of the operations. There are separate designs for key generation, encapsulation, and decapsulation, since we expect an embedded device to usually compute these operations separately.

Overview
The FPGA designs of all three cryptographic operations consist of three main components: matrix-matrix multiplication, addition of an error distribution, and the use of random oracles via cSHAKE. Our designs use two cSHAKE modules and one AES module. Essentially all the designs proposed have the same critical path, that is, the matrix-matrix multiplications. All other modules, such as random number generation, occur in parallel to this which saves significant clock cycles and simplifies the overall design. This also means the clock cycles per operation are easily calculable and, more importantly, happen in constant time, where for example, encapsulation happens inn × (n ×n +n ×n) clock cycles. Efficient constant runtime is a practical countermeasure to some simple side-channel attacks such as timing analysis.
Instead of creating a hardware module for matrix-matrix multiplication, we instead utilize a vector-matrix multiplier which loops over then = 8 rows of the S matrix, for calculating both B and V matrices (i.e., for encapsulation in Algorithm 2). This equivalent operation saves area consumption by not requiring all of S to be stored, as once each row of S is used, that corresponding row of S is not needed again. This is true for both encapsulation and decapsulation, where S is required to be reused in more than one multiplication instance.
Many of the operations within FrodoKEM key generation, encapsulation, and decapsulation are similar, with some differences in the use of encoding or decoding. More specifically, the main key generation operations also lie in encapsulation, and decapsulation, and (with the exception of calculating M) encapsulation and decapsulation also share many of the same operations. These similarities will also be seen by the similar results of area consumption on the FPGA, due to the use of the same sub-modules. Thus, the operations of encapsulation will be described, where a high-level overview of the architecture is given in Figure 2.
FrodoKEM encapsulation begins with an initialization stage. This step is required mainly for reading in the public-key information onto the device. This time required is exploited to also initialize the cSHAKE module (used for generating A), generating the uniformly random key µ, and pre-generating some of the matrix A. The public-key (B) information and the matrix A are stored in block RAM (BRAM), and are called upon in the multiplication component depending on a row-column (address) count.
Once the initialization has finished, the computation of the matrix B starts, requiring a vector of values from the error distribution, a matrix of uniform values, and another error distribution value used for addition. The differences in these error distribution values is that the first, used in S , is required again for the multiplication of V, and is hence stored, whereas those required for the addition of E and E are not required again and are not stored for further use. Once a single row of S is generated, the next row is generated in parallel to the running of the multiplication of B , where a double-buffered store (sometimes called the page-flip method or ping-pong buffering) is used. At the end of each vector-matrix operation (n = 8 of these are required), the buffers are then swapped. Using this technique saves 4x in the storage requirements of S and ensures there is no delay between any of the vector-matrix multiplication operations in the LWE multiplier.
An encapsulation operation is complete when the vector-matrix LWE multiplier has looped over then = 8 vectors of S , for calculating B and V. As the coefficients of these matrices become available, they are input into a second cSHAKE module, used as a random oracle to calculate the shared secret ss. This stage is computed in parallel to the next encapsulation operation, which simplifies the overall design and ensures the constant runtime.

MAC 1 MAC(S [row], A[col])
Add  Figure 1: A high-level overview of the pipeline incorporated within the LWE multiplication core, for example B = S A + E . Latency is minimised due to parallelising the multiplyaccumulate (MAC) operations within the DSP and additions with the error.

LWE Multiplication Core
At the center of the FrodoKEM FPGA design is a LWE multiplication core which consists of vector-matrix multiplication and addition of the error distribution and, if required, the message data. The generic design generates and stores a single row of the error matrix S for use in the calculation of B and V. Whilst these operations are taking place the next row of S is being generated, and the vectors are swapped at the end of each vector-matrix multiplication. This process loops forn =m = 8 rows, the same for both parameter sets. The design exploits a DSP block on the FPGA device, as it matches the requirements of the vector-matrix multiplications; a 25-by-18 bit multiplier and a 48-bit accumulator. Each vector loop of S is multiplied by a matrix (either A or B) and adds an error distribution value and, if required, message data of the encoding of µ. The nature of the DSP means that each multiplication within the MAC happens in a single clock cycle, ensures constant runtime, and makes the clock cycles counts easily calculable since the MAC operations are the critical path of the proposed designs. Figure 1 shows the pipelines of each vector-matrix MAC operation, as well as the parallelising of these with the additions required.

Additional Modules
The generation of the deterministic matrix A uses cSHAKE. For the cSHAKE implementation, a balanced design is used which is based around the mid-range core of KECCAK 1 . Due to the deterministic nature of the matrix A, it does not need to be stored in its entirety. This is essential, as otherwise the storage requirements would exceed the capacity of the FPGA, even for the smaller parameter set. Instead, enough of the matrix is pre-generated during the initialization stage, where the remaining matrix is generated on-the-fly, which continuously reuses the same memory blocks. This is similar to the page-flip technique used for S . This module runs in parallel to the LWE multiplication core and is thus not apart of the critical path and the clock cycle counts of the operations.
Error sampling is another important module within FrodoKEM. Both FrodoKEM-640 and FrodoKEM-976 parameter sets require a slightly different distribution, however the standard deviations are close enough to essentially utilize the same FPGA area and performance. A large number of samples are required during a run of FrodoKEM, due to this a fast but rather large sampler is designed in order to keep up with the LWE multiplier. Instead of using a binary search for the table look-up, a large number of comparators are used in order to instantly output an error distribution value in the look-up table.

Microcontroller Design
We present four implementations of FrodoKEM. We implemented both parameter sets FrodoKEM-640 and FrodoKEM-976 and we also implemented both possible pseudo-random numbers generators for the generation of A. For AES, we rely on the optimized implementation by Schwabe and Stoffelen [SS16] and for cSHAKE we use the assembly implementation from the official KECCAK code package [BDP + a].

Target Platform
We evaluate our microcontroller implementation of FrodoKEM on the STM32F407 Discovery board that has a 32-bit ARM Cortex-M4F microprocessor that runs with up to 168 MHz. Our development board comes with 192 kilobyte of RAM and one megabyte of Flash memory. Furthermore the Cortex-M4 features powerful DSP instructions like single-cycle multiply-with-accumulate and a true random number generator based on analog noise. However, in contrast to other M4-based microcontrollers, our development board does not have an AES-accelerator that could be used to speed-up FrodoKEM-AES. As development environment we use CooCox CoIDE version 1.7.7 with gcc-arm-none-eabi 5.4 2016 toolchain. The Cortex-M4F has 13 general purpose registers and (R0 − R12), one register reserved for the stack pointer, a link register, one register reserved for the program counter, and special-purpose program status registers. When mixing C with assembly it is important to note that the calling convention requires parameters to be in R0 − R3 and the result to be in R0 − R1. The link register can be used as general purpose register as well, if the assembly function does not call any other function and its original value is restored before leaving the function.

High-level Memory Optimization
The official specification of FrodoKEM reports a peak stack memory usage of 189,176 bytes for FrodoKEM-976-AES. As our microcontroller has only access to 192 kilobytes of RAM, we carefully analyze the memory allocation of the reference implementation to see whether we can make the implementation more efficient in terms of memory usage. Keep in mind that for many applications there is another software running beside the KEM, therefore it is sensible to reduce the memory consumption as much as possible without sacrificing performance. With the help of the flow chart of the most important operations in FrodoKEM (Figure 3 and 4), it is easily possible to see which matrices are used for which computations. The highlighted intermediate values are large arrays with n ×n elements. As we store each element in two bytes, this means that one large array requires 976 × 8 × 2 = 15, 616 bytes of RAM for each large array for FrodoKEM-976-AES.
The non-highlighted intermediate values are small (n ×n elements, i.e. 128 bytes) and therefore we focus on optimizing the large ones.
The first thing to note about decapsulation, as shown in Figure 4, is that we need memory for at least two large arrays. For instance, during the computation of B , both inputs E and S are large. While E can be generated on the fly, S is loaded multiple times during the multiplication by A and therefore on-the-fly computation would imply regenerating the same value over and over again. Therefore we decided that the better trade-off would be to keep storage space for at least two large arrays. Another thing to note is that the right-hand side can be computed completely independent from the left-hand side. Therefore we can store S in one of our two memory slots for large arrays and compute V and B using the other memory slot. Once V and B are calculated, S is no longer used and can be replaced by B .
In the flow chart of encapsulation in Figure 3 we can see that two large arrays are also sufficient for the encapsulation as the sampling of E and the unpacking of B can be done on-the-fly. In fact the encapsulation needs even less memory as for instance the packing of B could be done on-the-fly as well. But as the bottleneck in terms of memory consumption is the decapsulation, we do not further optimize the encapsulation.

Low-level Assembly Optimization
Our measurements indicate that besides the generation of the matrix A, the multiplication of A with the secret matrix consumes most of the cycles, and therefore optimizing the multiplication is profitable. The simple operation of multiplication consumes just a small part of the cycles, the loading and storing of the matrix entries is the decisive part. Therefore minimizing the memory accesses is the key to a short run-time. Since the generation of the matrix A needs to be computed on-the-fly due to the memory constraints on the ARM Cortex-M4F, the multiplication cannot be done on the whole matrix A at once. We chose to generate A row-by-row when computing AS. multiplication in assembly language gives us more control over the implementation and allows us to incorporate enhancements the compiler cannot engineer. Since the amount of memory accesses is substantial for the speed, our goal is to load the necessary matrix entries from RAM as rarely as possible and use all the available registers. When A is generated on-the-fly, a straightforward implementation of the multiplication of AS has two loops. The first loop iterates over the columns of S withn iterations. The second loop iterates over the rows of S, respectively the entries of the generated row of A with n iterations. But asn = 8 in both parameter sets, it is possible to implement the matrix multiplication by using only one loop. During the multiplication of one row of A with S, only eight entries of AS are computed. Since these entries are the sum of the products of n multiplications, they are often updated during the computation. Storing the eight entries of AS in registers during the whole computation enables us to save many memory accesses: instead of iterating over the eight columns of S, it is possible to process one entry of A with a complete row from S during one iteration. Figure 5 presents this concept graphically.
The multiplication of A and matrix S is slightly different, and variates with the use of either AES or cSHAKE. With cSHAKE, the matrix A is intended to be generated only in entire rows, which is convenient for the computation of AS, but inefficient for the computation of S A, because one row of A affects all elements of S A. With AES, A is generated in blocks of 128 bits, therefore it is not only possible to generate A row-by-row but also by producing eight columns at a time. In a straightforward implementation, this leads to a third loop iterating over the eight columns. The fact that the number of processed columns of A is eight, just as the parametern, enables us to use the same concept that we used for the multiplication of S A. We illustrate this concept in Figure 6. To avoid nested loops, we unroll the loop that gets added due to the eight columns. We end up with eight loops that are run through after each other.  The ARM Cortex-M4F offers 13 general purpose registers R0 − R12 which we use all in our assembly matrix multiplication to maximize memory efficiency. Furthermore we use the designated link register R14 whose content we preserve on the stack. In R0 a pointer to matrix A is passed, in R1 a pointer to matrix S, and in R2 a pointer to matrix B. After loading the eight relevant entries of B into the registers R4 − R11, we use R2 to store elements of A. In R3 we pass the parameter n, which defines the number of iterations through the loop. In R12 and R14 entries of S are stored.
In both parameter sets, entries of matrices are stored in 16-bit data types, but the ARM Cortex-M4F is a 32-bit architecture. This enables us to reduce memory access, by loading two entries of matrix A simultaneously, as in Line 1 of Listing 1. In the next line we use an instruction to load multiple aligned words, to get four entries of S by only one instruction. The single-cycle multiply-with-accumulate capabilities are very valuable for the actual matrix multiplication, used for example in Line 3 of Listing 1.

Protection Against Timing Side Channels
Our implementations using cSHAKE have a constant timing and are therefore protected against timing attacks. The AES implementation from [SS16] is very efficient but due to the caches on our development board not timing-constant. Therefore we disabled the data and instruction cache by clearing bit 9 and bit 10 of the FLASH_ACR register. We noticed only a negligible drop in performance (< 1%) after disabling the caches.

Results and Comparison
In this section we discuss the results of our FPGA and microcontroller implementations and compare our implementations with others. In particular, we also compare with implementations of NewHopeUSENIX, even though comparing a standard lattice-based scheme with an ideal lattice-based scheme is not exactly an apples-to-apples comparison (as discussed in Section 2.6). Our intention behind this comparison is to show the cost of removing the potential additional attack vector of ideal lattices (i.e., the ring structure). Table 3 provides post-place and route results of the proposed hardware designs, as well as comparative lattice-based cryptographic hardware designs. The 18Kb BRAM usage follows the requirements of the inputs of the operations, those being the public-key, the secret-key, and the ciphertext information. The increased of BRAM usage in Decaps results in a slight decrease in clock frequency. The area consumption of all the modules are similar, at least for each parameter set. This is essentially due to the reuse of the LWE multiplication core, which is reused for all vector-matrix multiplication and error addition operations. The increase between parameter sets is due to an increase from 640 to 976 in the matrix dimension, the rest of the design essentially remains the same. Hardware results are also provided for the main components required; the error distribution sampler and the cSHAKE module. As per the specifications, the error sampler is combined with AES as a PRNG input to the lookup table. The large area consumption of this module is due to the use of AES, as well as employing a large number of comparators in order for high throughput. One cSHAKE module is used for generating the randomness for the matrix A and a second is used to generate the shared secret ss on-the-fly, which makes these cSHAKE modules the largest overall. The remaining area usage is consumed by control logic and the LWE multiplier; which requires a DSP for multiplication and a reasonable amount of LUTs for storage.

FPGA Results
In Table 4 we present the cycle counts for our FPGA designs. Clock cycle counts in Table 4 are equivalent for the PRNG choice (either AES or cSHAKE) as this module runs in parallel to the vector-matrix multiplication within the LWE multiplication core, and does not affect the critical path of the operations (as described in Section 3.2). The clock cycle counts for each cryptographic operation are defined by the MAC operations of all the matrix-matrix multiplications they require. Moreover, the multiplication of the largest matrices in each cryptographic operation, that is; B ← AS + E for key generation, B ← S A + E for encapsulation, or B ← S A + E for decapsulation, respectively contribute 100%, 97.5%, and 97.5% to their overall clock cycle counts. As described in Section 3.1, there is a one-time initialisation stage, for loading input information, initialising modules, and pre-storage matrices. This process lasts between 5.1k and 23.5k clock cycles, depending on the operation and parameter set used. This extra latency is not included in Table 4 as it is negligible, even for one run of a FrodoKEM operation (at most, < 0.5%), and becomes even more so when averaged over numerous operations. Comparison results for related works is given in Table 3. The FPGA device used, the Xilinx Artix-7 XC7A35T FPGA, is similar to the one used by Oder and Güneysu [OG17], in order for a fair comparison. Although the NewHopeUSENIX design outperforms our proposed designs in terms of operations per second, the area consumption is comparable. The loss in throughput is expected and almost entirely due to the number of clock cycles required for each operation; NewHopeUSENIX requires 171k clock cycles for the server-side operations whereas FrodoKEM requires 3.3m. The increase in memory requirements is due to the differences in the key sizes, since NewHopeUSENIX uses polynomials instead of the matrices used in FrodoKEM (as mentioned in Section 2.6).
The only other implementation of standard lattice-based cryptography in hardware is by Howe et al. [HMO + 16], referred to as Standard-LWE, and is discussed here for comparison. The area consumptions are similar due to the similar LWE multiplication operations required, however, Howe et al. uses significantly smaller matrix dimensions in comparison to the FrodoKEM parameters, and hence we see an improvement. Moreover, their use of BRAM is significantly larger, due to precomputed keys, which we mitigate by using on-the-fly generation. Reusing keys is discussed by the authors of FrodoKEM [ABD + , Sec. 5.1.4] but is not recommended due to the potential attack vector it provides. Additionally, we also wanted to keep the memory requirements low and therefore decided to not store the entire matrix A in memory. The throughput performance of Standard-LWE is much higher in comparison to our research, due to a much lower clock cycle count of 98k required. This is essentially due to, again, the significantly smaller matrix dimensions and hence less multiplications required. Comparing the overall security targets of the schemes shows that the Standard-LWE implementation only provides 128 bits of classical security, whereas our implementations provide 128 and 192 bits of post-quantum security.

Microcontroller Results
We use the pqm4 framework [pqm] to evaluate the proposed microcontroller implementation. In the framework, the running time of an operation is measured in cycle counts using libopencm3 2 . The framework can also measure the stack usage with the help of stack canaries. Our development board runs at a clock frequency of 168 MHz.
In Table 5 we show the cycle counts for the major building blocks of FrodoKEM as well as the entire key pair generation, encapsulation, and decapsulation. At 168 MHz, the key generation takes 266 ms, the encapsulation takes 284 ms, and the decapsulation takes 286 ms for FrodoKEM-640-AES. For FrodoKEM-940-AES the cycle counts are more than twice as high as for FrodoKEM-640-AES. The main reason for that is that the size of the matrix A grows quadratically when n grows and the generation of A is the most time consuming part of our implementation. Further speed-ups could be achieved by speeding up the AES implementation. However, to the best of our knowledge, the implementation by Schwabe and Stoffelen [SS16] that we used in our implementation is already the fastest published AES implementation. Some Cortex-M4-based microcontrollers also have access to an onboard hardware AES engine that would further speed up the AES. The implementations of FrodoKEM-cSHAKE are slower than FrodoKEM-AES implementations as cSHAKE is based on KECCAK and hardware platforms are where KECCAK really excels on, not software [BDP + b]. Therefore we expected FrodoKEM-cSHAKE to have a worse performance than FrodoKEM-AES.
We furthermore see in Table 5 that the AS + E is the most time consuming part of the key pair generation as it accounts for 93% of the run time for FrodoKEM-640-AES. Similarly S A + E is bottleneck for encapsulation and decapsulation (88% resp. 87%). The second most time-consuming operation is the sampling of a noise matrix. We measured the performance of noise sampling for matrices with dimension n ×n. This operation is performed twice during every run of key pair generation, encapsulation, and decapsulation. It accounts for 6% of the run time of each of the three algorithms (FrodoKEM-640-AES). In comparison to the computation of AS + E or S A + E , multiplications of smaller matrices cost much less cycles, as one can see in the cycle counts for S B + E . In Table 6 we list the peak stack usage for our implementations. For n = 976 the FrodoKEM specification [ABD + ] reports a peak stack memory usage of 189 kilobytes when using AES as PRNG and 156 kilobytes when using cSHAKE. The specification reports at most 81,836 bytes as static library size for non-vectorized implementations. As our development board has access to 192 kilobytes of RAM and one megabyte of Flash memory, the peak stack usage is what we focus on in the following. We managed to reduce these numbers so that the implementation comfortably fits onto the microcontroller and still leaves space for other applications running on it. For the AES-based implementation we reduced the memory consumption by 66% and for the cSHAKE implementation we reduced it by 63%.  In Table 7 we compare our implementation with other implementations of key exchange schemes on Cortex-M microprocessors. Our implementation of FrodoKEM-976-AES has a similar performance compared to the implementation of FrodoKEM-640-cSHAKE from the pqm4 library [pqm] even though the security level is higher (192 vs. 128 bits). Our implementation is two orders of magnitude slower than the NewHopeUSENIX implementation from [AJS16]. The reason for that is that NewHopeUSENIX is based on ideal lattices that are inherently much more efficient as the main operation in ideal lattice-based cryptography is polynomial multiplication. Polynomial multiplication in integer rings can be efficiently realized with the number-theoretic transform that has a complexity of O(n log(n)) while the matrix operations in FrodoKEM have a complexity of O(n 2 ). Therefore a decently optimized implementation of a scheme based on ideal lattices will always be faster than any implementation of a scheme based on standard lattices targeting a similar security level. Furthermore, the implementation in [AJS16] is only secure against chosen-plaintext attackers, not chosen-ciphertext attackers. Unsurprisingly the KyberNIST-768 implementation from pqm4 also provides a better performance, as KyberNIST-768 is a module lattice-based scheme that still retains some of the structure of ideal lattices and in particular also benefits from the speed-ups from the number-theoretic transform. The ECDH implementation of [DHH + 15] is also much more efficient. But as ECDH is not secure against attacks by quantum computers, it cannot be considered as alternative to FrodoKEM.
While being significantly slower than implementations of other schemes, FrodoKEM is also a very conservative choice that does not suffer from relying on a structured lattice as potential additional attack vector (like for instance NewHopeUSENIX does). Depending on the use case it might be sensible to use NewHopeUSENIX in scenarios that demand high efficiency and moderate security and use FrodoKEM in cases that have very high security requirements.

Conclusion
In this paper we present a thorough evaluation of the NIST post-quantum standardization candidate FrodoKEM on embedded devices. We have developed an FPGA implementation that fits into 2000 slices on a low-cost Xilinx Artix-7 FPGA. The FPGA implementation for a security level of 128 bits needs 60 ms to run a full key exchange and 135 ms for a security level of 192 bits. We also developed a ARM Cortex-M4 microcontroller implementation that needs 836 ms to perform a full run of the protocol for 128 bits of security and 1.84 s for 192 bits of security. Our implementations are compatible with the reference implementation and we covered all implementations options given in the specification, i.e. we implemented both parameter sets and both PRNG options. Our results show the efficiency of FrodoKEM and help to assess the practical performance of a possible future post-quantum standard.
For future work it would be interesting to further analyze to side-channel resistance of our implementation and the cost of applying countermeasures against side-channel attacks. Our implementations are protected against timing attacks as they have a constant execution time, but more sophisticated attacks, like differential power analysis or fault attacks are not covered in this work. Furthermore, we only applied those optimizations to the memory usage that did not have a sizeable impact on the performance. It would be interesting to know the performance cost of getting the scheme running on even smaller microcontrollers, like a Cortex-M0, that have even less memory accessible.