A Compact Hardware Implementation of CCA-Secure Key Exchange Mechanism CRYSTALS-KYBER on FPGA

. Post-quantum cryptosystems should be prepared before the advent of powerful quantum computers to ensure information secure in our daily life. In 2016 a post-quantum standardization contest was launched by National Institute of Standards and Technology (NIST), and there have been lots of works concentrating on evaluation of these candidate protocols, mainly in pure software or through hardware-software co-design methodology on diﬀerent platforms. As the contest progresses to third round in July 2020 with only 7 ﬁnalists and 8 alternate candidates remained, more dedicated and speciﬁc hardware designs should be considered to illustrate the intrinsic property of a certain protocol and achieve better performance. To this end, we present a standalone hardware design of CRYSTALS-KYBER, a module learning-with-errors (MLWE) based key exchange mechanism (KEM) protocol within the 7 ﬁnalists on FPGA platform. Through elaborate scheduling of sampling and number theoretic transform (NTT) related calculations, decent performance is achieved with limited hardware resources. The way that Encode/Decode and the tweaked Fujisaki-Okamoto transform are implemented is demonstrated in detail. Analysis about minimizing memory footprint is also given out. In summary, we realize the adaptive chosen ciphertext attack (CCA) secure Kyber with all selectable module dimension k on the smallest Xilinx Artix-7 device. Our design computes key-generation, encapsulation (encryption) and decapsulation (decryption and re-encryption) phase in 3768/5079/6668 cycles when k = 2, 6316/7925/10049 cycles when k = 3, and 9380/11321/13908 cycles when k = 4, consuming 7412/6785 LUTs, 4644/3981 FFs, 2126/1899 slices, 2/2 DSPs and 3/3 BRAMs in server/client with 6 . 2 / 6 . 0 ns critical path delay, outperforming corresponding high level synthesis (HLS) based designs or hardware-software co-designs to a large extent.


Introduction
Powerful quantum computers would render the public key cryptosystems currently used in our daily life insecure, with which the underlying mathematical hard problem integer factorization, discrete logarithm that considered to be infeasible to handle in classic computer architecture now can be broken in polynomial time by Shor's algorithm [Sho94]. As a result, the post-quantum cryptography (PQC), study of cryptosystems that would be secure against adversaries who have access to both classic and quantum computers, is under intensive research not only in academia but also within industrial community. In December 2016, NIST launched a contest of new post-quantum cryptographic schemes and called for proposals from all over the world, aiming to form public key cryptography standards before the upcoming quantum era. In December 2017 and January 2019, the first and second round of candidates were announced respectively, each followed by a public comment period. Very recently, the third round of candidates were announced in July 2020. In total 15 out of 26 second round candidates get through, of which 7 have been selected as finalists and the other 8 as alternate candidates [AASA + 20]. These schemes base their security on different mathematical hard problems, and the evaluation criteria used to compare schemes throughout standardization process mainly focuses on three aspects, namely security, cost & performance, and algorithm & implementation characteristics, in a decreasing order of importance.
There have been lots of works concentrating on achievable performance of PQC protocols in hardware-software co-design, aiming to evaluate several candidates with similar computational tasks on the same platform, obtaining relatively precise rankings among them, for instance [NCD19][NVB + 20] [FDAG19]. It would give preference to some protocols over others in certain use cases, and further impact the standardization process to some extent [FDAG19]. The most time-consuming parts of the protocol are offloaded to separate hardware accelerators, referring mainly to NTT and hash functions, while others are implemented with software in processors such as ARM Cortex series. The hardwired ARM core can be replaced with popular RISC-V soft cores, which can be implemented with configurable logic in FPGA, making it possible to implement the whole design in hardware, avoiding slow data transmission across hardware/software boundary to a large extent, including works in [BUC19][AEL + 20] [FSS20][XHY + 20]. The accelerators can be designed as vector coprocessor, being able to speed up corresponding operations massively [XHY + 20]. In addition to performance gain compared with pure software design, flexibility is another primary advantage in consideration that the PQC contest is still in progress and different tweaks may be involved in a certain protocol. By dividing tasks into hardware and software parts, procedures that are standard and fixed by a few parameters would be implemented in hardware, while the ones that are subject to tweak and often intractable to handle in hardware, would resort to software design.
Complete hardware designs are still rare up to now. As the contest progresses to the third round, and is expected to last 12-18 months from July 2020, more dedicated designs would better demonstrate the strength and illustrate the intrinsic property of a certain protocol, fitting the expectation from NIST that more performance data of seven finalists and eight alternate candidates for hardware implementation would emerge [AASA + 20]. A few published ones that reported results for Kyber are [DFA + 20] and [HHLW20]. Moreover, some procedures that are intractable to handle in hardware but shared among different participants in the contest, can be properly designed once and used in a similarly way among corresponding protocols, especially tasks required by the Fujisaki-Okamoto transform in re-encryption phase. To this end, we design and implement all building blocks of a specific protocol -CRYSTALS-KYBER in hardware.
As one of the finalists, CRYSTALS-KYBER [SAB + ] is a KEM protocol basing its security on the presumed hardness of the MLWE problem [LS12]. It first constructs public key encryption from the original method in [Reg04] and then achieves adaptive CCA-secure through the tweaked Fujisaki-Okamoto transform [FO99]. The scheme possesses high performance on both hardware and software platforms, and another advantage over some Ring-LWE-based candidates in round 2 is that the security level can be easily adjusted by changing module dimension k. Specifically, security category 1,3 and 5, equivalent to strength of AES-128,AES-192 and AES-256 respectively are supported in Kyber. Contributions. A manually designed hardware implementation of Kyber is presented to illustrate its strength over HLS-based designs and hardware-software co-designs. Through compact scheduling of sampling and NTT related processes, decent performance is achieved with limited computational resources in NTT core, and the method, especially the use of predefined order sequence table and endpoint table can be extended to other candidates. Unified butterfly pair is proposed, within which NTT-related procedures, including the special point-wise multiplication (PWM) are all well supported and can be easily switched by short control code. A reference realization of the Fujisaki-Okamoto transform is given out, through use of a set of FIFOs with different specifications. We also explore the minimal memory footprint of all the RAMs and FIFOs adopted in out design. By means of unified computation cores and minimal memory footprint, the whole design is able to be deployed in the smallest device in Xilinx Artix-7 series. Our design has been verified against known answer test (KAT) files of Kyber in the round-3 submission to NIST and the Vivado project has been uploaded to https://github.com/xingyf14/CRYSTALS-KYBER to ease comparison. Structure. The rest of our paper is organized as follows: Section 2 clarifies the notations adopted in our paper firstly, then gives a brief introduction to CRYSTALS-KYBER and describes the special NTT and PWM. Section 3 describes design rationale of main hardware modules and our optimizations in implementing the protocol, from both hardware and algorithm point of view. The way towards minimizing memory footprint is discussed in Section 4. Performance results on the selected FPGA device and comparison with related works are given out in Section 5, followed by discussion about other design choices. Finally we draw our conclusion in Section 6.

Notation
The ring of integers modulo prime q is denoted as Z q . The ring of integer polynomials modulo prime q is denoted as Z q [X]. Then polynomials modulo both q and X n + 1 form the ring R q = Z q [X]/(X n + 1). In MLWE-based protocol Kyber, polynomial vector that contains k elements in R q is represented in bold lowercase where k represents module dimension in MLWE, while pure lowercase corresponds to a single element. Furthermore, polynomial matrix that consists of k × k elements in R q is represented in bold uppercase. The i-th entry within a polynomial vector s is denoted by s i , and the entry that lies in i-th row, j-th column in polynomial matrix A is denoted by A ij . Greek symbol ζ is selected to denote the n-th primitive root of unity in Z q , in conformance with Kyber specification, then by ξ we mean the square of ζ. The binomial distribution with parameter η is denoted as B η , where η directly relates with possible range of noise samples. In Kyber, n, q, ζ, η are set to 256, 3329, 17, 2 or 3 respectively and k takes value from 2,3,4.

CRYSTALS-KYBER KEM
Kyber, as part of the Cryptographic Suite for Algebraic Lattics (CRYSTALS), is a postquantum key exchange mechanism that bases its security on the MLWE problem. It follows conventional construction method to build an IND-CPA public key encryption (PKE) scheme firstly and then turns it into an IND-CCA KEM through the tweaked Fujisaki-Okamoto transform. Detailed procedures corresponding to KYBER.CPAPKE and KYBER.CCAKEM are listed in Algorithm 5-Algorithm 10 in appendix. Parameter sets and instantiation of symmetric primitives are listed in Table 6 and Table 7 respectively, with definition of auxiliary functions depicted in Equation 12. In the first-round submission to NIST, prime q is chosen to be 7681 and satisfies q ≡ 1 mod 2n with degree of modular polynomial n = 256, implying classic NTT can be utilized to speed up polynomial multiplication in R q . Current version of Kyber adjusts the protocol in several places, including the change of prime q from 7681 to 3329, resulting in several differences in detailed procedure, of which the most important one is that there is no 2n-th primitive root of unity in Z q , thus polynomials during NTT process are not evaluated at all the 2n-th roots of unity in Z q , but rather at all the n-th roots of unity. Although detailed Algorithm 1 Polynomial Multiplication over Z q [X]/(X n + 1) using negative wrapped convolution (NWT) both in Z q /(X n + 1), the n-th primitive root of unity ω n in Z q , the square root of ω n , denoted as ϕ 2n , prime q satisfying q ≡ 1 mod 2n.
algorithm is changed, the computational complexity remains the same, which would be demonstrated further in Subsection 2.3. On the basis of a rather small prime, the noise in Kyber can be selected within a tight range.

NTT in Kyber
NTT resembles discrete Fourier transform (DFT), which is widely used in signal processing and conducted in complex field C. Differently, NTT is conducted in finite field Z q and it transforms an integer polynomial from coefficient representation into point-value representation, and can be sped up in the same way as fast Fourier transform (FFT) used in DFT case, only if we choose a set of special points to evaluate the polynomial. By the usage of NTT, computational complexity of multiplying two n-term integer polynomial can be decreased from O(n 2 ) to O(n log n). Typically, prime q is chosen such that there exists 2n-th primitive root of unity in Z q , where the polynomial modulo X n + 1 completely factors into n degree 1 polynomials. Defining ϕ an arbitrary 2n-th primitive root of unity and ω the square of ϕ, the n roots of X n + 1 can be generated completely by ϕ 2k+1 with integer k ranges from 0 to n − 1. The n-point NTT/INTT of vector x/X are defined in Equation 1 and Equation 2, both referred to as classic NTT hereafter.
Compared with evaluation process in FFT, pre-scaling is needed for input polynomial terms, corresponding to multiplication x k ϕ k in Equation 1. Similarly, post-scaling (multiplication with ϕ −k in Equation 2) should be conducted on output polynomial terms. These two auxiliary procedures result from the chosen polynomial modulo X n + 1, and help to make NTT applicable to this special modular polynomial multiplication. The complete processes are also known as negative wrapped convolution (NWC), as depicted in Algorithm 1. NTT in Kyber is slightly different, where field Z q contains n-th primitive roots of unity, but not 2n-th primitive ones, thus the modulo X n + 1 can not fully factor into n degree 1 polynomials, but rather n/2 degree 2 ones. Specifically, X 256 + 1 = 127 i=0 (X 2 − ζ 2i+1 ) in Kyber, and the evaluation process is applied to the set of all the 256-th root of unity {ζ 1 , ζ 3 , · · · , ζ 255 }. For an arbitrary polynomial f (x) = The evaluation process is then simplified as substituting y with each 256-th root of unity In an alternative form that conforms the definition in Kyber specification, ξ = ζ 2 is the 128-th root of unity in Z q and br(i) for i = 0, · · · , 127 is the bit reversal of the unsigned 7-bit integer i. To this end, one 256-term NTT process in Kyber can be conducted by two separate 128-term classic ones as aforementioned in Equation 1. After NTT, polynomial multiplication h(x) = f (x) · g(x) mod (X n + 1) corresponds to PWM at these selected points, i.e. H i = F i · G i mod (X 2 − ζ 2br(i)+1 ). Specifically, 2i+1ĝ2i+1 +x(f 2if2i+1 +ĝ 2iĝ2i+1 ) (5) It is quite different from classic PWM where only one multiplication in Z q would be conducted, here five multiplications are needed to complete the calculation, as the evaluation value at a certain point is not one element in Z q , but rather one degree 1 polynomial in Z q . INTT in Kyber can be derived similarly as NTT, with one INTT corresponds to two classic ones as in Equation 2. When NTT and INTT are applied to vector or matrix of polynomials in R q , the respective operation is conducted on each element individually.

Order-relating issues and rearranged iterative NTT algorithms
In Kyber specification, the generated secret polynomial s, r, noise polynomial e, e , e are considered to be in natural order, e.g. elements s 00 , s 01 , s 02 · · · in s 0 are generated one by one in order. Besides, the reference software implementation chooses decimation in time (DIT) strategy for NTT and decimation in frequency (DIF) strategy for INTT, which means input vector is considered to be in natural order, and the output vector of NTT is in bit-reversed order correspondingly, resulting from the property of NTT calculation. After INTT, the output vector would return to natural order. Noticing the order of input, output vector in original DIT NTT and DIF INTT is contrary to actual requirement in Kyber, both iterative DIT and DIF algorithm need modification to be applicable. Detailed schematic during each stage of rearranged DIT/DIF strategy is illustrated in left/right part of Figure 1 [POG15], with corresponding algorithm presented in Algorithm 2/Algorithm 3. It is noteworthy that twiddle factors are rearranged as well and further merged with scaling factors, and multiplication with n −1 in INTT is Algorithm 2 Rearranged Iterative DIT NTT Input: digits vector x = (x 0 , x 1 , · · · , x n−1 ), precomputed ϕ k for k ∈ [0, n − 1) stored in bitreversed order of k in ROM_forward. t ← ω · x j+k+m/2 7: u ← x j+k 8: x j+k ← u + t 9: x j+k+m/2 ← u − t distributed among stages where simple multiplication with 2 −1 is conducted instead. These two methods are meant for decreasing computational complexity in terms of the number of multiplications in Z q . When conducting point-wise multiplication, the evaluation points are in bit-reversed order, requiring the corresponding 256-th primitive roots of unity in Equation 5 also in bit-reversed order, just as in Kyber specification ζ 2br7(i)+1 is used in an increasing order of i rather than ζ 2i+1 . For the same reason, elements of public matrix A,Â T should also be generated in bit-reversed order.

Binomial sampling
The noise polynomials s, e, r, e , e in Kyber obey binomial distribution with parameter η = 2 or 3, implying only four or six pseudo random bits are required to form one desired sample. Taking the former as an example, four random bits, denoted as b . The output sample would simply be w 1 − w 0 , lying within [−2, 2] ∩ Z. As can be seen, the whole process is pretty succinct and easy to realize in constant time.

Design Rationale
The overall data flow chart in server is depicted in Figure 2, details corresponding to major parts will be demonstrated further in the following subsections, including NTT core and its components, encoder/decoder as well as hash module.

Butterfly unit
As aforementioned in Subsection 2.3, one 256-term NTT in Kyber can be regarded as two separate 128-term ones, namely the even index one and the odd index one, and each of them can be implemented in a classic way. These two separate NTTs share the same scaling factors and twiddle factors, thus can be calculated concurrently. Naturally two sets of butterfly units should be adopted to handle the even part and the odd part. Considering both DIT and DIF strategies should be supported, the unified butterfly structure would be more suitable than two separate operators that each supports one strategy [BUC19]. Details with regard to active operators during DIT NTT and DIF INTT is illustrated in upper part of Figure 3. In addition to conventional use where it is either working in DIF mode or in DIT mode, several other operations are also adapted to functionality of the unified butterfly units, as   Table 1. Correspondence between 10-bit control code sel with all the operators is depicted in Figure 10. The special point-wise multiplication in Kyber, corresponding to Equation 5, is dispatched among operators in two butterfly units and implemented in two cycles, namely PWM0 and PWM1. Active operators during each cycle are illustrated in lower part of Figure 3 and details would be further demonstrated in Subsection 3.3. Similarly, three other operations are also supported with some modifications made to operators in unified butterfly units or expression form of procedures in Kyber protocol. For the former one, INTTm function absorbs addition with Decompress q (Decode 1 (m), 1) into the last stage of INTT(t T •r), from an observation that add2,sub2,add3 and sub3 are all idle during INTT. Obviously sub2 and sub3 should be modified in order to support extra addition functionality. For the latter one, one example is u = NTT −1 (Â T •r) + e in encryption phase. Noticing u would be compressed before it forms ciphertext c 1 , addition with noise samples can be delayed (thus u = NTT −1 (Â T •r)) and handled in Compress procedure by means of Compress q (u − e , d u ), equaling to 2 du (u−e ) q mod 2 du . The new operation is conformed to basic functionality of DIF butterfly unit and thus supported by our unified butterfly units, implying the addition with noise samples can be absorbed, saving both time and resources potentially. The same method also goes for Compress q (v, d v ), corresponding to COMPs in Table 1. The sign of noise samples should have no effect on security of the protocol, but to be compatible with software implementation, the centered binomial distributed noise samples in polynomial e and e need to be negated, corresponding to position switch between two Hamming weights in subtraction. The DECOMP function realizes Decompress q (Decode du (c 1 ), d u ) and Decompress q (Decode dv (c 2 ), d v ), in which multiplication with q would be conducted.

RAM structure
The corresponding even and odd index coefficients, arranged as a pair, reside in the same place within a RAM. During NTT process, two pairs of coefficients are fetched from RAM, with the even index coefficient from each pair sent to one butterfly unit, and the odd index coefficient from each pair sent to the other. In this way, the shared twiddle factors need to be read only once. An overall schematic of NTT core is illustrated in Figure 4.
The noise polynomials are written into RAM in natural order as discussed in Subsection 2.4. Each cycle two pairs of points would be fetched and two pairs would be written back concurrently in consideration that butterfly units function in a pipelined manner. At the first sight, the two pairs can be stored in the same place within a certain RAM, and output pairs between two consecutive cycles should be interleaved and stored back [RVM + 14] [FSS20]. The strategy is not suitable in our case as points with increasing index are stored continuously in RAM, thus not ready for application of the strategy unless an extra bit-reverse process is conducted, which is time-consuming. In our design, RAMs are arranged as two banks instead, each with the width of one pair of points. Strict in-place strategy is adopted, with additional advantage relating with PWM between data stored in RAMs and input points. Specifically, when two pairs of points are stored in the same place, one of them needs to be registered as two butterfly units can only handle one PWM per cycle. By each bank realized in separate RAM, granular data movement can be achieved. Detailed RAM structure is illustrated in Figure 5.

Adopting Karatsuba algorithm in point-wise multiplication
Point-wise multiplication, or basecase multiplication referred in Kyber specificationĥ 2i Bank0 Bank1 Figure 5: Detailed RAM structure, the left exhibits specific content after samples are written into the RAM and the right corresponds to results after NTT process. straightforward way :ĥ Clearly there are five multiplications in Z q in total, and two consecutive ones,f 2i+1ĝ2i+1 · ζ 2br(i)+1 form the longest path. Despite the longest path can not be shortened, the total number of multiplications can be decreased by means of Karatsuba algorithm, which is originally used in integer multiplication a × b = (a 0 + a 1 2 n ) × (b 0 + b 1 2 n ) and depicted as follows : The number of n-bit multiplications then decreases from four to three. On this basis, Karatsuba algorithm can also be used in PWM in our case: As can be seen, the number of multiplications decreases from five to four, andĥ 2i +ĥ 2i+1 X is now obtained byĥ The calculation process is divided and allocated in two cycles in our design, according to functionality of butterfly units. The way that each step is realized with a specific operator in butterfly units has been shown in Figure 3 and Table 1.

Modular reduction
In Kyber a fixed modulus q = 3329 is used, and dedicated modular reduction method can be implemented to accelerate corresponding operations, especially modular multiplication in Z q . There exist two multiplications in generic Barrett or Montgomery reduction method, demanding extra DSP resources or forcing the reduction unit to work in a time-multiplexed way with other operations that involve multiplications. In fact, these two multiplications can be replaced with shifts and additions by making use of property of q. We adapt Barrett reduction method to modular multiplication in our design, for the sake of its clarity. Another advantage is that approximate quotient is also generated along with the remainder, which would be beneficial in Compress procedure during encryption phase.

Algorithm 4 Modified Barrett Reduction
Input: Integer prod with maximum length 24 bits, and fixed modulo q being 3329 10 Output: res = prod mod q, quo = prod/q 1: µ ← 2 24 /q = 2 12 + 2 10 − 2 6 − 2 4 2: quo ← prod Detailed algorithm is given out in Algorithm 4, and quo in line 2 is an approximation of actual quotient prod q . As quo ∈ prod q − 1, prod q + 3 from analysis, the difference between input prod and quo · q lies in [−3q, 2q). With follow-up steps presented in line 4 to 15, precise remainder can be obtained. Noticing in Algorithm 4 the intermediate value dif f is a signed 15-bit integer.

Encode and decode with shift register
Encode procedure packs polynomials into 32-bit block arrays. As the possible coefficient length (12 int, d u in u, d v in v) does not always divide 32-bit data bandwidth, points in polynomials need to be regularized before they are transmitted. Conversely, packed 32-bit block arrays should be decoded into data points with proper length, before they participate in calculation in subsequent procedure. Dedicated encoder and decoder are adopted in our design, both implemented with shift register. Detailed working flow and content in shift register during each cycle is clarified in Figure 6. For decoder, it requests 32-bit data blocks from FIFO whenever left bits in shift register are not enough to form one pair of data points with proper length. Decoded data points stream out continuously as long as corresponding FIFO is not empty, as input data rate is greater than output data rate. The encoder is working in a reversed way, receiving output data points continuously from two sets of butterfly units, then generating regularized 32-bit data blocks discontinuously.
Encoder and decoder can also be implemented with long buffer in the same way as [RB20] did to deal with length mismatch between coefficients in polynomial and data blocks stored in FIFO. Initially a 352-bit (LCM (2 × d u , 32)) buffer should be involved. For decoder in server when k = 4 with d u = 11, the buffer is filled after 11 cycles of data loading, then 22-bit data pieces can be read out from the most significant part or the least significant part, depending on shift direction in 16 cycles continuously. Dedicated cycles in data load drag down the overall performance severely. To address this issue, the authors in [RB20] proposed an improvement that data points can be fetched before the whole buffer  is filled, reducing both the buffer size and cycle overhead. However, the improvement comes at the cost of an extra large multiplexer, which would be a 11-to-1 22-bit-width one in our use case. Noticing in Figure 6, the multiplexer is 16-to-1 with the same 22-bit length, and the working flow is almost the same as long buffer in [RB20] whilst consuming much less register bits (52 in our design), the decoder adopted in our design should be advantageous. For v = Decode(c 2 ) with different k, output bits are fetched from positions included in u = Decode(c 1 ) scenario. The longer output than needed (when k < 4 or during v = Decode(c 2 )) would be truncated when fed into butterfly units for subsequent calculation. Decoder in client is more succinct as it receives 32-bit data and outputs fixed 24-bit data, corresponding tot = Decode(pk).

Hash module
The hash module accounts for a great portion of the total resource consumption, thus special attention should be paid to reach an overall efficient design. It needs to support several function modes specified in SHA3 standard [oST15], including SHAKE-128, SHAKE-256, SHA3-256, SHA3-512. They share the same underlying function Keccak-f [1600], but with different rate r and different message suffix appended for domain separation. The constitution of r bits out of 1600-bit input is decided by the top module of either server or client. Parts of register data, appended suffix and part of padding data are selected through multiplexer under the control of a finite state machine, sent to ififo of hash module, as illustrated in Figure 2. Keccak core is working in an autonomous way that whenever ififo is not empty and it has done the last hash, data would be fetched from ififo until enough data has been collected to begin the next hash process. Detailed data flow is shown in Figure 7. It is especially useful in H(pk) and H(c) cases, where 32-bit data stream of pk or c flows into ififo and Keccak core decides when to fetch data from ififo. . .
. The bandwidth of Keccak core is set to 32 bits, aiming to be compatible with software implementations, thus the 1600-bit internal state would return to its initial value after 50 cycles of shift, which would be helpful in Hash(pk) and Hash(c). As elements of matrix A/A T are sampled from 12-bit pseudo random integers, the Keccak core should naturally generate 24-bit data per cycle in our design, which is not a divisor of 1600. Similarly, noise samples in s, e, r in Kyber512 (η 1 = 3) each consumes 6 pseudo random bits, then the problem that 6(×4) is not a divisor of 1600 arises as well. In both cases there would be samples split over two invocations of Keccak-f function, and updating internal state with data pieces from different indices would cause an obvious increase in total resource consumption. Considering these two kinds of samples are stored in separate places in our design, namely ofifo0 and ofifo1 to ease elaborate manipulation when read out, a data buffer is inserted between the Keccak core and ofifo0/ofifo1 in Figure 2, which receives 32-bit data pieces and generates 24-bit and 16-bit (η 1 , η 2 = 2) output data. It consists of a small FIFO (32 × 32) and a small decoder, in a similar way with design rationale of Decode in the protocol.
The uniform and binomial distributed samples correspond to different input from register file (ρ/σ) as well as padding bits, as illustrated in Figure 2. Whether the samples should be fed into ofifo0/ofifo1 or not is determined before they are generated by the Keccak core. Considering the complexity in generating dedicated logic to control the order of sampling process, a predefined order table would be more suitable. The sort of samples that being generated is determined by the current bit fetched from order sequence, which we depict in Table 2. In Table 2, bit 1 (and specially 2&3) corresponds to centered binomial sampling, and bit 0 uniform random sampling. The underline beneath a certain bit indicates the end of sampling process, and a dedicated endpoint table highlights the endpoint position. The order sequence is fetched and stored in the 73-bit register patt upon reset signal is valid. The endpoint sequence, similarly, would be stored in the 73-bit endp concurrently. During sampling, the Keccak core would take register value and padding as input according to the most significant bit of patt, and patt would shift left one bit upon start signal go of the current Keccak process is valid. When the last bit corresponding to endpoint is fetched, the sampling process would terminate after current generation of samples and the top module moves to next procedure. The endpoint table is essential as the number of Keccak invocations varies among different system parameter k.
According to the specification of Kyber, noise polynomials in centered binomial distribution ψ n η1 , ψ n η2 are generated by PRF, which is instantiated with SHAKE-256 and outputs Enc k endpoint 0 2 0000000000000000000100000000000000000000000000000000000000000000000000000 1 2 0000000000000000000010000000000000000000000000000000000000000000000000000 0 3 0000000000000000000000000000000000000000010000000000000000000000000000000 1 3 0000000000000000000000000000000000000000001000000000000000000000000000000 0 4 0000000000000000000000000000000000000000000000000000000000000000000000010 1 4 0000000000000000000000000000000000000000000000000000000000000000000000001 1088 bits data after 24-round Keccak-f function. The parameter η 1 equals to 3 in k = 2 case, otherwise η 1 is set to 2, while η 2 always takes 2. For one 256-term polynomial obeying ψ 256 η , the total number of pseudo random bits needed is 2 × η × 256, which would be as large as 1536, implying that the PRF in generating s, e, r corresponds to two consecutive invocations of Keccak-f function in Kyber512 case, illustrated as 32 pairs in Table 2. In practice, both 3 and 2 need two bits space, so an auxiliary table that highlights the cases where η 1 = 3 is added along with order sequence table. Similarly, public matrixÂ/Â T is sampled from pseudo random bit stream with XOF, which is instantiated with SHAKE-128 and outputs 1344 bits after 24-round Keccak-f function. Elements in matrixÂ/Â T should be uniform random distributed in Z q , thus some samples are rejected when they fall outside the desired range. To be compatible with software implementation, the range is set to be [0, q), corresponding to 12-bit pseudo random numbers, and the acceptance rate is 3329/2 12 = 81.3%. Therefore, the number of generated elements in Z q that one 24-round XOF generates is 1344/12 × 81.3% = 91.0, implying with at least four XOFs can the generation of one element polynomial ofÂ/Â T in R q be guaranteed, in consideration that binocdf (255, 3 × 112, 81.3%) = 8.33 × 10 −3 , which is not negligible. Sampling of one uniform polynomial in R q corresponds to 0000 bit strings (ahead of endpoint) in Table 2. Details would be further demonstrated in Subsection 4.2.

RAM dimension
In order to reduce memory footprint, only the points that will participate in later calculation whilst other participants are not ready at the time would be stored in memory. For example, polynomialŝ i for i ∈ {0, 1, 2, 3} must be preserved after key generation phase in server side, asŝ i would multiply withû i during decryption, thus dedicated memory block must be allocated for storage of eachŝ i and it is free to use only after decryption. The same goes forr i in client side. In other words, data points should be preserved in RAM during their lifecycle, and bothŝ andr can be stored in a single large RAM block as under no circumstances would differentŝ i parts orr i parts be accessed concurrently. As aforementioned in Subsection 3.2, memory block is divided into two banks, with one pair of  In addition, one wider RAM block, depicted as RAM4, is involved and dedicated to PWM between two polynomials in NTT representation, includingÂ •ŝ +ê in key generation phase,Â T •r,t T •r in encryption phase as well asŝ •û in decryption phase. Based on the observation that each PWM needs space as large as four intermediate points from Subsection 3.3, RAM4 is arranged as 48-bit-width, 128-piece-depth memory block. It is especially essential inÂ •ŝ +ê andÂ T •r, where inputÂ andÂ T are fetched from ofifo0,ŝ andr fetched from RAM0/RAM1. The intermediate points can not be fully stored in the same place in RAM where elements ofŝ andr residue, as two points from RAM would explode to four intermediate ones, and ofifo0 is busy in generating other parts of public matrixÂ orÂ T at the same time.

Specification of FIFOs
The depth of each FIFO depends on its functionality in our design. In Figure 2, the FIFOs with lowercase name ififo, ofifo, ofifo0 and ofifo1 reside in hash module. They are related with Keccak function and all serve as data buffer, thus their size should be adequate that no data is lost during transmission resulting from mismatch between read and write rate. The same goes for IFIFO in top module. Differently, DFIFO0, DFIFO1, and OFIFO in top module serve as data storage, thus their size is determined by the largest data block they would store.
IFIFO lies ahead of decoder and its size is determined by mismatch between input rate from client and processing ability of decoder. Detailed working flow of decoder in decoding polynomial u (with parameter k = 4) is illustrated in Figure 6. In 16 cycles, it outputs 22-bit data consecutively, and requests 32-bit data of c from IFIFO d u times. Consequently, it needs du×256×k 32 × 16 du = 128k cycles to decode polynomial u from transmitted ciphertext c, which is greater than du×256×k 32 + dv×256 32 = 8kd u + 8d v in all cases when k, and correlated d u , d v take different values. As a result, the decoder is always decoding polynomial u when IFIFO receives ciphertext c from client, and in total (8kd u + 8d v ) × du 16 = 1 2 kd 2 u + 1 2 d u d v 32-bit data is fetched from IFIFO, leaving behind 8kd u + 8d v − 1 2 kd 2 u − 1 2 d u d v data, which equals to 72/102/122.5 when k takes 2/3/4. Consequently, the depth of IFIFO can be as small as 128, and the width is 32 bits, equaling to transmission bandwidth.
The size of DFIFO0, DFIFO1 is determined by the content that it stores, i.e. polynomial u/t and v respectively. Noticing u,t are both vector of 256-term polynomials while the width of u is smaller thant in all cases when k takes different value, the DFIFO0 should  Figure 9: Elaborate scheduling during sampling and corresponding NTT procedures be set as 24 × 512 data storage, in order to fully savet when k takes 4. Similarly, DFIFO1 should be set as 10 × 128 data storage. OFIFO of server needs to storet and publicseed ρ, in total 96k + 8 pieces of 32-bit data, whose maximal value is 392, implying the depth should be set to 512 at least. The width of OFIFO is 34. Besides 32-bit data transmitted to client, two flag bits are added in each piece, indicating whether the current piece corresponds to publicseed ρ and whether it is the last one that would be transmitted. The flag bits would not be sent to client, but rather are helpers that determine when the transmission is done and which data pieces should be sent back to the input port of OFIFO when they are requested and outputted in preparation for later use in re-encryption process v = NTT −1 t T •r + e + Decompress q (Decode 1 (m), 1). More details would be given out in Subsection 4.3.
As for data buffer ififo in hash module, the worst case happens when server/client requests c/pk data from OFIFO in client/server. At this time, continuous data stream sent to server/client would also be stored in ififo, corresponding to procedure H(c)/H(pk). From Kyber specification we know H function is instantiated as SHA3-256 from FIPS-202 standard, where 1088 = 32 × 34 bits are absorbed in one 24-round Keccak-f function. In our design one 24-round Keccak-f would take 79 cycles. Consequently, the ratio of speed between continuous data stream fed into ififo and discontinuous data request from Keccak core is 79:34, thus the leftover 32-bit data in ififo during H(c) would be (8kd u + 8d v ) × (1 − 34/79) = 109/155/223 when k takes 2/3/4. For H(pk), that would be (96k + 8) × (1 − 34/79) = 114/169/223. As a result, the lowest yet adequate depth of ififo is 256, and the width should be 32 bits, determined by bandwidth of Keccak core.
Up to now, we obtain the size constraint of IFIFO, DFIFO0, DFIFO1, OFIFO and ififo from two factors, namely the worst case of rate mismatch in data buffer and largest data block that storage would preserve. In both cases, there would be data losses when corresponding FIFO is not large enough. For ofifo0 and ofifo1, the constraint is looser in terms of their size. When they are not enough for storing the whole newly-sampled data pieces needed by following processing module, bubbles would be inserted into calculation procedure such that enough data can be collected and transmitted, implying the performance would be dragged down. On the other hand, we can arrange the sampling procedure elaborately to decrease the size of ofifo0 and ofifo1. Now we demonstrate how the sampling procedures are arranged in cooperation with NTT calculation, with details illustrated in Figure 9. Firstly, one NTT on a certain polynomial in R q would take 128/2×log 2 128 = 448 cycles. In NTT(s i ), NTT(e i ), NTT(r i ), process that writing data into RAM takes another 64 cycles, and in Figure 9 they are merged into the cycles relating with NTT for clarity, amounting to 512 cycles in total. Similarly, cycles corresponding to INTT(u i ) includes compression procedure in addition to INTT, which takes 128 cycles. During one NTT or INTT, one noise polynomial and one element polynomial of public matrixÂ/Â T are guaranteed to be generated. Secondly, one point-wise multiplication betweenÂ ij /Â T ij andŝ i /r i takes 256 cycles. As a result, during two such multiplications, one element of public matrix and one noise polynomial are guaranteed to be sampled when k = 2. For k = 3, during three such multiplications, two elements of public matrix and one noise polynomial are guaranteed to be sampled. Similarly, the number would be three and one respectively in k = 4 case. To this end, we present a compact sampling procedure that cooperates with NTT-related calculation. The complete sequences in both server and client are listed before in Table 2. Noticing Figure 9 means to illustrate the margin between sampling and corresponding NTT. In practice, the sampling procedures are continuous except when corresponding ofifo that data should be sent to is full. At that moment, sampling would be halted until corresponding ofifo is ready to receive new samples. Basically, it follows the so called just-in-time principle [HOKG18][KMRV18] [BUC19], which corresponds to minimal memory footprint. The compactness comes from two aspects, of which the first refers to clock cycles, i.e. the samples generated from Keccak core are always ready to be fetched from ofifo where they are stored, thus no bubble needs to be inserted and performance would not be affected by sampling. The other goes to the size of ofifo0 and ofifo1. ofifo0 is designed to store only two elements of matrixÂ orÂ T , and ofifo1 only one element of noise polynomial vector s, e, r, e , e . Thus the size of ofifo0 and ofifo1 is set to 24 × 256 and 24 × 64 respectively. Noticing the bandwidth of ofifo1 is 24 bits, enough for 4 centered binomial distributed samples when η 1 = 3. They are transformed to samples in NTT module before written into RAM, as illustrated in Figure 4.

Making full use of data storage FIFOs
In Kyber.CCAKEM.Dec, re-encryption process is required, which implements almost all the procedures in client. Besides the increase of computational complexity, an important problem arises that data stream of c 1 , c 2 , received from client during decryption phase as well as data points oft in key generation phase would be requested again in the re-encryption phase. Message c 1 and c 2 would be requested and compared with the newly formed c 1 and c 2 to validate correctness of decryption, influencing generation of the final agreed key, andt would be requested in restoring message v with v = NTT −1 (t T •r ) + e + Decompress q (Decode 1 (m ), 1), implying these related message must be stored properly in some place. Instead of initializing more memory blocks, we make full use of the data storage FIFOs, in consideration that they are free to use in re-encryption phase after occupation in corresponding process.
To implement the strategy, output data of DFIFO0/DFIFO1 and OFIFO are redirected to its corresponding input port if they are not working in re-encryption phase. For DFIFO0 and DFIFO1, the redirected data points actually belong to Compress q (u, d u ) and Compress q (v, d v ) respectively, and they are compared with Compress q (u , d u ) and Compress q (v , d v ) in re-encryption directly, without further resorting to Encode procedure. The results of above comparisons are equivalent to comparison between c 1 , c 2 and c 1 , c 2 , with two encode procedures eliminated. Different from DFIFO0 and DFIFO1 where all data points are needed in later comparison, register value ρ stored in OFIFO would not be requested. To distinguish which part of data points would be redirected, two flag bits are added, expanding the width of OFIFO from 32 bits to 34 bits, as depicted in Subsection 4.2. Noticing the redirected message is actually Encode(t), a decode process should be involved such that the message is unpacked into data points and ready for participating in point-wise multiplication withr in re-encryption phase. The decoded data points are fed into DFIFO0, just after Compress q (u, d u ) being read out for comparison. Detailed data flow is depicted in Figure 2.

Results, Comparison and Discussion
The proposed design of CRYSTALS-KYBER has been simulated, synthesized and implemented with Vivado 2017.3 design suite on Xilinx XC7A12TCPG238-1 device, with all building blocks implemented in hardware. All module dimension k are supported and can be selected through dedicated input port. Bandwidth of data transmission in both server and client is set to be 32 bits, in order to be compatible with bandwidth of hash module.

Results
Cycle counts at different protocol phases with different parameter k are depicted in Table 3. The critical path delay is 6.2/6.0 ns, corresponding to 161/167 MHz maximum frequency respectively in server/client, and execution time of each phase is calculated accordingly. Detailed cycle counts corresponding to each procedure in server side when k takes 3 is listed in Table 8, where procedures conducted concurrently in NTT core and hash module reside in the same line. The execution time of sampling, both noise polynomials and public matrices, is basically hidden behind NTT and related PWM processes. The procedures H(pk) and H(c) are conducted simultaneously along with transmission/reception of pk and c. Overall, the execution time occupied by pure hash function accounts for rather small portion of the total time, as a result of elaborate scheduling between sampling and NTT procedures, as depicted in Figure 9. Consequently, timing performance is determined by computational ability of NTT cores, or more specifically butterfly units to a great extent. Key generation, encapsulation and decapsulation phase in CCA-secure Kyber can be completed in 23.4/30.5/41.3 µs when k = 2, and 39.2/47.6/62.3 µs when k = 3, corresponding figures would be 58.2/67.9/86.2 µs when k = 4. Resource consumption of our proposed design is depicted in Table 4, where occupation by major submodules are also listed out. As can be seen, Keccak core accounts for more than 40% LUTs and 35% FFs of total consumption, resulting from that a high speed Keccak design is involved to keep up with computational ability of butterfly units, generating enough samples before they are required. The second most area-consuming module is NTT core, where two sets of butterfly units are allocated, cooperating with each other to realize several extra functionalities in addition to conventional use in NTT and INTT, as depicted in Table 1. There are differences in input/output multiplexing relating to inner adder/subtracter of each butterfly unit as well as intercommunication between two sets of them, so they are implemented in a single butterfly module, where two reduction units are adopted. In NTT core, RAM0 and RAM1, arranged as 24×256 memory block, each consumes one 18k BRAM. Similarly, RAM2 and RAM3, arranged as 24×64 memory block, each consumes one 18k BRAM, but only occupies 1 12 storage capacity. Differently, RAM4 consumes one 36k BRAM because that its data width is 48 bits, exceeding 36-bit boundary that one 18k BRAM can hold. In total, five RAM blocks occupy three 36k BRAMs. Twiddle factors as well as scaling factors are stored in distributed RAM, in consideration that each one of ROM0, ROM1, ROM2 contains only 12×128 data bits, used in NTT, INTT, PWM respectively. Overall, our design consumes 7412/6785 LUTs, 4644/3981 FFs, 2126/1899 slices, 3/3 BRAMs and 2/2 DSPs in server/client side, and can

Comparison with related works
Comparison with related works that concentrate on hardware implementation of contest candidates, especially CRYSTALS-KYBER is demonstrated in Table 5. Both [DFA + 20] and [HHLW20] report performance and resource consumption of standalone hardware implementation of Kyber. In [DFA + 20], besides summarizing and analyzing massive results reported by other groups, four CCA-secure KEM candidates of the second round are implemented in pure hardware as the authors report, including Kyber. Compared with [DFA + 20], our proposed design is more than two times slower, resulting from 1.5 -2× more cycle counts and lower maximum frequency. It needs to be noted that key generation is assumed to be performed in software in [DFA + 20] and only one module dimension k is supported at a time. As [DFA + 20] does not give out any detail of either hardware architecture or implementation procedures, we can only make some speculations from the reported resource consumption. Speed grade of our device is -1, which is the slowest among the device family. Beside, more than two times of FFs are utilized in [DFA + 20] compared with our design, facilitating shorter critical path, thus higher maximum frequency can be obtained. 4× of DSPs, 5× of BRAMs as well as 1.5× LUTs are involved, implying more computational cores participate in calculation and total cycle counts would decrease.
In [HHLW20], BRAMs are inserted between different modules for communication, and bottom modules are reused in realizing related upper functionalities extensively. Through elaborate scheduling with limited computational resources, cycle counts of our design is 10 times smaller than that of [HHLW20], and resource consumption corresponding to LUT and FF are more than 10× and nearly 40× smaller as well.
[BUC19] [FSS20][AEL + 20] are all crypto-processors constructed upon a RISC-V core. The RISC-V soft core is implemented in fabric, configured in pipeline structure with different depth. Accelerators are inserted into the pipeline, and would be invoked in the same way as arithmetic logical unit (ALU) by user defined instructions, which is extended based on existing instruction set architecture (ISA). Procedures in the protocol would be translated to supported instructions, and conducted in the RISC-V core. Total cycle counts would generally be much larger compared with dedicated, standalone design. Besides, maximum frequency is subject to timing performance of the RISC-V core severely. It can be seen from  [FSS20]. In summary, a general purpose computational platform is able to implement cryptographic protocol, but it possesses no particular advantage except flexibility in deploying different protocols.
[BSNK19] is based on HLS design methodology, aiming to overcome the long deployment time problem. A widely accepted view is that results from HLS design are much worse, in terms of both timing performance and resource utilization than designs obtained through manual hardware description language (HDL) coding, which is also true in comparison between [BSNK19] and our design.
In addition to related hardware implementation of Kyber on FPGA, [RB20] is a perfect counterpart as comparison work, in which a similar module lattice-based KEM protocol Saber [DKRV] is implemented. Saber is another finalist in the third round contest, basing its security on module learning-with-rounding problem. Dimension of polynomial ring n = 256 is the same with Kyber, as well as module dimension parameter k = 2, 3, 4. Binomial distributed noise is also adopted with parameter η = 3, 4, 5, corresponding to different k. As NTT is not directly applicable in Saber, NTT procedures are not involved in its protocol, making it possible to take advantage of small norm of noise polynomials during polynomial multiplications by simple schoolbook method. 256 multiply-and-accumulate units are instantiated in parallel, being able to compute one polynomial multiplication in only 256 cycles. Besides, as noise term is within a small range, such multiplication in field Z q can be realized with simple shifts and additions, which means the whole design can be implemented without DSP. Timing performance of our design is pretty close to that in [RB20], while LUT, FF consumption is more than 3×,2× less than the design of Saber.

Discussions about performance and resource utilization
Our presented design targets to explore intrinsic relation between two core modules, NTT and Keccak to achieve decent performance with limited computational resources. The second purpose is to give out a design method in implementing procedures involved in the Fujisaki-Okamoto transform through reusing as much hardware structure and memory as possible. In pursuit of higher performance design, more butterfly units should be exploited, with other modules keeping up with data input/output requirement of them. The butterfly cores can be organized as vector processor, with dedicated permutation networks located in front of and behind them to arrange data points in desired order at the current stage of NTT process [PNPM15][XHY + 20]. Alternatively, constant geometry NTT [Pea68] can be adopted to avoid complex and variable read/write pattern in different stages [BUC19] [XL20]. Noticing Kyber is based on MLWE construction, NTT process is conducted on each element in R q within a vector separately, implying only one extra memory block for storing one element in R q , resulting from the out-of-place property of the strategy, would be adequate. With more butterfly cores, the number of reduction units, located behind multiplier in butterfly units should also be increased accordingly. Besides, RAM structure should be modified elaborately to ensure enough data bandwidth [BUC19] [XL20]. Similarly, bandwidth of Keccak core needs to be tuned accordingly, which is relatively easy as data input/output and round function are conducted separately. In a low area design contrarily, the first thing to do is adopting a low-cost Keccak core [Tea20] where Keccak round function is conducted in multiple cycles, as many works have reported hash module accounts for a great portion of the overall resource consumption.

Discussions about a constant-time design
Almost all the procedures in the CCA-secure Kyber are implemented in constant-time in our design, except the generation of public matrices A/A T . As samples of A/A T are obtained from pseudo random bits with abortion, the total cycles would be variable with different publicseed ρ. However, this would not cause any trouble in a constant-time design from two aspects. Firstly, the publicseed ρ and the generation algorithm Parse,XOF are all public, which means the variation in time would not leak any information that should be private. Secondly, the generation processes are conducted along with NTT-related calculations, and the number of total cycles consumed in our design is irrelevant with these processes. With these facts, our design should be resistant to timing attacks.

Conclusion
In this work, a pure hardware implementation of CRYSTALS-KYBER, one of the third round finalists of PQC contest is presented. The whole implementation is manually designed, without resort to either hard-wired processor such as ARM Cortex series or soft core that can be implemented with reconfigurable logic such as popular RISC-V processors. Achievable performance is explored with limited computational resources. Through elaborate scheduling of NTT related procedures and noise sampling, our design achieves decent performance and it can fit in the smallest device in Xilinx Artix-7 series FPGA. The number of utilized butterfly units is two in our design on the basis of special form of the NTT in Kyber, and we also discuss different solutions in cases where either higher performance or lower cost design is demanded. Similar scheduling method can be derived according to relation of speed between NTT core and hash module. In addition, detailed design methods corresponding to data transmission, reception and the Fujisaki-Okamoto transform are illustrated, through adoption of a set of FIFOs with different specification. We further explore minimal memory footprint of each storage block, and avoid involving more memory blocks in re-encryption phase through data redirection of corresponding FIFOs.
Overall, our pure hardware implementation can execute key generation, encapsulation, decapsulation in 23.4/30.5/41.3 µs when module dimension k takes 2, in 39.2/47.6/62.3 µs when k takes 3, and in 58.2/67.9/86.2 µs when k takes 4, with 7412/6785 LUTs, 4644/3981 FFs, 2126/1899 slices, 2/2 DSPs and 3/3 BRAMs consumed in server/client side, shedding light on achievable performance in a standalone hardware design. The results verify computational efficiency of schemes based on structured lattice, and better performance of manual hardware design compared with that implemented with hardware-software co-design or HLS method.  Figure 10: Structure of two sets of butterfly units. Each path is illustrated separately, with two input data pairs (in 1 , in 0 ), (in 3 , in 2 ) and corresponding output pairs (out 1 , out 0 ), (out 3 , out 2 ). Control codes in different operations are presented in Table 1.