Racing BIKE: Improved Polynomial Multiplication and Inversion in Hardware

. BIKE is a Key Encapsulation Mechanism selected as an alternate candidate in NIST’s PQC standardization process, in which performance plays a signiﬁcant role in the third round. This paper presents FPGA implementations of BIKE with the best area-time performance reported in literature. We optimize two key arithmetic operations, which are the sparse polynomial multiplication and the polynomial inversion. Our sparse multiplier achieves time-constancy for sparse polynomials of indeﬁnite Hamming weight used in BIKE’s encapsulation. The polynomial inversion is based on the extended Euclidean algorithm, which is unprecedented in current BIKE implementations. Our optimized design results in a 5.5 times faster key generation compared to previous implementations based on Fermat’s little theorem. Besides the arithmetic optimizations, we present a united hardware design of BIKE with shared resources and shared sub-modules among KEM functionalities. On Xilinx Artix-7 FPGAs, our light-weight implementation consumes only 3 777 slices and performs a key generation, encapsulation, and decapsulation in 3 797 µs, 443 µs, and 6 896 µs, respectively. Our high-speed design requires 7 332 slices and performs the three KEM operations in 1 672 µs, 132 µs, and 1 892 µs,


Introduction
Due to extensive research and advanced progress in quantum computation during the last decades [Gam20], in 2017, the National Institute of Standards and Technology (NIST) announced a Post-Quantum Cryptography (PQC) standardization process with the target to find public-key cryptographic algorithms that provide security in the presence of quantum computers [NIS17].
After the call for proposals, the NIST received 69 submissions which were revised with respect to security, efficiency (e.g., key sizes and latency), and implementation costs for software and hardware. Eventually, after the third round, they selected seven finalists and eight alternate candidates [NIS20b]. While the finalists are all considered for standardization, the alternate candidates will be reviewed and may be evaluated in a fourth round such that they potentially could be standardized as well [NIS20b].
The Bit Flipping Key Encapsulation (BIKE) [ABB + 20] is one of the NIST's alternate candidates in the Key Encapsulation Mechanism (KEM) category. The security of BIKE relies on the hardness of decoding linear error-correcting codes. More specifically, as underlying linear codes, BIKE utilizes Quasi-Cyclic Moderate-Density Parity-Check (QC-MDPC) codes, which were first presented by Misoczki et al. [MTSB13] in 2013.
In this work, we target to improve the efficiency of the KEM functionalities of BIKE by an Field-Programmable Gate Array (FPGA) hardware design. Since NIST announced that performance plays an important role in their PQC standardization efforts [NIS20a], researchers presented several optimization techniques for BIKE on the suggested platforms including the AVX2 instruction set on x86, embedded microprocessors, and FPGAs. For example, Drucker et al. [DGK20a] optimized BIKE for x86 Central Processing Units (CPUs). Chen et al. [CCK21] presented optimization techniques for x86 and Arm Cortex M4. Richter-Brockmann et al. [RBMG21] proposed an optimized scalable hardware implementation for reconfigurable devices. In this paper, we propose new optimization techniques for efficient FPGA implementations of BIKE and report significant improvements compared to previous works.

Related Works on FPGAs.
Although there were several early works implementing QC-MDPC codes on hardware devices for variants of the McEliece cryptosystem [VMG14,HVMG13] and for the Niederreiter framework [HC17], the first hardware implementation of BIKE was presented with the round-two submission of NIST's PQC standardization process [ABB + 19]. The implementation was designed for an older version of BIKE (called BIKE-1) and only supported the key generation and encapsulation.
In 2020, Reinders et al. [RMGS20] proposed a complete hardware design which, however, targets the older parameters of BIKE. Besides, they presented an efficient hardware implementation for a novel constant-time decoder.
Recently, Richter-Brockmann et al. [RBMG21] presented the first complete hardware design of the current BIKE version [ABB + 20]. They implemented for the first time the Black-Gray-Flip (BGF) decoder on hardware, introduced an optimized polynomial inversion module (based on Fermat's little theorem), and proposed a scalable multiplier.
In further detail, BIKE poses several challenges on the arithmetic level. For improving the polynomial multipliers in code-based schemes, Hu et al. [HWCW19] presented two different approaches. While the first design is based on a schoolbook multiplication, the second multiplier improves multiplications by exploiting the sparseness of the polynomials used in QC-MDPC codes. Additionally, they instantiated their designs to create a key generation module based on previous parameter sets of BIKE. Barenghi et al. [BFG + 19] presented similar approaches to implement polynomial multiplications for the code-based scheme LEDAcrypt [BBC + 19]. They explored different configurations of schoolbook and sparse multipliers for Xilinx FPGAs.
Contribution. In this work, we revise previous concepts and identify significant improvements and systematic explorations of the hardware implementation of BIKE on FPGAs. Specifically, we introduce an optimized polynomial multiplier that exploits the sparseness of QC-MDPC codes while performing all multiplications applied in BIKE in constant time. In addition to that, we present a novel component for polynomial inversion based on the extended Euclidean algorithm (extGCD) accelerating the key generation in hardware. For that we adapt the extGCD from the constant-time algorithm recently proposed by Bernstein and Yang [BY19], and demonstrate that this approach clearly outperforms previous implementations based on Fermat's little theorem in the specific case of BIKE. As a design constraint, our implementation is highly scalable to instantiate specifically tailored cryptographic components for any use-case.
Besides these major arithmetic-oriented optimizations, we also substitute symmetric cryptography from encapsulation and decapsulation implementations presented in [RBMG21] with a single Keccak core to demonstrate the authors' assertion of achieving a lower footprint by applying this modification. Additionally, we present a combined hardware implementation of BIKE that consolidates all three KEM algorithms in one single, united design. This approach enables resource and module sharing between the KEM algorithms achieving a design that reduces the overall implementation costs.
Our implementations are written in Verilog and are publicly available at https: //github.com/Chair-for-Security-Engineering/RacingBIKE.

Outline.
In Section 2, we briefly introduce BIKE and cover the background of polynomial arithmetic that is necessary for our hardware implementations. Section 3 starts with an introduction of our design considerations. Afterwards, we introduce our modifications with respect to the random oracles, present our multiplier and inversion modules, and describe the composition of an united hardware design. In Section 4, we evaluate all designs with respect to implementation costs and performance. Before we conclude our work in Section 6, we briefly discuss the resistance against side channels and address the transferability of our approaches to software implementations in Section 5.

Preliminaries
In this section, we describe the algorithms and parameters forming BIKE. Then, we summarize important polynomial arithmetic.

Notations
Throughout this work, we use the following notations: • F 2 : Finite field of two elements {0, 1}.
In this work, we store a polynomial a = a 0 + a 1 X + · · · ∈ F 2 [X] as a bit sequence of coefficients (a 0 , a 1 , . . .). The 0-th bit corresponds to the coefficient a 0 ; the first bit corresponds to the coefficient a 1 ; and so on.
• r: The parameter defining the length of polynomials in BIKE.
Multiplication in R is generally implemented as multiplication of bit polynomials in F 2 [X] and followed by a modulo operation by X r − 1 for terms of degrees ≥ r.
• |f |: Hamming weight of a bit polynomial f .
• b: Bandwidth (in bits) for accessing data from memory in our FPGA implementation.

BIKE
We divide this section in three paragraphs describing the KEM functions of BIKE, introducing required hash functions, and summarizing BIKE's parameters.
KEM Functions. The BIKE KEM comprises three algorithms -key generation, encapsulation, and decapsulation. The key generation (see Algorithm 1) outputs a key pair. It randomly samples two sparse polynomials (h 0 , h 1 ) ∈ R 2 and a random string σ as the private key. By inverting h 0 and multiply the result by h 1 , the key generation computes the public key h as shown in line 3. Algorithm 2 describes the encapsulation which starts by sampling a message m and deriving two error polynomials (e 0 , e 1 ) from H(m). Afterwards, it computes the first part of the cryptogram c 0 by multiplying e 1 by the public key h and adding (xor) the result to Algorithm 1: Key Generation.
Input : BIKE parameters n, w, t, . Output : Private key (h 0 , h 1 , σ) and public key h. e 0 . This step represents the encoding procedure of linear codes but with the difference that the errors are added intentionally. The second part of the cryptogram c 1 is generated by adding the message m to the output of the hash function L. Eventually, the algorithm derives the shared key K by hashing the cryptogram and the message with K. Algorithm 3 shows the decapsulation that recovers the error polynomials (e 0 , e 1 ) from the cryptogram C. It first computes the syndrome s in line 1 as a common procedure for decoding linear codes. The syndrome and the private key (h 0 , h 1 ) are then fed into a decoder to determine the error polynomials (e 0 , e 1 ). The BIKE specification [ABB + 21] applies a BGF decoder which was extensively investigated in [DGK20b]. If the decoding is successful, the algorithm calculates the message m from the ciphertext C and the error polynomials (cf. line 3). To ensure that m and the determined error polynomials match the one generated in the encapsulation, it applies the same sampling algorithm H to m and compares the result to the error polynomials returned from the decoder. In case the pair is valid, it computes the shared key K = K(m , C). Otherwise it computes K using the secret string σ belonging to the private key.
Hash Functions. In BIKE, the encapsulation and decapsulation utilize the three functions H, K, and L which are modeled as random oracles and defined over the following domains: The latest specification [ABB + 21] unifies the three random oracles to hash functions based on Keccak [BDPA13]. H maps an -bit string into a 2r-bit string with Hamming weight t. It is implemented by a SHAKE256-based Pseudo-Random Number Generator (PRNG) while it was realized by AES-256 in previous versions. K and L uses a SHA3-384 implementation replacing the SHA2-384 of previous versions.
Parameters. Table 1 summarizes the parameters of BIKE for various security levels of NIST's PQC standardization process. As already introduced above, the parameter r represents the length of polynomials used in BIKE. The parameter w specifies the Hamming weight of the private key polynomials (h 0 , h 1 ), satisfying |h 0 | = |h 1 | = w/2. The parameter t defines the decoding radius, i.e., the Hamming weight of the errors randomly sampled in the encapsulations. Eventually, specifies the length of the shared key of the KEM, which is fixed to 256 bits for all security levels.

Polynomial Multiplication by Sparse Polynomials
In BIKE, all multiplications in R comprise a sparse operand f ∈ R with |f | r. For the key generation, h 1 is the sparse polynomial in the multiplication h 1 · h −1 0 . For the encapsulation, e 1 is sparse in e 1 · h. For the decapsulation, h 0 is sparse in c 0 · h 0 . The decoder contains some additional multiplications by the sparse polynomials (h 0 , h 1 ) which are part of the private key.
We represent a sparse polynomial as a set of indexes corresponding to its non-zero terms. For example, the set I f = {i 1 , . . . , i t } represents the sparse polynomial f = X i1 + · · · + X it with the Hamming weight |f | = t. Multiplying a dense polynomial g by the sparse polynomial f simply accumulates t products of multiplications g · X i for i ∈ I f . Since g is represented as a bit sequence, multiplication by X i shifts the bit sequence i-bit to the left and modulo by X r − 1 moves the shifted bit segment exceeding the r-th bit to the empty bit segment starting from the 0-th bit. In other words, the multiplication simply accumulates t rotated g by i 1 , . . . , i t bits.

Polynomial Inversion with the Extended Euclidean Algorithm
The key generation (cf. Algorithm 1) computes the multiplicative inverse of a secret polynomial h 0 ∈ R. Previous works, e.g., [ABB + 20, HWCW19,RBMG21], computed the inversion by raising h 0 to the power of 2 r − 2 (Fermat's little theorem).
In this work, we compute the inversion with the extended Euclidean algorithm (extGCD). The extGCD takes two input polynomials (f, g) and outputs three polynomials (gcd (f, g), u, v), where gcd (f, g) is the great common divisor of f and g and gcd (f, g) = u · f − v · g. All polynomials are in F 2 [X] in the context of BIKE.
In a nutshell, we compute extGCD(X r −1, h 0 ) for the inverse h −1 0 . Under the parameters of BIKE, the polynomial X r − 1 has two factors X r − 1 = (X − 1)( r−1 i=0 X i ). Since |h 0 | = w/2 is an odd number, h 0 is not a multiple of X − 1. Since |h 0 | = r, h 0 is also not the polynomial . However, a traditional extGCD is unsuitable for cryptographic applications because it usually contains branches that depend on the inputs. While the inputs are secret, an attacker can collect the information about the inputs through running-time differences. Hence, we have to apply a constant-time extGCD to prevent the leakage of timing sidechannel information.
In this work, we adopt the constant-time version of the extGCD proposed by Bernstein and Yang [BY19]. In contrast to the traditional extGCD that eliminates the head coefficients of polynomials at any degree, the constant-time extGCD in [BY19] always eliminates the 0-th bit of polynomials. This leads to extra coefficient reversal processes for inputs to move its head coefficient to the 0-th bit position and before output for recovering the polynomial to its original coefficient order. 1 Considering for example an input polynomial f , the coefficient process is equivalently to perform the f ← f (1/X) · X deg(f ) operation. This operation moves the original head coefficients of f to a new position of degree 0, which is accessed by f [0]. Thus the extGCD always eliminates the head coefficients at the 0-th bit.

Division Steps and Transition Matrix.
In this work, we simplify the extGCD in [BY19] regarding F 2 [X] for the BIKE application. The algorithm consists of a constant number of simple division steps (divsteps) for the two input polynomials. Define divstep : Here, δ means the degree difference between f and g. The divstep outputs two polynomials. The first polynomial aims for the polynomial of the higher degree among two input polynomials. The other is the result of subtraction of two polynomials for eliminating one head term, and it adjusts the new head term to the degree-0 coefficient by the division of X.
Since the division of X causes negative degrees, we adjust the representation of polynomials to prevent negative degrees. If the polynomial f contains a monomial of negative degree, e.g., 1/X i , we will store f as an alternative polynomial f s.t. f = f · (1/X) i and degrees of all monomials of f are non-negative. For applying divstep multiple times, define (δ n , f n , g n ) = divstep n (δ, f, g), i.e., applying the divstep to inputs (δ, f, g) for n times.
Bernstein and Yang describe the transition of the two polynomials (f, g) under the divstep operation as a matrix-vector multiplication. Let T (δ, f, g) be a 2 × 2 transition matrix which performs the transition (f, g) → (f 1 , g 1 ) as matrix multiplication: Define the transition matrix of i-th step as After n steps, the input polynomials (f, g) become Note that we use w instead of the original r in [BY19] to avoid the symbol conflict. Since we aim for the polynomial inversion in BIKE, we keep only two vectors (f, g) and (v, w) in our storage space for storing all (f i , g i ) and (v i , w i ) for i in 0, . . . n, instead of tracking full transition matrices. The polynomials (f, g) and (v, w) are stored in different formats. Since (v i , w i ) T is part of the transition matrix, they are polynomials with monomials of negative degrees. Hence, we store the vector (v i , w i ) in a form of (v i , w i ) · (1/X) i and i increases with steps to keep the polynomials (v i , w i ) with nonnegative degrees. Since (f i , g i ) T and (v i , w i ) T are multiplied by the same transition matrix, we update the two vectors with similar operations except the degree adjustment. We remove the coefficient of the constant term of g for the division by X but increase the coefficients of v by one degree to keep the correct form of (v i , w i ) · (1/X) i .
Last, we describe the overall algorithm for the polynomial inversion in BIKE. We initialize the two input polynomials f = X r − 1, g = h 0 (1/X) · X r , and their degree difference δ = 1. Note that g is initialized as a bit-reversal form. The (v, w) polynomials are initialized to (0, 1) as the right column of an identity matrix. Then we perform 2r − 1 divsteps to update (δ, f, g) and (v, w) as well. After divsteps, we reverse the coefficients of the polynomial v and output it as the inverse h −1 0 .

Optimization Strategies
In this section, we propose several optimization strategies to improve the hardware implementation of BIKE. We start by describing the exchange of the symmetric cryptographic building blocks, i.e., AES-256 and SHA2-384 with a single Keccak core. Then, we introduce a new design of a multiplier exploiting the sparseness of QC-MDPC polynomials. Afterwards, we present an improved inversion module based on the algorithm proposed by Bernstein and Yang [BY19]. We conclude this section with an united hardware design which consolidates all three KEM algorithms of BIKE in one implementation.

Design Considerations
We start with our design considerations. First, our implementations utilize the framework presented in [RBMG21] while we modify and optimize several hardware modules described in the following sections. Besides these modifications, the main structure is based on the original implementation. However, we translate all modules to Verilog. Second, we keep the same bandwidth parameter b in our modified modules as proposed in the original implementations from [RBMG21]. Hence, our design is scalable with b as well, and we benchmark our designs with the same instantiations of b ∈ B = {32, 64, 128}. Larger b generally improve the latency of the corresponding computation since b-bit chunks of polynomials can be accessed and processed in parallel.

Random Oracles
The BIKE team recently updated the random oracles H, K, and L in their latest specification of version 4.2 [ABB + 21]. They adapted the core components of these functions from AES256 and SHA2-384 to SHAKE256 and SHA3-384 with an unified Keccak core, respectively. While Richter-Brockmann et al. suggested the unified symmetric core would be beneficial for a hardware implementation, they, however, did not test their suggestions in [RBMG21].
In this work, we modify the implementations presented in [RBMG21] to the latest specification of hash functions and report the comparisons in Section 4.1. Therefore, we implement a simple Keccak core which only contains the round function and a controlling interface. In the following, we describe the implementations of wrappers that are connected to the Keccak core and form the random oracles.
First, for the H function, we instantiate a SHAKE256 from the Keccak's round function. As in Algorithm 2, H uses a 256-bit message m as seed for SHAKE256 which is requested by an dedicated interface in our implementation. Then, with correct padding and controlling of the Keccak core, the wrapper divides the 1 088 output bits into 32-bit chunks. The integrated sampler uses the chunks to generate the indexes of error polynomials (e 0 , e 1 ) and rejects illegal samplings. If the sampler has consumed all randomness, the wrapper initiates an additional squeezing phase of SHAKE256.
Second, for generating the private key (h 0 , h 1 ) in the key generation (cf. Algorithm 1), our wrapper operates similarly to the H function besides different Hamming weights.
Third, for the L function, the wrapper uses the error polynomials (e 0 , e 1 ) and provides them in the absorbing phases to the Keccak core. In this case, it performs a SHA3-384 hashing operations. Besides the correct padding, the wrapper ensures to concatenate the error polynomials by eight-bit blocks. Last, it truncates the 384-bit hash value to a 256-bit value and adds it to m.
Fourth, our wrapper for the K function is realized similarly to the L function. However, the input to the SHA3-384 slightly differs since a 256-bit string needs to be concatenated with an r-bit polynomial and with another 256-bit string. Nevertheless, it truncates the 384-bit output to 256 bits in the same way.

Sparse Polynomial Multiplier
In this section, we present the hardware design of the sparse polynomial multiplier for BIKE. In 2019, Hu et al. [HWCW19] already applied the approach of sparse multiplications to BIKE. However, compared to their design, our optimized implementation achieves a better area-time product and reduces the latency (for detailed information see Section 4.2). Additionally, our design keeps the time-constancy for the encapsulation while computing e 0 + e 1 · h with the indefinite Hamming weight of e 1 .
As in Section 2.3, given a multiplication p res = p sparse · p arb. where p sparse , p arb. ∈ R and |p sparse | r. Further, the polynomial p sparse is represented as a set of indexes of non-zero terms and p arb. is a r-bit sequence divided into r b chunks. Then, we conduct the multiplication by reading the non-zero indexes of p sparse , rotating p arb. by the indexes to the left, and accumulating the rotated results to the product p res . General Sparse Multiplier. Figure 1 shows a simplified architecture of the general sparse multiplier which iterates over the indexes of the sparse polynomial p sparse . Each iteration is initiated by reading a non-zero index from p sparse . Meanwhile, it starts to access the values of the polynomial p arb. in an ascending order starting at the second uppermost address, proceeding with the uppermost, and then keep going from address zero. This procedure simplifies to deal with the most-significant bits in p arb. since r mod b = 0 (r is always prime). Figure 1 neglects the hardware to deal with this exception (mostly multiplexers) for clarity.
While processing a particular index from p sparse , the lower log b bits of the index determine the number of bits to shift the input from p arb. to the left. The shifted output is added to the current intermediate result depicted by the xor-gates in Figure 1. We instantiate two memories to store the intermediate results of the multiplication. This allows us to read the current intermediate result from one memory and write the new result to the other one in the same clock cycle.
The upper part of the schematic in Figure 1 determines the addresses for both memories. When an index of the sparse polynomial is read, the upper bits are sampled in a register used as initial value for a counter. To handle the jump from the highest address (i.e., r b ) to zero, our final design contains slightly more logic. Again, Figure 1 neglects this logic  for the sake of clarity. However, the output of the counter is subtracted by one, and two multiplexers decide which of the address values are used to access which of the memories. The decision signal sel is determined based on the LSB from the address counter used to read out the indexes of the sparse polynomial.
For each index of the sparse polynomial, our multiplier spends r b + 4 clock cycles for shifting and accumulating the intermediate results. The total latency is given by where th denotes the weight of the sparse polynomial (e.g., for the key generation in BIKE th = w 2 ). The circuit switches to the DONE state in the additional clock cycle. This design iterates over a fixed number of indexes of the sparse polynomial. While this approach is capable of processing the secret polynomials (h 0 , h 1 ), it cannot process the multiplication e 1 · h in the encapsulation with a constant latency since the Hamming weight of e 1 is unknown. Therefore, we modify the design of the general sparse multiplier into a dedicated multiplier for BIKE in the next paragraph.

Tailored Constant-time Multiplier for BIKE.
To deal with the indefinite weight of e 1 in the encapsulation, we utilize the relation |e 0 | + |e 1 | = t defined by BIKE. It allows to rephrase the encoding operation as an addition of two multiplications c 0 = e 0 · 1 + e 1 · h. (2) For computing c 0 , we modify the general sparse multiplier introduced above and add a multiplexer choosing h or 1 as input for p arb. depending on e 0 or e 1 . To indicate whether e 0 or e 1 is processed, we add an additional leading bit to the indexes and set the MSB of the indexes belonging to e 0 to '1'. We embed this operation directly into the sampling function H. Hence, the multiplexer selects its output according to the MSB of the indexes of the sparse polynomial. In order to illustrate the two modes of the multiplication engine, we provide a small example for r = 11, b = 4, and p arb. = X 10 + X 8 + X 7 + X 6 + X 5 + X 4 + X 3 + 1 = 101 1111 1001 (corresponds to h in Equation 2). For the error polynomials, we exemplary assume e 0 = X 5 and e 1 = X 7 and their corresponding indexes e 0,idx = 1 0101 and e 1,idx = 0 0111, respectively. For both modes, we assume that the current intermediate result is p int = 010 1001 0110. Figure 2 visualizes the multiplication e 1 · p arb. where each dashed line separates the data flow between the clock cycles. In this case, the expected result is  write to addr 0x02 write to addr 0x00 Figure 3: Example for a multiplication with an index from e 0 .
As described above, the module first reads the second uppermost chunk from the input polynomial which is 1111 in our example. Since r = 11 and b = 4, only the most significant bit from this chunk is required and stored in the register reg rot (cf. Figure 1). The remaining three bits are taken from the uppermost chunk. Afterwards, the process proceeds in a regular pattern by reading a new chunk and moving the old chunk to the lower part of reg rot . The multiplier determines the starting address to read the first chunk from the intermediate result by the upper bits of the error index, i.e., 0x01 in our example. This describes the required shift on word level. The output of the register is shifted to the left by 3 bit which are the least log(b) bits from index e 1,idx and describe the required shift on bit level. Hence, the first chunk of the new intermediate result is written to address 0x01. Note, when the multiplier writes the result to address 0x02, the most significant bit is set to 0 since it does not belong to a valid polynomial of size r = 11. The procedure for a multiplication with the index e 0,idx is similar. Instead of providing the polynomial p arb. to the multiplier, the polynomial p one = 1 = 000 0000 0001 is selected by the most signification bit of e 0,idx . The corresponding data flow is visualized in Figure 3. It is clearly visible that the multiplication with an index from e 0 requires the same amount of clock cycles such that a constant-time operation is guaranteed.
To this end, Figure 4 shows the adjustment for processing the operand p arb. in the multiplier. Note, the polynomial of one does not require an extra memory but is generated on the fly. While accessing the 0-th chunk of p arb. , the circuit feeds a b-bit chunk of 0...01 to the multiplexer. Otherwise, the multiplexer always gets a zero b-bit chunk. Hence, the multiplier always finishes the multiplication from Equation 2 in L mult (t) clock cycles. Last, we add an additional input to the multiplier design determining the number of non-zero indexes of the sparse polynomial for the two possible weights (t, w/2) of the sparse input polynomials.

Polynomial Inversion
We present our hardware design and optimization for the polynomial inversion in this section. In 2020, Marotzke [Mar20] had reported an implementation for the polynomial inversion required in NTRU Prime, a post-quantum KEM. The inversion module utilizes Bernstein and Yang's extGCD algorithm [BY19] optimized to perform inversions of polynomials of degree 760 with coefficients in prime fields, where the arithmetic takes place in Digital Signal Processor (DSP) units. Since our design targets to invert polynomials in R with large degrees (i.e., ≥ 12 323), the two implementations pursue different purposes and are not directly comparable.
In the following, we first divide the computation of divstep into two subroutines. Then, we introduce the main framework of the inversion and the two subroutines followed by our hardware designs.
Performing the divstep. Recalling Section 2.4, an extGCD for polynomial inversion computes 2r − 1 divsteps. In [BY19], based on the shape of the transition matrix, Bernstein and Yang optimized the multiplication by the transition matrix in a single divstep as two simple functions: 1. a conditional swap: replacing (δ, f, g) with (−δ, g, f ) if δ > 0 and g(0) = 0.
Since the head coefficient f (0) is always one for computing the inversion in BIKE, we need only two information bits deduced from (δ, g(0)) in each divstep as instructions for updating (f, g) and (v, w). The first bit indicates the swap operation and the second bit is g(0) used in the elimination operation. We refer to the two information bits as control bits of one divstep in this paper. Furthermore, we split one divstep into two operations: Main Framework. Algorithm 4 describes the main framework of the polynomial inversion. As introduced above, the algorithm uses four temporary polynomials f , g, v, and w while g is initialized with the bit-reversed input polynomial g in . The main parts of the algorithm are 2r − 1 divsteps, which are decoupled into series of get_control_bits() and update_fg_or_vw() subroutines. Last, the algorithm shifts v one bit to the right, reverses its coefficients, and returns v as the inverse of the input polynomial.
Algorithm 4: Main framework for the polynomial inversion. Input : Input polynomial g in and step size s.
// Reverse the bits of the input polynomial 5 δ ← 1 ; // Degree difference of polynomials f and g 6 τ ← 2r − 1 ; // Number of divsteps to be executed In the algorithm, we introduce a parameter s to control the step size, allowing to proceed s divsteps in each iteration in parallel (cf. line 7). The get_control_bits() and the update_fg_or_vw() take the parameter as well and proceed s steps accordingly. Therefore, the subroutine get_control_bits() determines 2s control bits and updates δ based on the state of (δ, f [0], g[0]). Afterwards, a loop iterates over all four polynomials f , g, v, and w and updates them by update_fg_or_vw() for s steps in each call. Starting from line 22, the algorithm covers the remaining steps and updates only (v, w) accordingly.
Besides the step size s, the execution time of Algorithm 4 scales with the bandwidth parameter b as well. Enlarging b decreases the number of chunks N and therefore, less numbers of iteration are executed in the inner loop since update_fg_or_vw() updates one chunk in each execution. In our design, the choice for s is also limited by s ≤ b since get_control_bits() takes inputs of one polynomial chunk only. We describe the details of get_control_bits() and update_fg_or_vw() in the following paragraphs.
Determining Control Bits. Algorithm 5 details the process of get_control_bits(). The algorithm takes four inputs, which are the degree difference δ, two b-bit chunks )/2 ; 9 end 10 return δ, c; ) from the polynomials (f, g), and the step size s. The algorithm outputs the updated δ and 2s control bits c for s divsteps. For generating the control bits for the s divsteps, the algorithm uses only s bits from each input polynomial instead of the full coefficients. Note, however, the algorithm is a sequential process where the control bits of iteration i depends on the results of the previous iterations. Our hardware design for get_control_bits() incorporates this characteristic such that we aim to fully utilize the computational capacity and hence execute d iterations of the loop shown in Algorithm 5 in one clock cycle. Therefore, Figure 5 shows a schematic draft of this approach where one iteration is highlighted by the red dashed border. For larger step sizes s, however, unrolling the whole loop in a hardware implementation would result in a long critical path. Hence, we introduce a round-based circuit that is executed s d times since d · s d ≥ s. We store the generated control bits in registers to use them immediately for updating the polynomials (f, g) and (v, w) by update_fg_or_vw().
Updating Polynomials. We summarize the details of update_fg_or_vw() in Algorithm 6. The algorithm expects as inputs the control bits c, two 2b-bit chunks of the polynomials (f, g) or (v, w), the step size s, and one bit specifying whether the input chunks originating from the pairs (f, g) or (v, w). The algorithm updates the given chunks for s divsteps according to the control bits c. Since (f, g) and (v, w) are multiplied by the same transition matrix in the same divstep, the arithmetic for updating the polynomials is identical. The different formats of storing polynomials (see Section 2.4) cause the difference of the two operating modes, which shift polynomials in different directions and output different chunks of polynomials. Figure 6 shows our hardware design for updating the polynomials (f, g). The basic block (highlighted by the red dashed border) updates the polynomials for one divstep, consisting of simple shifts, an addition (xor), and multiplexing operations. The whole submodule can finish the computation with s consecutive basic blocks which, however, would result in a long critical path without any further modifications. Therefore, to control the length of the critical path, we introduce pipeline registers after u basic blocks. Hence, there are s u pipeline stages in the module. Note, we implement a similar module to update (v, w).
Although, Figure 6 depicts two full b-bit chunks for each input associated with the different polynomials, the algorithm actually only requires b + s bits of data from the input polynomials. The algorithm inputs the 2b-bit chunks because it accesses polynomials in chunks of b bits from the memory. However, the module only instantiates logic for processing s + b data such that no area overhead occurs.
Overall Design of the Polynomial Inversion. The entire polynomial inversion module consists of two counters controlling the reversion of the bits and the final right shift (cf. Algorithm 4). Additionally, we instantiate get_control_bits() and two versions of update_fg_or_vw() (updating (f, g) and (u, w) in parallel) as described above. Since the algorithm works on four temporary polynomials, the inversion module utilizes eight Block-RAMs (BRAMs) allowing to read and write the intermediate results in the same clock cycle. Nevertheless, the latency of the proposed design depends on several parameters, i.e., r, b, s, d, and u. It is determined by . Note, our design for s = 1 does not follow Equation 3 since it is a handcrafted and optimized design which achieves a slightly smaller latency and requires only seven BRAMs instead of eight.

United Hardware Design
Given the optimized modules for the polynomial arithmetic and the modifications for the random oracles, we now present an united hardware design of BIKE consolidating the key generation, encapsulation and decapsulation in one module. Such a design allows to share resources between the different KEM operations. For example, we only instantiate one single multiplier, one Keccak core with the corresponding wrappers described in Section 3.2, and a limited number of BRAM modules. The number of required BRAMs is given by the decapsulation since its implementation utilizes the most memories (cf. [RBMG21]). However, this design decision implies that only one of the three KEM algorithms of BIKE can be executed at the same time. Therefore, we implement a control interface that allows to enable the desired algorithm by a three bit instruction, load and read data (polynomials and 256-bit strings), and request randomness used as seed for the PRNG. A top-level draft of this implementation is shown in Figure 7. While all building blocks that are used by more than one KEM algorithms are marked by a green border, the black modules are only required for a single KEM operation (the inversion module and sampler are used only in the key generation, and the BFIter module together with the Hamming weight and threshold computation only in the decapsulation).
The Finite-State Machine (FSM) on the right side manages all input/output operations and the control flow of the three KEM algorithms. The input interface expects a six-bit instruction identifying which data should be loaded. For the key generation no initial data is required. The encapsulation requires the public key h which needs to be loaded to a BRAM before the computation can be started. To perform a decapsulation, the implementation assumes that the user load the two parts of the cryptogram (c 0 , c 1 ), the two polynomials of the private key (h 0 , h 1 ), and σ. The output interface returns the same data and additionally the shared key K. After the required data has been accessed, all memories are reset by overwriting the content with zero.

Implementation Results
In this chapter, we evaluate the proposed optimizations and modifications for a hardware implementation of BIKE. First, we show that the modifications of the random oracles are beneficial for a hardware design of BIKE. Second, we report implementation results for the proposed sparse multipliers and compare them to designs from the literature. Third, we demonstrate the scalability of our inversion module by presenting implementation results for different configurations. Fourth, since both -the multiplication and inversion -influence the footprint and performance of the key generation, we provide dedicated implementation results for a stand-alone key generation design. Fifth, we present the implementation results of the united hardware design and compare it to other implementations of code-based PQC schemes. We generate all results for an Artix-7 XC7A200T FPGA manufactured by Xilinx.

New Random Oracles
As described in Section 3.2, BIKE's new specification [ABB + 21] updates the random oracles from AES-256 and SHA2 to an unified Keccak core. To test how the design choice of cryptographic primitives effects the performance of hardware implementations, we compare the implementations of the original VHDL code 2 from [RBMG21] with our adapted version applying the new specification with a replaced Keccak core. We performed no other optimizations for a fair comparison. Table 2 reports the comparisons for the encapsulation and decapsulation. For both KEM algorithms and all hardware configurations, the adapted versions achieve slightly better results in terms of area and latency. Especially the number of required registers decreases by roughly 880 in the adapted implementation for all designs. To this end, these implementation results show that the modifications of the random oracles are indeed beneficial for hardware implementations of BIKE. Table 3 shows the implementation results for our two multiplier designs configured for the lowest security level of BIKE, i.e., for r = 12 323. The first design is the general sparse multiplier where the sparse polynomial always has a fixed Hamming weight, i.e., the Hamming weight is determined before synthesis. In BIKE, such cases occur in the key generation and decapsulation where |p sparse | = w/2. The second design reads the Hamming weight of the sparse polynomial via an input interface. Hence, it can be used for all multiplications required in BIKE. Additionally, the design performs the encoding in the encapsulation in constant time. To this end, the hardware utilization is slightly higher as for the general sparse multiplier. Note, for the second multiplier design, we report performance numbers for the multiplication performed in the encapsulation, i.e., |p sparse | = t = 134. The number of clock cycles for different Hamming weights follows Equation 1. Table 3 also lists the results of the schoolbook-based (dense) multiplier from [RBMG21] and of the sparse multiplier design from [HWCW19]. Since the authors of [RBMG21] only reported implementation results for r = 10 163, we extracted the multiplier from their code and synthesized it for r = 12 323. As expected, the sparse multiplier clearly outperforms the schoolbook-based design with respect to area. For a fixed Hamming weight of 71, the sparse multiplier also achieves better performance results. However, for b = 128 the schoolbook multiplier achieves slightly better performance results than the tailored sparse multiplier which it trades with a huge area footprint. Therefore, the sparse multiplier is clearly superior with respect to the Area-Time (AT) product.

Multiplier
Compared to the multiplier from [HWCW19], our design achieves a considerably lower latency albeit our results were generated for a larger parameter set. Our design mainly differentiates from their implementation in two parts. First, we decided to instantiate two memories to store the intermediate results of the multiplication's product. This allows us to perform a read and write access in the same clock cycle while the implementation by Hu et al. requires two clock cycles. Note, for Xilinx FPGAs one could exploit the read-then-write option allowing to perform a read and write access in the same clock cycle to the same address reducing the amount of required BRAM modules. However, we decided not to use this option but rather instantiate two memories since it is a more generic approach which is universally applicable to other hardware devices as well. Second, our rotation unit performs the whole rotation within one clock cycle while the design by [HWCW19] requires log b clock cycles. Even though our multiplier architectures consume slightly more slices, it clearly improves the AT product.
We also tried to compare our results to the design proposed in [BFG + 19] but we were not able to figure out which value the authors applied for the parameter BW (corresponds to our bandwidth parameter b) so that a fair comparison is difficult. However, we assume that their design is similar to our multiplier design which uses fixed Hamming weights.

Inversion Module
In this section, we first evaluate the polynomial inversion module described in Section 3.4 for b ∈ B and for r = 12 323 and compare our approach afterwards to the design from [RBMG21] which is based on Fermat's little theorem. Note, in all experiments we fix the maximum number of basic blocks instantiated between two register stages for the updating process of (f, g), and (v, w) to u = 8 achieving a critical path that is smaller than 10 ns. Additionally, we generate all results in this subsection for a target frequency of 100 MHz. Figure 8a shows the number of required slices and the latency in clock cycles for b = 32, 1 ≤ s ≤ 32, and d = 2. The area footprint linearly increases with the step size parameter s while the number of clock cycles follows Equation 3. Moreover, we include the configuration for the best AT product (slices × cycles/10 6 ) visualized by the green dashed line. The configuration for s = 23 achieves the best result with an AT product of 432. A more detailed evaluation of the implementations can be found in the appendix in Table 7. Figure 8b shows the implementations results for different step sizes s for b = 64. The trends for the required clock cycles and for the area utilization are very similar to the configurations for b = 32. The smallest configuration requires 4 880 299 clock cycles but only consumes 377 slices while the fastest design performs one inversion within 91 678 clock cycles by consuming 5 457 slices. The design with the best AT product is obtained for s = 31 (a detailed evaluation can be found in the appendix in Table 8). Step size s Step size s Step size s   Figure 8c where the best AT product is obtained for s = 16. To achieve reasonable critical paths (maximum possible frequency larger than 100 MHz), we reduce the number of unrolled rounds to compute the control bits c to d = 1. With s = 128 we can instantiate our fastest inversion module which finishes one polynomial inversion in only 47 386 clock cycles. However, the implementation costs drastically increase to 21 435 slices. Again, a detailed evaluation is given in the appendix in Table 9 and Table 10.

Detailed Evaluation of the Inversion Module
Comparison to Related Work. We compare our inversion module to the approach presented in [RBMG21] which is based on Fermat's little theorem in Table 4. The corresponding numbers are extracted from their implementation of the key generation.
With Fermat's little theorem, given a g ∈ R, [RBMG21] computes the inverse as g 2 r−1 −1 . To efficiently raise the degree of g, they used a square-and-multiply chain from the Itoh-Tsujii Algorithm (ITA) [IT88] achieving a latency of where r bin = r − 2 and L school = r b · ( r b + 3) + 1. Note, Equation 4 describes just an approximation of the required clock cycles since the implementation from [RBMG21] is highly optimized to the use-case of BIKE. However, compared to the dominant term 2·r−1 s · r b from Equation 3, our inversion module has an extra parameter s, allowing to achieve more optimized configurations.
In Table 4, we present results for the light-weight (s = 1) and high-speed (s = b) configuration as well as the design with the best area-time product. For comparison with the area cost, we report a configuration targeting the number of clock cycles of the approach from [RBMG21]. While finishing the inversion with the same amount of clock cycles, Table 4 shows that the inversion module based on the extGCD achieves a smaller footprint. This implies that the extGCD implementation results in a better area-time product. We note that the inversion based on Fermat's little theorem always requires a dense polynomial multiplier, which increases the area cost notably. For the design with the best area-time product, our approach consumes roughly twice the amount of logic but finishes the inversion with only one sixth clock cycles setting b = 32.
While writing this article, Deshpande et al.
[DdPM + 21] presented a hardware implementation of Bernstein and Yang's inversion algorithm for computing the modular inverse for integers. Their implementation targets integer sizes of 255 bits to 2 048 bits which requires units for integer additions with carry logic. Since we compute the inverse of bit polynomials of at least 12 323 bits and perform carry-less additions, i.e., the XOR operation, the two implementations target different applications, and a comparison of performance numbers would be misleading.
Additionally, referring to the sequential design of [DdPM + 21], they always compute the control bits for only one divstep and update the integers with one divstep. This corresponds to the configuration of s = 1 in our design introduced in Section 3.4. Hence, our inversion module provides more configurations allowing to finely adapt to various circumstances.

Key Generation
We report implementation results for stand-alone key generation modules in Table 5 and compare them to the key generation module from [RBMG21]. We evaluate our designs only on the key generation because the polynomial inversion module is used solely in this KEM operation. Because our design is based on the extGCD instead on Fermat's little theorem, we do not install a dense polynomial multiplier that is required for the inversion with Fermat's little theorem. Instead, we use a sparse multiplier which is far more efficient (in both area and latency) than the dense multiplier in the key generation (cf. Table 3). Although the module of key generation consists of various components, including the PRNG based on SHAKE256, the main operations occur in the inversion module and the multiplier.
As described before, both designs perfectly scale with the bandwidth parameter b while the inversion module provides an additional configuration via the step size s. Nevertheless, for each b ∈ B, we only pick two configurations for the inversion: (1) setting s = b which results in the fastest configurations we can achieve, and (2) instantiating the inversion module with the lowest AT product determined in Section 4.3.
The fastest key generation, that we can implement with our approaches, is obtained for b = s = 128. The key generation only takes 484 µs but requires over 25 000 slices. The maximum frequencies for the designs with b = 128 are slightly higher than for b = 64 because the parameter d is decreased to d = 1. We decided to synthesize these designs for d = 1 since otherwise the critical path for the computation of the control bits would drastically increase. Note, the results for b = 64 and b = 128 for the designs adjusted to the best AT product achieves roughly the same performance because b is doubled while s is halved. Therefore, the design for b = 64 is more efficient due to the lower footprint. Since our proposed inversion module is highly scalable, there are many other possible configurations. An estimation of the expected footprint and clock cycles can be obtained by using the results provided in the appendix.
Unfortunately, the authors of [RBMG21] did not implement a PRNG to provide randomness to the sampler which makes a comparison more difficult. Therefore, we determined the hardware utilization of our Keccak core which consumes roughly 800 slices. Considering these additional costs, our design adjusted to the AT products of the inversion modules is roughly 5.5 times faster while it only consumes 3.6 more number of slices for b = 32.

United Design
We present the implementation results of the united hardware design of BIKE, introduced in Section 3.5, in Table 6 for the lowest security level. Results for Level 3 and Level 5 can be found in the appendix in Table 11. We created three different implementations where the first one is a light-weight design (b = 32), the second one is a design with a trade-off between hardware resources and performance (b = 64), and the last one is a high-speed design with b = 128. The instantiations of the inversion module are the designs with the best AT product identified in Section 4.3. Table 6 also contains the estimated implementation results for a united hardware design of BIKE from [RBMG21]. For the light-weight configuration, our design clearly outperforms the previous design with respect to the hardware resources and performance. This improvement is mainly due to the new multiplier design and inversion module.
For the high-speed design, our proposed implementation consumes only half the amount of slices while achieving comparable performance results. Particularly, the latency of the key generation is significantly improved due to the inversion module. However, the number of clock cycles for the encapsulation and decapsulation slightly increased. This slight increase is due to the sparse polynomial multiplier.
Since the latency of the sparse multiplier is proportional to the Hamming weight of the sparse polynomial (cf. Equation 1), the schoolbook multiplier achieves a better performance when the Hamming weight of the sparse polynomial exceeds a certain value. More precisely, the latency of the schoolbook multiplier from [RBMG21] is defined by In case L mult (th) results in a larger latency than L school for a Hamming weight th and a fixed r/b , the schoolbook multiplier finishes the corresponding multiplication in less clock cycles. In BIKE, this phenomena only appears for b = 128 and for the parameter sets of the security levels 1 and 3. However, especially for b = 128 the sparse multiplier achieves a considerably better AT product as shown in Table 3. Besides implementation results for BIKE, Table 6 also provides implementation costs and performance values for other code-based cryptographic schemes submitted to the NIST standardization process. As already pointed out in [RBMG21], the comparison to the Classic McEliece implementation is difficult. On the one hand, the reported numbers are only for the Public-Key Encryption (PKE) scheme and not for the KEM. On the other hand, the Classic McEliece design consumes a huge amount of BRAMs which requires to use larger and more expensive FPGAs.
The hardware design for HQC was recently presented in the latest specification [MAB + 21] and is based on a high-level synthesis. While our hardware design of BIKE achieves similar performance results for the encapsulation and decapsulation, HQC has a faster key generation since no polynomial inversion is required.
Eventually, the last part of Table 6 reports recent hardware implementation results from other post-quantum schemes which were selected as finalists in the NIST standardization process. We list the corresponding implementation costs and performance numbers from lattice-based schemes including CRYSTALS-KYBER, LightSaber, and NTRU Prime. In general, the comparison shows that lattice-based schemes cost less area and achieve lower latencies than the code-based KEM operations.

Resistance against Side Channels
In this work, we present a constant-time hardware implementation of BIKE which prevents the timing side-channel leakage. However, we did not apply any specific countermeasure against power Side-Channel Analysis (SCA). In [RMGS20], the authors briefly discussed the resistance of their BIKE hardware implementation against power side channels. They suggested that a parallel processing of b = 128 bit chunks makes it hard to identify single bit dependencies in the power trace. Since our implementation also supports a 128 bit bandwidth, it follows the same argumentation. Additionally, using BIKE with ephemeral keys (suggested as one operation mode in the BIKE specification [ABB + 21]), makes a side-channel attack even harder since the attacker can only use single traces.
Nevertheless, this is not a guarantee for resisting power side-channel attacks. For example, analyzing a power trace of our proposed multiplication engine from Section 3.3 would probably reveal if an index of e 0 or e 1 is processed due to the Hamming weight difference of |1| and |h|. The multiplication with an index from e 0 probably generates different power traces than a multiplication with e 1 such that the Hamming weights of |e 0 | and |e 1 | are leaked. It requires further research to investigate the effect with respect to security from leaking |e 0 | and |e 1 |. The leakage can be avoid by using two sparse multipliers, where one is dedicated to e 1 · 1 and the other is dedicated to e 2 · h running in parallel.

Transferability to Software
In this section we discuss the possibility of transferring the presented approaches to software implementations for polynomial inversions and spare polynomial multiplications targeting various platforms.
When considering the inversion algorithms for the key generation, given the latency of extGCD inversion (Equation 3) and Fermat's inversion (Equation 4), the key issue is the latency of the exponentiation and multiplication (L school ) operations in the ITA algorithm on the target platforms. Although the multiplication involves complicated hardware circuits, it is a sunk cost in software when the underlying platform supports related instructions. Therefore, for platforms with native instructions of bit-polynomial multiplication, e.g., the pclmulqdq instruction in x86, we believe L inv-Fermat is smaller than L inv . For platforms without instructions for bit-polynomial multiplication, L inv is likely to be smaller than L inv-Fermat . However, besides the platform, the latency of the multiplication also depends on the implemented algorithms. Recently, Chen et al. [CCK21] reported an efficient FFT-based bit-polynomial multiplication on the 32-bit Arm Cortex-M4 platform. Hence, we expect extGCD based inversion outperforms Fermat's inversion in even smaller platforms without efficient multiplication implementations, e.g., 8-bit AVR microcontrollers.
Regarding the sparse polynomial multiplication in BIKE, we mainly consider the sidechannel leakage of the degrees of sparse terms. If a software implements the sparse-dense multiplication by accumulating the shifted dense polynomial with the degrees of sparse terms, then it might leak the degrees of sparse terms through a cache-time attack. This is a reason that recent software implementations, e.g., [CCK21,DGK20a], implemented the multiplication with algorithms for dense polynomial multiplication. Thus, we believe that the spare polynomial multiplication will be useful for small microcontrollers without data cache.

Conclusion
In this work, we propose various optimization strategies and present an improved hardware design for BIKE, one of the NIST's alternate KEM candidates.
For arithmetic optimizations, we implement a constant-time sparse polynomial multiplier for all three KEM algorithms of BIKE. Compared to a schoolbook implementation, our design improves the area-time product of at least five times for all design parameters. Our implementation also achieves a better latency except for the high-speed design (i.e., b = 128) for the encapsulation and the decapsulation. Additionally, we propose a hardware implementation of the polynomial inversion based on the extended Euclidean algorithm. Compared to previous results based on Fermat's little theorem, our new design not only achieves better latency but also provides smaller area-time products for the key generation in BIKE. Moreover, due to its scalable design, the instantiation of the inversion module can be tailored to various circumstances providing higher throughputs or smaller area footprints.
Besides these arithmetic optimizations, we show that the random oracles of a unified Keccak core in the new specification of BIKE indeed result in a more efficient hardware design compared to the design using versions of both AES256 and SHA2. Based on our improvements, we developed a united hardware design with shared resources and sub-modules, achieving a better latency with less area compared to previous BIKE implementations. All together, our high-speed implementation performs a key generation in 1 672 µs, an encapsulation in 132 µs, and a decapsulation in 1 802 µs on Xilinx Artix-7 FPGAs. Table 8: Implementation results for the polynomial inversion for r = 12 323, b = 64, and d = 2. We fixed the frequency to 100 MHz and selected an Artix-7 XC7A200T FPGA as target platform. Table 9: Implementation results for the polynomial inversion for r = 12 323, b = 128, and d = 1. We fixed the frequency to 100 MHz and selected an Artix-7 XC7A200T FPGA as target platform. Table 10: Implementation results for the polynomial inversion for r = 12 323, b = 128, and d = 1. We fixed the frequency to 100 MHz and selected an Artix-7 XC7A200T FPGA as target platform.