Optimizing BIKE for the Intel Haswell and ARM Cortex-M4

BIKE is a key encapsulation mechanism that entered the third round of the NIST post-quantum cryptography standardization process. This paper presents two constant-time implementations for BIKE, one tailored for the Intel Haswell and one tailored for the ARM Cortex-M4. Our Haswell implementation is much faster than the avx2 implementation written by the BIKE team: for bikel1, the level-1 parameter set, we achieve a 1.39x speedup for decapsulation (which is the slowest operation) and a 1.33x speedup for the sum of all operations. For bikel3, the level-3 parameter set, we achieve a 1.5x speedup for decapsulation and a 1.46x speedup for the sum of all operations. Our M4 implementation is more than two times faster than the non-constant-time implementation portable written by the BIKE team. The speedups are achieved by both algorithm-level and instruction-level optimizations.


Introduction
BIKE [ABB + 17] is a key encapsulation mechanism (KEM) involved in the NIST postquantum cryptography standardization process.It can be viewed as a variant of the code-based cryptosystem introduced in [MTSB13], which makes use of so-called "quasicyclic moderate-density parity-check" (QC-MDPC) codes.In July 2020, NIST announced that BIKE has been selected as one of 15 schemes that advanced to the third round of the process, making BIKE one of the candidates considered for standardization.
During the standardization process, BIKE holds a nice security record: unlike many other candidates, the claimed security levels of BIKE haven never been challenged.BIKE also features small public keys: the public key sizes are around 1.5KB, 3KB, and 5KB for the level-1, level-3, and level-5 parameter sets, respectively.These features make BIKE a competitive candidate in the process.In fact, NIST also commented that "NIST views BIKE as one of the most promising code-based candidates." in the status report for the second round candidates [AASA + 20].
Despite the advantages, there are two factors that arguably make BIKE look weaker than some other candidates.The first factor is that, although the BIKE team has identified the conditions on the decoder that would make the scheme CCA-secure, they are not able to prove that the decoder actually satisfies the conditions.On the other hand, BIKE can still be used in applications which require only CPA security.The other factor is the relatively slow speeds of the operations.The speed issue is what this paper aims to deal with.
• As pointed out in [Cho16], there are many operations that can be viewed as multiplications in R z in the decoding algorithm.These multiplications are of the form g(y)f (y), where each coefficient g i and f i is either 0 or 1, and g(y) is a sparse polynomial.[Cho16] proposed to consider g(y) as s y −s and compute the sum of all y −s f (y) to obtain g(y)f (y).Each y −s f (y) is computed using a circular shift on the coefficients of f by s positions.In order to make sure that the circular shift does not leak information through timing, [Cho16] proposed to make use of the concept of a Barrel shifter to build a "Barrel rotator".
• This idea of using a "Barrel rotator" would be very efficient if r was a multiple of the register size.However, as r is always an odd prime in BIKE, performing the circular shift directly will introduce some overhead.In order to reduce the overhead, [DG19] proposed to represent f in a "duplicated form" such that y −s f (y) can be obtained by performing a logical shift on the duplicated form.The logical shift is again implemented using the concept of a Barrel shifter.
• To compute the sum s (y −s f (y)), [Cho16] proposed to perform the computation in a bitsliced fashion to make use of the fact that each coefficient of y −s f (y) is either 0 or 1.
• There are many multiplications in R in key generation, encapsulation, and decapsulation.In order to compute the product of two elements f = i f i x i and g = j g j x j in R, the clmul implementation of [Cho16] represents f and g as i F i z i and j G j z j , where z = x 64 , and uses the pclmulqdq instruction to compute every F i G j .[DG19] proposed to use Karatsuba recursively, where the bottom level of recursions is handled by pclmulqdq.
• The public key is of the form h 1 • h −1 0 where each h i ∈ R. Computation of h −1 0 ∈ R can be computed using a multiplication chain as shown in [Cho16].Inside the multiplication chain, it is often required to compute the 2 k th power of an element f ∈ R for some constant k. [DGK20a] proposed to compute such an exponentiation by permuting the coefficients in f .When k is large, this is much faster than carrying out k squarings sequentially.
Unfortunately, even with so many optimizations, the latest software speed of BIKE is still not very satisfying: the eBACs website [Be] reports that, on titan0, a machine with an Intel Haswell CPU, the level-1 parameter set bikel1 takes more than 138000 cycles and 2650000 cycles to perform encapsulation and decapsulation, respectively.For comparison, mceliece348864, one of the level-1 parameter sets of a third-round code-based candidate Classic McEliece [ABC + 17], takes only 44652 and 132360 cycles, respectively, on the same machine.Note that this means that the situation of code-based candidates is quite different from the situation of lattice-based candidates, as structured lattice-based schemes are typically much faster than non-structured ones.
The BIKE team has some optimized implementations for x86 machines, whose speeds rely heavily on the usage of instructions for carryless multiplications such as pclmulqdq.
This raises the question whether BIKE can still run with reasonable speed on embedded systems where such instructions are not available.To our knowledge, there has not been any BIKE implementation tailored for embedded systems yet.

Our Contribution
This paper presents two implementations of BIKE, one optimized for the Intel Haswell and one optimized for the Cortex-M4.Compared to the avx2 implementation written by the BIKE team, for bikel1, we achieve a 1.39x speedup for decapsulation and a 1.33x speedup for the sum of all operations on Haswell.For bikel3, we achieve a 1.5x speedup for decapsulation and a 1.46x speedup for the sum of all operations on Haswell.Compared to the non-constant-time portable implementation written by the BIKE team, our implementation is more than two times faster for all the three operations on M4.Both of our implementations are constant-time: there is no memory access using secret indices, no branching with secret conditions, and no usage of variable-time instructions.
We believe that our implementations will make BIKE a more interesting option for standardization.Even though we did not implement the largest parameter set bikel5, which was introduced in the 3rd-round specification, we expect that our optimization techniques (see below) will be useful for optimizing bikel5 as well.We also expect the implementation techniques used for Haswell will be useful for optimizing BIKE for newer microarchitectures, e.g., those supporting AVX512.In addition, our optimization techniques for polynomial multiplications in R can be useful for optimizing HQC [AMAB + 17], another third-round code-based candidate.

Optimization Techniques
The speed of our implementations is achieved by using the following optimization techniques.a) When performing a logical shift on the duplicated form of f ∈ R z , we make use of matrix transpositions to change the representation of f , so that a Barrel shifter is no longer required.As far as we can tell, this idea is completely original.This optimization is only used in our Haswell implementation.
b) We implement the Barrel shifter for shifting the duplicated form of f ∈ R z using conditional execution and powerful multiply-and-accumulate instructions supported by the M4.We consider this as a purely instruction-level optimization.
c) The sum of all y −s f is computed using a sequence of half adders in the BIKE team's implementation.We follow an algorithm explained by Boyar and Peralta [BP05] to use full adders whenever possible.The algorithm is more complex and requires more memory, but it reduces the number of logical operations required to compute the sum by a factor of more than 2.
d) The BIKE team's implementations use the Karatsuba algorithm to optimize multiplications in R. The Karatsuba algorithm is used recursively.Instead, we use a five-way recursive algorithm proposed by Bernstein in [Ber09a] for the top level of recursion.This optimization is only applied in our Haswell implementation.
e) We use a "Frobenius additive FFT" (FAFFT) to optimize multiplications in R on the M4.We found that FAFFTs fit nicely with bitslicing, such that we are able to carry out multiplications in R efficiently even without instructions for carryless multiplications.
To summarize, a), b), and c) are used for multiplications in R z , while d) and e) are used for multiplications in R. Table 1 summarizes where each of these techniques is used in our implementations.

Availability of Source Code
Our implementations are adapted from the BIKE team's implementation and hence will be distributed under the same Apache 2.0 license of the orignal implementations.We plan to submit our Haswell implementation to the eBACS project [Be] so that the source code can be included in SUPERCOP.The source code of our M4 implementation has been included in the pqm4 project [KRSS] under the directories crypto_kem/bikel1/m4f and crypto_kem/bikel3/m4f.We also plan to submit both of our implementations as the artifact of our paper for CHES 2021.

Organization of the paper
Section 2 reviews the latest specification of BIKE and shows why we need to perform multiplications in R z and multiplications in R. Each of the remaining sections introduces one of our five optimization techniques; See Table 1.Section 8 shows our experimental results for multiplications in R z , multiplications in R, and the three KEM operations.

BIKE
This section reviews the third-round specification of BIKE [ABB + 17].For the sake of completeness, we repeat some contents from the specification in this section.

System Parameters.
BIKE uses system parameters r, w, t, and .The following table shows the values of the system parameters and the security level for each of the three parameter sets of BIKE, which we denote as bikel1, bikel3, and bikel5.

Hash Functions.
The BIKE team defines the hash functions H, K, L to have the following domains and ranges. • where the spaces M, K, and E t are defined as follows.
The concrete instantiation of H, K, L can be found in the specification and is beyond the scope of this paper.

Key Generation, Encapsulation, and Decapsulation.
BIKE's key generation, encapsulation, and decapsulation algorithms are depicted in Figure 1.As shown in the figure, the key generation algorithm generates a secret key (h 0 , h 1 , σ) and a public key h.(h 0 , h 1 ) is in H w , which is defined as The encapsulation algorithm, on input a public key h, generates a session key K and a ciphertext c encapsulating the session key.The decapsulation algorithm, on input a secret key (h 0 , h 1 , σ) and a ciphertext c, outputs a session key K or ⊥.There are many multiplications in R as one can see from the KEM operations and the decoding algorithm decoder defined in the next subsection,

The Black-Gray Decoder
The decoding algorithm specified in the third-round specification is the black-gray decoder, which was introduced in [DGK20b].The algorithm is shown in Algorithm 1 of the specification.The BIKE team described it as an algorithm that can be used for any code with a low-weight parity-check matrix.However, as shown in Figure 1, BIKE is a scheme with a ring structure.To facilitate our discussion in the following sections, we rephrased the algorithm so that operations in rings are explicitly shown.We emphasize that "rephrased" means that we do not create a new algorithm but merely write the original algorithm in a different way.Our version of the black-gray decoder uses the following notations.
• Lift(•), which lifts an element from R to R z .
• Tr(•), which converts an element The black-gray decoder is summarized in Algorithm 1.It should be easy to see where the operations in R are from with respect to the specification.Note that the multiplications of the form Tr(h i ) • Lift(s) are in R z , and |Tr(h i )| is always w/2.These multiplications are used to compute the numbers of unsatified parity-check equations.It has been explained in [Cho16] why computation of the numbers of unsatified parity-check equations can be viewed as multiplications in R z .
For completeness, we note that the black-gray decoder uses (in addition to r, w, t) two integers NbIter, τ and a function threshold as its parameters.The third-round specification defines NbIter = 5 for all parameter sets.How τ and threshold are defined is irrelevant to the purpose of this paper.

Circular Shifts with Matrix Transposition
As mentioned in the Introduction, in order to compute g • f where g, f ∈ R z , g i , f i ∈ {0, 1} for all i, we may write g as s y −s with 0 ≤ s < r and compute gf as s (y −s f ).For each s, y −s f can be computed by carrying out a circular shift on (f 0 , . . ., f r−1 ) by s positions due to the structure of R z .Below we review how such a circular shift can be implemented using a Barrel shifter and propose a way to avoid the Barrel shifter.

The Duplicated Form and the Barrel Shifter
In order to compute y −s f , the BIKE team follows [Cho16] to use a Barrel shifter and follows [DG19] to use a duplicated form for f .The idea is to 1. lift f to Z[y] and represent it as f = f + (f mod y r−1 )y r 1 and 2. compute y −s f as ((f − (f mod y s ))/y s ) mod y r .
The second step might look complicated, but computationally it simply means performing a logical shift (toward the direction of f 0 ) of s positions on the vector of 2r − 1 coefficients of f and taking the first r coefficients.Note that making f a longer polynomial will lead to the same result, as long as Let b be a power of 2. To implement this idea, f is stored as an array of b-bit words, where the values of the words are and so on.Then, f is first shifted by s (1) = s − (s mod b) positions and then by s (0) = s mod b positions.As s (1) is a multiple of b, the shift by s (1) = j s (1) j 2 j positions is carried out as a sequence of conditional moves from word i + 2 j /b to word i for each j.The shift by s (0) positions can be carried out by a sequence of logical shifts and ORs on the words, assuming that corresponding instructions exist.
The avx2 implementation follows the strategy above, where b is set to 256 to fit the size of YMM registers.Note that there is no general shift instruction for YMM registers, so the avx2 implementation follows [GAB19] to use AVX intrinsics to carry out the shift by s (0) positions.

Avoiding the Barrel Shifter with Matrix Transposition
We do better than the avx implementation by making use of a simple observation.In short, we observed that shifting f by s (1) positions can be carried out by shifting a set of

   .
Our main observation is that shifting f by s (1) positions, in the matrix view, simply means updating row i by row i + s (1) /b (and if row i + s (1) /b is undefined, row i is set to 0).If we can somehow make each column of M (0) lie in one b-bit word, this can be carried out easily via b shift operations by s (1) /b positions on the columns.
In order to make use of this observation, given f and s, we perform the following steps to compute y −s f .1. Compute f , which can be viewed as the matrix M (0) , from f .  , by s (0) positions.Then, the first r entries will be y −s f .
Recall that what we actually want to compute is s y −s f .In fact, the first two steps above do not depend on s, so N (0) is the same for any s.Therefore, one can save operations by computing N (0) with the first two steps, and then for each s perform the last three steps.

The Haswell Implementation for bikel1
For ease of discussion, we define the following operators that can be applied to any matrix.
• ⇐ j (•), meaning replacing column i by column i + j for all i.
• ⇒ j (•), meaning replacing column i by column i − j for all i.
• ⇑ j (•), meaning replacing row i by row i + j for all i.
• ⇓ j (•), meaning replacing row i by row i − j for all i.
For each operator, the destination row or column is set to 0 if the source is not well-defined.The parentheses will be omitted whenever it is convenient.
Our Haswell implementation for bikel1 uses b = 256 because it mainly works on YMM registers.bikel1 has r = 12323, so f consists of at least (2r − 1)/256 = 97 words and s (1) /b ≤ 48.For ease of implementation, we let f be of length 32768, so M (0) is a matrix with 128 rows: where each M ij is a 64 × 64 matrix.Following the discussion in Section 3.2, we only need the first (12323 + 255)/256 = 50 rows of ⇑ s (1) /b (M (0) ) to perform the final shift of s (0) positions.Let s = s (1) /b, for ease of implementation, we compute the first 64 rows of ⇑ s (M (0) ), which is simply Our implementation stores f in a way that the corresponding matrix is . And then for each s, we compute and transpose each of the 4 64 × 64 submatrices in N (1) to obtain M (1) .
We chose the implementation strategy above, because it can be implemented easily using AVX2 intrinsics/instructions.Our source code for computing M (1) from N (0) is given in Figure 2. The variable in is a pointer to N (0) .The variable ymm_num is set to s .SRLI_I64 is defined (by the BIKE team) as _mm256_srli_epi64, and we use it to perform ⇐ s .SLLI_I64 is defined (by the BIKE team) as _mm256_slli_epi64, and we use it to perform ⇒ 64−s .The discussion above suggests that we need to compute XOR of the results of SRLI_I64 and SLLI_I64, but the source code uses OR because it has the same effect.The function transpose_64x256_sp_asm is an assembly function that computes the "64 × 64-block-wise" matrix transposition for obtaining M (1) from N (1) .The assembly function performs a vectorized version of the 64 × 64 matrix transposition mentioned in Section 3.2.We note that the assembly function is included in the source code of Classic McEliece.The source code of Classic McEliece, including the function, is in the public domain.In fact, we also compute N (0) from M (0) by calling the assembly function twice.

The Haswell Implementation for bikel3
bikel3 has r = 24659, so f consists of at least (2r − 1)/256 = 193 words and s = s (1) /b ≤ 96.For ease of implementation, we let f be of lenght 65536, which means M (0) is a matrix with 256 rows: where each M ij is a 64 × 64 matrix.Following the dicussion in Section 3.2, we only need the first (24659 + 255)/256 = 98 rows of ⇑ s (M (0) ) to perform the final shift of s (0) positions.For ease of implementation, we instead compute 128 rows of ⇑ s (M (0) ).Let M (0k) ∈ F 128×256 2 be the first 128 rows of ⇑ 64k M (0) .We observed that there are two cases for the first 128 rows of ⇑ s (M (0) ), which we denote as M (1) : Let Our implementation stores f in a way that the corresponding matrix is Otherwise, we compute In either case, each of the 8 64 × 64 submatrices in N (1) is transposed to obtain M (1) .It might seem that the implementation strategy above will result in branching on s , and branching on s is not allowed for constant-time implementations.However, this can be avoided with the ability of AVX intrinsics.Our source code for computing M (1) from N (0) is given in Figure 3.The variable in, again, is a pointer for N (0) .The variable ymm_num, again, is set to s .When 0 ≤ s < 64, the SRLI_I64 and SLLI_I64 in line 12 and 13 will give all-zero results because their shift amounts are out-of-range.Similarly, when 64 ≤ s < 128, the SRLI_I64 and SLLI_I64 in line 10 and 11 will give all-zero results because their shift amounts are out-of-range.In this way, we are able to avoid branching on s .Finally, we compute M (1) from N (1) .by calling the assembly function transpose_64x256_sp_asm twice.We also compute N (0) from M (0) by calling the assembly function four times.

Circular Shifts with Conditional Moves
Following the discussion in Section 3.1, our M4 implementation uses a Barrel shifter to shift the duplicated form of f by s bits.In the setting of M4, we have s = s (1) + s (0) where s (0) = s mod 32, and the duplicated form of f is stored as an array of 32-bit words.Our M4 implementation first performs the shift by s (1) bits and then the shift by s (0) bits.

The Shift by s (1) Bits
In the first phase of the Barrel shifter, each word i is conditionally replaced by word i+2 j /32 for several j's with j ≥ 5 to carry out the shift by s (1) bits.Note that j ∈ {13, 12, . . ., 5} for bikel1 and j ∈ {14, 13, . . ., 5} for bikel3.To complete the computation for each j, we can first create a 32-bit mask which is set to 0xFFFFFFFF if the shift of 2 j /32 words needs to be carried out or 0x0 otherwise.Then, for i = 0, 1, 2, . . ., the portable implementation does where idx holds the value of 2 j /32.This is essentially the code of the portable implementation, except that it actually works on 64-bit words.
In our implementation, for each k in the first phase, the words are partitioned into 2 j /32 sets: each word i with k = i mod 2 j /32 is in the kth set.For each set, we use two general-purpose registers Rx and Ry, where Rx is initialized with word k.Then, we set i to k and perform the following loop.

Conditionally move
Ry to Rx.
Otherwise, break the loop.

Store
Ry to word i.
To carry out the conditional moves, our M4 implementation uses the instruction SEL.As suggested by the name, SEL can move one of its two source registers into its destination register.Which source register is chosen is determined by the "GE" flags 2 .We note that the SEL instruction always takes 1 cycle.
In order to reduce the average cost of each conditional move, our implementation actually processes n = min(2 j /32, 4) words in one iteration of the loop: we first load n words to Ry1, . . ., Ryn, use the SEL instruction n times to conditionally move Ry1, . . ., Ryn to the corresponding Rx1, . . ., Rxn, store Ry1, . . ., Ryn to the memory of n words.We also partially unroll the loop so that Step 4 is omitted.A piece of code for a loop of j = 12 is given in Appendix A.

The Shift by s (0) Bits
In the second phase, the Barret shifter performs the shift of f by s (0) < 32 bits.Suppose f is stored as an array of 32-bits words w[n].In the portable implementation of BIKE team, the shift is carried out by doing for i = 1, 2, . . ., and so on.This approach takes 2 logic shits and 1 logic OR operations for each 32-bits word.We note that portable actually works on 64-bit words, but the idea is the same.In fact, we can also deal with the 32-bit words in the reversed order: Note that the logical OR can be replaced by an addition.Now consider the following C function.
A previous version of our M4 implementation actually used IT blocks instead of SEL to carry out conditional moves and use the same way as the portable implementation to perform the shift by s (0) bits.The ideas of using SEL and replacing shift instructions by multiplication instructions was suggested by an anonymous reviewer of this paper.In the current version of our M4 implementation, which uses the ideas, one circular shift is about 5% faster than in the previous version.

Hamming Weight Computation with Full Adders
The implementations of the BIKE team follows [Cho16] to compute the sum s (y −s f ) in a bitsliced fashion to obtain gf .As the result, a sequence of bitwise logical operations are performed to obtain the sum.Let n be the cardinality for the set of s, and let b 1 be the constant term of the first y −s f , b 2 be the constant term of the second y −s f , and so on.It is easy to see that the problem of computing s (y −s f ) boils down to computing the Hamming weight of (b 1 , b 2 , . . ., b n ) ∈ {0, 1} n using a sequence of gates.This sections shows how previous implementations and our implementation compute the Hamming weight.

Simple Algorithms for Computing the Hamming Weight
The avx2 implementation of BIKE, following [Cho16], first prepares a counter of log 2 n +1 zero bits.The counter is used to store the Hamming weight.Then, for each i ∈ {1, . . ., n}, b i is added to the counter.For each addition, if n is a power of 2, each of the least significant log 2 n bits can be updated using one half adder, and the most significant bit can be set to the carry-out bit of the last half adder.If n is not a power of 2, each of the least significant log 2 n bits can be updated using one half adder, and the most significant bit can be updated using 1 XOR.Let T (n) be the number of bit operations it takes to compute the Hamming weight of n bits.We have In [GAB19], the authors suggested a simple way to improve the approach above.The idea is that, since adding b i only affects the first log 2 i + 1 bits of the counter, there is no need to update the remaining bits.In fact, adding b i takes 2 log 2 i + 1 bit operations if i is not a power of 2, and adding b i takes 2 log 2 i bit operations if i is a power of 2. Therefore, we have (2 log 2 i + 1), if i is not a power of 2. 2 log 2 i , if i is a power of 2.

The Boyar-Peralta Algorithm
Our implementation uses a much faster algorithm which makes use of full adders.The algorithm was explained by Boyar and Peralta in [BP05].We note that the technique is probably much older: [FS71] presents an algorithm that seems equivalent.However, as we are not able to tell who was the first to propose the algorithm, we simply call the algorithm the Boyar-Peralta Algorithm.Boyar and Peralta focused on proving that the algorithm gives the minimal number of ANDs required to compute the Hamming weight.In the following discussion, we consider the number of XORs as well as the space requirement of the algorithm.
To explain the idea, consider the case when n = 2 k − 1. Boyar and Peralta described the following recursive algorithm: apply the algorithm recursively to compute the Hamming weight x 1 of the first 2 k−1 − 1 bits, apply the algorithm recursively to compute the Hamming weight x 2 of the next 2 k−1 − 1 bits, and use a standard adder to compute the sum x 1 + x 2 + b n , where b n serves as the carry-in bit.Then the sum x 1 + x 2 + b n−1 can be computed using k − 1 full adders, resulting in 5(k − 1) bit operations.Therefore, we have for k > 1, and obviously T (0) = T (1) = 0.
In order to handle general n, the Boyar-Peralta algorithm computes k, n 1 , n 2 such that n = 2 k−1 + n 2 = n 1 + n 2 + 1, where 0 ≤ n 2 < 2 k−1 .Then, the algorithm is applied recursively to compute the Hamming weight x 1 of the first n 1 bits, and the Hamming weight x 2 of the next n 2 bits, and finally a standard adder is used to compute the sum , k is the number of bits required to represent n 2 .k is set to 0 if n 2 = 0.Then, the sum x 1 + x 2 + b n−1 can be computed using k full adders and k − 1 − k half adders, resulting in 5k + 2(k − 1 − k ) bit operations.Therefore, we have for n ≥ 2, and T (0) = T (1) = 0.
The reason why we need to perform multiplications of the form gf = s (y −s f ) in the first place is because there are multiplications between Tr(h i ) and Lift(s) in the decoder (see Section 2.4).As each Tr(h i ) is always of weight w/2, our goal is actually to compute the Hamming weight of w/2 bits.Now we can easily compute T (w/2) for the Boyar-Peralta algorithm and the simple algorithms in the previous subsection.The numbers are summarized in Table 2.

Code Generation for the Boyar-Peralta Algorithm
To implement the Boyar-Peralta algorithm, we wrote a code generator to generate the sequence of additions in the Boyar-Peralta algorithm.The code generator is included in Appendix B. It assumes that there is a buffer consisting of infinitely many bits that can be used for storing the results of the additions and the Hamming weight.
Our code generator is written as a Python function named bp.One can simply call bp(n, 0) to to generate operations for n input bits.Its the second argument, the variable off in its signature, indicates the index of the first bit in the buffer of the computation.

The function, on input
Then, the function first recursively calls bp(n1, off); bp(n2, off + nbits(n1)); to generate operations for computing x 1 and x 2 , where nbits(n1) means the number of bits required to represent n 1 .Note x 1 will be stored in the first nbits(n1) bits starting if they are different.In either case, this means that a standard adder is used to compute The variable buf is a pointer to the buffer, and the offset of x 1 is indicated by off.The difference between the two cases is that adder_size_eq only requires one argument to indicate the bit length of x 1 or x 2 , while adder_size_neq requires two arguments.If n = 1, the function outputs the line of rotate_right, increases count by 1, and outputs "adder_size_eq(buf, bn, {0}, 0);".format(off) where the last argument 0 indicates that b n is simply copied into the buffer bit of index off.If n = 0, the function returns immediately.The only remaining task is to implement the C functions adder_size_eq and adder_size_neq.
By analysing the output of the function, we found that a buffer of 16, 17, and 22 bits are required for n = 71, n = 103, and n = 137.Note that the simple algorithms requires a buffer of only 7, 7, and 8 bits respectively.In other words, using the Boyar-Peralta algorithm to compute the sum s (y −s f ) in a bitsliced fashion takes at least 16 • 12323/8 = 24646, 17 • 24659/8 = 52401, and 22 • 40973/8 = 112676 bytes for bikel1, bikel3, and bikel5, respectively.

Multiplications in F 2 [x] with Bernstein's 5-way Recursive Algorithm
The implementations of the BIKE team use Karatsuba for multiplying polynomials in are of degree smaller than n, the algorithm computes F G as This means that Karatsuba is a 3-way recursive algorithm: it reduces the multiplication into 3 smaller ones, where the smaller multiplications can be handled recursively.The complexity of Karatsuba is O(n log 2 3 ).
Recall that bikel1 has r = 12323 and bikel3 has r = 24659.The BIKE team always picks n to be the smallest possible power of 2, which means at the top level of recursion, we have n = 8192 for bikel1 and n = 16384 for bikel3.They then apply Karatsuba recursively.At the bottom level of recursion, the avx2 implementation uses pclmulqdq, and the portable implementation uses a non-constant-time function to handle multiplications of polynomials of degree smaller than 64.
As r seems to be large enough to make use of algorithms with lower complexities, our Haswell implementation uses a 5-way recursive algorithm by Bernstein.

Bernstein's 5-way Recursive Algorithm
In [Ber09a], Bernstein proposed a 5-way recursive algorithm for multiplying two polynomials in F 2 [x] with five multiplications.The complexity of Bernstein's algorithm is O(n log 3 5 ).Given two polynomials where F i 's and G i 's are polynomials of degree smaller than n and z = x n .The algorithm reduces the multiplication into five polynomial multiplications of polynomials of length about n: Bernstein showed that H = F • G can be constructed by the following formula where U = H(0) + (H(0) + H(1))z and V = H(x) + (H(x) + H(x + 1))(x + z) .

Implementing Bernstein's algorithm
Our Haswell implementation uses Bernstein's algorithm at the top level of the recursion and Karatsuba for other levels.At the top level, we pick n = 4096 for bikel1 and n = 8192 for bikel3.Note that 12323 − 4096 • 3 = 35 and 24659 − 8192 • 3 = 83.These "leftover bits" are handle separately, and the costs of handling them is very small.In order to make the operands of the 5 multiplications fit into n = 4096 or n = 8192 bits, we also handle the top 2 bits of F 2 and G 2 and the top bit of F 1 and G 1 separately, together with the "leftover bits".At the bottom level of recursion, we also use pclmulqdq to handle multiplications for polynomials of degree smaller than 64.
While most of the operations in Bernstein's algorithm are either multiplications or additions, there is one division between h(x) and (x 2 + x) in Eq. 1.We carry out the division by computing Note that instead of computing the coefficients sequentially, the computation can be easily parallelized: let h be (h − (h mod x 2 ))/x 2 , we update h to h + (h − (h mod x))/x, update h to h + (h − (h mod x 2 ))/x 2 , update h to h + (h − (h mod x 4 ))/x 4 , and so on.This results in a much faster implementation for the division.

Polynomial Multiplications in F 2 [x] with Frobenius Additive FFTs
This section presents how we carry out the polynomial multiplications on the Cortex-M4.Section 7.1 reviews how the Gao-Mateer additive FFT works.Section 7.2 reviews how to perform multiplications in F 2 [x] with the additive FFT in [LCK + 18, CCK + ].Section 7. 3 shows how we implement the algorithm for the Cortex-M4.

The Gao-Mateer Additive FFT
Below we introduce how the Gao-Mateer additive FFT [GM10], which we denote as AFFT, can be used for polynomials over F 2 m , where m is a power of 2.
We define W ⊆ F 2 m as the span of {v 0 , . . ., v −1 } for any 1 ≤ ≤ m and W 0 as {0}.When is a power of 2, W is the subfield of size 2 .In [Can89], Cantor defined a series of polynomials that are closely related to v i 's and W 's: In fact, each s i satisfies the following properties.

Divide and Conquer
of degree smaller than n, where n = 2 , v ∈ F 2 m , and ≤ m, the AFFT computes S = {f (α) | α ∈ v + W }. When = 0, the algorithm simply returns f (x) itself.When > 0, the algorithm first writes f (x) as This is called "radix conversion" in [BC14]: the radix is changed from x to s 1 (x) = x 2 + x.Note that the computation f (0) and f (1) from f requires only field additions, which are much cheaper than field multiplications, and each f (i) is of degree smaller than 2 −1 .If we have Observe that s 1 (v + W ) = s 1 (v) + W −1 , which is of size 2 −1 .Gao and Mateer make use of this observation to compute S (0) and S (1) by recursively evaluating f (0) and f (1) at all elements of s 1 (v) + W −1 .In addition, as Gao and Mateer suggest to compute S by computing The three stages of butterflies in a size-8 AFFT.Each arrow going upwards is associated with a constant which is not shown, so it actually means "adding the product of the source and the constant to the destination" instead of "adding the source to the destination".This is one "butterfly" in an AFFT.
To have a clearer picture regarding the pattern of computation in an AFFT, one can "unroll" the recursion.In this way, it can be seen that an AFFT always consists of two phases.The first phase consists of a bunch of field additions from radix conversions in all levels of recursion.The second phase consists of several stages of butterflies, where the last stage consists of the butterflies in level 0 (i.e., the top level) of the recursion, the second last stage consists of the butterflies in level 1 of the recursion, and so on.Figure 4 shows the three stages in the second phase of a size-8 (n = 8) AFFT.

Multiplication in F 2 [x] with FAFFT
The AFFT allows us to perform multiplications in F 2 [x]: given polynomials f, g ∈ F 2 [x] such that f g is of degree smaller than n = 2 , we can lift f and g to F 2 m [x] such that m is a power of 2 and ≤ m, perform an AFFT for each of f and g to evaluate them at W , perform the pointwise multiplication, and perform the inverse AFFT to obtain f g.Such an AFFT-based polynomial multiplication always works, but one can reduce the cost of the AFFT by carefully choosing the evaluation points for f and g.

Reducing the Number of Evaluation Points
In [LCK + 18], Li et al. suggested to pick the set of evaluation points to be In this way, the function that maps f to f (Σ) will be invertible, which means we can use the AFFT and the inverse AFFT to carry out multiplications in return c(x) 19: end procedure radix conversion phase for a size-n AFFT.The second phase consists of the first log 2 m levels of butterflies in the truncated AFFT.This phase is called encoding, in which the operations is defined below.The third phase consists of the last stages of butterflies in the truncated AFFT.The last stage is identical to the butterfly phase in a usual size-n AFFT for Σ.
As each of the three phase is invertible, one can carry out the inverse FAFFT by performing the inverses of the three phases in the reverse order.The FAFFT-based polynomial multiplication algorithm is shown in Algorithm 2.
Encoding Encoding performs the first log 2 m levels of butterflies in the truncated AFFT.With respect to Σ, it is an invertible linear map which maps radix converted coefficients of f (x), e.g., (f 0 , f 1 , . . ., f n−1 ) ∈ F n 2 to (f 0 , . . ., f n −1 ) ∈ F n 2 m where the f i is defined as where (j 0 , j 1 , . . ., j log 2 m−1 ) ∈ {0, 1} log 2 m is the binary representation of the integer j, i.e., j = k j k • 2 k .

Implementing the FAFFT-based Polynomial Multiplication
Below we show how we implemented the FAFFT-based multiplication for the M4.

Parameter Selection
The following table shows our parameters for the algorithm 2. We pick the parameters to minimize m satisfying Eq. (2).
16 + x 16 + x 8 x 4 x 2 x 1 ).For completeness, we note that our Cantor basis is defined by The Pointwise Multiplication The pointwise multiplication consists of 1024 or 2048 independent multiplications in F 2 32 .To carry out these multiplications, we follow [BC14,Cho17] to represent field elements in a bitsliced format.As the M4 is a 32-bit platform, bitslicing means that we always perform 32 multiplications in parallel.
To multiply a 0 + a 1 x 16 ∈ F 2 32 and b 0 + b 1 x 16 ∈ F 2 32 (in parallel with 31 other multiplications), we use the Karatsuba algorithm to reduce the task into 3 multiplications a 0 b 0 , a 1 b 1 , (a 0 + b 0 )(a 1 + b 1 ) in F 2 16 and apply the Karatsuba recursively as [BC14].We note that the amount of data for multiplication in F 2 4 fits into the general-purpose registers, so there is no register spilling in the case.For larger fields, we use floating-point registers as a pool for register spilling.
The Butterflies Section 7.1 shows that, at level i of recursion (level 0 represents the top level), multiplications in the butterflies are always of the form β • f (1) (s 1 (β)), where Observe that w is always in a proper subfield of size 2 2 log 2 ( −i) .Due to the tower field structure, we carry out the multiplication by w as 32/2 log 2 ( −i) multiplications in the subfield.To carry out the multiplication by v, we consider the operation as an invertible linear map and use the circuit generator introduced in the ePrint version of [BC14] to generate a sequence of XORs for the multiplication.We note that m = 32 is actually too large for the circuit generator to handle: it stops making progress usually after several tens of iterations.When the circuit generator stops to make progress, we generate another sequence of XORs reaching the state in a naive way.Then, we combine the 2 sequences of XORs to complete the whole invertible linear operation.We've considered Paar's circuit generator [Paa97], but it hard to fit everything into available registers since it does not generate in-place code.We also tried Bernstein's circuit generator [Ber09b], but the results (in terms of numbers of XORs) are worse than the one from [BC14].Figure 4 shows the stride of the butterflies decreases by a factor of 2 in each stage.For the first few stages, the strides are always greater than 32, so 32 butterflies can be carried out in parallel with bitslicing.However, in the last 5 stages, there are stride-16, 8, 4, 2, 1 butterflies, which make it hard to carry out 32 butterflies in parallel.The same issue has appeared in [Cho17] regarding Beneš networks as Beneš networks have a similar structure as the butterflies.The author of [Cho17] solve the issue by carrying out matrix transpositions on the data.We make use of the same idea: one can imagine that our implementation performs a matrix transposition on the data right before the last 5 stages.The matrix transposition turns stride-16, 8, 4, 2, 1 butterflies into butterflies with strides greater than or equal to 32, so that we are able to carry out 32 butterflies in parallel.At the end of the forward FAFFTs, the order of the field elements is changed because of the matrix transposition, but this does not affect the pointwise multiplication since the multiplication can be carried out in any order.Similarly, in the inverse FAFFT, one can imagine that our implementation performs a matrix transposition on the data right after the first 5 stages.

We show the costs of various multiplications in
We note that instead of carrying out the whole matrix transpositions, we actually carry out only a part of the matrix transposition in each of the 5 stages.The part of the matrix transposition we carry out in a stage is just enough to makes it possible to carry out 32 butterflies in parallel in that stage.This is a different way of implementing the idea in [Cho17].
Encoding Consider the input f as a matrix M ∈ F 32×n 2 where the M j,i = c j•n +i .One way to understand the encoding is to consider it as multiplying M by a 32 × 32 invertible matrix (from the left), where the invertible matrix is defined by v m/2 , . . ., v m/2−4 .As multiplying by the invertible matrix is a linear map, we again make use of the circuit generator in [Cho17] to optimize it.We also use the circuit generator to optimize the inverse of encoding, which is used in the inverse FAFFT.

Radix Conversions
Our implementation follows the algorithm presented in [LANH16], which seems to be implicitly shown in [GM10, Section IV].As the implementation is quite straight-forward, we do not describe how we implemented the algorithm in detail.

Experiment Results and Discussions
We chose the STM32F407 Discovery board to be the platform for benchmarking.It is a low-cost development board by STMicroelectronics featuring the STM32F407VG microcontroller that can be clocked with up to 168 MHz, and it has 192 KB of SRAM and 1 MB of flash memory.All implementations for M4 are benchmarked in the pqm4 [KRSS19, KRSS] benchmarking framework on the development board with the compiler arm-none-eabi-gcc-10.2.1.Note that pqm4 always sets the frequency to 24 MHz so that the CPU can be set to have zero wait states when fetching instructions from the flash memory.
All the Haswell cycle counts in this section, unless specified otherwise, are measured on one core of an Intel Xeon E3-1275 v3 CPU, with Turbo Boost and hyper-threading disabled.

Multiplications in R z
The following table shows the cycle counts for carrying out one circular shift in our Haswell and M4 implementations with the techniques in Section 3 and 4.

Haswell M4 bikel1
1112 24856 bikel3 2547 52722 The following table shows the numbers of cycles on the Haswell core to carry out a multiplication Tr(h i ) • Lift(s) in R z .Our code uses the techniques in Section 3 and 5. our code avx2 bikel1 113822 202176 bikel3 369782 629952 The way the avx2 implementation carries out circular shifts is similar to the approach in Section 4, because it also makes use of conditional moves.The conditional moves are realized by using the intrinsic _mm256_blendv_epi8.
The following table shows the numbers of cycles on the Cortex-M4 to carry out a multiplication Tr(h i ) • Lift(s) in R z .Our code uses the techniques in Section 4 and 5.We actually tried to use the techniques in Section 3 for our M4 implementation but found that the approach in Section 4 is still faster.We believe this is due to the fact that the size of YMM registers on Haswell is much larger than the size of general-purpose registers on the Cortex-M4, and that there are powerful intrinsics/instructions (as shown in [GAB19]) that can be used to manipulate YMM registers so that even a shift of s (0) ∈ {0, . . ., 255} bits can be carried out efficiently.

Multiplications in R
The following table shows the Haswell cycles for a multiplication between two elements in R. Our code uses techniques in Section 6.We actually tried to implement the FAFFTbased multiplication described in Section 7. Instead of bitslicing, our FAFFT-based implementation uses pclmulqdq for field multiplications.However, as shown in the last column of the table, this leads to a much slower implementation.The following table shows the M4 cycles for a multiplication between two elements in R with respect to methods in Section 7 and from the BIKE team.We can further reduce the cost of FAFFT multiplication by reusing the intermediate data while multiplying same elements.As shown in Algorithm 1, there are many multiplications in R of the form e i h i in the decoder.Since h 0 and h 1 do not change during decapsulation, our implementation computes two forward FAFFTs, one for h 0 and one for h 1 , at the beginning of decapsulation.Then, each time we need to compute e i h i , one forward FFT can be saved.In this way, about 30% of the 1320940 cycles and 2929293 cycles can be saved for each e i h i .It is mentioned in [Cho16, Section 5.1] that one can use circular shifts to carry out each multiplication of e i h i : write h i as s x −s and compute e i h i as s (x −s e i ).The numbers in the first table of Section 8.1 show that the resulting cycle counts are expected to be

Key Generation, Encapsulation, and Decapsulation
The Haswell cycles for key generation, encapsulation, and decapsulation are shown in the following table.The cycle counts of the avx implementation are from [Be], and they are measured on a machine titan0 with a Haswell CPU.key gen.encap

Table 1 :
Optimization techniques and where they are used: • means that the technique is used for the corresponding operation on the corresponding platform, and • means that the technique is not used. 1

Table 2 :
The numbers of gates required to compute the Hamming weight for the three methods.bit of index off after the operations generated by the first recursive call are carried out, and x 2 will be stored in the next nbits(n2) bits after the operations generated by the second recursive call are arried out.After the recursive calls, the function first outputs which means a new s is used to compute y −s f , so that the bit bn is generated.Here count is a global variable which is always increased by 1 after printing the line of rotate_right.Then, the function outputs "adder_size_eq(buf, bn, {0}, {1});".format(off,nbits(n1))ifnbits(n1) is equal to nbits(n2) or "adder_size_neq(buf, bn, {0}, {1}, {2});".format(off,nbits(n1), nbits(n2)) . As explained by Li et al., it is the property that Σ can reach n elements in F 2 m via the Frobenius map which guarantees that the function is invertible.This is why Li et al. call such an AFFT a Frobenius AFFT (FAFFT).With the FAFFT, the number of evaluation points is reduced by a factor of m.W +log 2 m + v +m/2 , such that Σ ⊂ Σ and | Σ| = n.The way Li et al. evaluate f as Σ is to perform a size-n AFFT to evaluate f at Σ but skip all the redundant operations.In other words, the approach of Li et al. is to carry out a truncated AFFT.The truncated AFFT starts with a usual , . . ., ĉn −1 ) ∈ F n 2 m ← (â 0 • b0 , . . ., ân −1 • bn −1 ) pointwise multiplication ∈ F 2 [x] <n ← FAFFT −1 ((ĉ 0 , . . ., ĉn −1 ), Σ) The field F 2 32 is constructed as 32768 32 1024 10 v 26 + W 10 BIKE-3 24659 65536 32 2048 11 v 27 + W 11 a tower of extension fields of F 2 as in [BC14, LCK + 18]:

Table 3 :
Cost of multiplying an element in F 2 32 by various field elements.Note that the cycle counts are for carrying out 32 such multiplications.For the environment of benchmarking, see Section 8. −i+m/2 + W −i .Following [BC14, Cho17], we pre-compute all β's and store them (in the bitsliced format) in a static table.The multiplications can of course be carried out by our function for 32 multiplications in F 2 32 .However, our implementation does better by exploiting the fact that β = v + w, where v ∈ v −i+m/2 and w ∈ W −i .To carry out a multiplication between β and γ, our implementation multiplies γ by v, multiplies γ by w, and adds the two results to obtain β • γ.
Table 3 to evaluate our strategy of performing multiplication by β = v + w.As shown in the table, carrying out 32 multiplications by v takes at most 881 cycles, and carrying out 32 multiplications by 32 w's takes at most 1652 cycles.Carrying out 32 multiplications in F 2 32 , however, takes 2849 > 881 + 1652 cycles.