Faster Constant-Time Decoder for MDPC Codes and Applications to BIKE KEM

. BIKE is a code-based key encapsulation mechanism (KEM) that was recently selected as an alternate candidate by the NIST’s standardization process on post-quantum cryptography. This KEM is based on the Niederreiter scheme instantiated with QC-MDPC codes, and it uses the BGF decoder for key decapsulation. We discovered important limitations of BGF that we describe in detail, and then we propose a new decoding algorithm for QC-MDPC codes called PickyFix. Our decoder uses two auxiliary iterations that are signiﬁcantly diﬀerent from previous approaches and we show how they can be implemented eﬃciently. We analyze our decoder with respect to both its error correction capacity and its performance in practice. When compared to BGF, our constant-time implementation of PickyFix achieves speedups of 1.18, 1.29, and 1.47 for the security levels 128, 192 and 256, respectively.


Introduction
BIKE [ABB + 21] is a code-based key encapsulation mechanism (KEM) selected as an alternate candidate for the NIST post quantum standardization process. The scheme consists of a variant of the Niederreiter [Nie86] scheme using quasi-cyclic moderate-density parity-check (QC-MDPC) codes instead of Goppa codes. As such, BIKE can be seen as a refinement of Misoczki's et al. QC-MDPC McEliece [MTSB13].
The use of QC-MDCP [MTSB13] codes yields two advantages. The first one is that the public key is much smaller, since one needs only one row to represent a quasi-cyclic matrix in systematic form. The second is that matrix multiplication, and thus encoding, is much faster for quasi-cyclic matrices. However, QC-MDPC codes comes with an important disadvantage: their decoding algorithms have a non-zero probability of failure. This fact was exploited in the famous GJS [GJS16] key-recovery reaction attack, that provided the ground for side-channel attacks against QC-MDPC [RHHM17] and further attacks against other code-based encryption schemes [SSPB19, FHS + 17].
To deal with this problem, BIKE's original proposal [ABB + 17] used ephemeral keys. However, recent approaches on obtaining negligible decryption failure rate (DFR) [Til18,SV20a,Vas21], together with Hofheinz et al. [HHK17] CCA security conversions that accounts for decryption errors, motivated BIKE proponents to consider key-reuse. In particular, Sendrier and Vasseur [SV20a,Vas21] propose a framework that, under reasonable assumptions, allows them to find parameters where the DFR should be negligible using experiments and statistical analysis. This framework was used in BIKE's last revision [ABB + 21], which uses the state-of-the-art BGF decoder [DGK20c,DGK19] with parameters that supposedly achieve negligible DFR.
While trying to improve BGF's performance, we noticed two limitations. The first one is that its performance cannot be improved by considering a lower number of iterations,

Background
A binary [n, k]-linear code is a k-dimensional linear subspace of F n 2 , where F 2 denotes the binary field. If C is a binary [n, k]-linear code spanned by the rows of a matrix G of F k×n 2 , we say that G is a generator matrix of C. Similarly, if C is the kernel of a matrix H of F r×n 2 , we say that H is a parity-check matrix of C. The Hamming weight of a vector v, denoted by |v|, is the number of its non-zero entries. The syndrome z of a vector e with respect to a parity check matrix H is the vector z = eH . If the vector e is sufficiently sparse and the linear code defined by H is sufficiently good, it may be possible to recover e from the syndrome z by using efficient decoding algorithms. The support of a binary vector v, denoted as supp (v), is the set supp (v) = {i : v i = 1}.
A moderate-density parity-check (MDPC) code [MTSB13] is a linear code that admits a moderately sparse parity-check matrix H ∈ F r×n 2 . The weight of each column of H is set to be all equal to a fixed value d, and require that d = O( √ n). For applications in cryptography, it is particularly useful to consider quasi-cyclic MDPC (QC-MDPC) codes, because they allow for smaller keys and more efficient operations. BIKE [ABB + 21] is defined over QC-MDPC codes with two circulant blocks, which are MDPC codes that admit a sparse parity check matrix of the form H = [H 0 |H 1 ], where each r × r binary matrix H 0 and H 1 is circulant.
MDPC codes admit very efficient decoders, which are called bit-flipping decoders [Gal62]. All variants of bit-flipping decoders work based on the following observations. Let e be a sparse vector whose syndrome with respect to the sparse matrix H is z = eH . Suppose we do not know e but want to recover it from z using our knowledge from H. We know that z = i∈supp(e) H i , where H i denotes the transpose of the i-th column of H. Now, since e and each column H i are sparse, we can estimate the likelihood that e i = 1 by checking how closely z matches with column i of H: the more they are similar, the higher is the probability that e i = 1. The similarity measure for each column i is what is known as the unsatisfied parity-check (UPC) counter, denoted as upc i , and it is equal to the size of the intersection of supp (z) and supp H i . The name UPC comes from the Algorithm 1 General bit-flipping decoding algorithm. return ⊥, indicating that the maximum number of iterations was reached fact that the set supp (z) is sometimes called the set of unsatisfied equations, and therefore upc i counts the number of unsatisfied equations that are caught by H i .
Algorithm 1 shows the steps that a general bit-flipping algorithm performs when trying to obtain e from z and H. The algorithm stops when it finds a vectorê with the same syndrome as e, or if the number of iterations exceeds some limit. Notice that the partial syndrome defines the objective syndrome in each iteration, and in the ideal case, vectorê gets closer and closer to e after each iteration. Although most bit-flipping algorithms used in cryptography [Gal62, MTSB13, DGK19, SV19] can be framed in the general description above, they can vary significantly with respect to how the threshold for flipping bits is selected in each iteration.

BIKE
The purpose of a key encapsulation mechanism is to use public-key encryption algorithms to securely exchange a key between two parties. These parties can then use secret-key algorithms, which are much more efficient, to exchange large messages.
For a clearer presentation, we describe BIKE algorithms without the implicit-rejection Fujisaki-Okamoto transformation [HHK17], usually denoted by FO ⊥ . However, notice that when discussing the experimental performance of our algorithm in Section 7.3, we consider the full decapsulation with the FO ⊥ transformation applied.

Parameters and Algorithms
Setup. On input 1 λ , where λ is the security level, the setup algorithm returns parameters r, w and t taken from the parameter Table 1. Parameters r and w will define the family of QC-MDPC codes to be used while t controls the weight of the error used for encryption, as will be detailed in the following sections. The table also provides the estimated decryption failure rates (DFR) for each parameters set according to Vasseur's framework [Vas21]. ] are both parity checks of the same quasi-cyclic linear code. However, the sparsity of the first one allows for efficient syndrome decoding using bit-flipping algorithms.
Encapsulation. Select two random binary vectors e 0 and e 1 such that |e 0 | + |e 1 | = t. Then the key to be shared is k Shared = H ([e 0 |e 1 ]), for some cryptographic hash function H. To encapsulate the key k Shared , compute the ciphertext c = e 0 + e 1 H Pub ∈ F r 2 . Notice that ciphertext c then corresponds to the syndrome of the low weight vector [e 0 |e 1 ] with respect to the public parity-check matrix [ I | H Pub ].
Decapsulation. Given the ciphertext c, the receiver, who knows the sparse parity-check matrix H, first compute the secret syndrome z = cH 0 . Notice that Therefore, as mentioned by the end of Section 2, the receiver can use some QC-MDPC bit-flipping decoding algorithm, together with their knowledge of the secret matrix H to recover the sparse vector [e 0 |e 1 ] and compute the shared key k Shared = H ([e 0 |e 1 ]).
In the last revision of BIKE [ABB + 21], the authors recommend the BGF decoding algorithm [DGK20c], which is the state-of-the art QC-MDPC decoder. Before introducing BGF, let us first discuss the security of BIKE and, in particular, why good decoders are very important to ensure BIKE's security.

Security and Negligible Decryption Failure Rate
The security of the scheme is based on three hypotheses. The first two are standard conjectures for quasi-cyclic codes, namely the hardness of the syndrome decoding problem and the hardness of finding codewords of a fixed low weight. This ensures that one can neither recover the secret sparse matrices H 0 and H 1 from H, nor the secret message [e 0 |e 1 ] from ciphertext c. The third hypothesis is that the decryption failure rate (DFR) is negligible with respect to the security parameter. Although we cannot prove the third hypothesis, Vasseur [Vas21] proposed a framework that, under weaker hypothesis, allows one to get confident that some decoders achieve negligible DFR for selected parameter sets.
It is shown [TS16,Sen11] that parameters t and w are the most important when determining the security level, since they control the weight of the sparse vectors. Intuitively, if w or t are too small, it is easy to find h 0 or the partial encryption error e 0 by enumerating low weight vectors. But they may not be so large, with respect to r, otherwise the probability of failing to decrypt a ciphertext would be too high. Therefore, to define parameters (t, w, r), one typically fixes (t, w) sufficiently large to achieve high security levels, and then define r such that the decryption failure rate is low enough for the desired application.
In 2016, Guo et al. [GJS16] showed that decryption failures could lead to a full key recovery attack against schemes based on QC-MDPC codes. To deal with the potential vulnerability faced by schemes within which decryption failures occur, Hofheinz et al. [HHK17] refined the Fujisaki-Okamoto [FO99] transformation showing that a scheme whose decryption failure rate is below 2 −λ can be transformed into a CCA secure one.
Unlike for algebraic codes, such as Goppa or Reed-Solomon codes, whose decoders are guaranteed to decode all errors in vectors up to a given weight, we cannot yet give strong mathematical guarantees on the error correction capability of decoders for QC-MDPC codes. Recently, Sendrier and Vasseur [SV20a,Vas21] proposed a method that, under reasonable hypotheses, allows one to use simulations and simple statistical analysis to find parameters (r, t, w) such that a QC-MDPC decoder fails with negligible probability with respect to some security parameters λ. Let t and w be fixed positive integers and let us consider a hypothetical QC-MDPC decoder D. Let DFR D (r) denote the decryption failure rate of D when decrypting a ciphertext generated at random with respect to a random QC-MDPC key with parameters (r, t, w). The main observation by Sendrier and Vasseur [SV20a] is that the curve log 2 (DFR D (r)) is typically concave for practical QC-MDPC decoders and for all values of r such that DFR D (r) is high enough so that failures can be observed in simulations. Vasseur's [Vas21] model then makes the following assumption: for a given decoder D and security level λ, the curve log 2 (DFR D (r)) is concave in the region where DFR D (r) ≥ 2 −λ . This assumption is somewhat consistent with Tillich's [Til18] asymptotic theoretical model for MDPC codes, which shows that the dominating term in log 2 (DFR D (r)) decreases linearly with r. Figure 1 illustrates how Vasseur's [Vas21] model can be used to estimate the block parameter r that allows for negligible failure rate with respect to the security parameter λ = 128. First, one performs DFR simulations for increasing values of r until it cannot see any decoding failure. Then, they take the last two points (r A , p A ) and (r B , p B ) in the log 2 DFR plot such that a number of failures were observed and compute the line passing through them. According to the extrapolation hypothesis, the decoder fails with negligible probability for r = r ext , the point where the line intercepts DFR = 2 −λ . Finally, choose parameter r to be the least prime r ≥ r ext such that 2 is primitive modulo r. This avoids both squaring attacks [LJS + 16] and other potential attacks based on the factorization of the cyclic polynomial ring 1 F 2 [X]/(X r − 1).
Since there is always some error in the DFR estimates, Vasseur [Vas21] uses confidence intervals for the observed DFR and compute a conservative extrapolation for r as follows. Let p A and p B be the DFRs for r A and r B , respectively, where r A < r B . Consider p − A and p + B to be the lower and upper limit for p A and p B according to Binomial confidence intervals for p A and p B . Then a conservative extrapolation for r ext is obtained by considering the line passing through (r A , p − A ) and (r B , p + B ). Vasseur [Vas21] uses the Clopper-Pearson confidence interval together with posterior probabilities to obtain a narrower interval, with confidence level α = 0.01. In this work we use the same α with the Clopper-Pearson interval, but we do not use the posterior probabilities. Even though this tends to give slightly more conservative estimates, it is easier to compute.

BGF: State-of-the-art QC-MDPC Decoder
BGF [DGK20c], which stands for Black-Gray-Flip, is one of the most efficient known decoders for QC-MDPC codes. This decoder is an improvement of the Black-Gray decoder first proposed by Sendrier and Misoczki in a previous version 2 of CAKE [BGG + 17], a predecessor of BIKE.
As a decoding algorithm, BGF's goal is to, given a syndrome ciphertext c = e 0 +e 1 H Pub , recover the sparse error vector e = [e 0 |e 1 ] using the secret sparse matrix H. The algorithm first computes the secret syndrome z = cH 0 , then starts with e ← 0 and performs a sequence of N Iter iterations, each of which updates its knowledge on e until either z = eH or the number of iterations exceeds a certain limit N Iter and a decoding failure occurs. Before introducing BGF, let us first define its auxiliary procedures.
BGF Auxiliary Algorithms. BGF uses two bit-flipping auxiliary procedures: BitFlipIter and BitFlipMaskedIter, which are formally described in Algorithm 2. These procedures are very similar to other iterative decoders, such as the original Gallager's bit-flipping algorithm [Gal62].
Both algorithms flip bits of the partial error vector e when their corresponding UPC counters are above some threshold, τ 0 for BitFlipIter and τ 1 for BitFlipMaskedIter. However they differ in some important points. First, BitFlipIter not only flips the bits, but it also marks the bits in either black or gray, using bit-masks BlackMask and GrayMask. Black bits are the ones that are flipped with a somewhat high confidence (upc j ≥ τ 0 ), while gray bits are the ones that were almost selected for flipping (τ 0 > upc j ≥ τ 0 − δ), but did not make it because of a minor difference δ. On the other hand, BitFlipMaskedIter is a simple bit flip iteration based on the UPC value, but it only flips bits that are marked 1 in a given mask Mask.
The BGF algorithm. BGF is defined as Algorithm 3. Intuitively, the first call to BitFlipIter flips the bits for which it has a high confidence that they are wrong, by using a selective threshold function Thresh. Then it comes the two regret steps: first the black and then the gray. In the black regret, all the 1 bits added in the previous step that have an UPC strictly greater 3 than (d + 1)/2 will be flipped back to 0. The gray regret step is analogous, but now over the bits marked in GrayMask, which are called gray bits. These consist of 0 bits that were not flipped in the first step because their UPC were smaller than, but somewhat close to, the selected threshold.
After the first and most costly iteration ensured a good start, hopefully with only a small number of errors left to be corrected, BGF continues with N Iter − 1 iterations of BitFlipIter that will try to correct the remaining errors. Notice that the masks are not needed after this point, and thus, are ignored.
BGF Parameters. Table 2 shows the parameters δ, N Iter and threshold function Thresh proposed for the different security levels together with their performance under our platform 4 . We considered the constant-time implementation provided in BIKE Additional Algorithm 2 Auxiliary iterations used by BGF.
for j = 1 to 2r do 5: Notice how δ and N Iter are the same in all security levels. The threshold function is an increasing linear function on the syndrome weight truncated above the minimum value (d + 1)/2. Since the threshold function is used to determine when to flip a bit, this means that when the weight of the syndrome s is large, fewer bits will be flipped.

Critical Analysis of BGF
In this section, we dive a little deeper into the BGF decoding algorithm. This allows us to better understand why BGF is effective, but, more importantly, it will show some of BGF's weaknesses and lay the ground over which a better decoder can be designed. It is well-known to be difficult to provide a theoretical analysis for QC-MDPC iterative decoders, because of the inherent dependency caused by the circulant matrices involved. Therefore, our analysis is based on observations of BGF's behavior in practice.

BGF's First Iteration: The Black-Gray Step
Let us first discuss BGF's first iteration and its importance for the extrapolation framework. As described in the previous section, in the first iteration, BGF performs a sequence of 3 Algorithm 3 The BGF decoding algorithm. bit-flipping calls: one BitFlipIter followed by two BitFlipMaskedIter.
We know that BitFlipIter flips all bits whose UPC counters are above a certain threshold. This makes it very sensible to the threshold selected, as illustrated in Figure 2. Consider the difference if, by chance, the threshold τ 0 = 76 was selected, then the number of errors made after calling BitFlipIter, that is, correct bits that would be incorrectly flipped, would be twice the number if τ 0 = 77 were selected.
This problem is particularly important under the extrapolation framework, where the algorithm needs not only to perform well, but also to improve its performance at a very fast rate for small, but increasing, values of r. Therefore, BGF uses a very conservative threshold in the first iteration BitFlipIter. Additionally, the black and gray regretting phases, corresponding to the two calls of BitFlipMaskedIter, also work by flipping a controlled number of bits: only those bits in the black or gray masks whose UPC is above (d + 1)/2. This makes the whole first iteration very conservative.   Even though a conservative first iteration is important to ensure a fast DFR decay when r increases, it may result in useless iterations when r is close to the value when negligible DFR is reached. In particular, for the case presented in Figure 2, where r = 40,973, Thresh returned τ 0 = 86. This would result in no error being made after BitFlipIter, but at the cost of flipping only a small number of bits, compared to the case where τ 0 = 80, for example.
This suggests that removing the black regret step may be a good starting point for optimization. For example, we could merge both black and gray regret steps into one iteration in such a way that the black regret is critical for small r, but when r gets larger, the gray regret steps gets more important than the black one. This is the key idea behind our PickyFlip iteration that we introduce in Section 5.

The Number of Iterations and the Threshold Function
One straightforward method to improve the decapsulation performance would be to decrease the number of BGF iterations, at the cost of increasing the key sizes. Intuitively, one may think that there is a direct trade-off between the number of iterations and the block length parameter r: the DFR may not decay as fast when using a lower number of iterations, but one might be lucky to obtain a reasonable value of r after the extrapolation. However, as discussed in the previous section, since the thresholds are so conservative, if the number of iterations is too small, the decoder may not be able to fully correct the errors even for large values of r. Figure 3 shows how the number of iterations affects the decay of the DFR as a function of r. Notice how 2 iterations are not enough to allow for a complete decoding of errors of weight t = 134. Furthermore, the curve for 3 iterations does not appear to be concave, therefore it is not safe to use the extrapolation framework for this value. This odd behavior of the DFR curves for 2 and 3 iterations is caused by the following problem. On the one hand, increasing r should make it easier to correct more errors, since there is more redundancy, but, for large values of r, the threshold τ 0 used in the first iteration is so high that only very few errors are corrected in the first iteration.
Let us analyze the thresholds τ 0 in more detail. The average values of the thresholds τ 0 used in each of BGF's iteration are shown in Figure 4, considering 10,000 decapsulations under BIKE Level 1 parameter set. We make three observations. First notice how the first threshold increases as r increases. This is a consequence of the linear dependency of τ 0 on the syndrome weight |s|, which turns out to increase with r. The second observation is that the thresholds used in iterations 2 to 5 appear to converge to the floor (d + 1)/2 = 36. This happens because the first iteration, in general, is able to flip a sufficiently large number of errors, and leave only fine adjustments for the next iterations to deal with. The third is that τ 0 , for the second iteration, starts increasing after r = 11,500. This is caused by the threshold in the first iteration being too high, which leaves a lot of errors to be corrected by the second iteration.

Impact of the Threshold on the Concavity Assumption
Back to the DFR curves, the non-concave behavior of the curves for 2 and 3 iterations raises a potentially deep problem with the BGF threshold: why should we expect the curves for 4 and 5 iterations to be concave as well? It is possible that we just cannot see an inflection point because it is located at a DFR smaller than what we can simulate.
To evaluate the concavity of the DFR curves for 5 iterations, we propose the following experiment. Consider BIKE Level 1 parameter set. Since we cannot see the inflection points for t = 134, we can exaggerate the error weight t so that we can see the DFR curve in the interval of interest. Ideally, it should be concave at least within all values of r < 12,323, since this is the extrapolated value of r for BIKE Level 1.
As we can see in Figure 5, this is not what happens for t = 151, 153 and 155, for BGF with 5 iterations. Therefore, considering our results regarding the non-concavity of BGF with 2 and 3 iterations, together with the non-concavity of BGF with 2 to 5 iterations when t = 151, we believe that it is not conservative to assume that the DFR curve for BGF is concave. We also tested BGF for levels 3 and 5, observing an analogous behavior for t = 220 and t = 300, respectively.
The main cause for this behavior appears to be the threshold function that depends on |s|. We conclude that it is not safe to use it for the first, and most important, iteration, but Figure 4 suggests that it might be used in further iterations, since it converges to (d + 1)/2. Initially, we though that the threshold problem would be fixed by defining a maximum value for τ 0 . In our exploratory tests, this indeed make concave DFR curves for exaggerated values of t, but the error correction was negatively affected. Therefore, we leave the problem of finding better thresholds for future work. Our approach to deal with the first iteration is simple: we do not use a simple threshold to flip bits. Instead of starting with a BitFlip iteration, we propose to start with FixFlip, a new type of iteration that works by flipping a predetermined number of bits that have the largest corresponding UPC.

PickyFix
In this section, we describe a new BIKE decoder called PickyFix. Similar to other iterative decoders for LDPC codes, PickyFix works by performing a sequence of iterations that progressively increases the knowledge of the secret sparse error used for encrypting. However, it differs significantly in how it chooses which bits to flip in its iterations. We begin by defining two new types of auxiliary procedures: the FixFlip and PickyFlip iterations, that are the building blocks of our decoder.

The FixFlip Auxiliary Iteration
While the majority of previous bit-flip approaches are based on flipping all bits whose UPC counters are above a certain threshold, FixFlip flips a predetermined number of bits, denoted by n Flips , that have the highest UPC counters. The formal description of a full iteration of FixFlip is described as Algorithm 4.
Almost every step of the algorithm is standard for other bit-flipping algorithms. However, despite its simplicity, one has to be careful with line 3 when implementing the FixFlip iteration, In Section 7 we discuss this issue and show how this can be done efficiently in linear time on r by using important observations on QC-MDPC parameters.
Algorithm 4 The FixFlip iteration. This iteration is very useful at the start of the decoding process, when there is a lot of uncertainty about the correctness of the bits. We can point two immediate advantages of using FixFlip. First, since the number of flips is fixed, the number of wrong flips done by this iteration is limited. This makes FixFlip useful for small values of r, which is an important property for decoders to be used in Vasseur's [Vas21] DFR extrapolation framework. Second, and most important, FixFlip is immune to the problem of BGF's first threshold that gets larger as r grows, since it does not rely on a generic threshold function that depends only on |s|. In fact, the threshold function for FixFlip depends directly on the UPC values and the target number n Flips of bits to flip.

The PickyFlip Auxiliary Iteration
PickyFlip is very similar to the BitFlip iteration, except that it uses 2 different threshold: τ In is used to flip zeros to ones and τ Out to flip ones to zeros. In particular, PickyFlip requires that the threshold to flip a zero to a one is greater than or equal to the threshold to flip a one to zero. This makes it picky with respect to the support of e and explains why we use in and out to differentiate the thresholds. The iteration is formally described as Algorithm 5.
The power of this iteration is that the weight of e does not grow too much in one iteration because it is easier to give up on a 1 in the partial error vector e than to accept one more. Additionally, the effect of one PickyFlip iteration is similar to the sequence of black regret and gray regret steps, for a sufficiently high r. Luckily, because of its similarity with the BitFlip iteration, it can be easily implemented by small adjustments of the code by Drucker et al. [DGK20a] in BIKE Additional Implementation.

The PickyFix Decoder
We are now ready to define a full decoder, which is described as Algorithm 6. To allow for a direct comparison between PickyFix and BGF, we decided to define it in a similar fashion: the first iteration makes 3 calls of the auxiliary steps, which are then followed by single calls in the next N Iter − 1 iterations.
The threshold τ Out for PickyFix is fixed as (d + 1)/2 in every iteration, which is the value typically used as the minimum threshold for flipping bits. For the value of τ In , we decided to use the BGF's auxiliary function Thresh which was carefully built by the BIKE team and is sufficiently restrictive for our use case.
FixFlip depends on the following parameters: the number n Flips of flips to be done by FixFlipIter and the number N Iter of iterations. These parameters depend on the security level and significantly impact the decoder's performance. We analyze these parameters in the next section.

Analysis
The main problem when searching for good parameters (n Flips , N Iter ) is that they are not independent. For example, if n Flips is too small, we may need a large number N Iter of iterations to compensate. To simplify our search, we will take a greedy approach and break the search into two parts. In this section, first we find good values for n Flips by focusing only on the first iteration and then show that these values indeed yield decoders with a concave DFR curve. Finally, we proceed to evaluate the decoder performance for different number N Iter of iterations.

Choosing the FixFlip Parameter
Intuitively, the best value of n Flips is the one that minimizes the number of errors left to be corrected by further PickyFlip iterations. Ideally, one could see how each possible value of n Flips affects the DFR curves following the extrapolation framework, and choose the one that has the fastest decay. The problem of this approach is that these experiments are very expensive and could easily take months of computing power.
To deal with this problem, instead of counting decoding failures, we count the average number of uncorrected errors left, which can be estimated with a much smaller sample than what is needed for the DFR estimation. Consider the curves PickyFix n Flips 1 (r) that represent the average number of errors left after the first iteration of PickyFix when the FixFlip iteration performs n Flips bit flips. Similarly, define the curve BGF 1 (r) as the average number of errors left after the first iteration of BGF. Figure 6 shows selected curves, where the average number of errors left was obtained by simulations of 10,000 runs. Notice how each PickyFlip curve eventually leaves about 0 errors after the first iteration. Furthermore, we can see that BGF appears to stall its error correction in its first iteration as r increases. In Level 1, BGF even starts to leave more errors for sufficiently large values of r, which is a consequence of the very conservative threshold used in the first iteration that we discuss in Section 4.1.
To obtain the best value of n Flips we used the following criteria: for each security level, select the value n Flips such that  Table 3, where 10,000 tests were performed to estimate PickyFix n Flips 1 (r) for each r.
Let us now see how PickyFix behaves with respect to the concavity with an experiment similar to the one done in Section 4.3 for BGF. First notice that we could not use t = 155 because PickyFix was much better than BGF's and its DFR quickly got to the point where no failure could be observed in our simulation. Therefore, we had to consider t = 160. Figure 7 shows our results for this experiment. We invite the reader to compare this figure with Figure 4.3 and see that, not only PickyFix's DFR appears to be concave in the same interval, but it also outperforms BGF with 5 iterations for a higher value of t. Furthermore, we also tested PickyFix for levels 3 and 5, using t = 240 and t = 330, respectively, and the DFR curves appear to be concave, unlike the ones for BGF.

DFR
Iterations 2 3 4 5 Figure 7: The DFR curves for PickyFix when using 2 to 5 iterations considering the BIKE Level 1 parameter set with t = 160.

Achieving Negligible DFR
Now comes the most important evaluation of PickyFix, which consists of its decoding performance under the extrapolation framework. Our results are shown in Figure 8. The number of tests to determine each DFR estimate was selected to be enough to obtain approximately 1000 failures (at least) for each point and can be found in data/setup/ dfr_experiment.csv. Table 5 shows the results for the DFR extrapolation of the curves considered in Figure 8, together with the performance of our constant-time implementations. The extrapolation was done for the last two points (r A , p A ) and (r B , p B ) where more than 1000 failures were observed and considered α = 0.01 for the Clopper-Pearson method to build the confidence interval for p A and p B .
We can see, from Table 5, that even with less than 5 iterations, the extrapolated parameter r for each security level does not differ by much from the parameters proposed by the BIKE team using BGF (Table 1). However, since PickyFix also works with a reduced number of iterations, its performance can be significantly better.
From the results presented in this section, PickyFix looks like a promising decoder for BIKE. However, remember that the FixFlip auxiliary iteration used by PickyFix is inherently more complex than those used by BGF. In the next section, we describe how to efficiently implement PickyFix in constant-time and show that our decoder provides a major speedup over BGF for all security levels.

Efficient Implementation in Constant Time
The efficient constant-time implementation proposed by the BIKE team is based on Chou's [Cho16] QcBits with further improvements by Guimarães et al. [GAB19] and Drucker et al. [DGK20b,DG19]. Using these ideas, Drucker et al. [DGK19,DGK20c] proposed the BGF implementation that is the best performing decoder up to this day, which is implemented in BIKE's Additional Implementation [DGK20a].
We based our PickyFix implementation on Drucker's et al. [DGK20a] code, which implements, in constant-time, most of the procedures required for both PickyFlip and FixFlip iterations. This includes, for example, the syndrome and UPC counters computations, and algorithms to flip bits given a threshold.
This section begins with a high-level description on how to adapt Drucker's et al. [DGK20a] implementation to perform the PickyFlip iteration in constant-time. Then we give a more detailed explanation on how to implement the procedures needed by FixFlip that are significantly different from what is used by previous decoders. We end this section with a performance evaluation of our constant-time implementation, which is available at https://github.com/thalespaiva/pickyfix.

Implementing the PickyFlip Iteration
Remember that PickyFlip is similar to the BitFlip iteration, except that it uses a different threshold to flip zeros and ones. More specifically, consider the BitFlipIter described in Algorithm 2. Notice how if upc j ≥ τ 0 it inverts e j , but if τ 0 − δ ≤ upc j < τ 0 , it updates GrayMask j = 1. BitFlip behavior is then very similar to PickyFix if we let τ 0 = τ In and δ = τ In − τ Out .
BIKE's efficient implementation of BitFlip is based on QcBits [Cho16], and we implemented PickyFix by reusing their implementation. Since the details of this implementation are already described by Chou [Cho16], we give here only a brief description of how it works.
Suppose we want to flip all bits in e whose UPC counters are above a threshold τ In . First, all UPC counters are computed in bitsliced form. Since the UPC counters are lower than or equal to d = w/2, then log 2 (d) slices are enough. Second, the implementation performs a bitsliced subtraction of τ In over all UPC counters. Therefore, the 0 bits in the last slice, which contains the most significant bits, indicate that the UPC was greater than or equal to τ In , and thus the corresponding bit in e should be flipped.
Notice that PickyFix performs the procedure above two times: one for τ In and other to τ Out . However, the computation of UPC counters, which is the most costly step, is only done once for the two thresholds. The cost of the call is then very similar to the complexity of BitFlipIter.

Implementing the FixFlip Iteration
Most of the steps needed by the FixFlip algorithm are common to all variants of the original bit-flipping decoder proposed by Gallager [Gal62]. Therefore, we can base our implementation in the most efficient constant-time implementations of QC-MDPC decoders, if we can efficiently implement the sorting step of FixFlip, corresponding to line 3 of Algorithm 4.
Simply put, the main problem we need to solve is: given a list of UPC counters, flip the n Flips bits that have the largest counters. This motivates us to call the set of indexes of entries to be flipped as a FixFlip set, which is formally defined bellow.
Definition 1 (FixFlip set). Consider a list of UPC counters U = (u 1 , . . . , u 2r ). A FixFlip set S with respect to U and n Flips is a set of n Flips indexes such that u i ≥ u s for all i / ∈ S and for all s ∈ S.
Notice that, in general, there are more than 1 FixFlip set for the same list of UPC counters. For example, for a list of UPC counters U = (3, 5, 2, 3, 7, 1, 3, 1) and n Flips = 4, then S 1 = {1, 2, 4, 5} and S 2 = {1, 2, 5, 7} are two valid FixFlip sets. Furthermore, notice that any FixFlip set S for U can be constructed by the threshold τ = 3 and the integer n τ = 1 by taking every index i whose UPC is strictly greater than τ and also taking n τ indexes whose UPC is equal to τ . The pair (τ, n τ ) is then called a FixFlip threshold, and is formally defined next. U = (u 1 , . . . , u 2r ) be a list of UPC counters. A pair (τ, n τ ) is a FixFlip threshold with respect to U and n Flips if, for any FixFlip set S can be partitioned into S = S >τ ∪ S =τ such that S >τ = {s ∈ S : u s > τ }, S =τ = {s ∈ S : u s = τ } and |S =τ | = n τ .

Definition 2 (FixFlip threshold). Let
This notion helps us to reduce the problem of flipping the bits with the largest UPC values to finding a FixFlip threshold, as shown in Algorithm 7. The idea of the algorithm Algorithm 7 Algorithm to flip the n Flips entries of e with largest UPC counters. is to flip all bits whose UPC is above τ , and use the array FlipFlagsForThreshold to control which set of n τ bits should be flipped among all of the N τ bits whose UPC is τ . The conditionals in Algorithm 7 can be implemented in constant-time using condition masks. However, there are two aspects that are important to notice when converting the algorithm to a constant-time implementation. The first is that it is not trivial to implement FixFlipThreshold in constant-time. The second is that, to generate the random vector FlipFlagsForThreshold of fixed weight in line 4, and to hide the accesses to index η in line 11, we need a tight upper bound on N τ . In the next two sections, we describe how our constant-time implementation deals with these concerns.

Computing the FixFlip Threshold
The straightforward solution is to use general sorting algorithms, such as quicksort, to sort the indexes based on the corresponding UPC counters' values, and then return the first n Flips indexes. There are two problems with this approach. The first is that the average complexity would be O(r log r) which would result in an iteration much costlier than that of BGF or BG. The second, and most problematic one, is that the algorithm would not be constant-time and timing attacks would be practical.
Notice that the values of the UPC counters are always in {0, . . . , d}, which is a relatively small range, and therefore counting sort is an interesting option that allows for linear sort. The problem with using counting sort in this cryptographic setting is that the constant-time implementation would not be efficient: for every counter, we need to touch all the d + 1 buckets to avoid cache timing attacks, resulting in O(wr) complexity.
We can do better by analyzing the context in which FixFlip iteration is used. Since the weight t of the error vector is at most t = 264, considering security level 5, then it is not necessary to allow for more than 264 flips in each FixFlip iteration. Furthermore, we already saw in Section 6.1 that, in practice, n Flips is typically much lower than t for all security levels, and we can safely assume n Flips < 256. This means that, when performing the counting sort, we only need to count up to 255, since we need only to return the indexes corresponding to the n Flips largest counters. Therefore, 8 bits are needed for each bucket.
Still, even if we can pack 8 buckets into one 64-bit register, we would need to touch all (d + 1)/8 registers for each counting update. The number of registers would result in 9 × 2r and 18 × 2r operations, considering parameters for levels 1 and 5. But remember that we do not need to count all entries, and we can take what we call the reduced UPC counters approach, which is described next. Figure 9 shows how the algorithm works in a real decoding instance considering BIKE Level 5. Suppose we are given a list U = (u 1 , . . . , u 2r ) of UPC counters and we want to find  the FixFlip threshold for U and n Flips < 256. To show our concrete efficient implementation, we assume the following conditions, that hold in the real world parameters.
2. The number of bits to flip is n Flips < 256.
The FixFlip threshold is found in 3 counting steps, and each step uses only 8 buckets. For the first step, each bucket i, where i goes from 0 to 7, corresponds to the UPC counters in the interval [64i, 64i + 63]. The algorithm then runs from u 1 to u 2r counting the occurrences into the buckets, but with the following rule: the counting is done only in 8 bits, and it should not overflow. That is, the maximum count is 255 for each bucket. Now suppose the resulting counts for each bucket is [255, 255, 0, 0, 0, 0, 0, 0], and consider the case n Flips = 40, just like in Figure 9. Then the bucket where the FixFlip threshold lives must be Bucket 1, since Buckets 3 to 7 do not have any entry, and there are more than n Flips entries in Bucket 1. Using Bucket b 1 = 1 selected in this step, the algorithm proceeds to the next step.
In the second step, the algorithm expands Bucket b 1 , and the 8 counting buckets are zeroed. Now, each bucket i will count the UPC counters in the interval [B 2 +8i, B 2 +8i+7], where B 2 = 64b 1 . Again, the algorithm runs through the counters in reversed order until it finds where the FixFlip threshold lives. In the case considered in Figure 9, Bucket 3 is not enough to contain the threshold since it separates at most 31 UPC counters from the rest. Therefore, the search continues using Bucket b 2 = 2.
In the third and last step, Bucket b 2 is expanded, and now each counting bucket will correspond to one UPC value. Formally, each Bucket i will count occurrences of the UPC counter B 3 + i, where B 3 = B 2 + 8b 2 . If we consider the search in Figure 9, we can see that it stops at τ = 86, since it has found 6 + 30 + 1 = 37 UPC values above τ and n τ = 3 UPC values equal to τ complete the n Flips = 40 bits to be flipped. Now let us analyze why this algorithm is useful. Since each bucket uses only 8 bits, we can pack all the 8 buckets into a single 64-bits register. Therefore, each update on the counters updates a single register, which avoids the cache-timing attacks. Since 3 rounds are necessary, the threshold is found in about 3 × 2r touches on the counting registers.
Furthermore, let us check that computing the corresponding bucket for an UPC counter is made using constant-time operations. Suppose we want to find the bucket b corresponding to the UPC counter u i on step . Then Both conditions can be evaluated in constant time, since they involve simple unsigned integer comparisons, additions, and the computation of 8 4− does not involve any secrets. Now for the actual values, if we use 8 bits to represent the buckets, we can let 0xFF denote the symbol ⊥. Furthermore, since denominator of the division involving secrets is a power of 8, we can compute (u i − B )/8 3− in constant time by using a right shift by 3(3 − ) bits, assuming the processor uses a barrel shifter. This observation is particularly useful when considering the vectorized implementation using AVX512 instructions: the bucket computation can be done in parallel for multiple UPC counters, as they involve simple additions, comparisons and right shifts by a fixed amount.

Generating FlipFlagsForThreshold and Accessing it in Constant Time
The generation of a random binary vector of a given weight appears frequently in codebased cryptography. For example, both HQC [MAB + 18] and BIKE [ABB + 21] itself require such a procedure when generating error vectors or secret keys. There is, however, a key difference between our setup and the constant-weight sampling algorithms used by BIKE: FixFlip must hide both the weight n τ and the size of the vector N τ . Let us first see, in Algorithm 8, how the naive Fisher-Yates shuffle works in our case, and then discuss how to make it run in constant-time. We start with a vector of N τ bits, in which the first n τ are set to 1 and the rest are set to 0. Then, the algorithm performs n τ random swaps to shuffle the first n τ bits of the array. By the end, if each random integer j generated for the swap is unbiased, then each vector of length N τ and weight n τ should be generated with uniform probability 1/ Nτ nτ . To implement Algorithm 8 in constant-time, we need upper bounds on n τ , to limit the loops, and on N τ , to hide the accesses to vector FlipFlagsForThreshold when swapping bits in line 9. Notice that, when swapping bits, we only need to hide access to position j, since i is already known in each iteration. Furthermore, notice that we do not use rejection-sampling when selecting the index j because its rejection rate would depend on N τ . Instead, we use a constant-time modulo reduction of the λ-bit random number, where λ is equal to the security level, to achieve negligible bias.
A trivial upper bound on n τ is n τ ≤ n Flips . This allows us to run the loops in lines 3 and 6 in constant time by performing n Flips iterations and using condition masks. Now, to bound N τ we can focus on the distribution of UPC counters of the wrong bits, that is, Algorithm 8 Generate a random vector of fixed weight using the Fisher-Yates algorithm.
j is a random integer in range i ≤ j ≤ Nτ 9: Swap bits i and j of FlipFlagsForThreshold 10: return FlipFlagsForThreshold those that should be flipped. Let U τ be the random variable that counts the number of UPC counters, among the wrong bits, that are equal to τ . Notice that, when N τ > 2U τ , then flipping bits whose UPC are equal to τ is more likely to result in a wrong flip. Suppose that we find the smallest value κ, in the interval 0 ≤ κ ≤ t, such that Pr (U τ > κ) ≤ 2 −λ , where λ is the security level. Then we only care about flipping bits whose UPC are equal to τ in the case when N τ ≤ 2κ, as pointed by the comment in line 4 of Algorithm 8. To find this value κ for each parameter set, we can use Sendrier and Vasseur's [SV19] model for the distributions of UPC counters. Under their model, the UPC counters' distribution for the wrong and right bits are accurately modeled by Binomial distributions with different parameters that are easy to compute. Since we want to consider all possible values of τ , we can search for the smallest κ satisfying the rightmost inequality where the distribution of each U θ is computed using Sendrier and Vasseur's [SV19] model. Table 4 shows the upper bounds on N τ that we found for each security level. Our implementation uses an array of 64-bit integers to represent FlipFlagsForThreshold, and the total number of 64-bit blocks required for 2κ bits is shown in the last column. Notice that, for security levels 128 and 192, it is possible to simultaneously compute N τ and the FixFlip threshold, since 2κ < 255. To compute κ, we consider the smallest values of r achieving each security level λ, which are taken from Table 5. This is a conservative approach, since κ gets smaller for higher r within a fixed security level.

Performance Evaluation
We now evaluate the decoder with respect to the full decapsulation time 5 , when using PickyFix as a subroutine. For this test, we considered the constant-time implementations of BIKE decapsulation using BGF from BIKE Additional Implementation [DGK20a] and our constant-time PickyFix implementation over their code. The algorithms are implemented in two modes: the portable implementation and the accelerated one using AVX512 instructions. The testing platform consists of an Intel ® Xeon TM Gold 5118 CPU at 2.30GHz. Notice that the decoding step is the most important part of the decapsulation. In our setup, the decoding step consists of 90% of the decapsulation, for the portable implementation, and between 80% and 90%, for the AVX512 implementation 6 . Table 5 shows the performance of our constant-time implementation of PickyFix. The basis for the speedup comparison over BGF comes from Table 2, for the corresponding security levels. Notice how PickyFix provides major speedups with respect to BGF for all security levels for one very important reason: it can work with a smaller number of iterations. Even if parameter r suffers a slight increase when using only 2 iterations, between 1% (λ = 256) and 14% (λ = 128), this is compensated by speedups from 1.47 to 1.18, correspondingly.

Conclusion and Future Work
The evidence provided in this paper suggests that PickyFix outperforms BGF both with respect to security and performance. Moreover, we show how PickyFix can be efficiently implemented in constant-time. The only drawback appears to be that the implementation of FixFlip, one of PickyFix's auxiliary iterations, is more involved than that of simple bit-flipping algorithms.
There are several directions one may take to extend this work. It would be interesting to perform a broader exploration of the thresholds used by PickyFlip. For example, to consider looser thresholds for rejecting or accepting ones. On the FixFlip side, notice that we tried to be as general as possible in our implementation. However it may be possible to make it simpler and faster by using the fact that FixFlip is used only in the first iteration. Therefore, one could use statistical analysis to limit the range in which the FixFlip threshold should be searched.
It would be fascinating to see if our implementation of FixFlip can be used to compute better and more complex thresholds. For example, one could use the partial counting of UPC counters to compute thresholds based on the separation of the distributions of UPC for right and wrong bits. On the security side, it is important to understand how PickyFix compares with other decoders in corner cases, such as when using weak keys or decoding near-codeword error patterns [DGK19,SV20b,Vas21]. Finally, it may be interesting to evaluate PickyFix as a decoder for low-density parity-check (LDPC) codes [Gal62].