A Key-Recovery Side-Channel Attack on Classic McEliece Implementations

. In this paper, we propose the ﬁrst key-recovery side-channel attack on Classic McEliece, a KEM ﬁnalist in the NIST Post-quantum Cryptography Standardization Project. Our novel idea is to design an attack algorithm where we submit special ciphertexts to the decryption oracle that correspond to cases of single errors. Decoding of such ciphertexts involves only a single entry in a large secret permutation, which is part of the secret key. Through an identiﬁed leakage in the additive FFT step used to evaluate the error locator polynomial, a single entry of the secret permutation can be determined. Iterating this for other entries leads to full secret key recovery. The attack is described using power analysis both on the FPGA reference implementation and a software implementation running on an ARM Cortex-M4. We use a machine-learning-based classiﬁcation algorithm to determine the error locator polynomial from a single trace. The attack is fully implemented and evaluated in the Chipwhisperer framework and is successful in practice. For the smallest parameter set, it is using about 300 traces for partial key recovery and less than 800 traces for full key recovery, in the FPGA case. A similar number of traces are required for a successful attack on the ARM software implementation. attacks. paper proposes the ﬁrst key-recovery side-channel attack on Classic McEliece implementations.


Introduction
The promise of quantum computing has rapidly changed the focus of research and industry in many areas. The growing interest in applications of quantum computing has led to rapid development of quantum computers in recent years. Experimental quantum computers are developed in the labs of companies such as IBM, Google and Microsoft.
The current solutions for information security are threatened by this progress. In particular, cryptographic primitives that base their security on the difficulty of factoring or the discrete log problem, are no longer secure. Shor's algorithm [Sho94] can be used to break these schemes in polynomial time. Even though a sufficiently large quantum computer may still be many years into the future, information processed today must remain secure also in 10 or 20 years from now. So the development of new security solutions that can withstand the threat of quantum computers is both urgent and of utmost importance.
As a major step in the direction, NIST initiated a few years ago the NIST Post-quantum Cryptography Standardization Project [NIS], here called the NIST PQ project. This is an ongoing evaluation and standardization project for two types of cryptographic primitives, KEMs (Key Encapsulation Mechanisms) and digital signatures. It will eventually set new world standards (technically, US Federal Government standards) for post-quantum secure primitives, in a similar manner as was previously done in the development of AES and SHA-3.
Post-quantum secure primitives are most commonly constructed based on either lattice problems or decoding problems in the Hamming metric, referred to as lattice-based crypto or code-based crypto (but there are also e.g. hash-based, multi-variate, isogeny-based). The NIST PQ project is now in its final round (round 3) before standardization and we can find one code-based KEM as finalist and two code-based KEMs as alternate candidates (classified roughly as promising candidates that need more study), BIKE [ABB + 20] and HQC [AAB + 20].
This paper is about Classic McEliece [ABC + 20], which is the code-based KEM finalist, together with three lattice-based KEM finalists, Saber, Kyber, and NTRU [DKR + 20, SAB + 20, CDH + 20]. The Classic McEliece KEM proposal is a modified version of the old McEliece PKC construction from the 70's, using the so-called Niederreiter PKC version and scrambled parity-check matrices from Goppa codes. The security is mainly related to the hardness of decoding random codes as well as distinguishing scrambled Goppa codes from random codes.
Classic McEliece is regarded as a conservative design based on a well-studied problem. It is less efficient compared to lattice-based schemes in implementation and key size but has high confidence in its security. The German Federal Office for Information Security (BSI) in [Ger] suggests to use Classic McEliece [ABC + 20] and FrodoKEM [NAB + 20] for "long-term confidentiality protection".
While the theoretical security of these post-quantum secure primitives is intensively investigated and small steps forward are continuously taken, the study on the implementation security of these schemes is of equal importance. From a practical perspective, it may even be more important, as information leakage from implementations often lead to actual practical attacks, whereas a successful theoretical attack on a proposed scheme may still be very far from an attack that can actually be done in practice.
Side-channel attacks on implementations of cryptographic primitives, initiated by Kocher [Koc96], contain a plethora of different approaches, such as timing attacks and power attacks, etc. There are also the related fault injection attacks. In a power attack, as used in this paper, the continuous power consumption of the target device with the crypto implementation is measured while the device is executing. The measured power consumption can provide information on secret values in the cryptographic scheme. A successful attack both needs to identify where in the execution to measure, i.e. identifying a useful leakage point, and then to describe an algorithm that uses the received side-channel information and determines secret information in the attacked crypto scheme.
The most powerful and common side-channel attack model is of profiling type, meaning that it is assumed that the adversary has access to the target device or some form of a copy of the target device. The adversary can then in an initial profiling step characterize and measure on the device to learn possible dependencies, etc. Whereas so-called template attacks have traditionally been the common approach to profiled attacks, a recently developed and now very common approach is to use machine-learning algorithms. In particular, side-channel attacks based on deep learning have recently gained a lot of attention [MPP16, ZBHV19, KPH + 19, NDGJ21, PPM + 21]. There are also non-profiled side-channel attacks with deep learning [Tim18,PCBP21].
All the lattice-based KEM finalists in the NIST PQ project have been subjects of side-channel attacks that can recover the secret key [RRCB20, XPRO20, NDGJ21, AR21, REB + 22, UXT + 22] There are also attacks recovering the secret message [SKL + 20]. We now see a race between researchers trying to provide better-protected implementations both in hardware and in software, and researchers trying to find even more sophisticated ways of attacking protected implementations of the finalists [BDK + 20, NDGJ21, BDH + 21, BGR + 21, ABH + 22]. However, no key-recovery side-channel attack on Classic McEliece implementations is known, only message recovery attacks. This paper proposes the first key-recovery side-channel attack on Classic McEliece implementations.

Related works
The first code-based cryptosystem was proposed by McEliece [McE78]. Classic McEliece is a modified version of this original scheme and its latest version is described in [ABC + 20]. The official submission of the proposal to round 3 of the NIST PQ project contains also implementations. Other published implementations of Classic McEliece can be found, e.g., FPGA implementations in [WSN17,WSN18] and an ARM Cortex-M4 implementation in [CC21].
Side-channel attacks on McEliece PKC with Goppa codes have previously appeared in [STM + 08], which presents an implemented timing attack as well as two other side-channel attacks and related countermeasures: a power attack on the construction of the parity check matrix during key generation and a cache attack on the permutation of code words during decryption. See also improvements in [AHPT11]. Power analysis of an implementation of McEliece PKC on an 8-bit AVR microprocessor was presented in [HMP10].
Side-channel attacks on Classic McEliece have previously appeared as a message-recovery attack using a type of reaction attack in [LNPS20]. There has also been a message-recovery laser fault-injection attack described in [CCD + 21]. In [LNPS20] the attack is based on [SSMS10]

Contributions
In this paper, we propose the first key-recovery side-channel attacks on Classic McEliece implementations. This is based on an identified general vulnerability caused by the current algorithm design, that can be explored in a side-channel attack. The vulnerability comes from the fact that if the error locator polynomial is fixed, then the additive FFT evaluation procedure in the decoding step behaves the same. We list some main contributions of the paper as follows.
• We present the first key-recovery side-channel attacks on implementations of Classic McEliece both in hardware (FPGA) and in software (ARM Cortex-M4) • We highlight an identified side-channel vulnerability in the constant-time (Goppa) decoding step that involves an FFT computation for the evaluation of the error locator polynomial. This has to be addressed when designing a protected implementation of Classic McEliece or similar schemes.
• We show the design of a detailed attack algorithm that finds ways of minimizing the number of required traces.
We have applied this attack to the FPGA reference implementation 1 of Classic McEliece and fully implemented and evaluated the different steps. We also apply it to a third-party implementation for the ARM Cortex-M4 CPU [CC21] with full implementation and evaluation.

New techniques
The main idea of the attack is that if the error locator polynomial is fixed, then the later step of an additive FFT to evaluate the error locator polynomial over all the 2 m points (m = 12 or 13) is a fixed process. In the FPGA implementation, it corresponds to 1095 clock cycles of computation. If we generate error vectors with only one position in error, in position i, then there exit only 2 m possible error locator polynomials since it is a polynomial of the form x − α i , where α i is an unknown value. We use a machine-learning-based classification algorithm to determine the error locator polynomial from power measurements. The error locator polynomial outputted by the Berlekamp-Massey algorithm is given by a selected error location after the secret mapping, related to the secret α i values. We can thus recover parts of the secret support by repeatedly submitting ciphertexts with a single error in different positions and after a few hundred such submissions we can successfully recover the entire secret Goppa polynomial. To do full secret key recovery, we need additional traces.

Organization
The remaining of the paper is organized as follows. In Section 2, we give the necessary background in coding theory and code-based cryptography. We then describe the novel ideas in Section 3 and present the new attack in detail in Section 4. This is followed by Section 5 showing the experimental results. We conclude the paper and discuss possible improvements and future works in Section 6.

Background
In this section, we briefly introduce background information including basics in coding theory, code-based cryptography, and side-channel analysis.

Notations
We adopt some of the notations in the design document of round-3 Classic McEliece [ABC + 20]. We employ bold-face capital characters for matrices and bold-face low-case characters for vectors throughout the paper. Let q be a prime or a power of prime. We denote F q the finite field of order q and F q [x] the polynomial ring over F q . The notation #{A} means the number of elements in the set A. The Hamming weight of a vector v (denoted w H (v)) is defined as the number of non-zero coordinates of v. The Hamming distance of two vectors v 1 and v 2 (denoted d H (v 1 , v 2 )) is defined to be the number of coordinates in which v 1 and v 2 differ. We use |x| to denote the absolute value of x.

Coding theory
Linear codes Let C be a subspace of F n q with dimension k. Then C is called an [n, k] q linear code of length n and dimension k. The redundancy of C is then r = n − k. We call a vector c = (c 1 , . . . , c n ) ∈ C a codeword of C, and the support of a codeword c is defined as the index set I(c) that Thus, we have w H (c) = #{I(c)}. The minimum distance of a linear code C is defined as the smallest Hamming distance between two distinct codewords. Let G be a k × n matrix over F q whose rows are the vectors of a basis of C. We call G a generator matrix and the linear code C is generated by C = {uG : u ∈ F k q }. We can also define C by an r × n matrix H, called parity-check matrix, as i.e. C is the kernel of H. The syndrome of a vector v ∈ F n q is defined as Hv T . Binary Goppa codes Classic McEliece employs irreducible binary Goppa codes defined as follows.
Definition 1 (Binary Goppa Codes). The binary Goppa code C over F 2 m is defined by a support vector p = (α 1 , . . . , α n ) ∈ F n 2 m , where α i = α j for i = j and the Goppa polynomial We say that the code C is defined by Γ = (g, (α 1 , . . . , α n )). If the Goppa polynomial g(x) is irreducible, then the Goppa code C has minimum distance 2t + 1 and is called an irreducible binary Goppa code.
For more information on Goppa codes and their related decoding algorithms, we refer to any textbook on the subject, like [MS78].

Classic McEliece
The first code-based cryptosystem was proposed by McEliece in 1978 [McE78] using a randomly chosen irreducible binary Goppa code. Later in 1986, Niederreiter [Nie86] proposed a dual variant of the McEliece cryptosystem that uses a parity-check matrix for encryption (rather than using a generator matrix). His original version employing Reed-Solomon codes was attacked in [SS92], but the version with irreducible binary Goppa codes is still secure. Also, it was proven in [LDW94] Table 1, where m determines the size of the binary field, t represents the number of correctable errors, and n the length of the code. Next, we describe the IND-CPA-secure PKE.
Key generation First choose a random irreducible polynomial g(x) ∈ F 2 m [x] of degree t and a list of distinct elements (α 1 , . . . , α n ) ∈ F n 2 m . Thus, we have picked a random binary irreducible binary Goppa code, which serves as the private key of the PKE. We then compute a t × n parity check matrix H goppa over F 2 m and transform it to a tm × n binary matrix H goppa via replacing each entry in H goppa by an m-bit column over F 2 . We write the matrix H goppa in the systematic form H goppa = [I mt |T mt×(n−mt) ] and set the public key to be T. This step of systematizing H goppa reduces the public key size since it is unnecessary to store or communicate the identity matrix.
The private key of the Classic McEliece KEM contains an additional uniform random n-bit string, which is only used in the CCA transform in case the decapsulation fails.

Input:
The Classic McEliece parameters: m, t, and n Output: The secret key (g(x), (α 1 , α 2 , . . . , α n )) and public key T 1: Randomly choose a list of distinct elements (α 1 , . . . , α n ) ∈ F n 2 m as support 2: Choose a random irreducible polynomial g(x) ∈ F 2 m [x] of degree t 3: Compute the t × n parity-check matrix (2) goppa [s|0] T , where we append n − mt zeros to the syndrome s. We then use the constant-time BM algorithm to compute the error locator polynomial σ(x) ∈ F 2 m [x] of s (2) and evaluate σ(x) in all elements in F 2 m . This polynomial evaluation over the whole finite field F 2 m can be efficiently implemented through the additive FFT (Fast Fourier Transform) procedure. In the last step, we read the partial secret key (α 1 , . . . , α n ) and check whether σ(α i ) = 0. We set the i th bit e i = 1 if σ(α i ) = 0 and e i = 0 otherwise.

Algorithm 3 Decryption for the PKE
Input: Ciphertext s and the secret key (g(x), (α 1 , α 2 , . . . , α n )) Output: Plaintext e 1: Compute a double-size 2t × n parity-check matrix goppa to a 2mt × n binary parity check matrix H  [Mas69] is employed for computing the error locator polynomial whose roots are the error locations. The error correction capability is t since the size of the double-size syndrome vector s (2) is 2t. The BM algorithm can be made constant-time due to its simplicity. We compute the syndrome polynomial from s (2) , which is the input to the BM algorithm. The algorithm initializes polynomials σ( , integers l = 0 and δ = 1 ∈ F 2 m and updates the 4-tuple (σ(x), β(x), l, δ) during the k th iteration for 0 ≤ k ≤ 2t − 1, according to certain updating rules. The final output, i.e., the found error locator polynomial, is the updated polynomial σ(x) after the 2t iterations.
Another important problem is to evaluate a polynomial at multiple points, which is solved by the additive FFT algorithm in the Classic McEliece. We focus on the decryption algorithm, in which one needs to evaluate the error locator polynomial at the secret support (α 1 , . . . , α n ). The additive FFT procedure includes two steps, the radix conversion and twisting step transferring the input polynomial σ(x) to many 1-coefficient constant polynomials and the reduction step iteratively evaluating at the input points using these constant polynomials.

Relations between secret key parts in Classic McEliece
If you know the public key T and some part of the secret key (g(x), (α 1 , α 2 , . . . , α n )), can you then efficiently determine other parts of the secret key? Some brief facts are described.
The support splitting algorithm The support splitting algorithm [Sen00] proposed by Sendrier is designed to solve the code equivalence problem of determining if a linear code C 1 can be obtained by the index permutation of another linear code C 2 . The input to the support splitting algorithm is two generator matrices and the output is the found permutation. For random linear codes, the dominant cost of the support splitting algorithm is O n 3 with overwhelming probability, where n is the length of the code.

Key recovery
The key recovery problem of Classic McEliece is the recovery of the Goppa polynomial g(x) and the vector p = (α 1 , . . . , α n ) since such information is sufficient for decrypting ciphertexts. The key recovery problem has been investigated in [Sen00,LS01,OS09]. We can determine the polynomial g(x) from the vector p or determine the vector p from g(x) and the set {α 1 , α 2 , . . . , α n }. If n = 2 m , the set {α 1 , α 2 , . . . , α n } is the whole finite field F 2 m , and thus, it is sufficient to recover g(x). The whole secret key (g(x), p = (α 1 , . . . , α n )) can then be recovered by the support splitting algorithm. We just construct a generator matrix G 0 of the Goppa codes from g(x) and an arbitrary support p 0 over the set {α 1 , α 2 , . . . , α n }. Feeding G 0 and a generator matrix G goppa from H goppa to the support splitting algorithm, we could reconstruct the secret support.
The Classic McEliece submission proposed four parameter sets with n < 2 m , which can provide additional security against key recovery attacks since it is non-trivial to recover the set {α 1 , α 2 , . . . , α n } from g(x) if 2 m − n is not small. We return to this problem in Section 4.2.

Information set decoding (ISD)
A fundamental problem in code-based cryptography is the syndrome decoding problem, where one needs to find an unknown e 0 with w H (e 0 ) = w, assuming that a parity-check matrix H and a syndrome s = He 0 T are given. Prange [Pra62] initiated the research line called information set decoding, where the basic idea is to find k error-free coordinates that carry sufficient information to recover the full error vector e. These k coordinates are called the information set. This algorithm was further improved by a number of algorithms (e.g., [LB88,Ste88,MMT11,BJMM12,MO15]). Since one part of the new attack is based on an ISD algorithm and we instantiate it with Stern's algorithm [Ste88] for simplicity, we involve a detailed description and complexity analysis of Stern's variant as follows.

Stern's algorithm [Ste88]
Stern firstly introduced the idea of using the birthday paradox in information set decoding. We start with a permutation to write the parity check matrix H in a systematic form (H 0 , I), so the first k coordinates form an information set. We denoteĤ 0 (orŝ) the first l columns of H 0 (or rows of s). We then enumerate e of dimension k/2 and weight p, computeĤ 0 (e, 0) T andŝ −Ĥ 0 (0, e) T , and search for collisions. Last, we check whether the weight of remaining (r − l) coordinates of the obtained error vector is (w − 2p).
The list size is k/2 p . The complexity of one iteration of Stern's algorithm is where C Gauss is the cost of Gauss Elimination that can be set as 0.5 · (n − k)k 2 , if we use a basic school book form of the algorithm. The complexity of Stern's algorithm can be written as W/P , where P is the probability that we find one solution in one iteration. Since in our problem setting the weight w is larger than a threshold called GV-bound, there exist many solutions. Then, the probability P can be estimated as

Neural-network-aided profiled side-channel analysis
A profiled side-channel attack consists of a profiling stage and an attack stage. Typically, a device ideally identical to the intended target is used during profiling where the attacker has full control and can set the inputs to the device, like the secret key and the ciphertext. A large number of traces are captured through side-channel leakage while the device performs a cryptographic operation with inputs picked by the attacker. Each trace is labeled with a piece of information that is related to the selected input. The set of traces and labels are then used to construct a model, that based on an observed trace estimates the true label of the trace. At the attack stage, the model is used to classify traces captured from the device under attack that could be the same or different from the profiling device. For profiling, templates introduced by [CRR03] have been used to model the relation between observed traces and labels. For this type of profiling, the estimated noise of a trace is used to determine the most probable label. Machine learning techniques, such as support vector machines and random forest has been used for profiling to get around some of the shortcomings of the template techniques [LPB + 15]. With the rapid development of deep learning, neural networks have shown promising results for profiled side-channel attacks [KPH + 19].
Common architectures for neural networks in the context of side-channel attacks are the convolutional neural network (CNN) and the multilayer perceptron (MLP). CNN's have shown to be less sensitive to jitter, i.e., when traces are misaligned due to clock phase variation or intentional phase variations introduced as countermeasures [CDP17]. In the case of well-synchronized traces, the MLP has shown to be effective for profiling [RJJ + 19, Mag19].
An MLP consists of several layers where the first layer is called the input layer and the last is called the output layer. Layers in between, are called hidden layers. Rather simple MLP's i.e., shallow neural networks with only a few hidden layers, have successfully been used to conduct side-channel attacks [NDGJ21]. Each layer in an MLP consists of several neurons that are fully connected to neurons in the previous layer. During a supervised training of an MLP, traces are fed to the input layer and the predicted labels at the output layer are compared to the true labels. An optimizer is then used to tune the connection between the neurons such that the MLP becomes more accurate in predicting labels of traces.

Test vector leakage assessment
During the NIST Non-Invasive Attack Testing Workshop in 2011, Goodwill et al. [GJJR11] presented the Test Vector Leakage Assessment (TVLA) as a metric to evaluate side-channel leakage. The TVLA has been used to evaluate implemented side-channel countermeasures [BGN + 14, BGG + 15], and to locate points of interest during attacks [RJJ + 19].
During a TVLA, side-channel measurements are divided into two sets and Welch's t-test is applied at each sampling point to determine if the two sets are different by evaluating the null hypothesis that the two sets have equal mean. To perform the t-test on the two sets of measurements T 0 and T 1 , the test statistic t obs of Welch's t-test is calculated as whereμ i , s i , and n i are the mean, standard deviation, and cardinality of T i . If |t obs | ≥ 4.5, the null hypothesis is rejected at a confidence level of 99.998% if s 0 ≈ s 1 , and n 0 ≈ n 1 ≥ 100.
In the context of side-channel analysis, a rejected null hypothesis suggests that the two sets of measurements are noticeably different and leak side-channel information that possibly could be exploited.

A basic description of the new attack idea
In this section, we briefly describe the new attacking idea. We start with the attack model and then give the essence of the new key-recovery attack.

The threat model
The Classic McEliece KEM is designed for IND-CCA2 security. In this paper, we further study its side-channel resilience when being implemented in low-end devices and assume that the adversary is capable of measuring the power traces during the decapsulation process. The adversary aims to recover the secret key via: 1. the adversary firstly selects ciphertexts satisfying certain properties and sends these ciphertexts to the Classic McEliece KEM decapsulation device; 2. the adversary could physically observe the power traces.
Furthermore, the adversary is assumed to have a similar device/environment running the Classic McEliece KEM, and thus the adversary could perform profiling activities. Note that we do NOT assume the same secret key is cloned to the profiling environment. The difficulty of the attack is to design ciphertext properties for profiling to facilitate the key recovery via side channels.

The essence of the attack
The secret key of the classic McEliece KEM consists of an irreducible polynomial g(x), a vector p = (α 1 , α 2 , . . . , α n ) where α i ∈ F 2 m and α i = α j for i = j, and one uniform random n-bit string. As the random string is only used when the KEM fails, the key recovery problem is to recover the polynomial g(x) and the secret support p = (α 1 , . . . , α n ). We aim to first partially recover the secret support p = (α 1 , . . . , α n ) and list the main observation below. Listing 1: Part of the decryption function in [CC21] In Line 3, the Berlekamp-Massey (BM) algorithm computes the error locator polynomial σ(x) stored in a variable called "locator". Then, the computations in Line 4 and Line 7 only depend on the value stored in the "locator", i.e., σ(x).
To evaluate σ(x) at all elements of F 2 m , both the FPGA and ARM Cortex-M4 implementations considered in this paper make use of an additive FFT to speed up the computation. The additive FFT algorithm takes as input, the polynomial σ(x) of degree at most t, and outputs the value of σ(β) for all β ∈ F 2 m . The main idea of the additive FFT, is to exploit that (β + 1) 2 + (β + 1) = β 2 + β for β ∈ F 2 m . Thus, by doing a radix conversion and writing σ(x) = σ (0) (x 2 + x) + xσ (1) (x 2 + x) where deg(σ (0) ) = deg(σ)/2 and deg(σ (1) ) = (deg(σ) − 1)/2 , the value of σ(β + 1) can be easily calculated by first calculating σ(β) = σ (0) (β 2 + β) + βσ (1) (β 2 + β), and then using σ(β + 1) = σ(β) + σ (1) (β 2 + β). Thus, σ(x) only needs to be evaluated at half of the elements of F 2 m , since the other half, which contains "1", can easily be calculated from the first half. By twisting the basis of σ (0) (x) and σ (1) (x), half of the elements where the polynomials should be evaluated will contain the "... + 1". Thus, the radix conversion can performed again to get The additive FFT repeats the twisting and radix conversion recursively until we are left with constant polynomials. These constants are then read and we recursively evaluate σ(x). For our attack, the important point is that the value of the constants, as well as the calculation of all intermediate polynomials, only depends on the σ(x) that we feed to the additive FFT.
New idea for a profiled attack Based on Observation 1, we create ciphertexts from plaintexts e's with w H (e) = 1; for the decryption of such a ciphertext, the computed error locator polynomial σ(x) is a monic polynomial of degree 1 and up to q error locator polynomials are possible. We then could design a profiled attack to recover σ(x). The basic idea of the attack is described in two phases as follows.
Profiling phase: We randomly sample secret supports p pub . We then sample error vectors e i such that all the entries are zero except for the i th one. Then, the non-zero position will lead to an error locator polynomial σ(x) = (x − p pub (i)). Thus, we have q = 2 m different error locator polynomials and could allocate all the traces in q categories according to the corresponding error locator polynomial. We then train a neural network to classify the traces from the q different categories, that are expected to have distinctive traces as explained later.
Attacking phase: In each decryption oracle call, we send an error vector e i . The error locator polynomial can be computed as σ(x) = x − α i . By the classification model built in the profiling phase, we could detect σ(x) and therefore recover α i . After trying all the possible i's, we could recover the secret support p, so the required number of traces in the attack phase is at least n. However, not all of the α i 's in the secret support are required to re-build the irreducible polynomial g(x). With this observation, we could reduce the required number of test traces.
Recover g(x) by polynomial factorization Next, we show how to determine the irreducible polynomial g(x), once the partial secret-key p is recovered. We use one valid ciphertext c = (c 1 , c 2 , . . . , c n ) and compute where I(c) denotes the support of the codeword c. Since we know from the definition of Goppa codes that, we can compute the irreducible polynomial g(x) by factoring the polynomial c(x) in F 2 m [x] and choosing an irreducible factor with the weight t. Factoring a polynomial over a finite field to irreducibles is a well-studied problem [Sho05] and can be efficiently solved by the Cantor-Zassenhaus algorithm with expected complexity O l 2+o(1) · m or by the Berlekamp algorithm with expected complexity O l 3 + l 1+o(1) · m . For the new attack targeting the parameter sets of the Classic McEliece KEM, the polynomial degree l is bounded to a few hundred. Thus, the complexity for such polynomial factorization is low. The factorization procedure is efficiently implemented, for example, in the open-source software SageMath [The20]. We show in the next section that the task of computing c(x) and factoring it to irreducibles can be performed at a negligible cost compared with the main complexity cost.

Note.
To recover the polynomial c(x) in Equation 2, one only needs to determine α i where the corresponding c i is non-zero. Thus, the weight of the codeword determines the sample complexity of recovering the irreducible polynomial g(x). We therefore try to find a low-weight codeword as further explained in Section 4.3.

Detailed attacks
In this section, we describe a concrete instantiation of the new attack. We first propose a partial key-recovery attack to recover the irreducible polynomial g(x), which is sketched in Algorithm 4. We then extend the partial key recovery to full key recovery.

Partial key recovery
We now describe the new attack algorithm in steps. We start with producing one binary Goppa codeword with a small weight. The support of c is denoted by I(c). From the systematic parity check matrix H goppa , we could easily construct a generator matrix G goppa in the systematic form. Thus, the easier method to find a codeword is to randomly pick one row in the generator matrix G goppa , the expected weight of the chosen codeword is r/2 + 1. A more advanced approach is to use an Information Set Decoding algorithm. With Stern's algorithm as discussed in Section 2.5, we can achieve a codeword with a smaller weight.
Recovering α i for i ∈ I(c) Denote p c the set that p c = {(i, α i ) : i ∈ I(c)}. We aim to recover p c with side-channel leakages. We choose an error vector e i with all but the i th position zero, prepare for the corresponding ciphertext, and send it to the decryption oracle. With the trained deep-learning model, we can get soft information of the corresponding α i , i.e., a normalized vector (l 1 , l 2 , . . . , l q ) ∈ R q with l i = 1. A simple approach, called the naive approach, is to pick the guess with the maximized likelihood. We could design a more advanced approach, called the threshold approach, by picking a threshold τ for the obtained normalized likelihood values. If the largest likelihood value is still below the threshold τ , then we can repeat the test for e i and get another normalized likelihood vector (l 1 , l 2 , . . . , l q ) ∈ R q . We can multiply the two vectors component-wisely and normalize the new vector. We repeat until the obtained largest likelihood value becomes larger than the threshold τ . With this method, we could recover all the α i 's for i ∈ I(c). The advantage of the threshold approach is that one could adaptively increase the number of decryption attempts for the obtained unreliable α i 's. Thus, if most of the positions are reliable, then the increased number of traces is limited.
Recovering the irreducible polynomial g(x) As has been described in Section 3, we could compute the polynomial c(x) ∈ F 2 m [x] in Equation (2) if the codeword c and the corresponding α i 's for i ∈ I(c) are all determined. We recover the irreducible polynomial by factoring c(x) and choosing an irreducible factor with weight t.
Last, even if we cannot uniquely determine g(x), we can prepare a small list of g(x)'s and recover the full key for all the candidates in the list. We can find the correct one in the list since the wrong guess of g(x) can be easily detected and discarded with the encryption and decryption tests.

Full key recovery
Recovering the full secret support when n = 2 m After recovering the irreducible polynomial g(x), if n = 2 m , i.e., for the parameter set kem/mceliece8192128, it is clear that the support splitting algorithm [Sen00] can be used to recover the full secret support. Since Goppa codes behave like random codes, the complexity of the support splitting algorithm is O n 3 , which is a small cost. We refer the interested readers to [OS09] for details.

Attack variant when n < 2 m
When n = 2 m , the key recovery problem is equivalent to the problem of recovering the irreducible polynomial g(x) and it is beneficial to find a codeword of low weight. But this is not true in a more general case, where n < 2 m , since it is difficult to guess the set {α 1 , . . . , α n } and we need to design a new approach to recover the full secret support p.
We first recover the matrix P that is the first r = mt columns of H goppa . Assuming the partial support (α 1 , . . . , α r ) and the irreducible polynomial g(x) is known, we could compute P where the element in the i th row and the j th column is α i−1 j /g(α j ) ∈ F 2 m . We then compute P by replacing each entry in P with an m-bit column over F 2 . Then, since one could recover H goppa by computing PH goppa .
We then build a table storing pairs (α, TABLE(α)), where α runs through all the 2 m elements in F 2 m . For each α ∈ F 2 m , we compute a column vector K ∈ F t 2 m , where the i th entry in K is α i−1 /g(α). We obtain a new vector K ∈ F mt 2 by replacing one entry in K with an m-bit column over F 2 . We put TABLE(α) = K . For each column in H goppa , we can check the table and find the corresponding value of α.
Based on this new approach, we could slightly modify the attack described in Section 4.1 to make it more efficient for the parameter sets with n < 2 m .
Firstly, instead of minimizing the weight of the codeword, we aim to find a codeword c minimizing #{I(c) \ {1, . . . , r}}. An easy solution is to generate G goppa = [T T |I k×k ] and to select one row in the matrix. We select the index set I = I(c) ∪ {1, . . . , r} and recover such (r + 1) α i 's with i ∈ I by observing the side-channel leakages from decrypting the chosen ciphertexts. Similar to the method described in Section 4.1, we compute c(x) and Overall, the sample complexity is (r + 1) and the computation cost is also small. The computation of g(x) by factoring c(x) can be done in seconds; We build a table of size 2 m , which is small since m is only 12 or 13. The rest is dominated by the matrix multiplication of PH goppa , which is less than 2 40 bit-operations.

Complexity analysis and verification
We conclude this section with the complexity analysis of the new algorithm and present some experimental verification. Since the profiling stage is irrelevant to the targeted public-secret keypair, we treat it as a pre-computation step.
Verifying the ISD step The weight of the found codeword c is a key parameter for the partial key recovery of g(x) since it determines the number of α's that need to be recovered in the profiled side-channel attacking step. For the Classic McEliece KEM parameters, we compute the expected weight of the low-weight codeword with the complexity formula presented in Section 2.5 for Stern's algorithm. We show the results in the "Estimation" columns in Table 2. The column "random" shows the expected weight numbers computed by r/2 + 1, i.e., when the codeword is generated by randomly picking a row in a systematic generator matrix; the column "≈ 2 40 " shows the expected weight number when the computation cost is about 2 40 bit operations; the column "≈ 2 50 " shows the case with about 2 50 bit operations.
The complexity analysis for Stern's algorithm is well-established. One question is how to interpret the bit operation numbers as actual CPU clock cycles since modern CPUs can do more than one bit operation in a clock cycle. Since the binary Goppa codes behave like binary random linear codes, we generate parity check matrices for random linear codes with dimension same to the Classic McEliece KEM parameters and test the actual performance of Stern's algorithm. We slightly modified a C implementation [Vas05] from Vasseur that employs the AVX2 instruction set. We use eight threads in a desktop with CPU Intel(R) Core(TM) i7-10700K @3.8GHz, run the ISD algorithm for about 2 minutes, and report the obtained weights under the column "≈ 2 min" in Table 2. Thus, the numbers presented in the column "≈ 2 50 " could be achieved easily on a normal desktop.
We have also performed a large instance against a random public key generated by the round-3 reference implementation of kem/mceliece8192128 submitted to NIST. Using twelve threads of the same desktop, we found a codeword of weight 690 after 103 seconds of computation, which supports the numbers claimed in Table 2. After 22858 seconds (≈ 6.3 hours), we obtained a codeword of weight 679. For a public key of kem/mceliece348864, after 47855 seconds (≈ 13 hours), we obtained a codeword of weight 273.
Verifying the factorization method Given a low-weight codeword c, we assume that the corresponding partial secret support p c = {(i, α i ) : i ∈ I(c)} has been recovered through side channels. We have verified that the procedure of computing the polynomial c(x) and then factoring c(x) to find g(x) as the irreducible factor of weight t can be done efficiently. We implemented this procedure with the SageMath software. For instance, using a single thread in the desktop with CPU Intel(R) Core(TM) i7-10700K @3.8GHz, we performed this procedure ten times, targeting a secret Goppa code for the largest parameter set kem/mceliece8192128. We found the correct irreducible polynomial g(x) of weight 128 ten times, so the empirical success probability is 100%. The average time consumed for each run is only 16.3 seconds.
The overall complexity We verified in Section 5 that when the KEM kem/mceliece348864 is implemented on our real (FPGA and ARM Cortex-M4) platforms, one decryption oracle call in the attack phase is sufficient to recover one error locator polynomial (which is equivalent to one α i value in our setting), due to the significant leakages detected. Thus, the attack stage is quite efficient. For partial key recovery of g(x), one can make the computation dominated by the initial information set decoding (ISD) step. This ISD step can be seen as a time-sample trade-off since heavier ISD computation can lead to an attack that requires fewer traces (i.e., accessing time) to the targeted device.
Thus, one can get an estimation of the sample complexity of the new partial key recovery of g(x) targeting different Classic McEliece parameters, by checking the weight prediction shown in Table 2. When aiming for the full key recovery, for the KEM kem/mceliece8192128, the sample complexity is the same as that of the partial key recovery, i.e., we need 688 traces when the computation cost is bounded to 2 50 bit operations with Stern's algorithm. For the parameter sets other than the KEM kem/mceliece8192128, we need r + 1 traces to perform the full key recovery. The numbers r + 1 for different parameter sets are shown in the column marked by "r + 1" of Table 2.

Experimental results
In this section, we present the detailed results of our side-channel key-recovery attack against a hardware and a software implementation of Classic McEliece. To capture traces, we make use of the open-source Chipwhisperer system that consists of hardware modules and software specifically developed by NewAE for evaluating hardware security [New].

On the reference FPGA implementation
For the attack on the hardware implementation, we use the CW305 from NewAE that contains a XC7A100T2FTG256 Artix 7 FPGA. The implementation of [WSN18] is synthesized with the parameter set kem/mceliece348864 and implemented on the FPGA. To capture traces, we use the Chipwhisperer-lite (CWL) from NewAE that measures power consumption of the FPGA by the voltage drop over a shunt resistor placed inline with the supply to the FPGA. Both the FPGA and the CWL are driven by a common 5 MHz clock and the measured power is amplified by 46 dB before being digitized.

Leakage assessment
To evaluate whether the hardware implementation leaks sensitive side-channel information, we perform fixed-vs-random TVLA. More specifically, we want to determine if the power consumption during processing a specific σ(x) of degree 1 differs from a random σ(x) of degree 1. Initially, we randomly pick an element γ ∈ F 2 m . Then, we construct a set of random keys K L where for each k L ∈ K L , p = (α 1 , α 2 , . . . , α n ) such that γ = p i for some i ∈ (1, 2, . . . , n). During trace capture, we randomly decide to either capture a fixed or a random trace. For a fixed trace, we randomly select a k L ∈ K L and determine i such that γ = p i . In the case of a random trace, we randomly pick a k L ∈ K L and an integer i such that γ = p i . In both cases, we create a plaintext e where the i th bit e i = 1 and all other bits are zero. The secret key and the ciphertext s are then transferred to the FPGA. We then capture a trace while the FPGA runs the decryption algorithm.
Traces are then split into two sets T 0 and T 1 , where T 0 contains all fixed traces and T 1 contains all random traces. Finally, we apply Welch's t-test on the two sets. We then repeat the same procedure for multiple γ's. In Fig. 1 the test statistic is shown for one of the γ's. It can be seen that, possibly exploitable, side-channel information leaks during the additive FFT step. Note that the double syndrome construction step was stripped from traces before the TVLA as the execution time of this step depends on w H (s). Thus, if this step had been kept, subsequent decryption steps would become misaligned in the captured traces, which could lead to falsely detected leakage.

Profile phase
In the profile phase, we only capture traces of the additive FFT step as this is the only step where we could detect leakage for our suggested attack. By focusing on the additive FFT step and using a sample rate of 1pt/clock cycle, each trace contains 1095 samples. Initially, we create a set of random keys K p . We capture a set of traces T p with cardinality d = #{T p }. For each T j ∈ T p , j = 1, 2, . . . , d, we pick a random k p ∈ K p and a random integer i ∈ (1, . . . , n). We capture T j while the FPGA performs decryption of the plaintext e where the i th bit e i = 1 and all other bits are zero. We label each T j with the support element at position i of the key k p , i.e. we label a trace with γ = α i .
We use T p to train a MLP that predicts γ for each trace, i.e. the root of σ(x). We split T p into a training set T train and validation set T val with a ratio of 5:1. We preprocess all traces by removing the mean and scaling to unit variance before training the MLP. Parameters used for preprocessing are solely based on T train .
The architecture of our MLP is presented in Table 3. All weights in the network are initialized by sampling a uniform distribution and all biases are initially set to zero. We train the MLP by using mini-batches consisting 150 traces each. To evaluate the performance of the MLP we use the crossentropy loss and we use the Nadam optimizer with a learning rate of 10 −3 to tune weights and biases of the network. We regularize the training by employing label smoothing with a value of 0.2. We let the training run for a maximum of 100 epochs with an early stop condition if the validation loss is not reduced within 4 epochs. For profiling we use #{K p } = 30 and d = 418560.

Attack phase
To evaluate our approach, the trained model is used to attack a set of random keys K a , where K a ∪ K p = ∅ and #{K a } = 30. For each k a ∈ K a , we collect a trace corresponding to the decryption of every possible plaintext e i with only the i th bit set to 1. For each trace, we also record the value of i. In total, we collect 104640 traces. Next, we employ the previously trained classifier to predict the value of each γ. Since we recorded the value of i, we know the position of the predicted γ in p = (α 1 , α 2 , · · · , α n ). Thus, we can predict the value of α i . With our attack, we manage to successfully recover the complete secret support for all attacked keys. This means that the experimental prediction rate was 100%.

On an ARM Cortex-M4 implementation
For the attack on the software implementation, we use the CW308T-STM32F from NewAE which contains a STM32F415RGT6 microcontroller that is programmed with the kem/mceliece348864 implementation of [CC21]. The software implementation is built using the arm-none-eabi-gcc compiler set with optimization level -O3. We only program the microcontroller with the decryption function which is the first function called during decapsulation in [CC21]. During trace capture, the microcontroller and the CWL are driven by a common 24 MHz clock, and the measured power is amplified by 32 dB.

Leakage assessment
To evaluate possible leakage points for the software implementation, we use a similar approach as used for the hardware implementation. However, as the software implementation takes roughly 340 times more clock cycles to execute, the buffer of the CWL is too small to capture the complete decryption with one sample per clock cycle. We solve this by performing a piece-wise TVLA of the complete decryption. In Fig. 2 we see that the software implementation shows a large number of leakage points. Especially, we see many leakage points during the additive FFT step marked by a gray box in Fig. 2. But the TVLA also shows leakage during other parts of decryption. The detected leakage towards the end of decryption is related to the re-encryption of the recovered plaintext. The re-encryption feature is not implemented in the hardware reference design.
The software implementation makes heavy use of bit-sliced operations to speed up the decryption process where the coefficients of σ(x) are stored in an array. Each row of the array represents a coefficient of σ(x) and the columns represent the bits of each coefficient. During a large part of the additive FFT, operations are carried out on columns, i.e. on bits of the coefficients rather than the whole coefficients. During our attack, we make sure that σ(x) has a single root and thus the individual bits of the root should affect the power consumption at different times during the FFT. To investigate this, we captured 34880 traces while the additive FFT was fed by a random degree 1 polynomial. For each bit i of the 12 bits representing elements of F 2 m we assigned traces to one out of  two sets depending on the value of bit i. Fig. 3 shows the difference between the means for the two sets for each bit position i. The left part of Fig. 3 shows the difference for the complete additive FFT where the black dashed lines show where different steps of the FFT correspond to Lines 2-5 in Listing 2. In the right part of Fig. 3, the second step of the FFT corresponds to the loops at Lines 3-11 in Listing 3. In this part, all the degree 1 polynomials from the radix conversion step are evaluated. As seen the power consumption at different times depends on bits of the σ(x) that we fed to the FFT. This can be explained by the assignment at Line 8 in Listing 3 where the variable vv gets assigned the value 0 or 1 depending on a single bit of the input to the function.

Profile phase
In the profile phase we only focus on the additive FFT step of the algorithm. As the duration of the additive FFT part is too long to be sampled with the CWL at every clock cycle there are two options. We could capture the FFT step in intervals with one sample per clock cycle, or we could capture the complete FFT step with a lower sample rate. As the time and memory requirements of the subsequent neural network training would be higher for the first capture procedure, we decide to test if the later option works. Therefore, we reduce the sample rate to 1pt/12 th microcontroller clock cycle. This results in traces consisting of 22988 sample points. Apart from the number of sample points, we use the same procedure to capture traces as for the hardware implementation. We train the MLP given in table 4 with the same initialization, optimizer and regularization as for the profiling on the hardware implementation. We let the training run for a maximum of 100 epochs with an early stop condition if the validation loss is not improved within 3 epochs. For profiling of the software classifier we use #{K p } = 10 and d = 34880.

Attack phase
We evaluate our classifier by attacking a set of random keys K a , where K a ∪ K p = ∅ and #{K a } = 90. We use the same procedure to capture traces as for the hardware implementation, i.e. for each k a ∈ K a we capture 3488 traces which are then fed to the classifier. Thus, in total we collect 313920 traces. With our attack against the software implementation, we manage to successfully recover the complete secret support for all the attacked keys. Again, the experimental prediction rate was 100%.

Alternative approach
Since our neural network classifiers performed very well, we also tested if some simple statistical method could be used instead of the neural network. Here we describe the attack on the software implementation of Classic McEliece but we also performed the same type of attack on the hardware implementation.

Profile phase
We capture a set of traces T p corresponding to Line 3 in Listing 3, i.e. the evaluation of the degree 1 polynomials outputted by the radix conversion function. For each T j ∈ T p , j = 1, 2, . . . , d, we pick a key k p from a set of random keys K p and a random integer i ∈ (1, . . . , n). We capture T j while the microcontroller performs decryption of the plaintext e where the i th bit e i = 1 and all other bits are zero. We label each T j with the known support element at position i of the key k p , i.e. we label a trace with the field element α i . Next, we split the set T p into sets A k , k = 0, 1, . . . , 4095, based on the label of the traces. For each A k , we calculate the mean µ A k of the traces in the set. During the profile phase, we use #{T p } = 313920 and #{K p } = 90.

Attack phase
First we load a random key k a ∈ K a where K a ∪ K p = ∅. Next, we collect traces T i , i = 1, 2, . . . , 3488, corresponding to the decryption of every possible plaintext e i with only the i th bit set to 1. For each trace, we record the value of i and for each key k a we collect 3488 traces. To predict the secret support we compare each trace T i with all the µ A k from the profile phase and assign (3) Since we recorded the value i this gives us a prediction of the secret support element at position i, i.e. α i . We evaluate our distinguisher on 10 keys where we manage to fully recover the secret support of each key. This means that the experimental success rate was 100%. We performed the same type of attack on the hardware implementation but in this case the success rate was 0.05 % which is only slightly better than the random guess success rate of 0.024 %.

Discussions and concluding remarks
In this paper, we have presented the first key-recovery side-channel attack on Classic McEliece, a KEM finalist in the NIST Post-quantum Cryptography Standardization Project. We have identified a general vulnerability in the decryption algorithm design that the additive FFT procedure for evaluating a polynomial at many points is deterministic for a fixed input error locator polynomial. We then designed special ciphertexts generated by error vectors where the Hamming weight is 1 and captured traces of the decryption of such ciphertexts. Since the error weight is only 1, there are only q = 2 m possible error locator polynomials. We then used machine-learning algorithms to recover the secret error locator polynomial, which reveals one entry in the secret support p = (α 1 , . . . , α n ). We have also designed new algorithms to recover the partial key g(x) and the full secret key, based on the side-channel leakages.
We implemented and measured the new attack in real FPGA and ARM Cortex-M4 platforms. We have a perfect recovery for recovering the secret α i with one trace, i.e., the empirical success probability is 100% with the naive approach. Note that in a noisier platform, we can use the threshold approach to adaptively send more decryption traces for the unreliable α i 's.
The sample complexity of the partial key recovery attack depends on the computation limit of the first ISD step in the proposed attack. We could achieve 273 traces for kem/mceliece348864 and 679 traces for kem/mceliece8192128. For the case n = q, i.e. kem/mceliece8192128, a partial key recovery is equivalent to a full key recovery with the help of the support splitting algorithm.
The other four parameter sets with n = q provide more security against this side-channel attack. We designed a variant of the full key recovery attack with sample complexity of r + 1 traces, which is 769 for the KEM kem/mceliece348864. A future work is to study better algorithms to efficiently recover the full secret support from the recovered g(x) when n = q.
On the one hand, the Classic McEliece KEM still requires more traces compared with the recent results in attacking lattice-based primitives. For instance, in [NDGJ21], the secret key of a masked Saber version could be recovered with around 20 traces. On the other hand, this work shows that protections like masking are necessary, though they may hurt the performance of the scheme. Thus, research on efficiently protected implementations of Classic McEliece should be prioritized.
Our key-recovery attack requires slightly more traces than the message-recovery EM attack reported at Asiacrypt 2020 [LNPS20]. But, we highlight that key-recovery attacks are much stronger than message-recovery attacks. Also, we could greatly reduce the sample complexity in a theoretical manner. One approach is to send a ciphertext obtained from a plaintext of higher weight (of e.g. weight 2). For e with w H (e) = 2, there are q(q − 1)/2 possible error locator polynomials to profile. The number is about 2 23 for the kem/mceliece-348864 KEM and about 2 25 for the other parameter sets. Then, assuming that one error locator polynomial can be recovered by one decryption oracle call, each time we can determine two α i 's. The overall sample complexity can be reduced by a factor of 2. This improved attack has a high implementation complexity, but is still doable; we leave it for future work. Last, an interesting future direction is to improve the current attack without a significant increase in the profiling complexity.
The presented results are limited to the traditional setting in power analysis. Extensions to EM analysis seem plausible, since the accuracy in the measurements is very high, but not investigated in this work. Devices facing threats from power analysis would have to implement countermeasures. A low-order masked hardware implementation may be quite safe, but still it can severely hurt the performance. We have not found such a protected implementation. Low-order masked software implementations may on the other hand be attacked similar to attacks on lattice-based and code-based KEMs. An adversary using machine learning can perform higher-order attacks since the input traces contain information related to all the shares [BDK + 20, NDGJ21, BDH + 21, BGR + 21, ABH + 22].