Defeating Low-Cost Countermeasures against Side-Channel Attacks in Lattice-based Encryption A Case Study on Crystals-Kyber

. In an eﬀort to circumvent the high cost of standard countermeasures against side-channel attacks in post-quantum cryptography, some works have developed low-cost detection-based countermeasures. These countermeasures try to detect maliciously generated input ciphertexts and react to them by discarding the ciphertext or secret key. In this work, we take a look at two previously proposed low-cost countermeasures: the ciphertext sanity check and the decapsulation failure check, and demonstrate successful attacks on these schemes. We show that the ﬁrst countermeasure can be broken with little to no overhead, while the second countermeasure requires a more elaborate attack strategy that relies on valid chosen ciphertexts. Thus, in this work, we propose the ﬁrst chosen-ciphertext based side-channel attack that only relies on valid ciphertexts for key recovery. As part of this attack, a third contribution of our paper is an improved solver that retrieves the secret key from linear inequalities constructed using side-channel leakage from the decryption procedure. Our solver is an improvement over the state-of-the-art Belief Propagation solvers by Pessl and Prokop, and later Delvaux. Our method is simpler, easier to understand and has lower computational complexity, while needing less than half the inequalities compared to previous methods.


Introduction
In July 2022, a new Key Encapsulation Mechanism (KEM) and 3 digital signature schemes were selected as the first standardized PQC algorithms [AAC + 22], a result of an extensive NIST Post-Quantum standardization process.Kyber KEM [ABD + 20], based on the well-known Module Learning With Error (MLWE) problem, was the only algorithm standardized for KEMs, and we will soon witness wide-scale adoption of Kyber on a variety of computing devices, including resource-constrained platforms such as embedded microcontrollers [KRSS19,AHKS22].This naturally makes them susceptible to side-channel attacks, which was also an important consideration during the NIST PQC standardization process [RCDB22].Thus, several works were reported on SCA of Kyber KEM and proposals for concrete countermeasures [BGR + 21, RRD + 23, RRCB20].has been the de-facto approach in prior works [PP21,HPP21,Del22].Our solver is easier to understand, more efficient to run and can recover the secret key with less than 2× number of linear inequalities compared to the state-of-the-art.
We perform experimental validation of our CC attack using valid ciphertexts, on both the unprotected and masked open-source implementations of Kyber KEM taken from the pqm4 library [KRSS] and mkm4 library [HKL + 22] respectively.On the unprotected implementations, we can recover the secret key in ≈ 325 traces for the reference implementation and ≈ 7800 traces for the optimized implementation.We also show that our attack can be adapted to both the shuffled and masked implementations in a straightforward manner.While both shuffling and masking increase the attacker's complexity for key recovery, they do not concretely prevent the attack.
Our work therefore shows that low-cost detection countermeasures can be rendered completely ineffective, and do not offer standalone protection against CC-based side-channel attacks.While these countermeasures are attractive for designers, our work encourages more study towards the development and analysis of new detection-based countermeasures against CC-based side-channel attacks.

Availability of Software
All the software used in this work is placed in the public domain in the following link: https://github.com/thalespaiva/sca_on_kyber_lcc.

Notations
For any prime q, we let Z q denote the field of integers modulo q.Let R q be the polynomial ring Z q [X]/(X n + 1), then R k q is a module of dimension k.Polynomials a ∈ R q are denoted using lower case letters.Vectors a ∈ R k q and matrices A ∈ R k×k q are denoted in bold using lower and upper case, respectively.When u, v ∈ R k q , we let u, v ∈ R q denote their dot product.The ith coefficient of a polynomial a is denoted as a [i].For any set V , we write v $ ← − U(V ) to denote that v is chosen uniformly at random from V .We denote as χ the Centered Binomial Distribution (CBD) with range [−η, η].In this case, we abuse notation and let a $ ← − χ(R k q ) mean that each coefficient of each polynomial of a ∈ R k q is drawn according to χ. Rounding a coefficient of a polynomial a ∈ R q from modulus q to modulus p is denoted as a p→q .Furthermore, rounding a floating point value a ∈ R to the nearest integer is denoted as a , with ties being rounded up.
Let poly_to_vec be the function that, given a polynomial a ∈ R q , returns the vector in Z n q consisting of the n coefficients of a. Furthermore, let negashift i be the function that returns a negacyclic shift of a vector by i positions.That is, if a = a 0 +a 1 x+. ..+a n−1 x n−1 , then negashift i (a) = a i , . . ., a 0 , −a n−1 , . . ., −a i+1 .Using this notation, we can express the coefficients of the product of polynomials a and b in the negacyclic ring R q in vector form as poly_to_vec (ab) [i] = negashift i (a) , poly_to_vec (b) .
If we extend these two notations for vectors of polynomials u, v ∈ R k q by applying the poly_to_vec and negashift to each of the k polynomials, we get the analogous property poly_to_vec ( u, v ) [i] = negashift i (u) , poly_to_vec (v) . (1) Algorithm 1: CPAPKE.KeyGen.

Kyber KEM
Kyber is a CCA-secure KEM based on the hardness of the Module-Learning With Errors problem (MLWE) [ABD + 20].It offers parameter sets for three NIST security levels 1, 3, and 5, named Kyber512, Kyber768, and Kyber1024 respectively.All instantiations operate over the same anti-cyclic polynomial ring R q = Z q /(X n +1) with a prime modulus q = 3329 and degree n = 256.This ring is used to build the module R k q , where k = 2, 3 or 4, for security levels 1, 3, and 5.The CCA-secure Kyber contains, at its core, an IND-CPA secure PKE scheme, which is reviewed next.

Key generation:
The key generation procedure shown in Algorithm 1 essentially involves the creation of an LWE instance.The coefficients of the secret key s ∈ R k q and error e ∈ R k q are sampled from the narrow centered binomial distribution χ, while coefficients of A are sampled from the uniform distribution U.The MLWE instance is computed as b = A • s + e.The public key is the pair (A, b) while the secret key is s.

Encryption:
The encryption procedure shown in Algorithm 2 first samples r ∈ (R k q ), e 1 ∈ R k q and e 2 ∈ R q from χ, which, together with the public key pk = (A, b), are used to compute two LWE components: u ∈ R k q and v ∈ R q .Polynomial v is computed using the 256 bit message m ∈ {0, 1} 256 , for which it is encoded to a corresponding polynomial in R q by simply multiplying each bit of the message by q/2 .The vector (u, v) undergo coefficient-wise compression after which they serve as the ciphertext outputs Decryption: The decryption procedure shown in Algorithm 3 (CPAPKE.Dec) involves the computation of the noisy message polynomial m = (v − s T • u ) ∈ R q , where u ∈ R k q and v ∈ R q are decompressed versions of values (u, v) computed during encryption.The compression and decompression introduces a certain error in u and v, which we denote as ∆u = u − u and ∆v = v − v, respectively.The noisy message polynomial m is then decoded to the correct message m ∈ {0, 1} 256 by simply decoding each of its coefficients m [i] to 0 or to 1, depending if m [i] is closer to 0 or to q/2.
Let us review why the decryption works.Notice that the noisy message polynomial m can be represented as follows After expanding each term and simplifying it, we get that m = m q 2 + ∆m, where ∆m = e, r − s, e 1 + ∆u + e 2 + ∆v. ( Notice that ∆m is made of components sampled from the narrow distribution χ, and therefore m can be seen as a noisy version of the original message polynomial q 2 • m.Kyber security parameters are responsible for ensuring that the coefficients in ∆m are small enough so that decryption errors occur only with negligible probability.

IND-CCA Secure Kyber KEM
The IND-CPA secure Kyber KEM can be transformed into IND-CCA secure KEM using the well-known Fujisaki Okamoto (FO) transform [FO99].This involves instantiation of the CPAPKE.Enc, CPAPKE.Dec procedures of Kyber PKE scheme, along with several instances of hash functions resulting in IND-CCA secure encapsulation (CCAKEM.Encaps) and decapsulation (CCAKEM.Decaps) procedures respectively.Algorithm 4 -6 supplies the details.The main idea is that the randomness required for CPAPKE.Enc is made explicit through a seed r derived from the message m and public key pk.This enables the decapsulation procedure to retrieve the message m from ciphertext ct, compute the seed r, and re-compute the ciphertext ct .
Subsequently, the recomputed ciphertext ct is compared with the received ciphertext ct, and this comparison only succeeds for valid ciphertexts and fails for invalid ciphertexts with a very high probability.In theory, the FO transform helps check the validity of ciphertexts through a re-encryption procedure after decryption.Thus, with a very high probability, the attacker only sees decapsulation failures for invalid ciphertexts.This provides strong theoretical security guarantees against chosen-ciphertext attacks.

Prior Works
The decapsulation procedure of Kyber KEM has been subjected to a wide variety of SCA for key recovery [BDH + 21, RRCB20, HHP + 21, RRD + 23].These attacks can be broadly classified into two categories, based on the knowledge/control over the ciphertexts manipulated by the decapsulation procedure: (1) Known Ciphertext (KC) based SCA and (2) Chosen Ciphertext (CC) based SCA.

Known Ciphertext (KC) based SCA:
The dot product between the ciphertext component u ∈ R k q and the secret s ∈ R k q (Line 3 in the CPAPKE.Dec procedure in Alg. 6) has been the sole target for KC-based SCA, as it directly manipulates the secret key.There are single trace template style attacks targeting the INTT operation [PPM17,PP19] as well as the pointwise multiplication operation [YRZ + 23, BBB + 23].There are also multiple trace attacks [MWK + 22] based on the well-known Correlation Power Analysis (CPA), which require ≈ 200 traces for key recovery.Since the polynomial multiplication operation has been heavily targeted by side-channel attacks, the designer might incorporate additional countermeasures to mitigate potential key recovery.

Chosen Ciphertext (CC) based SCA:
The modus operandi of side-channel attacks using chosen ciphertexts [BDH + 21,RRCB20,RRD + 23,UXT + 22] is given as follows.The attacker crafts specially structured malicious ciphertexts such that their corresponding decrypted message m (unknown to the attacker) becomes related to the secret key sk (or a part of it).This ensures that leakage from the entire FO transform can be utilized as an oracle for key recovery.These attacks can be broadly classified into three types: (1) Plaintext Checking (PC) Oracle-based SCA [RRCB20, UXT + 22, TUX + 23], (2) Full-Decryption (FD) Oracle-Based SCA [XPRO20], and (3) Decryption-Failure (DF) Oracle-Based SCA [BDH + 21].

Comparing KC-based CCA and CC-based SCA
We observe that KC-based attacks observe leakage for valid ciphertexts, and thus can only target leakage from the polynomial multiplication operation of the secret key (Line 3 in the CPAPKE.Dec procedure in Alg. 6) for key recovery [PPM17, PP19, YRZ + 23, MWK + 22].However, CC-based side-channel attacks form the largest category of attacks on Kyber KEM, and are more powerful than KC attacks, as the attacker can exploit leakage from the entire decapsulation procedure for key recovery.Thus, protection against CC-based attacks is naturally costlier compared to KC-based attacks, and analyzing countermeasures against CC-based side-channel attacks remain the main focus of our work.
There exist concrete proposals for masking as well as combined masking and shuffling to protect against such CC-based attacks.However, they are typically expensive, incurring several factors of overhead, particularly in terms of runtime [BGR + 21, HKL + 22].Thus, an alternative approach towards countering such attacks has been towards development of lowcost countermeasures that attempt to detect such malicious ciphertexts [XPRO20,RCDB22].If detected as malicious, the DUT can choose to refresh the secret key, ensuring further exposure for attacks is prevented.

Detection Countermeasures against CC based SCA
In this section, we broadly discuss two concrete proposals for such detection based countermeasures against CC-based side-channel attacks.

Ciphertext Sanity Check:
The malicious ciphertexts used for PC oracle and FD oraclebased attacks are typically very sparse and skewed, with most of the coefficients having a value of 0 [XPRO20, RCDB22].However, valid ciphertexts are LWE instances whose coefficients are uniformly distributed in the range [0, q].Thus, [RCDB22] proposed to test the entropy of input ciphertexts to detect PC oracle or FD oracle-based attacks.This countermeasures works in the following manner.
In the ciphertext sanity check, the mean and variance is used as measures for the entropy of the input ciphertext.We denote the mean and standard deviation of the coefficients of a polynomial x ∈ R q as µ(x) and σ(x) respectively.The main idea of the countermeasure is then to compute µ(x) and σ(x) for all the polynomials in the input ciphertext (i.e., c 1 and c 2 ) and reject the ciphertext if the variance of these inputs is too low.One main advantage of this approach is that the malicious ciphertexts are rejected even before manipulated by the decapsulation procedure.Thus, the attacker cannot observe any leakage corresponding to these malicious ciphertexts thereby offering concrete protection against PC oracle and FD oracle attacks that utilize skewed ciphertexts.Ravi et al. [RCDB22] showed that this countermeasure only incurs about 5% overhead in runtime.However, we remark that this countermeasure does not protect against the DF oracle based attacks [BDH + 21], which already utilize uniformly random ciphertexts that are invalid.

Decapsulation Failure Check:
A basic observation of all CC-based side-channel attacks is that all these attacks utilize invalid ciphertexts, that always induce a decapsulation failure.Thus, a more simpler countermeasure at the protocol level, would be to simply refresh the secret key immediately upon observing a decapsulation failure.In this case the attacker is restricted to recovering the secret key with only a single trace.This countermeasure therefore offers concrete protection against all types of CC based side-channel attacks, since most if not all CC based side-channel attacks require multiple traces (few tens to few thousands) for key recovery.In this work, we perform a concrete analysis of both these countermeasures and demonstrate novel CC based side-channel attacks that are capable of breaking both these countermeasures.

Analysis of Ciphertext Sanity Check Countermeasure
In this section, we will demonstrate a novel attack technique that enables the PC oracle and FD oracle-based attacks, which are originally not possible on implementations protected with the ciphetext sanity check countermeasure, with only a small reduction in the accuracy of the attack.This is done by masking a malicious attack ciphertext to look uniformly random using the public key.The setup for our attack is exactly the same as that of standard PC oracle and FD oracle-based SCA [RRCB20, XPRO20, RRD + 23], and our technique simply involves adding a mask to the chosen ciphertexts used for the PC and FD oracle-based SCA.We start with briefly describing the adversary model applicable to our attack.

Adversary model:
The attacker attempts to recover the long-term secret key sk used by the target's decapsulation procedure of Kyber KEM.We assume physical access to DUT performing decapsulation for power/EM measurements.Since our attack is a CC-based attack, we assume the attacker's ability to communicate with the target decapsulation procedure with chosen ciphertexts of their choice.We note that this is a standard adversarial model used in several CC-based side-channel attacks [RRCB20, XPRO20, BDH + 21].
Attack Intuition: For the intuitive explanation, we will assume that the public key and ciphertext vectors only consist of one polynomial (i.e., k = 1) and that no ciphertext compression is used.The main idea behind the masking can be intuitively explained as follows: imagine a public key (A, b = A • s + e), and an attack ciphertext (u atk , v atk ) used in a PC oracle or FD oracle attack, that would not pass the ciphertext sanity check.The goal is to obtain the message polynomial: (3) To this end we adapt our attack ciphertext to (u atk + A, v atk + b).The attack ciphertext looks uniformly random and is accepted.The message in this case is: which is the original equation up to the small error e.One countermeasure against this attack might be to check if the ciphertext is close to the public key, or more specifically whether (u − A, v − b) is small.However, such a countermeasure can easily be circumvented by instead of adding the public key, adding a small multiple of the public key (e.g.(−A, −b) or (2A, 2b) ) or a rotated version of the public key (e.g.(X • A, X • b) ).

Attack on Kyber:
We now go into more detail on the practical implementation of our attack, where we show how to mask the ciphertexts used for the PC oracle-based side-channel attack.As in previous works, the goal of our attack is to input a chosen ciphertext (u atk , v atk ) of the form u atk = [y, 0, 0] and v atk = z with |y| < q/4.
The first message bit m[0] is then calculated as where m[0] is therefore only dependent on the first coefficients of the secret (i.e.) s[0][0], while all other message bits are zero.The message thus only has two possible values, and the attacker can instantiate a practical PC oracle through side-channel leakage from the decapsulation procedure.The attacker chooses similar values for the tuple (y, z) such that the message can be used as a binary distinguisher for the first secret coefficient.By repeating this procedure for other values of y and z, and other coefficients of s, one can retrieve the full secret key.Notice that this is a well-known attack procedure [RRCB20,RRD + 23].Table 1 shows the values of y and z used by Rajendran et al. [RRD + 23] to mount an attack against Kyber768.
The attack strategy described above does not pass the ciphertext sanity check, and thus we will have to add the public key to mask the input ciphertext.A first practical problem is that the size of u ∈ R k×1 q is not equal the size of A ∈ R k×k q , and the same holds for the values v ∈ R 1×1 q and b ∈ R k×1 q .This can be solved by selecting only one row of the matrix A, which we will indicate with a * , and the corresponding polynomial element of the vector b, which will be indicated with b * .The introduced error will be indicated e * and is the corresponding element of the vector e A second inconvenience is that the ciphertext is compressed, which means we cannot exactly input the ciphertext.This introduces an error to the values of a * + u atk and b * + v atk that can be inputted in the attack ciphertext.One can calculate that this results in a message ∆ defined as: In the above equation, the value of ∆u and ∆v implicitly captures the additional rounding error that occurs due to addition of the public key mask.This is essentially the noisy message equation from 2.2.1 where r = 1, e 1 = 0 and e 2 = 0 (interestingly, one can see the original unmasked attack as an instance where r = 0).This means that the message ∆ calculated in the attack (see 7), has an additional error term due to the compression and decompression.
For power-of-two moduli schemes (e.g., Frodo, Saber), there is typically an additional public key compression of the b term.The compression, in this case, works as follows: first LSBs are discarded from log 2 (q) bits to log 2 (p) bits in public key compression, then from log 2 (p) to log 2 (T ) in ciphertext compression.This means that the public key compression is essentially contained in the ciphertext compression, so its effect can be ignored.
For a given mask (a * , b * ), the value of E is nearly identical for the recovery of different coefficients.During an attack, ∆v and only one coefficient of ∆u are changed slightly due to different rounding while all other coefficients are unchanged.Thus, value of E is only changed slightly, which is experimentally verified.This means that each mask (a * , b * ) is tied to a certain unknown approximate error E * .If this approximate error is too large, the attack will fail with high probability, but if this approximate error is low enough, the attack proceeds without the masking having any effect other than defeating the detection countermeasure.Note that one cannot predict the value of this error E * in advance.If the key recovery fails, one can restart the attack with a different mask (e.g. using a different row of A and b, using a rotated version (X • a * , X • b * ) or using a multiple of the mask (c • a * , c • b * ) ).

Result:
We performed software simulations of our proposed ciphertext masking approach on the malicious chosen ciphertexts of the binary PC oracle attack proposed in [RRD + 23].We did not perform side-channel evaluation, as the realization of the oracle has already been demonstrated in [RRD + 23], and it applies in exactly the same manner for our proposed attack.In our experiments on Kyber768 to recover 200 secret keys, we were able to recover all of them with 100% success rate.However, a randomly selected mask (a * , b * ) yields a success probability of 57.8% for full key recovery.Since the success of each run is independent, the expected number of runs is 1/0.578= 1.73, which is the expected number of runs for full key recovery.This means that one has to restart only 0.73 times on average (i.e.
before hitting a working mask and thus performing a successful attack.Our proposed ciphertext masking approach can be used for the binary and parallel PC oracle attacks as in [RRD + 23], thereby only requiring a few hundred ciphertext queries for full key recovery.Our technique can also be used to mask ciphertexts for the more efficient FD oracle-based side-channel attacks [XPRO20], which have been shown to require ≈ 10 traces for full key recovery.

Analysis of Decapsulation Failure Check Countermeasure
The goal of this section is to develop attacks that can circumvent countermeasures that refresh the secret key when a decapsulation failure is observed.Defeating this countermeasure calls for a completely new approach that uses valid chosen ciphertexts for key recovery.This makes our approach significantly different than typical state-of-the-art attacks that rely on invalid ciphertexts which fail the re-encryption check with overwhelming probability.We start with briefly describing the adversary model for our attack.

Adversary Model:
The standard adversarial model used for CC based side-channel attacks (described in Section 3) is applicable to this attack, along with some additional assumptions.All the chosen ciphertexts used in our attack are valid ciphertexts.Since it is a profiled attack, the attacker has access to a clone device, on which he/she can control the secret key, so as to build side-channel templates for any intermediate variable of his/her choice.

Targets for CC-based SCA with Valid Ciphertexts:
Since the attacker needs to rely on valid ciphertexts to defeat the decapsulation failure check, they are limited to targeting operations in the decryption procedure for key recovery.KC-based side-channel attacks [PPM17, PP19, MWK + 22, YRZ + 23, BBB + 23] which utilize valid ciphertexts work with valid ciphertexts, and thus can defeat the decapsulation failure check countermeasure in multiple ways, through single trace as well as multiple trace attacks.However, these attacks solely target the polynomial multiplication operation.Thus, protecting this operation alone can protect against key recovery when considering KC-based SCA.This raises the question whether there are other potential operations within the decryption procedure, that can be targeted with valid ciphertexts, bypassing the decapsulation failure check countermeasure.
In this work, we show a novel CC-based attack with valid ciphertexts capable of full key recovery by targeting a different operation (i.e.) manipulation of the noisy message polynomial m within the decryption procedure.Exploiting this leakage for practical attacks is not trivial due to two main reasons.Firstly, it is not possible to mount CPA-style attacks that use a divide-and-conquer approach since every coefficient of m depends on all coefficients of the secret key.Secondly, while the value of m depends on the secret key, operations manipulating m do not directly involve the secret key.This means that one will only obtain indirect information on the secret, and no direct information on individual coefficients as in typical CC attacks.Consequently, to obtain the key from the leakage of m , one needs an additional key recovery step.In this work, we demonstrate how to overcome these obstacles and show that an attacker who queries only valid chosen ciphertexts can still exploit leakage of m for efficient key recovery attacks.

Intuitive Explanation of our Attack
We represent the noisy message polynomial m as m = q/2 • m + ∆m where ∆m is the decryption noise component.We recall from Section 2.2.1, that ∆m is linearly dependent on the secret s and error e as ∆m = e, r − s, e 1 + ∆u + e 2 + ∆v.
For a valid chosen ciphertext (i.e.) ciphertext generated by running the encapsulation procedure (Alg.5), the attacker knows r, e 1 and e 2 , as well as the extra noise factors ∆u and ∆v, which are caused by the compression and decompression of ciphertexts.Knowledge of ∆m ∈ R q for 2 • k ciphertexts results in direct recovery of the secret key s using linear algebra.The same applies if single coefficients of ∆m can be recovered across 2 • k • n ciphertexts.We show that leakage of the Hamming weight of coefficients of m for valid ciphertexts can be used to define bounds on ∆m.From there we can use solvers for systems of linear inequalities to retrieve the secret vectors s and error e.Our chosen ciphertext attack proceeds in three steps: Step 1. Recovering HW(m ) through Side-Channels: During decapsulation, the processing of the polynomial m ∈ R q leaks the Hamming weights (HW) of each of its coefficients.In Section 5, we experimentally demonstrate how to recover the HW of the coefficients of m from different implementations of Kyber.
Step 2. Relating HW(m ) with ∆m: We observe that for a message bit m i = 0, the distribution of HW(m [i]) has a very clear bias, that clearly indicates the sign of ∆m[i].
In particular, when HW(m [i]) is very low, then ∆m[i] > 0 with a very high probability, while a very high value for HW(m [i]), signifies ∆m[i] < 0 with a very high probability.This information can be used to construct a linear inequality in the coefficients of s and e. Generating a sufficient number of inequalities can be used to recover the secret key.
Step 3. Solving Linear Inequalities in s and e for Key Recovery: In this step we solve the linear inequalities from Step 2 to retrieve the secret vectors s and e.This approach is shown in Section 6. Pessl and Prokop [PP21], Hermelink et al. [HPP21] and later Delvaux [Del22] proposed efficient methods based on the Belief Propagation techniques to solve these linear inequalities in the context of fault attacks.We propose a novel greedy solver approach that is simpler to understand, uses lesser compute resources and at the same time requires 2× fewer inequalities to extract the secret key.
In Section 5, we will provide details of Step 1 of our attack, involving recovery of HW(m ) through side-channels.Subsequently, in Section 6, we will demonstrate how this side-channel information can be fed into our novel solver for key recovery (Steps 2 and 3 of our attack).

Recovering HW(m ) through Side-Channels
We start by describing the experimental setup used in this work.We performed experiments on reference and optimized implementations of Kyber KEM from the well-known pqm4 library [KRSS] running on the STM32F407 microcontroller clocked at 24 MHz.We obtain EM side-channel leakage using a Langer RF-U 5-2 near-field EM probe, and the traces were collected using a Lecroy 610Zi oscilloscope at a sampling rate of 1.25 GSam/sec, amplified 30dB with a pre-amplifier and filtered using a 48 MHz low-pass filter.

Detecting Leakage from Coefficients of m
We now demonstrate leakage detection from manipulation of single coefficients of m in two different implementations of Kyber KEM: (1) Reference Implementation [ABD + 20] (2) Assembly-Optimized Implementation [HZZ + 22].

Reference Implementation
Refer to Fig. 1 for the assembly code snippet of the subtraction operation between s T •u and v (Line 4 in CPAPKE.Dec procedure in Alg.3), when compiled with the O3 optimization level.The coefficients of m are computed and stored in memory, sequentially one coefficient at a time, which allows to observe leakage from multiple coefficients simultaneously.To illustrate this, we simultaneously profile leakage of the first 8 coefficients m [i] for i ∈ {0, . . ., 7}.
We build 100k valid ciphertexts for random messages m, but choose m i = 0 for i ∈ {0, . . ., 7}.We perform CPA for the coefficients m [i] for i ∈ {0, . . ., 7} and refer to Fig. 3(a) that shows 8 CPA plots, one for each coefficient m [i] for i ∈ {0, . . ., 7} for the O3 optimized implementation.The peaks for each of the coefficients are easy to distinguish and sufficiently separated from one another indicating clear leakage from individual coefficients.

Assembly Optimized Implementation
We studied the assembly-optimized implementation of Kyber KEM from Huang et al. [HZZ + 22], and refer to Fig. 2 for the highly optimized hand-written assembly code snippet of the polynomial subtraction operation.We observe that 10 coefficients of its operands are simultaneously loaded into 5 registers using the vectorized ldmia instruction (Lines 3, 5).Each register is packed with two coefficients (upper and lower halves), and the subtraction operation for the 10 coefficients is carried out by five usub16 instructions (Line 7-11).Then, the vectorized store instruction (stmia) is used to simultaneously store 10 coefficients in memory.Thus, one can only observe simultaneous leakage from 10 coefficients, which results in significant noise when targeting leakage of one coefficient.Thus, to detect leakage from a single coefficient say m [0], we adopt the following noise reduction technique.From Fig. 4(a)-(b), we observe that the HW distribution of coefficients when m i = 1 is much smaller compared to when m i = 0. Thus, to reduce noise from the other coefficients m [i] for i ∈ {1, . . ., 9}, that are stored along with m [0], we fix their corresponding message bits m i = 1 for i ∈ {1, . . ., 9} in the encapsulation procedure.This approach can be followed to exploit simultaneous leakage from multiple coefficients in the following manner.To exploit leakage from say 13 coefficients m [i • 10] for i ∈ [0, 12], we choose m (i•10) = 0 for i ∈ [0, 12], while the bits corresponding to the remaining coefficients that are simultaneously stored along with this m [i • 10] for i ∈ [0, 12] are fixed to 1.The remaining bits of the message m are randomized.We build 100k ciphertexts for such random messages.
Refer Figure 3(b) shows 13 CPA plots, corresponding to leakage from the 13 coefficients corresponding to m [i • 10] for i in {0, . . ., 12}.Notice that, even though 10 coefficients are simultaneously stored in memory, we are able to observe leakage from a single targeted coefficient by setting the other 9 message bits to 1.We can observe the CPA peaks corresponding to the targeted coefficients, and that the peaks are identical and uniformly separated from one another.This demonstrates that simultaneous leakage from one every ten coefficients can be exploited for key recovery.We have thus shown that leakage from single coefficients of m can be observed in both the reference and assembly optimized implementations.In the following, we demonstrate how the detected leakage can be exploited to recover HW(m [i]).

Building a Classifier for HW(m [i])
The HW classifier is built in two phases: (1) Profiling phase and (2) Recovery phase.We illustrate the technique for a single coefficient m [0] and the same can be repeated for other coefficients.

Profiling Phase
The profiling phase is intended to build templates for the HW of the individual coefficients of the noisy message polynomial m .This phase is carried out using leakage from a clone device.From the CPA plot corresponding to m [0] (Fig. 3), we select those features whose correlation value is above a certain threshold T h 0 , as our points of interest (PoI) denoted as P 0 .We stress that T h 0 is a parameter of the experimental setup, and can be experimentally determined.We use the selected features P 0 to build a reduced trace set RT (0,i) for every possible value of HW(m [0]) (i.e.) HW of the message polynomial coefficient m [0] that lies in the range [0, 16].We can compute the mean and co-variance matrix of each reduced trace set RT (0,i) for i ∈ {0, . . ., 16}, which we denote as µ (0,i) ∈ R P0 and Σ (0,i) ∈ R ( P0 )×( P0 ) .Thus, the reduced template for HW(m [0]) = i is denoted as tmp (0,i) = (µ (0,i) , Σ (0,i) ) for i = {0, 1}.The same procedure can be repeated to build side-channel templates for multiple coefficients of m .
To perform classification of the HW classes, we have decided to Random Forest, the reason being is that it has been shown to be successful in the previous works [CDCG22] and it does not have to deal with a more complex hyperparameter tuning usually encountered in more complex models.Random Forest or RF [Bre01] is an ensemble learning algorithm, based on the construction of multiple decision trees.The predictions from these decision trees are combined (for example, through majority voting) in order to achieve a better prediction.The individual decision tree by itself is usually sensitive to small changes in the training data, and when grown large enough, will usually tend to overfit the training data.RF is designed to address these problems of instability in the decision tree, and multiple trees are allowed to grow large, without the need of post-processing.These decision trees are trained on different subsets of the training data using bootstrapping methods (the training data is sampled uniformly with replacement), and the decision is then made by taking the average or majority voting of the decisions given by the trees.

Recovery Phase
In the recovery phase, we obtain a trace tr for a chosen ciphertext from the decryption procedure, and build a reduced trace tr 0 , corresponding to P 0 .We utilize the trained RF model for prediction, with the number of trees fixed to 1.5k, where we observed a good accuracy performance on the validation set.To check if the data imbalance affects the performance, we also verify other statistical tests, such as precision and recall, and obtain a similar score as the accuracy.We then fit in the attack traces to obtain the prediction, which will then be forwarded to the solver to recover the key.

Experimental Results for HW Recovery
We performed experimental validation to recover the HW of single coefficients of m for both the reference implementation and assembly optimized implementations of Kyber KEM.For the reference implementation compiled with O3 optimization level, we obtain an accuracy of 91.1% in recovering the HW of single targeted coefficients.In case of the assembly optimized implementation, we obtain a much lower accuracy of 32.0% for recovering HW of single targeted coefficients.This much lower accuracy for the HW classifier can be attributed to the noise due to simultaneous storage of multiple coefficients.However, in the following section, we show that such imperfect HW classifiers with low accuracy do not deter key recovery using valid chosen ciphertexts.

Key recovery
We have shown the attacker's ability to recover HW(m ) (i.e.) HW of individual coefficients of m .In this section, we demonstrate a key recovery attack exploiting this information.We start by explaining how HW(m ) can be used to derive information about key-dependent noise component ∆m and thereby construct linear inequalities in the secret key coefficients.Subsequently, we explain our novel solver to perform full key recovery.

Recovering Information on ∆m using HW(m )
In this section, we consider the type of information we can learn about the decryption noise ∆m from the leaked Hamming weight of the message HW(m ).The intuitive idea of our approach is to characterize the relation between this Hamming weight and the sign of ∆m.From this characterization, we can assign to each Hamming weight a probability of the sign being positive or negative.

A First Approach
It is known that the coefficients of m are distributed as a narrow bell-shaped discrete distribution around q/2 or 0 in the following manner for where the standard deviation of the noise ∆m In most practical implementations of Kyber, such as pqm4, coefficients of m are represented as signed 16-bit integers.Then, because of the narrow distribution of the noise ∆m[i], we hypothesized that it should be possible to learn something about the size of ∆m[i] from the observation of HW (m [i]).
Figure 4 shows the distribution of HW(m [i]) when ∆m[i] ≥ 0 and ∆m[i] < 0, considering m[i] = 0 and m[i] = 1.We can see that, when m[i] = 0, there is a clear distinction between positive and negative errors.This happens since a small negative error when added to a zero message bit (i.e.m[i] q/2 = 0), results in a small negative number, which has a very high Hamming weight in the typically used two-complement representation.Similarly, small positive errors would result in values around 0 on the positive side, which in turn have very small Hamming weights.On the other hand, when m[i] = 1, then the HW distributions for positive and negative noise ∆m[i] are very close to each other, and therefore it is hard to obtain meaningful information on ∆m[i] from HW (m [i]).
From this observation, if an attacker learns the HW of multiple m [i] where m[i] = 0, they can build a series of inequalities on ∆m [i].These types of inequalities can then be solved using Belief Propagation algorithms [PP21,Del22,HPP21].While this approach is theoretically sound, it wastes some of the information that comes from the Hamming weight.In the following section, we discuss how to obtain more information than only the sign of ∆m[i] from HW (m [i]).

Obtaining Tighter Inequalities
The Hamming weight of the message can not only be used to infer information about the sign of ∆m but also on the possible values that ∆m can take.That is, a certain Hamming weight can only be linked to a limited number of possible values of ∆m and some Hamming weights can only occur within a limited range of values.Therefore, it is possible to compute both a lower bound and an upper bound on m [i] based on the Hamming weight.There is one important caveat when implementing this idea in practice:  efficient Kyber implementations perform lazy modular reduction, and therefore it is not guaranteed that −q/4 ≤ m [i] < q/4 before the reduction.While it is not easy to give a theoretical estimate on the bounds on m [i] for each Hamming weight from 0 to 16, we can run a simulation to compute these values.
Table 2 shows the maximum and minimum values of m [i] observed for each possible Hamming weight, considering Kyber's reference implementation.The tightness of the inequalities are represented by the spread, which is simply the difference between max m [i] − min m [i].It is also possible to see the effect of the lazy reduction.Notice how one would expect max m [i] to be −64 for the case when HW (m [i]) = 10, since it's two's complement representation 0b1111111111000000 makes it the maximum value that can be the result of a narrow distribution around 0 with the given Hamming weight.However, instead we observe 253 because HW (253 + q) = HW (0b110111111110) = 10.Now, for each observed HW (m [i]) with m[i] = 0, an attacker is able to construct two inequalities, one for the maximum and one for the minimum.This gives more information to the inequalities solver, which we will cover next.

Solving Inequalities to Find The Secret Key
In this section, we formalize how a system of linear inequalities related to the secret key is built with SCA information.We then construct a new solver to retrieve the secret key from these inequalities.

Defining the System of Linear Inequalities Related to the Key
In the previous section, we saw that when m[i] = 0, then m [i] = ∆m[i], and HW (m [i]) gives us information on ∆m [i].Suppose an attacker asks the target device to decrypt a number κ of ciphertexts, whose corresponding messages are m 1 , . . ., m κ .Similarly, let r j , (e 1 ) j , u j , (e 2 ) j and v j be the values resulting from the encryption of the corresponding messages m j .Therefore, notice that these resulting values are known by the attacker.Now, for any indexes (j, i) ∈ {1, . . ., κ} × {0, . . ., 255}, we know that But, from Equation 1, we have (e 1 ) j + ∆u j , s [i] = negashift i (e 1 ) j + ∆u j , poly_to_vec (s) .
Therefore, if we let

Reference Implementation:
We remark that any randomly generated message m contains an average of 128 zero bits, and thus the attacker can simultaneously exploit leakage from all the 128 bits.However, for our practical experiments, we only exploit leakage from the first 8 zero bits of the message for full key recovery, as it only involves construction of templates for 8 bits which is easier to handle compared to building and managing templates for 128 zero bits.This is just to simplify our experiments, and demonstrate a proof of concept for key recovery while only exploiting 8 zero message bits.We were able to recover all the 10 secret keys in about ≈ 5200 traces.We observed through experimental simulations that exploiting leakage from all zero bits in a randomly generated message (i.e.) average of 128 zero bits, enables full key recovery in just 325 traces, which correlates well with our attack simulations as presented in Figure 5.

Optimized Implementation:
We showed in Section 5.1.2that simultaneously storing 10 coefficients in memory ensures the attacker can only exploit leakage from one every 10 coefficient of m .For illustration, we exploited leakage from 13 coefficients of m (i.e.) m 10 * i for i ∈ {0, . . ., 12}, and were able to recover all secret keys with about ≈ 7800 traces.Thus, the number of traces is 24× higher than that to attack the reference implementation (325).
This can be attributed to the reduced number of exploited coefficients and the reduced accuracy of the HW classifier: ≈ 32% for the optimized implementation, compared to ≈ 91% for the reference implementation.While, intuitively, the low accuracy of 32% appears to be too low to allow key recovery, we notice that even if predictions are wrong but close to the actual HW, the inequalities generated may still be correct.Furthermore, if an inequality is wrong by a small margin, the score function ensures that the penalty is also small, thus still allowing the solver to converge.
In the optimized implementation, we exploited leakage of single coefficients from 13 vectorized store instructions.However, we remark that an attacker can exploit from several more vectorized store instructions, out of a total of 26 vectorized instructions for key recovery.Thus, it is clear that leakage of m can be efficiently exploited by an attacker for key recovery with valid chosen ciphertexts.In the following, we demonstrate the applicability of our attack to the shuffling and masking countermeasures.

Attacking Shuffled Implementation
We now consider the case when the operations manipulating the coefficients of m are shuffled.In this case, the attacker does not know when the zero message bits are being processed, and therefore even if they know that m[i] = 0, they cannot tell if m [i] have a high or low Hamming weight.But notice that, since the leakage of the message polynomial coefficients still exists, the attacker can recover HW of single coefficients of m in the same manner as an unprotected implementation.This can be done by using templates created using leakage from a clone device, on which the attacker has knowledge of the random permutation used for every execution.However, we cannot directly use the same key recovery procedure without knowing the permutation used for processing these coefficients on the target device.
Attack Idea: Since the attacker can only exploit leakage from coefficients m [i] corresponding to m i = 0, we propose to choose a message m that has only a very small number of zero bits.We remark that the attacker can explicitly choose the value of the message for valid chosen ciphertexts.Suppose that the chosen message m has θ = 2 null bits at positions i and j, and that the attacker sees a sequence of Hamming weights W = (w 1 , w 2 , . . ., w n ) being processed.Then, if this sequence contains two very small values w i , w j ≤ 1 or two very large values w i , w j ≥ 15, then i and j must be associated with the zero positions i and j in some way.Once this is done, we can proceed using the same key recovery strategy presented in the previous section, considering either 1 or 15 as the HW for both m [i] and mť[j], depending on the HW values that were observed.
The main difficulty in using this idea in practice is that we need really tight intervals on a Hamming weight w to associate it to a zero message bit m[i] = 0, since we need to be sure to exclude all possible valid HW associated with m[i] = 1.This results in the observations within these intervals being very rare.Unfortunately, using θ = 2 as in the example does only gives us n 2 = 32640 different ciphertexts, which are not enough for us to make a high number of such rare observations.However, we can relax this idea and use messages with θ = 4 null entries, but accept inequalities in case we see 2 extreme values.Then all indexes corresponding to m[i] = 0 be counted as having an extreme Hamming weight.Notice that this would generate some invalid inequalities.
As a proof of concept, we tested this idea and obtained a successful recovery, assuming perfect classification of the HW of the coefficients m [i].With 300000 inequalities, out of which around 16% were wrong, we were able to recover the key using a little more than 8 million ciphertexts.Although the number of ciphertexts is very large, it does support the validity of the approach, and we leave improving this value for future work.

Attacking Masked Implementation
In the masked implementation of the decryption procedure, the message polynomial m is additively masked, and thus the attacker can observe leakage from the d independent arithmetic shares for each coefficient m [i] of the message polynomial, where d − 1 is the order of masking.We performed our experiments on the first order masked implementation of Kyber768 from the open-source mkm4 library [HKL + 22].
Consider the MaskedPolySub operation responsible for the masked computation of the subtraction m = v − s, u .Let w = s, u ∈ R q , and consider its arithmetic shares w = w 0 + w 1 + . . .+ w d−1 .The computation is done in two steps.First, it uses the poly_sub to compute v − w 0 .Then the other shares are simply negated.Notice that poly_sub is similar to the unprotected version.
While the mkm4 library utilizes an assembly optimized implementation for the polynomial subtraction operation, for our analysis, we modified this routine with a C based implementation and compiled it with the O0 compiler optimization level for simplicity.Our analysis of the assembly code also reveals that the attacker can observe leakage of the individual coefficients of all the arithmetic shares of m .We performed CPA to detect leakage from coefficients of all the arithmetic shares of m and refer to Figure 6 for the CPA plots, where we can observe leakage from coefficients of all shares of m .For  the classification of HW, we used the same technique that was used for the unprotected implementation and we were able to obtain an average accuracy of 94% for the HW classifier of all the targeted coefficients.We now show how the HW of coefficients of arithmetic shares of m can be plugged into our novel greedy search algorithm for key recovery.
Key Recovery: From the key recovery point of view, the main difference when attacking the masked implementation is the construction of inequalities.Remember that, in the unprotected case, we used tables of minimums and maximum values of m [i] for each possible HW.This idea can be adapted for building inequalities for the masked implementation.
For the first order masked implementation, we build two 17 × 17 tables.In these tables, rows and columns represent the Hamming weights of the first and second shares of m [i].The entry with index (ω 0 , ω 1 ) of the first table is the maximum observed value for m [i] when its shares have Hamming weights ω 0 and ω 1 .The second table is similarly constructed, except that we take the minimum instead maximum.To build the maximums and minimums tables, we used the code from mkm4 implementation and observed 100000 decryptions.We remark that no SCA is needed for building this matrix, as they are only dependent on the implementation, not on the device.The resulting inequalities can then be directly plugged into the key recovery algorithm.Figure 7 shows the results performance of our key recovery algorithm in the masked case.Notice that we did not consider the Belief Propagation algorithm here because the required number of inequalities is too large for it to be efficient.

Conclusion
We performed the first security analysis of two detection-based SCA countermeasures against CC-based side-channel attacks: (1) Ciphertext Sanity Check and (2) Decapsulation Failure Check, demonstrating practical attacks against both countermeasures.We first report a novel attack to circumvent the ciphertext sanity checking, by simply applying the public key as a mask to a maliciously crafted ciphertext.We circumvent the decapsulation failure check, by proposing the first CC based side-channel attack that relies on valid ciphertexts for key recovery.Our attack exemplarily exploits leakage from the noisy message polynomial in the decryption procedure for full key recovery.We also introduced a simpler and improved inequality solver, that can recover the secret key with less than half the inequalities compared to these previous methods based on Belief Propagation.
We performed experimental validation of our attack on reference and optimized implementations of Kyber KEM on the STM32F4 microcontroller, requiring between ≈ 325−7800 traces for full key recovery.We show how our attack can be adapted to both the shuffled and masked implementations, with appropriate increase in number of traces for key recovery.Our work therefore shows that low-cost detection countermeasures can be rendered completely ineffective, and do not offer standalone protection against CC-based side-channel attacks.While these countermeasures are attractive for designers, our work encourages more study towards the development and analysis of new detection-based countermeasures against CC-based side-channel attacks.

Figure 1 :
Figure 1: Assembly code snippet of PolySub operation (polynomial subtraction) when compiled with O3 optimized implementation.The target store operation that leaks the HW of the coefficients is highlighted in red.

Figure 2 :
Figure 2: Assembly code snippet of PolySub operation (polynomial subtraction) for the highly optimized assembly implementation of Kyber KEM for the ARM Cortex-M4 microcontroller.The target store operation that leaks the HW of the coefficients are highlighted in red.

Figure 3 :
Figure 3: Correlation Power Analysis (CPA) plot corresponding to coefficients of the noisy message polynomial m for the reference implementation O0-optimized (a), O3-optimized optimization (b) and Assembly optimized implementation (c).

Figure 6 :
Figure 6: Correlation Power Analysis (CPA) plot for four coefficients of the two arithmetic shares of m , from the reference implementation of masked Kyber KEM, compiled with O0 optimization level.

Figure 7 :
Figure 7: Number of ciphertexts required for key recovery considering considering different levels of Gaussian noise.

Table 1 :
Attack parameters used for the CC-based side channel attack using skewed ciphertexts in [RRD + 23].

Table 2 :
Maximum and minimum values of m [i] for each possible value of its Hamming weight that was observed in 100,000 decryptions using Kyber's reference implementation.