Carry Your Fault: A Fault Propagation Attack on Side-Channel Protected LWE-based KEM

Post-quantum cryptographic (PQC) algorithms, especially those based on the learning with errors (LWE) problem, have been subjected to several physical attacks in the recent past. Although the attacks broadly belong to two classes - passive side-channel attacks and active fault attacks, the attack strategies vary significantly due to the inherent complexities of such algorithms. Exploring further attack surfaces is, therefore, an important step for eventually securing the deployment of these algorithms. Also, it is important to test the robustness of the already proposed countermeasures in this regard. In this work, we propose a new fault attack on side-channel secure masked implementation of LWE-based key-encapsulation mechanisms (KEMs) exploiting fault propagation. The attack typically originates due to an algorithmic modification widely used to enable masking, namely the Arithmetic-to-Boolean (A2B) conversion. We exploit the data dependency of the adder carry chain in A2B and extract sensitive information, albeit masking (of arbitrary order) being present. As a practical demonstration of the exploitability of this information leakage, we show key recovery attacks of Kyber, although the leakage also exists for other schemes like Saber. The attack on Kyber targets the decapsulation module and utilizes Belief Propagation (BP) for key recovery. To the best of our knowledge, it is the first attack exploiting an algorithmic component introduced to ease masking rather than only exploiting the randomness introduced by masking to obtain desired faults (as done by Delvaux). Finally, we performed both simulated and electromagnetic (EM) fault-based practical validation of the attack for an open-source first-order secure Kyber implementation running on an STM32 platform.


Introduction
Post-quantum cryptography (PQC) comprises cryptosystems that are designed using hard problems that an adversary cannot solve easily using a sizeable quantum computer.PQC has been developed as a contingency plan in anticipation of the invention of sizeable quantum computers that can break our prevalent classical cryptosystems in the near future.A remarkable step in this direction is the recently concluded post-quantum standardization procedure initiated by the National Institute of Standards and Technology (NIST).NIST proposed some standard Digital Signatures, Key-Encapsulation Mechanisms (KEM), and Public-Key Encryption schemes [AAC + 22], which can be used to replace our current cryptosystems.Therefore, we should put more focus on practical issues such as the physical security of PQC so that it generates enough confidence to be deployed in the near future.
Countermeasures for two different types of physical attacks i.e. side-channel attacks (SCA) and Fault Attacks (FA) differ greatly from each other.Usually, countermeasures for one type of physical attack do not guarantee any security against another type of physical attack.However, if an SCA countermeasure is not applied carefully, its application sometimes opens up new vulnerable points for SCA.One such attack is shown in [FRVD08], where the authors exploited carry leakage from the randomized exponent of RSA [RSA78].Masking [CJRR99] is a well-known and provably secure countermeasure against SCA.Masking alone provides no assurance of security against FAs, except in some special cases of Statistical Ineffective Fault Attack (SIFA) [SJR + 19]1 .However, protection against both SCA and FA is desired.Therefore, it is important to understand the robustness of each dedicated countermeasure against cross-attacks, such as FA on SCA-secure implementation or vice-versa.It is especially important to understand whether one such countermeasure results in some new vulnerabilities that are exploitable by attack vectors.Recent work on symmetric key ciphers has shown that the interaction between two countermeasure classes is tricky and often leads to new attacks [SBJ + 21, SRJB23].We find that there is a lack of study on such cross-attack strategies i.e.FAs on schemes with SCA countermeasures or vice versa for PQC schemes.Therefore, in this work, we ask: Does one countermeasure create new vulnerabilities for another class of attacks in the PQC context?Specifically, since there exist many SCA secure masking schemes [BDK + 20, BGR + 21, HKL + 22, KDVB + 22] for learning with errors (LWE) [Reg09]-based PQC keyencapsulation mechanisms (KEM), in this work, we will limit our focus to these schemes only.
Most of the LWE-based KEMs (e.g.Kyber [BDK + 17], Saber [DKRV18], NewHope [ADPS16], etc.) use a variant of the Fujisaki-Okamoto (FO) [FO99] transformation [JZC + 17] to achieve security against Chosen Ciphertext Attack (CCA) attacks.A CCA-secure KEM consists of three algorithms: Key generation, Encapsulation, and Decapsulation.Although the key generation and encapsulation algorithms can be attacked using fault or side-channel [RRB + 19, VOGR18], the decapsulation algorithm is especially exploitable as it contains the secret key [PP19].The secret key of KEM is non-ephemeral i.e. the same secret key is repeatedly used in the decapsulation for different ciphertexts from various sources.Therefore, an attacker can refine their attack and improve their success probability by observing multiple traces with different inputs or by applying the fault in different executions while the key remains fixed.The decapsulation operation includes decryption and re-encryption and only returns a session key if the given ciphertext and re-encrypted ciphertext are the same.Otherwise, it returns a random key indicating decryption failure.These steps of the FO transformation inherently resist DFA [OSPG18,TMA11], which require faulty outcomes to proceed.However, a fault to bypass the equality check (between the received ciphertext and the re-encrypted ciphertext) can enable DFA.While this attack is simple, it is not the only attack possible.More precisely, equality check is not the only place where faults can be injected.Generally, ineffective FAs, similar to SIFA [SJR + 19] and Fault Template Attacks (FTA) [SBBR + 20], have been quite successful so far [PP21,HPP21,Del22].Such attacks extract information from two events: a) in case there is decryption failure due to faults (effective fault), and b) the decryption succeeds even though a fault has been injected (ineffective fault).The "effectiveness" of a fault depends on the decryption noise, which depends on the long-term secret key s s s.Therefore, each decryption failure or success provides an inequality involving s s s.The attacker can accumulate a sufficient number of such inequalities by injecting several faults and can retrieve s s s by solving this system of inequalities.
Several works [BMR21, PP21, HPP21, Del22] have proposed efficient ineffective FAs on LWE-based KEM schemes.In [PP21], Pessl et al. used an instruction-skip fault in the Decode part (Figure 3) of the decapsulation of Kyber and then observed if this fault resulted in the correct session key or not.The attack proposed in [HPP21] is similar to [PP21] but uses a different location to inject fault, which also belongs to the category of ineffective fault attack.Later, Jeroen Delvaux extended the fault locations of [HPP21] and improved the inequality solver in [Del22].There is another recent class of attack (not ineffective faults), which injects faults in the constants of the Number-Theoretic-Transform (NTT) computation, reducing the entropy of the LWE instances leading to key recovery [RYB + 23].Notably, in all of these cases, the main difference comes from the fault location -finding out new attack location is always beneficial from a security assessment perspective.It also indicates where one should put the countermeasures and what kind of countermeasures are needed.In this regard, the contributions of this paper are as follows: • We propose an FA by injecting faults in a previously unexplored location present in many LWE-based KEMs.The attack falls in the category of ineffective FAs.However, the new location we explore (the Arithmetic-to Boolean mask conversion module aka.A2B) is an artifact of masking.Therefore, our attack can be seen as a "side-effect" of masking.There exists a previous attack due to [Del22] which is aided by masking.
The reason was that masking randomizes the variable to be faulted and increases the chance of realizing a desired (e.g.single-bit) fault model with different executions.
Our exploration is different -we show that an algorithmic change, which is essential for efficient masking, is problematic.Also, we show later in this paper that preventing these two attacks has very different requirements, and we anticipate that our attack would be difficult to prevent using standard techniques, such as duplication (in Sec. 6).
• The proposed vulnerability originates from a fault-induced information-leakage channel through the carry propagation logic of a masked adder.While we restrict our discussion mostly to Kyber in this paper, the channel exists for Saber (we will explain this briefly) and maybe for many other schemes that utilize such adders.Moreover, the leakage is oblivious to the masking order, and, therefore, the attack remains unaffected by the increase in SCA security.
• We realize the key recovery algorithm using the Belief Propagation algorithm provided in [Del22]; however, with important customizations needed for our purpose.While the exact version of the attack requires a single-bit fault at a specific bit, we show that the effectiveness remains unchanged for much relaxed multi-bit fault models -although at the cost of more observations.This interesting observation is due to the nature of carry propagation through the adder and the restricted range and distribution of values the message coefficients can take.
• Our final contribution is a practical realization of the attack on an STM32-based open-source first-order masked implementation of Kyber.We used electromagnetic (EM) faults for this purpose and effectively recovered the key with 1.9 million injections.
It is worth mentioning that there exist multiple works targeting the masked implementations of KEMs using SCA analysis [NDGJ21,NWDP22].Most of them, such as [NDGJ21], exploit the inherent weaknesses of masked software implementations of these algorithms due to low noise and microarchitectural leakages.The attack proposed by us depends on the algorithm and, therefore, does not depend much on such implementation-specific physical assumptions other than that of the fault.In other words, our attack remains unchanged for hardware implementations, provided one can realize the desired fault model, even if the masking is implemented very carefully with sufficient noise.

Preliminaries
Notations: The k-th bit of an integer u ∈ Z is denoted as u (k) .Z q is a ring of integer modulo q and R q is a polynomial ring Z q [X]/(X n +1).R l q is a ring containing vectors with l polynomials of R q and we use R l×l q to represent a ring with l ×l matrices over R q .We denote a single polynomial with lower case letter (e.g., x ∈ R q ), a vector of polynomials with bold lower case letter (e.g., y y y ∈ R l q ), and a matrix of polynomials with bold upper case letter (e.g., A A A ∈ R l×l q ).We denote the i-th (i ∈ {0, 1, ..., n−1}) coefficient of a polynomial x as x[i] and the i-th (i ∈ {0, 1, ..., n−1}) coefficient of j-th (j ∈ {0, 1, ..., l−1}) polynomial in a vector of polynomials y as y[j][i].If an element v is sampled from a set S according to a distribution χ, we present it by v ← χ(S).However, if the element v is generated from a seed v with the help of a pseudorandom number generator and follows the distribution χ over the set S, then we denote it by v ← χ(S;seed v ).We use U as uniform distribution, and β µ as Centered Binomial Distribution (CBD) with standard deviation µ/2.⊕ denotes bit-wise XOR operation and & denotes bitwise AND operation.w represents the rounding operation that outputs the closest integer of w and is rounded upwards in case of ties (i.e., if w = 1/2 then w = 1).w b and w b denotes logical shifting of w by b positions to left or right respectively.Here, w is an integer, and b is a natural number.These operations can be extended for a polynomial x by performing them coefficient-wise (e.g.,

LPR Public-Key Encryption
The Ring-LWE (RLWE) [LPR10] problem is a variant of the Regev's LWE [Reg09] problem introduced by Lyubashevsky et al.The decision version of RLWE problem states that if a ← U(R q ), s,e ← χ(R q ), b ← U(R q ), and b = as+e, then distinguishing (a, b) and (a, b ) is hard.The authors also proposed a public-key encryption scheme based on RLWE problem, which is commonly known as the LPR scheme.This scheme consists of three algorithms: (i) key generation (LPR.PKE.KeyGen), (ii) encryption (LPR.PKE.Enc), and (iii) decryption (LPR.PKE.Dec), which are presented in Figure 1.The key generation algorithm generates the public key and secret key pairs.In this algorithm, secret polynomial s and the noise polynomial e are sampled using a narrow distribution χ over the set R q .The public polynomial a's coefficients are drawn using a uniform distribution over the set Z q , and the other part of the public key is the polynomial b is computed by computing as + e.The Encryption algorithm takes input as the public key pk and message m and produces the ciphertext pair (u, v).Here, u is the key containing part of the ciphertext and generated like b in key generation.v is the message containing components of the ciphertext.To generate v, an n-bit message m is encoded to a message polynomial with mapping Encode.One such mapping can be Encode : R 2 → R q defined by Encode(m) = Here, es −e 1 s+e 2 is known as decryption noise.Also, e, s , e 1 , s, and e 2 are all sampled from a small distribution χ.So, the decryption noise can be removed with a very high probability.This is performed by the Decode operation, which takes input as m ∈ R q and outputs m ∈ R 2 (shown in Figure 3).The chosen plaintext attack (CPA)-secure LPR.PKE can be converted to a CCA (chosen ciphertext attack)-secure key encapsulation mechanism (KEM) using a post-quantum variant of the Fujisaki-Okamoto (FO) transformation proposed by Jiang et al. [JZC + 17].
We call this FO transformation integrated CCA secure KEM as LPR.KEM, and present in Figure 2. G, H, and KDF are three hash functions employed in this KEM as a part of FO transformation.This FO transformation is used in Kyber and Saber.

PQ KEM Schemes Kyber and Saber
Kyber is a LPR.KEM like CCA secure KEM, but its security is based on the Module Learning with Errors (MLWE) problem.Here, the public polynomial of LPR.PKE a is a matrix of polynomials A A A ∈ R l×l q , the secret polynomial s is a vector of polynomials s s s ∈ R l q , and the error polynomial e is a vector of polynomials e e e ∈ R l q .The MLWE problem states that if  (A A A, b b b ) are hard to distinguish.Kyber uses number-theoretic transformations to perform polynomial multiplication.Three security versions of Kyber named Kyber512, Kyber768, and Kyber1024, depending on different values of the parameter set, and are presented in Table 1.The parameter set of Kyber contains prime modulus q and two power-of-two moduli p and t, which form the rings R q , R p , R t .In the Kyber.PKE.Enc, the Compress function shortens each coefficient of the ciphertext c = (u, v).Each coefficient of u is reduced from R q to R p , and each coefficient of v is reduced from R q to R t .Conversely, the Decompress function in Kyber.PKE.Dec, extend the ciphertext bits and maps each element of R p or R t to R q .The Encode : {0,1} n → R q function converts n-bit message to a polynomial in R q , and the Decode : R q → {0,1} n function reverse the effect of the Encode function.This is shown in Figure 3. Kyber uses two CBD distributions β η1 and β η2 to sample secret and error vector.Here, we brief the Kyber scheme and recommend [BDK + 17] for additional details.
Saber is also a CCA secure KEM, and its security depends on the Module Learning with Rounding (MLWR) problem.The MLWR problem states that if   ) are hard to distinguish.Here the error vector of the MLWE problem is replaced by the rounding operation.The moduli q, p, t that are used to form the rings R q , R p , R t in Saber algorithms are all power-of-two.The number of bits in each coefficient of a polynomial of the corresponding ring can be calculated as log 2 (q) = q , log 2 (p) = p , and log 2 (t) = t .The encoding and decoding operations of Saber are similar to LPR.PKE, and these operations are realized by using the shift operation due to the power-of-two moduli.There are three security versions of Saber named LightSaber, Saber, and FireSaber, depending on different values of the parameter set, which is shown in Table 2.We suggest the original paper [DKRV18] for further specifics.

Masking Saber and Kyber
Due to many successful SCAs [PP19, KPP20, MBB + 22, RRCB20] on the decapsulation algorithm of the NIST KEM candidates, many masked implementations for those schemes have been proposed to prevent SCA [BDK + 20, HKL + 22].In this work, we also target the masked decapsulation procedure of an LWE-based KEM.Therefore, we briefly describe the first-order masked decapsulation algorithm below, which can prevent first-order SCA.The decapsulation of a general LWE-based KEM follows three steps: (i) decryption, (ii) re-encryption, and (iii) ciphertext comparison (Figure 2).We must mask each of these three steps to protect the decapsulation algorithm from SCA.In first-order masking, the sensitive variable w is split into two shares, w 1 and w 2 .The relation between w and (w 1 , w 2 ) for arithmetic masking is (w 1 +w 2 ) mod q = w and for Boolean masking it is w 1 ⊕w 2 = w.LWE-based KEMs primarily utilize linear polynomial arithmetic operations over a ring R q , such as ring multiplication with one masked and another unmasked input, ring addition, and ring subtraction.These operations are duplicated and performed on each share independently in masked settings.For masking them, arithmetic masking techniques are used.The decapsulation also has non-linear arithmetic over R q .Those are shift operations on polynomials, and hash functions H, G, the extension function used for sampling pseudorandom numbers XOF, the CBD β µ , and ciphertext equality checking operation =?.These operations are expressed in bit operations and masked with Boolean masking.Except for these operations, the masked decapsulation needs two share conversion algorithms i.e. arithmetic to Boolean (A2B) and Boolean to arithmetic (B2A) conversions.The B2A conversion occurs inside the masked CBD (β µ ).
In Saber's decapsulation operation, the ring is R p instead of R q .A pictorial version of this masked decapsulation operation of Saber is presented in Figure 4.In this work, we will focus on the part surrounded by the red rectangle in the figure, which includes the A2B conversion and shift operations.A version of A2B conversion used in Saber is shown in Algorithm 1.
In Kyber, the ring modulus q is prime, the share conversion algorithms A2B (A2B q ) and B2A (B2A q ) are different compared to Saber, which uses power-of-two modulus.In Kyber, the shares are first transferred from prime to a power-of-two modulus, and the aforementioned A2B conversion (Algorithm 1) or a B2A conversion algorithm with input in power-of-two modulus is performed.For this reason, a couple of extra masked modules, such as Decode, Encode, Compress, and transform-power-of-2 are required in masked Kyber.In Kyber, the A2B conversions are used inside the Decode and Comp components, and the B2A conversions are inside the Encode.
coefficient after the subtraction with q/4.The third figure expresses the distribution of that coefficient after the transformation from Z q to Z 2 k+1 , where k = log 2 (q) .The last figure displays the distribution of that coefficient after the subtraction with q/2.After this, the A2B conversion takes place and produces Boolean shares All other bits of ( ) act as decryption noise.As noted in Section 1, the decryption noise is linearly related to the secret key.

Fault Propagation through the Carry Chain
In this section, we present our main observation, which results in key recovery fault attacks on the masked decapsulation algorithms of certain CCA secure LWE-based KEMs.Broadly, our attack exploits the Decode component (ref.Figure 6) that removes the decryption noise and produces the message in the decapsulation algorithm.However, unlike [PP21], which skips the first subtraction operation (line 1. in Algorithm 3) in the Decode module, we target the A2B algorithm.The A2B adds the arithmetic shares using Boolean masking and generates Boolean shares amenable to the shift operation so that the message bits can be extracted easily.More precisely, we target one of the arithmetic shares input to this module with faults.For simplicity, we limit our discussion to two shares (ref.Algorithm 3) in the next few paragraphs, and later explain the many shares case.Moreover, for ease of explanation, we limit the discussion to the Decode module of Kyber only (described in the last section).The differences for the Saber case will be explained at the end.
The output of the Decode stage are the message bits (masked), which are inputs to the re-encryption stage.As described in the last section, these (masked) message bits are derived from the (masked) message coefficients containing (masked) noise.The faults induced by us at the input of the A2B (at a specific bit of any one of the shares) propagate to these decoded (masked) message bits, conditioned on a specific (unmasked) bit of the noise.In other words, whether or not the fault propagates to the (masked) message bits (and, therefore, causes a decryption failure) depends on the actual unmasked value of some noise bit.Now the noise is dependent on the secret key s s s.Therefore, based on this fault propagation property, we can construct inequalities (each corresponding to one ciphertext) involving the secret, and solve a system of such inequalities using the Belief Propagation (BP) algorithm.The BP algorithm finally returns the long-term secret key.Note that, in this attack, we always work with valid ciphertexts, i.e. the ciphertexts that will be generated using proper execution of the encapsulation mechanism of the scheme.So, the probability of decryption failure (without fault) during the decapsulation procedure is identical to the preassigned decryption failure probability of the scheme, which is very small (< 2 −128 ).Therefore, we can safely assume that all the observed decryption failures happen due to the induced faults.Next, we explain the information leakage that we exploit due to the fault propagation through carry chains of the A2B.

Fault Propagation through First-order Masked A2B
We begin our discussion with stuck-at faults.Without loss of generality, suppose we introduce stuck-at-1 bit-fault (on an input share) at the (k−1)th bit of the i-th input coefficient (i.e.bits of the message).Henceforth in this section, we will only focus on a single coefficient.So, for the sake of simplicity, we rename the following: • g a1 [i] as x, g a2 [i] as y, and g[i] as z.These are (k+1) bit registers.
and m[i] as m.These are 1 bit registers.
Let us assume after the application of the stuck-at-1 fault at x (k−1) , x becomes x * , z = (x+y) mod 2 (k+1) becomes z * , and the message bit m becomes m * .We introduce the following two events: • Fault activation: The injected stuck-at-1 fault at x (k−1) is active if x (k−1) = 0, else it is inactive.If the fault is active, then x * (k−1) = x (k−1) (the value of the bit changes from 0 to 1).
• Fault propagation: The injected stuck-at-1 fault at x (k−1) propagates to z (k)  if and only if z * (k) = z (k) .The injected fault propagates to z (k+1) if and only if z * (k+1) = z (k+1) .A fault propagation happens only if the fault is "active" The main operation of A2B is to compute z = x+y mod 2 (k+1) , and it performs this by utilizing Boolean masking.However, for our purpose, we just need to focus on the addition functionality and the arithmetic shares (x and y).We can write: where 1 ≤ j ≤ (k + 1), c (j−1) is the carry for (j − 1)-th bit, and c (0) = 0.The carry can be written further as: = ((y (j−1) ⊕c (j−2) )&x (j−1) )⊕(y (j−1) &c (j−2) ). ( The relation between x, y, z and m 1 , m 2 , m is given as: Next, we present the following lemmas: Lemma 1.If we introduce a stuck-at-1 fault at x (k−1) , the fault activates with probability Proof.The introduced stuck-at-1 fault at x (k−1) is active, i.e. x * (k−1) = x (k−1) , only if x (k−1) is equal to 0. Since x is a random arithmetic share, x (k−1) = 0 happens with probability 1 2 .If the fault is active then x * (k−1) = 1.From Equation 2, we get that the introduced fault at x (k−1) can only propagate to z (k) through c (k−1) .Let us assume c becomes c * after the fault injection.Now, from Equation 3, we get c * (k−1) = ((y . This is the unmasked value of the (k −1)th bit of the coefficient that we target.The fault injection undoes the masking for this specific bit, thanks to the fault propagation.Overall, the fault reaches to z (k) with probability 1  2 when z (k−1) = 1.
Example: Now, let us provide a very simple example to visualize Lemma 2. Let us consider k = 2 (therefore, we have a 3-bit register).In this case, we introduced a stuck-at-1 fault at x (1) .The fault is active only if x (1) is equal to 0. If the fault is active then z * = (z+1) mod 2 3 .Table 3 presents the values of z and z (3) before the stuck-at-1 fault at x (1) together with the values of z * and z * (3) after the fault, when the stuck-at-1 fault at x (1) is active.From this example, we can observe that the fault at x (1) only propagates to z (3) , if z (1) = 1 and z (2) = 1.

Table 3:
The value of z and z (3) before and after the stuck-at-1 fault at x (1)  Before the fault injection After the fault injection 1) activates with probability 1 2 and propagates to z (k+1) and causes decryption failure if and only if z (k−1) = 1 and corresponding message bit m = 1.
Proof.From Equation 1 and Equation 4, we have z (k) = z (k+1) = m.So, the stuck-at-1 fault at x (k−1) propagates to z (k+1) = m, only if z (k−1) = 1 and the corresponding message bit is 1 (according to the previous lemmas).As the fault propagates to m then m * = m.This event will cause decryption failure as the decoded message bit is different before and after the fault injection.Once again we remind that m will remain as two shares.However, by the correctness of the Boolean masking, the properties described in these lemmas will follow.
In a nutshell, fault propagation provides information about z (k−1) , and propagates till z (k+1) conditioned on z (k) .Further, since z (k) and z (k+1) are the message bits (and assumed known in our attacks), if corrupted, they lead to decryption failure.Overall, we obtain a (unmasked) bit (z (k−1) ) which exposes partial information about the secret-dependent noise.Notably, we receive similar results if we use stuck-at-0 fault.However, the fault propagation would happen when z (k−1) = 0, in that case.However, bit-flip faults do not provide information in this case.For a bit-flip fault at x (k−1) , it is active for both possible cases, when x (k−1) = 0 or when x (k−1) = 1.The injected fault will propagate to z (k) and cause successful decryption failure for 2 k−1 instances out of 2 k instances when z (k−1) = 1 and also when z (k−1) = 0. Hence, successful decryption failure does not provide knowledge regarding z (k−1) .

Fault Propagation for the Higher-Order Masking
We now discuss the effect of our stuck-at fault on the higher-order masked decapsulation algorithms.In t-th order masking, the input of A2B conversion g[i] has t + 1 arithmetic shares.For the sake of simplicity, let us assume z = g[i] and its (t + 1) arithmetic shares are (x 1 , x 2 , ..., x t+1 ).So, ( (t+1) j=1 x j ) mod 2 (k+1) = z.Now, this multi-share sum can be written as a two-share case z = (x 1 +y) mod 2 (k+1) , where y := (  .Therefore, by applying Lemma 3, the stuck-at-1 fault at x k−1 1 will propagate to z k+1 only if z (k−1) = 1 and z (k) = 1.The conclusion is that the fault propagation scenario we describe here does not get affected by the order of masking.

Application of Our Fault Attack on Kyber
In Kyber, q = 3329 and hence k = 12.So, each coefficient of m ∈ R q are of 12 bits.However, in order to enable message extraction in a masked setting, the shares are transferred from mod q to mod 2 (k+1) , making each share a 13-bit value.We inject stuck-at-1 bit-fault at the 11th bit of the i-th coefficient of an input share of the A2B of the masked Decode algorithm.Whether or not the fault propagates to the 13-th bit (which remains in a Boolean shared form) exposes the 11-th bit.
According to our nomenclature defined in the previous section, we denote the target 11-th bit (to be faulted) as g a1 [i] (11) , and the 11-th-bit value to be extracted as g[i] (11) , where g = (g a1 +g a2 ) mod 2 13 .From Lemma 3, we obtain that the stuck-at-1 fault at g a1 [i] (11)  propagates to the corresponding i-th message bit m[i] and causes decryption failure ( with probability 1/2) when g[i] (11) = 1 and m[i] = 1. Figure 7a presents the value ranges for g [i], where the stuck-at-1 fault will propagate and cause decryption failure, and in which cases the fault will not cause decryption failure for Kyber.The value range of g[i], which will cause decryption failure are {2 12 +2 11 +2 10 , 2 12 +2 11 +2 10 +1, ..., 2 13 −1} (coloured in red).The value range of g[i], which will not cause decryption failure are {2 13 − q/2 , 2 13 − q/2 + This picture also presents that the injected fault will corrupt the message value whenever it lies on the red part.
1, ..., 2 12 +2 11 +2 10 −1}(coloured in blue).One important point is that the distributions are bell-shaped, i.e. the mean values occur more often than any other value, and the values at the tails are extremely rare.This fact will be used later for relaxing the fault model to tolerate multi-bit faults.Similarly, the stuck-at-0 fault at g a1 [i] (11) propagates to m[i] and causes decryption failure when g[i] (11) = 0 and m[i] = 0 with probability 1/2. Figure 7b shows the value of g [i] where the stuck-at-0 fault will propagate and cause decryption failure and where the fault will not cause decryption failure.The values of g[i], which will cause decryption failure are {0, 1, ..., 2 10 −1} (coloured in red) and only when m[i] = 0 and the fault is active.There are some implementations of masked Kyber where 16 bit register is used for g[i] instead of 13 bit.One such implementation is [HKL + 22] (We used this implementation in our practical setup).We note that if the stuck-at-1 fault at g[i] (11) propagates to g[i] (13) then it will propagate to g[i] (16) .This can be proven by repeated application of Lemma 2. Therefore, working with the 13th or the 16th bit is the same for us, and we stick with the 13th bit case.Now, we recall that the 11th bit corresponds to the (secret dependent) noise.After the injection of the stuck-at-1 fault at g[i] (11) , it will cause a decryption failure if g[i] (11) = 1 and the corresponding message bit m[i] = 1.If m[i] = 1 and g[i] (11) = 1, then the associated decryption noise is ≥ q/4 −(2 10 − q/4 ).So, it can create a partition in the distribution of the decryption noise depending on the decryption failure.This partition leaks some information regarding the secret key, and we can eventually form inequalities involving the secret key (described in the next subsection).Note that the partition we received from our fault attack is different from the attacks available in the literature [PP21, HPP21, Del22].In the previous works, if the message bit m[i] = 1, then after the injection of the fault for almost half of the possible values of g[i] (around q/4 values out of q/2 values) the fault will cause decryption failure, and for the other half of the possible values of g[i] will not cause decryption failure.The same incident happens when the message bit m[i] = 0.However, in our stuck-at-1 attack, we observed decryption failures only when the corresponding message bit m[i] = 1.If m[i] = 1, then in our experiments, we have noticed 99.3% decryption failure and only 0.7% decryption success by applying stuck-at-1 fault at 11th bit of a share of g[i] for Kyber512.We provide more details in Section 5.

Recovery of the Secret Key for Kyber
This subsection will describe the final stage of our attack i.e. recovering the secret key for Kyber.As mentioned earlier, decryption noise forms a linear equation with the secret key s s s and error e e e.The decryption noise for LWE-based KEM is equal to e e e T s s s where ∆u u u and ∆v are the noise introduced because of the compression operation over u u u and v, respectively.In our case, a decryption failure is observed only when the message bit m[i] = 1, and the corresponding 11-th bit is also 1.If the fault is active and decryption failure happens, then the decryption noise satisfies the following inequality.
(e e e, s s s) T (s s s If the fault is active and decryption success happens, then the decryption noise satisfies the following inequality. (e e e, s s s) Note that the fault is active with probability 1 2 .As mentioned in Section 1, the attacker has access to the encapsulation oracle and works with self-generated ciphertexts.So, the attacker knows all the values in Equation 6 & 7, except (e e e, s s s).The attacker can find the polynomials (e e e, s s s) by solving these linear inequalities.
Recently, a few inequality solvers have been proposed [PP21, HPP21, Del22, HMS + 23].All these solvers are based on BP in order to tolerate errors and solve the system with a large number of inequalities in a reasonable amount of time.In this work, we have used the inequality solver proposed by Delvaux et al. in [Del22] to solve the inequalities obtained from the stuck-at-1 fault on Kyber.We always inject the fault at some fixed coefficient.If we consider ω numbers of inequalities, then the system of inequality contains ω inequalities with ψ unknowns.Here the value of ψ is the total number of coefficients of s s s and e e e, which is 2 * l * 256, l = 2 for Kyber-512.This inequality solver uses an iterative method to solve the system of ω inequalities by maintaining a probability mass function (PMF) for each of ψ unknowns.The PMF for each unknown initiated with the CBD on [−η, η] and updated in each iteration.The updation of PMF for each unknown is performed by calculating the probability to satisfy each of the ω inequalities for each possible value [−η, η] of the unknown, and then all the 2 * η * ω probabilities are combined.[Del22] accelerates the computation time of their solver by applying the central limit theorem.The iteration of the updation of PMF for each unknown continues until a sufficiently approximate one-point distribution on [−η, η] is found.

Application of Our stuck-at Fault Attack on Saber
In Saber, q = p = 2 p is a power-of-two moduli and k = p ( p = 10 for the medium security version of Saber).The decoding algorithm in Saber has only two steps.The input of the decoding algorithm of Saber is arithmetic shares of m ∈ R p .First, the algorithm performs the A2B conversion on the arithmetic shares of m to convert them into Boolean shares.Then the most significant bit ( p th bit) is extracted from the Boolean shares of m by using ( p −1) right shift operation.This final output is the message polynomial m, which is the output of the decoding in Saber.m [i] follows a bell-shaped distribution with a peak at p/4 when m [i] ( p ) is 0, i.e. for the corresponding message bit 0. The values of m [i] follow a bell-shaped distribution with a peak at 3p/4 when m [i] ( p ) is 1, i.e. for the corresponding message bit 1.From Lemma 1, we get that the injected stuck-at-1 fault at the ( p −1)-th bit (9th bit for the medium security version of Saber) of a share of m [i] propagates and causes decryption failure if it is active and m [i] ∈ {(p/4), (p/4)+1, ..., (p/2)−1}∪{3(p/4), 3(p/4)+1, ..., p−1} (coloured in red).The injected fault is active with 1/2 probability.Also, if m [i] ∈ {0, 1, ..., (p/4)−1}∪{(p/2), (p/2)+ 1, ..., 3(p/4)−1}, then the fault does not propagate and hence does not cause decryption failure (coloured in blue).We use Figure 8a to show the values of m [i] in which the stuck-at-1 fault

., p−1}
then decryption failure does not arise(coloured in blue).Figure 8b represents the values of m [i] for both cases when the stuck-at-0 fault propagation is successful and when fault propagation is not.Overall, we see that the fault propagation and the nature of the information leakage are quite similar to that of Kyber's.This gives enough evidence that our fault attack can be applied to any LWE-based KEM with some minor changes, as the leakage pattern from the masked decode algorithm is similar for all these schemes, such as we can use Kyber's analysis to NewHope [ADPS16] just by changing the parameters.However, we left this for the extended version of this work.Hereon, we shall mainly focus on the Kyber scenario in this paper and develop the BP-based key recovery algorithm for it.

Relaxation of the Fault Model
The attack described so far in this paper requires faults to be injected at a specific bit (the 11th bit) of a register.While such a precise injection has been demonstrated previously in literature [DBC + 18, BH22, SBBR + 20], injection at a specific bit is always probabilistic and depends on the precision of the injection mechanism.Therefore, we can only expect the fault to happen at the 11th bit with some probability p.On the other hand, research has shown that limiting the width of faults to one bit is feasible [GYTS14] even with low-cost injection mechanisms such as clock-glitch.However, the location of the faulted bits is not well-controllable in many situations.Furthermore, with low-cost setups, multi-bit faults also occur quite frequently.Therefore, it is practical to assume a fault model that injects single-bit faults at random locations, or even multi-bit in a register.Depending on the injection mechanism, the fault model may vary.However, stuck-at faults can be observed in several practical scenarios, such as clock-glitch or EM.Even in our practical experiments with EM injection described in Section 5, we observed single/multi-bit stuck-at-1 faults.
We now analyze the effect of such a relaxed fault model (single/multi-bit faults at different locations) assumption on the proposed attack.One should note that the proposed fault attack utilizes the inequalities constructed from the decryption failures/successes.There is no direct way to distinguish whether or not the fault has been injected at the desired location (11th bit for Kyber) in the case of a decryption failure.Similarly, in the case of a decryption success, it is hard to determine if it is due to the fault at 11-bit or some other fault.The injections at the undesired locations may generate noisy equations, hindering the convergence of the BP algorithm to the correct key value.Generally, BP algorithms can tolerate noise only up to a certain extent [HPP21,Del22].
Interestingly, we have observed that our attack is not significantly affected by such injection noise.The signal, in our case, is a single-bit injection (stuck-at-1) at the 11th bit.The main reason behind such observation is that due to the carry chain structure, many faults do not propagate till the most significant bit of the addition operation.More precisely, the only way to observe a fault is to observe it at the MSB bit, and a fault in some least significant bits can propagate to the MSB only through the carry chain.The carry chain contains several AND operations, which makes the fault propagation conditional on many bits, making it rare.It is also due to the normally distributed data that we are dealing with.The target variable assumes only a specific set of values with high probability among the entire range, which further hinders the fault propagation from most of the least significant bits.The probability of fault propagation is the highest for the two most significant bits, 12th and 13th bit for Kyber, and the second highest is for the next significant bit, 11th bit for Kyber.However, for the lower-order bits, the probability of fault propagation is negligible.In the next few paragraphs, we further elaborate on why the fault propagation probability is low for lower-order bits.
For the time being, let us further narrow down our discussion to Kyber-512, although the conclusions made would be similar for other variants.As already mentioned in Section 3.4, we extract a single-bit of the decryption noise via faults, which provides us with some information on the secret key.For Kyber512, the mean of the decryption noise is 0 associated with any of the message bits.Furthermore, decryption noise remains between (− q 4 , q 4 ) with probability (1 − 1 2 139 ) [BDK + 17] (as the failure probability for Kyber512 is 2 −139 ).The decryption noise associated stays in between (− q 8 , q 8 ) with probability (1 − 1 2 26 ).Concretely it means that obtaining any noise value out of this range would happen only once in 67 million executions.From a physical attack perspective, it is a quite low probability.These observations indicate that (due to the bell curve of noise distribution) the noise values are highly likely to belong within a small range 3 .
Let us now analyze the distribution of the message coefficients (g[i]) just before the extraction of the message using the A2B algorithm (i.e.before line 4 in Algorithm 3).These coefficients contain the message added with the decryption noise.At the input of A2B, g[i] ∈ (0, q 2 ) = (0,1664), if the corresponding message bit m[i] = 0, and g[i] ∈ (2 13 − q 2 , 2 13 −1) = (6528,8191), when the corresponding message bit m[i] = 1.These two ranges are clearly disjoint between 1664 and 6528.For the other end (i.e.0 and 8191), they are quite close as 0 = 8192 in the ring.The main observation at this point is that if a fault corrupts the message bit, then it must be strong enough to take the value of g[i] from one range to the other range (e.g. from the value range of message value 0 to the value range of the message value 1).As we can see, it is indeed possible if the value of g[i] is close to 0 or 8191.However, such values occur with extremely low probability.Now we consider the following scenarios, which model various single/multi-bit fault models on this variable g [i].For the sake of explanation, we consider the fault width to be 8 bit, in this case, limited to the least significant bits (in fact, we mostly observe such faults in our practical setup).However, the argument is similar for any bit width.We note that the faults are actually injected on one of the shares of g[i].However, the impact is realized on the unmasked g[i] as the shares are added with A2B.
• stuck-at-1 fault at least significant 8 bits in a share of g[i]: In this case, random bits in between 1 to 8 gets faulted to 1 if its value is 0. If the j-th bit is flipped from 0 to 1, then 2 j−1 gets added with g[i] after the fault.As we are introducing fault at a random share, the maximum added term to g [i], that can occur from stuck-at-1 fault at least significant 8 bits in a share of g[i], is (1 + 2 + 4 + ••• + 128) = 255.This happens when all of the 8 least significant bits in a share are 0, and all of them flip to 1.So the resulting value addition with g[i] is 255.
• stuck-at-0 fault at least significant 8 bits in a share of g[i]: Here, random bits in between 1 to 8 get changed from 1 to 0. If the j-th bit is faulted from 1 to 0, then 2 j−1 gets subtracted from g[i] after the flip (as we are introducing fault at a masking share).So, the maximum subtracted term, that can occur from stuck-at-0 fault at least significant 8 bits in a share of • bit-flip fault at least significant 8 bits in a share of g[i]: In this case, random bits in between 1 to 8 get flipped from either 0 to 1 or 1 to 0. If the j-th bit is flipped from 0 to 1, then 2 j−1 gets added with g[i] after the flip, and if the j-th bit is flipped from 1 to 0, then 2 j−1 gets subtracted from g[i] after the flip.So, the maximum added term that can occur in this case is 255, as well as it is the maximum subtracted term that can occur.
Let us first consider the scenario when the message bit m[i] associated with g[i] is 1.The decryption failure happens when we can change g [i] to some values that decode to 1, and we want to do it with 8-bit fault models described before.This can indeed happen for g[i] = 8191 or values closer to it if we add 255 with it.However, the least possible value for which this would be feasible is 7937, which occurs with a very low probability.If we consider values from the higher probability range, such as (6944, 7776) (happens with probability 1− 1 2 26 ), the chances that it will go to the ranges where the message value is 0 by adding (resp, subtracting) 255 is zero.In other words, the probability of changing a message value from 1 to 0 with an 8-bit fault is definitely < 1  2 26 , which we consider as quite low in terms of physical attacks.This is illustrated in Table.4. In conclusion, we point out that with multi-bit faults at LSB positions up to 8 bits, it is highly improbable to change the value of the message bit.In other words, the faults at lower order bits never reach the 12/13th bit and, therefore, do not cause noise in our observations.
It is worth mentioning that the probability of fault propagation to the message bit indeed increases if we consider multi-bit faults up to 9th bits.However, in simulation, we observed that the probability of such propagation is still low (e.g.no decryption failure for up to 9th-bit injection even if we simulate for 1 million cases).For faults up to 10/11 bits, the probability is significantly high for random bit-flip faults (nearly 50%), but lower for stuck-at faults (nearly 30%).The 11th bit is our signal and therefore, we always want a single bit fault individually occurring there.This also happens with some probability and in the experimental section, we show that it is indeed possible to increase this probability by careful adjustment of fault parameters.Finally, for 12th and 13th bits, the fault happens with 50% probability and carries no information.The takeaway of this section, therefore, is we have to carefully avoid injecting faults in these MSB bits and keep the width of multi-bit faults low.Interestingly, most of the singlebit faults, even while occurring at random locations other than the 11/12/13th bit, never propagate.Overall, we see that the fault models can be significantly relaxed for our case, and even with a few single bit stuck-at faults hitting the 11th bit, we can perform the desired fault propagation.There are many situations where the probability of noise is low (or, in fact, zero), and if the fault model can be tuned for any one of those situations, the attack would work seamlessly.

Experiments
In this section, we validate the attack on Kyber in simulation and also through practical fault injections.We begin with simulating the ideal scenario, i.e. injecting stuck-at-1 at the 11th bit of an arithmetic share of g [i] for every execution of the decapsulation algorithm.The outcomes are then fed to the BP algorithm to recover the secret key.The simulated experiment helps us to find the right parameter settings for the BP algorithm.Next, we move to our practical setup with EM-based fault injections on a software implementation.We further tune the BP algorithm to perform key recovery from these practical fault cases.

Noise Removal from Inequalities
Given a fault injected in the masked decapsulation algorithm with a ciphertext C, the decryption failure provides the information that g[i] (11) = 1.The decryption success indicates that g[i] (11) = 0.The fault must be active in both cases.However, in a single execution with C, it might happen that we observe successful decryption even though g[i] (11) = 1 due to the fact that the fault is not activated.Note that fault activation happens with probability 1 2 since we fault the masked values.In order to remove the uncertainty (which results in wrong information and is interpreted as noise in the BP algorithm) due to probabilistic fault activation, our strategy is to execute the decapsulation of C multiple times (of course, with fault injection).In our injection campaign, we repeat C, β times until we see a decryption failure for it.In case no decryption failure is observed after β executions, we decide g[i] (11) = 0. Once we see a decryption failure, we stop the execution with C, conclude g[i] (11) = 1, and move to the next ciphertext.
In our simulation, we begin with β = 20.This is an empirically chosen value.However, we checked (through simulation) that the probability of getting a wrong decision about the value of g[i] (11) is negligible for this choice of β.Moreover, as mentioned before, the probability of g[i] (11) = 1 is significantly higher (99.3%) than g[i] (11) = 0, by the nature of the decryption noise distribution.A result of such bias in the distribution is that we get a very low number of decryption "success" cases compared to the decryption "failure" cases.Since we stop execution for a given C upon seeing a decryption failure, the average number of repetitions we need is quite low compared to β. Empirically, we observed that, on average, we need to repeat our fault injection 2.67 times for each ciphertext (We tried with 100K ciphertext, and the total fault required there was 2.67K).The parameter setting for the ideal simulation is presented in Table 5.

Configuring and Running the BP Algorithm
As mentioned in Section 3.4 we have utilized the inequality solver proposed by [Del22].In our stuck-at-1 fault injection attack at g[i] (11) , we are only dealing with the inequalities that are generated from the message bit m[i] = 1.This follows from the Lemmas in Section 3.1, which say that the fault propagates to the 13th bit only when the 12th bit is 1.Since for Kyber, both the 12th and 13th bit are the same message bit, this means that the message bit must be 1.The corresponding inequalities for decryption success and failures are given in Equation 7and Equation 6, respectively, in Section 3.4.However, due to such a highly unbalanced value distribution for g[i] (11) , the BP solver might end up having a negligible number of inequalities corresponding to the correct cases compared to the faulty cases.The ratio is originally 0.7 : 99.3, which improves to 3 : 97 by applying a filtering strategy described in the next paragraph.This ratio, however, hinders the BP solving, making it converge to a wrong value quite often.One most probable reason for this would be that the highly-biased ratio makes the solver ignore the constraints due to correct cases, and impedes the solver converge to a single key.To stabilize this, we introduce a strategy called sample rejection.More precisely, we randomly reject 50% of the inequalities among the total inequalities generated, but these rejected inequalities are only chosen from the decryption failure cases.This improves the ratio between correct and faulty cases, making it 7 : 93.We found that this ratio is sufficient for consistently solving the system for several random secret key choices.With this new balanced system of inequalities, we need roughly 30k inequalities to recover the whole secret key (shown in Figure 11), and to obtain these 30k inequalities (i.e.60k different ciphertexts, as we reject 50% of the total inequalities and all are generated from decryption failures), we require to continue the fault simulation 160k times (= repetitions required for noise removal × ciphertexts required to enable 50% sample rejection = 2.67×60k).We present this result in Table 5.
Another important parameter that controls the BP solving is the filtering strategy originally presented in [Del22].Filtering basically chooses the smallest value among the message coefficients (i.e.g[i]) for fault injection in order to assist the BP algorithm.Also, in our case, it has some impact on improving the ratio of correct and faulty executions, as mentioned in the last paragraph.While directly implementing this strategy can be costly from the implementation perspective of the injection setup (as we need to change the fault location for every ciphertext), this is functionally similar to the following strategy: fixed the fault location at some coefficient index i, and then choose a ciphertext having the minimum value for this coefficient (among several randomly generated ciphertexts) for fault injection.This strategy, however, increases our simulation time, as we now need to generate several encapsulations to enable the minimum value selection.Owing to the similarity of the original filtering strategy, we, therefore, used the original filtering already implemented in the BP code.However, we take the minimum value for which the message bit is also 1. Also, it is worth mentioning that during the practical attack, we keep the fault location fixed, mimicking the second strategy.

Practical Setup
After establishing the simulated evidence for the attack, we now move to the practical fault injection experiments.For our experiments, we target the reference first-order masked implementation of the considered scheme built for the ARM Cortex-M4 family of microcontrollers [HKL + 22].We ported the reference implementation to the STM32F4-DISCOVERY board (DUT) housing the STM32F407 -an ARM Cortex-M4 microcontroller running at a clock frequency of 24 MHz.For the compilation of the reference implementation, we have used arm-none-eabi-gcc version 10.3.1 with compiler flags -O3 -std=gnu99 -mthumb -mcpu=cortex-m4 -mfloat-abi=hard -mfpu=fpv4-sp-d16 .We have used a generic USB 2.0 to TTL converter for USART communication with the DUT.EM Fault Injection (EMFI) was utilized to inject faults into the target device.
The most formidable challenge in executing the aforementioned attack is inducing the faults accurately at desired locations.We have established an injection setup to address this challenge, illustrated in Figure 9.This setup consists of several components, including an arbitrary waveform generator (Keysight 81160A), a constant-gain power amplifier (Teseq CBA 400M-260), a high-frequency near field H-probe (Rigol Near-field Probe 30MHz-3GHz), and an XYZ table (Thorlabs SMC100). Figure 10 shows the position of the near-field EM probe on the DUT.After being triggered by a signal from the evaluation board hosting the target implementation, the waveform generator produces a high-frequency pulse train.Following this, the amplifier boosts the intensity of this pulse train, and the H-probe generates a magnetic field above the target.It's important to highlight that the pulse train's amplitude, frequency, and burst count can be customized to meet particular needs.It was observed that by manipulating the position of the probe over the target using the XYZ table, with some necessary adjustments to the pulse parameters, fault can be induced in specific bits of the target register.However, this positioning process must rely on a trial-and-error approach due to the absence of internal register visibility.We have also utilized an oscilloscope to observe the execution timing of the target instruction so that the pulse delays can be adjusted accordingly and the trial-and-error can take the least amount of time.Initially, we analyzed the target object file to get an insight into the occurrence of a particular instruction where we can inject the desired fault.In the masked_poly_tomsg() function of the reference implementation, we have targeted a specific arithmetic instruction to induce the fault in the desired register, which corresponds to line 3 of Algorithm 3.

Attack with Practical Faults
We do not expect the practically injected faults to exactly follow the ideal distribution (i.e.stuck-at-1 fault at 11th bit with 100% probability).It is more judicious to assume that the desired fault (stuck-at-1 at g[i] (11) ) will occur with a certain probability p.If p is significant, then we can expect some signal (the desired fault event) to construct useful inequalities.Injections at undesired locations cause noise.However, we can handle such noise quite well, thanks to the two observations made in this paper: 1) A significant part of the induced faults on a share of g [i] does not propagate to the 12th or the 13th bit (ref. Section 4), and 2) The uncertainty (noise) due to undesired injections can be largely removed through repeated execution of the same ciphertext (Section 5.1).In Section 5.1, the second observation was utilized to remove the uncertainty arising due to fault activation.However, the same strategy can be utilized to remove the uncertainties for undesired fault locations.
We, therefore, follow the same strategy as before: Upon repeating injection several (β) times for the same ciphertext C, if we see a decapsulation failure, we decide g[i] 11 = 1.If we do not observe any decapsulation failure, g[i] 11 = 0.The choice of β is crucial here, as a very low choice might lead to several wrong inequalities being added to the system.A very high choice, on the other hand, may increase the attack time.We note that even after choosing a proper β, we may end up having some wrong equations.However, the BP algorithm inherently tolerates some (low) noise.Therefore, it does not create an issue for us.The most obvious (albeit slow) way or finding β is to do some trial-and-error.We did this for the first set of outcomes from our practical setup with the same β = 20 value chosen for the ideal simulation.This did not work for obvious reasons.In order to speed up our experiments, we, therefore, extract the fault model directly from the setup, assuming an offline profiling stage.Analyzing faults at the target location, we observed that multi-bit faults were also induced, specifically towards the LSB side of the target register.After tweaking the pulse and burst parameters from the waveform generator, we were able to obtain 10% single-bit stuck-at-1 faults at the 11 th bit and 90% random single-bit/multi-bit stuck-at-1 faults, mostly concentrated at the bit positions 1−8 (the LSB is at position 1).The pulse parameters are presented in Table 6.From this, we empirically estimate the value of β as 180.The interpretation is that if we repeat the fault injection β times for the decapsulation of the same ciphertext, then the probability of the fault getting activated at the 11th bit position at least once is significant.Interestingly, if the faults happen at any of the 1−8 LSB positions, it does not propagate to the message bit (13th bit of g[i]) 4 .Therefore, with this observed fault model, we can ensure that if a decryption failure is observed, it is always due to the propagation of our desired fault.The g[i] (11) = 0 case, though, may have noise.However, β = 180 successfully removes this noise for most of the cases, enabling key recovery.We notice that, on average, for a single ciphertext, we need to repeat the fault injection 26.53 times only.As shown in Figure 11, we could recover the secret key with 35k inequalities even with this fault model.To obtain 35k different balanced inequalities, we are required to continue the fault simulation 1,857k times (= repetitions required for noise removal × ciphertexts required to enable 50% sample rejection = 26.53×70k).We present this result in Table 5.It is interesting to observe that the number of inequalities is roughly the same for the ideal and the practical case.This is mainly attributed to the fault model we observe and the potentially removable noise by the structure of the attack.FA countermeasures use some form of redundant computation to detect/correct the fault [PM18].Considering the previous attacks on the LWE-based KEMs, many of them can be prevented using this strategy.For example, consider the attack due to [PP21], which skips a subtraction operation.Duplication of the subtraction followed by a check can prevent such attacks.Quite similarly, the attack due to [PP21] can be prevented by duplicating different parts of the re-encryption module as the fault is mainly injected at that part.The reason why duplication works for these cases is that the attacks strongly rely on the Decode or the final check operation of the FO transform.Unless the fault reaches these operations, it cannot be exploited.Therefore, preventing the fault from reaching these operations can stop the attack.In this regard, the proposed attack has some interesting properties.Although we still target the Decode, our injection happens much later than the fault location of the attack in [PP21].In some sense, we target the penultimate step of the decoding operation itself rather than sending some faulty value to decode as in [HPP21,Del22].The most important fact is that if duplication and detection are used to detect our fault injection, that would result in the same information leakage that we utilize.Error correction might be somewhat useful, however, as shown in recent work [SBJ + 21], error correction too leaks if the FA is combined with SCA.We anticipate that recent combined-attack secure gadgets [FGM + 23] can be a proper direction to prevent this attack.However, such gadgets are based on Boolean or polynomial masking, while we target arithmetic masking.Shuffling can be useful in this context, but, again, the adversary can perform a combined attack to extract the permutation used for shuffling and thereby remove the noise induced by it.Note that, we already consider masking (even higher-order) for our target implementations.So, extracting only the shuffling permutation would not result in a successful SCA attack.But such extraction would definitely help our FA.With all these observations, we leave a sound countermeasure development as an interesting future work.

Conclusions
In this work, we propose a new FA on the SCA-secure masked decapsulation algorithm for generic LWE-based KEMs and elaborate the attack for Kyber.Our attack exploits the A2B component used in the masked implementations and appears due to the presence of masking.So, it can be considered as a vulnerability introduced from the algorithmic changes introduced due to masking.The attack results due to fault propagation through the A2B component and leaks a secret-dependent noise bit by unmasking it due to faults.Eventually, we show a practical validation of the attack on ARM Cortex-M4 microcontrollers using EMFI.A direct consequence of the attack is that one must be careful while designing masking algorithms.One also has to be careful while introducing fault countermeasures to prevent this attack, as we anticipate that state-of-the-art duplication-based countermeasures would not be sufficient.Overall, our work encourages more study in this direction, in general, to eventually construct truly secure implementations.

Figure 2 :
Figure 2: CCA secure LPR.KEM [JZC + 17] and b b b = (A A As s s) + e e e, then (A A A, b b b) and

2 −174 5 b
b b ← U(R l p ), q > p and b b b = (q/p)(A A As s s) , then (A A A, b b b) and (A A A, b b b

Figure 4 :Algorithm 1 :
Figure 4: First-order masked Saber.KEM.Decaps algorithm [BDK + 20].The highlighted operations in color gray are influenced by the non-ephemeral secret-key s s s and use masking to prevent SCA.The component we focus on in this work is enveloped with the red rectangle.

Figure 5 2 :
is an illustrated version of the first-order masked decapsulation algorithm of Kyber.The function we exploit in this work is the A2B conversion in the Decode module, which is surrounded by the red rectangle in the figure.We present a first-order masked Decode procedure in Algorithm 3.According to this algorithm, (m a1 ,m a2 ) are arithmetically masked shares of the input of the Decode operation m .We use Figure5to illustrate the steps of the masked decoding algorithm for a coefficient of m , say m [i].Here, the first figure shows the distribution of a coefficient of m [i].The second figure presents the distribution of that Algorithm SecAdd [CGV14]

Figure 5 :
Figure 5: The steps of the masked decoding algorithm of Kyber presented in Algorithm 3[OSPG18].

Figure 6 :
Figure 6: First-order masked decapsulation algorithm of LW(E/R) based schemes.The highlighted operations in color gray are influenced by the non-ephemeral secret-key s s s = (s s s 1 , s s s 2 ).The function we exploit in our attack is enveloped with the red rectangle.

1 . 1 1
j=2 x j ) mod 2(k+1) .This brings back the two-share scenario explained in the previous subsection.More precisely, we introduce a stuck-at-1 fault at x k−1 The fault is active only if x k−1 1 = 0.The injected fault at x k−will not affect y, and can propagate to z (k+1) only through x k−1

Figure 7 :
Figure 7: Results of stuck-at fault attacks on Kyber.The solid curve represents the probability distribution of a coefficient that decoded to the message value m[i] = 0 and the dashed curve represents the probability distribution of a coefficient that decoded to the message value m[i] = 1.This picture also presents that the injected fault will corrupt the message value whenever it lies on the red part.

( a )Figure 8 :
Figure 8: Results of stuck-at fault attacks on Saber.The probability distribution of m [i] is shown in these figures.The solid curve represents all the values of m [i] that decoded to the message value 0, and the dashed curve represents all the values of m [i] that decoded to the message value 1.The inserted fault will corrupt the message value whenever the value of m [i] lies on the red part.

Figure 9 :Figure 10 :
Figure 9: Schematic of the attack platform

Figure 11 :
Figure 11: Number of inequalities required to recover partial or entire secret key for Kyber-512 with our simulated and practical fault model.Here, success probability indicates the number of coefficients recovered out of 1024(= 512 + 512) coefficients of the secret key.(Success probability equal to 1 means full secret key recovery.)The green coloured curve represents the simulated fault model, and the red coloured line represents the practical fault model.

Table 1 :
Parameters of Kyber with security and failure probability [BDK + 17]

Table 4 :
The effect of stuck-at-1, stuck-at-0, and bit-flip fault at the least significant 8 bits of g[i]Before the fault injection After the fault injection

Table 5 :
The parameter settings of the BP algorithm for recovering the whole secret key