Fault Attacks on CCA-secure Lattice KEMs

. NIST’s post-quantum standardization eﬀort very recently entered its ﬁnal round. This makes studying the implementation-security aspect of the remaining candidates an increasingly important task, as such analyses can aid in the ﬁnal selection process and enable appropriately secure wider deployment after standardization. However, lattice-based key-encapsulation mechanisms (KEMs), which are prominently represented among the ﬁnalists, have thus far received little attention when it comes to fault attacks. Interestingly, many of these KEMs exhibit structural similarities. They can be seen as variants of the encryption scheme of Lyubashevsky, Peikert, and Rosen, and employ the Fujisaki-Okamoto transform (FO) to achieve CCA2 security. The latter involves re-encrypting a decrypted plaintext and testing the ciphertexts for equivalence. This corresponds to the classic countermeasure of computing the inverse operation and hence prevents many fault attacks. In this work, we show that despite this inherent protection, practical fault attacks are still possible. We present an attack that requires a single instruction-skipping fault in the decoding process, which is run as part of the decapsulation. After observing if this fault actually changed the outcome (eﬀective fault) or if the correct result is still returned (ineﬀective fault), we can set up a linear inequality involving the key coeﬃcients. After gathering enough of these inequalities by faulting many decapsulations, we can solve for the key using a bespoke statistical solving approach. As our attack only requires distinguishing eﬀective from ineﬀective faults, various detection-based countermeasures, including many forms of double execution, can be bypassed. Weapply this attack to Kyber and NewHope, both of which belong to the afore-mentioned class of schemes. Using fault simulations, we show that, e.g., 6,500 faulty decapsulations are required for full key recovery on Kyber512. To demonstrate practicality, we use clock glitches to attack Kyber running on a Cortex M4. As we argue that other schemes of this class, such as Saber, might also be susceptible, the presented attack clearly shows that one cannot rely on the FO transform’s fault deterrence and that proper countermeasures are still needed.


Introduction
The search for quantum secure replacements of RSA and DLP-based cryptosystems is in full swing.This is demonstrated by NIST's ongoing Post-Quantum Cryptography Standardization process [NIS16], which very recently entered its third and final round.In this selection effort, a large number of submitted proposals base their security arguments on lattice problems.Such lattice-based key-encapsulation mechanisms (KEMs) and digital

Algorithm 1 LPR Key Generation
Output: Keypair (pk, sk) 1: s, e ∈ R q ← χ n 2: a ← U n (Coefficient-wise) sampling from a probability distribution is represented with ←, deterministic assignments with =, and equality tests with ==.This corresponds to the C programming language, which is used in all following source code excerpts.

LPR Public-Key Encryption
In 2010, Lyubashevsky, Peikert, and Regev [LPR10] extended Regev's Learning With Errors (LWE) problem [Reg05] to an algebraic variant called Ring-LWE (RLWE).They also defined a public-key encryption scheme, frequently dubbed the LPR scheme, which offers a reduction to RLWE.The LPR cryptosystem uses polynomials in the ring R q = Z q [x]/(x n + 1), with a prime q and n a power of two.The parameters q and n are typically chosen such that polynomial multiplication can be performed in O(n log n) using the Number Theoretic Transform (NTT).LPR requires sampling from a (narrow) error distribution χ.Early proposals, such as the original LPR scheme, used a discrete Gaussian distribution for χ.However, Gaussian samplers are hard to implement securely.For this reason, many newer proposals, including Kyber and NewHope, use a centered binomial distribution over the range [−η, η] for some small integer η.
The encryption scheme is comprised of three procedures.Algorithm 1 (Key Generation) generates a public/secret key pair.The coefficients of the secret key s and the noise polynomial e are drawn from χ, the coefficients of a are sampled from U q , i.e., from the uniform distribution over Z q .The public key b is computed as b = as + e.While only s is stored as part of the secret key, we note that e must also be considered a secret, as knowing its value would allow trivial recovery of s.Encryption is shown in Algorithm 2. In its first step, three polynomials r, e 1 , e 2 are sampled from the error distribution.Then, the two ciphertext components u, v are computed.For computing the second component v, the n-bit message m needs to be mapped to a polynomial in R q .The simplest way of doing that is by multiplying each bit of m with q/2 .Finally, Algorithm 3 specifies the decryption step.Its correctness can be verified by back-substitution: (1) As seen above, computing m = v − us yields a noisy version of the plaintext.We call the added terms, i.e., er + e 2 − e 1 s, the encryption noise d.As all terms involved in this encryption noise are sampled from the narrow error distribution χ, the original message bits can still be recovered.This can be seen in Figure 1, which shows an exemplary probability distribution for the coefficients of the noisy plaintext in two different ranges.The probability distribution of a 0 bit is shown with a solid line; the dashed line shows the distribution for a 1 bit.We use this mapping throughout the rest of this paper.
A dedicated decoding routine Decode is needed to recover the original message m from m .We defer the description of efficient constant-time decoder implementations to later sections.0 q/2 q/4 3q/4 3q/4 (a) Range: [−q/4, 3q/4) 0 q q/2 q/4 3q/4 (b) Range: [0, q) Figure 1: Typical probability distribution of the coefficients of the noise plaintext m .The solid line marks the distribution for a 0 bit, the dashed line for a 1 bit.

Fujisaki-Okamoto Transform
The LPR public-key encryption scheme only offers security against chosen-plaintext attacks (CPA).Thus, LPR key pairs must be ephemeral, as the plain scheme can be trivially broken in the chosen-ciphertext setting (CCA) if keys are reused (see, e.g., [Flu16]).For this reason, many LPR-based schemes make use of the Fujisaki-Okamoto (FO) transform [FO99] or one of its more recent versions [TU16,HHK17].This transform allows constructing a CCA-secure key-exchange mechanism (KEM) using a CPA-secure public-key encryption scheme.
The employed FO variants are, at least from a high-level perspective, very similar.To illustrate the basic principles of the transform, we now give more details on the FO variant used by NewHope [PAA + 19].NewHope refers to its key-encapsulation transformation as QFO ⊥ m .It embeds a public-key encryption scheme PKE featuring key generation, encryption, and decryption.Crucially, the randomness required for encryption is made explicit via a seed parameter coin.Furthermore, let F and G be two hash functions.Figure 2 shows the full QFO ⊥ m process.In Encaps, a randomly chosen message µ is encrypted using a seed derived via G.The shared secret ss is computed by hashing Keygen() Encaps(pk) Decaps((c, d), (sk, pk, s)) the PKE-ciphertext c together with the confirmation value d.Decapsulation recovers the message µ and then re-encrypts it using the recovered seed coin.Only if both the PKE-ciphertexts as well as the confirmation values d match, the correct shared secret is returned.Upon failure, a pseudorandom shared secret is computed.
While the intended purpose of the FO is to establish CCA security, it also averts many fault attacks by design.For instance, when targeting PKE.Decrypt in Decaps, i.e., the only part involving the private key sk, all faults leading to a corrupted µ are detected, as re-encryption will lead to a different c .This rules out differential fault attacks exploiting resulting differences in µ .

Kyber
Kyber [ABD + 19] is one of four finalists in the KEM category of the NIST PQC standardization process.It is an example of a scheme combining an LPR-like public-key encryption scheme with the FO to construct a CCA-secure KEM.The Kyber second-round specification includes three parameter sets (Kyber512, Kyber768, and Kyber1024), which are listed in Table 2 of Appendix A.

Structure and differences.
Unlike LPR, Kyber bases its security on the Learning with Errors problem in module lattices (Module-LWE).This means that, e.g., a is not a polynomial, but instead a quadratic matrix of polynomials A ∈ R k×k q , where k is the module rank.Other elements, such as the secret key s, the error e, as well as r and e 1 , are now (column) vectors containing polynomials, i.e., R k q .Kyber uses the same base ring R q for all its parameter sets, only the module rank k varies.All parameters are chosen such that polynomial multiplication can be efficiently implemented using a slightly modified NTT.
Another difference to plain LPR is the use of (lossy) ciphertext compression.In compression, a value x is divided by q/2 d , with d being the compression parameter, and then rounded to the nearest integer.Decompression reverses this process, but can of course only recover an approximation to x.In simplified terms, only the d most significant bits of x are kept, all lower bits are discarded.By doing that, the size of the ciphertext can be significantly decreased.Kyber compresses both ciphertext components (u, v), but to a different degree.The two compression parameters (d u , d v ) are specified in the parameter set.
Message encoding and decoding.For message encoding, i.e., mapping a message to an element in R q , Kyber follows the path of LPR and multiplies each message bit with q/2 .Recall that for decryption, a decoder is needed to recover the original message m from its noisy version m .This decoder is applied to each coefficient of m and returns 1 if the input value is in [q/4, 3q/4), and 0 otherwise.This must be done in a secure (read: constant time) manner, otherwise side-channel attackers might be able to infer information on m or s.
The decoding routine of Kyber's reference implementation1 is sketched in Figure 3. First, it assures that all coefficients of the input polynomial a are in [0, q).Then, it loops over all coefficients (byte index i, bit index j).Decoding as such is then done in a single line of code, where all values are interpreted as unsigned integers.After multiplying the coefficient of the input a by 2 and adding q/2 , an integer division by q is performed.Note that compilers turn such divisions with constants into multiplications2 , thereby avoiding typically non-constant-time division operations.Finally, the least-significant bit of the outcome gives the decoded message bit.After ensuring each coefficient of a lies within [0, q), the first line maps each coefficient to 0 or 1.Here, 0 ≤ i ≤ 31 and 0 ≤ j ≤ 7 yield all coefficient indices.
To further illustrate this process, in Figure 4, we show how each of the above operations affects the probability distribution of the intermediate value (also cf. Figure 1).The multiplication by 2 scales the x axis, the following addition shifts the distributions to the right.The integer division by q leads to values between 0 and 2, picking the LSB then gives the correct decoded bit.0 q q/2 q/4 3q/4 5q/2 0 2q q q/2 3q/2 0 2q q q/2 3q/2 Figure 4: Visualization of the decoding routine used in Kyber's reference implementation.

NewHope
NewHope [PAA + 19] is a key-encapsulation mechanism which bases its security on RLWE.While it did not advance to the final round of the PQC process, it is still of significance, as it has already undergone some first real-world evaluations [Lan16,Inf17].Just like Kyber, it uses an LPR-like public-key encryption scheme in combination with the FO described in Section 2.2 to construct a CCA-secure KEM.The NewHope NIST submission includes two parameter sets, with n = 512 and n = 1024, respectively.We give all other specified values in Table 3 of Appendix A.

Structure and differences.
NewHope is, compared to Kyber, arguably a more direct successor to LPR due to it also being based on RLWE.Thus, it does not operate with matrices and vectors of polynomials.NewHope also employs ciphertext compression, but unlike Kyber, it only compresses v.

Message encoding and decoding routine.
A distinguishing feature of NewHope is its message encoding routine.As encrypted messages are 256 bits long, whereas the used ring dimension n is much larger, it encodes each bit onto multiple coefficients and thereby increases resilience against decryption errors.Concretely, it repeats the message twice (for n = 512) or four times (for n = 1024) and then multiplies the coefficients of this extended message with q/2 .The decoding routine first applies the function flipabs(x) = |(x mod q) − q/2| to all coefficients.We show in Figure 5 how flipabs affects the coefficient-wise probability distribution of v − us; we also give its C code as included in NewHope's reference implementation3 in Figure 6.0 q q/2 q/4 3q/4 0 q/2 q/4 flipabs The function coeff_freeze reduces the input x into the base interval [0, q).The subsequent subtraction sets r's sign bit if x < q/2.Recognize that r is declared as a signed integer.Thus, the following line performs an arithmetic shift by 15 positions.If r is positive (sign bit is 0), this shift flushes out all set bits and yields 0. Shifting a negative r (sign bit is 1) leads to all bits being set to 1, i.e., 0xffff, which equals −1 for an int16_t.
This property can be used in line 8.If m is zero, then both the addition and the XOR have no effect.If m is 0xffff= −1, all bits are flipped after subtracting 1.This is exactly the process of computing the two's complement.Thus, lines 7 and 8 perform a conditional negation and thereby compute the absolute value of r.After computing flipabs, the decoding routine sums up the 2 (NewHope512) or 4 (NewHope1024) coefficients encoding the same message bit.If this sum is larger than either q/2 (for n = 512 and two coefficients per message bits) or q (for n = 1024), then the decoder returns 1, and 0 otherwise.This comparison can be performed in constant time by subtracting q/2 (or q, respectively) and then returning the MSB.

Masked Decoder
The previously mentioned applicability of lattice-based schemes to embedded systems raises the need for adequate protection against side-channel attacks.Due to the linearity of many involved operations, e.g., polynomial addition and multiplication in R q , the masking countermeasure appears to be a perfect fit.Extending masking to the nonlinear decoder, however, is a much more involved task.
One possible way of building such a masked decoder was presented by Oder et al. [OSPG18], who describe a fully masked implementation of an LPR-like lattice KEM.Similar to NewHope, their custom KEM encodes each message bit onto multiple polynomial coefficients.The accompanying masked decoder is designed to handle this fact.As our first target is Kyber, we now describe the simpler version geared towards the single-coefficient-per-bit encoding scheme.Also, we use the Kyber parameter set, with q = 3329 and log 2 q = 12, for all further illustrations.
The decoding routine is illustrated in Figure 7.This figure again shows how each operation performed by the decoding algorithm affects the intermediate value's probability distribution.As all operations are performed in a modular domain (either modulo q or modulo a power of two), we adapt the circular representation of Oder et al.Note that the figure shows the distribution of the unmasked value, but all operations are performed in the masked domain.That is, the decoder is fed each coefficient m in masked form, i.e., the two shares m 1 and m 2 satisfy m = m 1 + m 2 in Z q .All subsequent operations are also done in a masked manner.
As a first step, the decoding routine subtracts q/4 from the input coefficient.Since this is a linear operation in Z q , it suffices to perform this subtraction on just one of the two shares.Then, the shares are transformed, from arithmetically mod q to arithmetically mod 2 bits , where bits = log 2 q + 1.For the details of this transform, we refer to the original paper [OSPG18].After subtracting q/2 (mod 2 bits ) from one of the shares and performing a final arithmetic-to-boolean mask conversion, a masked representation of the decoded message bit can be retrieved by selecting the MSBs of the shares.

Previous Fault Attacks on Lattice-based KEMs
Thus far, the susceptibility of lattice-based KEMs against fault attacks is somewhat unexplored.Still, there exists some prior work.

Ravi et al. [RRB +
19] propose a fault attack that can target key generation and encapsulation, but not decapsulation.They exploit the fact that long secrets are generated by expanding a short seed, which is additionally used multiple times but with a different domain separator.When injecting a fault such that the domain separator is reused, two values, such as s and e, are identical.Then, one can, e.g., trivially solve the equation b = as + e = (a + 1)s for the key s.
Valencia et al. [VOGR18] performed a more general study of the susceptibility of LPRlike encryption schemes against fault attacks.They show attacks that target decryption and can recover the secret key.However, their attacks are applicable only to CPA-secure systems, such as plain LPR.Their techniques cannot be applied to, e.g., Kyber and NewHope, as the FO transform would detect all their injected faults, and the resulting pseudorandom key is of no use in a differential fault attack.Also, their attacks require faulting multiple decryptions that use the same key.This must be avoided in systems providing only CPA security, as they could otherwise be easily broken with a chosenciphertext attack.
Thus, to the best of our knowledge, the susceptibility of FO-secured decapsulation has thus far not been properly investigated.We now change this in the following by presenting a new fault attack on broad range of lattice-based KEMs.

Generic Attack Description
In this section, we describe the underlying ideas and techniques of our attack.First, we describe the attacker's capabilities and discuss some trivial fault attacks, which we argue might either not be very realistic to mount or already obvious.Then, in Section 3.1, we observe that the decryption noise linearly depends on the secret polynomials e and s.In Section 3.2, we show that by injecting skipping faults in the decoder and then observing if the decapsulation still returns the correct shared secret (ineffective fault), one can restrict the possible values of the encryption noise and hence learn information on the long-term secrets.After repeating the above for many decapsulations, the attacker can create a linear system of inequalities, which we solve using a statistical method described in Section 3.3.Finally, we describe the efficient implementation of said statistical technique in Section 3.4.
Attacker model and steps.We assume that the adversary has the following capabilities and performs the now described basic steps.The attacker is in possession of the public key and can thus execute an arbitrary number of encapsulations.He gathers and stores not only the returned ciphertexts and shared secrets but also some intermediates computed during encapsulations, such as the sampled noise polynomials and the encrypted message.The attacker can filter the recorded encapsulation results according to some values known to him, e.g., the value of certain bits of the embedded encrypted message.However, he does not alter the ciphertexts, as the FO would always detect these modifications.
The adversary can then send a selected ciphertext to the target device, which holds the private key.He injects a specific skipping fault in PKE.Decrypt inside the decapsulation (cf. Figure 2), which is the only operation involving the private key s.Note, however, that the adversary cannot directly observe the decryption output µ .Instead, the attacker can only infer if the device still derived the correct shared secret (the fault was ineffective) or if the fault attack caused an incorrect decryption result (the fault was effective).This can be done by, e.g., decrypting a response from the target using the shared key returned by the corresponding encapsulation.If the resulting plaintext conforms to the used protocol, then the targeted device still computed the correct shared secret.If, however, the plaintext appears to be random, then the FO detected the fault and thus returned a random shared secret.
Simple attacks.Before describing our attack, we now discuss some very basic fault attacks against FO-secured KEMs and argue why these attacks might be impractical or already obvious.
First, an attacker might want to skip the equality check during decapsulation, i.e., whether c == c .This would then re-enable chosen-ciphertext attacks, such as shown in, e.g., [BGRR19].However, fault injections might fail, and such attacks are typically not designed with uncertainties in mind.Also, we argue that the need to protect this obvious faulting target was already shown.In fact, skipping this check was previously proposed by Valencia et al. [VOGR18] in the context of their fault attacks.The need to protect the check was also noted by Oder et al. [OSPG18] and by Bauer et al. [BGRR19].
Another possible attack venue is exploiting ineffective faults in the stuck-at model [Cla07].When an attacker can reliably fault an intermediate to a known value, e.g., to zero (stuckat-zero), and observes that the faulted device still returns the correct shared secret, he can deduce that the intermediate was already zero.Such a fault-enabled probing capability allows trivial key recovery when targeting intermediates having, e.g., a linear dependency on the coefficients of s.However, such stuck-at faults are not trivial to achieve in practice, especially on 32-bit devices.Also, due to the large range of possible values, e.g., Kyber's modulus is 3329, it might be unlikely that the targeted intermediate takes on the injected value.When also considering the large key sizes, a vast number of highly reliable fault injections might be needed for a successful attack.

On the Linearity of the Decryption Noise
Instead of using one of the above approaches, our attack exploits the key dependency of the encryption noise.Recall from Section 2.1 that m = v − us = er − e 1 s + e 2 + m q/2 , where we call d = er − e 1 s + e 2 the encryption noise.As noted above, we assume that the adversary (honestly) generated the ciphertexts.Hence, he knows all intermediate values computed during encapsulation, including e 1 , e 2 , and m.Now observe that the noisy plaintext m and thus also the decryption noise d are linear combinations of known values and the secrets e and s.
Also, note that all components of d have small coefficients.This ensures that d is always in the range of [−q/4, q/4]; else decryption errors would occur.4Thus, one can essentially ignore the modular reductions and interpret the equation of d to be in R = Z[x]/(x n + 1).
If an adversary somehow recovers the entire noise d (or m ) for two decryptions, then he can trivially solve the resulting system of linear equations for the unknowns e and s.A similar statement can be made for the much more realistic case where the attacker can only recover a single coefficient of d, but for many decryption queries.
Here we can use the fact that polynomial multiplication in R q can be rewritten into a matrix-vector product.That is, for c = ab, we can also write c = Ab, where, with a slight abuse of notation, we equate polynomials with their coefficient vectors.The i-th column of A can be generated by computing a • x i in R q .This matrix-vector representation now allows to extract the computation of each individual coefficient of c. Concretely, we write c[i] = a (i) , b , where a (i) is the i-th row of A and •, • denotes an inner product.For the reduction polynomial x n + 1 used in LPR, this equates to The above can easily be extended to scalar products of vectors of polynomials, as used by Kyber.
When an attacker now recovers 1 , s , i.e., a linear equation with 2n unknowns (the coefficients of e and s).Gathering 2n of these equations again allows for key recovery using linear algebra.This technique of extracting single linear equations over many calls and then solving the entire system was already shown to be useful for side-channel attacks on lattice-based schemes [GBHLY16].

Fault Injection in Decoding
With a fault attack, it is not trivial to recover the true value of any d [i].Thus, we instead aim at using a relatively simple fault to acquire a hint on its value.Concretely, we inject a skipping fault in the decoder and then use the observed decapsulation outcome, i.e., whether the injected fault was effective or ineffective, as an oracle for d [i].We now explain this approach on the example of Kyber's decoding routine described in Section 2.3.
We attack the decoding routine for one selected coefficient, e.g., for m[i], and inject a fault such that the addition of q/2 (the second step in the visualization given in Figure 4) is skipped.Figure 8 now shows how this skipping fault influences the decoding process.Note that if the encryption noise d ≥ 0, which corresponds to the right side of the distributions, then the correct value is still recovered.This is the case no matter the value of the encoded bit.Hence, the fault injection was ineffective.If, however, d < 0, then the faulted decoding returns an incorrect result; the fault is effective.
Thus, by injecting this skipping fault and then testing whether the attacked device still derives the correct key, we can infer if d[i] is positive or negative.More formally, we have: For now, we assume that our skipping fault has perfect reliability.We explain methods to deal with unreliable faults in Section 6.

Solving a System of Linear Inequalities
Each such equation gives a small amount of information on e and s.Thus, the adversary has to fault many decapsulations, where he can set up one equation per faulted execution.When using r (ij ) j to denote the value of r (i) in the j-th decapsulation, and gathering a total of m equations, we can write: The index i j of the targeted coefficient can be different for each fault injection and thus each line in the system.We introduce dedicated symbols for the above system: we call the left matrix X, y = (e||s), and the right-hand side of the equation z.The attacker now wants to find the single y satisfying all these constraints.We note that similar problems need to be solved for attacks exploiting decryption errors, e.g., [DGJ + 19].When encountering or enforcing such an error, one can infer that |d| ≥ q/4 and use this information to construct a somewhat similar set of constraints.
Such a system of inequalities cannot be solved using straight-forward linear algebra.An approach that appears to be promising instead is linear programming (LP) since LP solvers already deal with such linear constraints.However, we found that the runtime of LP solvers grew very fast with the problem dimension, so much so that the required dimensions of more than 1000 unknowns appear to be out of reach.Also, LP solvers require hard constraints.However, fault injection might sometimes fail, which can then be mistaken as an ineffective fault.A single resulting erroneous inequality might already be enough to eliminate the correct solution.Hence, a more resilient approach is needed.
We now describe a statistical approach that can deal with the large dimension and is (somewhat) error-tolerant.For each of the 2n secret coefficients, we store a vector of length |χ| containing the probabilities for all its values (all key guesses).These vectors are initialized with the probabilities given by the error distribution χ.We then use an iterated method to update said probabilities given all constraints.
Assume we want to infer information on y[0] = e[0].For each of the 0 < k ≤ m equations, we compute the probability distribution of 2n−1 j=1 X[k, j]y[j], i.e., of the matrixvector multiplication using all but the targeted secret coefficient.In the first iteration, the probabilities for the y[j] are prescribed by χ, as stated above.In the next step, we enumerate all the guesses for y[0].For each guess y ∈ χ, we compute Pr(y + if fault injection k was effective, or 1 minus this probability if the fault was ineffective.We do this for all m faults.Finally, we use the above probabilities to perform Bayesian updating of the probability of each key guess.Thus, in case of an effective fault, for each equation, we have ) This process is also performed for all other secret coefficients in y.Importantly, we do not immediately use the updated probabilities.For instance, the update of y[1] still uses the original priors of y[0].
Performing this computation once for all coefficients, however, did not yield satisfactory results.For this reason, we repeat this entire process with the now updated probability vectors.After each such iteration, we pick the n most likely key values, which can then be plugged into the key relation b = as + e.The resulting linear system with n unknowns and n equations can be solved to recover the remaining n unknowns, given that the first n recovered coefficients are correct.We keep iterating the entire above process until either the correct key is recovered, or a defined maximum number of iterations-we set this threshold to 10-is reached.
We note that the described algorithm is related to the belief-propagation technique, which can be used for a plethora of inference tasks.It has seen some prior use in the context of side-channel analysis [VGS14], including attacks on lattice-based schemes [PPM17,PP19].
We also mention that the algorithm is not limited to working with inequalities, but can be adapted to exploit any information on d [i].For instance, faults in other locations followed by the distinction between effective and ineffective faults might instead reveal the LSB of d[i], or that |d[i]| is within a certain range.Such information can also be incorporated.

Efficient Attack Implementation
The above is a relatively high-level description of our key-recovery algorithm.We now describe some steps required to make the recovery process practical.
Clustering.First, we decrease the dimension of the system by clustering multiple unknowns into single variables.The cluster sizes are chosen such that the number of key guesses can still be manageable given the runtime and memory constraints.In Kyber, for instance, the error polynomials are sampled from the centered binomial distribution over the range [−2, 2], giving us 5 possible values per coefficient.We chose to cluster 4 coefficients, giving us 5 4 = 625 key guesses per cluster.Additionally, after each iteration of our algorithm, we discard key guesses having very low probability and then merge clusters such that their combined size is again enumerable.For ease of understanding, we do not consider clustering in the following explanations, but remember that it is still used.

Efficient summation of probability distributions.
A major runtime hurdle in the above algorithm is the computation of the probability distribution of j=0...2n−1\i X[k, j]y [j] for all equations k and secret coefficients i.A naive summation is prohibitively expensive, which is why we use the following optimization.
When given two discrete random variables A, B and their respective distributions as a vector of probabilities, the probability distribution of C = A + B can be computed by convolving the probability vectors of A and B. This convolution can be efficiently computed by utilizing the FFT and performing a pointwise multiplication of the transformed distributions.Thus, we compute FFT(Pr(X[k, j]y[j])), where the probability vector first has to be zero-padded such that it can deal with the full expected range of d.Then, the pointwise product over the transformed vectors for all j = 0 . . .2n − 1\i entries, followed by an inverse FFT, leads to the probability vector of the j=0..

.2n−1\i X[k, j]y[j].
Minimizing recomputations.The above needs to be computed for all coefficients i. Doing that via a simple loop over i would entail lots of recomputation of partial pointwise products.As an illustration, observe that products for two different i differ only by a single factor.We use dynamic programming to reuse such common factors.We store partial products in a binary tree, where the 2n leaves are initialized to FFT(Pr(X[k, j]y[j])).We then move towards the root; for each node, we multiply the partial products of its children.We call these partial products the upward distributions.
After processing all nodes, we begin moving towards the leaves again and compute the downward distributions.In the layer below the root, the downward distributions are computed by multiplying the upward distributions of all respective siblings.For the next layers, the downward distribution is obtained by combining the downward distribution of the parent with the upward distribution of the siblings.Thus, for each node, the downward distribution describes the sum of all leaves that are not below the node.After fully traversing the tree in this manner, the required distributions are the downward messages of the leaves.
By using this method, the runtime complexity can be reduced from O(n 2 ) (simple loop over i) to O(n), at the cost of higher memory requirements.

Application to Kyber and NewHope
The previous section gave a somewhat generic description and targeted a combination of plain LPR with the FO.We now describe the steps needed to apply our attack on concrete schemes and their decoder implementations.We start with Kyber in Section 4.1, then discuss how the attack can be applied on a masked Kyber implementation in Section 4.2, and finally, in Section 4.3, show that other LPR-like schemes, concretely NewHope, can also be susceptible.

Kyber
The previous section explained the generic attack on Kyber's decoder but still targeted plain LPR.Some additional steps are needed to actually apply our attack to Kyber.
Recall that Kyber compresses both ciphertext components u and v via rounding, i.e., by, in simplified words, dropping the low order bits (cf.Section 2.3).This compression needs to be accounted for when setting up our system of inequalities.We write u = u + ∆u and v = v + ∆v, where u, v are the uncompressed terms, u and v is the compressed ciphertext, and ∆u, ∆v denotes the additive rounding error.We note that in our attack scneario, the adversary performs the encapsulation and thus knows the uncompressed value and the rounding error.
When now performing back substitution in v − u s = (v + ∆v) − (u + ∆u)s, similar to Equation (1), we arrive at: and thus have d = er −(e 1 +∆u)s+e 2 +∆v.The value of ∆v can be somewhat large.Since our fault attack probes the sign of d, some inequalities will thus be fulfilled by (almost) all values of e and s.We thus filter for small |e 2 + ∆v|, i.e., we only send those ciphertexts to the target where during encapsulation, this value was observed to be within some bound.Concretely, we only use the ciphertext of encapsulations where |e 2 + ∆v| ≤ 10.All other steps of the attack are performed as described in the previous section.

Kyber using the Masked Decoder
While masking is a powerful countermeasure against various side-channel attacks, it is, at least in general, not effective in protecting against faults.However, our attack probes the value of the (implicit) intermediate d, which only appears in masked form in a protected implementation.Hence, it might appear that side-channel protections, which are required anyway in a vulnerable environment, already thwart our attack.We now show that this is not the case.While we do not claim that all masked implementations are susceptible, we demonstrate that masking as such does not hamper our attack.We do so by adapting the attack to the masked decoder of Oder et al. [OSPG18], which we introduced in Section 2.5.
Recall that the first step of this decoder is to subtract q/4 (mod q) and that the input to the decoder is arithmetically shared over Z q .As this subtraction is a linear operation in Z q , it suffices to perform it on just one of the two shares.The effects of skipping this subtraction are shown in Figure 9.The transformation to arithmetic shares mod 2 bits now splits the distribution for a 1-bit into two parts.The final subtraction of q/2 shifts the distributions such that no decoding error occurs when d ≥ 0 (marked in green), but the wrong value is returned when d < 0 (marked in red).Thus, we observe the exact same effect as in the unmasked case.Hence, the next attack steps are identical.

NewHope
Recall from Section 2.4 that NewHope encodes each message bit onto multiple coefficients.
For decoding, first the flipabs function is run on all n coefficients of m .Then, the 2 (for n = 512) coefficients encoding the same message bit are added.Finally, after subtracting q/2, the sign bit is returned.We found that skipping a particular instruction in one call to the flipabs routine again allows us to learn the sign of d.Concretely, we skip the XOR with the bitmask m (line 8 of Figure 6).We now discuss the exact effect of this fault via a case study.
If the input x ≥ q/2, i.e., x is on the right half of Figure 1b, then r is positive and m = 0. XORing a value of 0 has already no effect.Thus, the instruction skip is always ineffective, and the correct value is still returned.
If x < q/2, then r < 0 and m = 0xffff, which equals −1 for the int16_t data type.Skipping the XOR thus results in returning x − q/2 − 1 instead of the desired value |x − q/2|.This incorrect value is now added to the outcome of flipabs of the second coefficient encoding the same bit.We dub x 1 the result of the faulted flipabs call and x 2 the returned value of the undisturbed call, where x 2 follows the distribution shown in the right half of Figure 5.If the encoded message bit is 0, then we have x 1 ∈ [−q/2 − 1, q/4 − 1] and x 2 ∈ [q/4, q/2).The sum of these values is in [−q/4 − 1, q/4 − 1), which is always smaller than q/2 and will thus incorrectly decode to a 1.If the original message bit is 1, then x 1 ∈ [−q/4, −1] and x 2 ∈ [0, q/4], which leads to a sum in [−q/4, q/4).This will again be decoded to a 1, which is the correct value in this case.
For a message bit with value 0, d ≥ 0 will lead to an ineffective fault, whereas d < 0 leads to corrupted result.Hence, we again set up a system of linear inequalities and solve for the key.For a message bit with value 1, the injected fault is always ineffective; the returned value is correct regardless of the value of d.We thus only send ciphertexts to the target where the attacked coefficient encodes a 0. The attacker cannot directly control the encrypted message, as it is generated by hashing a seed.However, the attacker can just discard seeds that lead to a 1 in the targeted position.
Compression.Unlike Kyber, NewHope only compresses the second ciphertext component v.We again filter for ciphertexts where |∆v + e 2 | is small.

Evaluation
In this section, we put the previous descriptions into practice and evaluate the performance of our attack.Using fault simulations, we analyze how many faulted decapsulations are needed for key recovery.We also investigate the computational-resource requirements for solving the system of inequalities with our algorithm.
Implementation.We implemented the statistical solving approach described in Section 3.3, including all optimizations mentioned in Section 3.4, in Matlab.All source code is available at https://github.com/latticekemfaults/latticekemfaults/.
For all evaluations presented in this section, we used fault simulations.That is, we modified the decapsulation of the Kyber and NewHope reference implementations such that the desired skipping fault is done in software.Thus, these evaluations assume perfect faulting reliability.However, we analyze a scenario with unreliable faults in Section 6.We analyzed all proposed Kyber parameter sets (Kyber512, Kyber768, and Kyber1024), and the smaller NewHope parametrization (NewHope512).

Number of Fault Injections
Each injected fault allows us to set up a linear inequality that carries a small amount of information on the long-term secret key.We now analyze how many such inequalities and thus fault injections are needed for full key recovery.
We performed a sweep over the number of inequalities for each mentioned parameter set.For each analyzed quantity, we perform 20 experiments and determine the success rate.The outcome for Kyber is shown in Figure 10.For the smallest parameter set, namely Kyber512, one needs approximately 6,500 fault injections to achieve a success rate above 90%.As to be expected, this number grows significantly for Kyber768 and Kyber1024.To achieve a similar success rate, we need 9,500 and 13,000 faults, respectively.The result for NewHope512 is shown in Figure 11.Compared to Kyber512, attacking NewHope512 requires a much larger number of faults, despite being in the same NIST security category.We can attribute this, at least in part, to the larger key space.In Kyber, key coefficients are drawn from a centered binomial distribution over [−2, 2], but in NewHope, they are sampled from [−8, 8].We were not able to attack NewHope1024, potentially due to these reasons.

Resource Requirements
Key recovery requires solving a very large system with at least 1024 unknowns and around 10,000 inequalities.Additionally, the algorithm does not assign a single value to the unknowns but instead needs to keep track of their entire probability distribution.Thus, the resource use of the solving algorithm is non-negligible.
The runtime and memory consumption depends on the size of the system, i.e., on both the number of unknowns and on the number of inequalities.The former is prescribed by the targeted parameter set.In Kyber, we have 2kn unknowns, which translates to 1024 for Kyber512.Successful attacks required at least 5,500 inequalities, but we used up to 18,000 during evaluations.
We give some exemplary resource requirements in Table 1.For each parameter set, we measured the runtime and memory requirements for one selected number of faults.We always picked the lowest number where the success rate, as per the analysis shown in Section 5.1, is at least 90%.All these measurements were done using 8 cores of a Xeon E5-4669 v4 2.2 GHz.The average runtime ranges between 3 and 20 minutes.Memory might be more of a limiting factor, as we required up to 79 GB of RAM.However, we did not particularly optimize our implementation in this regard, as our testing machine did still provide more than enough memory.6 Experimental Verification on an M4 The above experiments use simulations to show that a single skipping fault per decapsulation can indeed be enough for key recovery.To demonstrate that the attack is also practical, we successfully attacked Kyber512 running on an ARM-based microcontroller.This section describes the experiment and discusses its outcome.

Setup
For our experiment, we targeted an STM32F405 microcontroller featuring an ARM Cortex M4 core.The Cortex M4, and the STM32F40X series in particular, is the de-facto standard platform for evaluating embedded software implementations of schemes running in NIST's PQC process.
Our target device is mounted on the ChipWhisperer UFO board [Newb], which allows for relatively simple fault injection.Concretely, we use a ChipWhisperer Lite board [Newa] to generate the 24 MHz base clock, which the target device then directly uses as its core clock.We added a trigger signal to mark the beginning of the decoding process.We then inject a clock glitch such that a single chosen instruction is skipped, as described in the previous section.Determining the exact glitching parameters and the timing of the targeted instruction was done using sweeps over the glitching parameter space on an extracted implementation of the decoder.
We run the unprotected and in large parts assembly-optimized Kyber512-90s implementation included in the PQM4 library [KRSS]; the C portion of the code is compiled using the O3 optimization level.The 90s version of Kyber replaces calls to Keccak with SHA2 and AES.We use this version simply because it runs slightly faster on our target.Apart from this, the choice of the used symmetric primitives does not affect our attack.We note that decoding is not manually optimized and instead uses the C code from the Kyber reference implementation.Still, as our attack requires only a single instruction skip, we do not expect worse attack performance when using an assembly-optimized routine.5

Attack Steps
We run encapsulation on a PC, which received the public key directly from the device.The PC can perform an arbitrary number of encapsulations and then decide which out of the generated ciphertexts to send to the device.As stated in Section 4.1, we only use the outcome of encapsulations where |e 2 + ∆v| ≤ 10, all other ciphertexts are discarded.
After sending such a selected ciphertext to the target device, it runs decapsulation, during which the described skipping fault is injected.For the sake of simplicity, the device directly replies with the computed shared secret key in clear.Thus, the attacker can directly test whether the received key matches the output of encapsulation (fault was ineffective) or if they differ (fault was effective).We note that in a real-world attack, the adversary does not directly receive the shared key, but can, e.g., decrypt follow-up messages with the shared secret and test whether they adhere to the used protocol (correct shared key) or if they appear to be random garbage (incorrect shared key).
Dealing with unreliable injections.The described attack needs to distinguish between effective and ineffective fault injections.However, failed fault injections, i.e., cases where no instruction is actually skipped, cannot be distinguished from ineffective faults by just observing the shared secret.With our inexpensive setup, we were not able to achieve a high enough reliability for fault injection.Without further filtering, the attack failed due to the large number of incorrect inequalities.For this reason, we only use data from decapsulations returning an incorrect shared secret.In these cases, we can be certain that the fault injection worked and that the fault was effective.As we expect that about half the injected faults are ineffective, we thus have to discard the data from at least half of the faulted decapsulation queries.
Despite this filtering, some of the generated inequalities turned out to be incorrect.This can be the case when the injected clock glitch has some other (unknown) effect, such as general data corruption or skipping a different instruction.We call this scenario unintended faults.When assuming that the decoder returns a random bit in such a case, then unintended faults will be misclassified in about 50% of all injections.This percentage likely differs for a real setup, as the outcome of unintended faults can still show a bias.As it turns out, the statistical solution approach presented in Section 3.3 can deal with some incorrect inequalities without further adaptations.We discuss potential methods to achieve a higher robustness against unintended faults in Section 7.

Results
We ran 6 key recovery experiments, each one using a different key.For each experiment, we gathered 10,000 linear inequalities.With perfect faulting reliability, this would also be the number of required fault injections.As we filter for effective faults, we need to double that number.However, with our inexpensive setup, we did not achieve a very high faulting reliability.Some injections lead to crashing the device, while many others resulted in the correct shared key.As explained above, we need to discard such trials.
In 5 out of the 6 experiments, approximately 17% of the faulted decapsulations were exploitable and allowed the extraction of a linear inequality,6 for the sixth run this number dipped to 8%.This corresponds to faulting roughly 60,000 and 125,000 decapsulations and required about 8.5 hours and 16 hours, respectively.
When feeding the gathered system of inequalities into our key-recovery algorithm, the correct private key was always successfully recovered.After plugging the recovered private key back into the inequalities, we found that between 0.4% and 1% of them were incorrect, presumably due to unintended faults.Still, this shows that the recovery algorithm is somewhat resilient against such errors.
As our goal was to demonstrate the feasibility of the attack, and as instruction skips are a well-established fault model and have been shown to work in the past, we did not put further effort into improving the above numbers.Still, we note that a determined adversary using more sophisticated faulting equipment can likely cut down the number of faulted decapsulations, ideally to the fault quantities found using simulations, and thereby drastically reduce the attack time.
Robustness against erroneous inequalities.To get a better grasp on the robustness of our approach, we ran additional simulations with the above settings (Kyber512, 10,000 faults) and artificially introduced errors in the inequalities.With an error rate of 0.5%, the success rate is roughly 70%.When 1% of the inequalities are incorrect, the success rate drops into single-digit percentages.Since all practical experiments succeeded, we believe that the occurred errors still have some bias.We describe a method to deal with higher error rates in the following section.

Countermeasures and Future Work
While the FO inherently prevents many kinds of fault attacks, the presented method clearly shows that practical attacks are still possible.Our attack requires a single instruction skip per decapsulation and also shows resilience against erroneous fault injections.Thus, countermeasures are still needed.We now discuss several options and then conclude with possible future work.

Countermeasures
Protocol-level countermeasures.Our attack requires that the targeted device decapsulates any attacker-generated ciphertexts.Thus, if either the public key is only known to trusted parties or the target only runs decapsulation for ciphertexts signed by such a trusted party, the attack is prohibited.However, this might not be possible or practical in many applications.
Redundancy.Arguably the most popular method to protect against faults is to introduce redundancy and use it for error detection.The FO provides such redundancy through the involved re-encryption, but as shown, it does not prevent the attack.
The simplest form of redundancy is course-grained double execution, i.e., to run decapsulation twice and only return the shared secret if both results are equal.However, our attack does not require knowing the (faulty) shared secret; we only need to detect whether the injected fault changed the outcome.This is still possible with double execution in place, and possibly even easier if the device reacts differently upon detecting a fault.
Still, double execution can also be applied on a finer granularity.By using double computation to validate the integrity of the intermediates within the decoder, some instances of our attack can be prevented.For the attack on Kyber, for instance, we skip the addition of a constant.This will always be detected, no matter if the fault would change the outcome of the decoder.However, in the attack on NewHope described in Section 4.3, we skip the XOR with the bitmask m.For a message bit 0 and d ≥ 0, m has a value of 0. As the XOR does not change any values in this case, the skipping fault is not detected, meaning it is still possible to differentiate between effective and ineffective faults.Some lattice-based KEMs apply an error-correction code to the plaintext m and then run a decoder on the recovered plaintext.While this is done to increase the resilience against decryption errors, this also severely impedes our attack.Determining if our attack can be adapted to this setting would need further study.

CFI.
A generic countermeasure that will always detect our attack is (fine-grained) controlflow integrity.We rely on skipping faults.Thus, an implementation which can ensure that the sequence of executed instructions is correct will not be vulnerable.
Shuffling.Finally, one very effective and easy to implement countermeasure appears to be shuffling.For setting up a linear inequality, we need to know which coefficient index i we faulted.Since the decoder is called for each coefficient independently, the order in which the coefficients are processed can easily be shuffled.While an attacker can still target the decoder and even differentiate between effective and ineffective faults, the uncertainty on i prevents deriving the inequality and thus hinders the attack.As an additional bonus, this shuffling countermeasure also hampers side-channel attacks.

Future Work
Improving the robustness.For scenarios where the algorithm fails due to unintended faults, it can be adapted to cope with the larger error rates.One can estimate the expected error rate and incorporate this probability in the Bayesian update step.While this will inevitably increase the number of required fault injections, it will make the approach more robust.
Adaptation to other schemes.While we concretely apply the attack on the reference implementations of Kyber and NewHope, our descriptions of Section 3 target FO-protected plain LPR.Thus, many more LPR-based schemes (and their implementations) might be vulnerable against similar attacks.We now discuss some of these schemes.
• Saber [DKRV19] is a third-round finalist and thus a direct competitor to Kyber.
Unlike Kyber and NewHope, it bases its security on the Learning with Rounding problem (LWR).For, e.g., key generation, it samples a random s and then generates the public key b = as .That is, it does not sample a random e, but instead generates an error by rounding the product.Still, the above can be rewritten as b = as + e, with e = as − as.Another differentiating feature is its use of a power-of-two modulus instead of a prime.Still, as it is structurally very similar to Kyber, and as the error due to rounding can be made explicit, our attack might be applicable.
• Frodo [NAB + 19], which advanced to the third round as an alternate candidate, uses an unstructured lattice and encodes multiple message bits into each coefficient.We did not study the susceptibility of this approach.
• NTRU Prime [BCLvV19], also an alternate candidate in the third round, includes a variant NTRU LPrime, which, as the name already implies, can also be seen as being based on LPR.Hence, this variant might also be susceptible.
• LAC [LLJ + 19] and Round5 [GMZB + 19], both second-round candidates in NIST's PQC process, employ error correction codes on the encrypted plaintext.As stated earlier, this will severely impede our attack.Still, we do not rule out that the attack can be adapted to this setting.
Incorporating lattice reduction.Our statistical algorithm returns probabilities for each coefficient of the secret key s and the error vector e.Thus far, we simply picked the n coefficients with the highest confidence and then solved the linear system b = as + e.
A small number of errors in the recovered coefficients can be corrected by enumerating likely error positions.Alternatively, one can recover less than n coefficients, plug these into b = as + e, and then perform key recovery using lattice basis reduction.This requires a hard classification, i.e., information on the confidence cannot be further incorporated.
Recently, Dachman-Soled et al. [DDGR20] showed how such probabilities (soft information) can be incorporated into a lattice-reduction approach.There, however, it is important that the probabilities are somewhat reliable.We found that if our algorithm cannot recover the key, then it latches onto the most likely values and boosts their probability close to 1.We suspect that this is due to a positive feedback loop, i.e., the probability vector of a coefficient influences itself after at least two iterations.A possible solution is to abort our algorithm after a lower number of iterations, i.e., before the positive feedback has too much of an adverse effect.

q 3 :
b = as + e 4: return (pk = (a, b), sk = (a, s)) Algorithm 2 LPR Encryption Input: Public key pk = (a, b), n-bit message m Output:Ciphertext (u, v) 1: r, e 1 , e 2 ∈ R q ← χ n 2: u = ar + e 1 3: v = br + e 2 + m • q/2 4: return (u, v)2 BackgroundIn the following, [a, b) represents an open-closed interval, whereas [a, b] includes both boundary values.We denote vectors and polynomials using lower-case characters, like s, individual coefficients of s are written as s[i].Upper-case symbols like A denote matrices.Multiplication of polynomials a, b in some ring R is denoted as a • b or simply ab.

Figure 2 :
Figure 2: IND-CCA transform QFO ⊥ m built using public key encryption scheme PKE and hash functions F and G [PAA + 19].

Figure 3 :
Figure 3: Kyber's poly_tomsg function converts a polynomial a to a 32-byte message msg.After ensuring each coefficient of a lies within [0, q), the first line maps each coefficient to 0 or 1.Here, 0 ≤ i ≤ 31 and 0 ≤ j ≤ 7 yield all coefficient indices.

Figure 6 :
Figure 6: C source for the flipabs function of NewHope's reference implementation.This function computes |(x mod q) − q 2 |.

Figure 7 :
Figure 7: Visualization of the masked decoder of Oder et al. [OSPG18]

Figure 9 :
Figure 9: Visualization of the effect of skipping the first subtraction in the masked decoder of Oder et al. [OSPG18]

Figure 10 :
Figure 10: Attack success rate for all 3 Kyber parametrizations as function of faulted decapsulations

Figure 11 :
Figure 11: Attack success rate for NewHope512 as function of faulted decapsulations

Table 1 :
Resource requirements for running the attack for one selected number of faults per parameter set.All measurements used 8 cores of a Xeon E5-4669 v4 2.2 GHz.