Roulette: A Diverse Family of Feasible Fault Attacks on Masked Kyber

. At Indocrypt 2021, Hermelink, Pessl, and Pöppelmann presented a fault attack against Kyber in which a system of linear inequalities over the private key is generated and solved. The attack requires a laser and is, understandably, demonstrated with simulations—not actual equipment. We facilitate and diversify the attack in four ways, thereby admitting cheaper and more forgiving fault-injection setups. Firstly, the attack surface is enlarged: originally, the two input operands of the ciphertext comparison are covered, and we additionally cover re-encryption modules such as binomial sampling and butterﬂies in the last layer of the inverse number-theoretic transform (INTT). This extra surface also allows an attacker to bypass the custom countermeasure that was proposed in the Indocrypt paper. Secondly, the fault model is relaxed: originally, precise bit ﬂips are required, and we additionally support set-to-0 faults, random faults, arbitrary bit ﬂips, and instruction skips. Thirdly, masking and blinding methods that randomize intermediate variables kindly help our attack, whereas the IndoCrypt attack is like most other fault attacks either hindered or unaltered by countermeasures against passive side-channel analysis (SCA). Randomization helps because we randomly fault intermediate prime-ﬁeld elements until a desired set of values is hit. If these prime-ﬁeld elements are represented on a circle, which is a common visualization, our attack is analogous to spinning a roulette wheel until the ball lands in a desired set of pockets. Hence, the nickname. Fourthly, we accelerate and improve the error tolerance of solving the system of linear inequalities: run times of roughly 100 minutes are reduced to roughly one minute, and inequality error rates of roughly 1% are relaxed to roughly 25%. Beneﬁting from the four advances above, we use a reasonably priced ChipWhisperer ® board to break a masked implementation of Kyber running on an ARM Cortex-M4 through clock glitching .


Introduction
Kyber [ABD + 20] is a lattice-based key-encapsulation mechanism (KEM) and was selected as a post-quantum cryptography (PQC) standard by the United States' National Institute of Standards and Technology (NIST) in July 2022. We revisit a fault attack against Kyber proposed by Hermelink, Pessl, and Pöppelmann at Indocrypt 2021 [HPP21a], where a single ciphertext bit in either input operand of the ciphertext comparison must be flipped.
fault-injection setup is substantial-but unaccounted for. Especially in scenarios where a single device instead of a batch of devices is targeted, the time spent on building the setup has no advantages of scale and likely surpasses the time needed for the actual key-recovery. Presumably for the above reasons, the authors simulate the use of a laser in software.

Contributions
We improve the practicality of the above fault attack such that even a low-budget adversary has plenty of options. Four advances are made: • Before the IndoCrypt paper [HPP21a], the ciphertext comparison was already identified as a prime target for fault attacks [OSPG18, XIU + 21]. We forewarn secure-system designers that previously untargeted building blocks of the re-encryption should be protected against fault attacks too. This includes binomial sampling, butterflies in the last layer of the inverse number-theoretic transform (INTT), ciphertext compression, and its preceding modular reduction. By faulting any of these building blocks, an attacker can obtain inequalities over the private key while bypassing any potential countermeasures that guard the ciphertext comparison. One such countermeasure is proposed in the IndoCrypt paper.
• Whilst the IndoCrypt attack [HPP21a] requires a laser to precisely flip a bit, we support various equipment through various fault models, i.e., set-to-0 faults, set-to-1 faults, random faults, arbitrary bit-flip patterns, instruction skips, and instruction corruptions. The flip side is that more faults are needed for a key recovery: roughly speaking, 1000s become 10000s or 100000s. Even so, for a Kyber implementation that is clocked at MHz rates, and depending on its efficiency and security level, the latter range typically equates to a few hours up to a few days and thus a feasible attack. Furthermore, the additional time needed for a key recovery can partially, if not completely, be recouped by not having to set-up and calibrate a laser. This thought actually pertains to the entire field of study: in many papers that propose fault attacks using pure theory and no equipment, minimizing the number of faults is the exclusive focus [ASMM18], i.e., penalties encountered in practice and caused by strong theoretical assumptions are missing from the optimization model.
• Because most building blocks of Kyber have known weaknesses against side-channel analysis (SCA), such as power-consumption analysis, countermeasures should be in place [RCB22]. We lay out a peculiar case where masking and blinding methods that randomize intermediate variables facilitate a fault attack. Under normal circumstances, which includes the IndoCrypt paper [HPP21a], the vulnerability to fault attacks either decreases or remains the same upon introducing these countermeasures. We fault otherwise input-defined prime-field elements such that they cover a wide range of values, ideally but not necessarily uniformly distributed, so data-randomizing countermeasures naturally help achieving a more uniform coverage. To succeed, an attacker needs to keep faulting the element until its value is contained in a specific subset of values. In related work, field elements are often represented on a circle [OSPG18], or in our analogy, a wheel from the casino game roulette. Every fault spins the wheel until, eventually, the ball lands in a winning set of pockets.
• The IndoCrypt paper [HPP21a] presents an algorithm based on belief propagation to solve systems of linear inequalities. Solving 7000 inequalities for Kyber768 takes approximately 100 minutes using a single thread. To get around this inconvenience, the authors parallelize their code: 32 threads on 16 cores result in circa 7 minutes. Instead, we deploy an accurate numerical approximation that reduces the execution time to roughly one minute using a single thread. Upscaling the hardware through threading remains possible but is no longer needed. A second, more acute problem with the solver from the IndoCrypt paper is that all inequalities are assumed to be correct, but fault-injection setups that supposedly provide these inequalities are not perfectly reliable. Based on a previous report by Pessl and Prokop [PP21a], a 1% error rate is yet to be exceeded. We alter the algorithm such that at least 25% of the inequalities can be incorrect. To tie the above two improvements together: higher error rates necessitate more inequalities and thus more computation time, causing our acceleration technique to pay off. Our solver is made open-source.
To demonstrate the above four advances, we break a masked implementation of Kyber running on an ARM Cortex-M4. A ChipWhisperer ® board, which is affordable for individuals not just organizations, is used to inject faults in the INTT through clock glitching, thereby providing inequalities that are mostly but not always correct.

Structure
The remainder of this paper is structured as follows. Sections 2 to 4 provide preliminaries on Kyber, SCA, and fault attacks respectively. Section 5 presents our roulette attacks from a theoretical perspective. Section 6 presents our solver. Section 7 presents ChipWhisperer experiments. Section 8 concludes this work. Proofs are given in the extended version of this paper that is placed in the Cryptology ePrint Archive: https://ia.cr/2021/1622.

Notation
Variables and constants are denoted by characters from the Latin and Greek alphabets respectively. Vectors and matrices are denoted by bold lowercase and bold uppercase characters respectively. Functions are printed in a sans-serif font. Operator · denotes rounding to the nearest integer where ties, i.e., fractions of exactly 0.5, are rounded up.

Kyber
Kyber [ABD + 20] starts from a public-key encryption (PKE) scheme that is secure against chosen-plaintext attacks (CPAs), as recapitulated in Section 2.1, and to which a variation of the Fujisaki-Okamoto (FO) transform is applied to additionally resist chosen-ciphertext attacks (CCAs), as summarized in Section 2.2. We abstain from comprehensive descriptions and only highlight aspects that are important for this work.

Public-Key Encryption
The PKE scheme consists of key generation, encryption, and decryption, as specified in Algorithms 1 to 3 respectively. For brevity, the use of binary encodings to efficiently transmit data is omitted. Parameters corresponding to three security levels are given in Table 1. The security of the scheme is based on the module learning with errors (MLWE) problem. Errors are drawn from a centered binomial distribution (CBD), i.e., e e 1 − e 2 where e 1 , e 2 ∼ B( , 1 /2).

Input: Private keyŝ
The lossy compression is defined in Eq. (3).

Key-encapsulation mechanism
Kyber [ABD + 20] uses a variation of the FO transform that is specified in Algorithms 4 to 6. Essentially, the ciphertext c received by the decapsulation is re-encrypted after decryption and the result c is compared to c. If this comparison fails, the decapsulation returns a pseudorandom value instead of a failure symbol ⊥, which is referred to as implicit rejection. Hash functions H 1 and H 2 are instantiated with SHA3-512 and SHA3-256 respectively; the key-derivation function (KDF) is instantiated with SHAKE-256. Kyber has a 90s variant with other symmetric primitives, which we do not use.

ARM Cortex-M4
Following a recommendation by NIST, the ARM Cortex-M4 is the primary reduced instruction set computer (RISC) processor for benchmarking the implementation efficiency of PQC schemes. This embedded processor features thirteen 32-bit registers for general purposes, which may pack two 16-bit signed integers. Instructions that perform multiplications, subtractions, and other operations on these halfwords are supported. Source code for Kyber is publicly available in the pqm4 library [KRSS]. Although the implementation is largely written in C, we analyze routines written in assembly exclusively. These routines were updated after our analysis, yet similar conclusions can be drawn from the latest version. Given that prime ρ = 3329 < 2 12 , 16-bit halfwords can efficiently store polynomial coefficients whilst providing a margin for lazy reductions, i.e., reductions after additions and subtractions that do not cause overflow may be skipped. As pointed out by Alkim et al. [ABCG20, Algorithm 11], Montgomery reductions can be implemented using two instructions only. The realization from the pqm4 library is shown in Algorithm 7. The NTT and INTT exclusively rely on these Montgomery reductions, as evidenced by the double GS butterfly in Algorithm 8. Unfortunately, the Montgomery-reduced coefficients lie in the interval [−ρ + 1, ρ − 1] instead of [0, ρ − 1]. To obtain coefficients in the interval [0, ρ − 1] right before compression, a slower Barrett reduction is used.

Side-Channel Analysis
As specified in Algorithm 6, the decryption is the only building block of Kyber's decapsulation that uses the private keyŝ and is thus the obvious target for SCA. However,

Algorithm 7 Montgomery [KRSS]
Input: Integer a where −(β/2) · ρ ≤ a < (β/2) · ρ and β = 2 16 Input:  To avoid the realization of a message-checking oracle through SCA, algorithms that process m should be protected. This includes the hash function H 1 and the entire re-encryption. The academically preferred way of countering SCA is to randomize computations such that dependencies between internal secrets and measurable emissions are weakened. Below, we distinguish between masking methods, which are expensive and substantiated by a security proof in a probing model, and blinding methods, which are cheap and unsupported by a security proof.

Masking
In masked implementations, finite-ring elements x ∈ X are randomly and uniformly split into λ ≥ 2 shares according to Definition 1. According to Lemma 1, one way to meet Definition 1 is to first select Definition 1 (Uniformity). A finite-ring element x ∈ X is randomly and uniformly split Lemma 1 (Subset of Shares). For a finite-ring element x ∈ X that is randomly and uniformly split into λ shares according to Definition 1, any tuple of λ−1 shares is uniformly distributed on X λ−1 and thus independent of x. More generally, any tuple of α ∈ [1, λ − 1] shares is uniformly distributed on X α .
We distinguish between (i) Boolean masking, where X = {0, 1} σ and additions are defined by XORing, and (ii) arithmetic masking, where x ∈ Z ρ , and additions are performed modulo a prime ρ. For efficiency reasons, Boolean masking is typically used for symmetrickey algorithms, whereas arithmetic masking is used for polynomial operations. Hence, Boolean-to-arithmetic and arithmetic-to-Boolean conversions are commonplace.
A function G : X → Y must also be split such that shares of x ∈ X satisfying Definition 1 are mapped to shares of y = G(x) that again satisfy Definition 1. If G is linear, G is trivially split by defining ∀i ∈ [1, λ] :

Blinding
For blinding methods, we distinguish between randomization of data and randomization of time. The latter can be achieved by randomly permuting the order of parallelizable operations [Saa18, OSPG18, RPBC20, PP21a]. For example, the polynomial coefficients fed into Compress and Decompress in Eq. (3) can be permuted. Similarly, the butterfly operations within an NTT/INTT layer can be shuffled.
As an example of data randomization, consider a finite-field multiplication y where r 1 and r 2 are chosen randomly, uniformly, and independently from F \ {0}.
where r 1 and r 2 are chosen randomly, uniformly, and independently from [0, η − 1]. At least, if a lookup table of the powers of ζ is available. Ravi et al. [RPBC20] applied the latter technique at various granularities: the extent to which r 1 and r 2 are reused across the multiplications within an NTT/INTT layer is a trade-off between cost and security. In its most generic form, the GS butterfly in Eq. (2) is realized as in Eq. (5), where blinding factors ζ r1 and ζ r2 cancel out to a factor 1 after the last layer.

Fault Attacks
Although fault attacks on the key generation and the encapsulation exist [VOGR18, RRB + 19], the decapsulation is once again particularly vulnerable. An attacker can fault this module a virtually unlimited number of times in order to retrieve the private key s.

Differential fault analysis
As pointed out by Oder et al. [OSPG18], a positive side effect of using the FO transform is that many fault attacks on the decapsulation are inherently countered: by re-encrypting the decrypted message m and comparing the result c to the externally provided ciphertext c, secret-revealing faulted data is kept internal instead of forwarded to the output. This countermeasure, which also exists in a simpler form where an encryption or decryption is executed twice, is well-established since the early 2000s, at which time Karri et al. [KWMK02] protected block ciphers such as the Advanced Encryption Standard (AES) against differential fault analysis (DFA

Ineffective Faults
Another concern is that the inherent FO defense only counters DFA, or more generally, any attack that leverages faulted data. As already established in the 2000s, mere knowledge of whether the output of a keyed cryptographic algorithm is wrong or correct after introducing a fault might leak information about the secret. Faults that have the latter effect are often referred to as safe errors [YJ00] or ineffective faults [Cla07]. Because the attacker gains at most one bit of information per injected fault, a full key recovery typically requires more injections than with DFA. Bettale, Montoya, and Renault [BMR21] proposed an ineffective-fault attack against several lattice-based schemes, but Kyber is deemed secure. Attacks presented at CHES 2021 [PP21a] and IndoCrypt 2021 [HPP21a], the latter of which is based on the former, target Kyber and are recapitulated below.

CHES 2021
Pessl From Eqs. (4) and (7), it follows that the observed effectiveness of the fault provides one inequality that is linear in the secret x (s, e) ∈ [− 1 , 1 ] ψ where ψ 2 κ η, as given in Eq. (8). Both a and b are entirely determined by the encapsulation and are thus not only known but also controllable by the attacker. Reductions modulo prime ρ are omitted because (i) b and the elements of a and x are small in absolute value, and (ii) opposite signs ensure that a x + b is small in absolute value too. Thanks to this omission, and by gathering ω inequalities where ω is several thousands, the system comprising a matrix A with size ω × ψ and a vector b of length ω can be solved for the secret x. In practice, the index ι is kept the same for all ω inequalities so that the fault-injection setup only needs to be aligned with a single point in space and time. From Eq. (8b) and the coin-based randomization of the encapsulation, it follows that making ι constant and selecting ι uniformly at random from [0, η − 1] for each inequality would result in equally solvable systems anyway.
In practice, the obtained inequalities might be incorrect. Void injections where no fault is actually introduced can be misclassified as an ineffective fault. Similarly, injections that cause faults other than the intended instruction skip can be misclassified as an effective fault: the returned symmetric key is k ← KDF(z H 2 (c)) either way. In order to tolerate errors, and in order to keep execution times reasonable despite the large dimensions (ω, ψ), the authors abstain from using linear programming and base their solver on belief propagation. Their eventual algorithm, however, was unable to exceed a 1% error rate, and attempts to increase this number were deferred to future work.
The solver is an iterative method that maintains a PMF for each unknown x[j] in Eq. (8), where j ∈ [0, ψ − 1]. All ψ PMFs are initialized with the CBD on [− 1 , 1 ], and in each iteration, all ψ PMFs are updated, until they sufficiently approximate one-point distributions on [− 1 , 1 ] to make further iterations pointless. To update the PMF of any given x[j], the probability that a x + b ≥ 0 in Eq. (8) holds is computed for each possible outcome of x[j] ∈ [− 1 , 1 ] and for each out of ω inequalities according to Eq. (9), and these 2 1 ω probabilities are then aggregated. Importantly, the PMF updates do not interfere with one another, i.e., each update only uses probability masses from the previous iteration.
(9) Each probability in Eq. (9) involves a linear combination of ψ − 1 random variables x[j], which do not have a special shape anymore after the first iteration, and generally necessitates linear (non-circular) convolution. Even though the fast Fourier transform (FFT) is used to accelerate these convolutions, similar to how the NTT is used in Kyber to accelerate polynomial multiplication according to Eq. (1), and even though binary trees improve the reuse of intermediate variables, the computational load remains heavy with ω ψ FFTs and ω ψ inverse FFTs per iteration.

Indocrypt 2021
Hermelink, Pessl, and Pöppelmann [HPP21a] presented a similar attack: Eqs. (6) to (9) and Fig. 1 remain identical. The difference is that the increase of m ι by ρ/4 in Eq. (6) is not realized by a fault but by manipulating the compressed ciphertext coefficient v ι of an otherwise correctly computed encapsulation as specified in Eq. (10) and enabled by m ← v − INTT(ŝ •û) in Line 5 in Algorithm 3. Similar manipulations were used in the aforementioned SCA-assisted CCAs [DTVV19, RRCB20, UXT + 21], with the difference that the message-checking oracle is now realized by reintroducing a fault instead of SCA. To authors make the decapsulation succeed in case of a correctly decrypted message bit m ι = m ι by the suggested use of a laser to flip a single bit in either input operand of the ciphertext comparison. If multiple bits can reliably be flipped, the Hamming distance (HD) constraint can be removed. Depending on which operand is faulted, the symmetric key k is either KDF(k H 2 (c)) or KDF(k H 2 (c )); both values are known to the attacker. If m ι = m ι , the coins r used by the re-encryption and thus also the produced ciphertext c are entirely corrupted, and any attempt for rectification is in vain: k ← KDF(z H 2 (c )). ( Unlike the attack at CHES 2021, the error rate of the obtained inequalities is inherently asymmetric: an observed decapsulation success cannot be misclassified, whereas an observed decapsulation failure can be misclassified due to void injections and unintended faults. To reduce the error rate in the latter case, β > 1 fault-injection attempts per inequality can be made. The solver is another variation of belief propagation making use of Eq. (9), FFTs, and binary trees, and is fed inequalities that are 100% correct, originating from perfect software-simulated faults. Around 6000, 7000, and 9000 faulted decapsulations suffice to recover the private key of Kyber512, Kyber768, and Kyber1024 respectively, with a success rate of nearly 100%. To achieve an execution time under 10 minutes for Kyber768 with 7000 inequalities, 32 threads running on 16 cores are required.
The attack may be hindered by masking, shuffling, and/or double executions, but is not precluded, in part due to the error-rate asymmetry. Therefore, the authors proposed an additional countermeasure: instead of ciphertexts c, pairs (c, H 3 (c)) where H 3 is another hash function are stored in random-access memory (RAM) and eventually compared. Although faulting c while it is stored in RAM becomes pointless, the attack still succeeds by faulting c before it is fed into the hash function, e.g., in the back end of Compress(v; ρ, δ v ).

Roulette Attacks
Considering that our roulette attacks may be applicable to several KEMs, we first present a general methodology in Section 5.1, and then apply this methodology to Kyber's decapsulation in Section 5.2.

General Methodology
Consider a keyed cryptographic algorithm A : S × I → O where s ∈ S is keying material, i ∈ I is the public input, and o ∈ O is the output. Output o is not necessarily public, but an attacker can observe whether or not o is correct. We decompose A into four parts, as shown in Fig. 2. To keep the execution time of the attack within bounds, we require that the cardinalities |T 1 |, |T 2 |, and |T 3 | are small.
For a constant input (s, i), the attacker repeatedly faults either t 1 or A 2 or t 2 or A 3 or t 3 such that t 3 ∈ T 3 is not constant, i.e., t 3 does not follow a one-point distribution with respect to the infinite set of fault injections. If for the given distribution of t 3 , the probability that A fails to produce the correct output o depends on the secret s ∈ S, then Figure 2: Decomposition of cryptographic algorithm A.
the attacker retrieves information on s. Although many distributions might enable an attack, we idealize the case where t 3 is uniformly distributed on T 3 . In our casino analogy, this corresponds to spinning a fair roulette wheel, at least if we visualize T 3 through a circular representation. Our motivation for this idealization is that uniform distributions naturally support (i) a large attack surface, as shown by the fault-propagation properties in Section 5.1.1, and (ii) various fault models, as shown by the examples in Section 5.1.2.

Attack Surface
For a function that is balanced according to Definition 2, which extends existing definitions [DMB19], uniformly distributed faults propagate as uniformly distributed faults, as formalized in Lemma 2 and proven in the extended ePrint version. If the function A 3 : T 2 × W 2 → T 3 in Fig. 2 happens to be balanced with respect to t 2 , an attacker who is able to fault either A 2 or t 2 such that the faulted value t 2 ∼ U (T 2 ), indirectly achieves t 3 ∼ U (T 3 ). If, additionally, A 2 is balanced with respect to t 1 , then a faulted value t 1 ∼ U (T 1 ) has the same effect.

Lemma 2 (Fault Propagation by Balanced Functions).
Let G : X → Y be a balanced function, as formalized in Definition 2. If x ∼ U (X ), then y ∼ U (Y). Similarly, for a function G : X 1 × X 2 → Y that is balanced with respect to input x 1 ∈ X 1 , if x 1 ∼ U (X 1 ) is independent of x 2 ∈ X 2 , then y ∼ U (Y).
Fortunately for the attacker, balanced functions are frequently used in cryptography. Bijections are a trivial example. Additions in a finite ring and multiplications in a finite field are two more examples, as formalized in Lemmas 3 and 4 respectively, and proven in the extended ePrint version. In fact, balancedness is merely the ideal case; imbalanced fault propagation might still enable an attack in practice.

Lemma 3 (Balancedness of Addition in Finite Rings). Let R be a finite ring and let
G : R 2 → R be defined as y G(x 1 , x 2 ) x 1 + x 2 . It holds that G is fully balanced, i.e., Definition 2 is met with respect to both input x 1 ∈ R and x 2 ∈ R. Finite Fields). Let F be a finite field and let G : F 2 → F be defined as y G(x 1 , x 2 ) x 1 · x 2 , where x 2 = 0. It holds that G is balanced, i.e., Definition 2 is met with respect to input x 1 ∈ F.

Fault Models
Examples 1 to 4 demonstrate that the ideal distribution, t 3 ∼ U (T 3 ), can be achieved for various fault models. Masking is a facilitator in Examples 2 and 3; data-randomizing blinding is a facilitator in Example 4. Depending on whether A 2 and A 3 are balanced, Examples 1 and 2 can be applied to t 1 , t 2 , and/or t 3 , and Examples 3 and 4 can be applied to A 2 and/or A 3 . At least in the ideal case, because non-uniform distributions also enable attacks in practice. Example 2 (Set-To-Constant Faults). Set-to-0 and set-to-1 faults are covered for masked implementations. Let y be randomly and uniformly split into λ ≥ 2 shares according to Definition 1, and without loss of generality, assume that the first share, y (1) ∈ Y, is set to an arbitrary constant θ ∈ Y, whereas shares y (2) , · · · , y (λ) ∈ Y are untouched. Considering that y (1) ∼ U (Y) and y (2) , · · · , y (λ) ∼ U (Y λ−1 ) according to Lemma 1, it follows that the faulted value y = θ + y (2) + · · · + y (λ) = y − y (1) + θ ∼ U (Y).

Example 3 (Instruction Skips and Corruptions
). Let G : X → Y be realized through a masked software implementation. Without loss of generality, assume that an instruction in the first share function, G (1) , is either skipped or corrupted such that the faulty output share (y (1) ) is independent of the correct output share y (1) . Hence, y = (y (1) ) + y (2) + · · · + y (λ) is again uniformly distributed on Y.
Example 4 (Arbitrary Bit Flips). Let G : X → Y be an affine function over a finite field X = Y = {0, 1} ϕ where addition is defined by XORing. Let y G(x) be realized through a blinded implementation For any pattern of bit flips e ∈ {0, 1} ϕ \ {0} applied to the input of G, it holds that the faulted output y . Strictly speaking, this distribution is nearly uniform, given that the case y = y is excluded. One could achieve y ∼ U ({0, 1} ϕ ) by aborting the fault injection with probability 1/2 ϕ , but this would be pointless in an actual attack.

Comparisons
Table 2 compares our roulette attacks to well-known fault attacks, i.e., DFA, fault sensitivity analysis (FSA) [LOS12], and a statistical ineffective fault attack (SIFA) [DEK + 18]. The standout property of roulette attacks is that masking is a facilitator. Although masking may not preclude DFA [BH08], FSA [MMP + 11, Del20], or SIFA [DEG + 18], it is not a facilitator here. Furthermore, note that the fault distributions of roulette attacks and SIFA are complementary to some extent.

Application to Kyber's Decapsulation
We now instantiate the generic cryptographic algorithm A from Section 5.1 with Kyber's decapsulation, as specified in Algorithm 6. Our first and foremost roulette attack is an extension of the IndoCrypt attack [HPP21a]; the private key s is recovered by faulting the re-encryption. A second roulette attack recovers the message m and the corresponding session key k by faulting the decryption. Considering that the second attack is far less practical while recovering the short-term and thus not the long-term secret, its specification is deferred to the extended ePrint version.

Attack Surface
The generic variable t 3 ∈ T 3 in Fig. 2 is instantiated with a compressed ciphertext coefficient v ι ∈ {0, 1} δv that is output from the re-encryption, as specified in Algorithm 2. Following Hermelink et al. [HPP21a], the goal is to match a manipulated coefficient so that the polynomial comparison succeeds, at least if the preceding decryption is correct. If the faulted value v ι is uniformly distributed on {0, 1} δv , then the probability of a successful decapsulation is approximately 0 if m ι = m ι and 1/2 δv otherwise. For Kyber512 and Kyber768, the latter probability is 1/16; for Kyber1024, the latter probability is 1/32. The attacker injects faults until a decapsulation success is observed. After β unsuccessful injections, a decapsulation failure is assumed. Inequalities that correspond to an observed decapsulation success are always correct, whereas the error rate of inequalities that correspond to an observed decapsulation failure decreases with β.
Compared to the Indocrypt attack [HPP21a] in its original form, the number of fault injections increases by roughly one or two orders of magnitude, but we get a considerably larger attack surface and support for various fault models in return. As illustrated in Fig. 3 and in accordance with the C reference implementation of Kyber submitted to NIST [ABD + 20], the function A 3 • A 2 that produces a coefficient v ι ∈ {0, 1} δv comprises (i) one GS butterfly in the last layer of the INTT, which includes one Montgomery multiplication, (ii) another Montogomery multiplication for scaling purposes, (iii) the generation of one CBD sample, (iv) the decompression of one message bit, (v) one addition, (vi) one Barrett reduction, and (vii) one compression. Moreover, by faulting any of these building blocks, the countermeasure of Hermelink et al. [HPP21a] to store (c, H 3 (c)) in RAM is bypassed.
Another godsend for the attacker is that the fault-propagation statistics are almost ideal. The modular addition is perfectly balanced according to Definition 2 with respect to all three inputs (this is a trivial generalization of Lemma 3). Ciphertext compression as defined in Eq. (3) is not perfectly balanced, but the deviation is too small to notably impact the attack. If we introduce faults such that the uncompressed coefficient (v ι ) is uniformly distributed on [0, ρ − 1], then the compressed coefficient (v ι ) slightly deviates from uniform. For Kyber512 and Kyber768, the zero coefficient occurs with probability 209/3329, whereas all other coefficients occur with probability 208/3329. Similarly, for Kyber1024, this becomes 105/3329 for the zero coefficient and 104/3329 for all other coefficients.

Optional Hamming-Distance Constraint
The sole purpose of the HD constraint in Eq. (10) is to establish single bit flips as the fault model. In our extension of the attack, this constraint does not affect the feasibility of a fault injection and is thus entirely optional. To accommodate a potential omission, we replace Eqs. (7b) and (10). As a starting point, we summarize the behavior of Compress and Decompress in Eq. (3). For Kyber512 and Kyber768, where δ v = 4, our summary is contained in the first five columns of Table 3. The first and last elements of each bin are defined by Compress; the bin centers are defined by Decompress. For brevity, we do not Figure 3: The attack surface of the IndoCrypt paper [HPP21a] is colored blue; our extension is colored orange.
discuss Kyber1024, where δ v = 5, but identical conclusions can be drawn as shown in the extended ePrint version.
An evident anomaly is that bin 0 is 'oversized': it contains 209 elements, whereas 15 'ordinary' bins each contain 208 elements. The proposed manipulation in Eq. (10) is to add ρ/4 = 832 = 4 · 208 to the uncompressed coefficient v ι , which is a jump spanning exactly 4 'ordinary' bins. Unfortunately, the first element of bin 0 then maps to the last element of bin 3, given that 3225 + 832 mod 3329 = 728, and thus not to the first element of bin 4. In absence of the HD constraint, the decryption would face an accumulated error ∆m ι = ∆m ι + 632, which significantly undershoots the desired effect ∆m ι = ∆m ι + 832 in Eq. (6). An easy fix is to replace Eq. (10) by a direct manipulation of the compressed coefficient v ι as specified in Eq. (11).
Furthermore, in cases where the HD is 2 instead of 1, the accumulated error ∆m ι happens to be increased by 833 instead of 832. Equation (12) extends Eq. (7b) accordingly.

Masked Software on ARM Cortex-M4
To demonstrate how roulette attacks can defeat SCA countermeasures, theoretical examples are given. Due to the large attack surface in Fig. 3, where most building blocks come with a plethora of implementation strategies and masking schemes, we cannot possibly be exhaustive. Our first example is a segment of masked software on the ARM Cortex-M4. Although the Kyber implementations in the pqm4 library [KRSS] are unprotected, we focus on linear functions exclusively so that masking is realized merely by executing the corresponding code segments λ ≥ 2 times on their respective shares. More specifically, we focus on linear functions that are written in assembly so that differences among C compilers and build settings are irrelevant. We opted for the double GS butterfly in the last layer of the INTT, as implemented in Algorithm 8 and executed on λ ≥ 2 shares. For all nine instructions, Table 4 summarizes the effect of skipping that particular instruction for a single share.
Clearly, the attacker is in a privileged position: for five out of nine instruction skips, the faulted output coefficients are uniformly distributed, which is our ideal-case scenario. The uniformity proofs are all instances of Example 3 and deferred to the extended ePrint version. For the first two instruction skips though, two output coefficients are disturbed, which implies that the attacker must perform more fault injections. For instructions 5.1 to 6.2 in Table 4, a tractable closed-form expression for the distribution of the faulted coefficient d might not exist. However, we took an empirical approach by measuring the distribution of d on the ARM Cortex-M4, where an instruction skip is trivially realized by removing that particular instruction from the source code, and did not observe any non-uniformities that would hinder the attack.

Blinded Hardware
For attacks on hardware components, spatially localized fault-injections methods such as lasers beams or electromagnetic waves are of particular interest. A potential target is, for example, a GS butterfly blinded according to Eq. (5) in the final INTT layer. As formalized in Eq. (13), if the attacker flips an arbitrary set of bits in multiplicand (a + b), then the faulted butterfly output c is uniformly distributed on a subset of Z ρ with cardinality η, given that ζ is the η-th root of unity. Contrary to Example 4, only η/ρ ≈ 7.7% of all possible values are covered, but the attack succeeds considering that one or more values around ∆c = ρ/4 suffice.
Similarly, bit flips in multiplicand (a − b) cause butterfly output d to be uniformly distributed on a subset of η elements in Z ρ . Flipping bits in either a or b is possible too, but then more injections must be performed because c and d are simultaneously faulted.

Countermeasures
As demonstrated in Sections 5.2.3 and 5.2.4, masking and blinding methods that randomize intermediate variables facilitate roulette attacks. Other off-the-shelf countermeasures slow down the attack, albeit at a significant cost. For example, against a re-encryption module in which polynomial coefficients are randomly permuted, the attacker must inject faults until the chosen time of the injection eventually coincides with the manipulated ciphertext coefficient v ι . Alternatively, if the re-encryption and ciphertext comparison are executed twice, the attacker needs two consecutive lucky spins of the roulette wheel.

Solving Systems of Linear Inequalities
Both Pessl and Prokop [PP21b] and Hermelink et al. [HPP21b] published source code for solving systems of linear inequalities on GitHub, but we implement our own solver from scratch in order to reduce the computation time and increase the error tolerance. Source code is available in the following GitHub repository: https://github.com/Crypto-TII/ roulette.
The solver is entirely written in Python, but by mapping resource-intensive operations to large NumPy arrays, the heavy lifting is actually done in C on contiguous memory. Our code includes an implementation of Kyber, which uses symmetric primitives from the PyCryptodome library. Test routines compare the private key k priv , the public key k pub , the ciphertext c, and the shared secret k against those from the NIST reference implementation. To make all plots in this section reproducible, we include the methods that generated their data points, besides the solver itself.

Reduced Computation Time
The high computation time from previous solvers was already attributed to a single culprit, i.e., Eq. (9). We accelerate Eq. (9) by replacing the exact approach with an approximation. Considering that a large number of variables, i.e., ψ − 1, is being summed, the PMF of the sum can accurately be approximated by a normal distribution according to the central limit theorem (CLT). In later iterations, the binomial distributions evolved towards one-point distributions, and the approximation becomes less precise, but by then the algorithm is already honed in on the solution anyway. The resulting computation in Eq. (14) is light and straightforward. The summand 1/2 compensates for the fact that a discrete distribution with step size 1 is approximated by a continuous distribution. .
Instead of the reported 15 minutes, a single-threaded Kyber768 iteration with ω = 7000 inequalities and ψ = 1536 unknowns now takes less than five seconds. These numbers are obtained from different computers, but as our number comes from a laptop with Python running in a virtual machine, we are unlikely to have a significant advantage. The need for parallelizing computations through threading is removed. In the next section on error tolerance, the benefit of the CLT-based acceleration increases, given that the required number of inequalities ω increases with the error rate.

Increased Error Tolerance
Our error-tolerant solver is represented in Algorithm 9. Whilst observed decapsulation successes are presumed to be 100% correct, observed decapsulation failures are only assumed to be correct up to a probability that is estimated in Line 4. Regarding the CBD in Line 3, we point out that the PMF of e e 1 − e 2 where e 1 , e 2 ∼ B( 1 , 1 /2) can simply be evaluated as f bino ( 1 + e; 2 1 , 1 /2), as proven in the extended ePrint version. Not equally compact, Hermelink [HPP21b] loops over all pairs (e 1 , e 2 ) ∈ [0, 1 ] 2 . As the probabilities P[i, j, k] might be small, the product in Line 13 is realized through a sum of logarithms to avoid underflow. Line 12 ensures that the logarithms do not receive inputs close to zero. The stop criterion in Line 6 is met if a maximum of 16 iterations is reached, or if a fitness value obtained from filling in x in the inequalities does not improve anymore. . ( As noted in the IndoCrypt paper [HPP21a], correctly guessing ψ/2 out of ψ unknowns suffices for key-recovery, because the remaining half can be recovered via the public key k pub . The authors implemented several confidence measures to select ψ/2 coefficients in every iteration, but we do not despite the reduction in the number of inequalities needed.
An alternative to our error-tolerant solver would be to leverage the asymmetric error rates of the IndoCrypt attack [HPP21a] and our roulette attacks by only retaining inequalities that correspond to an observed decapsulation success. However, discarding all other inequalities would be a waste of experimental data, and cannot as reliably be applied to the CHES attack [PP21a].

Experiments with Software-Simulated Faults
We perform three experiments where faults are simulated in software. Success is quantified by estimating the probability that any coefficient of the guessed solution x is correct as a function of the provided number of inequalities, ω. This estimate is an average over (i) all ψ unknowns and (ii) 10 systems of inequalities that correspond to different key pairs (k pub , k priv ). No runs are discarded, thereby demonstrating the stability of our solver.
In our first experiment, we revisit a filtering technique from Pessl and Prokop [PP21a] where inequalities are selected such that coefficient b is small in absolute value. This way, the probability of a decapsulation success (or failure) is approximately 50%. Hence, the information or Shannon entropy carried by the inequality is maximized, and fewer faults are needed for key recovery. Because the potential gains have not been quantified before, we do so in Fig. 4. For the unfiltered curve, the faulted ciphertext index ι ∈ [0, η − 1] is constant, and the result of a single encapsulation is unconditionally accepted. For the filtered curve, a single encapsulation is still performed, but the faulted index ι is variable and chosen such that |b| is minimized. Remark that in an attack with actual hardware, a similar effect could be obtained by fixing ι and performing η encapsulations. Considering that the gains are significant, we filter inequalities by default.  In our second experiment, all three security levels of Kyber are compared in Fig. 5. The curves lie relatively close to one another, especially Kyber512 and Kyber768. This is at least partially attributable to the followings effects cancelling out: Kyber768 has more unknowns (1536 > 1024), whereas Kyber512 has more possible values per unknown (7 > 5). More rigorously, the Shannon entropy of the secret x in Eq. (16) is roughly 2389, 3119, and 4159 bits for Kyber512, Kyber768, and Kyber1024 respectively, and provides a lower bound on the number of inequalities needed for a 100% success rate.
In our third experiment, inequalities are corrupted. Given ω otherwise correct inequalities, decapsulation successes are turned into decapsulation failures with probability p s2f ∈ [0, 0.6], whereas decapsulation failures are untouched, which is in line with the working principles of the attack. Figure 6 shows that even with p s2f = 50%, the entire secret can still be recovered. The overall error rate is approximately half of p s2f , resulting in an error tolerance of 25%. This is a considerable improvement upon the 1% reported by Pessl and Prokop [PP21a], and demands on the fault-injection setup are reduced accordingly.

ChipWhisperer Experiments
We experiment with actual fault-injection equipment and target a masked software implementation of Kyber running on an ARM Cortex-M4. Upon discarding (i) the pqm4 implementation [KRSS] [BC22] for storing the private key s and the symmetric key k in unmasked form at the time of writing this paper, we opted for the implementation of Coron et al. [CGMZ22]. Because the latter implementation is entirely written in plain C and thus unoptimized for the M4, it runs too slow for bulk experiments yet fast enough to show that our attack works. We build Kyber768 with first-order masking using GNU Compiler Collection (GCC) with O3 optimization.
We use a ChipWhisperer board from NewAE Technology Inc. [O'F] to generate and glitch a 24 MHz clock. Through a CW308 UFO Target Board, this clock is provided to the M4 that is contained in an STM32F405RGT6 chip from STMicroelectronics, and causes either instruction skips or instruction corruptions [TSW16]. The glitch is created by XORing a single short pulse with an otherwise proper clock signal, and is configured by three parameters: a global offset expressed as a number of clock cycles, a local offset with respect to the clock edge, and the width of the pulse. The latter two parameters jointly embody an intensity that must be carefully balanced for the given STM chip: if too low, no data is faulted, and if too high, our target crashes. The former parameter must be paired with a vulnerable spot of Kyber's re-encryption and the given ciphertext index ι ∈ [0, 255] that is manipulated, which can be considered as a fourth parameter. Considering that we focused on the last layer of the INTT earlier-on, we mark this section of the source code with a trigger signal. Through a series of grid searches within the trigger window, four parameter values are selected. The selected ciphertext index ι = 130. Remark that in a typical closed-source commercial product, a trigger cannot simply be added to the source code but may be derived from SCA or communications with chip peripherals such as external memory.
Upon selecting parameters, key recovery would be possible in a few hours up to a day for a well-optimized implementation of Kyber, but as we had to settle for an unoptimized target, it would take approximately five days. And ideally, multiple recoveries should be performed. Therefore, our attack is showcased through faster but fairly equivalent means: the ability to generate correct inequalities is measured. Based on 500 inequalities, Fig. 7 shows the probability of assigning the wrong sign to an inequality as a function of the maximum number of fault injections, β. Recall that only decapsulation successes can be misclassified and thus negatively contribute to the error rate. If, guided by Fig. 6, we tolerate misclassifying approximately 50% of the decapsulation successes, it should roughly hold that β ≥ 20. To conclude: even with a cheap setup and an SCA-protected target, we can deliver a solvable system of inequalities.