Chosen Ciphertext k -Trace Attacks on Masked CCA2 Secure Kyber

. Single-trace attacks are a considerable threat to implementations of classic public-key schemes, and their implications on newer lattice-based schemes are still not well understood. Two recent works have presented successful single-trace attacks targeting the Number Theoretic Transform (NTT), which is at the heart of many lattice-based schemes. However, these attacks either require a quite powerful side-channel adversary or are restricted to speciﬁc scenarios such as the encryption of ephemeral secrets. It is still an open question if such attacks can be performed by simpler adversaries while targeting more common public-key scenarios. In this paper, we answer this question positively. First, we present a method for crafting ring/module-LWE ciphertexts that result in sparse polynomials at the input of inverse NTT computations, independent of the used private key. We then demonstrate how this sparseness can be incorporated into a side-channel attack, thereby signiﬁcantly improving noise resistance of the attack compared to previous works. The eﬀectiveness of our attack is shown on the use-case of CCA2 secure Kyber k -module-LWE, where k ∈ { 2 , 3 , 4 } . Our k -trace attack on the long-term secret can handle noise up to a σ ≤ 1 . 2 in the noisy Hamming weight leakage model, also for masked implementations. A 2 k -trace variant for Kyber1024 even allows noise σ ≤ 2 . 2 also in the masked case, with more traces allowing us to recover keys up to σ ≤ 2 . 7. Single-trace attack variants have a noise tolerance depending on the Kyber parameter set, ranging from σ ≤ 0 . 5 to σ ≤ 0 . 7. As a comparison, similar previous attacks in the masked setting were only successful with σ ≤ 0 . 5.


Introduction
Current public-key cryptographic schemes are based on the premise that the mathematical problems underlying them are hard to solve for the chosen parameters.With the advent of a quantum computer however, these classical hard problems will be efficiently solvable by applying Shor's algorithm [Sho94].As a result, there is rising interest in post-quantum cryptography (PQC) algorithms, which are based on mathematical problems conjectured • We demonstrate how to choose the number of traces in our attack from 1 to k to increase the noise tolerance from σ ≤ 0.5 − 0.7 up to σ ≤ 1.2 in the Hamming weight leakage.Our simpler sparse vector generation based on the NTT structure can still handle noise up to σ ≤ 0.9 in a k-trace attack.For Kyber1024 we are even able to handle noise up to σ ≤ 2.2 in an 2k-trace attack.With repeating failed runs we can

Learning With Errors
The Learning With Errors (LWE) problem [Reg05] and its instantiation over rings [LPR10] or modules are the basis of multiple NIST PQC candidates.Let Z q be the ring of integers modulo q and for given degree n, define R q = Z q [x]/(x n +1) as the polynomial ring of polynomials modulo x n + 1.Let also β η denote the centered binomial distribution with parameter η and U the uniform distribution over Z q .A module-LWE distribution now consists of tuples (a, b = a T s + e) ∈ Z k q × Z q (resp.R k q × R q ), where coefficients of s are drawn from β η once, and for each sample e is freshly drawn from β η and the coefficients of a from U. The module-LWE based schemes rely on the hardness of recovering s from several of such tuples.

Kyber
Kyber [BDK + 18] is a Key Encapsulation Mechanism (KEM) submitted to the NIST standardization process.It is among the 7 finalists of the 15 schemes in Round 3 [Nat].Its security is based on the module-LWE problem.For the three parameter sets in the proposal, Kyber512, Kyber768, and Kyber1024, the parameters are all set to n = 256 and q = 3329.For most parameters η = 2 is used, except for Kyber512, where η = 3.The parameter sets differ in their module dimension k = 2, 3, and 4 respectively.Since our focus is on its NTT (see Section 2.3), a simplified version suffices, omitting details such as how coefficients are packed.Kyber also supports "90s" versions of each parameter set, which substitute AES and SHA2 for SHAKE and SHA3, but this distinction doesn't affect our attack because we target only the NTT.For further details on Kyber, we refer to [BDK + 18].Kyber's CCA2-KEM Key Generation, PKE-and CCA2-KEM-Encryption, and CCA2-KEM-Decryption are summarized in Algorithms 1, 2, 3 and 4.
The PKE-Encryption is shown in Algorithm 2. The seed τ used for the noise sampling is made explicit to allow the re-encryption required for the CCA2 transform in Algorithm 3. The ciphertext c consists of two compressed parts, where the second component c 2 contains m encoded as an element in R q .The decryption process (Algorithm 4) requires the recipient to recover this m from a noisy version.CPA encryption with fixed seed 4: K := KDF( K||Hash(c)) Derive shared key 5: return (c, K) Kyber uses a variant of the Fujisaki-Okamoto transform [FO13] to build an IND-CCA2 secure key-encapsulation mechanism (KEM).This transform applies an additional reencryption of the decrypted message (cf.Algorithm 4, l. 4), using the same randomness as used for the encryption of the received ciphertext.The decryption is only deemed valid if the re-computed ciphertext matches the received ciphertext.As our attack exploits leakage that occurs before the re-encryption, i.e.NTT −1 in line 2, the check does not mitigate our attack.

Number Theoretic Transform
For lattice-based schemes using polynomial rings, the polynomial multiplication is the most computationally expensive step.The Number Theoretic Transform (NTT) is a technique that enables efficient computation of this multiplication.
The NTT is similar to the Discrete Fourier Transform (DFT), but instead of over the field of complex numbers, it operates over a prime field Z q .It can be seen as a mapping between the coefficient representation of a polynomial from R q (the normal domain) to the evaluation of the polynomial at the n-th roots of unity (the NTT domain).This bijective mapping is typically referred to as forward transformation.The mapping from the NTT domain to the normal domain is referred to as backward transformation or inverse NTT.In the NTT domain, multiplication of polynomials can be achieved by point-wise multiplication, which is much cheaper than multiplication in the normal domain.Typically, one would perform the forward transformation, multiply the polynomials (point-wise) in the NTT domain, and go back using the backward transformation.
For R q with a 2n-th primitive root of unity ζ, the NTT transformation of an n-degree (1) Similarly, The NTT transform (and its inverse) can be applied efficiently by using a chaining of log 2 n butterflies.It is a divide and conquer technique that splits the input in half in each step and solves two problems of size n/2.For n = 2 k after k steps, the problems are of size 1 and can be trivially solved.The way the splitting is done is referred to as decimation and typically in practice, either the Cooley-Tukey [CT65] or the Gentleman-Sande [GS66] butterfly is used.The construction for a 8-coefficient NTT using the Cooley-Tukey butterfly with decimation in time is depicted in Figure 2.1 (cf.[CT65]), with the output being in bit-reversed order.
Schemes like Kyber and Dilithium [DKL + 18] use an NTT-friendly ring.But in Kyber, only n-th primitive roots of unity exist, therefore the modulus polynomial X n + 1 only factors into polynomials of degree 2. Hence, the last layer of the NTT is skipped (nearest neighbors) and in NTT domain multiplication is not purely pointwise, but multiplications of polynomials of degree one (pairwise-pointwise).That is, the Kyber ring is effectively F q 2 [y]/(y 128 +1), where F q 2 is the field Z q [x]/(x 2 −ζ).Also note that in Kyber, polynomials in NTT domain are always considered in bit-reversed order (cf. Figure 2.1).Therefore,

Masked Implementations
Several previous works propose DPA-secured implementations of lattice-based schemes that use masking as their main protection mechanism.The first masking scheme for ring-LWE decryption was presented by Reparaz et  For the purpose of this paper, we are only interested in the initial decryption of the ciphertext, containing the inverse NTT operation, as we do not exploit leakage of subsequent computation steps.Here, the application of classical masking is rather straightforward as the NTT itself is a linear operation.Hence, one can simply (1) split the secret key into two (or more) shares, (2) multiply the NTT-domain ciphertext by each share, (3) compute the inverse NTTs independently for each share, and (4) finally add the shares back again if needed [RRdC + 16, OSPG18].The same strategy is also applicable for module-LWE-based schemes like Kyber.We give a more detailed description of masked Kyber decryption in Section 5.1.

Soft-Analytical Side-Channel Attacks (SASCA)
In this section, we first give a generic description of Soft-Analytical Side-Channel Attacks (SASCA) and the Belief Propagation algorithm (BP), which are frequently used in profiled power analysis attacks.We base our descriptions of BP on MacKay [Mac03, Chapter 26] and on previous works using SASCA [VGS14,PP19].We then outline previous works that study profiled power analysis attacks on (masked) NTT computations and point out their limitations.We do not discuss DPA attacks, as masking is generally considered to be an effective countermeasure against these kinds of attacks.

Belief Propagation
Profiled power analysis attacks in the single/few-trace setting often face the problem that measurements alone are not sufficient to determine the exact values of an analyzed computation.A common way to reduce the remaining guessing entropy is to build systems of equations which relate multiple intermediate values, and to solve them using SAT solvers or sophisticated brute force algorithms.In 2014, this idea was extended by Veyrat-Charvillon et al. [VGS14] proposing soft-analytical side-channel attacks (SASCA).This class of attacks treats the reduction of the remaining guessing entropy in combination with algorithm knowledge as a noisy decoding problem.Belief propagation is an inference algorithm which has proven to be very useful when decoding these kinds of problems.Knowledge about the algorithm is transformed into a so called factor-graph.This factorgraph models intermediate values of the algorithm as variable nodes and the relations or transformations between the intermediate values as factor nodes.Each variable node represents a local marginal probability distribution of the global joint distribution of all intermediate values.The variable nodes hold the initial probabilities -also called beliefs -which were gained via classical template matching.Variable and factor nodes then alternately exchange messages about their beliefs.This iterative process reduces intractable decoding solutions and could disclose the true values of each intermediate of the secret.A more thorough and formal explanation of BP is given in Appendix A.
The application of BP in side-channel analysis has proven to be very powerful [VGS14, PPM17, KPP20, PP19, GRO18].However, BP only decodes the correct marginals if the factor graph has no cycles.If BP is used in cyclic graphs, we refer to it as loopy Belief Propagation.The quality of the solutions of loopy BP is inversely proportional to the length of the loops.Short loops can introduce overconfidence, which could lead to oscillations or incorrect solutions within local minima [PP19].Therefore, it is beneficial to overcome short loops by clustering or remodeling of factor nodes which introduce these short loops.Additional attention has to be paid with respect to iteration number and break conditions in loopy BP.

Prior Work
As shown in prior work [PPM17,PP19], attacks can recover sensitive NTT inputs after observing just a single trace.More concretely, they recover the inputs to the forward/inverse NTT during ring-LWE encryption/decryption respectively.While decryption is generally the more interesting attack target since it involves the usage of a long-term private key, single-trace attacks on the encryption are also plausible attack scenarios for KEMs where ephemeral secrets are being encrypted.Fig. 3.1 illustrates some options to construct a factor graph for an NTT computation.Fig. 3.1a shows a single butterfly which is equivalent to a length-2 NTT and transforms an input polynomial with coefficients x 0 , x 1 into the corresponding output polynomial in the NTT domain with coefficients x0 , x1 .Fig. 3.1b depicts the corresponding factor graph as constructed in [PPM17].In this graph, variable nodes (intermediate values) and factor nodes (computations) are represented by circles and squares, respectively.
The factor nodes can be further split into two groups.The first group of factors f model the observed side-channel information, i.e., the outcome of the template matching f (i) = Pr(x = i| ) where x is the matched intermediate and is the observed side-channel leakage.In [PPM17] the template matching was performed on the modular multiplication with ω (corresponding to the powers of ζ in the NTT, cf. Figure 2.1), and hence where they receive information on x 1 .In this case, modeling the effects of multiplication on the beliefs simply corresponds to shuffling their probabilities since ω is publicly known.
The second group of factors, consisting of f add and f sub , model the deterministic relationships between the variable nodes as specified by the NTT.E.g., for the addition in  the upper branch, we get: As later observed in [PP19], modeling these operations individually causes many small loops in the factor graph which results in reduced BP convergence.Instead, they propose to merge the computations of a butterfly into a single factor node f bf .When combining this optimization with an improved message schedule and a certain amount of message damping, the BP convergence performance can be significantly improved.

Limitations of Prior Work
While [PPM17, PP19] demonstrate the possibility of side-channel attacks on the NTT, their presented attacks either fall somewhat short of being practical or are only applicable in certain scenarios.The attack in [PPM17] relies on template matching of modular multiplication operations which requires close to a million different templates.Furthermore, although not strictly required, they also exploit a certain time-invariance of the multiplication operation since it has a data-dependent behavior on their target device.Finally, they analyze a fairly simple and generic NTT implementation that misses several performance optimizations, such as lazy reductions, that are nowadays commonly used.
The attack from [PP19] can be seen as a significant improvement over [PPM17] in terms of practicality, but it also comes with restrictions to certain scenarios.First and foremost, their presented BP improvements allow them to replace value-based templates with generic noisy Hamming weight templates.They also show a practical attack using power traces of an ARM Cortex M4 device.On the downside, they rely on additional information about the narrow support (value range) of polynomial coefficients at the NTT input which is only present during ring-LWE encryption.Even in this case, their evaluation shows that they can only handle noise levels with σ ≤ 0.4 when considering masked implementations.

Chosen Ciphertext k-Trace Attack
In this section we present our new attack that overcomes the limitations described in Section 3.3.We first give an outline of the attack, then give details on the different aspects in the subsequent sections.

Attack Outline -Improving BP for the NTT
Our attack targets Kyber's decryption step, i.e. line 2 of Algorithm 4 with the aim of recovering the victim's long-term private key ŝ.Previous work has observed that SASCA can recover the coefficients used in the NTT if sufficient additional information is available [PP19].We provide that information by way of a Chosen Ciphertext Attack (CCA).Our attack ensures the inverse NTT (NTT −1 ) will be given sparse input, meaning that most of the NTT coefficients are known to have a value of zero.Our attack works by: 1. Creating a ciphertext c = (c 1 , c 2 ) such that the inverse NTT operations will be performed on sparse input involving the secret ŝ, which combined with the highly structured NTT gives much additional information.
2. Extracting the sparse, secret data from (a) SCA trace(s) on the inverse NTT.
3. Recovering the private key from the recovered information.
In NewHope and Kyber, to decrypt a message the recipient has a secret key ŝ in the NTT domain, and calculates NTT −1 (ŝ T • û), where û is the decompressed ciphertext in the NTT domain.Since ŝT • û is taken pairwise-pointwise, it will be sparse if û is pairwise-sparse.For NewHope, this is easily achieved because the ciphertext is sent as û, i.e. in the NTT domain.For Kyber however, the ciphertext is sent compressed as c 1 in the standard domain (cf.Algorithm 2, l. 5), so it is necessary to find a ciphertext c 1 such that û is sparse and u = Decompress(Compress(u)).In the remainder, we will refer to a ciphertext meeting this condition as compressible.

Creating Sparse NTT Values
We aim to find a compressible ciphertext c = (c 1 , c 2 ) where û is zero on some subset S of the coefficients.The component c 1 consists of k independent ring elements c 1,i ∈ R 2 d .We will explain two methods of forcing the NTT ûi of one such component's decompressed image, u i ∈ R q , to be sparse.In the following we will omit the subscript i for clarity.
Note that due to Kyber's field Z q lacking a 512th root of unity, its NTT has only 7 layers instead of 8, and the even and odd coefficients never mix.It can therefore be seen as two parallel "half-NTTs" each on 128 coefficients: one half-NTT on the even coefficients and the other half-NTT on the odd coefficients.We can aim for the output of one half-NTT to be sparse, and the other to be zero.Furthermore, the even and odd coefficients will be mixed during the pairwise-pointwise multiplication step, where a pair of one even and one odd coefficient are multiplied by the corresponding position of the NTT'd private key.So we don't need to set half the ciphertext to zero: the attack will work just as well if we set the two halves of the ciphertext so that their half-NTTs have the same sparse support.In any case, working with half-NTTs improves the speed of our methods.
The first method is to solve a short vector problem, since the set of uncompressed ciphertexts u for which û are zero on S form a lattice.We have where d = 10 for Kyber512 and Kyber768, and d = 11 for Kyber1024.We therefore want to construct pairs of vectors (u, Compress(u)) where û is sparse and ũ = u • 2 d − Compress(u) • q is a short vector.If all the coefficients of ũ are small then u will be compressible.Using BKZ-2.0 [CN11] with block size 70, we were able to find ciphertexts (c 1 , c 2 ) where each half-NTT component of ĉ1 is zero in all but 32 out of its 128 positions, and which are compressible with d = 10.BKZ is somewhat slow with such a large block size, but it only needs to be run once.For Kyber1024 with d = 11, it is possible to generate sparser vectors, with only 16 non-zero coefficients instead of 32.We were able to generate such vectors in some instances of BKZ-80 by shuffling the rows of the basis randomly.Note however, that an attacker only needs to generate this once in advance, independent of the private key.For better performance of the belief propagation, we further distributed the non-zero coefficients within each vector and matched them pairwise in the second set of 128 coefficients, due to the pointwise-pairwise product.The inputs of the NTT (left) need to be compressible, while the right side should be sparse.A sparse polynomial is found by iterating through a single intermediate value in layer (here: = 1, highlighted by the dashed box) until a compressible input of the NTT is found.This automatically results in a sparse output in the NTT domain (right side).Note, in Kyber the same has to be performed independently for the odd-indexed coefficients.
We also developed a faster approach, which is depicted in Figure 4.1; it takes advantage of the layered structure of the NTT.To achieve a sparse half-NTT output, we set a single intermediate value û ,j in layer to a non-zero value, and all other values in that layer to zero.This creates an input with 2 non-zero coefficients whose half-NTT has 2 7− non-zero coefficients.Setting = 2 for Kyber512 or Kyber768, we find that the resulting 4 coefficients will be compressible with probability of approximately (2 d /q) 4 ≈ 1/112 ≈ 30/q.This estimate indicates that approximately 30 possible intermediates in each position result in compressible ciphertexts.In our experiments we were able to find 32 such possible intermediate values for every position.Each of these results in 2 7−2 = 32 out of 128 coefficients in the NTT domain which are non-zero.The positions of these non-zero coefficients are determined by the first bits of the index j ∈ {0, . . ., 2 8 − 1}.E.g., with = 2 and j = 42 = 0101010 2 the non-zero NTT coefficients are indexed 01xxxxxx 2 .As a result, we can set multiple intermediates within the same block to non-zero values, while still obtaining a sparse half-NTT result.The compressibility is assured, as the coefficients in the normal domain are disjoint.By using this technique, we can produce an exponentially large family of compressible ciphertexts whose NTT is non-zero in 32 of its 128 pairs of coefficients.Most of these ciphertexts have no non-zero coefficients in the normal domain.For Kyber1024, because d = 11 instead of 10, we can apply the same technique with = 3 to generate compressible ciphertexts with as few as 16 out of 128 coefficients in the NTT domain which are non-zero.
This second approach can generate ciphertexts essentially instantaneously, but the non-zero output coefficients are all in a contiguous block.This leads to worse performance in our belief propagation step, because there are steps where the inputs to a butterfly are statically known to be zero, so the attacker gains no information from that step.However, it also allows faster reconstruction of the key from the partial information gained by the belief propagation step.We attempted to generalize this approach using different decimations of the NTT.But our attempts didn't work, for reasons described in Appendix C.
Either approach can be applied to each of the k components u i of c 1 , setting ν non-zero coefficients in each of their decompressed NTT-domain representations.In Kyber, the decryption step first take the sum over the components as If the values û0 , . . ., ûk−1 have non-zero values in the same ν pairs of positions, then ŵ will have non-zero values in those positions, which will then contain linear combinations of the coefficients of ŝ0 , . . ., ŝk−1 .Therefore, if we recover the coefficients of k different ŵ values using a side-channel attack on the inverse NTT, we can solve for ν pairs of coefficients of each of ŝ0 , . . ., ŝk−1 .
Another approach is to set the coefficients of û0 , . . ., ûk−1 to be non-zero in disjoint positions.In that case, ŵ will have k • ν pairs of non-zero coefficients, e.g.96 pairs of non-zero coefficients for Kyber768.With this method the belief propagation will have worse noise immunity, but if successful it will recover ν coefficients from each of ŝ0 , . . ., ŝk−1 in a single trace.This will lead to a single-trace attack when the signal-to-noise ratio is high enough.

Belief Propagation Details
After computing a compressible ciphertext u that is sparse in NTT domain, we replace the vector c 1 of a honestly generated ciphertext (c 1 , c 2 ) by the compression of u and send it to the target device.The leakage in the butterfly operations of the inverse NTT is exploited by template matching, and the resulting probability distributions are used as input to the belief propagation.As û is pairwise-sparse and the multiplication is pairwise-pointwise, the input to the inverse NTT ŵ is sparse.This reduces the entropy at variable nodes at positions with û,j = 0 to zero.If we are attacking a masked implementation, we perform those steps for both shares and recover the result, ŝT • û, from the recovered shares.
Apart from attacking the decryption instead of the key generation, our belief propagation is similar to that of Pessl and Primas [PP19], using the merged butterfly nodes but not applying message damping.Also, in the case of masking, we did not improve our belief propagation by adjoining the graphs for both shares.As the coefficients of the input to the inverse NTT, ŵ = ŝT • û are not small and the coefficients of the output of the inverse NTT, w = s T • u, are given by s convoluted with u, factor nodes ensuring the consistency for each coefficient individually seem infeasible.Therefore, our belief propagation graphs for both shares are completely independent.As Kyber's NTT uses only 7 instead of 8 layers, each share consists of two separate connected graphs, resulting in four separate graphs for the masked case and two separate graphs for the unmasked graph.

Recovering the Private Key from Secret Data
The last step of our algorithm is to recover the private key s from the recovered secret data.This recovered data consists of the value ŝT • û, where û is sparse, having only n non-zero pairs of values contained in some set S. Since all non-zero elements of F q 2 are invertible, we can divide by û to obtain the corresponding pairs of coefficients in ŝ.
We therefore know ν coefficients of ŝ, and we want to recover s, which we know has small coefficients.Again, the even and odd coefficients don't interact, so we can consider a 128-element half-NTT.In Kyber768 and Kyber1024 the coefficients of s are in {0, ±1, ±2}, so there are 5 128 possibilities, and in Kyber512 they are in {0, ±1, ±2, ±3}, leading to 7 128 possibilities.The solution is likely to be nearly unique if q n > 5 128 or q n > 7 128 , meaning that at least 26 or 31 coefficients are recovered, respectively.As with finding sparse ciphertexts, this can be written as a shortest vector problem and solved with BKZ.
As before, there is also a more efficient approach when S is a contiguous block.If the coefficients of ŝ in a 2 7− -sized block are all known, then we can unwind the last 7 − steps of the half-NTT, and learn the intermediate values after the first steps, as depicted in Figure 4.2.Note that this is possible as blocks only depend on known coefficients for the last 7 − layers.Each of these known intermediate values is a function of 2 unknown coefficients of s, independent of the other intermediates.
The coefficients of s are sampled from a small distribution, so they can be recovered by an exhaustive attack, or more simply a look-up table.We can use = 2, where each intermediate depends on 2 2 = 4 coefficients of s, and each coefficient is drawn from a set of size 5 for Kyber768 or Kyber1024, or 7 for Kyber512.So we can recover each coefficient of s using a lookup table or search of size 5 4 or 7 4 for these two cases.Recovering the private key from partial knowledge of ŝ.After inference of e.g. the upper half of ŝ, the original key can be recovered as follows.First note, that the last layers of the NTT can be reversed inside the upper half up to the intermediate polynomial s.From here, each coefficient can be independently brute forced by its original inputs, which are sampled from small binomial distributions, e.g.{−2, . . ., 2}.Here s0 = s 0 + ζ 4 s 8 which results in only 5 2 = 25 possible combinations.For Kyber1024, it is possible to generate sparser vectors, with only 16 pairs of non-zero coefficients per polynomial instead of 32 (see Section 4.2).This allows an attack with an even lower signal-to-noise ratio.However, knowledge of only 16 pairs of coefficients of ŝ does not yield a unique solution; instead there are more than 100 possibilities of the 8 coefficients of s for each of the 16 intermediates.Even by sorting by likelihood of the SCA results, we could not reduce this to a computational feasible level.Therefore, for this variant of the attack, we need 8 traces instead of 4, so that we recover 32 pairs of coefficients of ŝ per polynomial.

Results
First, in Section 5.1 results showing the effectiveness and allotted noise levels for our attack variants are presented.Additionally, we provide estimates for the remaining security after partial-key recovery in Section 5.2.

Key Recovery Attack
We evaluate the full attack strategy on Kyber512, Kyber768 and Kyber1024 via simulated leakage experiments.The choice of relying on simulated experiments is mainly motivated by (1) easy reproducibility and comparability of our results, (2) the fact that practical attacks have already shown in a similar attack setting.More precisely, the authors of [PP19] have shown that SPA attacks on the NTT operation are possible on a 32-bit STM32F405 microprocessor if the σ in a corresponding noisy Hamming weight leakage simulation is below 2 (unmasked scenario).Our attacks supersede these results both in terms of noise resistance and versatility.As we will show, our attacks can work up to noise level of about σ = 3.1/2.7 in unmasked/masked scenarios and are, in contrast to [PP19], applicable in public-key decryption scenarios.The codebase itself is written in Rust and Python.

Leakage Model
In the noisy Hamming weight leakage model an attacker can observe the Hamming weight (HW) with an additive Gaussian noise of certain intermediate variables of an analyzed computation.More precisely, for an intermediate a the simulated leakage results in HW(a) + N (0, σ), with N being a normal distribution with a mean of zero and standard deviation σ.In a recent paper [KPP20] the authors performed actual power measurements to show that this leakage model is a quite suitable approximation for load/store instructions on current microcontrollers.For an 8-bit target (XMEGA 128D4) their measured leakage closely matches the noisy Hamming weight model with a σ of 0.5.In the 32-bit scenario (STM32F405) the authors measured a σ in a broader range between 0.4 in the best case and 3.0 in the worst case.These numbers can be lowered by averaging multiple measurement traces (10 traces have been averaged in their evaluation) to a σ in the range of 0.2 to 1.3.Please note that averaging is only possible in an unmasked setting.
Target Implementation Details Our simulation is based on the current Kyber reference implementation [ABD + ], which is very similar to the implementation targeted in [PP19], allowing for direct comparison.We re-implemented the relevant functions in python/numpy to generate simulated noisy HW leakages of the relevant 16-bit signed integer values.In practice this step would require building templates of the load/store instruction for the HWs from 0 to 16.The leakage is taken of the intermediate values between each layer, i.e. the input to each butterfly operation (cf. Figure 3.1c).Analog to [PP19] we target the load (LDR) and store (STR) operations of the individual coefficients, which corresponds to the above-mentioned leakage model.Note that in contrast to [PP19] we target the inverse NTT in the Kyber decryption (cf.Algorithm 4, l. 2).
To the best of our knowledge, there is no published masked implementation specifically for Kyber.Therefore, we consider a masked implementation that follows the generic ring-LWE masking strategy from [RRdC + 16, OSPG18], which is also summarized in Section 2.4.Hence, the secret key s is assumed to be additively masked in two shares, s − m and m, with the coefficients of m sampled uniformly from {−(q − 1)/2, . . ., (q − 1)/2}.A simplified depiction of masked Kyber decryption is shown in Figure 5.1.

Belief Propagation Instantiation
We built an optimized, multi-threaded implementation of belief propagation in Rust.We structured the graph according to our Python simulation of the inverse NTT, with the masking split in two independent shares.We highlight further algorithmic optimizations in Section 4.3.As discussed in Section 3.1, loopy belief propagation needs break conditions.We employ a number of conditions based on empirical experiments to allow for a reasonable trade-off between inference and runtime.We set the maximum number of iterations to 1000.Additionally, we abort if the Shannon entropy of all nodes is less than 0.1 bits, or the entropy change is less than 0.05 bits after 20 iterations.We further abort if after 200 iterations less than one more correct coefficient became the most probable or we found all correct coefficients, which requires knowledge of the secret and hence would not be available to an attacker.With these break conditions, a belief propagation run takes on average 20 minutes using two Intel Xeon E5-2650 v4 2.20GHz with 24 cores and hyper-threading.

Attack Results
The chosen ciphertext c 1 is generated according to Section 4.2, resulting in a sparse vector û generated according to the two approaches.First, the sparse vectors generated with BKZ allowed us to distribute the non-zero coefficients.Note however, due to the nature of the NTT in Kyber, the coefficients are only distributable pairwise.Second, we used the faster generation approach depicted in Figure 4.1, which results in the non-zero coefficients aligned in contiguous blocks.The number of non-zero coefficients per polynomial ûi could be set to either 256, 128, 64, and 32, with the last only applicable for Kyber1024.Combining these polynomials in the vector û results in sparse inputs to the inverse NTT ŵ = ŝT • û with 256, 192, 128, 64, and 32 non-zero coefficients.Note that for the final key recovery it is important in which vector components the non-zero coefficients are placed in order to reduce the required number of attack traces (see Final Key Recovery below).Also, further intermediate combinations with 32 non-zero coefficients (e.g.96 non-zero coefficients) were omitted, as it is only applicable to Kyber1024 and would only marginally reduce the number of traces needed (e.g. 3 instead of 4) for the final key recovery.
We ran experiments for a range of σ from 0 up to 3.2 in steps of 0.1.For every value of σ in the relevant range where the observed probability of success was not 0 or 1, we repeated the experiments 25 times.We used the same strategy in both the masked and unmasked scenario and the two different sparse vector sets.After reaching our abort condition, we compared the output of the belief propagation to the correct secret key, and counted the individual run as successful if all correct coefficients had the highest probability.
The results for the distributed non-zero coefficients are shown in Figure 5.2 with the legend highlighting the number of non-zero coefficients.The shaded area around each of the lines represents the confidence interval of the experiments for a confidence level of 95%.Since the number of experiments is not large, we use the Wilson score interval [Wil27] that performs better than the normal approximation interval in such cases.
We draw the following conclusions from the experiments.In the non-sparse case (256 non-zero coefficients) we observed a success rate of 1 only for σ ≤ 0.4.This agrees with the Hamming weight leakage model results of [PPM17], targeting a 256-coefficient ring-LWE inverse NTT.Applying our sparseness strategy to the input ciphertext c, we can increase the noise tolerance significantly.For example, with 64 distributed non-zero NTT coefficients, the noise level can go up to σ = 1.2, while the probability of success stays within a confidence interval of 0.75 to 0.97 for the masked and 0.80 to 0.99 for the unmasked version.
By setting even more coefficients to zero, the noise level can be further increased, while keeping a solid success rate.For example, for Kyber1024 by setting all but an eighth (i.e.32) of the coefficients to zero with BKZ, at σ = 2.2 the confidence interval is 0.75 to 0.98 for the masked and 0.80 to 0.99 for the unmasked version.
As the graphs show, we can further increase the achievable σ by allowing a lower  success rate.Note however, that as the success rate reduces, the required number of attack traces will increase as we have to evaluate several runs until we find one where we achieve successful convergence for the belief propagation.
Our attack shows similar results for the masked and unmasked case, because each masking share can be attacked individually with the same technique as in the unmasked case.In particular, as can be seen from Figure 5.1, masking the secret key as ( m, s − m) does not influence the sparsity of u.Further, if u is sparse then m • u and ( s − m) • u are both sparse with the same support.We can separately recover them with belief propagation, and we add them together to obtain the (again sparse with the same support) s • u.From this step onwards the attack continues as in the unmasked case.We then run our sparse key recovery on that value.Also note that the mask is thus removed in each attack trace, allowing us to repeat and combine the coefficients recovered in multiple traces as in the unmasked case.
The results for sparse vectors with non-zero coefficients in contiguous blocks (see Figure 4.1) are shown Figure 5.3.Here, with 64 non-zero coefficients in a contiguous block, our approach still shows a non-zero success rate up to σ = 1.2, with a success rate confidence above 0.87 up to σ = 0.9.This can again be increased for Kyber1024 up  .Each data point within the relevant area is the average of 25 runs with a step-size of 0.1 for σ.The shaded area marks the 95% confidence interval.In comparison to Figure 5.2, the noise tolerance is mainly decreased for a low number of non-zero coefficients (i.e.64 and 32).Note, that the graph for 256 non-zero coefficients is identical to Figure 5.2, as this is the non-sparse case.
to a σ = 1.7 by reducing the number of non-zero coefficients to 32, with a success rate confidence between 0.80 and 0.99 for unmasked, and between 0.75 and 0.98 for masked.
For larger numbers of the non-zero coefficients the faster generation of the sparse vectors did not decrease the noise tolerance significantly.Note, that for 256 non-zero coefficients the graphs are identical as for the distributed sparse coefficients, as the location of non-zero coefficients is not relevant in this non-sparse case.
In [PP19] the unmasked case allowed a success rate above 0.9 up to σ = 1.5, slightly exceeding our achieved noise tolerance for Kyber512 and Kyber768.This is mainly because [PP19] targets the NTT in the encryption, whereas we target inverse NTT in decryption.In the former case, the input distribution is sampled from a small binomial distribution, which acts as an additional constraint for belief propagation (cf.Section 3.3).But our attack extract the long-term secret key s, in comparison to the ephemeral secret r.Additionally, the advantage of [PP19] disappears in the masked setting, as here the inputs the NTT are not small, dropping the success threshold to σ ≤ 0.4.Table 5.1: Overview of the required number of traces for a full secret key recovery given the number of non-zero coefficients and the resulting noise tolerance level for a Success Rate (S.R.) > 0.7 and > 0 for all Kyber security levels with masking.The first number is with sparse vectors generated with BKZ, the second with easier generation with the non-zero coefficients in contiguous blocks.As the final key recovery for Kyber512 using BKZ is computationally expensive, an attacker could opt for an extra attack trace (numbers in brackets) for a fast final key recovery.The k-trace attack (k ∈ 2, 3, 4) is applicable with 64 non-zero coefficients and a noise level of 1.2|0.9 or 1.4|1.2.Final Key Recovery As a final attack we implemented key recovery according to Section 4.4, inverting the pairwise-pointwise scalar product and the half-NTT.For a simpler separation of the k dimensions, we opted for non-zero values in disjoint positions for each dimension.This allows us to directly recover the corresponding segment of coefficients of ŝ by dividing out the non-zero section of our sparse vector û.

Sparseness
For the case of pairwise distributed non-zero coefficients in the sparse polynomials u i , this can be written as a shortest vector problem and solved with BKZ.With a block size 70 we were able to solve 92.7% out of 590 attempts for a half-vector component of Kyber768/Kyber1024 with 32 out of 128 non-zero coefficients.By increasing the block size to 80, we were able to increase the recovery success rate to 100% out of 100 attempts.For Kyber512 the task is more difficult due to the larger binomial distribution, but we are still able to succeed in 54% out of 100 attempts with block size 80.This can most likely be increased further by larger block sizes, however an alternative would be to increase the number of coefficients recovered to 48 out of 128.With this we could solve it in 100% out of 100 attempts, employing only BKZ-40.Hence, for Kyber768 and Kyber1024 an attack yielding a total of 64 coefficients per vector component allows for a full key recovery.For Kyber512 the computational cost for key recovery was only feasible for us in half of the cases, but by increasing the number of coefficients recovered to 96 per vector component, only minimal efforts are needed for a full key recovery.
In the case of the non-zero coefficients in a contiguous block, the faster approach highlighted in Figure 4.2 is possible.We ran 1000 experiments for each Kyber parameter set, in which we attempted to determine the key from a recovered ŵ = ŝT • û, with k • 64 non-zero coefficients in û.For Kyber768 and Kyber1024 the key was uniquely determined in all our tests.For Kyber512, a few coefficients of each key were not uniquely determined.This resulted in 2 12 possible values for the entire private key on average, up to a maximum of 2 22 in our experiments.However, the private key could be recovered by checking these possibilities against the known public key.In total, key recovery from k • 64 known coefficients of ŝ took at most a few seconds on a laptop.
Note that for each dimension, 64 coefficients suffice to uniquely recover all coefficients of s.Here, a trade-off has to be taken between the sparseness in the inverse NTT and the number of coefficients recovered with each trace.In Table 5.1 the necessary number of traces needed is summarized for the different security levels of Kyber together with our achieved noise tolerance threshold on σ for a masked implementation.The first numbers are for the sparse vectors generated with BKZ, with the non-zero coefficients distributed pairwise to improve the belief propagation.The second numbers show the results with sparse vectors generated with the faster approach employing the butterfly structure of the NTT, here the non-zero coefficients are in contiguous blocks.With the contiguous block sparseness we are able to perform a successful single-trace attack on Kyber512 up to a σ of 0.6 and fully recover the secret key in a 2k-trace attack for Kyber1024 up to a σ of 1.7 (Success Rate S.R. above > 0.7).The k-trace attack with full key recovery is possible up to a σ of 0.9 for all Kyber security levels.By generating the sparse vectors with BKZ this can be increased up to a σ of 1.2.Note however, for Kyber512 using BKZ for the final key recovery step, the number of traces might need to be increased by one (numbers in brackets), in order to allow for faster solving for the original key.With repeating failed runs (Success Rate S.R. between 0 < S.R. < 0.7) we can further increase the noise level in the k-trace attack setting up to σ ≤ 1.4 and for Kyber1024 even to σ ≤ 2.6 in the 2k-trace attack setting.All results are given considering masking of the secret key s.

Security Estimates from Partial Key-Recovery
For full key recovery, the attack presented in Section 4 either requires a single chosen ciphertext trace to be measured with low noise (σ ≤ 0.5 − 0.7) or k traces with a noise up to σ ≤ 1.2.We saw the effectiveness of such an attack in the previous section.However, it might not always be possible or practical to get the full k traces in the case of an assumed noise level of σ ≈ 1.2.Therefore, an estimate of the remaining security after less than k traces is shown in the following, assuming 64 coefficients recovered per trace.
To estimate the remaining security in these cases we use the same methodology as was used for the security estimates of Kyber; core-SVP hardness for the primal attack.Note that core-SVP estimates are particularly conservative, and a b-bit security estimate is not equivalent to a b-bit remaining key entropy.As is also noted in [ABD + 17], more refined estimates of the security can be made.Since this is not the purpose of this paper, we restrict ourselves to the estimates below.
In Table 5.2 it can be seen that even with a single trace recovered in the k-trace attack, the estimate for the remaining security shows a significant drop.We emphasize again that these numbers do not necessarily mean that 2 traces in the k-attack on Kyber768 leads to a practical break of the full Kyber key.Even for the 2 traces, sieving records are currently still well out of range recovering a short vector of a 1023-dimensional lattice [DSvW21].Recovery of LWE keys from partial key recovery is out-of-scope of this paper, but is an active area of research.E.g. in [DDGR20] it was shown that knowledge of a partial key can be incorporated into the u-SVP lattice of the LWE key recovery.We leave the adaptation of these techniques to the module-LWE setting as future work.
We do however see that the security estimates for classical and quantum hardness for 1 trace already drop well under the desired 128-bit security levels, and for 2 traces gives an alarming drop in security.For Kyber512 similar estimates shown that even 1 trace reduces the security estimates from 118 / 107 bits of security to 47 / 43 bits in the respectively classical / quantum setting.For Kyber1024, 1 trace in the k-trace attack drops the security to the Kyber768 case.From thereon the reduction of security declines comparably to Kyber768.For completeness the full table is included in Appendix B.1.

Discussion
In the previous sections, we demonstrated how simple power analysis attacks on latticebased cryptographic schemes can be significantly improved over previous works, both in terms of noise resistance and wider applicability.We now briefly discuss how our attack could be applied to lattice-based schemes other than Kyber and NewHope, and options for countermeasures against our attack.
Application to other Schemes Kyber and NewHope explicitly mention the usage of NTT computations in their specification, and therefore Kyber was an obvious focus of our presented attack.However, it has been shown in [CHK + 21] that the NTT can be used to implement the polynomial multiplication of NIST finalists Saber [DKRV18] and NTRU [CDH + 20].The same is the case for NIST alternate NTRU Prime [BCLvV17] in [ACC + 21b].Even though these schemes operate in rings that are less "NTT-friendly", their polynomial multiplication can be lifted to a larger ring (in degree and/or modulus) that both allows the application of an NTT and ensures that reductions do not affect the correctness of the result.
We also conjecture that belief propagation techniques presented in this paper could increase the effectiveness of simple power analysis attacks on other systems.Different multiplication algorithms that process secret coefficients in blocks (e.g.Karatsuba [KO63] or Toom-Cook [Too63,Coo66]) can likely be forced to have sub-blocks be "special", e.g.small or sparse.In this case, belief propagation can be used to learn more information from side channels, similarly to the attack presented in this paper.The structures of underlying rings like existing automorphism and sub-rings might also help the effectiveness of belief propagation, however this is left for future work.

Application to other Implementations
The choice of our target implementation (and leakage model) is mainly motivated by the fact that we want to allow accurate comparisons with previous works [PPM17, PP19, KPP20], and by the fact that the attack in [PP19] has already been reproduced on a real Cortex-M4 device.Nevertheless, it is natural to ask how our attack could be adapted to more recent, optimized Kyber implementations such as the one from pqm4 [KRSS].This implementation mainly differs from the implementation that is considered in this paper in two aspects.First, the pqm4 implementation stores two NTT coefficients within one 32-bit word and then uses vectorized instructions such as uadd16 to perform two halfword additions concurrently.Second, the pqm4 implementation uses a register allocation strategy that reduces the amount of load/store instructions to only occur every third NTT layer.
When adapting the factor graph to the situation where two NTT coefficients are stored within the same word, one could make use of a strategy that is already used in the single-trace attack on 32-bit implementations on Keccak in [KPP20].There, the authors use a clustering approach to represent one 32-bit word as two halfwords in the factor graph, not because the algorithmic description of Keccak requires it, but because BP runs into serious runtime issues when performing message passing for 32-bit variable nodes.This will however come at the cost of reduced convergence performance.
To accommodate for a potential lack of load/store instructions, one could instead opt for templating the multiplication with twiddle factors, as done in [PPM17].This will likely increase the complexity of template generation to some extent but would also have the advantage that multiplications need to be executed separately for each halfword, eliminating the need for clustering.We leave a more concrete evaluation of our attack approach against more optimized implementations for future work.

Masking of the Input
Masking is first and foremost a countermeasure against differential attacks but also somewhat affects the performance of profiled attacks.Most notably, masking randomizes data during computations, which effectively prevents an attacker from performing averaging.Consequently, side-channel attacks in this setting either need to work with single measurements or find ways to combine information from multiple computations using different masks.In our case, we are able to retrieve the unmasked intermediate values by individually attacking each masking share.However, this is possible for each trace individually, and we are thus able to combine coefficients recovered from multiple traces also in the masked case.Hence, our attack shows similar results for the masked and unmasked case (cf. Figure 5.2).It is important to point out that our attack on Kyber is only applicable for masking schemes that mask the key but not the input (during decryption), which is the common case [RRdC + 16, OSPG18].With masking of the input, however, our assumption of a sparse input to the inverse NTT does not hold anymore.In this case, our noise tolerance is reduced to the non-sparse case (cf. Figure 5.2 with 256 non-zero coefficients).
Hiding As mentioned in previous works that study belief propagation based side-channel attacks, a rather straight forward and effective protection against such attacks can be achieved by using hiding techniques such as shuffling [PPM17, PP19, KPP20, RPBC20].By randomizing the order of executed operations within an NTT computation, leakage points cannot be trivially assigned to the correct variable nodes anymore.This then leads to contradictions during belief propagation and prevents convergence.In a similar spirit, the insertion of random dummy operations inside the NTT can increase the difficulty of attacks.

Conclusion
We presented a method for crafting ring/module-LWE ciphertexts that result in sparse polynomials at the input of inverse NTT computations, and present a novel attack that uses this sparseness to significantly improve side-channel attacks.Our attack shows that side-channel security of lattice-based schemes cannot be neglected and that relying on (key) masking alone does not offer protection at a reasonable cost.While this work focuses on Kyber, variations of our attack are also applicable to other lattice-based schemes like NewHope, and potentially to implementations of NTRU, SABER, LAC etc. which use Number Theoretic Transforms.

Figure 2
Figure 2.1: 8-coefficient Cooley-Tukey decimation in time NTT The factor graph used in[PP19] and in our work.

Figure 4 . 1 :
Figure 4.1: Generating sparse NTT values.(simplified for 8 even coefficients and multiplications by ζ omitted)The inputs of the NTT (left) need to be compressible, while the right side should be sparse.A sparse polynomial is found by iterating through a single intermediate value in layer (here: = 1, highlighted by the dashed box) until a compressible input of the NTT is found.This automatically results in a sparse output in the NTT domain (right side).Note, in Kyber the same has to be performed independently for the odd-indexed coefficients.
Figure 4.2:Recovering the private key from partial knowledge of ŝ.After inference of e.g. the upper half of ŝ, the original key can be recovered as follows.First note, that the last layers of the NTT can be reversed inside the upper half up to the intermediate polynomial s.From here, each coefficient can be independently brute forced by its original inputs, which are sampled from small binomial distributions, e.g.{−2, . . ., 2}.Here s0 = s 0 + ζ 4 s 8 which results in only 5 2 = 25 possible combinations.

Figure 5 . 1 :
Figure 5.1: Simplified depiction of a masked Kyber decryption operation.Parts that are unnecessary for our analysis are omitted.The side channel leakage is taken from the NTT −1 , highlighted by the dashed orange box.

Figure 5 . 2 :
Figure 5.2: Attack results for different noise levels σ with distributed non-zero coefficients.The figures show the attack success rate for the masked (a) and unmasked (b) implementations, where the sparse vectors are generated with BKZ.Each data point within the relevant area is the average of 25 runs with a step-size of 0.1 for σ.The shaded area marks the 95% confidence interval.It can be seen that with a decreasing number of non-zero coefficients (given in the legend) the achievable noise tolerance is significantly increased.

Figure 5 . 3 :
Figure 5.3: Attack results for different noise levels σ with non-zero coefficients in contiguous blocks.The figures show the attack success rate for the masked (a) and unmasked (b) implementations with the non-zero values in a contiguous block (see Figure4.1).Each data point within the relevant area is the average of 25 runs with a step-size of 0.1 for σ.The shaded area marks the 95% confidence interval.In comparison to Figure5.2, the noise tolerance is mainly decreased for a low number of non-zero coefficients (i.e.64 and 32).Note, that the graph for 256 non-zero coefficients is identical to Figure5.2, as this is the non-sparse case.
al. [RRdC + 16] and was later extended to a CCA2-secure version by Oder et al. [OSPG18].An alternative countermeasure, more in line with classical blinding techniques, utilizes the additively homomorphic nature of ring-LWE and is presented in [RdCR + 16].

Table 5 .2: Security
estimates for the remaining security after partial k-trace key recovery for Kyber768 (k = 3), assuming 64 non-zero coefficients of ŝ recovered per trace.