New Bleichenbacher Records: Fault Attacks on qDSA Signatures

. In this paper, we optimize Bleichenbacher’s statistical attack technique against (EC)DSA and other Schnorr-like signature schemes with biased or partially exposed nonces. Previous approaches to Bleichenbacher’s attack suﬀered from very large memory consumption during the so-called “range reduction” phase. Using a carefully analyzed and highly parallelizable approach to this range reduction based on the Schroeppel–Shamir algorithm for knapsacks, we manage to overcome the memory barrier of previous work while maintaining a practical level of eﬃciency in terms of time complexity. As a separate contribution, we present new fault attacks against the qDSA signature scheme of Renes and Smith (ASIACRYPT 2017) when instantiated over the Curve25519 Montgomery curve, and we validate some of them on the AVR microcontroller implementation of qDSA using actual fault experiments on the ChipWhisperer-Lite evaluation board. These fault attacks enable an adversary to generate signatures with 2 or 3 bits of the nonces known. Combining our two contributions, we are able to achieve a full secret key recovery on qDSA by applying our version of Bleichenbacher’s attack to these faulty signatures. Using a hybrid parallelization model relying on both shared and distributed memory, we achieve a very eﬃcient implementation of our highly scalable range reduction algorithm. This allows us to complete Bleichenbacher’s attack in the 252-bit prime order subgroup of Curve25519 within a reasonable time frame and using relatively modest computational resources both for 3-bit nonce exposure and for the much harder case of 2-bit nonce exposure. Both of these computations, and particularly the latter, set new records in the implementation of Bleichenbacher’s attack.


Introduction 1.Attacks on Nonces in Schnorr-like Signatures
Attacks on the nonces of (EC)DSA [Gal13] and other Schnorr-like signature schemes [Sch91] have been of interest to cryptanalysts over the last couple of decades.Since the knowledge of the nonces directly translates to the secret key, it is well known that the nonces should never be revealed or repeated.However, the nonces in Schnorr-like signatures are even more sensitive; in fact, it is possible to recover the secret key using only partial information of nonces.Perhaps the most famous example is the lattice attack initiated by Howgrave-Graham and Smart in [HGS01].In a nutshell, the idea of lattice attacks is as follows: given d Schnorr-like signatures of different messages with some least significant bits (LSB) of the nonces exposed, preprocess signature pairs to make the nonces biased in their the most significant bits (MSB) and construct a (d + 1)-dimension lattice L containing a hidden vector c depending on the secret key.The signatures themselves provide another vector v ∈ Z d+1 which is very close to c; under suitable conditions, c is highly likely to be the closest vector to v in L. As a result, if the dimension d + 1 is small enough to make the closest vector problem in L tractable, it is possible to compute c and hence the secret key.See, e.g., [NT12] for more comprehensive description.The lattice attack is a very powerful technique because it requires relatively few signatures as input and works very efficiently in practice if many bits of the nonces are exposed.Since its first introduction, there have been a number of works on the lattice attack, such as [NS02,NS03,NNTW05,BvdPSY14,BFMT16].The largest group size and the smallest nonce exposure broken by lattice attacks in published literature so far have been 160-bit DSA signatures with 2-bit nonce exposure, broken by Liu and Nguyen [LN13], and 256-bit SM2 signatures with 3-bit nonce exposure, attacked by Liu, Chen and Li [LCL13].However, if the number of exposed bits and the resulting bias is small, the lattice attacks are generally impractical due to the large lattice dimension, or because the hidden vector c does not necessarily coincide with the closest vector.
Prior to lattice attacks, Bleichenbacher presented a purely statistical attack technique against biased nonces at the IEEE P1363 meeting in 2000 [Ble00].This approach had never been formally published until a few years ago, when it was revisited in a few papers [DMHMP14, AFG + 14].The main idea of Bleichenbacher's attack is to define a "bias function" based on a Fourier notion of bias, and to search for a candidate value of the secret key corresponding to the peak of this bias function.An advantage of Bleichenbacher's attack over lattice attacks is that it can in principle deal with arbitrarily small biases and even work with non-uniformly biased inputs.On the negative side, Bleichenbacher's method requires many signatures as input, and therefore suffers from a large space complexity due to its "range reduction" phase, where one has to find sufficiently many small and sparse linear combinations of signature values before computing the bias peak.For example, [AFG + 14] took a very straightforward approach to range reduction, which they call sortand-difference, and successfully carried out a full key recovery of ECDSA over 160-bit curve using 1-bit bias.However, their approach needed 2 33 signatures as input and consumed 1TB of memory, which remains an unusually large memory requirement for academic cryptanalytic experiments even to this day.Hence, Bleichenbacher's attack against groups of large order and small biases (e.g., 256-bit curve and 2-bit bias) has appeared intractable.

Montgomery Curve, Curve25519, qDSA
Elliptic curve cryptography is widely deployed nowadays since it offers relatively short key length to achieve a good security level.The most commonly-known instance is a signature scheme such as ECDSA.Most elliptic curve-based signature schemes operate in the group of rational points of an elliptic curve defined over a finite field, and their security relies on the hardness of the elliptic curve discrete logarithm problem (ECDLP).Moreover, elliptic curves are used to achieve efficient key exchange protocols; for example, X25519 is specified in RFC7748 [LHT16] as a function that computes the scalar multiplication efficiently in the elliptic curve-based Diffie-Hellman key exchange (ECDHKE) [DH76].The underlying curve used for X25519 is called Curve25519 [Ber06], which is one of the most famous instances of a Montgomery curve [Mon87].Interestingly, Montgomery curves offer extremely fast scalar multiplication due to its x-only arithmetic; however, it is not endowed with a group law in the usual sense, which is typically required in curve-based signature schemes.As a result, fast implementations of signature schemes using Curve25519 have usually avoided the x-only arithmetic, and relied on the twisted Edwards form of that curve instead (which has a fast group law in the usual sense, but does not benefit from the simplicity of the Montgomery ladder); this is in particular the approach taken by EdDSA [BDL + 12].It was not until the quotient Digital Signature Algorithm (qDSA) [RS17] was proposed by Renes and Smith last year that one could reuse the scalar multiplication algorithm and the public key of X25519-based ECDHKE for signatures without modifying the format at all.The qDSA is a high-speed, high-security signature scheme that relies on x-only arithmetic and can be instantiated with Montgomery curves (such as Curve25519) or Kummer surfaces.At a high-level, it closely resembles Schnorr signatures and is proved secure in the random oracle model as well.Due to its efficiency and its compatibility with X25519, the qDSA is expected to be deployed in real-world constrained embedded systems, such as IoT devices.Some improvements to the signature generation and verification of qDSA have been recently proposed by [FFAL17].

Our Contributions
In this work, the following main results are achieved: • Sections 4&5.Our first contribution is the optimized range reduction algorithm in Bleichenbacher's attack which overcomes the memory barrier of previous work while maintaining a practical level of efficiency in terms of time complexity.We designed the range reduction algorithm based on Howgrave-Graham-Joux's version [HGJ10] of Schroeppel-Shamir algorithm [SS81], which was originally proposed as a knapsack problem solver.The idea of making use of Schroeppel-Shamir was mentioned by Bleichenbacher himself, but it has never been formally evaluated in the literature.
Our approach has two merits: first, it has a lower space complexity, and therefore requires fewer input signatures than the previous methods did in order to perform the same level of range reduction.Second, our algorithm can be parallelized in a very straightforward fashion with low communication overhead.Note that this contribution is independent of the second one, since Bleichenbacher's attack applies not only to qDSA, but to any Schnorr-like signatures generated from biased or partially exposed nonces.
We first recall Bleichenbacher's attack framework in Section 4; Section 5 describes our approach to the range reduction in detail and presents the theoretical results on the lower bound for the amount of input signatures for the algorithm to work correctly within Bleichenbacher's framework, including performance comparison with previous nonce attack techniques.
• Section 3. As a separate contribution, we show that qDSA is yet another victim of the attacks against nonces.We propose two fault attack techniques against the qDSA instantiated with Curve25519 in order to induce 3-bit and 2-bit bias in its nonces.Our fault injection methods perturb the base point of Curve25519 to a point of non-prime order, so that its scalar multiplication by nonce reveals the few least significant bits (LSB) of it.The LSB obtained due to faults can be simply exploited to create bias in the most significant bits (MSB) of nonces.We describe those two attacks and straightforward countermeasures in Section 3.
• Section 6. Combining our two contributions, we are able to achieve a full secret key recovery 1 on qDSA by applying our version of Bleichenbacher's attack to these faulty signatures.Using a hybrid parallelization model relying on both shared and distributed memory, we achieve a very efficient implementation of our highly scalable range reduction algorithm.This allows us to complete Bleichenbacher's attack in the 252-bit prime order subgroup of Curve25519 within a reasonable time frame and using relatively modest computational resources both for 3-bit nonce exposure and for the much harder case of 2-bit nonce exposure.To the best of our knowledge, an attack against 252-bit group with such small exposures of the nonces has never been addressed before.Hence, both of these computations, and particularly the latter, set new records in the implementation of Bleichenbacher's attack.Section 6 describes those implementation techniques and provides our experimental results in detail.
We stress that the complete attack, especially in the 2-bit bias case, is not entirely practical, as it both requires a large number of faulty signatures, and targets a slightly modified version of the qDSA reference implementation.Nevertheless, it showcases a number of interesting optimizations of Bleichenbacher's attack in a concrete setting, and also has the valuable takeaway that clearing cofactors in qDSA signature generation (or indeed, any Schnorr-like signature using x-only arithmetic) is a simple and important security measure.

Related Work
Bleichenbacher's nonce attack against DSA was first proposed in [Ble00] and his own early experimental results include a full key recovery on 160-bit DSA given a nonce leakage of log 3 ≈ 1.58 bits for 2 22 signatures2 and 1 bit exposure for 2 24 signatures [Ble05].De Mulder et al. revisited his method in [DMHMP14] and successfully performed a key recovery attack against ECDSA over NIST P-384 and brainpoolP384r1 using 4000 signatures with 5 bits of the nonces known.After that, Aranha et al. [AFG + 14] utilized Bleichenbacher's method to attack ECDSA over SECG P160 R1 with 2 33 signatures of 1-bit biased nonces.
Recovering the secret key from the signatures knowing partial bits of the nonces reduces to an instance of the hidden number problem (HNP) of Boneh and Venkatesan [BV96].Howgrave-Graham and Smart first developed lattice attacks in [HGS01] to recover the DSA secret key over a 160-bit group using 30 signatures with 8 bits of the nonces known.Nguyen and Shparlinski in [NS02,NS03] later analyzed the lattice attacks in detail and presented the experimental results of the attack against 160-bit DSA using 100 signatures with only 3 bits of the nonces known.The largest group size and the smallest nonce exposure broken by lattice attacks in published literature so far have been 160-bit DSA signatures with 2-bit nonce exposure, broken by Liu and Nguyen [LN13], and 256-bit SM2 signatures with 3-bit nonce exposure, attacked by Liu, Chen and Li [LCL13].Sidechannel analysis and fault attacks have been often utilized in conjunction with lattice attacks to obtain the partial information of nonces.Such concrete attacks appear e.g., in [NNTW05,BvdPSY14,BFMT16].
The first fault attack was discovered by Boneh, DeMillo and Lipton, which is often referred to as the Bellcore attack [BDL97].This attack was against an implementation of RSA based on the Chinese Remainder Theorem.Various fault injection techniques and countermeasures are described in [BCN + 06].In [FLRV08], Fouque et al. proposed a fault attack targeting the base point on non-twist-secure Montgomery curves.The idea of exploiting the low order points on Curve25519, upon which one of our fault attacks relies, was recently explored by Genkin, Valenta and Yarom [GVY17] in the context of attack against ECDH.

Notations
In order to avoid confusion, we denote an index by an italic i and the imaginary unit by a roman i.A variant of big-O notation O will be used meaning that logarithm factors are omitted.
We denote b-LSB/MSB of an integer k by LSB b (k) and MSB b (k), respectively, assuming that k is represented as a fixed-length binary string.(The bit-length is typically 252-bit in this paper.)In Section 5, we will often use the binary representation of (τ + 1)-bit integer η, which is denoted as follows: where η i ∈ {0, 1} for i = 1, . . ., τ + 1.Moreover, we define a new notation η [a:b] to represent the substring of η and its corresponding value:

The quotient Digital Signature Algorithm
The quotient Digital Signature Algorithm (qDSA) is a variant of Schnorr signature scheme that operates on Kummer varieties and offers a key pair compatible with X25519-based Diffie-Hellman key exchange protocols [RS17].We briefly recall the x-only arithmetic of Montgomery curves discovered in [Mon87] and the qDSA signature scheme instantiated with Curve25519 [Ber06], the most widely-known Montgomery curve.For more comprehensive introduction to Montgomery curves and Montgomery's ladder, see, e.g., [CS17] or [BL17].

Montgomery Curves and Their Arithmetic
Let p be a prime.A Montgomery curve defined over the finite field F p is an elliptic curve defined by an affine equation where the coefficient A and B are in F p such that A 2 = 4 and B = 0.
Using the projective representation (X : Y : Z), where x = X/Z and y = Y /Z, we have the projective model Note that the point at infinity O = (0 : 1 : 0) is the only point where Z = 0.
Montgomery observed that the arithmetic in the above model does not involve ycoordinates.Namely, let P = (X P : Y P : Z P ) and Q = (X Q : Y Q : Z Q ) be two distinct points on E A,B , the point addition and doubling are defined as follows: where X P +Q /Z P +Q , X P −Q /Z P −Q and X [2]P /Z [2]P are the x-coordinates of P + Q, P − Q and [2]P , respectively.
Montgomery also proposed the algorithm, known as Montgomery's ladder, which efficiently computes the x-coordinate of the scalar multiplication [k]P using only the point addition and doubling operations above.Therefore, it suffices to consider the points mapped into a one-dimensional projective space P 1 (F p ), which is simply the x-line.Formally speaking, let E A,B / ±1 be the Kummer line of E A,B and P = (X : Y : Z) an elliptic curve point, if the quotient map x : E A,B → E A,B / ±1 ∼ = P 1 (F p ) is defined as then the Montgomery's ladder efficiently computes the scalar multiplication on P 1 : We omit the details of The value k at line 1 in Algorithm 1 is typically called nonce.From the line 5, the nonce obviously satisfies the following congruence relation: Note that d is only used as a seed and does not get involved in the verification at all.Hence, knowing d allows an attacker to generate a valid signature on arbitrary messages, even though the forged signatures are distinct from legitimate ones.In this paper, we will refer to d as the secret key for convenience.

Curve25519 Parameter Set
In the qDSA instance equipped with Curve25519, the parameters are specified as follows: • p = 2 255 − 19.
• Cofactor is 8 and • Cofactor of the quadratic twist of E A,B is 4.

Knapsack Problem
Although there exist several variants, we refer to the computational 0-1 knapsack problem as the knapsack problem.It can be stated as follows: given a set of S positive integers {h 0 , . . ., h S−1 } with some target value T , find the set of coefficients

Fault Attacks on qDSA
In this section, we describe several variants of a fault attack targeting the base point of scalar multiplication in qDSA signatures.
Our basic attack strategy is as follows.The qDSA signing algorithm uses the Montgomery ladder to compute the scalar multiplication R = [k]P (up to sign), where k is the sensitive nonce value associated with the signed message; and the point R (or rather, its x-coordinate x R ) is output as part of the signature.Here, the correct base point P is a generator of the cyclic subgroup of order n in E A,B (F p ) ∼ = Z/8Z × Z/nZ.
Suppose that we can inject a fault into the device computing qDSA signatures so as to replace the point P by a different, faulty point P still on E A,B , but with a different order, say 8n.Then, even without knowing the exact value of P , one can deduce information on k from the signature element x R .For example, if x R corresponds to a point of exact order n, we can show that k must be a multiple of 8: in other words, we obtain leakage information on the 3 least significant bits of k.As discussed in Section 3.3 below, such a bias can be turned into a bias on the most significant bits, which is enough to apply Bleichenbacher's attacks.
In the following sections, we describe several variants of this general approach, with a particular focus on how these attacks can be carried out in practice against practical implementations of qDSA.We also describe concrete fault attack experiments against a barely modified version of Renes and Smith's 8-bit AVR implementation of qDSA, on the XMEGA128D4 microcontroller of the ChipWhisperer-Lite low-cost side-channel and glitch attack evaluation board [OC14].Before delving into those details, however, two preliminary remarks are in order.
First, we point out that our attack is rather novel in the sense that it relies on the new and unique structure of the qDSA signature scheme.
• On the one hand, the attack depends in a crucial way on the use of x-only arithmetic.
Indeed, if we perturb a point P given by two coordinates, the resulting faulty point P will end up with overwhelming probability on a completely different curve among many possible choices, and even in a setting where the scalar multiplication by k still makes sense (as in the differential fault attack of Biehl et al. [BMM00]), the information on the curve on which P lies is lost in the signature, which contains only the x-coordinate of R = [k] P .This makes our strategy inapplicable to those settings.
• On the other hand, implementations using x-only arithmetic roughly divide into two families.Older, careless ones, tend to fall prey to the much simpler twist fault attack of Fouque et al. [FLRV08], in which case our strategy does apply, but is more complex and costly than necessary.Conversely, modern, careful implementations such as X25519 [Ber06] and other Diffie-Hellman implementations based on SafeCurves [BL], usually clear cofactors: in the description above, this means that the scalar k would be 8 times a uniformly random element of {0, . . ., n − 1}, and hence learning its 3 least significant bits would provide no information.That countermeasure thwarts our attack, even setting aside the fact that a few bits of leakage on Diffie-Hellman keys is much less of a security issue than nonce leakage in Schnorr-like signatures.Interestingly, the authors of qDSA apply that "clamping" technique to their secret keys [RS17, §3.3], but not to the nonces used in signature generation, which lets us carry out our attack.
Incidentally, the first point also explains why our attack applies to the genus 1 instantiation of qDSA (using Curve25519), but does not readily extend to the genus 2 instantiation (using the Gaudry-Schost Kummer surface).Indeed, the base point on the Kummer surface is represented by two coordinates, and injecting a fault will typically yield a point outside the surface, which prevents the attack for the same reason.
A second issue that should perhaps be stressed is that one can certainly consider much simpler fault attacks than our own on an unprotected implementation of qDSA: it is both easier and more effective to directly perturb the generation of the nonce k.For example, that generation typically ends with what essentially amounts to a copy of the final value into the array containing k (in the public qDSA implementations, this is done in the group_scalar_get64 function).That array copy is a loop, and exiting the loop early results in a nonce with most of its bits equal to zero.It is then possible to recover the full secret key with as few as two signatures generated with those highly biased nonces, using e.g. the lattice attack of Howgrave-Graham and Smart [HGS01].Note that this applies regardless of whether nonces are generated deterministically as in qDSA or probabilistically as in ECDSA.
However, the sensitivity of the nonce in Schnorr-like signature is very well-known, and one therefore expects a serious implementation that may be exposed to fault attacks to take appropriate countermeasures to protect against it (such as using double loop counters in the final array copy to check that the copy has completed successfully).On the contrary, our attack strategy is novel, and targets a part of the scalar multiplication that does not normally lead to serious attacks, as discussed above.It is thus much more likely to be left unprotected in real-world settings.Thus, we think that pointing out the corresponding threat is important, especially as qDSA is a scheme geared towards embedded devices (the target platforms of the accompanying implementations are AVR ATmega and ARM Cortex M0 microcontrollers [RS17, §7]).

Random Semi-Permanent Fault on the Base Point
Turning now to our attacks, we first describe a simple fault attack in a model that closely follows the strategy sketched at the beginning of this section.
Attack model.We suppose that the fault attacker is able to modify the base point P (represented by its x-coordinate on the quotient E A,B / ±1 ∼ = P 1 ) to a "somewhat random" faulty point P , and then obtain several signatures computed with that faulty base point.We do not assume that the attacker knows the faulty point P once the fault is injected.
Realization of the model.Such a model can easily be realized in implementations where the representation of the base point is first loaded into memory (say at device startup) and then used directly whenever exponentiations are computed.This is a relatively common implementation pattern for embedded implementations of elliptic curve cryptography (for example, the micro-ecc library [Mac] works that way).It is then possible to induce a faulty base point either with faults on program flow at first load time (using e.g.clock or voltage glitches) so that some part of the corresponding array remains uninitialized/random, or with faults on memory (using e.g.optical attacks [SA02]) so as to change some bit patterns within the existing array for P .
We note however that the model is more difficult to realize against the microcontroller implementations described in the original qDSA paper [RS17], due to the fact that the base point is recomputed before each signature generation.It may be possible to achieve a similar effect as above by carrying out a fault attack on program memory, so that e.g. the instruction that writes the byte 0x09 into the lowest-order byte of the base point is modified to write another byte instead (the same every time), but this presumably requires a significantly higher level of precision in the targeting of laser beams or x-rays.
Description of the fault attack.Suppose for simplicity that the fault attack yields a faulty base point P whose x-coordinate x is uniformly random in F p (we will see later on that the attack also works for values x that are not anywhere close to uniform).
In that case, we first observe that with probability close to 1/2, x is the abscissa of an actual point on the curve E A,B , and it is otherwise the abscissa of a point on the quadratic twist of E A,B .More precisely, excluding x = 0 (which corresponds to the point of order 2 both on the curve and its twist), the first case happens with probability exactly (4n − 1)/p and the second one with probability (p − 4n)/p, both of which are in From a signature generated with this faulty P , it is easy to distinguish between the two cases, since we get the x-coordinate of R = [k] P , which will correspond to a point on E A,B when P itself is on the curve, and on the twist when P itself is on the twist.
If we get a point on the twist, we reject it by injecting another fault on the base point (restarting the device if necessary), because the smaller cofactor of the twist (i.e.cofactor 4) would result in a less efficient attack.We also reject faulty base points P that yield a value R of order at most 8 in the signature (in which case P itself must have been of order at most 8 since k < n); such exceptional points happen only with negligible probability anyway.
After this rejection, we know that P is on E A,B , and has order 8n, 4n, 2n or n; its abscissa x is uniformly distributed among the 4n − 4 values in F p corresponding to such points.Moreover, 2n − 2 of these values correspond to points of exact order 8n.Therefore, with probability 1/2, P is of exact order 8n, and again, it is easy to check that from generated signatures: simply compute [4n](± R) = ±[4nk] P .If P is of order less than 8n, this is always the point at infinity, whereas if it has order exactly 8n, this is the non-identity point of order 2 whenever k < n is odd.
We can thus carry out another rejection step by generating e.g. 4 signatures with the faulty base point P , and injecting another fault if for all of these signatures [4n](± R) is the point at infinity.This always rejects points of order at most 4n, and also rejects points of order 8n with probability 2 −4 .
Overall, after M fault injections on average, where: we obtain a faulty base point P of order exactly 8n.
Once such a point P is obtained, we claim that we can easily learn the 3 least significant bits of k for a constant fraction of the signatures generated with it.
Indeed, for each such signature, we can compute, up to sign, the point: which has order dividing 8.If it is the point at infinity or the point of exact order 2, both of which are equal to their inverses, we can directly obtain that k ≡ 0 (mod 8) and k ≡ 4 (mod 8) respectively.In other words, if R is the point at infinity, we get LSB 3 (k) = 000, and if R is the point of order 2, then LSB 3 (k) = 100.However, the points of exact order 4 and 8 are not invariant under [±1], so if R is such a point, we cannot hope to learn 3 full bits of k; for example, if R is of order 4, we only obtain k ≡ 2 or 6 (mod 8), but it is not possible to distinguish between both cases since we only get R up to sign.To obtain many signatures for which the 3 least significant bits of k are known, it then suffices to generate signatures with the faulty base point P and only keep those which satisfy that the point R above is either the point at infinity or the point of order 2.This is the case whenever k is divisible by 4; thus, we keep a quarter of the generated signatures.
Once sufficiently many signatures have been collected, they can be used to carry out Bleichenbacher's attack as described in the following sections (see in particular Section 6.2 for concrete numbers of signatures, attack timings and memory consumption).A trivial but important point to note is that known LSBs by themselves do not translate into significant bias in the sense used in Bleichenbacher's attack (i.e. a large value for the bias function defined in Section 4.1).To achieve large bias, we first need to apply an affine transformation on signatures that map the partially known nonces k to values with their MSBs equal to zero (in this case, the 3 MSBs, since we have knowledge of 3 bits of k).This simple but crucial preprocessing step is described in Section 3.3 below.
Attack with a non-uniform faulty point.We have described the attack in the case when the fault injection yields a point P with uniformly random abscissa x in F p .However, uniformity is far from crucial.The only important condition that should be satisfied is that the fault should result with significant probability in a point P of exact order 8n.
Heuristically, this is expected to happen for essentially any "naturally occurring" subset of F p of size much larger than 8.For example, consider the "fault on program memory" scenario alluded to above, in which the attacker is able to replace the correct base point of abscissa x = 9 by another base point P whose abscissa x is a random integer still contained in a single byte (i.e.uniform between 0 and 255).The distribution is then very far from uniform in F p , but one can easily check that 119 such values x correspond to a point on E A,B (and not its twist) with order at least n, and among them, 65 correspond to a point of order exactly 8n.This means that the same attack can be carried out as above in that setting.The only change is the expected number of faults to inject, which instead of the estimate of Eq. ( 2) is slightly reduced to: It is a bit difficult to justify the heuristic above in a rigorous way, but arithmetic techniques can be used to prove partial results in that direction.It follows from the character sum estimates of Kohel and Shparlinki [KS00] that if x is picked uniformly at random in an interval of length > p 1/2+ε , then it corresponds to a point on E A,B of exact order 8n with probability 1/4 + O(p −ε ).As a result, a fault attack inducing a value x of that form works identically to the one where x is uniform over F p , and the expected required number of faults is very close to the one given by Eq. ( 2).

Instruction Skipping Fault on Base Point Initialization
Although the fault model of the previous attack seems quite natural, it is difficult to realize against the implementations of qDSA described in the original paper [RS17], due to the fact that the representation of the base point P of Curve25519 is not stored in memory in a permanent way, but reconstructed every time a signature computation is carried out.
We now describe a fault attack that can easily be realized in practice on a very slightly modified version of the AVR ATmega implementation of qDSA distributed by the authors of the original paper.We also argue that the corresponding slight modification is plausible enough, and we mount the attack in practice on an XMEGA128 target using the ChipWhisperer-Lite low-cost side-channel and glitch attack evaluation board [OC14].
Attack model.The attack model is quite simple: the attacker injects a suitably synchronized fault upon signature generation that causes the reconstructed base point P to be incorrectly computed.The abscissa is set to x = 1 every time instead of the correct x = 9, and the signature is generated using the corresponding faulty base point P .Note that this point P is of exact order 4.
In addition, we also assume that the attacker obtains a side-channel trace of the faulty execution of the signing algorithm.In that sense, the attack we will describe is a so-called combined attack, that uses both faults and side-channels.We note however that this is not particularly restrictive: the synchronization of fault injection is typically carried out by waveform matching of side-channel traces anyway, so using the collected traced for additional purposes doesn't really strengthen the attack model.
Realization of the model.The entry point for the Montgomery ladder implementation used in the qDSA source code is the ladder_base function reproduced in Fig. 1(a).Its main goal is to initialize the base point P and then call the ladder proper.More precisely, P is represented by its image in E A,B / ±1 ∼ = P 1 , with projective coordinates (X : Z) = (9 : 1).To set P as such, the code first sets the X component (given by an array of 32 bytes) to 0 using the fe25519_setzero function, then the Z component to 1 with fe25519_setone, and finally modifies the least significant byte to 9.
The idea of our attack is to uses glitches to skip the execution of that last step.On a platform like 8-bit AVR microcontrollers, this is relatively straightforward using clock glitches.
Doing so on the unmodified code of Fig. 1(a) results in a faulty base point P which maps to (0 : 1) on P 1 however: this is the point of exact order 2 on E A,B , instead of a point of order 4 as desired.We can still obtain nonce leakage using that faulty base point, but only on a single bit of the nonce.That leakage is not quite sufficient to deduce a practical attack.
Suppose however that the code was written as in Fig. 1(b).The only change is that the X component of the base point is first set using fe25519_setone instead of fe25519_setzero.Of course, when executed correctly, the modified code is exactly functionally equivalent to the original one.However, skipping the instruction that changes the lowest order byte of X now results in a faulty base point P which maps to (1 : 1) on P 1 : this is a point of order 4 as required.
That change might seem artificial, but there are plausible reasons why one might want to do it in practice.Most importantly, the function fe25519_setzero is almost never used elsewhere in the qDSA library code (there is exactly one other occurrence of it).Since reducing code size is a major concern for embedded implementations, removing that rarely used function and replacing its two uses by fe25519_setone (and adapting the code accordingly) makes sense.When compiling with avr-gcc 4.8.2, the change results in a code size reduction of 33 bytes, which can certainly justify such a change when program memory is at a premium.

Description of the combined attack.
Since we are able to obtain signatures generated with the faulty base point P of order 4, the attack proceeds mostly as before.According to the description of qDSA, signatures will then contain ± R = ±[k] P , which is of order 4 when k is odd, of order 2 when k ≡ 2 (mod 4), and the point at infinity when k ≡ 0 (mod 4).In particular, we get LSB 2 (k) = 10 when ± R is of order 2, and LSB 2 (k) = 00 when it is the point at infinity.This should thus yield 2 LSBs of leakage on the nonce k whenever k is odd (i.e. for half of the generated signatures).After collecting sufficiently many such signatures and applying the affine transformation of Section 3.3 to obtain biased MSBs, we can then apply Bleichenbacher's attack.Concrete parameters, timings and memory consumption are provided in Section 6.1.
That simple description omits an important implementation detail that slightly complicates the attack, however.Namely, the point ± R ∈ P 1 in signatures is represented in "affine coordinate" by a single element x R of F p , and the point at infinity does not really have a well-defined representation in those terms.This is not an issue for correct executions of the qDSA algorithm, since the point at infinity happens with negligible probability; however, it is crucial in our specific attack setting.We therefore need to examine how x R is computed from the projective representation (X R : Z R ) output by the Montgomery ladder.
In the qDSA implementation, x R is computed by first inverting Z R using Fermat's little theorem, and then multiplying the result by X R .In other words, the code computes: In the case of our faulty point ± R, we have: where in both cases L k ∈ F p is a large, typically full-size value depending only on k.In both cases, we therefore get: and as a result, it is not possible to distinguish between the two cases just from the value included in the signature.However, from an implementation perspective again, there is a clear difference between the two cases.When k ≡ 0 (mod 4), the value Z R for which the device computes the base field exponentiation Z p−2 R is 0, whereas in the other case, it is a large, randomlooking element L k in F p .This difference should translate in a marked difference in power consumption and other side-channel emanations during the computation of this exponentiation operation!Using side-channel leakage in addition to the fault, we are therefore able to distinguish between the two cases, and carry out the attack as expected.
Concrete glitch attack experiments.We successfully carried out the attack above on the implementation of qDSA for 8-bit AVR microcontroller platform [Ren17b], with the tweak of Fig. 1(b).The cryptographic code was otherwise left entirely untouched, except for the insertion of a trigger in the signing algorithm (before the call to the ladder_base function) in order to facilitate the synchronization of injected faults.That synchronization should be doable directly in hardware from the acquired waveform (using e.g.oscilloscope SAD triggers) when using a more costly setup, but a manual software trigger comes in handy in our low-cost setting.Note that the qDSA implementation itself does not claim security against faults or physical attacks in general; however, conducting the attack on a real-world target allows us to confirm the validity of the fault model.
The attack was conducted on the ChipWhisperer-Lite side-channel and glitch attack evaluation board [OC14], which comes with an AVR XMEGA128D4 microcontroller target (Fig. 2).In order to use the accompanying software, we wrapped the qDSA code into a program running on the XMEGA target that can sign messages using the SimpleSerial serial console protocol supported by ChipWhisperer-Capture.The program supports several single character serial commands (followed by hexadecimal arguments), including in particular: • k 32-byte hex string : generate fresh key pair with the provided seed; • p 16-byte hex string : sign the provided message and return the first 32 bytes of the signatures (the rest of the 80-byte signature can be displayed with additional commands if necessary); where the blue lines ask for signatures on the provided messages, and the replies starting with r give with first 32 bytes of the computed signature (corresponding to the abscissa x R ).The lines starting with z signal the end of the response (and 50 indicates that the entire signatures are 0x50 = 80 bytes long).
We then use the glitch module of ChipWhisperer-Capture to generate clock glitches at selected positions during the execution of the program.After some trial and error, we find that XORed-in rectangular clock glitches of width 5% of the clock frequency, inserted at 2.5% of the corresponding clock cycles cause reliably reproducible misbehavior of the microcontroller.We then increment the position at which the glitch is inserted (as an offset from the trigger located right before the call to ladder_base in the signing algorithm), and observe the results on the serial console.At offset 202 clock cycles, we finally observe the required fault: which we can confirm corresponds to skipping the assignment on step 9 of Fig. 1(b).The power trace corresponding to the first few hundred cycles after the trigger is reproduced in Fig. 3, both for the correct execution and for the faulty one.One can clearly see a spike on the faulty trace when the glitch is injected, and how the skipped instruction results in a shift to the left of the trace of the faulty execution compared to the correct one after that point.The fault is very reliably reproducible: in several hundred attempts at injecting the glitch, the assignment instruction was skipped 100% of the time, resulting in the same response r0000000000000000000000000000000000000000000000000000000000000000 as expected.
To finish validating the combined attack, we then check that it is indeed easy to distinguish between the case when R is of order 2 on curve, and when it is the point at infinity.To do so, we plot corresponding power traces at a later point during the execution of the program, within the base field exponentiation used to compute the modular inverse of the coordinate Z R .
Fig. 4 shows two sets of several traces corresponding to faulty signatures of random messages.The traces in blue all correspond to the case when R is of order 2, and the traces in red to the case when it is the point at infinity.It is visually clear that the two sets of traces are easy to distinguish from each other, and that one can construct a very accurate distinguisher even from a small number of samples around that part of the execution.

Preprocessing Signatures for Bleichenbacher's Attack
Both of the attacks described above allow us to obtain multiple qDSA signatures for which a few LSBs of the nonces k are known.We would like to use those signatures with partial nonce exposure to retrieve the secret key d.
This problem can be seen as an instance of Boneh and Venkatesan's hidden number problem (HNP) [BV96]: given sufficiently many equations of the form (1), in which the pair (h, s) is known, and partial information on k is also given, recover d.Note that in our setting, the pair (h, s) is indeed known, since s is directly part of the signature, and h can be recomputed as h = H(x R x Q M ) (where x R is again part of the signature; in particular, it is the faulty abscissa x R in the case of faulty signatures).
The HNP algorithm used in this paper is essentially due to Bleichenbacher, and relies on a search for heavy Fourier coefficients.However, those heavy Fourier coefficients only reveal the secret key in an HNP instance where the most significant bits of nonces k are constant (say identically zero).Thus, our instance with known LSBs of k needs to be preprocessed in order to be amenable to Bleichenbacher's attack.This preprocessing stage, which is folklore, proceeds as follows.
Suppose that in our setting, the b least significant bits of nonces are known, i.e. r := k mod 2 b is known for each nonce k.Subtracting r from Eq. (1) and dividing by 2 b , we get: , and h := h2 −b , where all computations are carried out in Z/nZ.The previous equation can be rewritten as: where MSB b (k ) is the all zero bit string and k is uniformly distributed on 0, (n − 1)/2 b .Hence, we get an equation of the correct form to apply Bleichenbacher's attack.To simplify the discussion in subsequent sections, we discard the signatures with 2 252 ≤ h < n; such an exceptional case happens only with negligible probability anyway.
In the rest of this paper, we assume that S signatures are generated with either of the fault attacks, and preprocessed as above by the attacker.For simplicity, we omit the prime symbols and refer to {(h i , s i )} S−1 i=0 as the set of preprocessed signatures, and {k i } S−1 i=0 as the biased nonces satisfying

Possible Countermeasures
Before turning to the description of Bleichenbacher's attack and of our optimizations thereof, we first mention a few countermeasures that can be applied to qDSA implementations in order to thwart the attacks of this section.
Since our attacks all target the base point in the Montgomery ladder computation, using generic techniques to protect that value should prevent the attack.Concrete ways of doing so include: • carrying out consistency checks of proper execution when copying the value into memory (e.g.double loop counters); • writing the value twice if it is reconstructed every time, so that a single instruction skip fault cannot corrupt it; • computing a CRC checksum of the base point and checking that it gives the expected result before releasing a generated signature, etc.
Rather than these generic countermeasures, however, one could recommend instead to slightly modify the signing algorithm in a way that completely prevents attacks based on the existence of points of small order.Namely, instead of carrying out scalar multiplication by the nonce k, use 8k (or if using a curve E other than Curve25519, use α • k, where α is the least common multiple of the cofactors of E and its twist), and adjust the verification algorithm accordingly.This ensures that, even if the base point is tampered with somehow, the adversary will not be able to map the result of the scalar multiplication to a non-identity element of a subgroup of small order.This thwarts the attacks of this section in particular.

Bleichenbacher's Nonce Attack
In this part, we recall the Bleichenbacher's attack method.We also formulate the conditions required for the range reduction phase, which is by far the most costly phase in the attack.Note that Bleichenbacher's attack applies in principle to any Schnorr-like signatures with arbitrarily biased nonces, including (EC)DSA [Gal13], EdDSA [BDL + 12], and ElGamal [ElG85], as long as they provide publicly available pairs (h, s) such that Eq. (1) holds.
Algorithm 2 specifies the high-level procedures of the attack.The step-by-step guide will be provided in the following subsections.

Input:
{(h i , s i )} S−1 i=0 -the set of preprocessed Schnorr-like signatures with b-bit biased nonces S -number of input signatures L -number of linear combinations to be found Output: most significant bits of d 1: Range Reduction 2: Find L = 2 reduced signatures {(h j , s j )} L−1 j=0 , where (h j , s j ) = ( i ω j,i h i , i ω j,i s i ) is a pair of linear combinations with the coefficients ω j,i ∈ {−1, 0, 1}, such that

Bias Definition and Properties
We first formalize the bias of random variables in the form of discrete Fourier transform.Let us recall the definition of the bias presented at [Ble00] and its basic properties.Definition 1.Let X be a random variable over Z/nZ.The bias B n (X) is defined as where E(X) represents the mean.Likewise, the sampled bias of a set of points The bias as defined above satisfies the following properties.See [DMHMP14] for the proof.

Lemma 1. Let X and Y be random variables. (a) If X follows the uniform distribution over the interval
, where B n (X) denotes the complex conjugate of B n (X).

(d) If X follows the uniform distribution over the interval
The following claim is useful for approximating the bias value when the nonces are b-bit biased.
Corollary 1.Let K be a random variable.If K follows the uniform distribution over the integer interval 0, (n − 1)/2 b ∩ Z for some positive integer b, then the bias value , we obtain the following by applying Lemma 1-(d): In

Range Reduction
The main idea of Bleichenbacher's attack is finding a secret key candidate that leads to the peak bias value: given a set of preprocessed pairs {(h i , s i )} S−1 i=0 with biased nonces, we would like to find the candidate w ∈ Z/nZ such that its corresponding set of nonce candidates K w := {s i + h i w} S−1 i=0 shows a significant nonzero sampled bias.If w is equal to the true secret, i.e., w = d, we obtain a set of genuine biased nonces K = {k i } S−1 i=0 and its sampled bias |B n (K)| is close to 1, which we call the peak; if the guess is wrong, i.e., w = d, the sampled bias can be approximated by 1/ √ S, which we call noise.Since Schnorr-like signatures allow anyone to compute a pair (h, s) that holds Eq. (1), we thus have a way to determine the secret value d by evaluating |B n (K w )| for all w ∈ Z/nZ in a brute force way.
where ω j,i ∈ {−1, 0, 1} (we will omit the index j for simplicity).Then its corresponding nonce becomes k = i ω i k i .Let K i be a random variable (which corresponds to a nonce k i ) uniformly distributed on the interval 0, (n − 1)/2 b and let us assume that K i1 and K i2 are independent if i 1 = i 2 .Then applying (b) and (c) of Lemma 1, Hence taking the absolute value, we obtain where Ω := i |ω i |.This means that the height of the peak diminishes as the sum of coefficients Ω for the linear combination increases.Since the noise is approximately 1/ √ L and the peak value needs to serve as a distinguisher, we obtain the condition [C2] for the peak not to vanish.
In summary, finding small and sparse linear combinations for sufficiently small L (i.e., small enough for the FFT to be tractable) is the key to performing Bleichenbacher's attack efficiently.Let us briefly review the previous range reduction algorithms.

Sort-and-difference
We present the sort-and-difference algorithm conducted by [AFG + 14] as the most straightforward instance of a range reduction algorithm.It simply works as shown below: 1. Sort the list4 {(h i , s i )} S−1 i=0 in ascending order by the h i values.2. Take the successive differences to create a new list

Repeat.
With this approach, they successfully performed the key recovery attack against ECDSA on 160-bit curve with 1-bit nonce bias.As a theoretical contribution, they analytically proved that approximately (1 − e −2 γ )S signatures are obtained such that h j < 2 log n−log S+γ after the first application of sort-anddifference, where γ ∈ Z is a parameter.However, because the h values are not uniformly random and independently distributed anymore, their experimental result showed that the ratio (1 − e −2 γ ) does not hold after the second iterations and the actual ratio drops as the algorithm iterates, i.e., the number of reduced signatures such that h j < 2 log n−ι(log S−γ) after ι rounds is less than (1 − e −2 γ ) ι S.
As a consequence, the sort-and-difference required S = 2 33 input signatures to satisfy [C1] and [C2] for their attack setting.Their implementation consumed nearly 1TB of RAM, and therefore attacking groups of larger order with small nonce biases was thought to be out of reach due to its huge memory consumption.

Lattice Reduction
De Mulder et al. in [DMHMP14] proposed to use lattice reduction to carry out the range reduction.They used the BKZ algorithm applied in lattices of dimension around 128 to mount Bleichenbacher's attack against 384-bit ECDSA with 5-bit nonce bias, using a total of about 4000 signatures as input.
The idea of using lattice reduction for range reduction may seem quite natural indeed: after all, range reduction is about finding very short and sparse linear combinations from a large list {h i } S−1 i=0 of integers, which seems closely related to the problem of finding very short vectors in the lattice generated by the rows of the following matrix: for a suitable scaling constant κ.Indeed, any vector in that lattice is of the form (κω 0 , . . ., κω S−1 , i ω i h i ), and it is thus short when all the ω i 's have a small absolute value and the linear combination i ω i h i is also short.However, two problems arise when trying to apply that approach to more demanding parameters than the ones considered by De Mulder et al., particularly when the bias is significantly smaller.
First, the conditions above do not really capture the sparsity of the linear combinations, which is of paramount importance for small biases, since the bias function decreases exponentially with the number of non zero coefficients.To get acceptably sparse linear combinations, one is led to start with a lattice of small dimension, constructed from a random subset of the h i 's of size at most equal to the desired weight of the linear combination.This in turns makes short vectors in that lattice no longer very short.
Second, although the coefficient ω i 's tend to be relatively small, they are not constrained to lie in {−1, 0, 1} as in the previous description, and as a result it is no longer true that the bias of linear combinations is given by |B n (K)| Ω , Ω = |ω i |, when the original nonces have b-bit bias.In fact, the bias can be computed explicitly, and it is smaller than this value in general.In particular, if one of the ω i 's is a multiple of 2 b , it is easy to check that the bias becomes exponentially small.Since for small b it is not usually feasible to avoid the appearance of such a coefficient, the linear combinations given by lattice reduction are typically not useful.

Bias Computation
Now let w m = mn/L, with m ∈ [0, L − 1], be an L-evenly-spaced secrete key candidate in [0, n − 1] and K wm := {s j + h j w m } L−1 j=0 be a set of candidate nonces.Assuming that L reduced signatures have been obtained by a range reduction phase, the sampled bias is Thus, by constructing the vector Z := (Z 0 , . . ., Z L−1 ), the sampled biases B n (K wm ) for m ∈ [0, L − 1] can be computed all at once using the inverse Fast Fourier Transform (iFFT).Note that (i)FFT only takes O(L) time and O(L) space complexities.Finally, recalling that the peak width has now broadened to n/L via range reduction, the algorithm picks the candidate w m that leads to the largest sampled bias, so we can expect that w m shares its -MSB with the secret d.

Recovering Remaining Bits
As  1) can be rewritten as follows: Hence defining s i := s i + h i d Hi 2 λ− , Algorithm 2 can proceed with the attack to recover the -MSB of d Lo , except that this time the FFT table is constructed in the following way: let n := 2 λ− be the upper bound of d Lo and w m = mn /L be a L-evenly-spaced candidate in [0, n − 1], then the sampled bias is As such, we only need to reduce the h values so that 0 ≤ h j < Ln/n ≈ L 2 , which should be much faster than the first round.By repeating the above operations, we can iteratively recover the -bit of the secret key d per each round.

Optimization and Parallelization of Bleichenbacher's Attack
As we discussed in the previous section, the range reduction is the most costly phase in Bleichenbacher's attack framework and the previous approaches to it are basically memory-bound.In this section, we present our approach to range reduction to overcome this memory barrier while maintaining a practical level of efficiency in terms of time complexity.

Our Approach: Using Schroeppel-Shamir Algorithm
We begin with an intuitive discussion on the nature of the problem of finding small and sparse linear combinations (we call it the range reduction problem for convenience).Interestingly, Bleichenbacher mentioned in [Ble00] the use of Schroeppel-Shamir algorithm, which was originally proposed as a knapsack problem solver in [SS81], would save memory in the range reduction phase, though there has been no concrete evaluation made on it until today.Let us develop his idea more concretely.The range reduction problem can be indeed regarded as a variant of the knapsack problem (as defined in Section 2.3) in a broad sense; instead of searching for the exact knapsack solutions, we would like to find sufficiently many sparse patterns of coefficients that lead to the linear combination smaller than a certain threshold value.With this in mind, we can transform Schroeppel-Shamir's knapsack problem solver into a range reduction algorithm.However, applying the original Schroeppel-Shamir algorithm introduces large priority queues (or min-heaps) to store partial linear combinations, which are not cache-friendly and moreover make it hard to optimize and parallelize the algorithm in practice.Hence, our approach is specifically inspired by the optimized version due to Howgrave-Graham and Joux, which replaced the priority queues with simple lists.Though their algorithm is intended for solving the knapsack problem, we observe that it happens to have two desirable characteristics in the context of Bleichenbacher's attack: modest space complexity and compatibility with large-scale parallelization.The interested reader is invited to refer to [HGJ10, §3] to become familiar with their approach in knapsack-specific setting.Fig. 6 and Fig. 7 depict how Schroeppel-Shamir algorithm and its variant by Howgrave-Graham-Joux would serve as a range reduction at a high level.
In a nutshell, the range reduction transformed from Howgrave-Graham-Joux's algorithm works as follows: 1. Split a set of S = 2 α+2 input signatures into 4 lists L (1) , R (1) , L (2) , and R (2) of size S/4 = 2 α , 2. Create the list A (r) , for each r ∈ {1, 2}, that consists of linear combinations of two (η j ) such that η (r) 's top consecutive (α + 1) bits coincide with a certain value c mod 2 α , and 3. Sort A (1) and A (2) and search for the short differences between elements from them such that they are β-bit smaller than the original h values, where β is a parameter.
That is, it first collects the linear combinations of two to make sure that the collision happens in the top consecutive bits when taking differences, so that the resulting linear combinations of four are expected to be much smaller with good probability.We give the concrete procedures of our range reduction in Algorithm 3. Note that it invokes Algorithm 4 inside as a subroutine that collects the linear combinations of two such that their top consecutive (α + 1) bits coincide with a given value.

Analysis
We first show how to choose the appropriate parameter β so that the resulting number of reduced signatures approximately remains S and the space usage is stable in each round.We also evaluate the space and time complexity of Algorithm 3. sigs := {(h i , s i )} S−1 i=0 -the set of preprocessed Schnorr-like signatures with biased nonces λ -bit-length of h, e.g., λ = 252 for qDSA signatures ι -number of iterations β -number of bits to be reduced per round Output: Split sigs into 4 lists: L (1) , R (1) , L (2) , R (2) of size L = L/4 7: Sort L (1) and L (2) in descending order by h values 8: Sort R (1) and R (2) in ascending order by h values 9: Create empty lists sols, A (1) , and A (2)   10: Call Algorithm 4 on L (1) , R (1) , and c, push the result into a list A (1)   12: Call Algorithm 4 on L (1) , R (1) , and c + C, push the result into a list A (1)   13: Call Algorithm 4 on L (2) , R (2) , and c, push the result into a list A (2)   14: Call Algorithm 4 on L (2) , R (2) , and c + C, push the result into a list A (2)   15: Sort A (1) and A (2) in ascending order by η values while Neither A (1) [i] nor A (2) [j] is at the end do 20: if η (1) [i] > η (2) [j] then 21: Push (h , s ) to sols such that the value corresponding to the top consecutive (α + 1) bits of η is equal to c, i.e., η [τ +1: Increment k 25: end while 27: Increment j 28: end if 38: end while 39: return A Theorem 1. Suppose β ≥ (1 + ε) • α for some ε > 0, so that in particular, 2 α−β = o(1).If h's are uniformly distributed in the interval [0, 2 λ − 1] 5 , then, after the first round of Algorithm 3, the expected cardinality of sols, which we denote by L, satisfies Proof.We first show that the expected cardinality of A (1) and A (2) is C = 2 α after the line 14.Second, we evaluate the probability that a (τ − β)-bit-bounded linear combination of four, which consists of items in A (1) and A (2) , can be found.Before the first round, τ = λ is the bit-length upper bound of h.Since h's are uniformly distributed in [0, 2 τ − 1], the values corresponding to the top α-bits of them, i.e. h/2 τ −α , are uniformly distributed in [0, C − 1].
Let η = h i + h j be an integer represented as (τ + 1)-bit string.Then the value corresponding to its top (α + 1)-bits is η/2 τ −α = η [τ +1:τ −α+1] (see Section 2.1 for the definition of the notation).Recalling that the sum of two uniform distributions follows a triangular distribution, We can make the distribution above uniform by considering the modulo C, i.e., by ignoring We would like to compute the following probability:

This corresponds to calling
For notational simplicity, we will omit the condition event η (1) [τ :τ −α+1] = c in the rest of the proof.
First, we compute the probability that η (1) τ +1 and η (2) τ +1 coincide: Pr η (1) 5 Although the assumption here indeed holds for plain Schnorr signatures, we remark that this is not actually the case for qDSA since it ensures hash values to be even (see [RS17,§2.4]).However, one can trivially make them uniformly distributed over a narrower range by using the filtering technique discussed in Section 5.5.
Second, we compute the following probability: We can consider three cases for the above, which are visualized in Fig. 8.
Therefore, it can be computed as follows: Pr |η (1) Pr η (1) Summing up, we obtain the probability κ c : Note in particular that, since the second factor is bounded between 1/2 and 1, we have κ c = Θ 1/2 β−α independently of c.Now there are L 4 /C 2 = C 2 possible linear combinations between A (1) and A (2) , for each c ∈ [0, C − 1], we obtain an expected L c linear combinations of four that are (τ − β)bit-bounded, where L c = C 2 • κ c .Not all of these linear combinations are necessarily found by Algorithm 3, however: the algorithm can miss such a linear combination when a sum on one side collides with two consecutive sums on the other side.Such a double collision happens with probability O(κ 2 c ), however, so the expected number L (found) c of small linear combinations found by the algorithm satisfies: As a result, the expected cardinality L of sols is given by L , where the sum is easy to evaluate: As a result, we obtain L = 4/3 + o(1) • 2 4α−β as required.
Now we can directly derive the following claim.Proof.Indeed, with that choice of β, we have: After the first round, the above result does not hold strictly because h's are not perfectly uniform anymore.However, we empirically confirmed that approximately S reduced signatures can be constantly obtained in practice when β is sufficiently close to β 0 .We first generated 2 17 Schnorr signature pairs (h, s) over a group of 252-bit order, and then made Algorithm 3 reduce them for 5 times, i.e., the parameters were as follows: S = 2 17 , λ = 252, and ι = 5.Since 1.58 < log 3 < 1.59, we conducted the reduction experiments with β = 3α − 1.58 and β = 3α − 1.59 respectively, and measured the amount of reduced signatures after each iteration.Table 1 gives the experimental results.As a consequence, we actually managed to get more than S signatures after every round when β = 3α − 1.59, which is slightly below β 0 ; on the other hand, the number of reduced signatures L decreased per iteration when β = 3α − 1.58 > β 0 .These results show that choosing β such that β ≤ β 0 is indeed sufficient to maintain L ≈ S even after the first round (if the choice of β ends up with more than S reduced signatures, then we can simply interrupt the for loop as soon as the cardinality of sols reaches S, which of course makes the range reduction end faster).In what follows, we will assume that β is equal to or slightly smaller than β 0 to make the space usage stable.

Lemma 2. The space complexity of Algorithm 3 is
Proof.The space usage of Algorithm 3 is bounded by the size of sigs, A (1) , A (2) and sols, all of which have cardinality O(S) if β = β 0 .

Lemma 3. The time complexity of Algorithm 3 is
Proof.We assume L ≈ S from Corollary 2. At the line 7 and 8, it takes time O(S log S) to sort the lists with a standard sorting algorithm such as quick sort.Collecting the linear combinations of two by Algorithm 4 takes O(S) from the line 11 to 14. Since A (1) and A (2) have the cardinality of L 2 /C = S/4, sorting at the line 16 takes O(S log S) and going through them in the while loop requires O(S) steps for each c.We finally obtain O(S 2 ) by taking the summation from c = 0 to c = C − 1.
Table 2 gives the performance comparison between Algorithm 3 (with β = β 0 ) and the sort-and-difference assuming that both algorithms take the same input size S. Note that we evaluated 2 rounds of sort-and-difference for a fair comparison, since each iteration of it only constructs linear combinations of two, while our Schroeppel-Shamir-based algorithm constructs the linear combinations of four per round.Our approach can reduce more bits than the sort-and-difference per each equivalent round using the same amount of inputs; in other words, in order to reduce the same amount of bits, it takes less space complexity, and therefore requires fewer input signatures.
This means that each iteration approximately reduces the bias by raising it to the fourth power6 .Therefore, the condition [C2] can be rewritten as follows: Applying Corollary 1, we obtain Putting all together, we obtain the lower bound α SS .

Data-(Time, Space) Trade-off
In practice, adversaries who can perform the fault attack are allowed to generate as many signatures as they want and filter out ones with relatively large h.That is, let f be the number of bits to be filtered, then one can heuristically get S signatures such that h < 2 λ−f by generating 2 f • S faulty signatures, assuming that h is uniformly distributed in [0, 2 λ − 1].With this setting, the condition [C1] is relaxed as follows: This clearly improves the lower bound obtained in Theorem 2 in exchange for spending more time on the initial signature generation.Let α SS be the new lower bound, then Now we only need to pass Bleichenbacher's attack at least S := 2 α SS +2 signatures.Let T Gen be the time spent on signature generation, and T Atk and S Atk be the time and space required for Bleichenbacher's attack with our range reduction (i.e., Algorithm 2 & 3), respectively.Then, we obtain the following estimates for each: Thus, the parameter f gives us the flexibility and it can be determined depending on the precise context; for example, if we are allowed to generate significantly many signatures, but can only utilize relatively limited computational resources, then f should be increased so as to obtain the appropriate lower bound α SS , and vice versa.We make use of this technique to attack 2-bit bias in Section 6.1.

Performance Comparison
We apply the settings of our attack -the qDSA on Curve25519 with its 2-or 3-LSB of the nonces known via fault attacks -to the bound obtained in Theorem 2 in order to give concrete performance estimates of our range reduction algorithm.We also found the optimal number of iterations ι for both cases such that α SS is minimized.Table 3 summarizes the result.It includes the comparison with the sort-and-difference used by [AFG + 14] and with a lattice attack in combination with the SVP algorithm by [BDGL16].Note that the estimates for the sort-and-difference are too optimistic because they are based on the assumption that the ratio (1 − e −2 γ ) ι holds even after the first iteration; indeed, unlike our algorithm, it is not true in practice as we reviewed in Section 4. We actually encountered such a situation and acquired less resulting signatures than theoretically estimated (see Section 6.2).

Implementation Results
We implemented the Bleichenbacher's attack incorporating the reduction technique described in Algorithm 3. In this section, we summarize the implementation details and our experimental results.The source code of the programs used in this section is publicly available [TT18].
Tools.We artificially (i.e., using the parallel computing facilities described below) generated faulty qDSA signatures by modifying the C reference implementation [Ren17a].
The attack program was written in C++ and the multiprecision integer arithmetic was mostly handled by GMP library [Gt16], except that the reduction phase only made use of the built-in C integer type uint64_t for further optimization; in fact, we do not need to handle the full-fledged big integers there since our reduction algorithm only requires the evaluation of the top β-bit and the following few bits, as Fig. 8 depicts.The bias was computed with FFTW [FJ05].The large-scale parallelization was achieved with the combination of Open MPI [GFB + 04] and OpenMP [Ope08].
Hybrid shared/distributed-memory parallelization.We describe how the large-scale parallelization of Algorithm 3 was achieved in practice.We implemented the attack using hybrid shared-memory and distributed-memory parallel programming technique.The former was handled by OpenMP and the latter was by MPI.We utilized the following two parallel computing facilities during the experiments: In particular, the much larger second facility is a distributed-memory system that consists of a set of independent nodes, each of which has its own shared-memory multiprocessing environment.(And although the first system is a single workstation with a single memory space, MPI also made it appear as though it consisted of two separate nodes running distinct multithreaded processes).
As a parallel programming paradigm, we employed a simple master-worker scheme (see, e.g., [HW11, Chapter 5] for details).Let t be the number of available shared-memory threads within a node and N be the number of distributed-memory nodes, where N is a power of 2 for simplicity.Moreover, we assume that each node is assigned a unique identifier I ∈ [0, N − 1].Then our parallelization strategy is summarized as follows: 1. Make the master process load and sort the input data.
2. Map one MPI worker process per node.
4. Make each worker spawn a team of t OpenMP threads and process the assigned jobs.
5. Gather the results (i.e., subsets of sols) into the master.
To achieve these, calling a few basic MPI collective communication routines -MPI_Bcast, MPI_Gather, and MPI_GatherV -is sufficient.Each routine was called only once per round before/after the for loop and it only took a few minutes to broadcast and gather the data in both experiments below.Considering the time spent on the whole range reduction operations, our implementation introduces negligibly low communication overhead due to the parallelization.
Scalability.Although our range reduction algorithm is highly space-efficient, multithreading in a shared-memory environment requires extra space for storing the lists A (1) and A (2) , whose expected cardinalities are C = S/4, for each thread (see the proof of Theorem 1).On the other hand, the amount of distributed-memory nodes N divides the cardinality of sols stored in each node.Therefore, the space needed for each node can be roughly estimated as follows: S L (1) ,R (1) ,L (2) ,R (2)   + 2tC A (1) ,A (2)   + S/N.

(partial) sols
Recalling the fact that our implementation broadcasts and gathers the data between nodes only once, it is advisable to scale distributed-memory nodes instead of shared-memory threads to save the memory space.In the era of cloud computing, it is safe to say that preparing many distributed nodes with moderate memory capacity is not very difficult for well-funded adversaries.Hence, our range reduction algorithm is highly scalable in practice.In the following subsection, we will present the actual memory usage in virtual distributed-memory nodes on the cluster machine (i.e., N =16 and t=16).

Attack against 2-bit Bias
We first present our main result: the key recovery attack against qDSA instantiated with Curve25519 using 2-bit biased nonces.We artificially generated faulty qDSA signatures based on the fault attack described in Section 3.2; in addition, we preprocessed them to make 2-MSB of nonces biased as described in Section 3.3.Due to the computational resources available to us, we had to filter the signature pairs by h values to trade the time and space complexity for the data complexity, following the discussion in Section 5.5.More concretely, setting f = 19, we initially generated nearly 2 45 preprocessed signatures and only kept ones with h < 2 252−19 , so that we obtained S = 2 26 signatures to be processed by Bleichenbacher's attack.The whole signature generation phase took about 5 days using the cluster.Accordingly, we only had to reduce 252 − 19 − 26 = 207-bit in total during the range reduction phase, which allowed us to set the parameter β = 69 slightly below β 0 = 3 × 24 − log 3. The recovery of the first MSBs was conducted with the virtual machine instances; the range reduction jobs were distributed to 16 distributed-memory MPI processes, all of which spawned 16 shared-memory OpenMP threads.The measured experimental results are summarized in Table 4.We observed that the detected bias peak after 3 rounds of reduction matches the theoretical estimate, i.e. |B n (K)| 4 3 ≈ 0.0012 from Corollary 1.The detected sampled biases are plotted in Fig. 9. (It only displays the selected noise points for simplicity; we actually computed the sampled biases at L-evenly-spaced points in n, where L ≈ 2 26 , and detected the only one peak point that showed the significant bias value.)The FFT table preparation and sampled bias computation finished within a few minutes.
Though the total wall clock time was over two weeks, we expect much better performance on a dedicated cluster.Due to the uneven resource allocation of virtual instances, which are used by many people and therefore out of our control, some nodes were significantly slower than others, and the fastest node completed their jobs within only 7 days, which is equivalent to 4.8 CPU-years in total.As a matter of fact, we did not observe such a difference when we parallelized the range reduction on the Xeon workstation.Thus, we stress that this synchronization overhead is not because of our range reduction algorithm, but rather a specific problem in virtual machines.
After the 26-MSB of the secret key was successfully recovered, we iteratively recovered the following bits as in Section 4.4, using the 2 nodes (i.e., 56 threads in total) of the Xeon workstation for the range reduction.Consequently, the whole process below took less than 6 hours in total.We took a small security margin and only assumed that the 24-MSB was recovered in the previous phase, following the advice by [AFG + 14] and [DMHMP14].We used Algorithm 3 until we recovered the 189-MSB and lastly used the sort-and-difference to recover the 216-MSB; at this stage, we do not need to reduce many bits anymore, and therefore the sort-and-difference is more convenient since it only constructs linear combinations of two and does not diminish the sampled bias peak very much, which allows us to detect the peak area more precisely.Finally, we directly computed the bias without range reduction and recovered 241-MSB, with which a simple exhaustive search could be easily done to obtain the remaining unknown bits.
Performance estimate of better-equipped adversaries.Since we filtered signatures by h's top 19 bits and only used S = 2 26 as input, what we have computed corresponds to the timings T Gen of 2 45 and T Atk of 2 52 , and the space S Atk of 2 26 .Thus, we can infer that a better-equipped adversary, say one with access to 32 cores × 32 nodes with 96GB RAM for each, could perform a key recovery within about 3 months even without filtering at all, from the estimate in Table 3.This should be a more favorable attack setting in a situation where the adversary is only allowed to generate fewer faulty signatures.

Attack against 3-bit Bias
Next, we describe the experimental results of the attack against qDSA signatures with 3-bit biased nonces.We artificially generated 2 23 faulty signatures (without filtering) based on the attack in Section 3.1 and preprocessed them to make the 3-MSB of nonces biased as described in Section 3.3.The program was executed in the Xeon workstation and we parallelized the range reduction with 28 shared-memory OpenMP threads × 2 MPI nodes.
The measured experimental results are given in Table 5.We also performed the attack using the sort-and-difference, which is abbreviated as S&D.The attack was completed much faster than the case of 2-bit bias since now we are allowed to iterate the range reduction 4 times, and therefore the amount of bits reduced per round is much less.Moreover, the CPU-time was almost 10 days and the memory consumption was considerably lower then that of the sort-and-difference.This result implies that the attack against 3-bit bias would even be feasible using a small laptop for daily use.We omit the recovery of the following bits since the procedure is the same as the previous experiment on 2-bit bias.
It also turned out that the sort-and-difference (with γ = 1) is even exploitable against 3-bit bias and the CPU-time was a lot shorter than our algorithm, which is as expected.In a situation where an adversary is allowed to generate more than 1 billion 3-bit biased signatures, the use of sort-and-difference should be a better option.However, it should be pointed out that the resulting number of signatures after 8 rounds was only 2 25.8 , which is significantly less than the estimated amount, i.e., (1 − e −2 γ ) 8 • 2 30 ≈ 2 28.3 .This instability could be an obstacle when attacking the signatures over a larger group, since it demands higher γ or more input signatures than the theoretical bound, both of which would lead to more memory usage than expected.

Conclusion
In this paper, we have proposed fault attack techniques against the qDSA signature scheme to induce a few bits bias in the nonces.Furthermore, we designed a highly-parallelizable and space-efficient range reduction algorithm for the Bleichenbacher's nonce attack, based on Howgrave-Graham and Joux's variant of Schroeppel-Shamir algorithm.We have presented the first complete experimental results on the full key recovery of a signature scheme implementation based on a 252-bit curve with 2-bit and 3-bit biased nonces, and thus have set new records in the application of Bleichenbacher's attack.

Figure 1 :
Figure 1: Initialization of the base point in qDSA's Montgomery ladder.

Figure 3 :
Figure 3: Power trace of the device starting from the call to ladder_base: correct execution (orange) and faulty one with glitch at offset 202 (red).Sampling rate is 4× the clock frequency.

Figure 4 :
Figure 4: Power traces of the device around 130 cycles after the call to compress: blue (resp.red) traces correspond to R of order 2 (resp.at infinity).
this paper, we focus on the case of b = 2 and b = 3; if n is sufficiently large, Corollary 1 gives the approximate bias values |B n (K)| ≈ 0.9003 for b = 2, and |B n (K)| ≈ 0.9745 for b = 3, respectively.

Figure 5 :
Figure 5: The effect of range reduction De Mulder et al. observed in [DMHMP14, §3.4], the remaining bits of the secret can be iteratively recovered as follows.Let λ be the bit-length of the secret d.Once -MSB of the secret is recovered, i.e. we know d Hi := MSB (d) but not d Lo := d − d Hi 2 λ− , Eq. (

Figure 6 :Figure 7 :
Figure 6: Overview of the Schroeppel-Shamir-based range reduction algorithm directly transformed from their original version

|A| = L 2 •
Algorithm 4 twice on c and c + C.There are L × L possible linear combinations of two between L and R. Since L = L/4 = S/4 = C when ρ = 1, the cardinality of the list A is estimated as follows: Pr η [τ :τ −α+1] = c = C. Now let us find the expected number of (τ − β)-bit-bounded linear combinations of four.

Figure 9 :
Figure 9: Detected sampled biases after reducing the signatures with 2-bit biased nonces 3 times Ladder algorithm here.What readers should keep in mind is that it does not involve y-coordinates at all to compute the scalar multiplication.1} * → {0, 1} 512 is a cryptographic hash function.The qDSA also uses the function Compress : P 1 (F p ) → F p to compress a projective point as follows: [DMHMP14]checking all possible w ∈ Z/nZ is computationally infeasible if n is large.Here a range reduction in Algorithm 2 plays an important role to avoid this problem.Bleichenbacher's observation is as follows: one can broaden the peak of the bias value by reducing the size of h values, so that it suffices to find a candidate close to d, instead of the exact solution.[DMHMP14]and[AFG+14] examined his approach more concretely; they showed that by taking linear combinations modulo n of the original (h i , s i ) pairs in a way that h j values are bounded by some L, as in the condition [C1], the width of the peak broadens to approximately n/L, and therefore the peak area can be detected by evaluating the sampled bias of L-evenly-spaced values of w in [0, n − 1] 3 .Fig.5illustrates this situation intuitively.Unfortunately, the range reduction has a negative side effect: the more dense the linear combinations become, the shorter the height of the peak gets.More concretely, [C2] can be shown as follows.Let us assume that k i = s i + h i d mod n and a range reduction algorithm constructs the pair of linear combinations (h j , s j Condition 1: Small Linear Combinations.

Table 1 :
Experimental results on the number of reduced signatures L after ρ rounds of the range reduction by Algorithm 3, when α = 15 and S = 2 α+2 = 131072

Table 3 :
Estimates for the minimum required number of signatures and the optimal complexities of reduction algorithms and a lattice attack when λ = 252.The estimates omit the subexponential factors; note however, that those factors are the same for sortand-difference and our algorithm, and are worse for a lattice attack.

Table 4 :
Implementation results of the attack against qDSA signatures with nonces of 2-bit bias

Table 5 :
Implementation results of the attack against qDSA signatures with nonces of 3-bit bias Wall clock time CPU-time Memory ι S n (K w ) |