Protecting Dilithium against Leakage Revisited Sensitivity Analysis and Improved Implementations

. CRYSTALS-Dilithium has been selected by the NIST as the new standard for post-quantum digital signatures. In this work, we revisit the side-channel countermeasures of Dilithium in three directions. First, we improve its sensitivity analysis by classifying intermediate computations according to their physical security requirements. Second, we provide improved gadgets dedicated to Dilithium, taking advantage of recent advances in masking conversion algorithms. Third, we combine these contributions and report performance for side-channel protected Dilithium implementations. Our benchmarking results additionally put forward that the randomized version of Dilithium can lead to significantly more efficient implementations (than its deterministic version) when side-channel attacks are a concern.


Introduction
The world's digital security infrastructure has always relied on a range of efficient and secure cryptographic primitives, including both symmetric and asymmetric solutions. In particular for asymmetric cryptography, RSA and ECC are the ubiquitous schemes in practice. However, with the anticipated advent of powerful and dedicated quantum computers, the established asymmetric cryptographic schemes, that we mainly use for key exchange and digital signatures, will no longer provide the desired security.
In 2016, the National Institute of Standards and Technology (NIST) has launched a standardization effort for cryptographic schemes that can withstand quantum cryptanalysis [Nat]. Recently in 2022, the NIST announced the first Post-Quantum Cryptography (PQC) schemes to be standardized. These include (CRYSTALS-)Kyber [ABD + 19] for Key Encapsulation Mechanism (KEM), and (CRYSTALS-)Dilithium [DLL + 17] for digital signatures. Both Kyber and Dilithium are lattice-based schemes, and in recent years the analysis of lattice-based PQC schemes and their implementations has become a prominent area of research. This is not only due to their widely accepted strong security but also because of their implementation efficiency in comparison to other PQC schemes.
Although a PQC scheme can be secure against classic and quantum adversaries, this is not sufficient to provide practical security in the embedded context. The implementations of cryptographic schemes on constrained devices can be targeted by physical attacks, which include Side-Channel Analysis (SCA) and Fault Injection (FI) attacks. Over the last years, PQC KEM's have attracted most of the attention when it comes to SCA.
Indeed, most KEM's in the NIST competition, including Kyber, rely on the Fujisaki-Okamoto (FO) transformation [FO99] which is a simple and generic technique to achieve IND-CCA security. Unfortunately, the leakage of the re-encryption step in the FO transformation leads to very powerful SCA's, demonstrated and analyzed in many recent works, including but not limited to [RRCB20, REB + 22, UXT + 22]. An adversary can also exploit leakage from the Number Theoretic Transformation (NTT) or from the Key Derivation Function (KDF) in order to extract the long term secret key or the shared secret key [RPBC20, HHP + 21, KPP20,PPM17]. This variety of threats implies a large attack surface leading to significant overheads when protecting PQC KEM's [ABH + 22].
To the best of our knowledge, digital signatures, including Dilithium, have received less attention than KEM's with respect to SCA. The main results include a work by Ravi et al. [RJH + 18] that shows that to achieve existential forgery an attacker only requires knowledge of one part of the secret key in Dilithium, namely s 1 . Marzougui et al. [MUTS22] exploit leakage of the zero coefficients in the secret signing nonce y for multiple signatures and recover the secret key by leveraging least squares regression and integer linear programming. Liu et al. [LZS + 21] also present an SCA on Dilithium, which is able to recover the secret key from the leakage of a single bit of the secret signing nonce y for multiple signatures. The authors use this side-channel information to define a problem called the Fiat-Shamir Integer LWE, and show that it can be solved efficiently. This attack is very reminiscent of the well-known lattice reduction attacks on (EC)DSA (and other Schnorr-like signature schemes) with partial nonce leakage, originally due to Howgrave-Graham and Smart [HS01] and recently improved by Sun et al. [SETA22]. Liu et al. showed that their attack requires a relatively low number of signatures. This result, along with previous works and the fact that the side-channel analysis of Dilithium is quite a new research topic for the community, highlights the vital need to protect the future digital signature standard against these threats. The amount of published works appears to be even scarcer when it comes to protecting Dilithium against leakage. To the best of our knowledge, the main contribution comes from Migliore et al. [MGTF19] and presents masked gadgets for Dilithium and a power-of-two modulus masked version of it.

Contributions.
In this work, we tackle the challenge of efficiently protecting Dilithium implementations on embedded devices. Our contributions are the following.
First, we revisit the sensitivity analysis of Migliore et al. [MGTF19]. Interestingly, we notice that the authors do not consider some intermediate computation as sensitive even though they can be explicitly used to recover the secret key. Conversely, others were unnecessarily protected since they could be computed from the signature and the public key. These observations lead to improved security and to more efficient signature generation. To the best of our knowledge, our work presents the first masked Dilithium design compliant with the third round submission document for all parameter sets.
Second, and following the security requirements of our sensitivity analysis, we propose new and improved masked gadgets for the main operations of Dilithium (namely the bound check, the secret sampling and the decomposition) and for all NIST security levels.
Finally, we provide a complete benchmark for an ARM Cortex-M4 microcontroller, which includes the evaluation of individual components, their comparison with the ones of Migliore et al., and performance results of full signature generation for deterministic and randomized versions of Dilithium. They highlight the advantages of randomized Dilithium compared to its deterministic variant in the context of physical attacks.

Cautionary note.
In an earlier presentation of our results, at the NIST's 4th PQC Standardization Conference, a finer-grain sensitivity analysis distinguishing security against Simple Power Analysis and Differential Power Analysis was proposed [ABC + 22]. This analysis was conjectured to enable strongly leveled implementations of Dilithium, where different parts of the implementation use different countermeasures (e.g., shuffling against SPA [VMKS12], masking against DPA [CJRR99,ISW03]). We clarify in Section 3.4 that the possibility to leverage a "hard physical learning problem" similar to [DMMS21] that would back up this conjecture does not hold. As a result, Dilithium has less potential for leveling and its sensitivity analysis can be simplified to a coarser-grain mix of sensitive operations that require DPA protections and non-sensitive ones that can leak in full.

Background
We next detail the notations used in the paper and the Dilithium signature scheme.

Polynomial arithmetic notations
All arithmetic operations in the paper are denoted over the polynomial ring R = Z q [X]/(X n + 1). We denote a polynomial with small caps such as p ∈ R, a vector of polynomials with bold letters such as x ∈ R k and a matrix of polynomials with capital bold letters such as X ∈ R k×k ′ . For Dilithium, the parameters of the ring are the prime q = 2 23 − 2 13 + 1 and the degree n = 256. For z, α ∈ Z we write z mod ± α to mean the unique integer z ′ in ] − α 2 , α 2 ] (resp., [− α−1 2 , α−1 2 ]) with z ≡ z ′ mod α if α is even (resp., odd). The notation z mod ± α implies that all the coefficients in z are given with mod ± α. With this, we can define the following norms on Z q , R and R k respectively: This means that the coefficients of an element in S η orS η are in the range [−η, η] or ] − η, η], respectively. We use the notation x ← X whenever we assign a uniformly random element of a set X to a variable x. The symbol ∥ is used for the concatenation of two bit strings, the function H is an expandable output function (XOF).

Dilithium
Dilithium is a digital signature scheme based on the MLWE (Module Learning With Errors) and the SelfTargetMSIS (Module Short Integer Solution) problems [LS15]. It is the primary algorithm selected by the NIST for quantum safe digital signatures. Its main features are: random sampling from a uniform distribution instead of a discrete Gaussian distribution, a focus on keeping the public key and the signature as small as possible in terms of their bit size, and being easy to adjust for different security levels by only changing the dimensions of the matrices and vectors involved. For a comprehensive description of the algorithm we refer to the proposal [DLL + 17]. Note that the pseudocode presented there and in the rest of the paper is a variation from the reference implementation described in [DLL + 17, p.17].
The key differences are highlighted in Section 3.3. In this paper we refer to the implemented version if not stated otherwise. We describe the key generation and signature generation algorithms in the following paragraphs. We do not consider the verification, which does not involve long-term secret variables (hence, does not leak sensitive information). Table 1 provides the Dilithium parameters for different NIST security levels.

Key generation.
The key generation is defined in [DLL + 17, Fig 4.] and is recalled in Algorithm 1. Initially, a random bit string ζ is created and used to generate three seeds ρ, ς and K thanks to the hash function H. A public matrix A for which all coefficients are uniform in Z q is generated from ρ. Two secret vectors s 1 ∈ S l η and s 2 ∈ S k η are derived from ς. Then, the vector t = As 1 + s 2 is calculated. This is an instance of MLWE, where s 1 and s 2 are hard to calculate given A and t. Next, the bit representation of t is split up into high order bits t 1 and low order bits t 0 . Only t 1 will be part of the public key, to keep its size as small as possible. For the same reason the matrix seed ρ is part of the output, rather than the whole matrix A. Lastly, ρ∥t 1 gets hashed to tr. The output is the public key pk = (ρ, t 1 ) and the secret key sk = (ρ, K, tr, s 1 , s 2 , t 0 ).
Algorithm 1 KeyGen. Signature. Similarly, we now describe the signing procedure in Algorithm 2. We refer to [DLL + 17, Fig 4.] for a more detailed description. The input is the secret key sk and a message M . The message is preprocessed with H into a bit string µ of fixed length. For deterministic signing, µ is used together with K to produce a seed ρ ′ . For the randomized version the seed ρ ′ is generated randomly. This seed and a rejection counter κ (initially set to κ = 0) are used to sample the secret polynomial y ∈S l γ1 with ExpandMask. Then, the product w = Ay is decomposed via division with remainder into w 1 and w 0 . The challengec is the hash of µ∥w 1 . For further calculations,c is converted into a polynomial c that contains strictly τ coefficients set to ±1 and the others set to zero. This polynomial is then used to calculate z andr. To ensure the security and correctness of the scheme, two checks are performed: where β = η · τ . If any of the two conditions does not hold, κ is increased and the process starts over (beginning with the sampling of a new y). After successful checks, a hint h is calculated. This is needed in the verification step in order to make up for the "lost" information of t 0 . Two more checks are performed on ct 0 and h. Again, if these conditions are not met, the signature is rejected and κ is increased. Otherwise, if all checks are successful, the signature σ = (c, z, h) can be output.

Sensitivity analysis
In this section, we analyze Dilithium's key generation and signature generation and discuss the sensitivity of all the variables and functions potentially leading to side-channel attacks. This sensitivity analysis indicates which operations/variables need to be protected against leakage. As mentioned in introduction, we use a coarse-grain taxonomy for this purpose, which is next reflected by color-coded diagrams: Figure 1 and Figure 2, where red (resp., blue) denotes sensitive variables/operations that need security against DPA (resp., variables/operations that do not require side-channel attack protection). Doing so we also compare our analysis to the one previously proposed in [MGTF19].
Starting with generalities, we first note that the public key can be leaked to the adversary over the whole scheme (since it is public). The public matrix A can also be leaked since it is deterministically derived from ρ. A similar status holds for some parts of the secret key sk := (ρ, K, tr, s 1 , s 2 , t 0 ), since similar variables are contained in the public key. Concretely, tr does not need to be protected either since it is a hash of pk. We additionally note that the vector of polynomials t 0 can be leaked as well. Indeed, the Dilithium security proofs consider t (hence t 0 and t 1 ) to be public [DLL + 17]. 1 M does not need to be protected, but s 1 and s 2 , and K, must be protected in order to avoid side-channel attacks leading to a signature forgery. Next, we detail which other variables must be protected in order to avoid the leakage of long-term sensitive secrets. We start with their sensitivity analysis for the key generation followed by the signing procedure.

Key generation sensitivity
During key generation, the variable ζ has to be protected since it is the seed for all subsequent values (e.g., K). Similarly, ς has to be protected since it serves as a seed to deterministically generate the long term secrets s 1 and s 2 . All the other variables in the key generation can be leaked or are public, hence do not need side-channel protection. Figure 1: Graphical representation of the key generation. Output: pk = (ρ, t 1 ), sk = (ρ, K, tr, s 1 , s 2 , t 0 ). Red: sensitive. Blue: non-sensitive.

Signature generation sensitivity
All the variables denoted in red in Figure 2 need to remain secret and hence must be secure against DPA. This naturally holds for both secret key components s 1 and s 2 , in order to avoid trivial key recovery attacks leading to signature forgeries. Next, the vector of polynomials y is sensitive and must be protected. Indeed, given a valid signature σ = (c, z, h), the secret vector s 1 can be recovered from z = y + cs 1 for known or partial knowledge of y [MUTS22]. A similar analysis applies to w 0 which can lead to the recovery of s 2 . 2 As a result, the vector of polynomials w must be protected: w 0 is directly derived from w, and it is possible to solve the system of equations Ay = w for known A and w to recover y in most cases (see Section 3.3 for details). This equation is similar to a Learning With Rounding (LWR) instance, where w 1 would be the public rounded value and w 0 would be the error (which cannot leak). For the same reason, ρ ′ must be protected since it is used as a seed to obtain y. The same holds for K in the deterministic signing case. Next, when it comes to public or non-sensitive variables, both tr and t 0 , the message M , the seed ρ, the hash µ and the matrix A are public. The sensitivity of the vector of polynomials w 1 in the signing procedure is more delicate to analyze, since it depends on the result of the boundary checks on z andr. If the boundary checks pass, for example when a signature is accepted, then the zero-knowledge proof of security of Dilithium shows that w 1 does not leak any information. Informally, w 1 can be reconstructed from a valid signature, which in turn can be simulated in zero knowledge, and hence w 1 contains no more information than the signature itself. As a result, Dilithium does not need an explicit LWR hardness assumption in this case, since it is at least as hard as its LWE assumption. When the boundary checks do not pass, the reduction from LWR to LWE does not apply immediately since the distribution on w 1 changes slightly. Leaving w 1 unmasked therefore requires the additional explicit assumption that the corresponding LWR problem is hard. Since the number of rounded bits is significantly higher than the error that is added in the LWE problem, we conjecture that w 1 does not require side-channel countermeasures. The same expectation was also shared by Vadim Lyubashevky in a personal communication and publicly during RWPQC23. To be conservative, we nevertheless study the option of additionally protecting w 1 in Appendix A. For the rest, the challenge c can be left unprotected since it is derived from a one-way hash of w 1 and public inputs. And once the bound checks on z andr (discussed further in the next paragraph) have passed, the hint vector can be made public since it does not contain any sensitive information. Indeed, in the simplified version of Dilithium which does not involve the hints or the public key compression, all information that would be given by the hints is already contained in the returned valid signature and the public key. The checks on ct 0 and h are needed for correctness only.
Finally, regarding z andr, both must remain protected until the bound checks on both have passed. This implies the need for a secure bound check algorithm. After successful bound checks, they do not leak information about other sensitive values and can be leaked to the adversary. 3 For z this is trivial, as it is part of the signature. In the case ofr, this can be shown by the equation: Indeed, for a valid signature, the values A, z, c, t, and w 1 are not sensitive and α is a known parameter of the algorithm. Therefore,r can be computed using only public values, so there is no need to keep it protected after a successful signing process. A publicr is quite handy, because it allows us to compute the hint h completely on public data.

Differences with [MGTF19]
Most of our claims made above do align with the ones made in [MGTF19]. However, our conclusions on w andr slightly differ, which we discuss in the following.
Protecting w. First we look at w, in particular at the system of equations that produces it: Ay = w. It is possible to solve this system for y, if the matrix A has one more row than columns. This is the case for NIST security levels 3 and 5, where A has dimensions 6 × 5 and 8 × 7 respectively. Even a simple solver is able to compute the sensitive y in less than two minutes on a laptop, with original Dilithium parameters. 4 For level 2, since the matrix A is square (of dimensions 4 × 4) and random, it is most likely invertible. 5 Hence, with knowledge of w, y can be computed simply as y = A −1 · w. This shows that w must be protected, contrary to the approach in [MGTF19].
Unmaskingr. Before we look atr we need to address a variation of the signing procedure in Dilithium. The original pseudocode for the signing algorithm described in [DLL + 17] only keeps the output w 1 from Decompose(w). Then, instead ofr = w 0 − cs 2 it computes r = w − cs 2 . The rejection check onr is done on r 0 , which comes from Decompose(r). Also, the MakeHint function works slightly different and takes r, c, t 0 as input (but produces the same exact output h). In [MGTF19], this r-version is used while ther-version is never mentioned. However, considering the equation: we can see that r andr can be calculated from each other using the public values w 1 and α. So any consideration regarding the sensitivity classification of one of these values automatically applies to the other one as well. In [MGTF19], the value r is never unmasked which means that the calculation of the hint h must be protected against side-channel attacks. But as we explained above,r can be recreated from public values after a valid signature output. So we considerr as public after the checks on z andr.

Differences with [ABC + 22]
In this previous version of our results, a finer-grain sensitivity analysis was proposed, suggesting that the computation of cs 1 + y = z could only require security against SPA. It was in particular conjectured that an intermediate attack path targeting this computation would be hard, the investigation of which being left as an open problem. However, it appears that the analogy between the multiplication cs 1 and a variant of "hard physical learning problem" similar to [DMMS21], which would back up this conjecture, does not hold. The problem, already glimpsed in [ABC + 22], is that c and s 1 are not uniformly distributed and have small norm, while the hardness of hard physical learning problems leverages uniform secrets so that computing the multiplication leads to modular reductions. Combined with the fact that when the signature is correct, the knowledge on z can be used to directly transfer information on y into information on the output of the multiplication, it implies that y and z actually need to be protected against DPA. As a result, Dilithium has less opportunities of levelling and its sensitivity analysis can be simplified into the mix of unprotected and DPA protected operations that we now use.

Improved masked gadgets
In this section, we describe the techniques used for masking Dilithium. First, we recall some standard notions of masking along with the notations used in this paper. Then, we provide a set of new gadgets dedicated to Dilithium operations. For each of them, we justify their correctness and discuss their probing security. Finally, we discuss their instantiation in the case of the different parameter sets of Dilithium.
Concretely, these gadgets are essentially relying on standard approaches tailored to the Dilithium use case and most of our optimizations are obtained from carefully selecting the type of masking (i.e., Boolean or arithmetic) We in particular rely on the recently improved masking conversion proposed in [BC22] for this purpose.

Masking background
Masking is a popular countermeasure against side-channel attacks. It consists in splitting any sensitive variable x into d shares [CJRR99]. Concretely, d − 1 shares are chosen uniformly at random. Hence, any subset of d − 1 shares remains independent of the secret x, forcing the adversary to exploit d shares simultaneously to extract sensitive information. This property must be maintained during the entire execution of the masked circuit. This is formalized in the probing model, ensuring that the adversary learns no information about the secret by having access to d − 1 intermediate variables [ISW03].
In lattice-based cryptography, two types of masking are used. The first one is Boolean masking. In such a case, the sharing of a k-bit Boolean variable x is written as x B,k and satisfies the property that The notation x B,k [j] denotes the sharing of the j-th bit of x. Boolean masking is typically used for protecting symmetric primitives such as hash functions. The second one is arithmetic masking. In such a case, the sharing of a variable x ∈ Z q is expressed as x Aq Arithmetic masking is typically used to perform polynomial operations such as additions and multiplications. Since both arithmetic and Boolean masking are used to protect lattice-based cryptography, gadgets are required to convert masking from one type to another. To convert from arithmetic to Boolean masking, we use SecA2BModp d q . Similarly to converting from Boolean masking to arithmetic masking, we use SecB2AModp d q . Eventually, we also leverage the gadget SecAddModp d q that performs a modular addition operating on inputs protected with Boolean masking. We refer to [BC22] for the implementation of these algorithms.
In this work, probing security is ensured thanks to the Probe Isolating Non Interference (PINI) security notion [CS20]. Fulfilling PINI ensures probing security and the composition of PINI gadgets is PINI as well. This means that PINI gadgets can be composed (without refresh) and the resulting circuit is probing secure. Since the new gadgets we propose can be expressed as a composition of PINI gadgets previously proposed by Bronchain and Cassiers, it directly implies that they are PINI and therefore probing secure.

SecLeq
We first introduce SecLeq d ψ x B,k described in Algorithm 4. It outputs a bit b equal to 1 if the input Boolean sharing of the k-bit variable x is less than or equal to a bound ψ.
Correctness. We next detail the correctness of Algorithm 4 for the case of 0 ≤ ψ < 2 k − 1. The first step in SecLeq consists in doing an addition of x with the (k + 1)-bit two's complement representation of −(ψ + 1) to obtain x ′ = x − ψ − 1. As a result, the output b must be set to 1 only if x ′ is strictly negative. Because of the input conditions 0 ≤ x < 2 k and 0 ≤ ψ < 2 k − 1, the resulting x ′ is included in −2 k ≤ x < 2 k which fits in a k + 1-bit two complement representation, hence no overflow occurs in the subtraction. The second step consists in unmasking the (k + 1)-th bit of x ′ which corresponds to the sign bit of the two's complement representation. Eventually, the case of ψ ≥ 2 k − 1 is trivial. Indeed, x ≤ 2 k − 1 hence x is always smaller or equal to ψ.

Proposition 1. Algorithm 4 is PINI if b is public.
Proof. SecUnMask is PINI as a consequence of [CGMZ21, Lemma 2]. 6 . Therefore, if b is public, Algorithm 4 is a composition of PINI gadgets.

Usage in Dilithium.
SecLeq is not a high-level component of Dilithium but is instead used as a building block in Algorithm 5 and Algorithm 6. We note that the

SecBoundCheck
Algorithm 5 describes SecBoundCheck d q,λ0,λ1 x Aq which returns a bit b if the input arithmetic sharing x Aq satisfies the property −λ 0 ≤ x ≤ λ 1 mod q.

Algorithm 5 SecBoundCheck d q,λ0,λ1 x Aq
Input: Arithmetic sharing x Aq , integer q < 2 k and λ0 + λ1 < q with λ0 ≥ 0 and λ1 ≥ 0. Output: Correctness The first step in Algorithm 5 is to add λ 0 to the input sharing of x resulting in a sharing of x ′ . As a result, the output bit b will be set to one if and only if 0 ≤ x ′ ≤ λ 0 +λ 1 mod q. The second step is to check that condition thanks to SecLeq. To do so, the arithmetic sharing x ′ Aq of x ′ is converted to a Boolean sharing x ′ B,k thanks to SecA2BModp. The resulting sharing fulfills the input conditions of SecLeq. Indeed, (x ′ mod q) < q and q < 2 k implies x ′ < 2 k . Additionally, since λ 0 and λ 1 are positive integers, we ensure that λ 0 + λ 1 ≥ 0. The returned bit by SecBoundCheck is the one returned by SecLeq.

Proposition 2. Algorithm 5 is PINI.
Proof. The first addition is applied only on the first share hence PINI. SecA2BModp d q is PINI by [BC22, Proposition 4], and Algorithm 4 is PINI by Proposition 1. Hence, Algorithm 5 is PINI since it is the composition of PINI gadgets.

SecSampleModp
Algorithm 6 describes SecSampleModp which samples uniformly x over the range −ϕ 0 ≤ x ≤ ϕ 1 mod p and outputs an arithmetic sharing when provided with a masked uniform randomness stream (x B,k 0 , x B,k 1 , . . . ).
Correctness. We first note that the output sharing should be uniform on a continuous range [−ϕ 0 , ϕ 1 ] which contains ϕ 0 + ϕ 1 + 1 integers. This range can be represented with k-bits such that ϕ 0 + ϕ 1 + 1 ≤ 2 k . The first step in Algorithm 6 is to convert its uniformly distributed input bits into shares x B,k i while the obtained x is strictly larger than ϕ 0 + ϕ 1 . This inequality is checked by leveraging SecLeq described in Algorithm 4. Once the inequality is not satisfied, the obtained x is uniformly distributed on the range 0 ≤ x ≤ ϕ 0 +ϕ 1 . The Boolean sharing of x is then converted into an arithmetic sharing x Aq . Finally, −ϕ 0 mod q is added to this arithmetic sharing, resulting in x being uniformly distributed over −ϕ 0 ≤ x ≤ ϕ 1 mod q.
Proposition 3. Algorithm 6 is PINI assuming that whether each x i satisfies x i > ϕ 0 + ϕ 1 is public information and that x i * ≤ ϕ 0 + ϕ 1 for some integer i * .
Proof. The assumptions imply that the output value of the SecLeq gadget calls are public and that the gadget terminates with the number of iterations being public. Therefore, the gadget SecSampleModp can be viewed as a circuit composed of PINI gadgets, hence Algorithm 6 is itself PINI.

Usage in Dilithium.
SecSampleModp is used during both for key generation and signing. First, during ExpandS in key generation, a secret key coefficient x in s 1 or s 2 is sampled such that −η ≤ x ≤ η where η ∈ {2, 4} depending on the Dilithium parameter set. This sampling can be masked with SecSampleModp d q,η,η (·). For these parameters, rejections can occur and a fresh x passes the SecLeq check with probability 5 8 and 9 16 , respectively. Second, during ExpandMask signature generation, a coefficient x of y is sampled such that −γ 1 < x ≤ γ 1 where γ 1 is a power of two such that γ 1 ∈ {2 17 , 2 19 } depending on the parameter set. As a result, this sampling can be masked thanks to SecSampleModp d q,γ1−1,γ1 (·). For these parameters, no resampling of x is required. Indeed, SecLeq d 2γ1−1 (·) is used which satisfies the trivial condition ϕ ≥ 2 k − 1 since ϕ = 2γ 1 − 1. The k must not be evaluated at run time since it is directly derived from the Dilithium parameter set. As an example for the ExpandMask execution during signature generation, k ∈ {18, 20} depends on the parameter set. We note that in both ExpandS and ExpandMask, the x is sampled from the output of a hash function, which is most efficiently protected using Boolean masking. This explains why we consider only sampling in Boolean domain. 7 Moreover, whether the samples have to be rejected is public information in the original security proof of Dilithium.

SecDecompose
The SecDecompose gadget presented in Algorithm 7 enables to compute the decomposition (w 1 , w 0 ) of a coefficient w such that w = αw 1 +w 0 mod q with w 0 = w mod ± α. Concretely, we leverage the fact that w 1 can be leaked to the adversary since it is computed during signature verification, and hence must not be protected against side-channel attacks. The first step of our gadget is to derive w 1 from w Aq . Then, w 0 Aq is obtained by computing w 0 Aq = w Aq − α · w 1 mod q. To the best of our knowledge, there is no generic and efficient method for a masked division to compute w 0 Aq divided by α to get w 1 . Hence, we next specialize the extraction of w 1 to the different parameter sets of Dilithium.
1: if NIST Level 3 or Level 5 then Correctness Level 2. For the NIST level 2 parameters of Dilithium, we have α = (q−1)/44. Hence, α −1 = −44 mod q. For these parameters, w 1 can be extracted by performing a division with its reminder such that: which in turn can be performed by using the Compress function defined as: for which a masked version at any order is presented in [CGMZ21]. The Compress function can be used since −α −1 ≪ q. Hence, the error does not have an impact on the results. This fact has been checked exhaustively for all possible values of w mod q. 8 Correctness Level 3 & Level 5. Next, we check the correctness of SecDecompose for NIST level 3 and level 5 parameters. In such cases, α = (q − 1)/16. Hence, α −1 = −16 mod q. The first steps in Algorithm 7 execute the following processing to w in order to derive b ′ such that; which can be alternatively expressed as: There, we note that α 2 − w 0 is strictly positive thanks to the definition of Decompose. Indeed, it follows from −α/2 ≤ w 0 ≤ α/2. As a result, w 1 can simply be contained in the 4 LSBs of the binary representation of b ′ . This is done thanks to the combination of SecA2BModp d q and just keeping the 4 LSBs of the output. 9 Proposition 4. Algorithm 7 is PINI if w 1 is public.
Proof. If w 1 is public, Algorithm 7 is the composition of PINI gadgets hence it is PINI.
Eventually, we put forward that Algorithm 7 can be adapted to keep w 1 masked by performing a SecB2AModp on its Boolean sharing (instead of unmasking). The computation of w 0 can then be applied share-wise similarly as in Algorithm 7, L-9. We discuss the impact at the gadget level in Figure 3 and on the full signature in Appendix A.

Implementation
We now discuss the different designs we compare later in Section 6. We describe the implementations for both the deterministic and the randomized versions of Dilithium. All our results are based on modified versions of the Dilithium implementations provided by the PQM4 project [KRSS]. These are C implementations with optimized assembly for polynomial arithmetic and hash functions. In order to prevent side-channel attacks, we follow our previous sensitivity analysis that distinguishes sensitive and non-sensitive operations, and we make use of masking with the gadgets presented in Section 4 and the underlying masked additions and conversion gadgets such as SecA2BModp, SecAdd, SecAddModp or SecB2AModp for all sensitive ones. We rely on their state-of-the-art bitsliced implementations introduced by Bronchain and Cassiers, which offer (to the best of our knowledge) the best performances on Cortex-M4 [BC22]. Eventually, we use the same masked Keccak as the one provided in [BC22]. We additionally leverage arithmetic masking with q modulus for all the polynomial operations and then apply share-wise the optimized polynomial arithmetic from the PQM4 implementations. Interestingly, the smaller modulus q ′ approach for the NTT in s 1 • c proposed in [AHKS22] could also be used. However, it requires arithmetic to arithmetic masking conversion from q ′ to q to perform the addition with masked y. We leave the study of such a trade-off for future works.
8 [CGMZ21] only considers δ equals a power of two. We take δ as an arbitrary positive number. 9 We note that only the LSBs of the SecA2BModp d q have to be explicitly computed. As a result, this can save several SecAnd when the SecA2BModp d q from [BC22] is used.

Deterministic Dilithium
We use a masked Keccak for H(K||µ) and within ExpandMask. In ExpandMask, the randomness generation in SecSampleModp (see Algorithm 6 Line-3) is performed with a call to the masked XOF. The multiplication Ay is performed on each of the shares of y independently by leveraging the optimized arithmetic operations in [KRSS]. For the w decomposition, we leverage the new gadget SecDecompose from Algorithm 7 with the appropriate parameters given in Subsection 4.5. The protected rejections ∥z∥ ∞ < γ 1 − β and ∥r∥ ∞ < γ 2 − β are implemented thanks to SecBoundCheck presented in Algorithm 5. Eventually, similarly to SecLeq in Algorithm 5, we unmask the public signature z andr using the SecUnMask gadget once all the bound checks are passed, to maintain probing security.

Randomized Dilithium
The implementation we consider of the randomized version is similar to the deterministic one previously described. The main difference lies in the sampling of the randomness y within ExpandMask. The randomized version of Dilithium enables more freedom in the generation of the uniform polynomial vector y compared to the deterministic version. For the deterministic version, the randomness sampling in Algorithm 6 is based on secured XOF (Keccak) which can be a performance bottleneck (as detailed in Table 2). For the randomized version, a first option (which follows the specifications but does not bring performance improvements) is to generate ρ ′ with a TRNG and to use the masked XOF to derive y. Alternatively, one can directly generate the shares of y with the TRNG. This does not follow the specifications of Dilithium but saves the cost of a masked XOF and does not weaken the security of Dilithium as y remains uniform. We will next evaluate this option. 10

Benchmarks
In this section, we report the performances of the Dilithium implementations described in Section 5. We first detail the benchmarking setup used for this purpose. Second, we report the performance improvements provided by the new gadgets of Section 4 compared to the ones of [MGTF19]. Then, we evaluate the cost of each individual operation in Dilithium's signature generation (without considering the rejections). Based on this, we compare the performance of both deterministic and randomized versions when sidechannel countermeasures are required. Performances are given for Dilithium with Level-3 parameters (see Table 1), but the general conclusions apply to all security levels.

Benchmarking setup
In order to evaluate the execution time of our implementations, we use a similar benchmarking setup as the one provided in [BC22], which itself is based on the PQM4 benchmarking initiative for PQC signatures and KEMs [KRSS]. More precisely, the benchmarks are performed with the NUCLEO-L4R5ZI demonstration board. The cycle counts are measured thanks to the cycle-accurate counter DWT_CCYCNT. With the considered clock configuration, the TRNG of the microcontroller provides 32 fresh random bits every 53 Cycles. This TRNG is used as for the generation of the randomness masking as well as for the ExpandMask in the randomized version of Dilithium that we evaluate.

Gadgets improvements
We first compare the gadgets presented in Section 4 and the ones proposed by Migliore et al. in [MGTF19], and we report the results in Figure 3. To enable a fair comparison, we implemented the gadgets as described in [MGTF19] by leveraging the PINI property and the bitslice gadgets from [BC22] for SecAdd, SecAddModp, SecA2BModp and SecB2AModp. As a result, the implementation of [MGTF19] we consider does not contain extra refresh gadgets (as it is PINI). We note that [MGTF19] uses a parameter w reflecting the bus width of the target CPU, which implies that operations might be performed on more bits than necessary. In our implementation, we do not use that parameter w as the operations are performed on the exact necessary number of bits as allowed by bitslicing.
Similarly, [MGTF19] performed masked SecAnd with public values to isolate bits on secret variables. Individual bits are isolated in our implementations thanks to bitslicing.
SecSampleModp. Both versions of SecSampleModp as used in the signature generation are similar. The only difference is that the subtraction with ϕ 0 is performed with Boolean masking in [MGTF19] and with arithmetic masking in Algorithm 6. As a result, our new gadget saves the cost of one SecAdd by replacing it by a share-wise addition. This results in a speedup of an approximate factor 1.2, as highlighted in Figure 3b.

SecBoundCheck.
Our SecBoundCheck also simplifies the one proposed in [MGTF19] where a SecA2BModp is performed followed by two SecAdd's. 11 Our construction replaces one of these additions by one arithmetically masked addition, which is almost free. This leads to a performance improvement by a factor ≈ 1.1, as reported in Figure 3d.
SecDecompose. Finally, we compare the two implementations of SecDecompose. Interestingly, the main improvement comes from the fact that we first extract w 1 efficiently and then unmask it to compute w 0 . This improvement relies on the fact that the higher order bits (e.g., w 1 ) of the SecDecompose gadget is considered as sensitive by Migliore et al. while it is not necessary as detailed in the revisited sensitivity analysis detailed in Section 3.3. In short, the implementation based on [MGTF19] starts with a SecA2BModp, continues with several (≈ 10) additions and finally performs a SecB2AModp to obtain the arithmetic sharing of w 0 . The new gadget only requires a single SecA2BModp and some share-wise operations with arithmetic masking. Overall, the new gadget runs ≈ 3.8 times faster.
We note that for Level-2 parameters, the α changes and the gadget from Migliore et al. does not apply. Our implementation of SecDecompose for Level-2 parameters is slightly slower than for Level-3 and Level-5. Indeed, in the SecCompress, the SecA2BModp must be performed on a slightly larger modulus increasing the cost by a factor ≈ 1.2.

Deterministic vs. Randomized performances
The performances of each operation within both versions of Dilithium are reported in Table 2. We observe that the randomized signature generation is more efficient than the deterministic one. For two shares, 24 005 kCycles are needed for the deterministic version vs. only 14 282 kCycles for the randomized one. Hence, randomization offers an improvement by a factor ≈ 1.68. Similarly, for 8 shares, the randomized version is ≈ 1.77× faster than the deterministic one. The run time of unprotected Dilithium3 signature generation is 3224 kCycles [KRSS]. Hence, the two-share version is 4.3× slower than the unprotected implementation in the randomized case and 7.32× in the deterministic case. (a) SecSampleModp comparison. (e) SecDecompose comparison.   Most of the difference between the two Dilithium versions is due to ExpandMask, which is composed of two parts as detailed in Algorithm 6. The first one is sampling the uniform y in Boolean masking. This operation is performed with a masked XOF in the deterministic case, and with the on-board TRNG in the randomized case. The second part is to perform a SecB2AModp in order to produce an arithmetic sharing. This operation is similar for both cases. In the deterministic case, the ExpandMask represents 56 % of the total run time from which 74 % are due to the masked XOF. As the randomized Dilithium does not require this masked XOF, only 25 % are imputable to ExpandMask. The cost of the other operations are similar for both versions. Concretely, the overall cost is dominated by the operations that have quadratic overheads in the number of shares (even for d = 2). These operations are H(K||µ), ExpandMask, SecDecompose, SecBoundCheck and UnMask. We note that for the deterministic version, the most expensive operation is ExpandMask, while it is SecBoundCheck for the randomized version. Additionally, the cost of polynomial arithmetic (NTT(s), y + s 1 c and w 0 − s 2 c ) is limited. As expected, the cost of the public matrix expansion ExpandA remains constant with the number of shares. Eventually, we note that the SecB2AModp is slightly more expensive in the deterministic case, as it includes the linear overheads needed in order to map the output of the masked XOF into the correct bitslice representation. This operation is not needed in the randomized case as the output of the TRNG already has the appropriate layout.
More generally, we also stress that the randomized version does not allow the adversary to average traces for the same inputs, which is beneficial for security. The combination of these observations and performance gains naturally calls for considering the randomized Dilithium in application contexts where side-channel attacks are a concern.

Conclusion and open problems
In this work, we analyzed side-channel protected implementations of Dilithium by mixing different contributions. First, we presented an updated sensitivity analysis for its key generation and signing algorithms. Our results show that a previous work in this direction was slightly flawed, with some parts leading to insecurities and other parts leading to inefficiencies. Second, our new masking gadgets improve over the state-of-the-art, leading to performance gains of factors up to of 3.8. They also fill gaps for which it was previously unknown how to efficiently apply masking and we propose the first masking gadgets that are compatible with all the Dilithium parameter sets. Overall, our analysis and benchmark highlight that the randomized variant of Dilithium evaluated in this paper provides notably better performances, thanks to additional flexibility in the sampling of random values. In addition, it also offers a smaller side-channel attack surface as signatures cannot be repeated. We therefore believe that it should be the default variant for embedded devices when side-channel leakage needs to be taken into account.
These results lead to a number of natural open problems. First, they highlight that for now, the leveling concept (i.e., the idea of protecting different parts of an implementations with different countermeasures, in order to limit the overheads) cannot be fully exploited for Dilithium, as initially thought (see the cautionary note in Section 1). For example, most of the signature operations in our implementations need to be secure against DPA, which requires (expensive) masked gadgets. Hence it is a natural open question to find out whether more leveled implementations could be obtained, which could be considered in different fashions. A light leveling option, directly applicable to Dilithium, would be to try exploiting that even when masking, all operations may not leak in a similar manner. For example, one could try leveraging the recent observation that prime masking is more resilient to low-noise leakages than Boolean masking, and study whether the number of shares in the Boolean and arithmetic masking used for protecting Dilithium against leakage could be leveled [MMMS22]. A more ambitious direction would be to study whether tweaking Dilithium could enable a stronger leveling (e.g., mixing operations that require security against SPA and operations that require security against DPA). One potential direction would be to rely on hard physical learning problems like intorduced in [DMMS21], but as discussed in the paper, this would imply significant changes in the design of Dilithium, in order to deal with the challenges raised by the manipulation of non-uniform and low-norm secrets. In general, designing a PQ signature scheme with a better performance vs. side-channel security tradeoff appears as an interesting long-term goal. Besides, our work only focuses on side-channel attacks and therefore raises the question of how to additionally protect Dilithium against fault attacks, and whether its randomized version also provides (security or performance) benefits in this context, which is yet another interesting direction for further research. Table 3: Performance of the masked Dilithium Level-3 components for randomized and deterministic versions with masked w 1 : number of clock cycles when running on a STM32L4R5 and using the TRNG for generating the masking randomness (32-bit randomness every 53 Cycles). Reported numbers are in kCycles. The numbers are for a single execution of the component (does not consider repetitions due to rejections). Rand. * : The vector polynomial y is sampled from the TRNG and not from a XOF.