Masking Floating-Point Number Multiplication and Addition of Falcon First-and Higher-order Implementations and Evaluations

. In this paper, we provide the ﬁrst masking scheme for ﬂoating-point number multiplication and addition to defend against recent side-channel attacks on Falcon ’s pre-image vector computation. Our approach involves a masked nonzero check gadget that securely identiﬁes whether a shared value is zero. This gadget can be utilized for various computations such as rounding the mantissa, computing the sticky bit, checking the equality of two values, and normalizing a number. To support the masked ﬂoating-point number addition, we also developed a masked shift and a masked normalization gadget. Our masking design provides both ﬁrst-and higher-order mask protection, and we demonstrate the theoretical security by proving the (Strong)-Non-Interference properties in the probing model. To evaluate the performance of our approach, we implemented unmasked, ﬁrst-order, and second-order algorithms on an Arm Cortex-M4 processor, providing cycle counts and the number of random bytes used. We also report the time for one complete signing process with our countermeasure on an Intel-Core CPU. In addition, we assessed the practical security of our approach by conducting the test vector leakage assessment (TVLA) to validate the eﬀectiveness of our protection. Speciﬁcally, our TVLA experiment results for second-order masking passed the test in 100,000 measured traces.


Introduction
The rapid development of quantum computers has posed a potential threat to public key cryptography systems.Due to Shor's algorithm [Sho97], public key cryptographic schemes based on the hardness of integer factorization and discrete logarithm, including RSA [RSA78], Diffie-Hellman key agreement [DH76], ElGamal encryption [Elg85], and ECDSA [JMV01], are vulnerable to large-scale quantum computing.To defend against such a menace, post-quantum cryptography, which studies algorithms that are considered secure against quantum computing, has been widely researched.In 2016, the National Institute of Standards and Technology (NIST) initiated a standardization process for postquantum cryptography [oSTa].Recently, the four selected algorithms, CRYSTALS-Kyber, CRYSTALS-Dilithium, SPHINCS + , and Falcon became part of NIST's post-quantum cryptographic standards, which are expected to be finalized in about two years [oSTb].
Unfortunately, even with conjectured quantum resistance, cryptographic implementations may not be secure.Side-channel analysis considers the threat of information leakage from an electronic device running cryptographic algorithms through its hardware physical behaviors.These could be the running time of different inputs or the electromagnetic emanation and power consumption during the execution.Many works have been

Notation
In the rest of the paper, N = 2 κ for some integer κ, M > N , q is a prime number, and φ = x N + 1 is the cyclotomic polynomial.For a polynomial f = N −1 . We write a vector v in bold, and a matrix A is written in bold and uppercase.The adjoint of a vector v or a matrix A, which is the transpose of adjoint of each of its coefficient, is written as v * or A * .For a polynomial f modulo φ, we may write it as an N -by-N matrix, with the ith row representing the coefficients of (x i f mod φ) for 0 ≤ i ≤ N − 1. Matrix additions and multiplications will map to polynomial additions and multiplications in the ring of polynomials modulo φ in this tradition.
For a variable x, the jth bit of x is written as x (j) .The LSB is the first bit, and the MSB is the kth bit if it is stored in a k-bit register.The ith bit to jth bit (j ≥ i) of x is represented by x [j:i] .A sequence of n variables (x 1 , x 2 , • • • , x n ) (e.g. shares of variable x) is written as (x i ) 1≤i≤n , or simply (x i ) if the sequence length is not our point, or it is obvious from contexts.Without specifying, n will be the sequence length of (x i ).
We use notations ⊕, ∧, ∨ to denote the bit-wise XOR, AND and OR operations, respectively.For variable x, we denote by ¬x the bit inversion of x.Also, for a negative integer in a register, we consider two's complement representation and write −x = (¬x) + 1.The and are unsigned left-and right-shift of a variable.In addition, for a proposition P , we let P = 1 if P is true and 0 if otherwise.

Falcon Signature Scheme
In this section, we briefly review the Falcon signature scheme, emphasizing on the preimage vector computation in the Fourier domain which is strongly related to our work.For more details, we refer the readers to the NIST round-3 submission of Falcon [PFH + 20].
The basis of Falcon is the GPV framework [GPV08].At a high level, the framework uses a full-rank matrix A ∈ Z N ×M q as public key, and B ∈ Z M ×M q as secret key where BA T = 0 mod q.To sign a message m, one first computes H(m) for some hash function H : {0, 1} * − → Z N q , and the signature is a short vector s, which is derived by trapdoor B, that satisfies sA T = H(m).To verify, one can simply check that the received s is short and satisfies the above equation.
Falcon instantiates the GPV framework on NTRU lattices.The secret key is essentially 4 polynomials f, g, F, G in Z q [x]/(φ) which satisfy the NTRU equation f G − gF = q mod φ while the public key is now the polynomial h = gf −1 mod (φ, q).To sign a message securely, coefficients of f and g are sampled from a discrete Gaussian distribution with a small standard deviation σ {f,g} = 1.17 q/2N , and F, G are generated from f, g by solving the NTRU equation.Matrices A, B are formed as The second part of Falcon key generation is the computation of the Falcon tree, which will be used in looking for the short vector in signing.In short, the Falcon tree is a binary tree where each node at height κ is a polynomial in Q[x]/(x 2 κ + 1).The Gram matrix G = BB * is first comptuted, and a Falcon tree is built by recursively calling the LDL decomposition on each element of the diagonal matrix and store the lower-triangular matrix as a node value.The leaf node is then normalized by a constant σ.The whole key generation process is given in Algorithm 1.

Algorithm 1 FALCON.Keygen from [PFH + 20]
Input: Polynomial φ, prime modulus q Output: Falcon secret key sk, public key pk 1: f, g, F, G ← NTRUGen(φ, q) generate Falcon tree 5: for each leaf leaf of T do The signing process of Falcon in Algorithm 2 starts by sampling a random salt r and computing c = H(r m).Next it finds a short vector s = (s 1 , s 2 ) such that sA * = s 1 + s 2 h = c.To do so, the vector t = (c, 0) × B −1 is first computed, which we call the pre-image vector.Then the fast Fourier nearest plane algorithm [DP16] (ffSampling in Algorithm 2) is applied to find some z where s = (t − z)B is close to zero in a secure way, in the sense that no information about secret basis B is leaked.The signature consists of the salt r and the second coefficient s 2 .The verification in Algorithm 3 is relatively easy.One first retrieves s 1 by H(r m) − s 2 h, and then check whether the norm of vector (s 1 , s 2 ) is small.

9:
(s 1 , s 2 ) ← invFFT(s) 10: s ← Compress(s 2 ) 11: while s = ⊥ 12: sig ← (r, s) 13: return sig To use the fast Fourier sampler and increase the speed in the signing process, Falcon applies the fast Fourier transform on polynomials.Let Ω φ be the sets of all complex roots of φ, the fast Fourier transform of a polynomial f is In this representation, polynomial additions and multiplications can be done by performing accept 8: else 9: reject operations on each coefficient, which is a complex number.Therefore, the pre-image vector computation is composed of complex number multiplication of each coefficient.
In practice, a complex number a + bi is represented by two floating-point numbers a, b as its real and imaginary parts.As pointed out in their document [PFH + 20], it may be hard to implement Falcon on constrained devices without a floating-point unit.
Falcon thus provides an emulation of floating-point operations with 53 bits of precision in their reference implementation.They follow the IEEE-754 standard to store a 64-bit floating-point number by 1-bit sign, 11-bit exponent, and 53-bit mantissa, where the 53rd bit is always 1 and omitted when storing.To make the comparison of values more straightforward, the exponent is biased with a value 1023 to make it unsigned.For convenience, we will represent a 64-bit floating-point number x as a tuple (s, e, m) where Compared to the floating-point number architecture, s is the 64th bit, e is the 53rd to 63rd bit, and m is the 1st to 52nd bit of the floating-point number with the omitted one at the beginning.In this representation, m may be viewed as an integer in [2 52 , 2 53 ).

Floating-Point Number Multiplication and Addition
The pre-image vector computation, or coefficient-wise complex number multiplication, is performed by combinations of floating-point number multiplications and additions.We here briefly introduce how Falcon emulates them in their given reference implementation.
We begin by introducing the last subroutine of both the multiplication and additionthe floating-point number rounding and packing function FPR (Algorithm 4), which receives a 55-bit mantissa, an unbiased exponent, and a sign bit and outputs the floating-point number with a rounded mantissa and biased exponent.It starts by adding the exponent with a constant 1076 for the bias (1023) and mantissa size (53).If the result is smaller than 0, it is considered subnormal, and the mantissa should be turned to zero.Then it zeros the exponent if the mantissa is zero.Next, the sign bit and mantissa are combined together and added with the exponent.Note the exponent will be incremented by the top bit of the mantissa.Finally, the rounding is done by checking the least three significant bits of the mantissa.It follows the round-to-nearest strategy: if they are 011, 110, or 111, a carry 1 is added.Now we introduce the floating-point number multiplication in Algorithm 5.The multiplication first XORs the sign bits and sums the exponents of both operands, and then does the mantissa multiplication.Note that a constant −2100 is added to the exponent for exponent bias (1023×2), scaling of mantissa (52×2), and shifting of the later multiplication product (−50).The product is a 106-bit raw mantissa, and rounding is performed to increment if z [3:1] is 011,110 or 111 9: return x reduce the mantissa to 55 bits in which the 55th bit is set.Notice that we need to make the sticky bit preserved after the rounding.If either of the exponents is zero, the computations above are invalid, and the mantissa is turned to zero.Finally, the sign bit, unbiased exponent, and 55-bit mantissa are packed into one floating-point number by the FPR in Algorithm 4.

Algorithm 5 FprMul
Input: Floating-Point numbers x = (sx, ex, mx) and y = (sy, ey, my) Output: Floating-Point numbers product of x and y 1: s ← sx ⊕ sy 2: e ← ex + ey − 2100 round to 55 bits with sticky bit preserved 9: e ← e + w increment if the product carries to 106th bit 10: bx ← ex = 0 , by ← ey = 0 11: b ← bx ∧ by 12: z ← z ∧ (−b) 13: return FPR(s, e, z) The floating-point number addition in Algorithm 6 first exchanges the operands to make the absolute value of the first one no less than that of the other.Since the exponent is biased, one can compare the absolute values of two floating-point numbers by comparing their least 63 bits.If both operands are only differed by the sign, it lets the first operand be the positive one.Then it extracts both operands' sign bit, exponent, and mantissa, where the mantissa is scaled up to 55 bits for further rounding.The exponent is subtracted by 1078 for the bias (1023) and mantissa scaling (55).Next, it shifts the second operand according to the exponent difference while preserving the sticky bit and adds both mantissa.The sum is first normalized to [2 63 , 2 64 ); that is, the 64th bit is set and then scaled down to [2 54 , 2 55 ) with the sticky bit preserved.Finally, the packing helps to return a 64-bit floating-point number with a 53-bit mantissa, as it does in multiplication.
It should be noticed that the floating-point number multiplication and addition do not follow the associative law and distributive law.In other words, for some floating-point numbers a, b, and c, a  This makes it complicated to design a masked implementation of Falcon without constructing new ways to do the multiplication and addition.In this paper, we follow Falcon reference implementation and rewrite the floating-point number multiplication and addition in a masked way.Our masked functions return the same values as the existing functions, and they can apply to both security levels provided by Falcon (i.e., Falcon-512 and Falcon-1024) since they use the same floating-point number arithmetic.

Masking
Side-channel attacks can extract secret information in cryptographic devices by measuring their physical behavior during the computation, and masking helps avoid leakage by randomizing secret variables in each operation.Essentially, it splits each sensitive value into several shares that are randomized every round.In this way, the attacker cannot gain any information if only a limited number of intermediate variables are seen.
Common masking methods include Boolean mask and arithmetic mask [MOP07].The Boolean masking method hides a variable x by a length n sequence ( Since the value x should only be recovered if all x i 's are known to the attacker, we also call the sequence (x i ) Boolean shares of x.The arithmetic masking method uses arithmetic shares (x i ) where x = x 1 + • • • + x n .Note the addition is considered to be modulo 2 k for a k-bit variable.
To theoretically evaluate the power of the attacker and the security level protected by masking, Ishai, Sahai, and Wagner [ISW03] introduced the notion of the t-probing model, which assumes an adversary can probe up to t intermediate values during the cryptographic operations.A gadget is said to be secure against t-order attacks if any t intermediate values in operations leak no information about the hidden secret.Their approach for proving this security is based on simulation; namely, to show that any t intermediate values of the gadget can be simulated without the knowledge of the secret.
To prove the security of compositions of gadgets, the concept of (Strong)-Non-Interference was proposed in [BBD + 16].We recall them in the version presented in [SPOG19].

Definition 1 (t-Non-Interference (t-NI) security). A gadget is t-Non-Interference (t-NI)
secure if every set of t intermediate values can be simulated by no more than t shares of each of its inputs.
Definition 2 (t-Strong-Non-Interference (t-SNI) security).A gadget is t-Strong-Non-Interference (t-SNI) secure if for every set of t I internal intermediate values and t O of its output shares with t I + t O ≤ t, they can be simulated by no more than t I shares of each of its inputs.
If a gadget is itself t-NI secure, and if any set of t shares of its input is independent of the secret, then it can resist t-order attacks.However, compositions of t-NI secure gadgets may not be t-NI secure, and we can use t-SNI to avoid this problem.Note that a t-SNI secure gadget is, by definition, also t-NI, and it can help the simulation since it requires only the same number of input shares as the internal probed values to simulate all the probes.
In section 4, we will show that our algorithms with n = t + 1 shares are t-NI or t-SNI secure and provide formal proofs via simulation.

Test Vector Leakage Assessment
The Test Vector Leakage Assessment (TVLA) methodology [GGJR + 11] is used to analyze whether a cryptographic device leaks information from its power/EM trace.The theory behind this is Welch's t-test.Here we introduce the non-specific version with fixed-versusrandom policy.The tester records a set of power traces where the device runs with a fixed input and another set of traces where the device runs with random inputs.In implementation, the tester records traces for different sets in random order to avoid possible device bias through time.For each point of trace, the t statistic is calculated, where Xf , s f , n f and Xr , s r , n r are the sample mean, standard deviation, and number of traces of set fixed and random, respectively.When the number of recorded traces is large, the t statistic helps recognize whether both sets are sampled from distributions of the same population mean, which is our null hypothesis.In our contexts, this implies adversaries cannot distinguish some particular input.In practice, we reject the hypothesis if the t statistic exceeds the standard threshold ±4.5, which is set to guarantee a p-value under 0.00001.However, since our measured traces contain many sampled points, we refer to [DZD + 17] and alter this threshold to avoid false positives.

Masked Floating-Point Number Multiplication and Addition
We now introduce our main algorithms -floating-point number multiplication and addition in Falcon, which rewrites FprMul in Algorithm 5 and FprAdd in Algorithm 6 in a masked design.To support their complicated operations in shares, we design three masked gadgets as subroutines in our main algorithms, including: • SecNonzero: the masked algorithm which receives input shares and outputs whether the input is zero in bit shares.
• SecFprUrsh: the masked algorithm which receives input Boolean shares (x i ) of x and arithmetic shares (c i ) of c and returns shares of x c with its sticky bit preserved.• SecFprNorm64: the masked algorithm which left-shifts 64-bit Boolean shares (x i ) to make its 64th bit set.It then adds the shift counts to the other input shares (e i ).
We also use several gadgets from previous works.Table 1 lists all the gadgets used in this work and their t-NI and t-SNI security.For those proposed in this work (Algorithm 7, 8, 9, 10), we provide details of them in the following of this section and will prove their t-NI or t-SNI security in Section 4.

Masked Nonzero Check
We start by introducing our nonzero-check algorithm.Consider that a k-bit number x is zero if bit-wise OR-ing all its bits results in a zero.That is, Let the input be some Boolean shares.As each bit of shares is independent, we can bitslice the shares and use the SecOr gadget to OR all the bits securely.The detail of the SecOr gadget is given in Algorithm 7, which applies De Morgan's law and calls the AND algorithm SecAnd of shares as a subroutine.That is, To increase efficiency, we consider OR-ing not one bit but half of the register size each time, which reduces the complexity from O(n 2 k) to O(n 2 log k) for n-shared k-bit numbers.
For arithmetic shares input (x i ), since n i=1 We take the last n 2 shares, turn them negative, and use two n 2 -share arithmetic-to-Boolean conversion gadgets A2B to create two Boolean shares, each representing half shares of the input.Followed by the above nonzero check of Boolean shares, we end up getting one-bit Boolean shares indicating if the input is nonzero.The whole algorithm is given in Algorithm 8.

Algorithm 7 SecOr
Input: Boolean shares (x i ) 1≤i≤n for value x, Boolean shares (y i ) 1≤i≤n for value y Output: Boolean shares (z i ) 1≤i≤n for value z = x ∨ y 1: 3:

Masked Unsigned Right-Shift
One step in the floating-point number addition is to right-shift the mantissa of the second operand by the difference of exponents (in line 11 of FprAdd).This is used to make the exponents of the two operands equal, and thus we can directly add their mantissa together.While the exponent difference is part of the secret and is represented in shares, we cannot directly unmask them.The SecFprUrsh in Algorithm 9 right-shifts a Boolean-masked number by a value in arithmetic shares while preserving the sticky bit.
The idea goes as follows.For a 64-bit number, rotating it by some value c is equivalent to rotating by value c mod 64.This shows we can rotate by a 6-bit arithmetic-masked value via sequentially rotating by each share of it.To recover the shifted result from the rotated one, we also rotate a constant (1 63) by the same value.As there is only a single 1 in bit representation of (1 63), we could sequentially XOR and right-shift the value to set all the valid bits for our desired shifted result.Moreover, the unset bits are the bits to discard, which determine the sticky bit.We use a SecNonzero to find our desired sticky bit and replace the least significant bit of the shifted result with it.
It is noteworthy that we add t-NI secure RefreshMasks in each iteration of the rotation.This is for removing the dependency between shares to achieve its t-SNI security.A formal proof based on simulation and properties of the RefreshMasks gadget will be given in Section 4.

Masked 64-bit Normalization
Another crucial part of the floating-point number addition is normalizing a number to range [2 63 , 2 64 ) (in line 14 of FprAdd).This is used to set the correct exponent and left-shift (m i ) ← RefreshMasks((m i )) 7: len ← 1 8: while len ≤ 32 do 9: the mantissa sum for a valid floating-point number.In Falcon reference implementation, they sequentially check whether the high-order bits are all zero and left-shift the mantissa by the corresponding value.We follow their implementation and use our SecNonzero gadget to check the shift counts.In addition, to add the shift counts to the exponent, we use the one-bit Boolean-to-arithmetic conversion algorithm B2A Bit to transform the result of SecNonzero into arithmetic shares.The whole algorithm is given in Algorithm 10.

Masked Floating-Point Number Packing
Given 1-bit Boolean-shared sign bit (s i ), 16-bit arithmetic-shared exponent (e i ), and 55-bit Boolean-shared mantissa (z i ), the SecFPR in Algorithm 11 packs them into one Boolean shares and round the mantissa to 53-bit with the round-to-nearest strategy.Similar to Algorithm 4, the procedure starts by adding a constant 1076 to the exponent.Then we turn the mantissa into zero if the result is smaller than 0 (line 4).The comparison is made by a 16-bit conversion A2B gadget and checking the most significant bit of the result.Next, we zero the exponent if the mantissa is zero, and we check this by the 55th bit of the mantissa (line 5).The 55th bit is also added to the exponent by the Boolean-masked addition gadget SecAdd.We then pack the sign bit, exponent, and mantissa that is shifted right by 2 bits (line 9).The Refresh gadgets for the sign bit and the exponent are used here to satisfy the t-SNI security.Finally, to do the rounding securely.We consider adding the mantissa by the value derived from OR-ing the first and third bit (line 10) and then AND-ing the second bit (line 11).In this way, the least 3 bits 011, 110, or 111 will cause an increment.
Note that the value will be indeterminate if the input exponent is too large, as it will overflow to the sign bit in Algorithm 4. We omit this check and leave the responsibility for not letting it happen to users, which is what Falcon reference implementation suggests also.

Masked Floating-Point Number Multiplication
We now introduce the masked floating-point number multiplication SecFprMul in Algorithm 12.To begin with, we consider its input floating-point numbers to be split into three parts of arithmetic shares: one-bit sign bit shares (s i ), 16-bit exponent shares (e i ), and 128-bit mantissa shares (m i ) where We use this form of input to make the operations more straightforward.It should be noted that the pre-image vector computation is the first operation that should be masked in the signing process, and hence this form of input can be derived directly from unmasked values.
To start with, the sign bit XOR and the exponent addition can be simply done by adding the corresponding share of the input (lines 1 and 2).As in Algorithm 5, a constant −2100 is also added.Mantissa multiplication is done by the SecMult gadget (line 3), which multiplies each share of both operands and adds them together carefully by inserting random mask values.
The next step is to round the mantissa to range [2 54 , 2 55 ) and preserve the sticky bit.We first convert the arithmetic-shared mantissa into Boolean shares (line 4).The product is in the range [2 104 , 2 106 ), so we shift the product by 50 or 51 bits according to the 106th bit (line 6 to line 10).Note that for one-bit shares, the negative of it can be achieved by turning each share negative, which is used in our conditional shift in lines 9 and 10.The 106th bit is then converted to 16-bit arithmetic shares by the one-bit B2A Bit and added to the exponent (line 13).To preserve the sticky bit, consider that if shifted by 50, we need to OR the 51st bit of mantissa with the nonzero result of the last 50 bits, while for a 51-bit shift, we OR the 52nd bit with the nonzero result of the last 51 bits.This can be done in both cases by using our SecNonzero on the last 51 bits and OR the result with the shifted mantissa, which is done in lines 5 and 11.
Finally, we turn the mantissa into zero if any exponent of the input is zero (line 14 to line 17), and we also make this by our SecNonzero gadget of the arithmetic-shared version.Now we get the sign bit, exponent, and 55-bit mantissa, which are in Boolean, arithmetic, and Boolean shares, respectively.We pack them into one Boolean-shared floating-point number in SecFPR.

Masked Floating-Point Number Addition
The SecFprAdd in Algorithm 13 takes in two Boolean-shared floating-point values and adds them in a masked way.We use Boolean shares as its input since it is followed by floating-point number multiplications in the pre-image vector computation.
The algorithm first exchanges the operands to make the first one no less than the second (line 1 to line 9).Thanks to the biased exponent, we can make the comparison by a simple subtraction and check the sign bit.Originally, the subtraction of two Boolean-masked values could be done by three steps: (1) inverting all the bits of the second operand (2) adding the inverted result with 1 (3) adding the first and second operand, which applies the equation x − y = x + (¬y) + 1 To avoid an additional call to the SecAdd gadget, we only invert bits and consider the boundary conditions.To put it clearly, let u, v be two values stored in 64-bit registers; we use the relation The first and second equalities come from the two's complement representation, where an increment of 2 63 − 1 results in −2 63 and ¬v = −v − 1.The third equality is for the output of the SecNonzero gadget.To evaluate the range check of the value u + (¬v), we also use both of the facts in our algorithm (lines 4 and 5) After the conditional swap, we extract the sign bit, exponent, and mantissa from the input shares.Like in Algorithm 6, the mantissa is scaled up 3 bits for further rounding precision, and the exponent is turned to arithmetic shares and subtracted by 1078.Then we use the SecFprUrsh gadget (Algorithm 9) we introduced in Section 3.2 to right-shift the Boolean-masked mantissa of the second operand with sticky bit preserved to make both operands have identical exponents.Note that we set the value to zero if the exponent difference is larger than 59 before shifting, as indicated in lines 9 and 10 of unmasked Algorithm 6 and lines 15 and 16 of Algorithm 13.
After the proper shift, we add/subtract the result to/from the mantissa of the first operand (line 24).The sum has a wide range, so we normalize it to [2 63 , 2 64 ) by the SecFprNorm64 gadget (Algorithm 10) in Section 3.3 and right-shift the result by 9 bits (line 27).Finally, we get the sign bit and exponent of the first operand and the shifted mantissa sum in the range [2 54 , 2 55 ), and the result floating-point number is given by calling SecFPR as the case of multiplication.

Security Proof
In this section, we sequentially prove that our design with n = t + 1 shares is secure regarding t-NI and t-SNI security.
Proof.This is a direct result that the SecAnd gadget is t-SNI secure and the negation is operated share-by-share.
Proof.We first show that the loop of rotation is itself t-SNI secure.Note that since there are n iterations, at least one of them is not probed.Let it be the iteration when j = j * .Since any set of output shares of RefreshMasks with size ≤ n − 1 is uniformly distributed ([BCZ18], Lemma 1), all the probes after j * , including probes of outputs, can be simulated with fresh randomness.Thus we only need to show that one can simulate probes before j * with no more number of shares.
Since the rotation is done share-by-share, one can simulate probes of (x i ) and (m i ) with the same number of input shares.As for the simulation of c j , if in some iteration j = j the rotation is probed, one then adds c j into the simulation set.Also, if consecutive RefreshMasks in iterations j = j − 1, j are probed, one adds c j into the simulation set.Note that if RefreshMasks are not consecutively probed, one can simulate c j with fresh randomness thanks to the uniformity of RefreshMasks's outputs.In this way, the size of the simulation set of c j is no more than the number of probes.with sizes no more than P 3 .Finally, one can simulate the probing set P 4 in the XOR and S 2 3 with output shares S 4 of the rotation of (m i ).All the probes are now simulated with output shares S 1 2 ∪ S 1 3 of the rotation of (x i ) and output shares S 4 of the rotation of (m i ).
They, along with the internal probes into the rotation loop, can be simulated by input shares due to the t-SNI security we showed at first. of the SecNonzero.We show that all probes in iteration j can be simulated with no more number of shares of (e i ) and (x i ) as the input of the iteration.If this is the case, all probes across different iterations can be simulated with no more number of input shares.
First, since the addition is done share-by-share, one can use some sets S 1 1 and S 2 1 of shares of (e i ) and B2A Bit , respectively, to simulate P ).The probing sets P i for some i are colored in red, and the simulation sets S i and S j i for some i, j are colored in blue.Gadgets with t-NI and t-SNI security are marked in black and green, respectively.
From the figure, one can list all the inequalities similarly and check that for each gadget, there are no more number of input shares than the probes used in the simulation.In particular, shares of (x i ) used to simulate include S 2 5 , S 15 , and S 18 ∪ S 1 17 , and

Implementation and Evaluation
In this section, we provide our masked implementation's performance and security evaluations.Our experiments were mainly implemented in plain-C code, but we rewrote some segments of the 2-shared version by assembly to reduce some observed leakages in security evaluation, which is discussed in Section 5.2 and 5.3.We first tested the performance of our design on an Arm Cortex-M4 processor, and then we used the program from Falcon reference code [PFH + 20] to test the speed of one complete signing process on an Intel-Core i9-12900KF CPU, a general-purpose processor.For security evaluation, we ran our experiments on ChipWhisperer-Pro Level 3 Starter Kit [Inc], which includes a main control board and a target board to run the main program.The control board clocks the target board at 7.38MHz and measures its power consumption during execution at the same frequency.The target board STM32F415 (CW308T-STM32F4) with an Arm Cortex-M4 MCU was used.
In our implementation, the 128-bit multiplication in Algorithm 12 was realized by combining four 32-bit registers in C. We generated the randomness for our masked implementation beforehand and fill them in a table to be read off.We list the number of used randomness in bytes for each algorithm in the performance evaluation subsection.).The probing sets O and P i for some i are colored in red, and the simulation sets S i and S j i for some i, j are colored in blue.Gadgets with t-NI and t-SNI security are marked in black and green, respectively.

Performance Evaluation
We first evaluate the performance of our masked implementation on the Arm Cortex-M4 processor and compare them with the unmasked version from Falcon reference code in NIST round-3 submission [PFH + 20], which is a re-implementation of floating-point arithmetic also written in plain-C.The cycle counts of floating-point number multiplication and addition are given in Table 2.For higher-order mask evaluation, we provide the results for the second-order mask (3 shares).All our code was compiled by arm-none-eabi-gcc 10.3.1 with optimization level -O3.
The masked floating-point number multiplication takes about 23× overhead of the unmasked version for 2 shares and 118× overhead for 3 shares.It shows that our nonzero check gadget SecNonzero only causes a small amount of overhead compared to the whole multiplication and addition algorithm.For the 2-shared design, the bottleneck is the packing SecFPR, in which a 64-bit SecAdd is used.For the 3-shared design, the heaviest overhead comes from the 3-shared 128-bit A2B, which internally calls a 128-bit SecAdd.
The masked floating-point number addition takes about 35× cycles for our first-order masked version than the unmasked one and 99× cycles for our second-order version.It shows that the main overhead is caused by the four 64-bit SecAdd functions, which is also the case in multiplication.Although it costs much in our implementation, it seems unlikely to avoid the Boolean masked addition gadgets or the mask conversion gadgets somewhere since the mantissa needs to be rounded in different stages and the sticky bit needs to be preserved.
In Table 3, we provide the speed for signing one message on the general-purpose Intel-Core i9-12900KF CPU with our masking countermeasure on the pre-image vector computation to show the amortized performance result in the whole Falcon.For this, we first replaced the floating-point arithmetic in the pre-image vector computation (line 3) with our masked multiplication and addition, and then we unmasked the result after all the computations were done.It shows that compared with the unmasked design, one signing process takes about 7.7× for 2 shares (about 1.9 ms for Falcon-512) and about

Security Evaluation
For practical security evaluation, we conducted the leakage assessment via the TVLA methodology, which we introduced in Section 2.5.We performed TVLA on our first-order (2 shares) and second-order (3 shares) masked floating-point number multiplication and addition, comparing them with the unmasked ones.Figure 8 shows our results.From left to right are the t-value statistics for unmasked, first-order, and second-order implementation.A threshold of ±4.5 is provided as red dotted lines, while for second-order traces, we offer green dotted lines as the recommended threshold in [DZD + 17].For multiplication, the traces have a length of 295727, and we set the threshold to ±6.628; while for addition with traces length of 387764, the threshold is set to ±6.668.
For the unmasked function, we measured a total of 1,000 traces, and it turns out that almost every point exceeds the threshold.The results are improved for first-order implementation, but still some points exceed the threshold.By rewriting part of the code by assembly, adding redundant operations, and rearranging the order of independent instructions, every point is within the threshold in 10,000 traces.However, for tests with 100,000 traces, some values crossed the thresholds.It shows that the device may implicitly leak first-order information in the 2-shared implementation.We discuss this problem more thoroughly in Section 5.3.For second-order implementation, almost all the points are within the threshold ±4.5 in 100,000 traces, and all the points pass the test with the adapted thresholds.We see that the second-order implementation eliminates the leakages that appeared in the first-order design.

Discussion about the Leakage of the First-Order Design
The first-order implementation without further optimization still showed leakage with 10,000 power consumption traces in our experiment.Similar results were also found in previous works [BGR + 21, BC22].On the other hand, our second-order implementation can pass the TVLA test of 100,000 traces, which indicates that leakage in the first-order design might be caused by unexpected equipment behavior.As pointed out and organized in [GD23], probing security cannot capture the physical defaults of devices.Glitches and transition-based leakage concerning the Hamming distance between two consecutive values written in a memory cell can cause power consumption related to the unmasked secret.The leakage in [BGR + 21, BC22] was eliminated or mitigated through assembly optimization.In our experiment, we used a defensive approach similar to that in [BC22], such as adding a dummy load and store operation before and after each consecutive share-wise operation where leakage appeared in the first-order TVLA result.We also found that shift operations could induce leakage, so we inserted redundant shift operations around the true ones.Besides, we separated dependent instructions to avoid potential leakage from Hamming distance or hidden buffer.With these revisions, we removed all the high values in the tests with 10,000 traces.Nevertheless, our first-order implementation failed to pass the test in 100,000 traces.
Two approaches can be taken to improve the result.The first one is a thorough assembly rewriting.With programs written in assembly, one can manipulate each register to avoid the potential transition-based leakage caused by compiling from a high-level language.However, hidden registers and other memory units in the processor [GOP22, MPW21] cannot be directly accessed.They may still induce transition-based leakage or even recombination of shares.Another concern with this strategy is that the design can vary for different devices.For example, we found different leakage patterns and locations when executing the same optimization method on STM32F303 and STM32F415 target boards; by contrast, the TVLA results of the second-order implementation were similar on both.The second approach is a secure design in the robust probing model [FGP + 18, MMSS19], which considers typical physical defaults like glitches.Unfortunately, a glitch-resistant model-based design of SecAdd, A2B, and B2A gadgets is still unknown to the best of our knowledge, and the procedure can require more than two shares and reduce efficiency.

Conclusion
In this work, we provide a masking scheme for Falcon's floating-point number multiplication and addition.To round the mantissa and compute the sticky bit efficiently, we design a masked nonzero check algorithm to find whether a shared value is nonzero, which can also be used to check the equality of two values and normalize a number.In addition, a masked right-shift and a masked normalization algorithm are proposed to add two floating-point numbers securely.The former can securely shift a value by some arithmetic shares while preserving the sticky bit, and the latter helps normalize a 64-bit number to the specific range [2 63 , 2 64 ).
We provide formal proofs to show our design is secure in the t-probing model.Specifically, we apply the t-NI and t-SNI definitions and prove the security of our gadgets based on simulation.In terms of practical leakage on board, we conducted the leakage assessment experiments via TVLA.The first-order countermeasure with part of the functions rewritten by assembly, adding redundant operations, and rearrangement of execution orders can pass the test in 10,000 traces.With second-order masking, there is no significant leakage in 100,000 measured traces.
For performance evaluation, we compare cycle counts among unmasked, first-order, and second-order implementations on an Arm Cortex-M4 core.To achieve complete masking for the procedures containing both Boolean and arithmetic operations, our algorithms call arithmetic-to-Boolean mask conversion gadgets A2B and Boolean-masked addition gadgets SecAdd several times.It turns out that they cause the most considerable overhead in our design.We also tested the speed on an Intel-Core CPU, and it shows that a complete signing process with our countermeasure can be finished within a few milliseconds.
Throughout the signing algorithm of Falcon, the attack and defense of the Gaussian sampler have been discussed.However, the pre-image vector computation and other parts of the fast Fourier sampler are still at risk of being attacked, even though no works on the fast Fourier sampler have been proposed.A complete masking of Falcon can be constructed by combining our works with a masked design of the sampler.With our implementations and evaluations of the masking scheme for Falcon, it can resist known attacks on the pre-image vector computation, allowing Falcon to be used more securely.

Algorithm 6
FprAddInput: Floating-Point numbers x and y Output: Floating-Point numbers sum of x and y

Figure 1 :
Figure 1: An abstract diagram of each iteration in SecNonzero (Algorithm 8).The probing sets O, P 1 , P 2 are colored in red, and the simulations sets S 1 1 , S 2 1 , S 2 are colored in blue.Gadgets with t-NI and t-SNI security are marked in black and green, respectively.

Figure 2 :
Figure 2: An abstract diagram of SecFprUrsh (Algorithm 9).The probing sets O and P i for some i are colored in red, and the simulations sets S i and S j i for some i, j are colored in blue.Gadgets with t-NI and t-SNI security are marked in black and green, respectively.

Lemma 4 .
The gadget SecFprNorm64 (Algorithm 10) is t-NI secure.Proof.An abstract diagram of each iteration in SecFprNorm64 is given in Figure3.Let the adversary probe in iteration j the intermediate values set P

Figure 6 :
Figure 6: An abstract diagram of the swap part in SecFprAdd (Algorithm 13).The probing sets P i for some i are colored in red, and the simulation sets S i and S j i for some i, j are colored in blue.Gadgets with t-NI and t-SNI security are marked in black and green, respectively.

Figure 7 :
Figure 7: An abstract diagram of the operations following the swap in SecFprAdd (Algorithm 13).The probing sets O and P i for some i are colored in red, and the simulation sets S i and S j i for some i, j are colored in blue.Gadgets with t-NI and t-SNI security are marked in black and green, respectively.

Table 1 :
List of used gadgets in our work with n = t + 1 shares . Since SecOr is t-SNI secure, for any probing set P 1 in SecOr gadget and O of its output shares, one can use some set S 1 1 of outputs of Refresh and set S 2 1 of shares of (r i ) to simulate both P 1 and O with |S 1 1 |, |S 2 1 | ≤ P 1 .Also, since Refresh is t-SNI secure, for any probing set P 2 in Refresh gadget, one can use some set S 2 of shares of (t i ) to simulate both P 2 and S 1 1 with |S 2 | ≤ P 2 .In summary, for probed output shares O and internal values P 1 , P 2 , one can use S 2 1 and S 2 to simulate all of them with |S 2 1 ∪ S 2 | ≤ |P 1 | + |P 2 |.This shows each iteration is t-SNI secure, and the whole loop is thus t-SNI secure.Boolean shares (x i ) 1≤i≤n and (y i ) 1≤i≤n representing floating-point numbers x and y Output: Boolean shares representing the floating-point numbers sum 1: (xm i ) ← (x |P 15 | |S 18 ∪ S 1 17 | ≤ |P 18 | + |P 17 | + |S 1 7 | ≤ |P 18 | + |P 17 | + |P 7 | For shares of (y i ), S 2 12 , S 16 , and S 2 17 are used, and

Table 2 :
Performance of each component in Algorithm 11, 12, and 13 on Arm Cortex-M4.We count the cycles for subroutines and the total random numbers used in bytes.

Table 3 :
Time (in microseconds) for signing a message on Intel-Core i9-12900KF CPU.