# Masking Floating-Point Number Multiplication and Addition of Falcon 

First- and Higher-order Implementations and Evaluations

Keng-Yu Chen ${ }^{1}$ and Jiun-Peng Chen ${ }^{1,2}$<br>${ }^{1}$ National Taiwan University, Taipei, Taiwan, r11921066@ntu.edu.tw<br>${ }^{2}$ Academia Sinica, Taipei, Taiwan, jpchen@ieee.org


#### Abstract

. In this paper, we provide the first masking scheme for floating-point number multiplication and addition to defend against recent side-channel attacks on FALCON's pre-image vector computation. Our approach involves a masked nonzero check gadget that securely identifies whether a shared value is zero. This gadget can be utilized for various computations such as rounding the mantissa, computing the sticky bit, checking the equality of two values, and normalizing a number. To support the masked floating-point number addition, we also developed a masked shift and a masked normalization gadget. Our masking design provides both first- and higherorder mask protection, and we demonstrate the theoretical security by proving the (Strong)-Non-Interference properties in the probing model. To evaluate the performance of our approach, we implemented unmasked, first-order, and second-order algorithms on an Arm Cortex-M4 processor, providing cycle counts and the number of random bytes used. We also report the time for one complete signing process with our countermeasure on an Intel-Core CPU. In addition, we assessed the practical security of our approach by conducting the test vector leakage assessment (TVLA) to validate the effectiveness of our protection. Specifically, our TVLA experiment results for second-order masking passed the test in 100,000 measured traces.


Keywords: Falcon, Floating-Point Arithmetic, Masking, Post-Quantum Cryptography, Side-Channel Analysis

## 1 Introduction

The rapid development of quantum computers has posed a potential threat to public key cryptography systems. Due to Shor's algorithm [Sho97], public key cryptographic schemes based on the hardness of integer factorization and discrete logarithm, including RSA [RSA78], Diffie-Hellman key agreement [DH76], ElGamal encryption [Elg85], and ECDSA [JMV01], are vulnerable to large-scale quantum computing. To defend against such a menace, post-quantum cryptography, which studies algorithms that are considered secure against quantum computing, has been widely researched. In 2016, the National Institute of Standards and Technology (NIST) initiated a standardization process for postquantum cryptography [oSTa]. Recently, the four selected algorithms, CRYSTALS-Kyber, CRYSTALS-Dilithium, SPHINCS ${ }^{+}$, and FALCON became part of NIST's post-quantum cryptographic standards, which are expected to be finalized in about two years [oSTb].

Unfortunately, even with conjectured quantum resistance, cryptographic implementations may not be secure. Side-channel analysis considers the threat of information leakage from an electronic device running cryptographic algorithms through its hardware physical behaviors. These could be the running time of different inputs or the electromagnetic emanation and power consumption during the execution. Many works have been
done on implementing such attacks against post-quantum algorithms. Therefore, some side-channel resistance in post-quantum cryptography has been proposed in recent years. For example, studies including $\left[\mathrm{BGR}^{+} 21, \mathrm{FBR}^{+} 22, \mathrm{HKL}^{+} 22\right]$ provided countermeasures for CRYSTALS-Kyber, and countermeasures for CRYSTALS-Dilithium was presented in [MGTF19]. However, the side-channel resistance of FALCON has lacked discussion.

Related Work The signing process of Falcon faces at least two side-channel vulnerabilities: the Gaussian sampler and the pre-image vector computation. The Gaussian sampler returns numbers following the discrete Gaussian distribution, and previous works [BHLY16, EFGT17, PBY17, MHS ${ }^{+}$19] have demonstrated the threat of timing attacks on the sampler. An isochronous design as a countermeasure was developed in [HPRR20]. Besides, a masked implementation of the Gaussian sampler is provided in [EFG $\left.{ }^{+} 22\right]$. Recently, Guerreau et al. [GMRR22] proposed a simple power analysis attack on Falcon and a related light countermeasure, and their work was further improved in [ZLYW23], which provides more potential sources of leakage and reduces the number of required traces by new algorithms.

The pre-image vector computation in Falcon is used for finding the short vector solution as the signature. In [KA21], Karabulut and Aysu first performed an EM attack on such computation in Falcon, whose method in their setting can recover the secret key with a few thousand measured traces. Their attack was later improved by Guerreau et al. [GMRR22], which reduces the guess space complexity from $2^{27}$ to only $2^{11}$. However, the protection of the pre-image vector computation in FALCON has so far seldom been discussed. At the same time, the authors of Falcon $\left[\mathrm{PFH}^{+} 20\right]$ suggest that the masking scheme should be considered to protect FALcon from side-channel attacks.

## Our Contributions

- We propose the first masking scheme on the floating-point number multiplication and addition in the pre-image vector computation of FALCON as a countermeasure. To support functions for our masked design, we devise masked gadgets for the nonzero checks, unsigned right-shift, and 64-bit normalization. We provide the details of our algorithms and show that our approach can be extended to high-order protections.
- We verify the high-order security of our design in the probing model by the concepts of Non-Interference security $\left[\mathrm{BBD}^{+} 16\right]$. We offer formal proofs via simulation to show they can resist high-order probing attacks.
- To test the practical leakage of our work, we implemented our algorithms on ChipWhisperer-Pro [Inc] with STM32F415 target. We conducted the Test Vector Leakage Assessment (TVLA) [GGJR ${ }^{+} 11$ ] experiments using 100,000 measured traces to demonstrate the effectiveness of our countermeasure.
- We tested the performance of our work on an Arm Cortex-M4 core and Intel-Core i9-12900KF CPU. The evaluation results are presented for our first- and second-order implementations and compared with the original unmasked design.

Outline We organize our work as follows. In Section 2, we give notations throughout the paper and review some contexts for our work. We introduce in Section 3 our masking algorithms of floating-point number multiplication and addition in detail. The security proofs in the probing model of our design are given in Section 4. The performance evaluation of our implementation and security assessment results of TVLA are given in Section 5.

## 2 Preliminaries

### 2.1 Notation

In the rest of the paper, $N=2^{\kappa}$ for some integer $\kappa, M>N, q$ is a prime number, and $\phi=x^{N}+1$ is the cyclotomic polynomial. For a polynomial $f=\sum_{i=0}^{N-1} f_{i} x^{i}$, its adjoint is $f^{*}=f_{0}-\sum_{i=1}^{N-1} f_{i} x^{N-i}$. We write a vector $\mathbf{v}$ in bold, and a matrix $\mathbf{A}$ is written in bold and uppercase. The adjoint of a vector $\mathbf{v}$ or a matrix $\mathbf{A}$, which is the transpose of adjoint of each of its coefficient, is written as $\mathbf{v}^{*}$ or $\mathbf{A}^{*}$. For a polynomial $f$ modulo $\phi$, we may write it as an $N$-by- $N$ matrix, with the $i$ th row representing the coefficients of $\left(x^{i} f \bmod \phi\right)$ for $0 \leq i \leq N-1$. Matrix additions and multiplications will map to polynomial additions and multiplications in the ring of polynomials modulo $\phi$ in this tradition.

For a variable $x$, the $j$ th bit of $x$ is written as $x^{(j)}$. The LSB is the first bit, and the MSB is the $k$ th bit if it is stored in a $k$-bit register. The $i$ th bit to $j$ th bit $(j \geq i)$ of $x$ is represented by $x^{[j: i]}$. A sequence of $n$ variables $\left(x_{1}, x_{2}, \cdots, x_{n}\right)$ (e.g. shares of variable $x)$ is written as $\left(x_{i}\right)_{1 \leq i \leq n}$, or simply $\left(x_{i}\right)$ if the sequence length is not our point, or it is obvious from contexts. Without specifying, $n$ will be the sequence length of $\left(x_{i}\right)$.

We use notations $\oplus, \wedge, \vee$ to denote the bit-wise XOR, AND and OR operations, respectively. For variable $x$, we denote by $\neg x$ the bit inversion of $x$. Also, for a negative integer in a register, we consider two's complement representation and write $-x=(\neg x)+1$. The $\ll$ and $\gg$ are unsigned left- and right-shift of a variable. In addition, for a proposition $P$, we let $\llbracket P \rrbracket=1$ if $P$ is true and 0 if otherwise.

### 2.2 Falcon Signature Scheme

In this section, we briefly review the FALCON signature scheme, emphasizing on the preimage vector computation in the Fourier domain which is strongly related to our work. For more details, we refer the readers to the NIST round-3 submission of FALCON [ $\mathrm{PFH}^{+}$20].

The basis of Falcon is the GPV framework [GPV08]. At a high level, the framework uses a full-rank matrix $\mathbf{A} \in \mathbb{Z}_{q}^{N \times M}$ as public key, and $\mathbf{B} \in \mathbb{Z}_{q}^{M \times M}$ as secret key where $\mathbf{B} \mathbf{A}^{T}=0 \bmod q$. To sign a message $m$, one first computes $H(m)$ for some hash function $H:\{0,1\}^{*} \rightarrow \mathbb{Z}_{q}^{N}$, and the signature is a short vector $\mathbf{s}$, which is derived by trapdoor $\mathbf{B}$, that satisfies $\mathbf{s} \mathbf{A}^{T}=H(\mathrm{~m})$. To verify, one can simply check that the received $\mathbf{s}$ is short and satisfies the above equation.

FALCON instantiates the GPV framework on NTRU lattices. The secret key is essentially 4 polynomials $f, g, F, G$ in $\mathbb{Z}_{q}[x] /(\phi)$ which satisfy the NTRU equation

$$
f G-g F=q \bmod \phi
$$

while the public key is now the polynomial $h=g f^{-1} \bmod (\phi, q)$. To sign a message securely, coefficients of $f$ and $g$ are sampled from a discrete Gaussian distribution with a small standard deviation $\sigma_{\{f, g\}}=1.17 \sqrt{q / 2 N}$, and $F, G$ are generated from $f, g$ by solving the NTRU equation. Matrices $\mathbf{A}, \mathbf{B}$ are formed as

$$
\mathbf{A}=\left[\begin{array}{l|l}
1 & h^{*}
\end{array}\right], \quad \mathbf{B}=\left[\begin{array}{c|c}
g & -f \\
\hline G & -F
\end{array}\right]
$$

The second part of Falcon key generation is the computation of the Falcon tree, which will be used in looking for the short vector in signing. In short, the Falcon tree is a binary tree where each node at height $\kappa$ is a polynomial in $\mathbb{Q}[x] /\left(x^{2^{\kappa}}+1\right)$. The Gram matrix $\mathbf{G}=\mathbf{B B} \mathbf{B}^{*}$ is first comptuted, and a FALCON tree is built by recursively calling the LDL decomposition on each element of the diagonal matrix and store the lower-triangular matrix as a node value. The leaf node is then normalized by a constant $\sigma$. The whole key generation process is given in Algorithm 1.

```
Algorithm 1 FALCON.Keygen from \(\left[\mathrm{PFH}^{+} 20\right]\)
Input: Polynomial \(\phi\), prime modulus \(q\)
Output: Falcon secret key sk, public key pk
    \(f, g, F, G \leftarrow \operatorname{NTRUGen}(\phi, q)\)
    \(\hat{\mathbf{B}} \leftarrow\left[\begin{array}{l|l}\mathrm{FFT}(g) & \mathrm{FFT}(-f) \\ \hline \mathrm{FFT}(G) & \operatorname{FFT}(-F)\end{array}\right]\)
    \(\mathbf{G} \leftarrow \hat{\mathbf{B}} \times \hat{\mathbf{B}}^{*}\)
    \(\mathrm{T} \leftarrow \mathrm{ffLDL}(\mathbf{G}) \quad \triangleright\) generate FALCON tree
    for each leaf leaf of T do
        leaf.value \(\leftarrow \sigma / \sqrt{\text { leaf.value }}\)
    sk \(\leftarrow(\hat{\mathbf{B}}, \mathrm{T})\)
    \(h \leftarrow g f^{-1} \bmod q\)
    \(\mathrm{pk} \leftarrow h\)
    return sk, pk
```

The signing process of Falcon in Algorithm 2 starts by sampling a random salt $\mathbf{r}$ and computing $c=H(\mathbf{r} \| \mathrm{m})$. Next it finds a short vector $\mathbf{s}=\left(s_{1}, s_{2}\right)$ such that $\mathbf{s A}^{*}=s_{1}+s_{2} h=c$. To do so, the vector $\mathbf{t}=(c, 0) \times \mathbf{B}^{-1}$ is first computed, which we call the pre-image vector. Then the fast Fourier nearest plane algorithm [DP16] (ffSampling in Algorithm 2) is applied to find some $\mathbf{z}$ where $\mathbf{s}=(\mathbf{t}-\mathbf{z}) \mathbf{B}$ is close to zero in a secure way, in the sense that no information about secret basis $\mathbf{B}$ is leaked. The signature consists of the salt $\mathbf{r}$ and the second coefficient $s_{2}$. The verification in Algorithm 3 is relatively easy. One first retrieves $s_{1}$ by $H(\mathrm{r} \| \mathrm{m})-s_{2} h$, and then check whether the norm of vector $\left(s_{1}, s_{2}\right)$ is small.

```
Algorithm 2 FALCON.Sign from \(\left[\mathrm{PFH}^{+} 20\right]\)
Input: Message m, secret key sk and a bound \(\left\lfloor\beta^{2}\right\rfloor\)
Output: Signature sig of \(m\)
    \(r \stackrel{\$}{\leftarrow}\{0,1\}^{320}\)
    \(c \leftarrow H(\mathrm{r} \| \mathrm{m})\)
    \(\mathbf{t} \leftarrow\left(-\frac{1}{q} \mathrm{FFT}(c) \odot \mathrm{FFT}(F), \frac{1}{q} \mathrm{FFT}(c) \odot \mathrm{FFT}(f)\right) \quad \triangleright\) pre-image vector computation
    do
        do
            \(\mathrm{z} \leftarrow \mathrm{ffSampling}(\mathrm{t}, \mathrm{T})\)
            \(\mathbf{s} \leftarrow(\mathbf{t}-\mathbf{z}) \hat{\mathbf{B}}\)
        while \(\|\mathbf{s}\|^{2}>\left\lfloor\beta^{2}\right\rfloor\)
        \(\left(s_{1}, s_{2}\right) \leftarrow \operatorname{invFFT}(\mathbf{s})\)
        \(\mathrm{s} \leftarrow\) Compress \(\left(s_{2}\right)\)
    while \(s=\perp\)
    \(\operatorname{sig} \leftarrow(r, s)\)
    return sig
```

To use the fast Fourier sampler and increase the speed in the signing process, Falcon applies the fast Fourier transform on polynomials. Let $\Omega_{\phi}$ be the sets of all complex roots of $\phi$, the fast Fourier transform of a polynomial $f$ is

$$
\operatorname{FFT}(f)=(f(\xi))_{\xi \in \Omega_{\phi}}, \quad \Omega_{\phi}=\left\{\left.\exp \left(\frac{i \pi(2 k+1)}{N}\right) \right\rvert\, 0 \leq k<N\right\}
$$

In this representation, polynomial additions and multiplications can be done by performing

```
Algorithm 3 FALCON.Verify from [ \(\left.\mathrm{PFH}^{+} 20\right]\)
Input: Message \(m\), signature sig \(=(r, s)\), public key pk, and a bound \(\left\lfloor\beta^{2}\right\rfloor\)
Output: Accept or reject
    \(c \leftarrow H(\mathrm{r} \| \mathrm{m})\)
    \(s_{2} \leftarrow\) Decompress(s)
    if \(s_{2}=\perp\) then
        reject
    \(s_{1} \leftarrow c-s_{2} h \bmod q\)
    if \(\left\|\left(s_{1}, s_{2}\right)\right\|^{2} \leq\left\lfloor\beta^{2}\right\rfloor\) then
        accept
    else
        reject
```

operations on each coefficient, which is a complex number. Therefore, the pre-image vector computation is composed of complex number multiplication of each coefficient.

In practice, a complex number $a+b i$ is represented by two floating-point numbers $a, b$ as its real and imaginary parts. As pointed out in their document $\left[\mathrm{PFH}^{+} 20\right]$, it may be hard to implement Falcon on constrained devices without a floating-point unit. Falcon thus provides an emulation of floating-point operations with 53 bits of precision in their reference implementation. They follow the IEEE-754 standard to store a 64 -bit floating-point number by 1-bit sign, 11-bit exponent, and 53 -bit mantissa, where the 53 rd bit is always 1 and omitted when storing. To make the comparison of values more straightforward, the exponent is biased with a value 1023 to make it unsigned. For convenience, we will represent a 64-bit floating-point number $x$ as a tuple $(s, e, m)$ where

$$
x=(-1)^{s} \cdot 2^{e-1023} \cdot\left(m \cdot 2^{-52}\right)
$$

Compared to the floating-point number architecture, $s$ is the 64 th bit, $e$ is the 53 rd to 63 rd bit, and $m$ is the 1 st to 52 nd bit of the floating-point number with the omitted one at the beginning. In this representation, $m$ may be viewed as an integer in $\left[2^{52}, 2^{53}\right.$ ).

### 2.3 Floating-Point Number Multiplication and Addition

The pre-image vector computation, or coefficient-wise complex number multiplication, is performed by combinations of floating-point number multiplications and additions. We here briefly introduce how FALCON emulates them in their given reference implementation.

We begin by introducing the last subroutine of both the multiplication and addition the floating-point number rounding and packing function FPR (Algorithm 4), which receives a 55-bit mantissa, an unbiased exponent, and a sign bit and outputs the floating-point number with a rounded mantissa and biased exponent. It starts by adding the exponent with a constant 1076 for the bias (1023) and mantissa size (53). If the result is smaller than 0 , it is considered subnormal, and the mantissa should be turned to zero. Then it zeros the exponent if the mantissa is zero. Next, the sign bit and mantissa are combined together and added with the exponent. Note the exponent will be incremented by the top bit of the mantissa. Finally, the rounding is done by checking the least three significant bits of the mantissa. It follows the round-to-nearest strategy: if they are 011,110 , or 111, a carry 1 is added.

Now we introduce the floating-point number multiplication in Algorithm 5. The multiplication first XORs the sign bits and sums the exponents of both operands, and then does the mantissa multiplication. Note that a constant -2100 is added to the exponent for exponent bias $(1023 \times 2)$, scaling of mantissa $(52 \times 2)$, and shifting of the later multiplication product $(-50)$. The product is a 106 -bit raw mantissa, and rounding is performed to

```
Algorithm 4 FPR
Input: Sign bit \(s\), exponent \(e\), and 55 -bit mantissa \(z\)
Output: Floating-Point number \(x\) packed by \(s, e, z\)
    \(e \leftarrow e+1076\)
    \(b \leftarrow \llbracket e<0 \rrbracket\)
    \(z \leftarrow z \wedge(b-1)\)
    \(b \leftarrow \llbracket z \neq 0 \rrbracket\)
    \(e \leftarrow e \wedge(-b)\)
    \(x \leftarrow((s \ll 63) \vee(z \gg 2))+e \ll 52\)
    \(f \leftarrow 0 \mathrm{XC8} \gg z^{[3: 1]}\)
    \(x \leftarrow x+f^{(1)} \quad \triangleright\) increment if \(z^{[3: 1]}\) is 011,110 or 111
    return \(x\)
```

reduce the mantissa to 55 bits in which the 55 th bit is set. Notice that we need to make the sticky bit preserved after the rounding. If either of the exponents is zero, the computations above are invalid, and the mantissa is turned to zero. Finally, the sign bit, unbiased exponent, and 55 -bit mantissa are packed into one floating-point number by the FPR in Algorithm 4.

```
Algorithm 5 FprMul
Input: Floating-Point numbers \(x=(s x, e x, m x)\) and \(y=(s y, e y, m y)\)
Output: Floating-Point numbers product of \(x\) and \(y\)
    \(s \leftarrow s x \oplus s y\)
    \(e \leftarrow e x+e y-2100\)
    \(z \leftarrow m x \times m y\)
    \(b \leftarrow \llbracket z^{[50: 1]} \neq 0 \rrbracket\)
    \(z \leftarrow z^{[106: 51]} \vee b\)
    \(z^{\prime} \leftarrow(z \gg 1) \vee z^{(1)}\)
    \(w \leftarrow z^{(106)}\)
    \(z \leftarrow z \oplus\left(z \oplus z^{\prime}\right) \wedge(-w) \quad \triangleright\) round to 55 bits with sticky bit preserved
    \(e \leftarrow e+w \quad \triangleright\) increment if the product carries to 106th bit
    \(b x \leftarrow \llbracket e x \neq 0 \rrbracket, b y \leftarrow \llbracket e y \neq 0 \rrbracket\)
    \(b \leftarrow b x \wedge b y\)
    \(z \leftarrow z \wedge(-b)\)
    return \(\operatorname{FPR}(s, e, z)\)
```

The floating-point number addition in Algorithm 6 first exchanges the operands to make the absolute value of the first one no less than that of the other. Since the exponent is biased, one can compare the absolute values of two floating-point numbers by comparing their least 63 bits. If both operands are only differed by the sign, it lets the first operand be the positive one. Then it extracts both operands' sign bit, exponent, and mantissa, where the mantissa is scaled up to 55 bits for further rounding. The exponent is subtracted by 1078 for the bias (1023) and mantissa scaling (55). Next, it shifts the second operand according to the exponent difference while preserving the sticky bit and adds both mantissa. The sum is first normalized to $\left[2^{63}, 2^{64}\right)$; that is, the 64 th bit is set and then scaled down to $\left[2^{54}, 2^{55}\right.$ ) with the sticky bit preserved. Finally, the packing helps to return a 64 -bit floating-point number with a 53 -bit mantissa, as it does in multiplication.

It should be noticed that the floating-point number multiplication and addition do not follow the associative law and distributive law. In other words, for some floating-point numbers $a, b$, and $c$,

$$
a+(b+c) \neq(a+b)+c \quad \text { or } \quad a \times(b+c) \neq a \times b+a \times c
$$

```
Algorithm 6 FprAdd
Input: Floating-Point numbers \(x\) and \(y\)
Output: Floating-Point numbers sum of \(x\) and \(y\)
    \(d \leftarrow x^{[63: 1]}-y^{[63: 1]}\)
    \(c s \leftarrow d^{(64)} \vee\left(\left(1-(-d)^{(64)}\right) \wedge x^{(64)}\right) \quad \triangleright \llbracket|x|<|y| \rrbracket \vee\left(\llbracket|x| \leq|y| \rrbracket \wedge \llbracket x^{(64)}=1 \rrbracket\right)\)
    \(m \leftarrow(x \oplus y) \wedge(-c s)\)
    \(x \leftarrow x \oplus m, y \leftarrow y \oplus m \quad \triangleright\) swap \(x\) and \(y\) if necessary
    Extract ( \(s x, e x, m x\) ) and \((s y, e y, m y)\) from \(x, y\), respectively.
    \(m x \leftarrow m x \ll 3, m y \leftarrow m y \ll 3\)
    \(e x \leftarrow e x-1078, e y \leftarrow e y-1078\)
    \(c \leftarrow e x-e y\)
    \(b \leftarrow \llbracket c<60 \rrbracket\)
    \(m y \leftarrow m y \wedge(-b)\)
    \(m y \leftarrow(m y \gg c) \vee \llbracket m y^{[c: 1]} \neq 0 \rrbracket\)
    \(s \leftarrow s x \oplus s y\)
    \(z \leftarrow m x+(-1)^{s} m y\)
    Normalize \(z, e x\) to make the 64th bit of \(z\) set
    \(z \leftarrow(z \gg 9) \vee \llbracket z^{[9: 1]} \neq 0 \rrbracket\)
    \(e x \leftarrow e x+9\)
    return \(\operatorname{FPR}(s x, e x, z)\)
```

This makes it complicated to design a masked implementation of Falcon without constructing new ways to do the multiplication and addition. In this paper, we follow Falcon reference implementation and rewrite the floating-point number multiplication and addition in a masked way. Our masked functions return the same values as the existing functions, and they can apply to both security levels provided by Falcon (i.e., Falcon-512 and FALCON-1024) since they use the same floating-point number arithmetic.

### 2.4 Masking

Side-channel attacks can extract secret information in cryptographic devices by measuring their physical behavior during the computation, and masking helps avoid leakage by randomizing secret variables in each operation. Essentially, it splits each sensitive value into several shares that are randomized every round. In this way, the attacker cannot gain any information if only a limited number of intermediate variables are seen.

Common masking methods include Boolean mask and arithmetic mask [MOP07]. The Boolean masking method hides a variable $x$ by a length $n$ sequence ( $x_{1}, x_{2}, \cdots, x_{n}$ ) where $x=x_{1} \oplus \cdots \oplus x_{n}$. Since the value $x$ should only be recovered if all $x_{i}$ 's are known to the attacker, we also call the sequence $\left(x_{i}\right)$ Boolean shares of $x$. The arithmetic masking method uses arithmetic shares $\left(x_{i}\right)$ where $x=x_{1}+\cdots+x_{n}$. Note the addition is considered to be modulo $2^{k}$ for a $k$-bit variable.

To theoretically evaluate the power of the attacker and the security level protected by masking, Ishai, Sahai, and Wagner [ISW03] introduced the notion of the $t$-probing model, which assumes an adversary can probe up to $t$ intermediate values during the cryptographic operations. A gadget is said to be secure against $t$-order attacks if any $t$ intermediate values in operations leak no information about the hidden secret. Their approach for proving this security is based on simulation; namely, to show that any $t$ intermediate values of the gadget can be simulated without the knowledge of the secret.

To prove the security of compositions of gadgets, the concept of (Strong)-Non-Interference was proposed in $\left[\mathrm{BBD}^{+} 16\right]$. We recall them in the version presented in [SPOG19].

Definition 1 ( $t$-Non-Interference ( $t$-NI) security). A gadget is $t$-Non-Interference ( $t$-NI)
secure if every set of $t$ intermediate values can be simulated by no more than $t$ shares of each of its inputs.

Definition 2 ( $t$-Strong-Non-Interference ( $t$-SNI) security). A gadget is $t$-Strong-NonInterference ( $t$-SNI) secure if for every set of $t_{I}$ internal intermediate values and $t_{O}$ of its output shares with $t_{I}+t_{O} \leq t$, they can be simulated by no more than $t_{I}$ shares of each of its inputs.

If a gadget is itself $t$-NI secure, and if any set of $t$ shares of its input is independent of the secret, then it can resist $t$-order attacks. However, compositions of $t$-NI secure gadgets may not be $t$-NI secure, and we can use $t$-SNI to avoid this problem. Note that a $t$-SNI secure gadget is, by definition, also $t$-NI, and it can help the simulation since it requires only the same number of input shares as the internal probed values to simulate all the probes.

In section 4, we will show that our algorithms with $n=t+1$ shares are $t$-NI or $t$-SNI secure and provide formal proofs via simulation.

### 2.5 Test Vector Leakage Assessment

The Test Vector Leakage Assessment (TVLA) methodology [GGJR ${ }^{+}$11] is used to analyze whether a cryptographic device leaks information from its power/EM trace. The theory behind this is Welch's t-test. Here we introduce the non-specific version with fixed-versusrandom policy. The tester records a set of power traces where the device runs with a fixed input and another set of traces where the device runs with random inputs. In implementation, the tester records traces for different sets in random order to avoid possible device bias through time. For each point of trace, the $t$ statistic

$$
t=\frac{\bar{X}_{f}-\bar{X}_{r}}{\sqrt{\frac{s_{f}^{2}}{n_{f}}+\frac{s_{r}^{2}}{n_{r}}}}
$$

is calculated, where $\bar{X}_{f}, s_{f}, n_{f}$ and $\bar{X}_{r}, s_{r}, n_{r}$ are the sample mean, standard deviation, and number of traces of set fixed and random, respectively. When the number of recorded traces is large, the $t$ statistic helps recognize whether both sets are sampled from distributions of the same population mean, which is our null hypothesis. In our contexts, this implies adversaries cannot distinguish some particular input. In practice, we reject the hypothesis if the $t$ statistic exceeds the standard threshold $\pm 4.5$, which is set to guarantee a $p$-value under 0.00001 . However, since our measured traces contain many sampled points, we refer to $\left[\mathrm{DZD}^{+} 17\right]$ and alter this threshold to avoid false positives.

## 3 Masked Floating-Point Number Multiplication and Addition

We now introduce our main algorithms - floating-point number multiplication and addition in Falcon, which rewrites FprMul in Algorithm 5 and FprAdd in Algorithm 6 in a masked design. To support their complicated operations in shares, we design three masked gadgets as subroutines in our main algorithms, including:

- SecNonzero: the masked algorithm which receives input shares and outputs whether the input is zero in bit shares.
- SecFprUrsh: the masked algorithm which receives input Boolean shares $\left(x_{i}\right)$ of $x$ and arithmetic shares $\left(c_{i}\right)$ of $c$ and returns shares of $x \gg c$ with its sticky bit preserved.

Table 1: List of used gadgets in our work with $n=t+1$ shares

| Algorithm | Description | Security | Reference |
| :---: | :---: | :---: | :---: |
| SecAnd | AND of Boolean shares | $t$-SNI | [ISW03, $\left.\mathrm{BBD}^{+} 16\right]$ |
| SecMult | Multiplication of arithmetic shares | $t$-SNI | [ISW03, $\mathrm{BBD}^{+} 16$ ] |
| SecAdd | Addition of Boolean shares | $t$-NI | [CGTV15, BBE ${ }^{+} 18$ ] |
| A2B | Arithmetic to Boolean conversion | $t$-SNI | [SPOG19] |
| B2A | Boolean to arithmetic conversion | $t$-SNI | [BCZ18] |
| $\mathrm{B}^{\text {2 }}$ Bit | One-bit B2A conversion | $t$-SNI | [SPOG19] |
| RefreshMasks | $t$-NI refresh of masks | $t$-NI | $\left[\mathrm{BBD}^{+} 16, \mathrm{BCZ1} 8\right]$ |
| Refresh | $t$-SNI refresh of masks | $t$-SNI | $\left[\mathrm{BBD}^{+} 16\right]$ |
| SecOr | OR of Boolean shares | $t$-SNI | Algorithm 7 |
| SecNonzero | Nonzero check of shares | $t$-SNI | Algorithm 8 |
| SecFprUrsh | Right-shift with sitcky-bit preserved | $t$-SNI | Algorithm 9 |
| SecFprNorm64 | Normalization to $\left[2{ }^{63}, 2^{64}\right.$ ) | $t$-NI | Algorithm 10 |

- SecFprNorm64: the masked algorithm which left-shifts 64 -bit Boolean shares $\left(x_{i}\right)$ to make its 64 th bit set. It then adds the shift counts to the other input shares $\left(e_{i}\right)$.

We also use several gadgets from previous works. Table 1 lists all the gadgets used in this work and their $t$-NI and $t$-SNI security. For those proposed in this work (Algorithm 7, $8,9,10$ ), we provide details of them in the following of this section and will prove their $t$-NI or $t$-SNI security in Section 4.

### 3.1 Masked Nonzero Check

We start by introducing our nonzero-check algorithm. Consider that a $k$-bit number $x$ is zero if bit-wise OR-ing all its bits results in a zero. That is,

$$
x=0 \Longleftrightarrow x^{(k)} \vee x^{(k-1)} \vee \cdots \vee x^{(1)}=0
$$

Let the input be some Boolean shares. As each bit of shares is independent, we can bitslice the shares and use the SecOr gadget to OR all the bits securely. The detail of the SecOr gadget is given in Algorithm 7, which applies De Morgan's law and calls the AND algorithm SecAnd of shares as a subroutine. That is,

$$
x \vee y=\neg[(\neg x) \wedge(\neg y)]
$$

To increase efficiency, we consider OR-ing not one bit but half of the register size each time, which reduces the complexity from $\mathcal{O}\left(n^{2} k\right)$ to $\mathcal{O}\left(n^{2} \log k\right)$ for $n$-shared $k$-bit numbers.

For arithmetic shares input $\left(x_{i}\right)$, since

$$
\sum_{i=1}^{n} x_{i}=0 \Longleftrightarrow \sum_{i=1}^{\frac{n}{2}} x_{i}=\sum_{i=\frac{n}{2}+1}^{n}\left(-x_{i}\right) \Longleftrightarrow \sum_{i=1}^{\frac{n}{2}} x_{i} \oplus \sum_{i=\frac{n}{2}+1}^{n}\left(-x_{i}\right)=0
$$

We take the last $\frac{n}{2}$ shares, turn them negative, and use two $\frac{n}{2}$-share arithmetic-to-Boolean conversion gadgets A2B to create two Boolean shares, each representing half shares of the input. Followed by the above nonzero check of Boolean shares, we end up getting one-bit Boolean shares indicating if the input is nonzero. The whole algorithm is given in Algorithm 8.

```
Algorithm 7 SecOr
Input: Boolean shares \(\left(x_{i}\right)_{1 \leq i \leq n}\) for value \(x\), Boolean shares \(\left(y_{i}\right)_{1 \leq i \leq n}\) for value \(y\)
Output: Boolean shares \(\left(z_{i}\right)_{1 \leq i \leq n}\) for value \(z=x \vee y\)
    \(\left(t_{i}\right)_{1 \leq i \leq n} \leftarrow\left(\neg x_{1}, x_{2}, \cdots, x_{n}\right)\)
    \(\left(s_{i}\right)_{1 \leq i \leq n} \leftarrow\left(\neg y_{1}, y_{2}, \cdots, y_{n}\right)\)
    \(\left(z_{i}\right) \leftarrow \operatorname{SecAnd}\left(\left(s_{i}\right),\left(t_{i}\right)\right)\)
    \(z_{1} \leftarrow \neg z_{1}\)
    return \(\left(z_{i}\right)\)
```

```
Algorithm 8 SecNonzero
Input: Shares \(\left(x_{i}\right)_{1 \leq i \leq n}\) for value \(x\)
Output: One-bit Boolean shares \(\left(b_{i}\right)_{1 \leq i \leq n}\) where \(\bigoplus_{i} b_{i}=0 \Longleftrightarrow x=0\)
    if input \(\left(x_{i}\right)\) are arithmetic shares then
        \(\left(t_{i}\right)_{1 \leq i \leq \frac{n}{2}} \leftarrow \mathrm{~A} 2 \mathrm{~B}\left(\left(x_{i}\right)_{1 \leq i \leq \frac{n}{2}}\right)\)
        \(\left(t_{i}\right)_{\frac{n}{2}+1 \leq i \leq n} \leftarrow \mathrm{~A} 2 \mathrm{~B}\left(\left(-x_{i}\right)_{\frac{n}{2}+1 \leq i \leq n}\right)\)
    else
        \(\left(t_{i}\right)_{1 \leq i \leq n} \leftarrow\left(x_{i}\right)_{1 \leq i \leq n}\)
    len \(\leftarrow\) bitsize \(/ 2\)
    while len \(\geq 1\) do
        \(\left(l_{i}\right) \leftarrow \overline{\operatorname{Refresh}}\left(\left(t_{i}^{[2 \operatorname{len}: l e n]}\right)\right.\), len \()\)
        \(\left(r_{i}\right) \leftarrow\left(t_{i}^{[\text {len: } 1]}\right)\)
        \(\left(t_{i}\right) \leftarrow \operatorname{SecOr}\left(\left(l_{i}\right),\left(r_{i}\right)\right)\)
        len \(\leftarrow\) len \(\gg 1\)
    return \(\left(t_{i}^{(1)}\right)\)
```


### 3.2 Masked Unsigned Right-Shift

One step in the floating-point number addition is to right-shift the mantissa of the second operand by the difference of exponents (in line 11 of FprAdd). This is used to make the exponents of the two operands equal, and thus we can directly add their mantissa together. While the exponent difference is part of the secret and is represented in shares, we cannot directly unmask them. The SecFprUrsh in Algorithm 9 right-shifts a Boolean-masked number by a value in arithmetic shares while preserving the sticky bit.

The idea goes as follows. For a 64 -bit number, rotating it by some value $c$ is equivalent to rotating by value $c \bmod 64$. This shows we can rotate by a 6 -bit arithmetic-masked value via sequentially rotating by each share of it. To recover the shifted result from the rotated one, we also rotate a constant $(1 \ll 63)$ by the same value. As there is only a single 1 in bit representation of $(1 \ll 63)$, we could sequentially XOR and right-shift the value to set all the valid bits for our desired shifted result. Moreover, the unset bits are the bits to discard, which determine the sticky bit. We use a SecNonzero to find our desired sticky bit and replace the least significant bit of the shifted result with it.

It is noteworthy that we add $t$-NI secure RefreshMasks in each iteration of the rotation. This is for removing the dependency between shares to achieve its $t$-SNI security. A formal proof based on simulation and properties of the RefreshMasks gadget will be given in Section 4.

### 3.3 Masked 64-bit Normalization

Another crucial part of the floating-point number addition is normalizing a number to range $\left[2^{63}, 2^{64}\right.$ ) (in line 14 of FprAdd). This is used to set the correct exponent and left-shift

```
Algorithm 9 SecFprUrsh
Input: 64 -bit Boolean shares \(\left(x_{i}\right)_{1 \leq i \leq n}\) for value \(x, 6\)-bit arithmetic shares \(\left(c_{i}\right)_{1 \leq i \leq n}\) for
    value \(c\)
Output: Boolean shares \(\left(z_{i}\right)_{1 \leq i \leq n}\) for value \(z=x \gg c\) with the sticky bit preserved
    \(\left(m_{i}\right)_{1 \leq i \leq n} \leftarrow((1 \ll 63), 0, \cdots, 0)\)
    for \(j=1\) to \(n\) do
        Right-rotate \(\left(x_{i}\right)\) by \(c_{j}\)
        \(\left(x_{i}\right) \leftarrow \operatorname{RefreshMasks}\left(\left(x_{i}\right)\right)\)
        Right-rotate \(\left(m_{i}\right)\) by \(c_{j}\)
        \(\left(m_{i}\right) \leftarrow \operatorname{RefreshMasks}\left(\left(m_{i}\right)\right)\)
    len \(\leftarrow 1\)
    while len \(\leq 32\) do
        \(\left(m_{i}\right) \leftarrow\left(m_{i} \oplus\left(m_{i} \gg\right.\right.\) len \(\left.)\right)\)
        len \(\leftarrow\) len \(\ll 1\)
    \(\left(y_{i}\right) \leftarrow \operatorname{SecAnd}\left(\left(x_{i}\right),\left(m_{i}\right)\right)\)
    \(\left(z_{i}\right) \leftarrow\left(y_{i} \oplus x_{i} \oplus y_{i}^{(1)}\right)\)
    \(\left(b_{i}\right) \leftarrow \operatorname{SecNonzero}\left(\left(z_{i}\right)\right)\)
    \(\left(z_{i}\right) \leftarrow\left(y_{i}^{[64: 2]} \vee b_{i}\right)\)
    return \(\left(z_{i}\right)\)
```

the mantissa sum for a valid floating-point number. In FALCON reference implementation, they sequentially check whether the high-order bits are all zero and left-shift the mantissa by the corresponding value. We follow their implementation and use our SecNonzero gadget to check the shift counts. In addition, to add the shift counts to the exponent, we use the one-bit Boolean-to-arithmetic conversion algorithm B2A Bit to transform the result of SecNonzero into arithmetic shares. The whole algorithm is given in Algorithm 10.

```
Algorithm 10 SecFprNorm64
Input: 64 -bit Boolean shares \(\left(x_{i}\right)_{1<i<n}\) for value \(x, 16\)-bit arithmetic shares \(\left(e_{i}\right)_{1<i \leq n}\) for
    value \(e\)
Output: Normalized \(\left(x_{i}\right)_{1 \leq i \leq n}\) in \(\left[2^{63}, 2^{64}\right)\) and \(\left(e_{i}\right)_{1 \leq i \leq n}\) with corresponding shift added
    \(e_{1} \leftarrow e_{1}-63\)
    for \(j=5\) to 0 do
        \(\left(t_{i}\right) \leftarrow\left(x_{i} \oplus\left(x_{i} \ll 2^{j}\right)\right)\)
        \(\left(n_{i}\right) \leftarrow\left(x_{i} \gg\left(64-2^{j}\right)\right)\)
        \(\left(b_{i}\right) \leftarrow \operatorname{SecNonzero}\left(\left(n_{i}\right)\right)\)
        \(\left(b_{i}^{\prime}\right) \leftarrow\left(-b_{i}\right)\)
        \(\left(t_{i}\right) \leftarrow \operatorname{SecAnd}\left(\left(t_{i}\right),\left(\neg b_{1}^{\prime}, b_{2}^{\prime}, \cdots, b_{n}^{\prime}\right)\right)\)
        \(\left(x_{i}\right) \leftarrow\left(x_{i} \oplus t_{i}\right)\)
        \(\left.\left(b_{i}\right) \leftarrow \mathrm{B}_{2} \mathrm{ABit}\left(b_{i}\right)\right)\)
        \(\left(e_{i}\right) \leftarrow\left(e_{i}+\left(b_{i} \ll j\right)\right)\)
    return \(\left(x_{i}\right),\left(e_{i}\right)\)
```


### 3.4 Masked Floating-Point Number Packing

Given 1-bit Boolean-shared sign bit $\left(s_{i}\right)$, 16-bit arithmetic-shared exponent $\left(e_{i}\right)$, and 55 -bit Boolean-shared mantissa $\left(z_{i}\right)$, the SecFPR in Algorithm 11 packs them into one Boolean shares and round the mantissa to 53 -bit with the round-to-nearest strategy.

Similar to Algorithm 4, the procedure starts by adding a constant 1076 to the exponent. Then we turn the mantissa into zero if the result is smaller than 0 (line 4). The comparison
is made by a 16 -bit conversion A2B gadget and checking the most significant bit of the result. Next, we zero the exponent if the mantissa is zero, and we check this by the 55 th bit of the mantissa (line 5). The 55th bit is also added to the exponent by the Boolean-masked addition gadget SecAdd. We then pack the sign bit, exponent, and mantissa that is shifted right by 2 bits (line 9). The Refresh gadgets for the sign bit and the exponent are used here to satisfy the $t$-SNI security. Finally, to do the rounding securely. We consider adding the mantissa by the value derived from OR-ing the first and third bit (line 10) and then AND-ing the second bit (line 11). In this way, the least 3 bits 011,110 , or 111 will cause an increment.

Note that the value will be indeterminate if the input exponent is too large, as it will overflow to the sign bit in Algorithm 4. We omit this check and leave the responsibility for not letting it happen to users, which is what FALCON reference implementation suggests also.

```
Algorithm 11 SecFPR
Input: 1-bit Boolean shares \(\left(s_{i}\right)_{1 \leq i \leq n}\) for value \(s, 16\)-bit arithmetic shares \(\left(e_{i}\right)_{1 \leq i \leq n}\) for
    value \(e, 55\)-bit Boolean shares \(\left(z_{i}\right)_{1 \leq i \leq n}\) for value \(z\)
Output: Boolean shares \(\left(x_{i}\right)_{1 \leq i \leq n}\) representing the floating-point numbers packed by
    \(s, e, z\)
    \(e_{1} \leftarrow e_{1}+1076\)
    \(\left(e_{i}\right) \leftarrow \mathrm{A} 2 \mathrm{~B}\left(\left(e_{i}\right)\right)\)
    \(\left(b_{i}\right) \leftarrow\left(-e_{i}^{(16)}\right)\)
    \(\left(z_{i}\right) \leftarrow \operatorname{SecAnd}\left(\left(z_{i}\right),\left(\neg b_{1}, b_{2}, \cdots, b_{n}\right)\right) \quad \triangleright\) set \(z=0\) if \(e<0\)
    \(\left(e_{i}\right) \leftarrow \operatorname{SecAnd}\left(\left(e_{i}\right),\left(-z_{i}^{(55)}\right)\right) \quad \triangleright \operatorname{set} e=0\) if \(z=0 \Leftrightarrow z^{(55)}=0\)
    \(\left(e_{i}\right) \leftarrow \operatorname{SecAdd}\left(\left(e_{i}\right),\left(z_{i}^{(55)}\right)\right)\)
    \(\left(e_{i}\right) \leftarrow \operatorname{Refresh}\left(\left(e_{i}\right)\right)\)
    \(\left(s_{i}\right) \leftarrow \operatorname{Refresh}\left(\left(s_{i}\right)\right)\)
    \(\left(x_{i}\right) \leftarrow\left(\left(s_{i}^{(1)} \ll 63\right) \vee\left(e_{i}^{[11: 1]} \ll 52\right) \vee\left(z_{i}^{[54: 3]}\right)\right.\)
    \(\left(f_{i}\right) \leftarrow \operatorname{Sec} \operatorname{Or}\left(\operatorname{Refresh}\left(z_{i}^{(1)}\right),\left(z_{i}^{(3)}\right)\right)\)
    \(\left(f_{i}\right) \leftarrow \operatorname{SecAnd}\left(\left(f_{i}\right),\left(z_{i}^{(2)}\right)\right)\)
    \(\left(x_{i}\right) \leftarrow \operatorname{SecAdd}\left(\left(x_{i}\right),\left(f_{i}\right)\right)\)
    return \(\left(x_{i}\right)\)
```


### 3.5 Masked Floating-Point Number Multiplication

We now introduce the masked floating-point number multiplication SecFprMul in Algorithm 12. To begin with, we consider its input floating-point numbers to be split into three parts of arithmetic shares: one-bit sign bit shares $\left(s_{i}\right), 16$-bit exponent shares $\left(e_{i}\right)$, and 128 -bit mantissa shares $\left(m_{i}\right)$ where

$$
s=\bigoplus_{i} s_{i}=\sum_{i} s_{i} \bmod 2, \quad e=\sum_{i} e_{i} \bmod 2^{16}, \quad m=\sum_{i} m_{i}^{[128: 1]} \bmod 2^{128}
$$

We use this form of input to make the operations more straightforward. It should be noted that the pre-image vector computation is the first operation that should be masked in the signing process, and hence this form of input can be derived directly from unmasked values.

To start with, the sign bit XOR and the exponent addition can be simply done by adding the corresponding share of the input (lines 1 and 2). As in Algorithm 5, a constant -2100 is also added. Mantissa multiplication is done by the SecMult gadget (line 3), which
multiplies each share of both operands and adds them together carefully by inserting random mask values.

The next step is to round the mantissa to range $\left[2^{54}, 2^{55}\right)$ and preserve the sticky bit. We first convert the arithmetic-shared mantissa into Boolean shares (line 4). The product is in the range $\left[2^{104}, 2^{106}\right)$, so we shift the product by 50 or 51 bits according to the 106 th bit (line 6 to line 10). Note that for one-bit shares, the negative of it can be achieved by turning each share negative, which is used in our conditional shift in lines 9 and 10. The 106 th bit is then converted to 16 -bit arithmetic shares by the one-bit $\mathrm{B}_{2} \mathrm{~A}_{\text {Bit }}$ and added to the exponent (line 13). To preserve the sticky bit, consider that if shifted by 50 , we need to OR the 51 st bit of mantissa with the nonzero result of the last 50 bits, while for a 51 -bit shift, we OR the 52 nd bit with the nonzero result of the last 51 bits. This can be done in both cases by using our SecNonzero on the last 51 bits and OR the result with the shifted mantissa, which is done in lines 5 and 11.

Finally, we turn the mantissa into zero if any exponent of the input is zero (line 14 to line 17), and we also make this by our SecNonzero gadget of the arithmetic-shared version. Now we get the sign bit, exponent, and 55-bit mantissa, which are in Boolean, arithmetic, and Boolean shares, respectively. We pack them into one Boolean-shared floating-point number in SecFPR.

```
Algorithm 12 SecFprMul
Input: Shares \(\left(s x_{i}\right)_{1 \leq i \leq n},\left(e x_{i}\right)_{1 \leq i \leq n},\left(m x_{i}\right)_{1 \leq i \leq n}\) for floating-point number \(x\), shares
    \(\left(s y_{i}\right)_{1 \leq i \leq n},\left(e y_{i}\right)_{1 \leq i \leq n},\left(m y_{i}\right)_{1 \leq i \leq n}\) for floating-point number \(y\)
Output: Boolean shares representing the floating-point numbers product.
    \(\left(s_{i}\right) \leftarrow\left(s x_{i} \oplus s y_{i}\right)\)
    \(\left(e_{i}\right) \leftarrow\left(e x_{1}+e y_{1}-2100, e x_{2}+e y_{2}, \cdots, e x_{n}+e y_{n}\right)\)
    \(\left(p_{i}\right) \leftarrow \operatorname{SecMult}\left(\left(m x_{i}\right),\left(m y_{i}\right)\right)\)
    \(\left(p_{i}\right) \leftarrow \mathrm{A} 2 \mathrm{~B}\left(\left(p_{i}\right)\right)\)
    \(\left(b_{i}\right) \leftarrow \operatorname{SecNonzero}\left(\left(p_{i}^{[51: 1]}\right)\right)\)
    \(\left(z_{i}\right) \leftarrow\left(p_{i}^{[105: 51]}\right)\)
    \(\left(z_{i}^{\prime}\right) \leftarrow\left(p_{i}^{[105: 51]} \oplus p_{i}^{[106: 52]}\right)\)
    \(\left(w_{i}\right) \leftarrow\left(p_{i}^{(106)}\right)\)
    \(\left(z_{i}^{\prime}\right) \leftarrow \operatorname{SecAnd}\left(\left(z_{i}^{\prime}\right), \operatorname{Refresh}\left(\left(-w_{i}\right)\right)\right)\)
    \(\left(z_{i}\right) \leftarrow\left(z_{i}^{\prime} \oplus z_{i}\right) \quad \triangleright\) conditional shift
    \(\left(z_{i}\right) \leftarrow \operatorname{Sec} \operatorname{Or}\left(\left(z_{i}\right),\left(b_{i}\right)\right) \quad \triangleright\) preserve the sticky bit
    \(\left(w_{i}\right) \leftarrow \mathrm{B}_{2} \mathrm{Bit}\left(\left(w_{i}\right)\right)\)
    \(\left(e_{i}\right) \leftarrow\left(e_{i}+w_{i}\right) \quad \triangleright\) add exponent by the 106 th bit
    \(\left(b x_{i}\right) \leftarrow\) SecNonzero \(\left(\left(e x_{i}\right)\right)\)
    \(\left(b y_{i}\right) \leftarrow \operatorname{SecNonzero}\left(\left(e y_{i}\right)\right)\)
    \(\left(d_{i}\right) \leftarrow \operatorname{SecAnd}\left(\left(b x_{i}\right),\left(b y_{i}\right)\right)\)
    \(\left(z_{i}\right) \leftarrow \operatorname{SecAnd}\left(\left(z_{i}\right),\left(-d_{i}^{(1)}\right)\right) \quad \triangleright\) set \(z=0\) if exponent of any operand is 0
    return \(\operatorname{SecFPR}\left(\left(s_{i}\right),\left(e_{i}\right),\left(z_{i}\right)\right)\)
```


### 3.6 Masked Floating-Point Number Addition

The SecFprAdd in Algorithm 13 takes in two Boolean-shared floating-point values and adds them in a masked way. We use Boolean shares as its input since it is followed by floating-point number multiplications in the pre-image vector computation.

The algorithm first exchanges the operands to make the first one no less than the second (line 1 to line 9). Thanks to the biased exponent, we can make the comparison by a simple subtraction and check the sign bit. Originally, the subtraction of two Boolean-masked
values could be done by three steps: (1) inverting all the bits of the second operand (2) adding the inverted result with 1 (3) adding the first and second operand, which applies the equation $x-y=x+(\neg y)+1$ To avoid an additional call to the SecAdd gadget, we only invert bits and consider the boundary conditions. To put it clearly, let $u, v$ be two values stored in 64 -bit registers; we use the relation

$$
\begin{aligned}
\llbracket u-v<0 \rrbracket & =\llbracket u-v-1<0 \rrbracket \oplus \llbracket u-v-1=-1 \rrbracket \oplus \llbracket u-v-1=2^{63}-1 \rrbracket \\
& =\llbracket u+(\neg v)<0 \rrbracket \oplus \llbracket u+(\neg v)=-1 \rrbracket \oplus \llbracket u+(\neg v)=2^{63}-1 \rrbracket \\
& =\llbracket u+(\neg v)<0 \rrbracket \oplus \llbracket u+(\neg v) \neq-1 \rrbracket \oplus \llbracket u+(\neg v) \neq 2^{63}-1 \rrbracket
\end{aligned}
$$

The first and second equalities come from the two's complement representation, where an increment of $2^{63}-1$ results in $-2^{63}$ and $\neg v=-v-1$. The third equality is for the output of the SecNonzero gadget. To evaluate the range check of the value $u+(\neg v)$, we also use both of the facts in our algorithm (lines 4 and 5)

$$
\begin{gathered}
u+(\neg v) \neq-1 \Longleftrightarrow \neg(u+(\neg v)) \neq 0 \\
u+(\neg v) \neq 2^{63}-1 \Leftrightarrow(u+(\neg v)) \oplus(1 \ll 63) \neq-1 \Leftrightarrow \neg((u+(\neg v)) \oplus(1 \ll 63)) \neq 0
\end{gathered}
$$

After the conditional swap, we extract the sign bit, exponent, and mantissa from the input shares. Like in Algorithm 6, the mantissa is scaled up 3 bits for further rounding precision, and the exponent is turned to arithmetic shares and subtracted by 1078. Then we use the SecFprUrsh gadget (Algorithm 9) we introduced in Section 3.2 to right-shift the Boolean-masked mantissa of the second operand with sticky bit preserved to make both operands have identical exponents. Note that we set the value to zero if the exponent difference is larger than 59 before shifting, as indicated in lines 9 and 10 of unmasked Algorithm 6 and lines 15 and 16 of Algorithm 13.

After the proper shift, we add/subtract the result to/from the mantissa of the first operand (line 24). The sum has a wide range, so we normalize it to $\left[2^{63}, 2^{64}\right.$ ) by the SecFprNorm64 gadget (Algorithm 10) in Section 3.3 and right-shift the result by 9 bits (line 27). Finally, we get the sign bit and exponent of the first operand and the shifted mantissa sum in the range $\left[2^{54}, 2^{55}\right.$ ), and the result floating-point number is given by calling SecFPR as the case of multiplication.

## 4 Security Proof

In this section, we sequentially prove that our design with $n=t+1$ shares is secure regarding $t$-NI and $t$-SNI security.

Lemma 1. The gadget SecOr (Algorithm 7) is $t$-SNI secure.
Proof. This is a direct result that the SecAnd gadget is $t$-SNI secure and the negation is operated share-by-share.
Lemma 2. The gadget SecNonzero (Algorithm 8) is t-SNI secure.
Proof. Since the A2B gadget is $t$-SNI secure, we only need to show that the following loop is itself $t$-SNI. An abstract diagram of each iteration is given in Figure 1. Since SecOr is $t$-SNI secure, for any probing set $\mathcal{P}_{1}$ in SecOr gadget and $\mathcal{O}$ of its output shares, one can use some set $\mathcal{S}_{1}^{1}$ of outputs of Refresh and set $\mathcal{S}_{1}^{2}$ of shares of $\left(r_{i}\right)$ to simulate both $\mathcal{P}_{1}$ and $\mathcal{O}$ with $\left|\mathcal{S}_{1}^{1}\right|,\left|\mathcal{S}_{1}^{2}\right| \leq \mathcal{P}_{1}$. Also, since Refresh is $t$-SNI secure, for any probing set $\mathcal{P}_{2}$ in Refresh gadget, one can use some set $\mathcal{S}_{2}$ of shares of $\left(t_{i}\right)$ to simulate both $\mathcal{P}_{2}$ and $\mathcal{S}_{1}^{1}$ with $\left|\mathcal{S}_{2}\right| \leq \mathcal{P}_{2}$. In summary, for probed output shares $\mathcal{O}$ and internal values $\mathcal{P}_{1}, \mathcal{P}_{2}$, one can use $\mathcal{S}_{1}^{2}$ and $\mathcal{S}_{2}$ to simulate all of them with $\left|\mathcal{S}_{1}^{2} \cup \mathcal{S}_{2}\right| \leq\left|\mathcal{P}_{1}\right|+\left|\mathcal{P}_{2}\right|$. This shows each iteration is $t$-SNI secure, and the whole loop is thus $t$-SNI secure.

```
Algorithm 13 SecFprAdd
Input: Boolean shares \(\left(x_{i}\right)_{1 \leq i \leq n}\) and \(\left(y_{i}\right)_{1 \leq i \leq n}\) representing floating-point numbers \(x\) and
    \(y\)
Output: Boolean shares representing the floating-point numbers sum
    \(:\left(x m_{i}\right) \leftarrow\left(x_{i}^{[63: 1]}\right)\)
    \(\left(y m_{i}\right) \leftarrow\left(\neg y_{1}^{[63: 1]}, y_{2}^{[63: 1]}, \cdots, y_{n}^{[63: 1]}\right)\)
    \(\left(d_{i}\right) \leftarrow \operatorname{SecAdd}\left(\left(x m_{i}\right),\left(y m_{i}\right)\right) \quad \triangleright d=x m-y m-1\)
    \(\left(b_{i}\right) \leftarrow \operatorname{SecNonzero}\left(\neg d_{1}, d_{2}, \cdots, d_{n}\right)\)
    \(\triangleright \llbracket d \neq-1 \rrbracket\)
    \(\left(b_{i}^{\prime}\right) \leftarrow \operatorname{SecNonzero}\left(\neg\left(d_{1} \oplus(1 \ll 63)\right), d_{2}, \cdots, d_{n}\right)\)
    \(\triangleright \llbracket d \neq 2^{63}-1 \rrbracket\)
    \(\left(c s_{i}\right) \leftarrow \operatorname{SecAnd}\left(\left(\neg b_{1}, b_{2}, \cdots, b_{n}\right),\left(x_{i}^{(64)}\right)\right)\)
    \(\left(c s_{i}\right) \leftarrow \operatorname{Sec} \operatorname{Or}\left(\left(c s_{i}\right),\left(d_{i}^{(64)} \oplus b_{i} \oplus b_{i}^{\prime}\right)\right)\)
    \(\left(m_{i}\right) \leftarrow \operatorname{Sec} \operatorname{And}\left(\left(x_{i} \oplus y_{i}\right),\left(-c s_{i}\right)\right)\)
    \(\left(x_{i}\right) \leftarrow\left(x_{i} \oplus m_{i}\right),\left(y_{i}\right) \leftarrow\left(y_{i} \oplus m_{i}\right) \quad \triangleright\) swap \(x\) and \(y\) if necessary
    Extract \(\left(s x_{i}\right),\left(e x_{i}\right),\left(m x_{i}\right)\) and \(\left(s y_{i}\right),\left(e y_{i}\right),\left(m y_{i}\right)\) from \(\left(x_{i}\right)\) and \(\left(y_{i}\right)\), respectively.
    \(\left(m x_{i}\right) \leftarrow\left(m x_{i} \ll 3\right),\left(m y_{i}\right) \leftarrow\left(m y_{i} \ll 3\right)\)
    \(\left(e x_{i}\right) \leftarrow \mathrm{B} 2 \mathrm{~A}\left(\left(e x_{i}\right)\right),\left(e y_{i}\right) \leftarrow \mathrm{B} 2 \mathrm{~A}\left(\left(e y_{i}\right)\right)\)
    \(e x_{1} \leftarrow e x_{1}-1078, e y_{1} \leftarrow e y_{1}-1078\).
    \(\left(c_{i}\right) \leftarrow\left(e x_{i}-e y_{i}\right)\)
    \(\left(c_{i}^{\prime}\right) \leftarrow \mathrm{A} 2 \mathrm{~B}\left(\left(c_{1}-60, c_{2}, \cdots, c_{n}\right)\right)\)
    \(\left(m y_{i}\right) \leftarrow \operatorname{Sec} \operatorname{And}\left(\left(m y_{i}\right),\left(-\left(c_{i}^{\prime(16)}\right)\right)\right) \quad \triangleright\) set \(m y\) to 0 if the exponent difference \(>60\)
    \(\left(m y_{i}\right) \leftarrow \operatorname{SecFprUrsh}\left(\left(m y_{i}\right),\left(c_{i}^{[6: 1]}\right)\right)\)
    \(\left(m y_{i}^{\prime}\right) \leftarrow\left(\neg m y_{1}, m y_{2}, \cdots m y_{n}\right)\)
    \(\left(m y_{i}^{\prime}\right) \leftarrow \operatorname{SecAdd}\left(\left(m y_{i}^{\prime}\right),(1,0, \cdots, 0)\right) \quad \triangleright-m y\)
    \(\left(s_{i}\right) \leftarrow\left(-\left(s x_{i} \oplus s y_{i}\right)\right)\)
    \(\left(m y_{i}\right) \leftarrow \operatorname{Refresh}\left(\left(m y_{i}\right)\right)\)
    \(\left(m y_{i}^{\prime}\right) \leftarrow \operatorname{SecAnd}\left(\left(m y_{i} \oplus m y_{i}^{\prime}\right),\left(s_{i}\right)\right)\)
    \(\left(m y_{i}\right) \leftarrow\left(m y_{i} \oplus m y_{i}^{\prime}\right) \quad \triangleright(-1)^{s x \oplus s y} m y\)
    \(\left(z_{i}\right) \leftarrow \operatorname{SecAdd}\left(\left(m x_{i}\right),\left(m y_{i}\right)\right) \quad \triangleright z=m x+(-1)^{s x \oplus s y} m y\)
    \(\left(z_{i}\right),\left(e x_{i}\right) \leftarrow\) SecFprNorm64 \(\left(\left(z_{i}\right),\left(e x_{i}\right)\right)\)
    \(\left(b_{i}\right) \leftarrow \operatorname{SecNonzero}\left(\left(z_{i}^{[10: 1]}\right)\right)\)
    \(\left(z_{i}\right) \leftarrow\left(z_{i} \gg 9\right)\)
    \(\left(z_{i}^{(1)}\right) \leftarrow\left(b_{i}\right) \quad \triangleright\) preserve the sticky bit
    \(e x_{1} \leftarrow e x_{1}+9\)
    return \(\operatorname{SecFPR}\left(\operatorname{Refresh}\left(\left(s x_{i}\right)\right),\left(e x_{i}\right),\left(z_{i}\right)\right)\)
```

Lemma 3. The gadget SecFprUrsh (Algorithm 9) is $t$-SNI secure.
Proof. We first show that the loop of rotation is itself $t$-SNI secure. Note that since there are $n$ iterations, at least one of them is not probed. Let it be the iteration when $j=j^{*}$. Since any set of output shares of RefreshMasks with size $\leq n-1$ is uniformly distributed ([BCZ18], Lemma 1), all the probes after $j^{*}$, including probes of outputs, can be simulated with fresh randomness. Thus we only need to show that one can simulate probes before $j^{*}$ with no more number of shares.

Since the rotation is done share-by-share, one can simulate probes of $\left(x_{i}\right)$ and ( $m_{i}$ ) with the same number of input shares. As for the simulation of $c_{j}$, if in some iteration $j=j^{\prime}$ the rotation is probed, one then adds $c_{j^{\prime}}$ into the simulation set. Also, if consecutive RefreshMasks in iterations $j=j^{\prime}-1, j^{\prime}$ are probed, one adds $c_{j^{\prime}}$ into the simulation set. Note that if RefreshMasks are not consecutively probed, one can simulate $c_{j}$ with fresh randomness thanks to the uniformity of RefreshMasks's outputs. In this way, the size of the simulation set of $c_{j}$ is no more than the number of probes.


Figure 1: An abstract diagram of each iteration in SecNonzero (Algorithm 8). The probing sets $\mathcal{O}, \mathcal{P}_{1}, \mathcal{P}_{2}$ are colored in red, and the simulations sets $\mathcal{S}_{1}^{1}, \mathcal{S}_{1}^{2}, \mathcal{S}_{2}$ are colored in blue. Gadgets with $t$-NI and $t$-SNI security are marked in black and green, respectively.


Figure 2: An abstract diagram of SecFprUrsh (Algorithm 9). The probing sets $\mathcal{O}$ and $\mathcal{P}_{i}$ for some $i$ are colored in red, and the simulations sets $\mathcal{S}_{i}$ and $\mathcal{S}_{i}^{j}$ for some $i, j$ are colored in blue. Gadgets with $t$-NI and $t$-SNI security are marked in black and green, respectively.

Now we show the operations following the rotation loop are $t$-SNI secure, and therefore the whole gadget is $t$-SNI secure. An abstract diagram of SecFprUrsh is given in Figure 2. Let the adversary probe the intermediate values sets $\mathcal{O}$ of the output shares, $\mathcal{P}_{1}$ of the SecNonzero, $\mathcal{P}_{2}$ of the XOR following SecAnd, $\mathcal{P}_{3}$ of the SecAnd, and $\mathcal{P}_{4}$ of the XOR before SecAnd. First, by the $t$-SNI security of SecNonzero, one can use some sets $\mathcal{S}_{1}$ of output shares of the XOR operation to simulate $\mathcal{P}_{1}$ and the first 63 bits of $\mathcal{O}$ with size no more than $\left|\mathcal{P}_{1}\right|$. The XOR operation is done share-by-share, so there are some sets $\mathcal{S}_{2}^{1}, \mathcal{S}_{2}^{2}$ of output shares of the rotation and SecAnd, respectively, that can simulate $\mathcal{P}_{2}$ and $\mathcal{S}_{1}$. Note only $\left|\mathcal{O} \cup \mathcal{S}_{2}^{2}\right| \leq|\mathcal{O}|+\left|\mathcal{P}_{2}\right|+\left|\mathcal{P}_{1}\right| \leq t$ output shares of SecAnd are used. Since SecAnd is $t$-SNI secure, one can use some sets $\mathcal{S}_{3}^{1}, \mathcal{S}_{3}^{2}$ to simulate $\mathcal{P}_{3}$ and $\{$ the last bit of $\mathcal{O}\} \cup \mathcal{S}_{2}^{2}$ with sizes no more than $\mathcal{P}_{3}$. Finally, one can simulate the probing set $\mathcal{P}_{4}$ in the XOR and $\mathcal{S}_{3}^{2}$ with output shares $\mathcal{S}_{4}$ of the rotation of $\left(m_{i}\right)$. All the probes are now simulated with output shares $\mathcal{S}_{2}^{1} \cup \mathcal{S}_{3}^{1}$ of the rotation of $\left(x_{i}\right)$ and output shares $\mathcal{S}_{4}$ of the rotation of $\left(m_{i}\right)$. $\left|\mathcal{S}_{2}^{1} \cup \mathcal{S}_{3}^{1}\right| \leq\left|\mathcal{P}_{2}\right|+\left|\mathcal{P}_{1}\right|+\left|\mathcal{P}_{3}\right|$ and $\left|\mathcal{S}_{4}\right| \leq\left|\mathcal{P}_{4}\right|+\left|\mathcal{P}_{3}\right|$. They, along with the internal probes into the rotation loop, can be simulated by input shares due to the $t$-SNI security we showed at first.

Lemma 4. The gadget SecFprNorm64 (Algorithm 10) is t-NI secure.
Proof. An abstract diagram of each iteration in SecFprNorm64 is given in Figure 3. Let the adversary probe in iteration $j$ the intermediate values set $\mathcal{P}_{1}^{(j)}$ of the addition, $\mathcal{P}_{2}^{(j)}$ of the XOR, $\mathcal{P}_{3}^{(j)}$ of the $\mathrm{B}^{2} \mathrm{~A}_{\mathrm{Bit}}, \mathcal{P}_{4}^{(j)}$ of the SecAnd, and $\mathcal{P}_{5}^{(j)}$ of the SecNonzero. We show that all probes in iteration $j$ can be simulated with no more number of shares of $\left(e_{i}\right)$ and $\left(x_{i}\right)$ as the input of the iteration. If this is the case, all probes across different iterations can be simulated with no more number of input shares.

First, since the addition is done share-by-share, one can use some sets $\mathcal{S}_{1}^{1}$ and $\mathcal{S}_{1}^{2}$ of shares of $\left(e_{i}\right)$ and $\mathrm{B}_{2} \mathrm{~A}_{\text {Bit }}$, respectively, to simulate $\mathcal{P}_{1}^{(j)}$. One can also use some sets $\mathcal{S}_{2}^{1}$ and $\mathcal{S}_{2}^{2}$ of shares of SecAnd and $\left(x_{i}\right)$, respectively, to simulate $\mathcal{P}_{2}^{(j)}$. The $t$-SNI security of


Figure 3: An abstract diagram of iteration $j$ in SecFprNorm64 (Algorithm 10). The probing sets $\mathcal{P}_{i}^{(j)}$ for some $i$ are colored in red, and the simulation sets $\mathcal{S}_{i}$ and $\mathcal{S}_{i}^{k}$ for some $i, k$ are colored in blue. Gadgets with $t$-NI and $t$-SNI security are marked in black and green, respectively.

B2A Bit assures that one can use some set $\mathcal{S}_{3}$ of the output of the SecNonzero with the size no more than $\left|\mathcal{P}_{3}^{(j)}\right|$ to simulate $\mathcal{P}_{3}^{(j)}$ and $\mathcal{S}_{1}^{2}$. Similarly, one can use some sets $\mathcal{S}_{4}^{1}, \mathcal{S}_{4}^{2}$ of the SecNonzero and $\left(x_{i}\right)$, respectively, to simulate $\mathcal{P}_{4}^{(j)}$ and $\mathcal{S}_{2}^{1}$ with sizes no more than $\left|\mathcal{P}_{4}^{(j)}\right|$. Note only $\left|\mathcal{S}_{3} \cup \mathcal{S}_{4}^{1}\right| \leq\left|\mathcal{P}_{3}^{(j)}\right|+\left|\mathcal{P}_{4}^{(j)}\right|$ shares of the SecNonzero are used, and they, along with $\mathcal{P}_{5}^{(j)}$, can be simulated by some set $\mathcal{S}_{5}$ of $\left(x_{i}\right)$ with size no more than $\left|\mathcal{P}_{5}^{(j)}\right|$. Now all the probes are simulated by shares $\mathcal{S}_{1}^{1}$ of $\left(e_{i}\right)$ and shares $\mathcal{S}_{2}^{2}, \mathcal{S}_{4}^{2}$ and $\mathcal{S}_{5}$ of $\left(x_{i}\right)$. As $\left|\mathcal{S}_{1}^{1}\right| \leq \mathcal{P}_{1}^{(j)}$ and $\left|\mathcal{S}_{2}^{2} \cup \mathcal{S}_{4}^{2} \cup \mathcal{S}_{5}\right| \leq\left|\mathcal{P}_{2}^{(j)}\right|+\left|\mathcal{P}_{4}^{(j)}\right|+\left|\mathcal{P}_{5}^{(j)}\right|$, we show that all the probes in iteration $j$ can be simulated with no more number of shares of $\left(e_{i}\right)$ and $\left(x_{i}\right)$.

We now give proofs that our main algorithms SecFPR (Algorithm 11), SecFprMul (Algorithm 12), and SecFprAdd (Algorithm 13) are $t$-SNI secure.

Theorem 1. The gadget SecFPR (Algorithm 11) is $t$-SNI secure.
Proof. We use an abstract diagram of SecFPR in Figure 4 to demonstrate our proof. Assume the adversary probes $t$ values, including output shares set $\mathcal{O}$ and sets $P_{i}$ in each gadget for $i=1,2, \cdots, 10$. Also, assume we use sets $S_{i}$ or $S_{i}^{j}$ for some $j$ to simulate the necessary values for the corresponding gadgets. For example, we use $\mathcal{S}_{1}^{1}, \mathcal{S}_{1}^{2}$ to simulate $\mathcal{P}_{1}$ and $\mathcal{O}$, and $\mathcal{S}_{2}$ to simulate $\mathcal{P}_{2}$ and $\mathcal{S}_{1}$, just to name a few. Our goal is to show that if the size of all the probing sets $P_{i}$ is $t_{I} \leq t$, and if the size of values we require to simulate in each gadget is smaller than $t$, then the simulation sets of input shares (that is, $\mathcal{S}_{8}, \mathcal{S}_{10}$, and $\mathcal{S}_{9}^{2}$ ) have sizes no more than $t_{I}$.

Since gadget SecAdd is $t$-NI, we have $\left|\mathcal{S}_{1}^{1}\right|,\left|\mathcal{S}_{1}^{2}\right| \leq\left|\mathcal{P}_{1}\right|+|\mathcal{O}|$. Due to the $t$-SNI security of Refresh and SecAnd, we have $\left|\mathcal{S}_{2}\right| \leq\left|\mathcal{P}_{2}\right|$ and $\left|\mathcal{S}_{3}^{1}\right|,\left|\mathcal{S}_{3}^{2}\right| \leq\left|\mathcal{P}_{3}\right|$. Similarly, we can sequentially derive

- $\left|\mathcal{S}_{4}^{1}\right|,\left|\mathcal{S}_{4}^{2}\right| \leq\left|\mathcal{P}_{4}\right|+\left|\mathcal{S}_{2}\right|$
- $\left|\mathcal{S}_{5}^{1}\right|,\left|\mathcal{S}_{5}^{2}\right| \leq\left|\mathcal{P}_{5}\right|$
- $\left|\mathcal{S}_{6}^{1}\right|,\left|\mathcal{S}_{6}^{2}\right| \leq\left|\mathcal{P}_{6}\right|$
- $\left|\mathcal{S}_{7}^{1}\right| \leq\left|\mathcal{P}_{7}\right|$

Based on the above inequalities, one can check that the number of values to simulate in each gadget is no more than $t_{I}+|\mathcal{O}|=t$. Finally we derive $\left|\mathcal{S}_{8}\right| \leq\left|\mathcal{P}_{8}\right|,\left|\mathcal{S}_{10}\right| \leq\left|\mathcal{P}_{10}\right|$, and $\left|\mathcal{S}_{9}^{2}\right| \leq\left|\mathcal{P}_{9}\right|$, which are all no more than $t_{I}$.

Theorem 2. The gadget SecFprMul (Algorithm 12) is $t$-SNI secure.


Figure 4: An abstract diagram of SecFPR (Algorithm 11). The probing sets $\mathcal{O}$ and $\mathcal{P}_{i}$ for some $i$ are colored in red, and the simulation sets $\mathcal{S}_{i}$ and $\mathcal{S}_{i}^{j}$ for some $i, j$ are colored in blue. Gadgets with $t$-NI and $t$-SNI security are marked in black and green, respectively.

Proof. We use an abstract diagram of SecFprMul in Figure 5 to demonstrate our proof. Assume the adversary probes $t$ values, including output shares set $\mathcal{O}$ and sets $P_{i}$ in each gadget for $i=1,2, \cdots, 15$. Also, assume we use sets $S_{i}$ or $S_{i}^{j}$ for some $j$ to simulate the necessary values for the corresponding gadgets. Similarly, we need to show that if the size of all the probing sets $P_{i}$ is $t_{I} \leq t$, and if the size of values we require to simulate in each gadget is smaller than $t$, then the simulation sets of input shares (that is, $\mathcal{S}_{9}^{1}, \mathcal{S}_{9}^{2}, \mathcal{S}_{11}^{1}, \mathcal{S}_{11}^{2}, \mathcal{S}_{15}^{1}$, and $\mathcal{S}_{15}^{2}$ ) have sizes no more than $t_{I}$.

Theorem 1 shows that the SecFPR gadget is $t$-SNI secure, so we have $\left|\mathcal{S}_{1}^{1}\right|,\left|\mathcal{S}_{1}^{2}\right|,\left|\mathcal{S}_{1}^{3}\right| \leq$ $\left|\mathcal{P}_{1}\right|$. We then sequentially write down all the inequalities based on the $t$-NI and $t$-SNI properties of the gadgets.

- $\left|\mathcal{S}_{2}^{1}\right|,\left|\mathcal{S}_{2}^{2}\right| \leq\left|\mathcal{P}_{2}\right|$
- $\left|\mathcal{S}_{3}^{1}\right|,\left|\mathcal{S}_{3}^{2}\right| \leq\left|\mathcal{P}_{3}\right|$
- $\left|\mathcal{S}_{4}^{1}\right|,\left|\mathcal{S}_{4}^{2}\right| \leq\left|\mathcal{P}_{4}\right|+\left|\mathcal{S}_{3}^{1}\right|$
- $\left|\mathcal{S}_{5}^{1}\right|,\left|\mathcal{S}_{5}^{2}\right| \leq\left|\mathcal{P}_{5}\right|$
- $\left|\mathcal{S}_{6}\right| \leq\left|\mathcal{P}_{6}\right|$
- $\left|\mathcal{S}_{7}\right| \leq\left|\mathcal{P}_{7}\right|$
- $\left|\mathcal{S}_{8}\right| \leq\left|\mathcal{P}_{8}\right|$
- $\left|\mathcal{S}_{9}^{1}\right|,\left|\mathcal{S}_{9}^{2}\right| \leq\left|\mathcal{P}_{9}\right|$
- $\left|\mathcal{S}_{10}^{1}\right|,\left|\mathcal{S}_{10}^{2}\right| \leq\left|\mathcal{P}_{10}\right|+\left|\mathcal{S}_{1}^{2}\right|$
- $\left|\mathcal{S}_{11}^{1}\right|,\left|\mathcal{S}_{11}^{2}\right| \leq\left|\mathcal{P}_{11}\right|+\left|\mathcal{S}_{10}^{1}\right|$
- $\left|\mathcal{S}_{12}^{1}\right|,\left|\mathcal{S}_{12}^{2}\right| \leq\left|\mathcal{P}_{12}\right|$
- $\left|\mathcal{S}_{13}\right| \leq\left|\mathcal{P}_{13}\right|$
- $\left|\mathcal{S}_{14}\right| \leq\left|\mathcal{P}_{14}\right|$
- $\left|\mathcal{S}_{15}^{1}\right|,\left|\mathcal{S}_{15}^{2}\right| \leq\left|\mathcal{P}_{15}\right|+\left|\mathcal{S}_{1}^{1}\right|$

Now we derive

$$
\begin{gathered}
\left|\mathcal{S}_{9}^{1}\right|,\left|\mathcal{S}_{9}^{2}\right| \leq\left|\mathcal{P}_{9}\right| \\
\left|\mathcal{S}_{11}^{1}\right|,\left|\mathcal{S}_{11}^{2}\right| \leq\left|\mathcal{P}_{11}\right|+\left|\mathcal{S}_{10}^{1}\right| \leq\left|\mathcal{P}_{11}\right|+\left|\mathcal{P}_{10}\right|+\left|\mathcal{S}_{1}^{2}\right| \leq\left|\mathcal{P}_{11}\right|+\left|\mathcal{P}_{10}\right|+\left|\mathcal{P}_{1}\right| \\
\left|\mathcal{S}_{15}^{1}\right|,\left|\mathcal{S}_{15}^{2}\right| \leq\left|\mathcal{P}_{15}\right|+\left|\mathcal{S}_{1}^{1}\right| \leq\left|\mathcal{P}_{15}\right|+\left|\mathcal{P}_{1}\right|
\end{gathered}
$$

Sizes of all of them are no more than $t_{I}$.
One can also show that the number of values to simulate in each gadget is no more than $t$. Here we show the case for the A2B gadget since it is the most complicated one. The simulation set $\mathcal{S}_{8}$ needs to simulate $\mathcal{P}_{8}$ and $\mathcal{S}_{10}^{2} \cup \mathcal{S}_{6} \cup \mathcal{S}_{5}^{2} \cup \mathcal{S}_{4}^{2} \cup \mathcal{S}_{7}$. Note we have $\left|\mathcal{S}_{10}^{2}\right| \leq\left|\mathcal{P}_{10}\right|+\left|\mathcal{S}_{1}^{2}\right| \leq\left|\mathcal{P}_{10}\right|+\left|\mathcal{P}_{1}\right|,\left|\mathcal{S}_{6}\right| \leq\left|\mathcal{P}_{6}\right|,\left|\mathcal{S}_{5}^{2}\right| \leq\left|\mathcal{P}_{5}\right|,\left|\mathcal{S}_{4}^{2}\right| \leq\left|\mathcal{P}_{4}\right|+\left|\mathcal{S}_{3}^{1}\right| \leq\left|\mathcal{P}_{4}\right|+\left|\mathcal{P}_{3}\right|$, and $\left|\mathcal{S}_{7}\right| \leq\left|\mathcal{P}_{7}\right|$. The number of values $\mathcal{S}_{8}$ needs to simulate is smaller than

$$
\left|\mathcal{P}_{8}\right|+\left|\mathcal{P}_{10}\right|+\left|\mathcal{P}_{1}\right|+\left|\mathcal{P}_{6}\right|+\left|\mathcal{P}_{5}\right|+\left|\mathcal{P}_{4}\right|+\left|\mathcal{P}_{3}\right|+\left|\mathcal{P}_{7}\right|
$$



Figure 5: An abstract diagram of SecFprMul (Algorithm 12). The probing sets $\mathcal{O}$ and $\mathcal{P}_{i}$ for some $i$ are colored in red, and the simulation sets $\mathcal{S}_{i}$ and $\mathcal{S}_{i}^{j}$ for some $i, j$ are colored in blue. Gadgets with $t$-NI and $t$-SNI security are marked in black and green, respectively.
which is guaranteed to be no more than $t_{I}$.

Theorem 3. The gadget SecFprAdd (Algorithm 13) is t-SNI secure.
Proof. We first show that for any probing sets in the swap part of the algorithm, one can use no more number of shares of each input to simulate. An abstract diagram of the swap part is given in Figure 6. Similar to our proof in Theorem 1 and 2, we write down the inequalities that the $t$-NI and $t$-SNI gadgets provide.

- $\left|\mathcal{S}_{1}^{1}\right|,\left|\mathcal{S}_{1}^{2}\right| \leq\left|\mathcal{P}_{1}\right|$
- $\left|\mathcal{S}_{2}^{1}\right|,\left|\mathcal{S}_{2}^{2}\right| \leq\left|\mathcal{P}_{2}\right|$
- $\left|\mathcal{S}_{3}^{1}\right|,\left|\mathcal{S}_{3}^{2}\right| \leq\left|\mathcal{P}_{3}\right|$
- $\left|\mathcal{S}_{4}^{1}\right|,\left|\mathcal{S}_{4}^{2}\right| \leq\left|\mathcal{P}_{4}\right|$
- $\left|\mathcal{S}_{5}^{1}\right|,\left|\mathcal{S}_{5}^{2}\right| \leq\left|\mathcal{P}_{5}\right|$
- $\left|\mathcal{S}_{10}^{1}\right|,\left|\mathcal{S}_{10}^{2}\right| \leq\left|\mathcal{P}_{10}\right|+\left|\mathcal{S}_{6}^{1}\right|+\left|\mathcal{S}_{8}\right|+\left|\mathcal{S}_{9}\right| \leq\left|\mathcal{P}_{10}\right|+\left|\mathcal{P}_{6}\right|+\left|\mathcal{P}_{4}\right|+\left|\mathcal{P}_{8}\right|+\left|\mathcal{P}_{9}\right|$
- $\left|\mathcal{S}_{11}^{1}\right|,\left|\mathcal{S}_{11}^{2}\right| \leq\left|\mathcal{P}_{11}\right|+\left|\mathcal{S}_{3}^{1}\right| \leq\left|\mathcal{P}_{11}\right|+\left|\mathcal{P}_{3}\right|$

Therefore we can use at most

$$
\left|\mathcal{S}_{2}^{2} \cup \mathcal{S}_{5}^{2} \cup \mathcal{S}_{10}^{2} \cup \mathcal{S}_{11}^{2}\right| \leq\left|\mathcal{P}_{2}\right|+\left|\mathcal{P}_{5}\right|+\left|\mathcal{P}_{10}\right|+\left|\mathcal{P}_{6}\right|+\left|\mathcal{P}_{4}\right|+\left|\mathcal{P}_{8}\right|+\left|\mathcal{P}_{9}\right|+\left|\mathcal{P}_{11}\right|+\left|\mathcal{P}_{3}\right|
$$

of shares of $\left(x_{i}\right)$, and

$$
\left|\mathcal{S}_{1}^{1} \cup \mathcal{S}_{10}^{1} \cup \mathcal{S}_{11}^{1}\right| \leq\left|\mathcal{P}_{1}\right|+\left|\mathcal{P}_{10}\right|+\left|\mathcal{P}_{6}\right|+\left|\mathcal{P}_{4}\right|+\left|\mathcal{P}_{8}\right|+\left|\mathcal{P}_{9}\right|+\left|\mathcal{P}_{11}\right|+\left|\mathcal{P}_{3}\right|
$$

of shares of $\left(y_{i}\right)$ to simulate all the probed values.
Now we show that the operations after the swap are $t$-SNI secure, implying that the whole algorithm is $t$-SNI secure. An abstract diagram of this part is given in Figure 7.


Figure 6: An abstract diagram of the swap part in SecFprAdd (Algorithm 13). The probing sets $\mathcal{P}_{i}$ for some $i$ are colored in red, and the simulation sets $\mathcal{S}_{i}$ and $\mathcal{S}_{i}^{j}$ for some $i, j$ are colored in blue. Gadgets with $t$-NI and $t$-SNI security are marked in black and green, respectively.

From the figure, one can list all the inequalities similarly and check that for each gadget, there are no more number of input shares than the probes used in the simulation. In particular, shares of $\left(x_{i}\right)$ used to simulate include $\mathcal{S}_{5}^{2}, \mathcal{S}_{15}$, and $\mathcal{S}_{18} \cup \mathcal{S}_{17}^{1}$, and

$$
\begin{gathered}
\left|\mathcal{S}_{5}^{2}\right| \leq\left|\mathcal{P}_{5}\right|+\left|\mathcal{S}_{4}^{1}\right| \leq\left|\mathcal{P}_{5}\right|+\left|\mathcal{P}_{4}\right|+\left|\mathcal{S}_{2}^{1} \cup \mathcal{S}_{3}\right| \leq\left|\mathcal{P}_{5}\right|+\left|\mathcal{P}_{4}\right|+\left|\mathcal{P}_{1}\right|+\left|\mathcal{P}_{2}\right|+\left|\mathcal{P}_{3}\right| \\
\left|\mathcal{S}_{15}\right| \leq\left|\mathcal{P}_{15}\right| \\
\left|\mathcal{S}_{18} \cup \mathcal{S}_{17}^{1}\right| \leq\left|\mathcal{P}_{18}\right|+\left|\mathcal{P}_{17}\right|+\left|\mathcal{S}_{7}^{1}\right| \leq\left|\mathcal{P}_{18}\right|+\left|\mathcal{P}_{17}\right|+\left|\mathcal{P}_{7}\right|
\end{gathered}
$$

For shares of $\left(y_{i}\right), \mathcal{S}_{12}^{2}, \mathcal{S}_{16}$, and $\mathcal{S}_{17}^{2}$ are used, and

$$
\begin{gathered}
\left|\mathcal{S}_{12}^{2}\right| \leq\left|\mathcal{P}_{12}\right| \\
\left|\mathcal{S}_{16}\right| \leq\left|\mathcal{P}_{16}\right| \\
\left|\mathcal{S}_{17}^{2}\right| \leq\left|\mathcal{P}_{17}\right|+\left|\mathcal{S}_{7}^{1}\right| \leq\left|\mathcal{P}_{17}\right|+\left|\mathcal{P}_{7}\right|
\end{gathered}
$$

Both sums are smaller than the number of internal probes.

## 5 Implementation and Evaluation

In this section, we provide our masked implementation's performance and security evaluations. Our experiments were mainly implemented in plain-C code, but we rewrote some segments of the 2 -shared version by assembly to reduce some observed leakages in security evaluation, which is discussed in Section 5.2 and 5.3. We first tested the performance of our design on an Arm Cortex-M4 processor, and then we used the program from Falcon reference code $\left[\mathrm{PFH}^{+} 20\right]$ to test the speed of one complete signing process on an Intel-Core i9-12900KF CPU, a general-purpose processor. For security evaluation, we ran our experiments on ChipWhisperer-Pro Level 3 Starter Kit [Inc], which includes a main control board and a target board to run the main program. The control board clocks the target board at 7.38 MHz and measures its power consumption during execution at the same frequency. The target board STM32F415 (CW308T-STM32F4) with an Arm Cortex-M4 MCU was used.

In our implementation, the 128 -bit multiplication in Algorithm 12 was realized by combining four 32 -bit registers in C. We generated the randomness for our masked implementation beforehand and fill them in a table to be read off. We list the number of used randomness in bytes for each algorithm in the performance evaluation subsection.


Figure 7: An abstract diagram of the operations following the swap in SecFprAdd (Algorithm 13). The probing sets $\mathcal{O}$ and $\mathcal{P}_{i}$ for some $i$ are colored in red, and the simulation sets $\mathcal{S}_{i}$ and $\mathcal{S}_{i}^{j}$ for some $i, j$ are colored in blue. Gadgets with $t$-NI and $t$-SNI security are marked in black and green, respectively.

### 5.1 Performance Evaluation

We first evaluate the performance of our masked implementation on the Arm Cortex-M4 processor and compare them with the unmasked version from Falcon reference code in NIST round- 3 submission $\left[\mathrm{PFH}^{+} 20\right]$, which is a re-implementation of floating-point arithmetic also written in plain-C. The cycle counts of floating-point number multiplication and addition are given in Table 2. For higher-order mask evaluation, we provide the results for the second-order mask ( 3 shares). All our code was compiled by arm-none-eabi-gcc 10.3.1 with optimization level -03.

The masked floating-point number multiplication takes about $23 \times$ overhead of the unmasked version for 2 shares and $118 \times$ overhead for 3 shares. It shows that our nonzero check gadget SecNonzero only causes a small amount of overhead compared to the whole multiplication and addition algorithm. For the 2 -shared design, the bottleneck is the packing SecFPR, in which a 64 -bit SecAdd is used. For the 3 -shared design, the heaviest overhead comes from the 3 -shared 128 -bit A2B, which internally calls a 128 -bit SecAdd.

The masked floating-point number addition takes about $35 \times$ cycles for our first-order masked version than the unmasked one and $99 \times$ cycles for our second-order version. It shows that the main overhead is caused by the four 64 -bit SecAdd functions, which is also the case in multiplication. Although it costs much in our implementation, it seems unlikely to avoid the Boolean masked addition gadgets or the mask conversion gadgets somewhere since the mantissa needs to be rounded in different stages and the sticky bit needs to be preserved.

In Table 3, we provide the speed for signing one message on the general-purpose Intel-Core i9-12900KF CPU with our masking countermeasure on the pre-image vector computation to show the amortized performance result in the whole Falcon. For this, we first replaced the floating-point arithmetic in the pre-image vector computation (line 3) with our masked multiplication and addition, and then we unmasked the result after all the computations were done. It shows that compared with the unmasked design, one signing process takes about $7.7 \times$ for 2 shares (about 1.9 ms for Falcon-512) and about

Table 2: Performance of each component in Algorithm 11, 12, and 13 on Arm Cortex-M4. We count the cycles for subroutines and the total random numbers used in bytes.

| Algorithm | Cycles |  |  |  |
| :--- | :--- | ---: | ---: | ---: |
|  | Total | Unmasked | 2 Shares | $\mathbf{3}$ Shares |
| SecFprMul | $\mathbf{3 0 8}$ | $\mathbf{7 1 3 4}$ | $\mathbf{3 6 3 8 8}$ |  |
|  | 128-bit A2B in line 4 | - | 1619 | 19253 |
|  | 64-bit SecNonzero in line 5 | - | 389 | 1350 |
|  | Two 16-bit SecNonzero in line 14, 15 | - | 662 | 2012 |
|  | SecFPR in Algorithm 11 | - | 3362 | 10813 |
|  | \#randombytes | - | 333 | 2005 |
|  | Total | $\mathbf{4 8 7}$ | $\mathbf{1 7 1 5 4}$ | $\mathbf{4 8 2 9 1}$ |
|  | Three 64-bit SecAdd in line 3, 19, 24 | - | 6990 | 16956 |
|  | Two 16-bit B2A in line 12 | - | 88 | 332 |
|  | SecFprAdd | - | 146 | 2267 |
|  | SecFprUrsh in Algorithm 9 | - | 1112 | 3214 |
|  | SecFprNorm64 in Algorithm 10 | - | 2846 | 7270 |
|  | SecFPR in Algorithm 11 | - | 3362 | 10813 |
|  | $\#$ randombytes | - | 849 | 2691 |

Table 3: Time (in microseconds) for signing a message on Intel-Core i9-12900KF CPU.

| Security Level | Unmasked | 2 Shares | 3 Shares |
| :---: | ---: | ---: | ---: |
| Falcon-512 | 246.56 | 1905.55 | 6137.25 |
| Falcon-1024 | 501.62 | 3819.76 | 12287.29 |

$24.9 \times$ for 3 shares (about 6.1 ms for FAlcon-512).

### 5.2 Security Evaluation

For practical security evaluation, we conducted the leakage assessment via the TVLA methodology, which we introduced in Section 2.5. We performed TVLA on our first-order (2 shares) and second-order (3 shares) masked floating-point number multiplication and addition, comparing them with the unmasked ones. Figure 8 shows our results. From left to right are the $t$-value statistics for unmasked, first-order, and second-order implementation. A threshold of $\pm 4.5$ is provided as red dotted lines, while for second-order traces, we offer green dotted lines as the recommended threshold in $\left[\mathrm{DZD}^{+} 17\right]$. For multiplication, the traces have a length of 295727 , and we set the threshold to $\pm 6.628$; while for addition with traces length of 387764 , the threshold is set to $\pm 6.668$.

For the unmasked function, we measured a total of 1,000 traces, and it turns out that almost every point exceeds the threshold. The results are improved for first-order implementation, but still some points exceed the threshold. By rewriting part of the code by assembly, adding redundant operations, and rearranging the order of independent instructions, every point is within the threshold in 10,000 traces. However, for tests with 100,000 traces, some values crossed the thresholds. It shows that the device may implicitly leak first-order information in the 2-shared implementation. We discuss this problem more thoroughly in Section 5.3. For second-order implementation, almost all the points are within the threshold $\pm 4.5$ in 100,000 traces, and all the points pass the test with the adapted thresholds. We see that the second-order implementation eliminates the leakages that appeared in the first-order design.


Figure 8: TVLA results of floating-point number multiplication (top row) and addition (bottom row) for (a)(d) unmasked implementation with 1,000 traces, (b)(e) first-order (2 shares) mask with 10,000 traces, and (c)(f) second-order ( 3 shares) mask with 100,000 traces

### 5.3 Discussion about the Leakage of the First-Order Design

The first-order implementation without further optimization still showed leakage with 10,000 power consumption traces in our experiment. Similar results were also found in previous works $\left[\mathrm{BGR}^{+} 21, \mathrm{BC} 22\right]$. On the other hand, our second-order implementation can pass the TVLA test of 100,000 traces, which indicates that leakage in the first-order design might be caused by unexpected equipment behavior. As pointed out and organized in [GD23], probing security cannot capture the physical defaults of devices. Glitches and transition-based leakage concerning the Hamming distance between two consecutive values written in a memory cell can cause power consumption related to the unmasked secret.

The leakage in $\left[\mathrm{BGR}^{+} 21, \mathrm{BC} 22\right]$ was eliminated or mitigated through assembly optimization. In our experiment, we used a defensive approach similar to that in [BC22], such as adding a dummy load and store operation before and after each consecutive share-wise operation where leakage appeared in the first-order TVLA result. We also found that shift operations could induce leakage, so we inserted redundant shift operations around the true ones. Besides, we separated dependent instructions to avoid potential leakage from Hamming distance or hidden buffer. With these revisions, we removed all the high values in the tests with 10,000 traces. Nevertheless, our first-order implementation failed to pass the test in 100,000 traces.

Two approaches can be taken to improve the result. The first one is a thorough assembly rewriting. With programs written in assembly, one can manipulate each register to avoid the potential transition-based leakage caused by compiling from a high-level language. However, hidden registers and other memory units in the processor [GOP22, MPW21] cannot be directly accessed. They may still induce transition-based leakage or even recombination of shares. Another concern with this strategy is that the design can vary for different devices. For example, we found different leakage patterns and locations when executing the same optimization method on STM32F303 and STM32F415 target boards; by contrast, the TVLA results of the second-order implementation were similar on both. The second approach is a secure design in the robust probing model [FGP ${ }^{+}$18, MMSS19],
which considers typical physical defaults like glitches. Unfortunately, a glitch-resistant model-based design of SecAdd, A2B, and B2A gadgets is still unknown to the best of our knowledge, and the procedure can require more than two shares and reduce efficiency.

## 6 Conclusion

In this work, we provide a masking scheme for Falcon's floating-point number multiplication and addition. To round the mantissa and compute the sticky bit efficiently, we design a masked nonzero check algorithm to find whether a shared value is nonzero, which can also be used to check the equality of two values and normalize a number. In addition, a masked right-shift and a masked normalization algorithm are proposed to add two floating-point numbers securely. The former can securely shift a value by some arithmetic shares while preserving the sticky bit, and the latter helps normalize a 64 -bit number to the specific range $\left[2^{63}, 2^{64}\right)$.

We provide formal proofs to show our design is secure in the $t$-probing model. Specifically, we apply the $t$-NI and $t$-SNI definitions and prove the security of our gadgets based on simulation. In terms of practical leakage on board, we conducted the leakage assessment experiments via TVLA. The first-order countermeasure with part of the functions rewritten by assembly, adding redundant operations, and rearrangement of execution orders can pass the test in 10,000 traces. With second-order masking, there is no significant leakage in 100,000 measured traces.

For performance evaluation, we compare cycle counts among unmasked, first-order, and second-order implementations on an Arm Cortex-M4 core. To achieve complete masking for the procedures containing both Boolean and arithmetic operations, our algorithms call arithmetic-to-Boolean mask conversion gadgets A2B and Boolean-masked addition gadgets SecAdd several times. It turns out that they cause the most considerable overhead in our design. We also tested the speed on an Intel-Core CPU, and it shows that a complete signing process with our countermeasure can be finished within a few milliseconds.

Throughout the signing algorithm of Falcon, the attack and defense of the Gaussian sampler have been discussed. However, the pre-image vector computation and other parts of the fast Fourier sampler are still at risk of being attacked, even though no works on the fast Fourier sampler have been proposed. A complete masking of Falcon can be constructed by combining our works with a masked design of the sampler. With our implementations and evaluations of the masking scheme for FALCON, it can resist known attacks on the pre-image vector computation, allowing FALCON to be used more securely.

## Acknowledgement

The authors would like to thank professor Ho-Lin Chen for his contributions through extensive discussions and insightful feedback on this work; Tzu-Hsien Chang, Yi-Lin Hung, and Yu-Cheng Su for their guidance in algorithm designs and implementations. The authors are also grateful to professor Bo-Yin Yang for the support of this research and the assistance in submission, as well as all the reviewers' efforts during the review process, which greatly improve this paper in all aspects.

## References

$\left[\mathrm{BBD}^{+} 16\right]$ Gilles Barthe, Sonia Belaïd, François Dupressoir, Pierre-Alain Fouque, Benjamin Grégoire, Pierre-Yves Strub, and Rébecca Zucchini. Strong noninterference and type-directed higher-order masking. In Edgar R. Weippl,

Stefan Katzenbeisser, Christopher Kruegel, Andrew C. Myers, and Shai Halevi, editors, ACM CCS 2016, pages 116-129. ACM Press, October 2016.
$\left[\mathrm{BBE}^{+} 18\right] \quad$ Gilles Barthe, Sonia Belaïd, Thomas Espitau, Pierre-Alain Fouque, Benjamin Grégoire, Mélissa Rossi, and Mehdi Tibouchi. Masking the GLP lattice-based signature scheme at any order. In Jesper Buus Nielsen and Vincent Rijmen, editors, EUROCRYPT 2018, Part II, volume 10821 of LNCS, pages 354-384. Springer, Heidelberg, April / May 2018.
[BC22] Olivier Bronchain and Gaëtan Cassiers. Bitslicing arithmetic/boolean masking conversions for fun and profit: with application to lattice-based kems. IACR Transactions on Cryptographic Hardware and Embedded Systems, 2022(4):553-588, Aug. 2022.
[BCZ18] Luk Bettale, Jean-Sébastien Coron, and Rina Zeitoun. Improved high-order conversion from Boolean to arithmetic masking. IACR TCHES, 2018(2):22-45, 2018. https://tches.iacr.org/index.php/TCHES/article/view/873.
$\left[\mathrm{BGR}^{+} 21\right]$ Joppe W. Bos, Marc Gourjon, Joost Renes, Tobias Schneider, and Christine van Vredendaal. Masking kyber: First- and higher-order implementations. IACR TCHES, 2021(4):173-214, 2021. https://tches.iacr.org/index. php/TCHES/article/view/9064.
[BHLY16] Leon Groot Bruinderink, Andreas Hülsing, Tanja Lange, and Yuval Yarom. Flush, gauss, and reload - A cache attack on the BLISS lattice-based signature scheme. In Benedikt Gierlichs and Axel Y. Poschmann, editors, CHES 2016, volume 9813 of LNCS, pages 323-345. Springer, Heidelberg, August 2016.
[CGTV15] Jean-Sébastien Coron, Johann Großschädl, Mehdi Tibouchi, and Praveen Kumar Vadnala. Conversion from arithmetic to Boolean masking with logarithmic complexity. In Gregor Leander, editor, FSE 2015, volume 9054 of $L N C S$, pages 130-149. Springer, Heidelberg, March 2015.
[DH76] W. Diffie and M. Hellman. New directions in cryptography. IEEE Transactions on Information Theory, 22(6):644-654, 1976.
[DP16] Léo Ducas and Thomas Prest. Fast fourier orthogonalization. In Proceedings of the ACM on International Symposium on Symbolic and Algebraic Computation, pages 191-198, 2016.
$\left[\mathrm{DZD}^{+} 17\right]$ A. Adam Ding, Liwei Zhang, François Durvaux, François-Xavier Standaert, and Yunsi Fei. Towards sound and optimal leakage detection procedure. In Thomas Eisenbarth and Yannick Teglia, editors, Smart Card Research and Advanced Applications - 16th International Conference, CARDIS 2017, Lugano, Switzerland, November 13-15, 2017, Revised Selected Papers, volume 10728 of Lecture Notes in Computer Science, pages 105-122. Springer, 2017.
$\left[\mathrm{EFG}^{+} 22\right]$ Thomas Espitau, Pierre-Alain Fouque, François Gérard, Mélissa Rossi, Akira Takahashi, Mehdi Tibouchi, Alexandre Wallet, and Yang Yu. Mitaka: A simpler, parallelizable, maskable variant of falcon. In Orr Dunkelman and Stefan Dziembowski, editors, EUROCRYPT 2022, Part III, volume 13277 of $L N C S$, pages 222-253. Springer, Heidelberg, May / June 2022.
[EFGT17] Thomas Espitau, Pierre-Alain Fouque, Benoît Gérard, and Mehdi Tibouchi. Side-channel attacks on BLISS lattice-based signatures: Exploiting branch tracing against strongSwan and electromagnetic emanations in microcontrollers. In Bhavani M. Thuraisingham, David Evans, Tal Malkin, and

Dongyan Xu, editors, ACM CCS 2017, pages 1857-1874. ACM Press, October / November 2017.
[Elg85] T. Elgamal. A public key cryptosystem and a signature scheme based on discrete logarithms. IEEE Transactions on Information Theory, 31(4):469-472, 1985.
$\left[\mathrm{FBR}^{+} 22\right]$ Tim Fritzmann, Michiel Van Beirendonck, Debapriya Basu Roy, Patrick Karl, Thomas Schamberger, Ingrid Verbauwhede, and Georg Sigl. Masked accelerators and instruction set extensions for post-quantum cryptography. IACR TCHES, 2022(1):414-460, 2022.
[FGP ${ }^{+}$18] Sebastian Faust, Vincent Grosso, Santos Merino Del Pozo, Clara Paglialonga, and François-Xavier Standaert. Composable masking schemes in the presence of physical defaults \& the robust probing model. IACR TCHES, 2018(3):89 120, 2018. https://tches.iacr.org/index.php/TCHES/article/view/ 7270.
[GD23] John Gaspoz and Siemen Dhooghe. Threshold implementations in software: Micro-architectural leakages in algorithms. IACR Transactions on Cryptographic Hardware and Embedded Systems, 2023(2):155-179, Mar. 2023.
[GGJR ${ }^{+}$11] Benjamin Jun Gilbert Goodwill, Josh Jaffe, Pankaj Rohatgi, et al. A testing methodology for side-channel resistance validation. In NIST non-invasive attack testing workshop, volume 7, pages 115-136, 2011.
[GMRR22] Morgane Guerreau, Ange Martinelli, Thomas Ricosset, and Mélissa Rossi. The hidden parallelepiped is back again: Power analysis attacks on falcon. IACR TCHES, 2022(3):141-164, 2022.
[GOP22] Si Gao, Elisabeth Oswald, and Dan Page. Towards micro-architectural leakage simulators: Reverse engineering micro-architectural leakage features is practical. In Advances in Cryptology - EUROCRYPT 2022: 41st Annual International Conference on the Theory and Applications of Cryptographic Techniques, Trondheim, Norway, May 30 - June 3, 2022, Proceedings, Part III, page 284-311, Berlin, Heidelberg, 2022. Springer-Verlag.
[GPV08] Craig Gentry, Chris Peikert, and Vinod Vaikuntanathan. Trapdoors for hard lattices and new cryptographic constructions. In Richard E. Ladner and Cynthia Dwork, editors, 40th ACM STOC, pages 197-206. ACM Press, May 2008.
$\left[\mathrm{HKL}^{+} 22\right]$ Daniel Heinz, Matthias J. Kannwischer, Georg Land, Thomas Pöppelmann, Peter Schwabe, and Amber Sprenkels. First-order masked kyber on ARM cortex-M4. Cryptology ePrint Archive, Report 2022/058, 2022. https: //eprint.iacr.org/2022/058.
[HPRR20] James Howe, Thomas Prest, Thomas Ricosset, and Mélissa Rossi. Isochronous gaussian sampling: From inception to implementation. In Jintai Ding and Jean-Pierre Tillich, editors, Post-Quantum Cryptography - 11th International Conference, PQCrypto 2020, pages 53-71. Springer, Heidelberg, 2020.
[Inc] NewAE Technology Inc. Chipwhisperer-pro (complete level 3 starter kit). https://store.newae.com/ chipwhisperer-pro-complete-level-3-starter-kit/.
[ISW03] Yuval Ishai, Amit Sahai, and David Wagner. Private circuits: Securing hardware against probing attacks. In Dan Boneh, editor, CRYPTO 2003, volume 2729 of $L N C S$, pages 463-481. Springer, Heidelberg, August 2003.
[JMV01] Don Johnson, Alfred Menezes, and Scott Vanstone. The elliptic curve digital signature algorithm (ecdsa). International journal of information security, 1(1):36-63, 2001.
[KA21] Emre Karabulut and Aydin Aysu. Falcon down: Breaking falcon post-quantum signature scheme through side-channel attacks. In 2021 58th ACM/IEEE Design Automation Conference (DAC), pages 691-696, 2021.
[MGTF19] Vincent Migliore, Benoît Gérard, Mehdi Tibouchi, and Pierre-Alain Fouque. Masking Dilithium - efficient implementation and side-channel evaluation. In Robert H. Deng, Valérie Gauthier-Umaña, Martín Ochoa, and Moti Yung, editors, ACNS 19, volume 11464 of $L N C S$, pages 344-362. Springer, Heidelberg, June 2019.
[MHS $\left.{ }^{+} 19\right]$ Sarah McCarthy, James Howe, Neil Smyth, Seamus Brannigan, and Máire O'Neill. BEARZ attack FALCON: Implementation attacks with countermeasures on the FALCON signature scheme. Cryptology ePrint Archive, Report 2019/478, 2019. https://eprint.iacr.org/2019/478.
[MMSS19] Thorben Moos, Amir Moradi, Tobias Schneider, and François-Xavier Standaert. Glitch-resistant masking revisited. IACR TCHES, 2019(2):256-292, 2019. https://tches.iacr.org/index.php/TCHES/article/view/7392.
[MOP07] Stefan Mangard, Elisabeth Oswald, and Thomas Popp. Power analysis attacks - revealing the secrets of smart cards. Springer, 2007.
[MPW21] Ben Marshall, Dan Page, and James Webb. Miracle: Micro-architectural leakage evaluation: A study of micro-architectural power leakage across many devices. IACR Transactions on Cryptographic Hardware and Embedded Systems, 2022(1):175-220, Nov. 2021.
[oSTa] National Institute of Standards and Technology. Postquantum cryptography standardization. https:// csrc.nist.gov/Projects/Post-Quantum-Cryptography/ Post-Quantum-Cryptography-Standardization.
[oSTb] National Institute of Standards and Technology. Post-quantum cryptography standardization. https://csrc.nist.gov/Projects/ post-quantum-cryptography/selected-algorithms-2022.
[PBY17] Peter Pessl, Leon Groot Bruinderink, and Yuval Yarom. To BLISS-B or not to be: Attacking strongSwan's implementation of post-quantum signatures. In Bhavani M. Thuraisingham, David Evans, Tal Malkin, and Dongyan Xu, editors, ACM CCS 2017, pages 1843-1855. ACM Press, October / November 2017.
$\left[\mathrm{PFH}^{+} 20\right]$ Thomas Prest, Pierre-Alain Fouque, Jeffrey Hoffstein, Paul Kirchner, Vadim Lyubashevsky, Thomas Pornin, Thomas Ricosset, Gregor Seiler, William Whyte, and Zhenfei Zhang. FALCON. Technical report, National Institute of Standards and Technology, 2020. available at https://csrc.nist.gov/projects/post-quantum-cryptography/ post-quantum-cryptography-standardization/round-3-submissions.
[RSA78] Ronald L. Rivest, Adi Shamir, and Leonard M. Adleman. A method for obtaining digital signatures and public-key cryptosystems. Communications of the Association for Computing Machinery, 21(2):120-126, February 1978.
[Sho97] Peter W. Shor. Polynomial-time algorithms for prime factorization and discrete logarithms on a quantum computer. SIAM Journal on Computing, 26(5):1484-1509, oct 1997.
[SPOG19] Tobias Schneider, Clara Paglialonga, Tobias Oder, and Tim Güneysu. Efficiently masking binomial sampling at arbitrary orders for lattice-based crypto. In Dongdai Lin and Kazue Sako, editors, PKC 2019, Part II, volume 11443 of LNCS, pages 534-564. Springer, Heidelberg, April 2019.
[ZLYW23] Shiduo Zhang, Xiuhan Lin, Yang Yu, and Weijia Wang. Improved power analysis attacks on falcon. In Carmit Hazay and Martijn Stam, editors, EUROCRYPT 2023, Part IV, volume 14007 of $L N C S$, pages 565-595. Springer, Heidelberg, April 2023.

