HAETAE: Shorter Lattice-Based Fiat-Shamir Signatures

. We present HAETAE (Hyperball bimodAl modulE rejecTion signAture schemE), a new lattice-based signature scheme. Like the NIST-selected Dilithium signature scheme, HAETAE is based on the Fiat-Shamir with Aborts paradigm, but our design choices target an improved complexity/compactness compromise that is highly relevant for many space-limited application scenarios. We primarily focus on reducing signature and verification key sizes so that signatures fit into one TCP or UDP datagram while preserving a high level of security against a variety of attacks. As a result, our scheme has signature and verification key sizes up to 39% and 25% smaller, respectively, compared than Dilithium . We provide a portable, constant-time reference implementation together with an optimized implementation using AVX2 instructions and an implementation with reduced stack size for the Cortex-M4. Moreover, we describe how to efficiently protect HAETAE against implementation attacks such as side-channel analysis, making it an attractive candidate for use in IoT and other embedded systems.


Introduction
The rise of quantum computing has brought up -among others -the necessity of new, post-quantum digital signature schemes.In the standardization process of post-quantum cryptography by the American National Institute of Standards and Technology (NIST), the lattice-based schemes Falcon [FHK + 18] and Dilithium [DKL + 18] have already been announced as future standards, and another 40 new candidates are on-ramp for an additional process.The critical challenge in developing lattice-based digital signatures lies in finding a balance between security and practicality: while developing secure schemes against a wide range of attacks is essential, it is also vital to ensure they are practical for real-world applications.This challenge becomes even more critical with the increasing prevalence of embedded devices and the Internet of Things (IoT).Both technologies have become ubiquitous, from home appliances to medical devices connected to the internet.
In particular, this leads to two practical requirements: 1.The verification key and signature sizes must be as small as possible since both are frequently transmitted.Specifically, it is helpful if the signature is small enough to be sent in only one UDP or TCP datagram, as this minimizes the need for packet fragmentation.The importance of the signature and verification key sizes for communication protocols has been highlighted already in multiple evaluations [Wes21, PST20, GS23].Paquin et al. [PST20] observe for TLS, that fragmentation over many packets has a significant performance impact for network links with non-ideal packet loss rates.Benchmarking DNSSEC [GS23] revealed, that the smaller signatures of Falcon lead to faster resolution times in comparison to Dilithium in most scenarios, although the signature computation and verification is much faster with Dilithium compared to Falcon.
2. The secret-dependent operations such as key generation and message signing must be easy to protect against implementation attacks.This is essential in embedded use cases like the IoT, where attackers have physical access and can measure power consumption or electromagnetic emanation [KA21], additionally to the timing behaviour [Sch00], which is also exploitable from remote.
In this context, Falcon fulfills the first requirement very well, but efforts for making it to satisfy the second requirement, namely Mitaka [EFG + 22], were recently broken [Pre23].Dilithium, on the other hand, focuses on being easy to implement and protecting against sidechannel attacks.However, this comes at the sacrifice of larger signatures and verification keys, which, for example, do not allow a signature to fit in one UDP datagram.We summarize this discussion in Table 1 and compare the two with HAETAE.
Table 2: Relative comparison between HAETAE, Dilithium, and Falcon.The security levels are given in the parameter sets instead of their name.The percentages are the ratio of their sizes and the execution times.The execution time is measured as the median cycle counts among 1000 executions, obtained on one core of an Intel Core i7-10700k, with TurboBoost and hyperthreading disabled.
Parameter set Sig. size vk size KeyGen Sign Verify HAETAE-2 / Dilithium-2 61% 76% 409% 507% 100% HAETAE-3 / Dilithium-3 71% 75% 376% 444% 113% HAETAE-5 / Dilithium-5 64% 80% 328% 454% 91% HAETAE-2 / Falcon-1 221% 111% 3% 35% 365% HAETAE-5 / Falcon-5 230% 116% 2% 30% 399% HAETAE benefits from several novel improvements in the key generation algorithm.We introduce a new rejection procedure in the key generation algorithm to minimize the magnitude of the secret key when multiplied by the challenge.This facilitates rejection sampling in the signing algorithm and leads to smaller signatures.The key generation rejection is also designed to be efficient and simple to implement.It significantly improves over a procedure with a similar objective in the key generation of BLISS.Furthermore, we introduce to the bimodal setting a verification key truncation with the same objective as Dilithium's.A direct adaptation would lead to large bounds for the verification algorithm and degraded security.Instead, we compensate for the verification key truncation by correcting the signing key accordingly.It increases the magnitude of the signing key, but by a much smaller amount than the naive approach.
For the signing algorithm, we adapt Dilithium's signature compression so that it is compatible with our module lattices key generation algorithm, by taking into account the residues modulo 2. Further, we apply the signature encoding technique from [ETWY22] to hyperball uniform distributions.The main novelty in the signing algorithm is a detailed description of a fixed-point arithmetic algorithm for sampling uniformly in a hyperball, which was left open in [DFPS22].The discretization leads to numerical errors: we bound them and bound their effect on the scheme security.
Implementation and Performance.We propose three parameter sets with NIST security levels 2, 3 and 5.Each parameter set of HAETAE has 20-25% smaller verification key size and 29-39% shorter signatures than its counterpart in Dilithium.Based on our portable and constant-time reference implementation of HAETAE, the verification process is as fast as Dilithium's, while the resulting key generation and signing algorithm are up to five times slower than Dilithium's.Up to 80% of the signing time is consumed by the hyperball sampling.Thus, any improvement to this sampling would contribute greatly to the efficiency of HAETAE, an independent speedup to further optimizations.Nonetheless, our benchmarks indicated signing with HAETAE is still around three times faster than with Falcon (portable Falcon with emulated floating-point operations).We summarize the comparison results in Table 2.
We provide a detailed, implementation-oriented specification using Chinese Remainder Theorem (CRT) and Number-Theoretic Transform (NTT), which enables efficient implementation of HAETAE (Section 5).We additionally developed an optimized version using AVX2 instructions (Section 6), and an implementation for the Cortex-M4 (Section 7), where we explore stack reduction techniques.
Moreover, we observe that masking HAETAE against physical attacks is only slightly more complex than masking Dilithium, based on the similarity of the scheme design and the use of fixed-point arithmetic.One of the conceptual differences between HAETAE and Falcon (and their variants) regarding physical attacks is that HAETAE only needs Gaussian samples for secret-independent centers and standard deviations.
Finally, we note that like other Fiat-Shamir signatures, such as Schnorr signatures [Sch90], the randomized signing of HAETAE can take advantage of pre-computations.By sampling from the hyperball and pre-computing the message-independent components offline, the online signing phase of HAETAE is cut by factor five.
Our code is publicly available.

Related
Work.An alternative approach to avoid leakage during the rejection step in Fiat-Shamir signatures based on lattices is to remove it altogether.A first approach is to flood what depends on the signature key in signatures by a much larger quantity.As shown in [ASY22], relying on the Rényi divergence for the security analysis allows to limit the amount of flooding.A concrete instantiation was recently proposed in [dPEK + ].This however results in signature sizes that are much higher than ours.A second approach was recently given in [DPS23], which uses Gaussian convolutions to obtain signatures that can be simulated, without flooding nor rejection sampling.However, signing is more complex as it relies on sampling from large-dimensional integer Gaussian distribution with non-diagonal covariance matrices.Also, the signature sizes provided in [DPS23] are worse than ours for the smallest parameter set, and only marginally smaller for larger parameter sets.An extensive comparison of the recent lattice signatures can be found in Appendix A.

Preliminaries
Before introducing specific results adapted to the setting in HAETAE in Section 3 and the HAETAE scheme itself in Section 4, we start by defining notations used throughout this paper and recapitulate relevant fundamental works.

Notations
Matrices are denoted in bold font and upper case letters (e.g., A), while vectors are denoted in bold font and lowercase letters (e.g., y or z 1 ).The i-th component of a vector is denoted with subscript i (e.g., y i for the i-th component of y).
Every vector is a column vector.We denote concatenation between vectors by putting the rows below as (u, v) and the columns on the right as (u|v).We naturally extend the latter notation to concatenations between matrices and vectors (e.g., (A|b) or (A|B)).
We let R = Z[x]/(x n + 1) be a polynomial ring where n is a power of 2 integer and for any positive integer q the quotient ring R q = Z[x]/(q, x n + 1) = Z q [x]/(x n + 1).We abuse notations and identify R 2 with the set of elements in R with binary coefficients.We also let R R = R[x]/(x n + 1) be a polynomial ring over real numbers.For an integer η, we let S η denote the set of polynomials of degree less than n with coefficients in r, c) denote the discretized hyperball with radius r > 0 and center c ∈ R m in dimension m > 0 with respect to a positive integer N .When c = 0, we omit it.Given a measurable set X ⊆ R m of finite volume, we let U (X) denote the continuous uniform distribution over X.It admits x → χ X (x)/Vol(X) as a probability density, where χ X is the indicator function of X and Vol(X) is the volume of the set X.For the normal distribution over R centered at µ with standard deviation σ, we use the notation N (µ, σ).
For a positive integer α, we define r mod ± α as the unique integer r ′ in the range [−α/2, α/2) satisfying the relation r = r ′ mod α.We also define r mod + α as the unique integer r ′ in the range [0, α) that satisfies r = r ′ mod α.We denote the least significant bit of an integer r with LSB(r).We naturally extend this to integer polynomials and vectors of integer polynomials, by applying it component-wise.

Signatures
We briefly recall the formalism of digital signatures.
Definition 1 (Digital Signature).A signature scheme is a tuple of PPT algorithms (KeyGen, Sign, Verify) with the following specifications: • KeyGen : 1 λ → (vk, sk) outputs a verification key vk and a signing key sk; • Sign : (sk, µ) → σ takes as inputs a signing key sk and a message µ and outputs a signature σ; • Verify : (vk, µ, σ) → b ∈ {0, 1} is a deterministic algorithm that takes as inputs a verification key vk, a message µ, and a signature σ and outputs a bit b ∈ {0, 1}.
Let γ > 0. We say that it is γ-correct if for any pair (vk, sk) in the range of KeyGen and µ, where the probability is taken over the random coins of the signing algorithm.We say that it is correct in the (Q)ROM if the above holds when the probability is also taken over the randomness of the random oracle modeling the hash function used in the scheme.
We also give two security notions, namely the existential unforgeability under chosen message attacks, and under no-message attacks.
Definition 2 (Security).Let T, δ ≥ 0. A signature scheme sig = (KeyGen, Sign, Verify) is said to be (T, δ)-UF-CMA secure in the QROM if for any quantum adversary A with runtime ≤ T given (classical) access to the signing oracle and (quantum) access to a random oracle H, it holds that Pr (vk,sk) where the randomness is taken over the random coins of A and (vk, sk) ← KeyGen(1 λ ).The adversary should also not have issued a sign query for µ * .The above probability of forging a signature is called the advantage of A and denoted by Adv UF-CMA sig (A).If A does not output anything, then it automatically fails.
Existential unforgeability against no-message attack, denoted by UF-NMA is defined similarly except that the adversary is not allowed to query any signature per message.

Lattice Assumptions
We first recall the well-known lattice assumptions MLWE and MSIS on algebraic lattices.Definition 3 (Decision-MLWE n,q,k,ℓ,η ).For positive integers q, k, ℓ, η and the dimension n of R, we say that the advantage of an adversary A solving the decision-MLWE n,q,k,ℓ,η problem is . Definition 4 (Search-MSIS n,q,k,ℓ,β ).For positive integers q, k, ℓ, a positive real number β and the dimension n of R, we say that the advantage of an adversary A solving the search-MSIS n,q,k,ℓ,β problem is Moreover, we finally introduce a variant of the SelfTargetMSIS problem introduced in Dilithium [DKL + 18], which corresponds to our setting.
In the ROM (resp.QROM), the adversary is given classical (resp.quantum) access to H.
The following classical reduction from MSIS to BimodalSelfTargetMSIS is very similar to the reduction from MSIS to SelfTargetMSIS introduced in [DKL + 18] and is similarly non-tight.As this latter reduction, it cannot be straightforwardly extended to a reduction in the QROM, since it relies on the forking lemma.
Theorem 1 (Classical Reduction from MSIS to BimodalSelfTargetMSIS). Let q > 0 be an odd modulus, H : {0, 1} * × M → R 2 be a cryptographic hash function modeled as a random oracle and that every polynomial-time classical algorithm has a negligible advantage against MSIS n,q,k,ℓ,β .Then every polynomial-time classical algorithm has negligible advantage against BimodalSelfTargetMSIS n,q,k,ℓ,β/2 .Proof sketch.Consider a BimodalSelfTargetMSIS n,q,k,ℓ,β/2 classical algorithm A that is polynomial-time and has classical access to H.
and outputs a solution (y, c, M j ) for some j ∈ [Q], then we can construct an adversary A ′ for MSIS n,q,k,ℓ,β as follows.
The adversary A ′ can first rewind A to the point at which the j-th query was made and reprogram the hash as H(w j , M j ) = c ′ (̸ = c).Then, with probability approximately 1/Q, algorithm A will produce another solution (y ′ , c ′ , M j ).We then have As q is odd, we have A(y − y ′ ) = (c − c ′ )j mod 2. The fact that c ′ ̸ = c implies that the latter is non-zero modulo 2, and hence so is y − y ′ over the integers.As it also satisfies (b| A 0 | Id k ) • (y − y ′ ) = 0 mod q and ∥y − y ′ ∥ < β, it provides a MSIS n,q,k,ℓ,β solution for the matrix (b| A 0 | Id k ), where the submatrix (−b| A 0 ) ∈ R k×ℓ q is uniform.

Sampling from the Continuous Hyperball-uniform
In order to sample in practice from hyperball uniform, we rely on the following result.

Lemma 1 ( [VGS17]
).The distribution of the output of the algorithm in Figure 1 is Sampling from continuous hyperball-uniform can be done using the algorithm in Figure 1 due to Lemma 1.However, to secure the HAETAE implementation, we sample from discrete hyperball-uniform.We delay to Section 3.2 the analysis of a discretized version which turns discrete Gaussian samples to discrete hyperball-uniform distribution.

Signature Encoding via Range Asymmetric Numeral System
A HAETAE signature is essentially a vector z, that is compressed into z 2 with smaller dimension and a hint h, that are then encoded.While Huffman coding would be applied on each coordinate at a time, an arithmetic coding encodes the entire coordinates in a single number.In contrast to Huffman coding, arithmetic coding gets close to entropy also for alphabets, where the probabilities of the symbols are not powers of two.We recall a recent type of entropy coding, named range Asymmetric Numeral systems (rANS) [Dud13], that encodes the state in a natural number and thus allows faster implementations.The rANS encoding technique was recently used in [ETWY22] and we adapt it to hyperball uniform distributions.As a stream variant, rANS can be implemented with finite precision integer arithmetic by using renormalization.
Then, we define the rANS encoding/decoding for the set S and frequency g/2 t as in Figure 2.
Lemma 2 (Adapted from [Dud13]).The rANS coding is correct, and the size of the rANS code is asymptotically equal to Shannon entropy of the symbols.That is, for any choice Decode(x ∈ Z): 1: y 0 = x 2: i = 0 3: while y i > 0 do 4: 5: 6: Finally, the cost of encoding the first symbol is ≤ t, i.e., for any x ∈ S, we have log(C(0, s)) ≤ t.
We determine the frequency of the symbols experimentally, by executing the signature computation and collecting several million samples.Finally, we apply some rounding strategy in order to heuristically minimize the empirical entropy s∈S p(s) log(g(s)/2 n ).

HAETAE-specific Results
While our scheme is reminiscent of Dilithium, the bimodal setting hinders the use of some of its base components.In this section, we describe parts that are specifically adapted to HAETAE.First, the key generation algorithm departs from known key generation algorithms for BLISS, as we work in the module setting.Second, we study the precision needed when discretizing the hyperball sampler from Section 2.4 to enable fixed-point arithmetic.Then, we explain how challenges are computed in HAETAE.Next, we describe the rejection sampling procedure and estimate its expected number of iterations depending on the fixed-point arithmetic precision.Finally, we explain how to split coordinates of a signature vector into high and low bits, allowing for signature compression via low bits drop.This order is consistent with the order in which those results are used during signing.

Key Generation
When using bimodal rejection sampling, the verification step relies on a specific key pair (A, s) ∈ R k×(k+ℓ) p × R k+ℓ p such that As = −As mod p.To generate such a pair, following [DDLL13], we choose p = 2q and aim at As = qj mod 2q for j = (1, 0, . . ., 0) ⊤ .

Key Generation and Encoding
To build such a key pair (A, s), we do as follows.We first generate an MLWE sample b = A gen s gen + e gen mod q, where A gen ← U (R k×(ℓ−1) q ) and (s gen , e gen ) ← U (S ℓ−1 η × S k η ).We then define A = (−2b + qj| 2A gen | 2Id k ) mod 2q as well as s ⊤ = (1|s ⊤ gen |e ⊤ gen ).This is a valid verification key pair for HAETAE, but the choice of even modulus 2q makes it hard to truncate the least significant bits of b as in Dilithium.
To enable the verification key truncation, we modify the key generation algorithm, as follows.We use an extra randomness a gen ← U (R k q ) and let b−a gen = A gen s gen +e gen mod q.For any decomposiaiton b = b 1 + b 0 , we then define A = (2(a gen − b 1 ) + qj|2A gen |2I k ) as well as s ⊤ = (1|s ⊤ gen |(e gen − b 0 ) ⊤ ).One sees that As = qj mod 2q.In practice, the verification key is then comprised of b 1 and the seed that allows generating A gen and a gen .The secret key is the seed used to generate s and (A gen , a gen ).
It remains to choose the decomposition of b, that we see as an nk-dimensional vector with coordinates in [0, q − 1].We set the coordinates of b 1 as follows.If some coordinate of b is even, then we take the same value for the corresponding coordinate of b 1 .Else, we take the rounding of this coordinate to the nearest multiple of 4 as value for b 1 .Next we set b 0 = b − b 1 and we note that coordinates of b 0 lie in [−1, 1], i.e., b 0 ∈ S k 1 .We can then write b = b 0 + 2b ′ 1 , where b ′ 1 is encoded using ⌈log 2 (q) − 1⌉ bits per coordinate, i.e. one less bit than b.This is computed coordinate-wise with b 0 = (−1) ⌊b/2⌋ mod 2 b mod 2. In all of the following, we let (LowBits vk (b), HighBits vk (b)) denote (b 0 , b 1 ).
When b is uniform, we notice that the coordinates of b 0 roughly follow a (centered) binomial law with parameters (2, 1/2), which experimentally leads to smaller choices for γ, which we discuss and introduce below.
Note that the truncation reduces each coeffficient of b by 1 bit.So the verification key becomes shorter, but not significantly.Thus, we use the truncation for lower security levels and keep the no-truncation version for the highest level.In the following, we refer to the truncated version as d = 1 and the non-truncated version as d = 0, where d is the vk truncation bit.

Rejection Sampling on the Key
A critical step of our scheme is bounding ∥cs∥ 2 , where s is generated as before and c ∈ R is a polynomial with coefficients in {0, 1} and has less than or equal to τ nonzero coefficients.The lower this bound is, the smaller the signature is, which in turn leads to the harder forging.In the key generation algorithm, we apply the following rejection condition for some heuristic value γ, bounding ∥cs∥ 2 ≤ γ √ τ : where m = ⌊n/τ ⌋, r = n mod τ , and ω j 's are the primitive 2n-th roots of unity.Note that s(ω j ) is defined as (s where m = ⌊n/τ ⌋, r = n mod τ , and ω j 's are the primitive 2n-th roots of unity.
Proof.We first rewrite ∥cs∥ 2 2 as: where s(ω j ) = (s 1 (ω j ), • • • , s k+ℓ (ω j )).We have that 2 by rearranging the order.Let m = ⌊n/τ ⌋ and r = n mod τ .Then m is the maximum number of |c(ω j )| 2 's that can be τ 2 .By sorting ∥s(ω j )∥ 2 in a decreasing order, where σ is a permutation for the indices, we have Then it reaches the maximum when the m largest ∥s(ω j )∥ 2 2 's are multiplied with τ 2 's, i.e., This concludes the proof.

Sampling in a Discrete Hyperball
In order to generate a hyperball uniform sample y, we apply a rounding-and-reject strategy to the discretization of the continuous hyperball uniform sampling from Figure 1, which allows to generate rightly distributed samples.Our approach in sampling is to avoid the use of floating point arithmetic for two reasons: First, many microarchitectures do not provide floating-point units and even if so, the execution time of floating-point instructions may be data-dependent and thus unsuitable [AKM + 15] for a constant-time implementation.
Floating-point computation would also prohibit a masked implementation, that is protected against power side-channel attacks, because known masking techniques are only applicable to integers.And second, the required precision is higher than achievable even in IEEE double.In order to do so, we replace the continuous Gaussian sampler from Lemma 1 and instead use discrete Gaussian distributions, as we know that they approximate well continuous Gaussian distribution for large standard deviation.
Discretizing the Output.Once we obtain an "hyperball" sample, we choose to round it.Then, if the resulting sample lies too close to the border of the hyperball, we reject it.This ensures that for any possible sample, they have the same amount of pre-rounding predecessors.This also decreases the precision but the output is now discrete in a hyperball with a somewhat-smaller radius.We simply increase the starting radius to compensate.We study in the following lemma the rejection probability of this step.
Lemma 4. Let n be the degree of R, M 0 ≥ 1, B, m, N > 0. At each iteration, the algorithm from Figure 3 succeeds with probability ≥ 1/M 0 and the distribution of the output is The proof of this lemma can be found in Appendix B.

Challenge Sampling
Challenges in HAETAE are polynomials c ∈ R with binary coefficients and exactly τ of them are nonzero.Since n = 256 across all three parameter sets, the challenge space has size n τ exceeding the required entropy 2 192 and 2 225 for HAETAE-120 and HAETAE-180, respectively.To sample such challenges we rely on the (binary version of) SampleInBall algorithm from Dilithium, which we specify in the first half of Figure 4.
For HAETAE-260, however, we require 255 bits of entropy for the challenge space, which cannot be reached with the fixed Hamming weights for n = 256.To achieve it, we replace the challenge space into a set containing exactly half of the bitstrings of length 256.Specifically, we choose a set containing all elements of Hamming weight strictly less than 128 and half of the elements of Hamming weight 128, using the following algorithm.Given a 256-bits hash with Hamming weight w, do the following.If w < 128, we do nothing, and if w > 128, we flip all the bits.If w = 128, we decide whether to flip or not, depending on the first bit.Exactly half of all binary polynomials are reachable this way, which means that the challenge set has size 2 255 as desired.The algorithm is specified in the second half of Figure 4.

Bimodal Hyperball Rejection Sampling
Recently, Devevey et al. [DFPS22] conducted a study of rejection sampling in the context of lattice-based Fiat-Shamir with Aborts signatures.They observe that (continuous) uniform distributions over hyperballs can be used to obtain compact signatures, with a relatively simple rejection procedure.To make masking easier, HAETAE uses (discretized) uniform distributions over hyperballs, in the bimodal context.The proof of the following lemma is available in Appendix B.
Lemma 5 (Bimodal Hyperball Rejection Sampling).Let n be the degree of R, c > 1, r, t, m > 0, and B ≥ √ B ′2 + t 2 .Define M = 2(B/B ′ ) mn and set . Let p : R m → {0, 1/2, 1} be defined as follows Then there exists M ′ ≤ cM such that the output distributions of the two algorithms from Figure 6 are identical.The black empty circles have radii equal to B and the green circle has radius B ′ .We sample a vector z uniformly inside one of the black circles (with probability 1/2 for each) and keep z with p(z) = 1/2 if z lies in the blue zone, with probability p(z) = 1 if it lies in the green zone, and with probability p(z) = 0 everywhere else.
We now have all necessary ingredients in Figures 1, 3, 5, and 6 to make sure the resulting distribution of z is indeed uniform over the discretized hyperball.Thanks to Lemma 4 and Lemma 5, we already know the level of precision required for y to maintain the provable security of HAETAE.

High and Low Bits
Recall that a HAETAE signature is principally a vector z, whose lower part is replaced with a (smaller) hint.HAETAE makes use of two different high and low bits decompositions: one helps encoding a signature while the other is used when computing a hint.Following [ETWY22], the first is helpful in the sense that if we correctly choose the number of low bits, they will be distributed almost uniformly and can then be excluded from the encoding step.The high bits on the other hand, will then follow a distribution with a very small variance and we apply the rANS encoding on them only, making it much more efficient as the size of the alphabet greatly shrunk.
The second decomposition implements a trick that allow to reduce the alphabet size of the resulting hint, and thus to reduce the size of its encoding.
We use the following base method of decomposing an element in high and low bits.We first recall the Euclidean division with a centered remainder.Lemma 6.Let a ≥ 0 and b > 0. It holds that and this writing as We define our decomposition for compressing the upper part of the signature.
We extend these definitions to vectors by applying them component-wise.We state that this decomposition lets us recover the original element and bound the components of the decomposition in Lemma 7. The proof is available in Appendix B.

High and Low Bits for h
In order to produce the hint that we send instead of the lower part of z, we could use the previous bit decomposition.However, as noted in [DKL + 18, Appendix B] in a preliminary version, a slight modification allows to further reduce the entropy of the hint.
The idea is to pack the high bits in the range [0, 2(q − 1)/α h ).This is possible if we use the range [−α h /2 − 2, 0) to represent the integers that are close to 2q − 1.
As before, we extend these definitions to vectors by applying them component-wise.We state that this decomposition lets us recover the original element and bound the decomposition components.Lemma 8. Let r ∈ Z.Let q be a prime, α h |2(q − 1) be a power of two and define m = 2(q − 1)/α h .It holds that r = α h • HighBits h (r) + LowBits h (r) mod 2q, The proof of Lemma 8 is available in Appendix B.

The HAETAE Signature Scheme
In this section, we describe three different versions of HAETAE.As a warm-up, we give an uncompressed, un-truncated version of HAETAE, implementing the Fiat-Shamir with aborts paradigm in the bimodal hyperball-uniform setting.We then give the full description of optimized and deterministic HAETAE as we implemented it.Finally, we discuss the parts of the signing algorithm which can be pre-computed.

Uncompressed Description
As a first approach, we give a high-level, uncompressed, description of our signature scheme in Figure 7.In all of the following sections, we let j = (1, 0, . . ., 0) ∈ R k , as well as k, ℓ be two dimensions, N > 0 the fix-point precision and τ > 0 the challenge min-entropy parameter.The parameters B, B ′ , and B ′′ refer to the radii of hyperballs.Let q be an odd prime and α h |2(q − 1) is a power of two.We define the key rejection function based on Lemma 3: With the parameter γ, we bound f (s) ≤ γ 2 n, which ensures that ∥cs∥ 2 ≤ γ √ τ for all c ∈ R 2 satisfying wt(c) ≤ τ .The key generation algorithm is a simplified version from Section 3.1, which removes the verification key truncation, for conceptual simplicity.

Specification of HAETAE
We now give the full description of the signature scheme HAETAE in Figure 8 with the following building blocks: • Hash function H gen for generating the seeds and hashing the messages, • Hash function H for signing, returning a seed ρ for sampling a challenge, • Extendable output function expandA for deriving a gen and A gen from seed A , • Extendable output function expandS for deriving (s gen , e gen ) ∈ S ℓ−1 η × S k η from seed sk and counter sk , • Extendable output function expandYbb for deriving y, b and b ′ from seed ybb and counter, The above building blocks can be implemented with symmetric primitives.Note that at Step 3 of the Verify algorithm, the division by 2 is well-defined as the operand is even.Lemma 9. We borrow the notations from Figure 8.If we run Verify(vk, M, σ) on the signature σ returned by Sign(sk, M ) for an arbitrary message M and an arbitrary key-pair (sk, vk) returned by KeyGen(1 λ ), then the following relations hold: Proof.Let m = 2(q − 1)/α h .Let us prove the first statement.By definition of h, it holds that w 1 = HighBits h (w) mod m.However, the latter part of the equality already lies in [0, m − 1] by Lemma 8.The first part lies in the same range as we reduce mod + m.Hence, the equality stands over Z too.We move on to the second statement.By considering only the first component of z = y + (−1) b cs, we obtain, modulo 2: This yields the result.Moreover, considering everywhere a 2 appears in the definition of A, we obtain that For the last statement, let us use the two preceding results.In particular, we note the identity We note that the last two elements have same parity, as the former one has the same parity as LowBits(w, α h ).By Lemma 8 their sum has infinite norm ≤ α h /2 + 2. Hence from its definition, it holds that Finally, this holds over the integers as the right-hand side has infinite norm at most 2B ′ + α h /2 + 2 < q.

Theorem 2 (Completeness). Assume that
Then the signature schemes of Figures 8 is complete, i.e., for every message M and every key-pair (sk, vk) returned by KeyGen(1 λ ), we have: Proof.We use the notations of the algorithms.The first and second equations from Lemma 9 state that ρ = ρ and thus c = SampleBinaryChallenge τ (ρ).
On the other hand, we use the last equation from the same lemma to bound the size of z.We have: The definition of B ′′ implies that the scheme is correct.

Security
When proving security in the Fiat-Shamir with aborts setting in the QROM, on typically relies on the generic reduction from [KLS18].However, as pointed out in [DFPS23] and [BBD + 23], this analysis is flawed.Both works give adaptations to Fiat-Shamir with aborts of the analysis from [GHHM21] of Fiat-Shamir (without aborts).Moreover, the reduction from [KLS18] assumes an arbitrary bound on the number of restarts, which is not the case here.This restriction is waived in both [DFPS23] and [BBD + 23].

UF-CMA Security
The security of HAETAE relies on the analysis of [DFPS23], which reduces UF-CMA security to UF-NMA security, where an adversary is not allowed to make signing queries.This analysis requires that the commitment min-entropy is high and the underlying Σ-protocol is Honest-Verifier Zero-Knowledge (HVZK).The latter is proved by providing a simulator for non-aborting transcripts and proving that the distribution of ⌊y⌉ has sufficiently large min-entropy.
Commitment Min-entropy.We first claim that the underlying Σ-protocol has large commitment min-entropy.The underlying identification protocol has ε bits of min-entropy if for any (w, x), for any (pk, sk) ← KeyGen and y ← U (B (1/N )R,(k+ℓ) (B)).We note that LSB(⌊y 0 ⌉) is a binary vector of length n and is statistically close to uniform.Thus, the inner probability is (very loosely) bounded by 2 −n regardless of the choice of (pk, sk).Hence we obtain at least 256 bits of min-entropy in all of our parameter sets.
HVZK.Next, we show that the underlying Σ-protocol satisfies the HVKZ property.To do so, we follow the strategy from [DFPS23, Section 4.2], which studies the simulation of non-aborting transcripts and switch to computational mode for aborting ones.We propose the following simulator in Figure 9. Here, p(z) is 1/M if ∥z∥ ≤ r and 0 everywhere else.We first remark that the resulting simulated distribution of (x, v, h) (possibly x = ⊥, v = ⊥, h = ⊥) is identical to the real distribution of (x, v, h).As Lemma 5 states that the Sim(vk, c) : (i) Simulating non-aborting transcripts.In both the real and simulated cases, the non-aborted transcript satisfies w = Az − qcj mod 2q for z recovered from (x, v, h).Hence, the statistical distance between the real and simulated transcripts is bounded by Lemma 5.
(ii) Simulating aborting transcripts.As argued in [DFPS23, Section 4.2], in this context, we can use a computational notion of HVZK rather than the usual statistical definition.We introduce an LWE-like assumption which states that it is hard to distinguish w = A⌊y⌉ mod 2q from a uniform element mod 2q.This LWE assumption is unusual only in its choice of distribution for the noise and the secret.
These two properties allow us to apply [DFPS23, Theorem 4] to reduce the SUF-CMA security to UF-NMA security.
If one wants to avoid this assumption, it is possible to use the reduction from [BBD + 23] by using A(0) as a simulator.The non-aborting transcripts produced by this simulator have statistical distance 0 with real ones.
UF-NMA security.Finally, we note that the UF-NMA security game is exactly the problem defined in Definition 5, up to replacing the verification key by an uniform matrix (still in HNF form), which can be done under the MLWE assumption.

HAETAE with Pre-computation
We observe that in the randomized signing process of HAETAE, many operations do not depend on the message M , and some do not even on the signing key.This enables efficient "offline" procedures, i.e., precomputations that speed up the "online" phase.
Specifically, there are two levels of offline signing that can be applied to randomized HAETAE: 1. Generic.If neither the message M nor the signing key is yet unchosen in advance, it is still possible to perform hyperball sampling.This removes the most time-consuming operation from the online phase.
2. Designated signing key.Here, only the message M is unknown during offline signing, while the signing key is fixed.This allows us to perform even more precomputations by using only the verifiction key, as shown in Figure 10.Most notably, there is no longer a matrix-vector multiplication in the online phase.
We showcase the offline and online parts of the (randomized) version of HAETAE in Figure 10.List.append(y, w, w 1 , LSB(⌊y 0 ⌉)) 9: return List Sign online (sk, List, M ): Figure 10: Randomized, on/off-line signing.Note that app is the function that appends a tuple to the list.

Parameter Sets
For setting parameters, we estimated the costs of practical attacks, as in Dilithium, Falcon, and many other NIST-submitted schemes.We consider two kinds of attacks: 1) Key recovery attacks, which amount to solving an MLWE instance; 2) Signature forgery attacks, for which we rely on the BimodalSelfTargetMSIS instance that appears in the security analysis for UF-CMA.We then use the fact that the only known way to solve BimodalSelfTargetMSIS is to solve MSIS.Heuristically, the hash function is not aware of the algebraic structure of its input, and the random oracle assumption that c is uniform and independent from the input is sound.Thus, an adversary has no choice but to choose some w, hash its high and low bits along with some message, and try to compute a short preimage of w − qcj mod 2q, modulo the low bits truncation.If the adversary succeeds, the preimage becomes an MSIS solution.
We propose three different parameter sets with varying security levels, where we prioritize low signature and verification key sizes over faster execution time.The parameter choices are versatile, adaptable and allow size vs. speed trade-offs at consistent security levels.For example at cost of larger signatures, a smaller repetition rate M is possible and thus a faster signing process.This versatility is a notable advantage over Falcon and Mitaka.
Like in Dilithium, our modulus q is constant over the parameter sets and allows an optimized NTT implementation shared for all sets.With only 16-bit in size, our modulus also allows storing coefficients memory-efficiently without compression.

Implementation Specification
In this section, we explain how to efficiently implement HAETAE.We furthermore specify implementation aspects, that are relevant for compatibility or security.We start this section with an implementation-oriented specification.Specifically, Figure 11 demonstrates how to implement the key generation, Figure 12 the signature generation and Figure 13 the signature verification.These illustrate the use of the CRT and NTT for efficient polynomial arithmetic.Notably, b can be transmitted in NTT domain if no rounding is applied, and most arithmetic is carried out modulo q, and recovering the values modulo 2q is only required for computing the low and high bits.

Hyperball Sampler
One critical component of HAETAE is the hyperball sampling.Essentially, the hyperball sampling procedure consists of four steps: 1. Sample n(k + ℓ) + 2 discrete Gaussians with σ = 2 76 , sum up their squares, and drop two samples eventually.
2. Compute the inverse square root of the sum of squares, multiply the result by B 0 + √ nm/(2N ).

Multiply every sample from
Step 1 by the result of Step 2.
4. Check the ℓ 2 norm of the resulting vector, start from Step 1 if this is bigger than B 0 N .
In the following, we explain how the Gaussian sampling and the inverse square root approximation can be implemented efficiently.Besides, we choose to generate each of the k + ℓ polynomials independently, which helps parallelizing the randomness generation for implementations that use vectorization and hardware implementations.Then, for the first two polynomials, we generate one more Gaussian sample each, which is never stored but included in the sum of squared samples.

Discrete Gaussian Sampling
As we will lose precision when computing the inverse square root of a Gaussian sample, we require a Gaussian sampler with high fix-point precision.This is achieved by sampling over Z with a large standard deviation and then scaling the resulting sample to our convenience.We use [Ros20, Algorithm 12] to sample from a discrete Gaussian distribution with σ = 2 76 , k = 2 72 .
In essence, we start by sampling a discrete Gaussian x with σ = 16 using a Cumulative Distribution Table (CDT) and a uniform y ∈ {0, . . ., 2 72 − 1} and set the Gaussian sample candidate as r = x2 72 + y.Subsequently, this candidate is accepted with probability exp(−y(y + x2 73 )/2 153 ).Fortunately, we achieve a very low rejection rate of less than 5 %.
Specifically, the CDT we use has 64 entries and uses a precision of 16 bit.Then, to compute the sample candidate's square and the input to the exponential, we first compute r 2 and round the result to 76-bit precision, which is accumulated later if the sample is accepted.Subsequently, r 2 − 2 76 x 2 yields the input to the exponential.

Approximating the Exponential.
For this, we need to approximate the exponential function e −x by a polynomial f (x) on the closed interval [c − w 2 , c + w 2 ], with center c and width w.We first determine an upper bound for the polynomial order required to approximate e −x , given an upper bound for the absolute error.We obtain f (x) by truncating the expansion of e −x into a series of Chebyshev polynomials of the first kind T n (x) with linearly transformed input, as this is known to yield small absolute Algorithm 1 Deterministic hyperball sampling.
expandYbb(seed ybb , κ) for i := 3 to k + ℓ do 9: for i := 0 to k + ℓ − 1 do 14: for j := 0 to n − 1 do 15: ▷ round to log 2 N fix-point bits 16: (s 1 , s 2 ) := (s gen , e gen ) 7: approximation errors for a given polynomial order.We find: where I n (z) are modified Bessel functions of the first kind, which rapidly converge to zero for growing n.We recall ∥T n (x)∥ ≤ 1 for ∥x∥ ≤ 1.For intervals [0, w] with not too large widths we find 2e −c I m+1 ( w 2 ) to be a useful estimate of the maximum absolute error, when truncating the series at order m > 1.This relation allows us to directly limit m according to the interval to cover and the maximum permissible error.
Listing 1: Fix-point approximation of the exponential function with 48 bit of precision.Finalization.If the sample is accepted eventually, it is (implicitly) scaled by the factor 2 −76 to obtain a continuous sample from the standard normal distribution.Moreover, we only need to store the upper 64 bits of the sample and round off the rest.
In summary, each Gaussian sample candidate requires • 72 bit randomness for the lower part of the candidate (y), • 16 bit randomness for the CDT sampling, and • 48 bit randomness for rejecting the candidate conditionally according to the output of the exponential.
This results in vast a randomness demand per hyperball sample, and explains the dominance of hashing in the cycle count performance.

Approximating the Inverse Square Root
To turn the vector of standard normal distributed variates into a hyperball sample candidate, we must compute its norm.For this, we accumulate all squared samples and approximate the inverse square root of the accumulated value.The approximation result is then multiplied by the constant r ′ + √ nm/(2N ), which yields the scaling factor that is multiplied to each Gaussian sample.For the inverse square root, we deploy Newton's method, which is a well-known technique for that purpose.However, Newton's method requires a starting approximation that is, with each iteration, turned into a better approximation.Fortunately, we know that the sum of nm+2 independent squared standard normal variables follows a χ 2 distribution with expected value nm + 2. Hence, the starting approximation can be fixed and precomputed as 1/ √ nm + 2. The number of iterations for a targeted precision can be determined experimentally.Therefore, we performed the approximation for the first input values that have negligible probabilities either for the cumulative distribution function of χ 2 (nm + 2) or its survival function, and checked how many iterations are required to still reach reasonable precision.

Signature Packing and Sizes
The last step of the signature generation is to compress and pack the elements of the signature.A packed HAETAE signature consists of the challenge c, the low bits of z1 (LN coefficients), the high bits of z1 and h (KN coefficients).Because the distributions of the values in the high bits of z1 and the coefficients in h are both very dense, we can compress both polynomial vectors with encoding.Figure 14 displays the frequencies for the possible values for both vectors in HAETAE-120.
Before compressing the values, we map them to a smaller symbol space and thereby reject the very unlikely values and the corresponding signatures.For h we cut out most of the values in the middle of the range, for HAETAE-120 this reduces the size of the symbol space from 252 to 13.
For the high bits of z1 we tail-cut the distribution left and right of the center at 0, and then shift the remaining values to the non-negative range beginning at 0. For HAETAE-120 this reduces the size of the symbol space from 37 to 13.
The parameters for these mappings are defined in Table 4.At the signature verification, the mapping must be reverted after decoding the compressed symbols.
The reason for these mappings is mainly to get significantly smaller precomputation tables for the rANS encoding and decoding.Also, all symbols can now be represented with 8-bits, which simplifies the rANS implementation.Furthermore, for the high bits of z1, a mapping to non-negative values is necessary to be able to use rANS encoding.The effect on the resulting signature size is insignificant.
The size of the compressed high bits of z1 and h varies and must be included in the signature, to allow a correct unpacking and decoding.The size of one compressed polynomial vector is often more than 255 bytes, and can thus not be expressed by one byte.Its variance however, is limited, and thus we encode the size the compressed high bits of z1 and h as positive offset to a fixed base value.This unsigned offset value fits into one byte in most of the cases, if not, the signature gets rejected.The base values can be found in Table 4.
The final signature is then built as following: The first 32 bytes contain the seed for the challenge polynomial c.Following, we have LN bytes for the low bits of z1.The next first byte consists of the offset to the base size for the encoding of the high bits of z1 and the second byte is the offset for h.Then we have the encoding of the high bits of z1 and directly afterwards the encoding of h, both with varying sizes, which are indicated beforehand.Lastly, the signature is padded with zero bytes to reach the fixed signature size, if any bytes remain.Signatures that would exceed the fixed limit get rejected.
To prevent signature forgeries, during signature unpacking and decoding multiple sanity checks have to be performed: the zero padding must be correct, the decoding must not fail and decode the expected number of coefficients while using exactly the amount of bytes indicated with the offset.Furthermore, rANS decoding must end with the fixed predefined start value to be unique.Our rANS encoding is based on an implementation by Fabian Giesen [Gie14].To set the fixed signature size as reported in Table 5, we evaluated the distribution empirically and determined a threshold that requires a rejection in less than 0.1% of the cases.Figure 15 displays the raw signature size distribution of 20000 executions (without the size-based rejection sampling).
In Table 5 we compare the signature and key sizes of HAETAE, Dilithium, and Falcon.The verification keys in HAETAE are 20% (HAETAE-260) to 25% (HAETAE-120 and HAETAE-180) smaller, than their counterparts in Dilithium.The advantage of the hyperball sampling manifests itself in the signature sizes, HAETAE has 29% to 39% smaller signatures than Dilithium.Less relevant are the secret key sizes, that are almost half the size in HAETAE compared to Dilithium.A direct comparison to Falcon for the same claimed security level is only possible for the highest parameter set, Falcon-1024 has a signature of less than half the size compared to HAETAE-260, and its verification key is about 14% smaller.

Performance Reference Implementation
We developed an unoptimized, portable and constant-time implementation in C for HAETAE and report median and average cycle counts of one thousand executions for each parameter set in Table 6.Due to the key and signature rejection steps, the median and average values for key generation and signing respectively differ clearly, whereas the two values are much closer for the verification.
For a fair comparison, we also performed measurements on the same system with  identical settings of the reference implementation of Dilithium1 and the implementation with emulated floating-point operations, and thus also fully portable, of Falcon2 , as given in Table 6.The performance of the signature verification for HAETAE is very close to Dilithium throughout the parameter sets.HAETAE-180 verification is 13% slower than its counter-part, HAETAE-260 on the other hand, is 9% faster than the respective Dilithium parameter set.For key generation and signature computation, our current implementation of HAETAE is clearly slower than Dilithium.We measure a slowdown of factors three to five.In comparison to Falcon, however, HAETAE has 38-50 times faster key generation and around three times faster signing speed.For the verification, Falcon outperforms both Dilithium and HAETAE by roughly a factor of four.A closer look at the key generation reveals that the complex Fast Fourier Transformation, that is required for the rejection step, is with 53% by far the most expensive operation and a sensible target for optimized implementations.
Profiling the signature computation reveals that the slowdown compared to Dilithium is mainly caused by the sampling from a hyperball, where about 80% of the computation time is spent.The hyperball sampling itself is dominated by the generation of randomness, which we derive from the extendable output function SHAKE256 [Dwo15], which is also used in the Dilithium implementation.Almost 60% of the signature computation time is spent in SHAKE256.
Based on the profiling and benchmarking of subcomponents, we estimate the performance of a randomized HAETAE implementation with pre-computation.The generic version, which is independent of the key, would already achieve a speedup of a factor five for its online signing, because the expensive hyperball sampling can be done offline.For the pre-computation variant with a designated signing key, additionally, a lot of matrix-vector multiplications and therefore most of the transformations from and to the NTT domain, can be precomputed.We estimate about 12% of the full deterministic signing running time, for the online signing in this case.

Optimized Implementation for AVX2
Advanced Vector Extensions 2 (AVX2) is an extension to the x86 instruction set architecture and available in processors since 2011.It provides Single Instruction Multiple Data (SIMD) operations on 256-bit registers, and thus allows to e.g.do an operation on eight 32-bit values in parallel.In this section we explain, how to exploit this parallelization.
The three major components, that significantly determine the computation time of HAETAE are Keccak, the NTT and the hyperball sampling.For the first two components we can fall back to existing optimized code.For the NTT in particular, we can reuse the implementation in Dilithium with only slight adaptions with regard to constants.In the following we demonstrate how to implement the third component, the hyperball sampling efficiently using AVX2 instructions.

Vectorized Hyperball Sampling
After the parallel generation of the randomness, generally, we have two options to parallelize the hyperball sampling.First, we can sample four different polynomials in parallel, and second, we can generate the Gaussian samples within a polynomial in parallel, since they are generated independently.We opt for the first approach.
As the sampling process is relatively complex, we cannot load input vectors, generate samples from them, and eventually store the samples.Instead, we pass several times over the internal memory state, dividing the procedure into seven separate steps: 1. parsing of the input randomness: separating the three parts for each sample candidate into separate memory locations such that later steps can process them quickly, 2. CDT sampling, 3. constructing the sample candidate, its square, and the input to the exponential approximation, 4. approximate the exponential, 5. generate masks which candidates to reject, 6. accumulate the squares of the non-rejected samples, and 7. storing only the accepted samples into the correct final memory positions.
In the following, we detail Steps 2, 3, and 4.

Parallel CDT Sampling
Although we perform 16-bit CDT sampling, we cannot use the 16-fold parallel vpcmpgtw comparison, since it is a signed comparison, and use vpcmpgtd instead, which operates on eight signed 32-bit integers.As we want to sample four samples in parallel, we store the CDT and the input randomness redundantly, such that we can perform the comparison with all 64 entries with 32 comparisons.However, since AVX2 only offers 16 vector registers, we have to perform this comparison in three chunks.vpcmpgtd c, a, b writes -1 to c if the respective vector element in a is in fact greater than its counterpart in b, and else 0. Thus, we can use vpaddd to accumulate these results, but they will be negative.For the final chunk, we use vpsrlq to line up the two intermediate results, then add them and negate them.

Multi-precision Squaring
Since the sample candidate is 72 bits long and AVX2 only supports 32-bit multiplication, we perform vectorized multi-precision arithmetic.Therefore, we split the candidate into its low 32 bits, the middle 16 bits, and the upper 31 bits.For the schoolbook multiplication, we perform the six partial multiplications with vpmuludq consecutively, such that they can be executed pipelined and in parallel.The subsequent recombination and rounding can be performed with a sequence of 16 instructions of shifts (vpsrlq, vpsllq), additions (vpaddq), and ANDs (vpand).

Vectorized Approximation of the Exponential
The exponential approximation as explained in Section 5.1 consists of six signed 48-bit multiplications, which is not supported natively by AVX2.Consequently, we implement this operation with a vectorized multi-precision approach.
More specifically, we know that the first operand of this multiplication is signed, and the second is not.Thus, splitting the second operand into a low and a high half is trivial, but for the signed operand, this requires a slightly more sophisticated approach: Here, the upper half is obtained by an arithmetic right-shift by 24.By shifting this result left again by 24 (shifting in zeros), and subtracting the result from the original value, we obtain the lower half.
Since AVX2 does not offer a signed right shift over 64-bit entries, we generate a mask of sign bits and simulate a signed right shift by performing a bitwise OR.Unfortunately, we require this operation three times during a single signed multiplication operation.Notably, AVX512 offers a signed right shift, which will speed up this operation considerably.
Eventually, we perform a vectorized signed 48-bit multiplication using 32 instructions, out of which 17 are used for the emulation of a signed right shift.Moreover, we use seven variable and three constant vector registers (out of which one is the second input, cf.Listing 1), which leaves six registers for other constants.Apart from the multiplication, we only make use of addition and shifts (vpaddq, vpsrlq).

Performance and Comparison AVX2
The impact of the parallelized Keccak can be observed by looking at the cycle costs for unpacking the matrix A, which is between five to seven times faster compared to the reference implementation.For e.g.HAETAE-120, the costs went down from around 132k cycles to 24k cycles.The picture is similar for the hyperball sampling, where we measure a speed-up of factor six to eight.The cycle counts for one function call in HAETAE-120 are around 1640k in the reference and 270k in the optimized implementation.The highly optimized NTT taken from Dilithium, is almost 19 times faster than the one in our portable reference code.
Table 7 provides cycle numbers for the AVX2 optimized implementations of HAETAE, Dilithium and Falcon.Compared to our reference implementation, the signature generation is around five times faster in the optimized implementation.For the signature verification we observe an acceleration of around three to four.
The comparison with Dilithium does not change distinctly with respect to the reference implementations, except for the key generation, where Dilithium experiences a much greater acceleration.Falcon on the other hand is considerably faster at the signature generation with its AVX2 implementation, compared to its portable reference code.Table 6 showed around three times faster signature generation for HAETAE compared to Falcon.Optimized for AVX2 and also using the floating-point unit, Falcon becomes faster than HAETAE.
We note, however, that both Dilithium and Falcon went through a multi-year process of incrementally optimized implementations, whereas this process has just started for HAETAE.Moreover, when we apply the heuristic that sending one byte via internet costs at least 1000 cycles [BBC + 20, Sec.5.4], remarkably, HAETAE is already nearly as performant as Dilithium in terms of signing plus sending the signature.

Embedded Implementation on Cortex-M4
To evaluate the suitability of HAETAE for embedded environments we developed an implementation for the STM32f4-Discovery board, featuring 128 KiB RAM and a Cortex-M4F processor which implements the ARMv7E-M ISA and operates on 32-bit words.We use the PQM4 framework [KRSS19] for development and evaluation, as it is the de facto standard for comparison of post-quantum cryptography schemes on Cortex-M4 processors.The Cortex-M4F, in contrast to the Cortex-M4, features a floating-point unit.Its floating point registers can be used to store and load intermediate values within a single cycle to reduce the pressure on the 13 general purpose registers.The profiling of the reference implementation already indicates that replacing the portable Keccak implementation with one optimized for the Cortex-M4 is an important and straightforward step towards fast execution time.The two other major components that are highly relevant are the polynomial arithmetic and the Gaussian sampler, both will be discussed in the following.

Polynomial Arithmetic
In this section, we will address the issue of how to implement the required arithmetic operations on these rings and mappings between them on a Cortex-M4 platform.

Modular Reductions
In modular arithmetic, the Barret reduction [Bar87], the Montgomery modular multiplication [Mon85], and related techniques are indispensable for efficient computation, the first for reducing given numbers, the second for yielding the reduced result of a multiplication with a constant mod q.The algorithms avoid computationally expensive divisions by q and replace them with a multiplication by a suitably chosen number and division by powers of two, which can be realized with shift operations.Both methods initially reduce the result to the interval [0, 2q] and perform the full reduction with a conditional subtraction, which can be done in constant time.In many cases one can forgo the final reduction for intermediate results, an approach dubbed lazy reduction.
The prime chosen in the HAETAE scheme is q = 64513 = 0xFC01, the largest unsigned 16-bit prime with a 512 th root of unity.Fully reduced elements of Z q can be stored efficiently in 16-bit in the bottom or top half of a 32-bit register.
Unfortunately, this does not carry over to arithmetic operations.A lazy reduction or an addition already requires 17 bits to store the result, a combination of a lazily reduced multiplication followed by an addition requires 18 bits.The recent advance of the Plantard multiplication [Pla21] is not useful within this work, as the prime in HAETAE is not compatible.Plantard multiplication requires q < 2R 1+ √ 5 ≊ 0.618 R, i.e., q ≤ 40503 for R = 2 16 .So the prime of HAETAE is too large for this use-case with Cortex-M4 16-bit DSP-instructions.The same goes for Seiler's variant [Sei18] of signed Montgomery modular multiplication, which is only well-defined for q < R 2 .

16-bit vs 32-bit
Quite a few post-quantum schemes use primes that are 13-bit values or smaller.In this case, one can both store and manipulate the coefficients graciously and efficiently as two signed 16-bit values packed into one 32-bit register, as the Cortex-M4 offers a wide range of instructions intended for Digital Signal Processor (DSP) applications, like mixed multiplication of upper and lower halves of two registers that can be used for this purpose.
In the case of HAETAE, trade-offs need to be found between the compactness of 16-bit storage and doubled speed of access for consecutive coefficients on the one hand, and the required overhead to fully reduce the coefficients before writing them to memory.If coefficients are written once and afterwards are read repeatedly without alterations, the 16-bit representation can be worthwhile.When polynomials of the public key are expanded, the coefficients are sampled in fully reduced state, we therefore store them in halfwords.
While it is feasible to use a modified Montgomery reduction with unsigned 16-bit integers as input and 17 bits output (or a 16-bit value with overflow flag), there are no corresponding instructions available to exploit this.In contrast, the Montgomery reduction in the Dilithium implementation for the Cortex-M4 uses R = 2 32 and takes only three instructions.We determined the overhead associated with full reductions required to store coefficients as 16-bit values to be too large to outperform the 32-bit variant for the NTT.The same applies to other polynomial arithmetic operations in HAETAE, besides the expanded polynomials of the public key, we therefore operate on 32-bit coefficients.

NTT
HAETAE, as other lattice-based schemes, extensively employs polynomial multiplication.The NTT is a generalization of the Fast Fourier Transform (FFT) and is the state-ofthe-art technique to perform this operation, speeding up the computation considerably, as compared to, e.g., schoolbook multiplication.The addition and multiplication of polynomials transformed into the NTT domain are carried out coefficient-wise, greatly reducing the cost of the latter operation.The overhead to perform the transform and inverse transform, where required, is usually outweighed by the performance gain in the multiplication.HAETAE is specified such that large parts of the public key are expanded directly to the NTT domain.
Fortunately, the closeness to Dilithium, which also uses polynomials of degree n = 256, allows us the reuse its highly optimized assembly code for the NTT developed for the Cortex-M4 by Abdulrahman et al. [AHKS22], which improves over previous work [GKOS18].The code adjustments required for operation in HAETAE are limited to adjusting constants like the prime being used, the root of unity, which is chosen as 426 (the primitive 512 th root of unity of q = 64513) and the twiddle factors.
Replacing the portable C code from the reference implementation with optimized assembly derived from the Dilithium implementation reduces the cycle count per invocation from 37506 cycles to 8047 cycles, a speed-up by a factor of 4.6.For the inverse NTT, the cycle count dropped from 43116 to 8369, a speed-up by a factor of 5.1.

Gaussian Sampler
The major numerical components of the Gaussian sampler are the CDT sampler for sampling the most significant bits and the fixed-point exponential function used in the rejection step.As both components are called repeatedly, both have been implemented in assembly code.
The CDT sampler accumulates ones and zeros, depending on comparisons of a uniformly sampled 16-bit random value to tabled threshold values.We use the uadd16, usub16 and sel SIMD instructions from the Cortex-M4 instruction set to carry out two comparisons and accumulations in parallel.We furthermore optimize the memory access and unroll the loop.By doing so we reduce the instruction count from 800 cyles for the reference implementation to 206 cyles, a speed-up by a factor of 3.9.
The exponential is approximated by a polynomial, which is evaluated using Horner's scheme.The reference implementation of the exponential function uses fixed-point arithmetic with 48 fraction bits.Values are embedded in 64-bit integers and 64x64 to 128-bit multiplication is used.The latter operation has no native support on the Cortex-M4.To circumvent this limitation, the Cortex-M4 implementation splits the value into two signed components at the start of each multiplication, namely the most significant bits ah = a » 24 and the least significant bits as al = a -ah.For the accumulation of the results a 64-bit integer is used again, taking advantage of the smlal instruction.This repeated switch between representations allows for efficient computation of the individual Horner's scheme iterations.Whereas the reference implementation of the exponential function takes 1658 cycles to execute, this is reduced to 563 instructions in the optimized Cortex-M4 code, a speed-up by a factor of 2.9.

Stack Optimization
Besides execution time, also the memory footprint is an important metric for constrained devices.The target device in this work has 128 KiB of RAM available as stack memory.In this context, data structures typically encountered in lattice based cryptography need to be considered as rather large.A single polynomial in HAETAE takes 512 B or 1 KiB of memory to store, depending on whether the data is represented as 16-bit or 32-bit values.So vectors or matrices of polynomials can occupy a considerable share of the available RAM, if stored in their entirety.In this section we explore how to minimize memory use; in some cases significant trade-offs between memory usage and execution speed can be made.
The reference implementation of HAETAE is designed with an emphasis on readability and close similarity to the mathematical specification.This results in top level functions, which consist of long monolithic blocks with many large data structures, which do not necessarily have overlapping lifetimes, but nevertheless occupy stack space for the entire function lifecycle.Most stack memory is required for the signature generation.Due to the unnecessarily high stack usage of the reference implementation, HAETAE-180 and HAETAE-260 do not run on the STM32f4-Discovery board without optimizations.
To reduce the memory footprint we executed two strategies.First, we carefully analyzed the liveness of each relevant variable and refactored the monolithic code into subroutines to reduce the scope of variables and thereby the total stack usage.This slightly impacts the readability of the code, but does not affect the performance.
In a second variant we additionally opted for a more aggressive memory reduction, by recomputing polynomials on demand, this obviously comes with performance costs.We adapt the data structures to be primarily polynomial oriented instead of vector and matrix oriented representations.Recomputations are done during public key usage, where we generate each polynomial on demand, and during hyperball sampling, where we sample each polynomial twice, once for the evaluation of the normalization factor and a second time to sample the actual y values.Since the hyperball sampling is computationally very expensive, this leads to a severe overhead in runtime.

Performance and Comparison Cortex-M4
Table 8 shows the maximum stack size of our two Cortex-M4 implementations of HAETAE and values reported in the PQM4 framework about Dilithium and Falcon.With speedopt, we refer to our implementation, that is optimized for the Cortex-M4 and includes multiple stack-size optimizations, but does not trade speed for better memory requirements.stack-opt refers to the version, where we additionally exploit speed vs memory trade-offs.First, we can observe that the memory requirements of HAETAE are small enough to run on the STM32f4-Discovery board for all parameter sets, even in the speed-opt version.Second, the stack sizes for HAETAE are in the same order of magnitude as Dilithium and Falcon.Compared to speed-opt HAETAE, Dilithium requires around two to three times more memory during key generation, and a similar overhead for signature verification.The difference is at most 20% for signature generation, for this operation Dilithium requires less memory than HAETAE for the first two parameter sets.
Our stack-opt version reduces the stack-size up to 34% during signature generation and key generation, but does not differ for the verification.However, this comes with higher costs in terms of computation time.
Falcon sticks out for its stack-size below 10 KiB for both parameter sets during verification.
Table 9 shows the cycles spend by our two Cortex-M4 implementations of HAETAE and values reported in the PQM4 framework about Dilithium and Falcon.Our version using aggressive stack reduction techniques based on recomputations does not impact the signature verification time, but almost doubles the computation time for the signature generation.The time overhead for key generation is up to 30%.Similar to our AVX2 optimized implementation, the relative performance comparison of HAETAE to Dilithium and Falcon does not change drastically.

Security against Physical Attacks
Implementation security is a crucial aspect of making cryptosystems feasible in real-world applications.A significant advantage of HAETAE is that it can be protected against power side-channel attacks efficiently and with reasonable overhead.In this context, we emphasize the similarity of HAETAE to Dilithium.Hence, past works analyzing concrete attacks [BP18,MUTS22], but also countermeasures [MGTF19, ABC + 22], mainly apply to HAETAE as well.
While there is no known method to efficiently mask Falcon, Mitaka [EFG + 22] was designed to be easy to protect against implementation attacks, while still having the advantage of similarly small signatures as Falcon.For Mitaka, the crux regarding sidechannel security is sampling Gaussian-distributed values.Together with Mitaka, an efficient, masked algorithm for discrete Gaussian sampling was presented.However, Prest broke its security proof recently [Pre23].In this respect, HAETAE has the strong advantage that Gaussian sampling only needs to be secured against the much stronger Simple Power Analysis (SPA) attacker model, which allows for simpler countermeasures, while Mitaka's side-channel security will always depend on a masked sampler.While a fully protected implementation of HAETAE is out of scope for this paper, we briefly sketch its feasibility.
Protecting the Arithmetic.Most notably, HAETAE does not deploy floating-point arithmetic at any point, and only few secrets require fix-point arithmetic.Remarkably, addition of fix-point arithmetic can be masked relatively easy, and HAETAE never requires a multiplication of fix-point values.
During signing, the most critical operation is multiplying the (public) challenge polynomial c with s and subsequently adding the result to y.Since this operation may leak information about the secret key statistically over many executions, implementers must protect it accordingly.As countermeasures against these so-called Differential Power Analysis (DPA) attacks, masking has been proven effective.
This operation is straightforward to mask at arbitrary order by splitting the secret key polynomials into multiple additive shares in R q .A masked implementation then stores the NTT of each share of s and multiplies them to c, obtaining a shared cs.Following this, the inverse NTT is applied share-wise.Since y is a polynomial vector in (1/N )R, it is not trivially possible to add our shares of cs ∈ R k+ℓ q .On the other hand, y is not a secret-key-dependent value.Therefore, it is not required to be protected against DPA but only against the much stronger attacker model of a SPA.In fact, coefficient-wise shuffling of the addition might be sufficient at this point.
Independent of whether the addition is shuffled or masked, this involves a masking conversion from Z q to Z 2 32 .Subsequently, the computation of 2z − y and the bound checks can be shuffled without applying costly masking.
Protecting the Hyperball Sampler.The same idea applies to the whole hyperball sampling procedure.Since the order of the Gaussian samples is, in principle, irrelevant, they can be generated in random order.This is particularly an advantage for randomized HAETAE.
For the deterministic version, a masked CDT sampler, and a masked approximation of the exponential function are required.The former was shown to be feasible recently by Krausz et al. [KLS + 23], while the latter is a sequence of multiplications, shifting by a constant amount, and addition by constants, which is expected to be costly but feasible.
It is noteworthy that the random oracle hash (which outputs the challenge) is only required to be protected against SPA as well.Since the input order into the hash function cannot be randomized, the preceding values must still be protected by masking.Therefore, if no masked hyperball sampling has been performed, we propose to perform a shuffled point-wise multiplication of A and y, directly followed by freshly masking the resulting coefficients.Then, a share-wise inverse NTT and a masking conversion to the Boolean domain will be performed, which enables a secure HighBits operation.For the LSBs of y 0 , generating a fresh Boolean masking during the shuffled generation of the hyperball sample's coefficients is sufficient.

Conclusion
With HAETAE, we close an important gap between the two state-of-the-art digital signature schemes Dilithium and Falcon.Novel contributions in key generation and rejection sampling allow us to reach significantly smaller signature and verification key sizes, while still allowing physical side-channel protected implementations for IoT use-cases.Moreover, our first set of optimized implementations exhibits that our proposed algorithms run in feasible time both on embedded and high-end platforms and compete with existing schemes when considering sending latency.
Algorithm 3 Mapping from (R k q , R q ) to R k 2q fromCRT(w, x) 1: parse w as vector of integers w of size kn 2: parse x as vector of integers x of size n 3: for i := 0 to n − 1 do 4: if LSB(x i ) = LSB(w i ) then ▷ Implement in constant time.w ′ nj+i := w nj+i + q 14: arrange w ′ to w ′ , an element in R k 2q 15: return w ′ applications could adopt the offline approach with designated signing key, including the multiplication of A and y, to further reduce the latency of the online phase.

Figure 8 :
Figure 8: Full description of deterministic HAETAE.The KeyGen algorithm is slightly different for d = 0 (HAETAE-260), which do not truncate b.See Section 3.1.1for details.

Figure 14 :
Figure 14: Distribution of the coefficients of h and HB(z1) in HAETAE-120.

Figure 15 :
Figure 15: Raw signature size distribution over 20000 executions.We set the bound for the size-based rejection to result in a rejection rate of less than 0.1%.

Table 1 :
NIST security level, signature size, verification key size, and implementation security, with respect to constant-time and masking of selected signature schemes

Table 3 :
HAETAE parameters sets.Hardness is measured with the Core-SVP methodology.

Table 5 :
NIST security level, signature and key sizes (bytes) of HAETAE, Dilithium, and Falcon.

Table 6 :
Reference implementation speeds.Median and average cycle counts of 1000 executions for HAETAE, Dilithium, and Falcon.Cycle counts were obtained on one core of an Intel Core i7-10700k, with TurboBoost and hyperthreading disabled.

Table 7 :
AVX2 optimized implementation speeds.Median and average cycle counts of 1000 executions for HAETAE, Dilithium, and Falcon.Cycle counts were obtained on one core of an Intel Core i7-10700k, with TurboBoost and hyperthreading disabled.

Table 8 :
Maximum stack size in bytes for Cortex-M4 implementations of HAETAE, Dilithium, and Falcon.