BAT: Small and Fast KEM over NTRU Lattices

. We present BAT – an IND-CCA secure key encapsulation mechanism (KEM) that is based on NTRU but follows an encryption/decryption paradigm distinct from classical NTRU KEMs. It demonstrates a new approach of decrypting NTRU ciphertext since its introduction 25 years ago. Instead of introducing an artiﬁcial masking parameter p to decrypt the ciphertext, we use 2 linear equations in 2 unknowns to recover the message and the error. The encryption process is therefore close to the GGH scheme. However, since the secret key is now a short basis (not a vector), we need to modify the decryption algorithm and we present a new NTRU decoder. Thanks to the improved decoder, our scheme works with a smaller modulus and yields shorter ciphertexts, smaller than RSA-4096 for 128-bit classical security with comparable public-key size and much faster than RSA or even ECC. Meanwhile, the encryption and decryption are still simple and fast in spite of the complicated key generation. Overall, our KEM has more compact parameters than all current lattice-based schemes and a practical eﬃciency. Moreover, due to the similar key pair structure, BAT can be of special interest in some applications using Falcon signature that is also the most compact signature in the round 3 of the NIST post-quantum cryptography standardization. However, diﬀerent from Falcon, our KEM does not rely on ﬂoating-point arithmetic and can be fully implemented over the integers.


Introduction
Lattice-based schemes, especially when they have a polynomial structure, are a very strong contender for post-quantum cryptography.They can be faster than widely deployed cryptosystems based on RSA and ECDH.However, the sizes of public keys, signatures and ciphertexts are significantly larger than in RSA and even larger by an order of magnitude compared with ECDH cryptosystems.Such a large size is a major drawback of lattice schemes, and can be a crucial obstacle in the following situations: -Real-world protocols may have a maximum length designed for classical cryptography.
For a standard Ethernet connection, the maximum transmission unit (MTU) is 1500 bytes 1 , and forward secrecy requires several objects including public keys (certificates), ciphertexts, signatures, for confidentiality and authentication.
-Large communication sizes increase the risk of lost packets and delays.Recent experiments on post-quantum TLS [54] show that communication sizes come to govern the performance when the packet loss rate is higher than 3%.Moreover, [61] examines how the initial TCP window size affects post-quantum TLS and SSH performance, and shows that even a small size increase can reduce the observed post-quantum slowdown by 50%.In addition, transmission energy can also be a significant part of the energy consumption on cryptography [58].The size of signature schemes in TLS handshakes is also important as analyzed in [62].
-In some lightweight applications, e.g.internet of things (IoT), encryption and verification are done by some constrained devices.These devices only have small on-board storage and modest processors so that they may not be compatible with large public keys, signatures and ciphertexts.
With the preparation and deployment of post-quantum cryptography underway, it is important to explore new lattice-based cryptosystems with smaller parameters.
In this light, a natural choice is NTRU [39], as its structure reduces the data to one ring element.There have been many high-performance NTRU-based schemes ranging from Falcon [33], BLISS [28] for signature to NTRU-HRSS [42], NTTRU [49], NTRUEncrypt [18], NTRU Prime [11] for encryption and KEM.In particular, Falcon is the most compact signature in the round 3 of the NIST post-quantum cryptography standardization [53].In addition, NTRU-based encryption and KEM schemes have some advantage in real-world use: the relevant patents have expired; by contrast, there could be some (controversial) intellectual property claims on the Ring/Module-LWE counterparts.
NTRU-based schemes are defined over some polynomial ring R that is R = Z[x]/(x n +1) with n a power-of-2 in this work.The secret key of an NTRU cryptosystem is essentially a pair of short polynomials (f, g) ∈ R 2 while the public key is h = f −1 g mod q.All NTRU encryption schemes, ranging from the earliest proposal [39] to the round 3 NIST submission [17], have followed essentially the same design rationale for more than 20 years.Concretely, the ciphertext is c = phr + m mod q where p is the masking modulus, r is the randomness, m is the message.A correct decryption is built upon that c = pgr + f m is short so that c = (f c mod q).The masking modulus p is also necessary to decrypt: one needs to first clean out pgr via reduction modulo p and then to recover m by multiplying the inverse of f modulo p.For typical NTRU KEMs, (f, g) is sparse and of length about C √ n for small constant C. By contrast, the design rationale for NTRU-based signatures went through some significant changes.The first NTRU-based signature is NTRUSign [38] that is a hashand-sign scheme.However, its signature transcripts leak some secret key information so that NTRUSign and some variants were broken by statistical attacks [52,30].Later, Ducas et al. made use of the GPV hash-and-sign framework [35] and proposed a provably secure NTRU-based signature [29] that further developed into Falcon [33].The public key of Falcon is still h = f −1 g mod q for some short (f, g), while the actual secret key is a The signing of Falcon is essentially Gaussian sampling with a trapdoor [35,31].Consequently, the signature size depends on the maximal Gram-Schmidt norm of B f,g .As analyzed in [29], Falcon chooses (f, g) of (f, g) ≈ 1.17 √ q for optimal parameters.Therefore, (f, g) in Falcon is independent of n, which is different from the case of NTRU KEMs.Some other NTRU-based signatures [28,25] indeed use one vector (f, g) as the secret key as the case of NTRU KEMs.However, to the best of our knowledge, there is no practical NTRU-based KEM using a trapdoor basis as the secret key.Intuitively we expect (F, G) to yield one more equation in decryption so that one can recover both the message and encryption randomness via two equations.This in effect gets rid of the masking modulus p in classical NTRU KEMs and thus hopefully allows smaller parameters.It is noteworthy that the improvement in compactness of NTRU-based KEMs severely lags that in speed of NTRU-based KEMs [49,32] and in compactness of LWE-based KEMs [9,47,6,21].
Moreover, when we unify the trapdoor function for both signature and KEM, part of the code can be shared and we may also reduce some storage and communication.Therefore, it would be interesting to investigate the practicality of a trapdoor basis for KEM.
Indeed, the earliest lattice-based cryptosystem GGH (Goldreich-Goldwasser-Halevi [36]) uses a trapdoor basis as the secret key and implements both encryption and signature based on that trapdoor function.Later Micciancio improved the GGH trapdoor function by using the Hermite normal form [50].While GGH encryption has a long history as NTRU, its practicality is far from well-studied and there is no GGH-like encryption/KEM with concrete parameter and security analysis so far.
Our contributions.We present a new KEM based on NTRU, called BAT. 2 Similar to Falcon signature, BAT uses h = f −1 g mod q as the public key and its secret key is a trapdoor basis B f,g with an additional ring element (for faster decapsulation).In addition, BAT shares the same leading design principle with Falcon, i.e. minimizing the communication size.
Our main improvement in BAT-KEM is a better decapsulation algorithm, which represents a major modification to the NTRU encryption scheme since its introduction 25 years ago.Instead of following the original NTRU, we modify it according to the GGH-Micciancio blueprint [37,50].The message m is now encapsulated as c = hm + e mod q where h is the NTRU public key and e is a small error.The decapsulation corresponds to applying Babai's nearest plane algorithm with the secret basis to decode the closest lattice point, and therefore recovering the message.Compared with other NTRU KEMs, we do not need the masking modulus p to extract the message.Instead of multiplying only by f , we multiply by F so that we get 2 linear equations in the 2 unknowns e, m.
However, Babai's nearest plane algorithm heavily relies on floating-point arithmetic, although most expensive calculation can be done in a pre-computation phase.To avoid floating-point arithmetic in the decapsulation, we replace the high-precision Gram-Schmidt vectors with integral approximations.Additionally, notice that m and e do not necessarily follow the identical distribution, hence we take into account their different sizes to optimize the decoding.We also use the Learning With Rounding (LWR) assumption [8,19] in order to further reduce the size of the ciphertext.Our improved decoding algorithm can be used with a smaller modulus and dropping more bits, and thereby increases the security of the scheme and decreases the communication.
Overall, our KEM achieves very impressive performance.First of all, for the same NIST security level, BAT achieves the smallest communication size, namely "public key size + ciphertext size", among all current lattice cryptosystems and even RSA cryptosystems.Secondly, the complexity of the code as well as its running time is asymmetric: while the key generation is complicated, the frequent key usage is quite efficient.Specifically, the encryption is very simple -essentially a ring multiplication -, and the decryption also boils down to a few ring additions and multiplications.Cheap daily operations make BAT particularly compatible with small devices.Thirdly, we can implement the whole scheme fully over integers, which is different from the case of Falcon.Our implementation is constant-time and uses some AVX2 optimizations.We can notice that BAT has performance comparable to Kyber while being more compact.Furthermore, it is comparable to SIKE p434 in size while being much more efficient.We gave the timing with x86 assembly optimization, while with the same level of optimization we did, the SIKE performance timing would be higher by one order of magnitude.We summarize the detailed comparisons with some well-known schemes in Table 2.
Finally, we explain in a simplistic way why the new decryption algorithm leads to smaller parameters.To correctly decrypt, our KEM needs (f c mod q) = gm + f e, i.e. gm + f e ∞ < q 2 , while previous NTRU KEMs need pgr + f m ∞ < q 2 .We also compare with ring-LWE-based KEMs.For a typical ring-LWE-based KEM, its secret key is (f, g) ∈ R 2 , public key is (a, b = af + g) ∈ (R/qR) 2 and ciphertext is (c 1 = ae 0 + e 1 , c 2 = be 0 + e 2 + q 2 m) ∈ (R/qR) 2 .The requirement for correct decryption is e 0 g − e 1 f + e 2 ∞ < q 4 .Suppose that m, e, r, e i are drawn from a distribution of standard deviation σ e and f, g from a distribution of standard deviation σ f .The coefficients of gm + f e, pgr + f m, e 0 g − e 1 f + e 2 are modeled as Gaussian.The comparison on parameter restrictions are summarized in Table 1.It can be seen that given (n, τ, σ e , σ f ), BAT allows a smaller modulus q.Note that for fixed (n, σ e , σ f ), a smaller q implies higher security.

Table 1:
The parameter restrictions for correct decryption.The parameter τ is the tail-bound parameter determining the decryption failure rate.

Requirement for correct decryption NTRU
τ σ e σ f (p Table 2: Comparisons with other KEMs including NTRU-HRSS [42], NTTRU [49], Kyber [5], Saber [9], LAC [47], Round5 [6], ECC, RSA and SIKE [43].Timings do not include generation of a random seed (from the operating system's RNG) or key derivation costs.Sizes for BAT and LW-BAT include an optional one-byte identifying header.The implementation of LW-BAT was not fully optimized with AVX2 opcodes.Measurements for Kyber, Saber and RSA-4096 were performed on the exact same system (x86 Coffee Lake) as BAT and LW-BAT.Measurements for NTRU-HRSS and NTTRU were given in [49], those for LAC, Round5 and SIKE were as in their NIST documentations.Values for ECC are for the curve25519 implementation in eBACS [12].Comparison with Falcon.BAT is similar in spirit to Falcon signature: they both achieve good compactness by using some nice NTRU trapdoor basis as the secret key.Nevertheless, some crucial distinctions exist between BAT and Falcon.
-At a high level, BAT and Falcon exploit their trapdoor to solve CVP (closest vector problem), but the used CVP algorithms are very different.Specifically, Falcon makes use of the KGPV Gaussian sampler [35] that is a randomized Babai's nearest plane algorithm.In contrast, BAT decrypts with a deterministic NTRU decoder that can be viewed as a hybrid of Babai's round-off and nearest plane algorithms.
-The algorithms of BAT are simpler than those of Falcon.On the one hand, the signing of Falcon relies on high-precision Gaussian sampling, but the encryption and decryption of BAT only need basic integer operations.On the other hand, Falcon includes many high-precision intermediate values along with the trapdoor for faster signing, but BAT just adds one integral polynomial for faster decryption.
-The NTRU trapdoors of BAT and Falcon are generated in different ways.In fact, Falcon chooses its trapdoor for smaller signatures, which is equivalent to minimizing the maximal Gram-Schmidt norm of the trapdoor basis.As for BAT, the trapdoor is generated to minimize the decryption failure, and according to our new decoder, the distributions of the message and error will also affect the trapdoor generation (see Section 3 for more details).
Related works.Chuengsatiansup et al. [22] propose some extensions of Falcon signature and NTRU encryption over Module-NTRU lattices.This allows more flexible parameters for NTRU-based cryptosystems.Our techniques are likely to apply to the Module-NTRUbased schemes as well.
In recent years, the performance of NTRU encryption has been greatly improved [49,32].These newly NTRU instantiations are mainly proposed for high efficiency and follow the classical design of original NTRU.In contrast, BAT is proposed driven by the quest for compactness and its design is different.
In order to improve parameters, some schemes [21,65] are built upon a variant of LWE in which the secret and error follow different distributions.Our work makes use of a similar idea.Yet the main difference is that our KEM follows a novel pattern which is essential to minimize the parameters.

Notations
We follow the setting } and (a mod q) ∈ Z q for any a ∈ Z.Let ln (resp.log) denote the logarithm with base e (resp.2).For an integer q > 0, let a q = aq /q ∈ (1/q) • Z for a ∈ R. For a real-valued function f and a countable set S, we write f (S) = x∈S f (x) assuming that this sum is absolutely convergent.

Linear algebra
) is a matrix with pairwise orthogonal columns.Let R n = Z[x]/(x n + 1) with n a power-of-2 and K n = Q[x]/(x n + 1).We denote by (R n mod q) the ring R n /qR n .When the context is clear, we may write R n (resp.
The symbol • q is naturally generalized to K n by applying it coefficient-wise.

Probability and statistics
Given a distribution χ, we write z ← χ when the random variable z is drawn from χ.For z ← χ, let µ[z] (resp.σ[z]) denote the expectation (resp.standard deviation) of z, and For a distribution χ, we denote by Sample(χ) the procedure of generating a random sample of χ and by Sample(χ; seed) the sampling procedure with seed seed.For a finite set S, let U (S) be the uniform distribution over S. In particular, for a positive integer k, be the one-dimensional Gaussian function with standard deviation σ.The centered discrete Gaussian over integers with standard deviation σ is defined by the probability function )dt be the error function.For a random variable X following a normal distribution with mean 0 and variance 1/2, erf(x) is the probability of X in the range [−x, x].

NTRU
The NTRU lattice defined by h is denoted by [38,56].Fixing (f, g), there are infinitely many such bases, whereas these bases have the same Gram-Schmidt norms.Hence, we simply write While the public key of an NTRU-based scheme is h itself, the secret key can have different forms.For most NTRU encryption schemes, the secret key is (g, f ) itself, i.e. one short vector of L h,q .For some other applications, e.g.signature and IBE, the secret key is B f,g called an NTRU trapdoor basis.Falcon [33] is a representative example.Falcon is an NTRU-based signature following the GPV hash-and-sign framework [35].To sign a message m, the signer computes a pair of short polynomials (s 1 , s 2 ) such that s 1 + hs 2 = Hash(m).This procedure is accomplished by lattice Gaussian sampling with B f,g and the length of the signature (s 1 , s 2 ) depends on the sampled Gaussian width.The Gaussian sampler of Falcon is a fast Fourier variant [31] of the KGPV sampler [35], hence the signature size is proportional to the maximal Gram-Schmidt norm of B f,g .For optimal parameters, Falcon generates (f, g) such that (f, g) ≈ 1.17 √ q as per [29].

A New NTRU Decoder
In this section we present a new NTRU decoding algorithm that is the key component of our KEM.In the context of NTRU, the code words are hs + e mod q where h is the public key and s, e are small polynomials.The decoding process recovers (s, e) with an NTRU trapdoor.An ideal decoder is supposed to satisfy: 1.All operations are simple and efficient; no high-precision arithmetic is needed.
2. The decoding distance is large, i.e. being able to recover large errors (s, e).Note that for our KEM, larger errors correspond to higher security level.
There have been two famed decoding algorithms due to Babai [7]: Babai's round-off algorithm (RO for short) and Babai's nearest plane algorithm (NP for short).They have respective pros and cons.The RO algorithm outperforms NP in efficiency and simplicity.In addition, RO is particularly compatible with q-ary lattices: all operations are over Z q .By contrast, NP is capable of decoding larger errors in both the worst and average cases [57].Yet the principal drawback of NP is its reliance on high-precision arithmetic.
Our decoder overcomes the main shortcomings of RO and NP.First, it is able to tackle a larger decoding distance than RO.Second, while complicated computations are still required, all involved algorithms can be implemented using fixed-point arithmetic in practice, which outperforms NP.Meanwhile these expensive computations can be done in the pre-computation and therefore do not affect the decoding efficiency.With an auxiliary integer vector, our algorithm achieves the same efficiency as RO and all operations are integer arithmetic.To optimize the decoding, our algorithm also takes into account the distributions of s and e.

Babai's algorithms for NTRU
For better contrast, we first recall RO and NP briefly in the NTRU setting.Let h ∈ (R mod q) be the public key and B f,g = g G f F ∈ R 2×2 be the trapdoor basis.In later discussion, we shall treat L h,q as a R-module of rank 2 rather than a Z-module of rank 2n.
The application of RO related to NTRU dates back to NTRUSign [38].Given c = hs + e ∈ (R mod q), we have In some applications [29,33], the trapdoor basis is optimal with respect to Gram-Schmidt norms: (g, f ) ≈ (G * , F * ) .However, (g, f ) and (G, F ) are not so close: (G, F ) ≈ n 12 • (g, f ) [38].As a consequence, (e, s) is dominated by the large (G, F ) in the RO algorithm but by the small (g, f ) in the NP algorithm, which leads to a gap of O( √ n).

Our decoding algorithm for NTRU
As shown in Section 3.1, RO boils down to solving two linear equations over R (without modular reduction).To enlarge its decoding range, we hope to replace the large (G, F ) with some small vector (G , F ) of size However, if we want to work with (G * , F * ) directly, we have to resort to high-precision arithmetic.To overcome the precision issue, we choose (G , F ) = (G − g v q , F − f v q ) ∈ (1/q )R 2 .When q is sufficiently large, (G , F ) converges to (G * , F * ) whose norm is about (g, f ) .In practice, a moderate q suffices to significantly improve the decoding.
We further refine our decoder as per the distributions of s and e.We focus on the common case where both s and e are iid-random over some publicly known distributions χ s and χ e .In practice, χ s and χ e are not necessarily the same or even close, which may cause a gap between the sizes of s and e.For example, when s e , we expect a better decoding by using a basis To this end, we introduce a parameter γ, by default γ = σ e /σ s , to compute the optimal decoding basis.Moreover, χ s and χ e do not have to be centered, e.g.χ s = U (Z 2 ).For given (f, g), non-centered s and e lead to a non-zero average of f e + gs.Therefore, we also consider the impact of µ s and µ e during decoding.Here we assume µ s , µ e ∈ 1 Q • Z for some Q ∈ N, which is indeed the case of our later schemes.
The decoding algorithm consists of two steps: (1) computing the auxiliary polynomial w and (2) recovering (s, e).They are illustrated in Algorithms 3.1 and 3.2 respectively.Notably, both algorithms can be fully implemented over the integers.In Algorithm 3.1, the computation of v consists of one polynomial division, but the final output is actually an integral approximation of q v, which can be computed with fixed-point values.More details are presented in Section 6.1.

Algorithm 3.1 ComputeVec
Claim 1 gives a heuristic estimation of the probability of correct decoding of Algorithm 3.2.The argument is in Appendix A.
Let s, e ∈ R be iid-random over χ s and χ e respectively, and Then the probability of Decode(B f,g , w, q, q , Q, (hs+e mod q)) = (e, s) is heuristically estimated at least 1 − 2n • (1 − erf(τ )) over the randomness of s and e.

Decoding failure rate
Claim 1 gives a heuristic estimate for µ[P (B f,g , w, χ s , χ e )] (over the randomness of (f, g)) where P (B f,g , w, χ s , χ e ) = Pr[Decode(B f,g , w, q, q , Q, (hs + e mod q)) = (e, s) | e ← χ e , s ← χ s ] is the decoding failure rate for given (B f,g , w).In fact, it is hard to numerically compute µ[P (B f,g , w, χ s , χ e )], since the distribution of (G , F ) (defined in Claim 4) is complicated.It is however easy to numerically compute P (B f,g , w, χ s , χ e ) given (B f,g , w).
We experimentally calculate P (B f,g , w, χ s , χ e ) for some (B f,g , w) generated with the suggested parameters (see Tables 3 and 4).The failure probabilities are smaller than for a naive Gaussian model, and significantly so for a ring dimension of 256.

BAT KEM
In this section, we present a KEM scheme, called BAT, constructed following the aforementioned encode/decode paradigm.Its secret key is an NTRU trapdoor basis as in the Falcon signature.As a consequence, some codes for Falcon implementation can be reused.
BAT permits very compact parameters.Specifically, the modulus q is greatly reduced in contrast to Falcon.More remarkably, the ciphertext is well compressed: each coefficient requires only less than one byte of storage.

Algorithm description
Prior to the description of our KEM, we first present the underlying public key encryption.It is specified by the following parameters: -R = Z[x]/(x n + 1) with n = 2 l .
q = bk + 1 with b, k ∈ N. Note that b determines the size of each ciphertext coefficient and k determines the decoding distance.
q ∈ N is used to control the decryption failure rate.
At a high level, our idea is to build an encryption scheme upon a one-way trapdoor function.Indeed, for a pseudorandom public key h, the function F (s, e) = hs + e mod q is one-way under the Ring-LWE assumption, but one can invert it with the trapdoor B f,g as shown in Section 3. In our scheme, the encryption is to compute c = F (s, e) and the decryption is to recover s by inverting F (s, e).
The key generation is shown in Algorithm 4.1.The first step is to generate an NTRU trapdoor basis B f,g along with the public key h.This is similar to the Falcon key generation, but the size of the secret key is changed.The second step pre-computes an auxiliary vector w.As explained in Section 3, w is used for decoding a larger error while avoiding floating-point arithmetic in the decapsulation.We include it as a part of the secret key.Note that Falcon key generation also pre-computes the Falcon tree for signing, but that computation is useless in our scheme.
The encryption algorithm is described in Algorithm 4.3.The message space is M = {0, 1} λ where λ denotes the claimed security level.The encryption is obtained by applying a simple worst-case to average-case correctness (ACWC for short) transform, introduced by a concurrent work [32], on a deterministic encryption (Algorithm 4.2).Thanks to the ACWC transform, the encryption scheme achieves the IND-CPA security and its decryption failure rate is independent of the message.The base encryption (Algorithm 4.2) is to compute the trapdoor function on an ephemeral s.For better compactness, we replace F (s, e) with F (s) = (hs mod q) k and thus use Ring-LWR as the hardness assumption.It is easy to see that the storage of a ciphertext is n log b + λ bits.The decryption stems from the decoding algorithm in Section 3. The formal description is provided in Algorithm 4.4.
Remark 1.The work [32] also introduces an ACWC transform avoiding the λ-bit overhead on the ciphertext.That transform is not so direct and the analysis is more complicated.

Algorithm 4.1 KeyGen Enc
return ⊥ 5: end if 6: c 2 ← Hash m (s) m where Hash m is some hash with range M 7: return (c 1 , c 2 ) By some standard techniques [34,26], an IND-CCA secure KEM immediately follows from our IND-CPA encryption.Algorithms 4.5 and 4.6 describe the encapsulation and decapsulation algorithms respectively, in which S (resp.K) is the set of seed (resp.shared key).The detailed security arguments are given in Section 5.2.

Parameter selection
We keep the same ring R as Falcon but choose a much smaller modulus q.Indeed, a smaller modulus forbids the applications of a complete NTT, nevertheless similar techniques [48,49] still allow a very fast polynomial multiplication.For security, a smaller q implies a smaller standard deviation of the secret key distribution and then less entropy of the secret key.However such loss does not reduce the concrete security level much.
A notable modification exists in key generation.In Falcon, (f, g) is sampled to make (g, f ) ≈ (G * , F * ) where (G * , F * ) is the Gram-Schmidt orthogonalization of (G, F ) in B f,g .However, in BAT, we choose σ f satisfying This gives rise to a nearly optimal decryption failure rate according to Claim 1. Particularly, when γ = 1, the σ f we use also makes (g, f ) ≈ (G * , F * ) as the case of Falcon.Yet for this case, σ used by BAT is different from that by Falcon, which is explained in Remark 5.
Let us recall that c = ck = hs − k hs k − (hs mod q) k := hs + e mod q.We model e drawn from U (Z n k ) and thus3 σ e = k 2 −1 12 .As σ s = 1 2 , it follows that γ = σ e /σ s = k 2 −1 3 .Our KEM tolerates a small decryption failure rate for better performance, as many current lattice-based KEMs.The decryption failure rate is equal to the probability of incorrect decoding.According to Claim 1, it is heuristically bounded by 2n • (1 − erf(τ )) for some τ .We bound the size of the error used to 1.08 its average to limit the impact of precomputed messages on the decryption failure rates: the exponent may be reduced by 20% in the worst case.As a contrast and verification, we also numerically computed the decryption failure rate for 100 keys in practice: the standard deviation of the logarithm of the rate over the secret key distribution is around 8, and the computed values are even smaller than their heuristic estimates.Therefore we present the numerical values (the average for randomly generated keys) for the decryption failure rate and the tail-bound parameter τ for the heuristic estimates.
Table 3 shows the suggested parameters.The parameter set for lightweight BAT.We further suggest one more parameter set particularly aiming at a lower security level (around 80 bits of security), which may be of interest for some lightweight use-cases.We call this lightweight variant LW-BAT.In LW-BAT, the degree n is only 256; for better compactness the modulus q does not support NTT anymore.We choose a relatively high decryption failure rate 2 −71.9 , but it should be sufficient for lightweight applications, e.g.IoT: for the Round5 IoT parameters [6], the decryption failure rate is 2 −41 even larger than ours.Table 4 summarizes the concrete parameter set.

Security
We now report on the security of BAT.First, we demonstrate the IND-CCA security of our KEM under some hardness assumptions.Then we estimate the concrete security according to the best known attacks.

Assumptions
The (decision) NTRU assumption.Let R × q be the set of invertible elements in R/qR.Let χ be some distribution over R × q .The advantage of adversary A in solving the decision NTRU problem NTRU R,q,χ is ) and χ is the distribution of the secret key f and g.Remark 2. There are some researches on the hardness of the decision NTRU assumption over R = Z[x]/(x n + 1).Notably, as shown in [64], when χ is a discrete Gaussian of standard deviation σ = ω(n √ q), the ratio of f and g is statistically indistinguishable from uniform, which gives a firm theoretical grounding.The decision NTRU assumption with a narrow distribution χ is also closely related to Falcon [33] and sometimes referred to as the Decisional Small Polynomial Ratio (DSPR) assumption [46,14].
The (search) Ring-LWR assumption.Let χ be some distribution over R. The advantage of adversary A in solving the search Ring-LWR problem RLWR R,q,k,χ is In our case, R = Z[x]/(x n + 1) and χ = U (R mod 2).
Remark 3. The theoretical foundation of the search Ring-LWR assumption is developed in [8,13,19].There are also some practical schemes, e.g.Lizard [21] and Saber [9], using the Ring-LWR over Z[x]/(x n + 1) or its module variant as their hardness assumption.Indeed, the provable hardness of Ring-LWR with a binary secret s remains open.Yet this would not weaken the concrete security as per the state-of-the-art cryptanalysis results, especially when q is relatively small (as in our case).

KEM security
The security notion we prove for BAT is IND-CCA security (indistinguishability against chosen-ciphertext attacks).To this end, we first note that the underlying encryption (Algorithms 4.3 and 4.4) is IND-CPA secure (indistinguishability against chosen-plaintext attacks) under the assumptions in Section 5.1.
and message recovery based on the primal attack [4] that is a fundamental cryptanalysis method in lattice-based cryptography [2].For BAT, the primal attack indeed leads to better security estimates than other attacks.

Cost of lattice reduction
We begin with a brief introduction to lattice reduction that is heavily used by the primal attack.Currently, the most practical lattice reduction algorithms are BKZ [60] and BKZ 2.0 [20].Let BKZ-β denote the BKZ/BKZ 2.0 with blocksize β.The cost of running BKZ-β on a d-dimensional lattice is estimated by where t is the tour number BKZ-β takes and C SVP-β is the cost of solving SVP on a β-dimensional lattice.
We follow a typical setting: Remark 4. The BKZ cost model we use is not extremely conservative: some lattice-based schemes use the Core-SVP model in which C BKZ-β = 2 0.292β (resp. 2 0.257β ) for classical (resp.quantum) setting.Nevertheless the used SVP cost models are quite conservative and provide a safe margin: they ignore the lower order term o(β) in the exponent, which is substantial for concrete security.For a fair comparison, we shall also show the required blocksize β along with the estimated cost.

Primal attack
The primal attack consists of constructing a uSVP (unique-SVP) instance and solving it by lattice reduction.We refer to [4,3,2] for details.
For key recovery, the uSVP instance is the matrix form of the public key h.The secret key pair g f is a short vector of the uSVP instance.To optimize the primal attack, one can reduce the instance dimension by "forgetting" some equations and take homogeneousness into account.
For message recovery, it suffices to recover s from We construct the uSVP instance as   as a short vector.Unlike the case of key recovery, the unknowns s and e have different distributions.Therefore, the primal attack can be improved by re-scaling technique.Also, the strategy of "forgetting" some equations still works here.The primal attack with all above optimizations is systematically discussed in [24].We estimate the cost of primal attack with the open-source script5 of [24].Numbers are shown in Table 5.

Other attacks
We list some other known attacks, while they do not outperform the primal attack for the proposed parameter sets.
Hybrid attack.This attack is a combination of lattice reduction and meet-in-themiddle techniques.It was proposed as an improved attack against early NTRU [41] and Dual attack.This attack is proposed to solve decision LWE problem and thus does not apply to BAT that is actually an NTRU-type scheme [2].
Learning attacks.This kind of attacks [52,30] were proposed to break insecure NTRU signatures [38] in which signature transcripts leak secret information of the NTRU trapdoor.While BAT uses NTRU trapdoor as the secret key, the ciphertext is indistinguishable from uniform under the Ring-LWR assumption.Therefore, BAT resists to learning attacks.
Overstretched NTRU attacks.These attacks [1,45] only work when the modulo q is significantly larger than the NTRU secret coefficients.This is not the case of BAT.
Algebraic attacks.There is a rich algebraic structure in BAT.While there are some results exploiting this algebraic structure [44] to speed up lattice reduction, the gains with respect to their general lattice equivalent are no more than polynomial factors.

Implementation Details
We implemented BAT with integer-only computations.We provide here some details on the used techniques.Our implementation is available at: https://github.com/pornin/BAT/

Key pair generation
Key pair generation starts with producing the short polynomials f and g, then solving the NTRU equation to obtain F and G.This specific step uses the algorithm described in [56].Compared with the reference implementation of Falcon, the following differences are noteworthy: -BAT polynomials have a lower norm than their Falcon equivalent.The polynomial resultants obtained at the deepest level of the recursive algorithm are then shorter, which improves performance.
-All uses of floating-point operations (for Babai's nearest-plane algorithm) have been replaced with fixed-point values (over 64 bits, with 32 fractional bits), which removes all dependencies on the floating-point unit.Since fixed-point values have a limited range, this implies that the reduction may fail, leading to a key pair generation restart.Failed cases can be efficiently filtered out early in the process by checking the current partial solution to the NTRU equation modulo a small prime integer; hence, the overhead implied by these restarts is low.At degree n = 512, about 30% of candidate (f, g) pairs lead to a restart.
-Some memory reorganization allowed for additional RAM savings, down to 12288 and 24576 bytes for n = 512 and 1024, respectively (compared to 14336 and 28672 bytes for Falcon).
Once the complete NTRU basis (f, g, F, G) has been obtained, ComputeVec is used to obtain w.The polynomials γ 2 F f + Gg and γ 2 f f + gg are first computed modulo a small prime where NTT can be applied for efficient computations (a 31-bit prime is used; since the basis coefficients are all small, it is easily seen that coefficients do not exceed 2 19  absolute value).The v polynomial is then obtained by performing the division in the FFT domain, using the same fixed-point code as the one used for solving the NTRU equation.The division itself is performed with a constant-time bit-by-bit routine.
Since fixed-point values are approximations of the real coefficients of v, the rounding step may occasionally be wrong by 1. Extensive tests show that it is a relatively uncommon occurrence (it happens in about 0.5% of keys at n = 512) and always when v is close to z + 1/2 for some integer z; over 30000 random key pairs, the largest observed deviation of v − 1/2 from the closest integer, for coefficients where our implementation rounds to w incorrectly, is lower than 2 × 10 −4 .This means that in all observed cases, |w i − v i | < 1/2 + 2 × 10 −4 .Since the decoding process works as long as |w i − v i | < 1, the impact on the decryption failure rate is negligible.

Field operations
Efficient and secure (constant-time) operations in the small base fields (modulo q and q ) are implemented with Montgomery multiplication.Namely, a value x modulo q is represented by an integer y in the 1 to q range (inclusive), such that y = 2 32 x mod q.Montgomery reduction can be implemented in two 16-bit multiplications, two shifts and one addition; they can moreover be mutualized because analysis shows that reduction works properly for values up to close to 2 32 .For details on this technique, see [55].
On recent x86 platforms, SIMD opcodes can be used to further optimize operations.AVX2 registers can store 16 values modulo q (or q ) and perform 16 Montgomery reductions in parallel.The _mm256_mullo_epi16() and _mm256_mulhi_epu16() intrinsics compute, respectively, the low and high halves of a 16-bit product, with a very low reciprocal throughput (0.5 cycles).Computing 16 modular multiplications in parallel requires in total only 6 invocations of such intrinsics.

NTT multiplication
Since q − 1 and q − 1 are multiples of 256 for BAT, the NTT can be applied to speed up computations over polynomials modulo X n + 1, when working with integers modulo either q or q .For n a power of two up to 2 7 = 128, the NTT representation of a polynomial f is the set of f (ζ 2i+1 ) for 0 ≤ i ≤ n − 1, where ζ is a primitive 2n-th root of 1 modulo q (or q ).In NTT representation, addition and multiplication of polynomials can be done coefficient-wise, hence with cost O(n) operations modulo q.Moreover, conversion to and from NTT representation can be done in O(n log n) steps.
For larger degrees, we cannot use full NTT representation, but we can still optimize operations by splitting polynomials as follows.Consider n = 512; the polynomial f modulo X 512 + 1 can be split into four sub-polynomials as follows: the polynomials f i being of degree up to 127, and operating modulo X 128 + 1.Then, operations on such polynomials can be expressed as a relatively small number of operations on the sub-polynomials, which themselves can be implemented in the NTT domain, since the sub-polynomials are of degree less than 128.
In our implementation, the NTT representations of the sub-polynomials are interleaved, so as to maximize parallelization efficiency.

Polynomial splitting and Karatsuba multiplication
For LW-BAT, we use q = 128, which prevents us from using the NTT straightforwardly 6 .Instead, for polynomial multiplications, we use Karatsuba with an even/odd split, by writing a polynomial f as: f = f 0 (X 2 ) + Xf 1 (X 2 ) with f 0 and f 1 being half-size polynomials (they operate modulo X n/2 + 1).We can then express the product of f and g as: i.e. we turn the multiplication of two polynomials modulo X n + 1 into three multiplications of polynomials modulo X n/2 + 1.We use this reduction recursively, until polynomials have degree less than 4.
The same split is used to compute polynomial divisions modulo X n + 1: this is used to compute the public key h = g/f mod q, and also to rebuild G from f , g and F when the short format for private key storage was used.The even-odd split allows us to write: which reduces inversion modulo X n + 1 to a multiplication (modulo X n + 1) and an inversion modulo X n/2 + 1. Applied recursively, this method leads us to the simple problem of inverting an integer modulo q = 128, which can be done in a few inexpensive multiplications.

Decoding
Decoding involves computing polynomials with integer coefficients modulo q, q and Q.The final step requires solving for e and s ; we only need s in practice, since we can use encapsulation to verify the result.Moreover, there are only two possible values for each coefficient of s (for 1/2 and −1/2) and we merely need to disambiguate between these two values.To keep to integer values, we do not recover s but qq Qs ; moreover, we perform computations modulo an additional small prime distinct from q and q .In practice, when q = 257, we perform the last step by working modulo 769; when q = 128 or 769, we use computations modulo 257.

Encoding and storage
We defined compact encoding formats for public keys, private keys, and ciphertexts.Each format starts with a single header byte which identifies the object type and parameter set.Public keys are polynomials with coefficients modulo q.When q = 257, we encode coefficients by groups of eight, each group using 65 bits: each coefficient is split into a low half (4 bits) and a high half (value 0 to 16, inclusive); eight "high halves" are encoded over 33 bits in base 17.For q = 769, a similar mechanism is used, with 5 coefficients being encoded over 48 bits.All encoding and decoding operations can be implemented with only simple 32-bit multiplications, and can be done efficiently in a constant-time manner (this last property does not nominally matter for public keys, which are public).
Ciphertexts are mainly polynomials with small, signed integer coefficients.When q = 257, coefficients of c 1 are in the −64 to +64 range; eight coefficients are encoded over 57 bits, in a way similar to public key encoding.For q = 769, coefficients of c are in the −96 to +96 range, and five coefficients are encoded over 38 bits in base 193.The value c 2 , which is a fixed-size binary value, is simply appended to the encoding of c 1 .
Private keys have a "short" and a "long" formats.The long format includes the 32-byte seed that was used to generate f , g, and the 32-byte value r (which is used when decapsulation fails).This seed is followed by a copy of r, then the polynomials f , g, F , G and w themselves, and the public key h.Coefficients of f and g are encoded over 4 bits each, in two's complement notation; for F and G, 6 bits are used per coefficient, and 17 bits for w.The public key h uses the same encoding as in the public key.The short format only stores the 32-byte seed, and the polynomial F : the value r and the polynomials f and g are regenerated with the same pseudorandom (deterministic) process that was used during key pair generation; G is recomputed using the NTRU equation (modulo q); and w and h are recomputed.While the short format is substantially shorter, decoding a private key stored in the short format has a nonnegligible overhead, but is still much cheaper than key pair generation, since the most expensive part (solving the NTRU equation from f and g alone) is avoided.
The numbers for required storages are listed in Table 6.

Speed benchmarks
We provide two implementations -(1) plain portable C version and (2) AVX2 version.We measured the speed on an Intel i5-8259U CPU clocked at 2.3 GHz; TurboBoost is disabled.
Compiler is Clang-10.0,with optimization flags "-O3".The AVX2 implementation uses intrinsic functions, and an additional optimization flag "-march=native".For key pair generation, the reported value is an average over several hundreds of key pairs7 .For encapsulation, the reported value includes decoding of the public key, core encapsulation process with the FO transform, and ciphertext encoding to bytes; cost of generation of a random seed (of 10, 16 or 32 bytes, for the three BAT variants) from the operating system RNG is not included.For decapsulation, the value includes decoding of the ciphertext from bytes, and core decapsulation process with the FO transform.All hashing operations in the FO transform and for the PRNG used in key pair generation rely on the BLAKE2 function as specified in RFC 7693 (BLAKE2s for LW-BAT and BAT-512, BLAKE2b for BAT-1024) [59].
The timing data for two implementations are illustrated in Tables 7 and 8 respectively.

Table 3 :
Suggested parameters for BAT.

Table 5 :
Concrete security estimate for BAT.Around 80 bits of classical security.Due to the conservative SVP hardness and the heavy memory cost of sieving, LW-BAT arguably reaches this level.further studied in [15, 51, The hybrid attack is effective for the NTRU or LWE problem with particularly sparse secret or error vectors.The secret and error in BAT have enough entropy to resist this attack. a

Table 6 :
Storage requirements of BAT (in bytes, including the header byte).

Table 7 :
Performance of the plain C implementation of BAT (in clock cycles).

Table 8 :
Performance of the AVX2 implementation of BAT (in clock cycles).