Efficient Algorithms for Large Prime Characteristic Fields and Their Application to Bilinear Pairings

. We propose a novel approach that generalizes interleaved modular multiplication algorithms for the computation of sums of products over large prime fields. This operation has widespread use and is at the core of many cryptographic applications. The method reformulates the widely used lazy reduction technique, crucially avoiding the need for storage and computation of “double-precision” operations. Moreover, it can be easily adapted to the different methods that exist to compute modular multiplication, producing algorithms that are significantly more efficient and memory-friendly. We showcase the performance of the proposed approach in the computation of multiplication over an extension field F p k , and demonstrate its impact with record-breaking implementations of bilinear pairings. Specifically, we accomplish a full optimal ate pairing computation over the popular BLS12-381 curve, designed for the 128-bit security level, in under half a millisecond on a 3.2GHz Intel Coffee Lake processor, which is about 1 . 40 × faster than the state-of-the-art. Similarly, we perform the same computation over the BLS24-509 curve, targeting the 192-bit security level, in ∼ 2.6 milliseconds, achieving a speedup of more than 1 . 30 × . We also report a significant impact on other applications, including protocols based on supersingular isogenies.


Introduction
Take two sets of integers (a 0 , a 1 , . . . , a t−1 ) and (b 0 , b 1 , . . . , b t−1 ) all defined over a certain finite field F p with large prime characteristic p. The sum of their products, namely, the computation c = t−1 i=0 ±a i · b i mod p is a fundamental operation that is at the core of various types of arithmetic over F p , from matrix multiplication and multiplication over polynomial rings to elliptic curve arithmetic. Of special interest is that this operation has wide use in the form of multiplication over extension fields F p k , which is at the heart of several cryptographic schemes such as elliptic curves defined over extension fields [37], bilinear pairings on ordinary curves [71] and supersingular isogeny-based schemes [55].
In this work, we propose a new approach that computes a sum of products over F p by interleaving intermediate products with the modular reduction step, in similar fashion to classical interleaved modular multiplication algorithms ( §2.1). Crucially, it departs from algorithms using the well-known lazy reduction technique [74,78], eliminating the excessive growth of intermediate values and the need of computing "double-precision" arithmetic. The method can be easily adapted to existing algorithmic variants that fit different application profiles, for software and hardware platforms. We show that some of these variants are especially efficient for software implementation, exhibiting strong synergy Many works have exploited the technique (especially in the context of multiplication over F p 2 ) without changing the basic approach [3,7,19,21,22,38,76]. In fact, despite the availability of efficient interleaved modular multiplication algorithms [42], the combination of lazy reduction with these faster algorithms has been long seen as an impossibility. As consequence, lazy reduction has been strictly limited to use non-interleaved modular multiplications.
The main drawback of this traditional approach using lazy reduction is that it forces the storage and computation with intermediate results of double precision. In addition to the increased pressure on memory usage, the optimization of double-precision arithmetic may require specialized routines that significantly increase the implementation complexity [2,3]. In contrast, the proposed method limits the growth of intermediate results and gets rid of double-precision arithmetic. The simplest variants use schoolbook internally, which eliminates further the additional storage demanded by Karatsuba. For small and mediumsized primes, the significant reduction in memory friction pays off in terms of speed, even in the cases in which the use of multiplications is higher, as we show in §4.
Open-source software. We have implemented and integrated our algorithms to RELIC [2], a cryptographic library containing state-of-the-art implementations of pairings. Our implementations cover two pairing-friendly curves: BLS12-381 and BLS24-509, which are intended to match the 128-and 192-bit security levels, respectively. Both are instantiated with an optimal ate pairing. The software is available as open-source at: https://github.com/primefieldsfp/faster_Fpx.
The software stack also includes the code that we wrote to evaluate the speed performance and memory usage of the proposed method using the SIKE primes, as described in Appendix B.
Outline. We start by giving some preliminary background on algorithmic aspects of Montgomery multiplication and extension field arithmetic in §2. Then, we describe the details of our method in §3, together with an exhaustive classification of its different algorithmic variants. This section also includes a preliminary cost analysis. In §4, we present our case study targeting pairings, describe various implementation choices for different extension fields, and report efficient implementations of the BLS12-381 and BLS24-509 pairing-friendly curves using the RELIC library. Finally, we discuss potential future developments and the impact of this work, including the case of supersingular isogeny-based protocols, in §5.

Montgomery multiplication
A well-known and widely-used method to implement modular multiplication is due to Montgomery [69]. This method introduces a significant speedup in the computation of modular reduction by replacing expensive long divisions by simple divisions with powers of two.
To carry out a Montgomery multiplication, field elements are represented in the socalled Montgomery domain. Let R = 2 N and p = −p −1 mod R, where N = n · w, n = l/w , l = log 2 p and w is the computer wordsize. For two field elements a and b, their Montgomery representation is given byã = aR mod p andb = bR mod p, respectively. If it holds thatãb < pR, the Montgomery residue c =ãbR −1 mod p = abR mod p is then computed as c = (ãb + (ãbp mod 2 N ) · p)/2 N .
The final result a · b can then be easily obtained by dividing by the value R. If one assumes that conversions to/from the Montgomery domain are amortized by executing a long series of modular multiplications, the cost of a Montgomery reduction (i.e., without the integer multiplication part) is given by (n 2 + n) w-bit multiplications.
Interleaved and non-interleaved modular multiplication There are two general approaches for implementing modular multiplication, which are determined by how the integer multiplication and reduction parts are "glued" together. In an interleaved or nonseparated modular multiplication, both pieces are computed in an intertwined fashion, while the non-interleaved or separated version computes the full integer multiplication first and then proceeds to carry out the reduction part separately. In applications dealing with sums of products over large prime fields (e.g., for multiplying over extension fields like in pairings or supersingular isogeny-based protocols), the use of non-interleaved modular multiplication has become the de facto approach, as it enables a straightforward application of lazy reduction and other advanced implementation techniques [3,4,19,21,22,38,62,74,76,78].

Radix-r Montgomery multiplication.
Straight implementations of Eq. (1) demand the use of a high number of registers since the full inputs are processed in a single pass using the full modulus. Let r be a certain radix, typically a power of two, in which operands and the modulus are represented (e.g., an operand a is represented as (a t−1 , . . . , a 1 , a 0 ) r ). A more implementation-friendly approach proposed by Dussé and Kaliski Jr. [42] processes the computation one digit at a time reducing with r at each iteration, in what is called the radix-r Montgomery reduction. An interleaved computation of a radix-r Montgomery multiplication of two Montgomery elementsã andb then proceeds by fixing p = −p −1 mod r (assuming that the modulus is a prime p of bitlength l), initializing c to 0, and computing t = l/ log 2 r iterations doing for i = 0, . . . , t − 1.
In this work, we adopt a generalization of the original radix-r Montgomery multiplication by setting r = 2 Bw , where B ∈ Z and 0 < B ≤ n and, as before, n = l/w , l = log 2 p and w is the computer wordsize 2 . In this case, Eq. (2) runs for n/B iterations. This generalization lifts the restriction that the bitlength of the radix r should match the computer wordsize w of a given platform, as originally assumed in [42]. Note that the original radix-r Montgomery multiplication corresponds to the case B = 1.
As we will show in §3, the flexibility introduced by B in the definition of the radix allows for a comprehensive generalization that captures many implementation variants of Montgomery multiplication exploiting either the "operand-scanning" form (a.k.a. schoolbook method), the "product-scanning" form (a.k.a. Comba method [33]), the Karatsuba technique [57], and their different combinations. To the best of our knowledge, several of the arising variants are novel.

Sums of products over large prime fields
Let (a 0 , a 1 , . . . , a t−1 ) and (b 0 , b 1 , . . . , b t−1 ) be two sets of elements all belonging to a certain field F p of large prime characteristic p. We define a "sum of products" as a computation with the form This operation can be found at the core of many cryptologic computations, the most notable of which is perhaps multiplication over extension fields with large prime characteristic. Hence, we use this operation to illustrate the impact of the proposed algorithms (see §3).
Let's recall the simplest scenario for a multiplication of two elements a, b ∈ F p k modulo an irreducible polynomial with the form f = x k − ω, where ω is a primitive element in F * p and k|(p − 1). The polynomial multiplication is then given by It is a widespread practice to optimize the implementation for computing each c j using an "accumulation and reduction" strategy, most commonly known as lazy reduction. This technique effectively reduces the number of modular reductions to only one (or k, for the full polynomial multiplication). Note that, as in most practical scenarios, we assume that ω has small coefficients that make a multiplication by it relatively cheap.
As mentioned before, this use of lazy reduction has some drawbacks, the most critical of which being the need of extra storage and computing with intermediate results of double precision. As we show in §4 (see also Appendix B), this issue makes implementations slower and less memory-friendly for small and medium-sized primes.
The rise of fast multiplication over F p 2 . The most basic "sum of products" operation underlying several cryptographic schemes is multiplication over a quadratic extension field F p 2 . Modern examples of these schemes include bilinear pairings on ordinary elliptic curves over prime fields and supersingular isogeny-based protocols.
For illustrative purposes, let's use the common construction fixing F p 2 = F p (i) for i 2 − β = 0, where β is a small integer in absolute value. Two main approaches are known to realize the multiplication in this case.
Take two elements a = (a 0 + a 1 i) and b = (b 0 + b 1 i) ∈ F p 2 . The first method to compute the multiplication a · b in F p 2 is the straightforward schoolbook method which computes it as The second approach is Karatsuba multiplication, which computes the same operation as If we assume that β = −1 as in most efficient instantiations, the operation count for multiplication using schoolbook is of four modular multiplications and two modular additions/subtractions, while the one for Karatsuba is of three modular multiplications and five modular additions/subtractions. Efficient implementation of the arithmetic over F p 2 attracted lots of interest from the cryptographic community around the mid-2000's, contributing to a remarkable effort aimed at reducing the high cost of computing cryptographic pairings. In 2007, Scott [74] was the first to apply lazy reduction to Karatsuba multiplication in the context of pairings, changing the cost of one F p 2 multiplication to three integer multiplications, two modular reductions, two additions and three double-precision additions/subtractions. This approach was later perfectioned by Beuchat et al. [19] and Aranha et al. [3] with the use of some aggressive optimization techniques such as the avoidance of modular corrections and carry-handling elimination, all in the context of the computation of optimal ate pairings [77] over a 254-bit Barreto-Naehrig (BN) curve [15].
The algorithm for multiplication in F p 2 combining all these optimizations for the optimal case with β = −1 is shown in Algorithm 1. Integer operations without modular correction or reduction are represented as ×, + or −. The only operation that requires a modular correction is the subtraction in line 8 that is represented as . Double-precision operands are represented in uppercase, while single-precision operands are in lowercase.

Algorithm 1 Multiplication in F p 2 using Karatsuba and lazy reduction
The case of higher-degree field extensions. For pairings, high-degree extension field arithmetic represents the main building block and, therefore, its efficient implementation becomes crucial. To this end, it has been recommended to implement it as a tower of extensions built with suitable irreducible polynomials [60], following a similar development for optimal extension fields [8]. For example, Pereira et al. [50] recommended to set p e ≡ 1 (mod 6) and represent F p 6e as a tower extension of F p e in one of three ways: where ξ ∈ F p e is a non-square and a non-cube.
The use of a tower field allows to write modularized implementations, in which each layer can be easily optimized using algebraic transformations like Karatsuba to reduce the number of modular multiplications.
In 2011, Aranha et al. [3] pushed the performance limits further by extending the use of lazy reduction to the full tower scheme, in order to minimize the use of modular reductions. Concretely, they showed that this optimization, when applied above the F p 2 arithmetic up to the F p 12 layer, results in an 11%-17% speedup on a variety of x64 processors. Nevertheless, this extended lazy reduction technique comes at a price. It requires additional specialized routines to perform the double-precision arithmetic, which increase the complexity and memory footprint of the implementation.
In §4, we discuss how one can improve performance and memory use for prime sizes of practical relevance with a new approach that we present next.

The Proposed Method
We describe the proposed method in the context of Montgomery multiplication, which is arguably one of the most relevant scenarios. See §5 for a discussion on other potential applications.
Let (a 0 , a 1 , . . . , a t−1 ) and where R = 2 nw , n = l/w , l = log 2 p , and w is the computer wordsize. From now on, we make the assumption that inputs a i and b i are already in the Montgomery domain.
From a general perspective, the new approach essentially consists in performing a merged computation of the operation c = t−1 i=0 ±a i · b i mod p using interleaved radix-r Montgomery multiplication 3 , that is, initializing c = 0 and executing l/ log 2 r = n/B iterations doing for j = 0, . . . , n/B − 1, where p = −p −1 mod r, and each integer a i is represented in radix-r representation as (a i, n/B −1 , . . . , a i,1 , a i,0 ) r . As stated in §2.1, the radix r is adopted in the generalized form r = 2 Bw , where B ∈ Z and 0 < B ≤ n. In the following, we call each digit in this radix representation a "B-digit". We remark that Eq. (3) is presented in a general form for simplicity purposes. Next, we provide a more detailed description that covers the wide diversity of variants that can be derived from the approach.
At a high-level, we can classify the different variants by the method that is used to implement the top layer in the computation of the multiplications in Eq. (3). Thus, we can distinguish operand-scanning (or schoolbook), product-scanning (or Comba), and Karatsuba variants. In the remainder, we mostly focus on the first case which brings very fast computations to the software platform class that we target in this work. We comment that product-scanning and Karatsuba variants, such as those described in Appendix A, might be useful in other scenarios, e.g., for hardware implementations (see discussion in §5).
Remark 1. The result of a Montgomery reduction is upper bounded by 2p when its input is in the range [0, pR). Hence, a conditional subtraction is needed to bring the result to [0, p). However, this operation can be avoided if we perform arithmetic over a redundant representation (e.g., over Z 2p ). For example, if operands are kept in the range [0, 2p) such that the result of a multiplication is guaranteed to be c = a · b < 4p 2 ≤ pR (i.e., it should hold that R ≥ 4p), then the result of the Montgomery reduction is still going to be bounded by 2p but we will no longer require the modular correction. A simple correction is going to be required at the very end of the computations to bring the final result to the canonical range [0, p). In the following, all the algorithms assume the use of this redundant representation to avoid the final conditional subtraction.

Operand-scanning method
For this method, the computation flow at the top layer follows the operand-scanning or schoolbook form. That is, for each multiplication, a B-digit from the radix-r representation of a given multiplier is first multiplied with the full value of the multiplicand before proceeding to the next computation. For the remainder, we refer to this operation as B-digit × row multiplication.
We distinguish two main approaches, depending on whether the inner multiplications a i,j b i from Eq. (3) are interleaved with the multiplications with the prime p or not. We adopt the naming convention from [61] and call the approaches finely integrated if we do the former case (i.e., with interleaved inner multiplications), and coarsely integrated, otherwise.
Coarsely integrated variants. The merged "sum of products" algorithm using a coarsely integrated strategy is displayed in Algorithm 2. The construction of the algorithm easily follows from Eq. (3) when n mod B = 0. We still need to manage the cases in which n mod B = 0 (i.e., the digit-size of the most significant B-digit is strictly smaller than that of a B-digit). This is done in lines 6-9, where the computations are adjusted to the right digit size.
As can be seen, the B-digit × row multiplications corresponding to a i,j b i (line 3) are interleaved with those for the multiplications with p and p (lines 4 and 5) at each iteration of the for-loop.
There are multiple ways in which the inner multiply-and-accumulate operations a i,j ·b i and u + q · p can be realized. We classify these variants according to the chosen value B as follows: • Case with B = 1: one is setting r = 2 w and all the inner computations essentially become straight digit × row multiplications. This is the analogous version of "Improvement 2" from [42], called coarsely integrated operand scanning (CIOS) in [61].
• Case with B > 1, B = n: the inner computations work on "blocks" of digits and, hence, each B-digit × B-digit multiplication can be implemented in either schoolbook, Comba or Karatsuba style (or any combination of these in a multi-level fashion for sufficiently large primes).
• Case with B = n: this is essentially the original lazy reduction technique.
Algorithm 2 Merged sums of products using radix-r interleaved Montgomery multiplication in coarsely integrated form. Input: integers (a 0 , a 1 , . . . , a t−1 ) and where R = 2 nw , n = l/w , l = log 2 p , and w is the computer wordsize; the radix r = 2 Bw s.t. B ∈ Z and 0 < B ≤ n, and the Montgomery constant Finely integrated variants. The merged "sum of products" algorithm using a finely integrated strategy is displayed in Algorithm 3. In this case, note that multiplications are performed B-digit by B-digit, interleaving those corresponding to a i,k b i,j (lines 3 and 7) with those for the multiplications with p k (lines 5 and 8). Note that we manage the case with n mod B = 0 as described before.
Similar to the coarsely integrated form, there are multiple ways in which the inner multiply-and-accumulate operations can be realized. Again, we classify these variants according to the value B as follows: • Case with B = 1: one is setting r = 2 w and all the inner computations become simple digit × digit multiplications, where those corresponding to input operands are interleaved with those with the prime. This is the analogous version of the finely integrated operand scanning (FIOS) method from [61].
• Case with B > 1, B = n: the inner computations work on "blocks" of digits and, hence, each B-digit × B-digit multiplication can be implemented in either schoolbook, Comba or Karatsuba style (or any combination of these in a multi-level fashion for sufficiently large primes).
• Case with B = n: this is essentially the original lazy reduction technique.
Algorithm 3 Merged sums of products using radix-r interleaved Montgomery multiplication in finely integrated form.
and w is the computer wordsize; the radix r = 2 Bw s.t. B ∈ Z and 0 < B ≤ n, and the Montgomery constant Selecting a variant. Picking a specific variant depends on both the modulus size (see the next subsection) and the targeted platform. Generally speaking, the coarsely integrated variant (Algorithm 2) should be the preferred option in most software platforms in which schoolbook works well and the availability of general purpose registers (GPRs) is sufficient to support a full B-digit × row multiplication with minimal interaction with the memory.
On the other hand, the inner for-loop of the finely integrated variant (Algorithm 3) consists of a bunch of multiplications that are independent from each other and, hence, can be executed in parallel on, e.g., an FPGA. See §5 for an extended discussion on other uses for the algorithms.
Regarding the selection of the value B, setting B > 1 might make it easier to alleviate memory use for relatively large primes, especially in the case of Algorithm 3. For small and medium size primes 4 , it appears that setting B = 1 hits the right balance between the size of intermediate results and the number of GPRs available on x64 platforms.
Finally, we mentioned before that for the internal multiplications it is possible to use either schoolbook, Comba, Karatsuba, or any combination of these in a multi-layer implementation. To be efficient, Karatsuba would require a B-digit consisting of a sufficiently large number of limbs. And between schoolbook and Comba, the former is typically preferable when a given platform supports efficient carry-saving instructions such as mulx or versatile multiply-and-add (MULADD) instructions (see §5).
Primes of special form. The algorithms described in this section have been presented generically assuming primes with no special form, as typically found in, e.g., pairing computations ( §4). However, they can be easily adapted to settings using primes with some special shape. In particular, a relevant scenario is when p + 1 ≡ 0 (mod 2 zw ) for some integer z ≥ 1. As an example of this case, see the adaptation of Algorithm 2 to the case of the SIKE primes in Appendix B. It is straightforward to apply a similar optimization to primes such as the SQISign prime p 3923 [46], as discussed in §5.1.
We note that, in part, the proposed method gains in efficiency by eliminating modular reductions. Therefore, it is naturally expected that primes that support a very fast reduction (e.g., Mersenne or pseudo-Mersenne primes) do not gain a significant advantage with the approach.

Cost analysis
Except for the variants that could use Karatsuba at the lower levels of their computations (which would only be the case for relatively large primes), the complexity of the proposed algorithms is quadratic in terms of multiplication instructions. For t products, it is easy to see that they require n 2 (t + 1) + n digit multiplications, which is precisely the complexity of standard lazy reduction when the products are done in schoolbook or Comba-style. This means that lazy reduction in conjunction with a subquadratic multiplication like Karatsuba-schoolbook or Karatsuba-Comba (KCM) [51] is theoretically cheaper in terms of multiplications.
Nevertheless, we argue that the new method can achieve a superior performance in practice for small and medium-size primes since it enables streamlined implementations with much less friction with memory. That is, the computation is carried out over a reduced number of registers, greatly minimizing the memory accesses. In software, the schoolbook variants are particularly efficient due to the availability of carry-preserving instructions, which precisely contribute further to reduce memory use.
To see this, let's run a comparative analysis with one of the most promising variants for software platforms, namely, a merged sum of products using radix-r interleaved Montgomery multiplication in coarsely integrated form (Algorithm 2). For the remainder of this section, we focus on the case of primes with no special shape, as assumed in the description of the algorithms from the previous section. We assume B = 1 in the case of an F p 2 multiplication, and set an x64 processor as the target. We compare against state-of-the-art implementations of multiplication over F p 2 , which essentially use variants of Algorithm 1 [2,39].
If we compare costs (4) and (5), standard lazy reduction appears to beat the new method solidly for almost every operation type. However, this analysis ignores key practical considerations, as we discuss below.
Let's now perform a more practical analysis based on actual implementations of the F p 2 multiplication for primes up to 512 bits long on an x64 platform. In order to provide a generic cost formula (but without loss of precision), the costs given below are approximations of the actual operation counts. We consider the use of carry-preserving instructions like mulx and adx, which are supported by all modern Intel and AMD processors.
As can be seen, in practice the memory access costs are greatly reduced thanks to the use of the general purpose registers (GPRs). Likewise, the use of carry-preserving instructions reduces the number of addition instructions significantly. While this happens across both algorithms, the improvement is much more pronounced for the new method, especially in the case of memory writes. This highlights the streamlined nature of the proposed approach, which permits to eliminate many memory accesses.  (4) and (5). The practical counts, which use the cost formulas (6) and (7), are precise approximations of the actual implementation costs on an x64 platform with mulx/adx support for primes with different digit sizes n up to 512 bits (results for larger primes are still obtained by using the same formulas, so the measurement errors are larger in these cases, as discussed in the text).
The above analysis for primes up to 512 bits can be clearly observed in Figure 1, which displays the total number of instructions (multiplications, additions/subtractions and memory reads and writes) for different prime sizes, using the theoretical analysis and the analysis based on practical implementations. We remark that the costs for primes greater than 512 bits are still estimated using formulas (6) and (7). However, in practice we expect the method to become less advantageous as the prime size increases. We have verified that for larger primes the number of GPRs becomes insufficient and the implementations start to demand more storage and an increasing number of memory accesses. If we consider that from the 15 GPRs available on an x64 platform 3 are used for the input/output interface, and 3 or 4 are used to hold auxiliary values, then only 8 or 9 registers are available to hold the actual operands for the computations. Hence, it makes sense that for primes up to 512 bits, i.e., with n ≤ 8, the proposed method achieves optimality to perform a full B-digit × row multiplication (B = 1) without register spillage. This observation, conflated with the better performance of Karatsuba at larger prime sizes, makes the new method to reduce its speed advantage as the prime bitlength increases beyond 512 bits.
Experimentally, we have verified this analysis with the SIKE primes (see Appendix B). Concretely, the speed superiority of our method over standard lazy reduction starts to vanish around the 610-bit mark on an x64 platform. It is important to note, however, that the memory savings remain stable and can even slightly increase with the prime bitlength.

Case Study: bilinear pairings
Since the seminal papers by Sakai et al. [72] on identity-based non-interactive authentication key agreement and by Joux [56] on tripartite one-round key agreement, bilinear pairings have become a powerful tool in the design of a myriad of novel cryptographic schemes such as identity-based cryptosystems [20,24,29,30] and non-interactive zero-knowledge proof systems [52,53].
One critical drawback of pairings is their relatively expensive running time. This motivated an extraordinary effort by the research community to improve efficiency on several fronts, including the construction of pairing-friendly elliptic curves [12,25,49,75], the development and improvement of the algorithms for the Miller loop and final exponentiation [11,13,14,54,77], the optimization of the explicit formulas for the curve arithmetic [3,35,36], and the design and optimization of towering schemes of extension fields F p k [3,18,60]. Readers are referred to [1] for a modern take on the design and implementation of pairings.
Here, we focus on the optimization of the arithmetic over F p k using the proposed algorithms. Of practical interest are extension fields of degrees 2, 4, 6, 8, 12, 24 and 48, which are commonly used to construct the tower fields in many of the popular pairings used in practice and considered for standardization [73], such as BLS12-381, BN446, BLS12-446, BN462, BLS24-509, BLS48-581 and others; see [73, Table 1] for a summary of pairingfriendly curves adopted in practice. From this set, we study the pairings with embedding degrees 12 and 24, which cover the majority of the most popular options considered for the 128-and 192-bit security levels. Accordingly, we analyze next the implementation of multiplication over F p 2 , F p 4 , F p 6 , F p 8 and F p 12 , which are core operations in the pairing computation, and consider the following widely widespread towering scheme [1,3,19,50]: It is important to remark that, in practice, α and β are chosen such that they have very small absolute values (typically, absolute value 1 or 2), which help improve efficiency of the extension field arithmetic.
The case of multiplication over F p 2 . Given that for pairings, generic Montgomery multiplication (i.e., variants that do not exploit any special form in the prime) is known to provide the best performance in software, straight implementations of the variants discussed in §3 are relevant in this case. More specifically, variants that exploit the synergy between schoolbook algorithms and carry-preserving instructions are expected to outperform other approaches. Thus, we observed that the implementation of multiplication over F p 2 can be efficiently carried out using the interleaved radix-r Montgomery multiplication variant in coarsely integrated form (Algorithm 2), with B = 1 to make full use of schoolbook.
The case of multiplication over F p 4 . There are multiple choices to implement multiplication over F p 4 . For example, it can be implemented on top of the F p 2 arithmetic layer using our method for the multiplication over F p 2 and Karatsuba at the F p 4 level with or without lazy reduction. Or it could be implemented using the proposed method by seeing F p 4 as a direct extension of F p 2 and expressing the operations down to the base field, as discussed next.
After regrouping common coefficients and assuming that the two values (αb 1,0 − b 1,1 ) and (b 1,0 + αb 1,1 ) are pre-calculated, one can apply either Algorithm 2 or Algorithm 3 with a cost of 4 × 4 = 16 multiplications in the base field. Note that, in contrast to a generic sum of products, the polynomial multiplication modulo f offers opportunities to eliminate some multiplications using Karatsuba. For example, in the term c 0,1 one could compute a 0,0 b 0,1 + a 0,1 b 0,0 as (a 0,0 + a 0,1 )(b 0,0 + b 0,1 ) − a 0,0 b 0,0 − a 0,1 b 0,1 with only one base field multiplication, using intermediate values from c 0,0 . However, these replacements should be applied with care, since they break the algorithm's flow (recall that inner multiplications are interleaved with reduction computations) and increase memory usage, potentially neglecting any savings obtained by eliminating multiplications. Ultimately, the benefit of combining Karatsuba with the proposed algorithms might depend on the target platform (see Appendix A for details on another Karatsuba variant).
The case of multiplication over F p 6 . Similar to the case over F p 4 , there are multiple choices to implement multiplication over F p 6 . For example, it can be implemented on top of the F p 2 arithmetic layer using our method for the multiplication over F p 2 and Karatsuba at the F p 6 level with or without lazy reduction. Or it could be implemented using the proposed method by seeing F p 6 as a direct extension of F p 2 and expressing the operations down to the base field, as discussed next.
Note that similar comments to the case over F p 4 apply to the possibility of exploiting Karatsuba.
The case of multiplication over F p 12 . Similarly in this case, we can leverage the multiplications over F p 2 or over F p 6 discussed above, in combination with Karatsuba with or without lazy reduction at the F p 12 layer. But we can also do the computation by writing the full polynomial multiplication down to the base field level. Recall that F p 12 It is straightforward to determine that, in this case, we need to compute twelve terms each consisting of a sum of twelve products (assuming the pre-calculation of ten values, similarly to multiplication over F p 6 ). Similar comments apply to the possibility of exploiting Karatsuba to products in adjacent terms.
The case of multiplication over F p 24 and beyond. The same methodology applies to higher extension degrees. But as we observe in the experimental results, the speedup of the method decreases with the extension degree, as reducing multiplications with Karatsuba becomes increasingly more profitable than reducing memory accesses and other small operations. However, one can expect an improvement even for extension degrees as high as 24 and 48 by combining the use of the proposed technique for the lower layers (say, for degrees 2, 4, 6 and 8) while using standard Karatsuba, with or without lazy reduction, for the top layers.
Although it has not been discussed here, the approach can be easily extended to squaring and other similar operations (e.g., sparse multiplication) over extension fields.

Performance results
In order to cover different field sizes and towering schemes, we carry out the evaluation on two pairing-friendly curves, namely, BLS12-381 and BLS24-509, using an optimal ate pairing instantiation [77].
BLS12-381, proposed by Bowe [23], is an elliptic curve from the Barreto-Lynn-Scott (BLS) family [12] that targets the 128-bit security level and is undergoing a standardization effort in the IETF CFRG [73]. Most notoriously, this curve is widely used in zero-knowledge proofs and digital signatures in blockchain applications like Zcash [6] and Ethereum 2.0 [27] 5 . BLS12-381 is defined by the curve E(F p ) : y 2 = x 3 + 4, with embedding degree k = 12. Relevant to our analysis is that, in practice, the arithmetic implementation over F p 12 is realized via the towering representation: BLS24-509, proposed by Costello et al. [40], is also an elliptic curve from the BLS family that targets the 192-bit security level. BLS24-509 is defined by the curve E(F p ) : y 2 = x 3 +b, with embedding degree k = 24. The arithmetic implementation over F p 24 is realized via the towering representation: To evaluate the proposed algorithms, we have integrated our implementations to the RELIC cryptographic library, version 0.6.0 [2]. This library contains, to our knowledge, some of the most efficient open-source implementations of pairings. In particular, it applies the generalized lazy reduction to the full extension field and elliptic curve arithmetic, as proposed in [3].
In our experiments, we use a 3.4GHz Intel Core i7-6700 (Skylake) processor with TurboBoost disabled to follow standard practice. Compilation was carried out using clang v11.0.1 with the command clang -O3. Memory stack usage was obtained using valgrind 6 and massif-cherrypick 7 . Table 1 compares RELIC's implementation of the extension field multiplications for BLS12-381 against the various options that we discussed for our algorithms. Likewise, Table 2 includes the analogous results for BLS24-509. We use the following notation to specify a given strategy: first, we indicate up to which layer an algorithm is applied, followed by the approach taken for the upper layers (if any). For the latter, we have two options: for the upper layers, one can use either straight Karatsuba (called "Karat") or Karatsuba with lazy reduction (called "Karat + lazyr"). For example, if the table indicates "Alg. 2 over F p 6 . Karat + lazyr over F p 12 ", it means that we use Algorithm 2 to implement multiplication up to the F p 6 layer, with the upper layer over F p 12 implemented with a formula that exploits Karatsuba with lazy reduction. In all the cases, we set B = 1 for Algorithm 2, which gives optimal performance on the targeted x64 platform. As noted before, this schoolbook-like algorithmic variant implementing an interleaved modular multiplication in coarsely integrated form fully exploits the availability of carry-preserving instructions. We comment that, at least on the targeted processor, the algorithm should achieve similar performance for small values of B, as long as an increase in the radix size does not put additional pressure on the register usage. For example, in our experiments, we obtained similar results for B = 1 and B = 2.
In terms of speed, Tables 1 and 2 reveal that the full use of the new method solidly beats the state-of-the-art implementations up to the F p 6 layer in the case of BLS12-381. For the F p 12 multiplication, the fastest mark is achieved by using the implementation over F p 6 and implementing the upper layer over F p 12 using Karatsuba. This is due to the fact that at certain threshold Karatsuba starts to outperform schoolbook algorithms when multiplications get eliminated at a sufficiently faster rate. Interestingly enough, we do not require the use of lazy reduction because a basic implementation based on Karatsuba already achieves optimal performance. This is the consequence of minimizing the cost of reduction through the proposed approach, and this greatly reduces the complexity of the implementation. Note that we also obtain a significant gain in the computation of squaring over F p 2 . This is the result of replacing the non-interleaved Montgomery multiplication available in [2] by a faster interleaved version, given that we were not limited anymore to the old algorithmic selection that exploited lazy reduction.
In the case of BLS24-509, we experimented with a full use of the technique up to the F p 4 level. For the F p 8 and F p 24 multiplications, we use the straight implementation of the method over F p 4 and implemented the upper layers over F p 8 and over F p 24 using Karatsuba. In the latter case, the use of lazy reduction results in a slight improvement.
Another relevant point is that the experimental results confirm an important and consistent improvement in speed up to (at least) the 500-bit base field size. This can be clearly observed when comparing the speedups for the F p 2 multiplication and squaring in the 381-and 509-bit fields. Similarly, note the increasing speedup in the multiplication over F p k as k decreases for all the values of k considered in Tables 1 and 2 (this is clearly observed in the cases of full use of the method using Algorithm 2).
Additionally, we also carried out an analysis of memory usage for BLS12-381. Notably, we observe that the proposed method reduces significantly the use of memory, achieving savings in the range 43%-78% for different extension field operations and with increasing savings for higher extension degrees; see Table 1. The remarkable reduction in memory consumption is mainly due to the elimination of double-precision operations and the streamlined nature of our algorithms. Looking at the different options for F p 6 and F p 12 multiplication, one can see that those that avoid lazy reduction and implement the full arithmetic using Algorithm 2 minimize the use of memory. For example, the use of Algorithm 2 over F p 12 brings the most memory-friendly option for multiplication over F p 12 , at the expense of a slight reduction in speed. We expect that similar results extend to other extension degrees.
Finally, Table 1 also reports the cycle counts for the full pairing computation using an optimal ate instantiation: on a 3.4GHz Intel Core i7-6700 Skylake machine the computation of our fastest implementation option for BLS12-381 is performed in ∼ 674 µsec. As an additional data point, the same computation on a 3.2GHz Intel Core i7-8700 Coffee Lake machine is carried out in ∼ 491 µsec. 8 (compare to the 688 µsec. obtained by running the implementation from RELIC on the same platform). We remark that our implementation is much simpler and more memory-friendly, and still achieves a speedup of up to 1.40× over the state-of-the-art on an x64 processor. In addition, the table also includes another implementation option in which the full F p 12 multiplication uses Algorithm 2. This implementation saves up to 49% in memory usage in the pairing computation while almost achieving top speed performance.
In the case of BLS24-509, Table 2 also shows a significant improvement in the running time of the full pairing. On the Skylake platform, the computation is completed in ∼ 3.5 msec., which is 1.30× faster than the state-of-the-art implementation from RELIC. On the Coffee Lake platform, the computation is completed in ∼ 2.6 msec., which is 1.31× faster than the ∼ 3.4 msec. that is obtained with RELIC. As expected, when compared with the results for BLS12-381, the relatively smaller speedup obtained for BLS24-509 reflects well the use of a higher-degree tower field. Analysis for larger field sizes and higher extension degrees is left for future work.

Impact to Other Scenarios and Future Work
The simple but effective approach that we have proposed in this work changes the paradigm which the implementation of extension field arithmetic has long relied upon. This immediately impacts the software implementation of cryptographic schemes such as those based on bilinear pairings and supersingular isogenies. Moreover, we also expect the approach to influence the development of efficient techniques and implementations for other software platforms, constrained devices and hardware architectures, not only because of the potential speed gains but also (and maybe more critically) because of the significant savings in memory usage.
Next, we describe a few possibilities for some representative platforms, and end the section with a discussion on the impact for supersingular isogeny-based schemes.
Software platforms. In platforms with a limited number of registers there is a risk of high memory access costs. Hence, a streamlined, schoolbook-like algorithm like Algorithm 2 that minimizes memory friction and reduces the use of certain operations such as additions (e.g., when there is support for carry-preserving instructions) achieves high performance on modern x64 platforms. Alternative methods based on Karatsuba are expected to become attractive at relatively large prime sizes, when the reduction in multiplications compensates for the bumpier algorithmic flow with higher number of memory accesses and additions.
We expect a similar (if not better) situation with vectorized implementations using the recent AVX-512 vector instructions available in some Intel processors. For example, the optional extension "Integer Fused Multiply and Add" (IFMA) includes MULADD instructions that perform up to eight 52-bit multiplications followed by accumulations with 64-bit values [34]. Future work could involve studying the performance of the proposed Table 1: Comparison of the speed performance (in terms of clock cycles) and memory stack usage (in terms of bytes) between the state-of-the-art implementation of an optimal ate pairing over BLS12-381 and its extension field arithmetic [2] and the optimized implementation using the method proposed in this work. The target platform is a 3.4GHz Intel Core i7-6700 (Skylake) processor. method using the operand and product-scanning forms, in combination with different vectorization strategies. Similar comments apply to implementations using the ARM NEON vector engine [63]. In this case, there is access to powerful, high-throughput MULADD instructions that perform up to two 32-bit multiplications followed by accumulations with 64-bit values. Thus, these instructions would favor an algorithmic variant of Eq. (3) in product-scanning form.

Reference
In the case of multiplication over F p 2 , for example, the 2-way NEON execution naturally adapts to perform the two-term computation (i.e., the operations c 0 = a 0 b 0 + a 1 b 1 β and For the case of scalar implementations on 64-bit ARMv8 processors, the relatively high cost of multiplication instructions might make the case for standard Karatsuba with lazy reduction. However, memory accesses are also expensive, which would favor a more streamlined execution as in the proposed algorithms. This requires actual experimentation to determine which algorithm would be optimal. Constrained platforms. For this class of devices, the potential reduction in memory use is particularly relevant. A popular platform in this computing category is ARM Cortex-M4. For this case, one can exploit the powerful, one-cycle MULADD instructions available in the DSP extension. These instructions can perform a 32-bit multiplication plus 64-bit accumulation, or a 32-bit multiplication plus two 32-bit accumulations. The low cost of multiplication, added to the potential of eliminating the overhead from addition instructions in a schoolbook-like computation (e.g., see [64, §3.4]), makes our approach using Algorithm 2 a perfect fit. Table 2: Comparison of the speed performance (in terms of clock cycles) between the state-of-the-art implementation of an optimal ate pairing over BLS24-509 and its extension field arithmetic [2] and the optimized implementation using the method proposed in this work. The target platform is a 3.4GHz Intel Core i7-6700 (Skylake) processor. starting with [28] rendered insecure the insignia protocols of this family, i.e., SIDH and SIKE. Since then, some works have proposed countermeasures to circumvent the attacks [47,48,70]. For example, in a recent work, Fouotsa, Moriya and Petit [48] propose two SIDH variants using masking. Unfortunately, the lengths of the underlying base fields of the proposed parameters are rather large and run in the thousands of bits. In this scenario, the proposed technique is expected to have a small impact (if any). A more promising application includes the compact signature scheme SQISign [45], whose prime bitlengths belong to the ideal range of a few hundred bits. Recently, in joint work with De Feo, Leroux and Wesolowski, we adapted the proposed algorithms to a new, faster variant of SQISign [46]. Concretely, we used Algorithm 2 for p 6983 , corresponding to a 256-bit prime of generic form, and adapted Algorithm 5 to the 254-bit prime p 3923 , for which p + 1 ≡ 0 (mod 2 64 ). In the former case, we obtained a speedup of ∼ 1.45× for the combined cost of signing and verification, while for the latter we obtained a speedup of ∼ 1.68×. The difference in the speedups highlights the advantage of suitably chosen primes such as p 3923 that feature a special shape and have some room at the computer word boundaries. Future work includes the use of the algorithms for the implementation of the arithmetic over the primes for the 192-and 256-bit security levels proposed in [26]. We also comment that the recently proposed variant FastSQISignHD [41] is expected to greatly benefit from using algorithmic variants such as Algorithm 5 that are specially tailored for primes with the form 2 x · 3 y − 1, as in SIDH and SIKE.

Reference
Another interesting line of work in which our method is expected to have a significant impact corresponds to isogeny-based zero-knowledge protocols [17,32,44]. For example, to illustrate the impact in a recent zero-knowledge proof of isogeny knowledge [17], we modified the implementation by Basso et al. based on a 434-bit prime and observed a 23% speedup in the generation phase, and a 29% speedup in both the prove and verify stages.

A Other algorithmic variants
In this section, we discuss two variants using Karatsuba that could be amenable for certain applications looking to reduce the number of multiplies.
The first one was already mentioned in §4 and applies to polynomial multiplication in general: for certain sums of two products, it is possible to use products from adjacent terms to convert them to a Karatsuba multiplication and save a multiplication (see the subsection "The case of multiplication over F p 4 "). Since the proposed algorithms interleave integer multiplications and reduction products, some care has to be taken into account to avoid excessive use of storage.
Below, we propose another option that interleaves Karatsuba multiplication with the radix-r Montgomery reduction. The new algorithm is shown in Algorithm 4 using 2-way Karatsuba. Let's focus on the simple case with t = 1, i.e., a standard modular multiplication a · b mod p (the sum-of-products case easily follows). The basic idea of the algorithm is to first split operands in two halves (for a 2-way Karatsuba) using the generalized radix r = 2 Bw = 2 n/2 , such that operands are represented as (a 1 , a 0 ) 2 n/2 . Then, from Eq. (2) and proceeding in product-scanning form, we have that: Finally, we simply replace the intermediate computation (a 1 b 0 + a 0 b 1 ) by (a 0 + a 1 ) For B odd we proceed as other cases in this work and adjust the operations to the right digit size (lines 7 and 8 in Alg. 4).
We comment that other variants are possible and easily follow, such as for example an interleaved, 3-way Karatsuba-Montgomery multiplication that can be derived in a similar way with a splitting of operands in three parts. It is also possible to derive a SIKE-friendly version. In this case, one can conveniently set the radix to r = 2 e2 for a prime p = 2 e2 · 3 e3 − 1 and eliminate the multiplications by p .

B Effect of the prime size: analysis with the SIKE primes
In this section, we evaluate the improvements in speed performance and memory usage that can be obtained with the proposed technique for different prime sizes. For this, we use the SIKE primes [5], which cover a range of sizes of relevance for several cryptographic applications.
SIKE primes have the special form p = 2 e2 · 3 e3 − 1, where 2 e2 ≈ 3 e3 for integers e 2 and e 3 . For the analysis, we use Algorithm 5, which we derived from Algorithm 2 by adapting it to the special prime shape.

Cost analysis.
To evaluate the performance of the proposed approach, we implemented Algorithm 5 with the Round 3 SIKE parameters p434, p503 and p610 (the number in the parameter name denotes the bitlength of the corresponding prime [5]). We also evaluate the SIKE prime p377 proposed by Longa et al. [67]. We compare against the performance of the state-of-the-art implementations of the same SIKE primes from the SIDH library v3.4 [39] and from [67]. These libraries implement the integer multiplication in two layers using Karatsuba (upper layer) and schoolbook (lower layer). The reduction part is implemented using a non-interleaved radix-r Montgomery reduction with B > 1, specialized to SIDH/SIKE primes (see [43,Alg. 6]). As is standard, these libraries use lazy reduction for the multiplication over F p 2 , following Algorithm 1. Table 3 presents the performance comparison (in terms of clock cycles) for the F p 2 multiplication on an x64 processor, specifically, a 3.4GHz Intel Core i7-6700 (Skylake) processor. All the implementations in the comparison are written in assembly language, and were compiled and tested on the same platform using clang v6.0.1 with the command clang -O3. The table also includes a detailed instruction count of all the implementations, including multiplications, additions, subtractions and other instructions. The columns Algorithm 5 Merged sums of products using radix-r Montgomery reduction in coarsely integrated form for a prime with the form p = 2 e2 · 3 e3 − 1. Input: integers (a 0 , a 1 , . . . , a t−1 ) and (b 0 , b 1 , . . . , b t−1 ) s.t. a i , b i ∈ [0, 2p) for i = 0, . . . , (t− 1) and 0 ≤ t−1 i=0 a i b i < pR, where R = 2 nw , p = 2 e2 · 3 e3 − 1, n = l/w , l = log 2 p , and w is the computer wordsize; z = e 2 /w ,p = (p+1)/2 zw , and the radix r = 2 Bw s.t. B ∈ Z and 0 < B ≤ z. Integers are represented in radix r, e.g., a i = (a i, n/B −1 , . . . , a i,1 , a i,0 )  u ← u/2 (n mod B)w + 2 (z−n mod B)w q ·p 10: return c ← u "read" and "write" present counts of all the corresponding instructions that require a memory access operation. We remark that the total instruction counts are provided as an additional data point only and should not be considered to follow actual performance with high-precision. Especially in the case of the targeted platform, its superscalar, deeply pipelined microarchitecture makes extremely difficult to extract performance data from a straight instruction count. Nevertheless, it can provide relevant information for a first-order comparison of the different algorithms.
Firstly, we observe that the proposed method achieves much better speed performance in all the cases, even though it requires a higher number of multiplication instructions. This is due to the significant reduction of other operations, especially of those requiring read/write memory accesses. Another relevant aspect is that at certain threshold the operand sizes become too large and the lack of enough general purpose registers forces the use of many more memory access instructions. This can be observed for the largest prime under analysis, i.e., p610, which precisely returns the lowest speedup. In the rest of the cases, the speedup goes from ∼ 1.17× up to 1.31×, with the speedup increasing as the size of the prime decreases. This is consistent with the results from existing literature that show that Karatsuba becomes more profitable as sizes grow (see §3). Nevertheless, we demonstrate that, for the case of a quadratic extension field, schoolbook can still be much faster for primes up to around 500 bits 9 . Table 3 also summarizes the stack memory usage corresponding to each parameter set. The figures were obtained using valgrind and massif-cherrypick, as in §4. As can be seen, our approach achieves a significant reduction in memory consumption that is well above 35%. This is obtained consistently across the different parameter sets, with even a slight increase in the savings as the prime size goes up. This is consistent with the memory analysis in §4, although for pairings the savings are even greater for higher degree extension fields when the use of the towering-based approach is minimized. Table 3: Comparison of instruction counts, speed performance (in terms of clock cycles) and memory stack usage (in terms of bytes) between the proposed method (Algorithm 5, B = 1) and the state-of-the-art implementations of multiplication over F p 2 for the SIDH and SIKE protocols [5,39,67]. The target platform is a 3.4GHz Intel Core i7-6700 (Skylake) processor. The comparison covers the Round 3 SIKE primes p434, p503 and p610, and also the prime p377 proposed in [67]. The instruction counts cover all the instructions executing multiplications, additions, subtractions and memory accesses. The column "others" includes any other additional instructions such as logical and shift instructions.