Faster Montgomery multiplication and Multi-Scalar-Multiplication for SNARKs

. The bottleneck in the proving algorithm of most of elliptic-curve-based SNARK proof systems is the Multi-Scalar-Multiplication (MSM) algorithm. In this paper we give an overview of a variant of the Pippenger MSM algorithm together with a set of optimizations tailored for curves that admit a twisted Edwards form. This is the case for SNARK-friendly chains and cycles of elliptic curves, which are useful for recursive constructions. Accelerating the MSM over these curves on mobile devices is critical for deployment of recursive proof systems on mobile applications. This work is implemented in Go and uses hand-written arm64 assembly for accelerating the ﬁnite ﬁeld arithmetic ( bigint ). This work was implemented as part of a submission to the ZPrize competition in the open division “Accelerating MSM on Mobile” ( https://www.zprize.io/ ). We achieved a 78% speedup over the ZPrize baseline implementation in Rust.


Introduction
A SNARK is a cryptographic primitive that enables a prover to prove to a verifier the knowledge of a satisfying witness to a non-deterministic (NP) statement by producing a proof π such that the size of π and the cost to verify it are both sub-linear in the size of the witness.Today, the most efficient SNARKs use elliptic curves to generate and verify the proof.A SNARK usually consists in three algorithms Setup, Prove and Verify.
The Setup and Prove algorithms involve solving multiple large instances of tasks about polynomial arithmetic in F r [X] (where r is a prime) and multi-scalar multiplication (MSM) over the points of an elliptic curve.Fast arithmetic in F r [X], when manipulating large-degree polynomials, is best implemented using the Fast Fourier Transform (FFT) [Pol71] and MSMs of large sizes are best implemented using a variant of Pippenger's algorithm [BDLO12,Section 4].For example, Table 1 reports the numbers of MSMs required in the Setup, Prove and Verify algorithms in the [Gro16] SNARK and the KZGbased PLONK universal SNARK [GWC19].The sizes of the MSMs are given in terms of the number of gates in the arithmetic circuits defining the computation to be proved by the SNARK (notation in the Table caption).The report excludes the number of FFTs as the dominating cost for such constructions is the MSM computation (∼ 80% of the overall time).
Table 1: Cost of Setup, Prove and Verify algorithms for [Gro16] and PLONK.m = number of wires, n = number of multiplication gates, a = number of addition gates and = number of public inputs.M Gi∈1,2 = multiplication in G i∈1,2 and P=pairing.

Setup
Prove Verify [Gro16] 3n Given a set of n elements G 1 , • • • , G n (bases) in G a cyclic group (e.g.cyclic subgroup of the group of points on an elliptic curve) whose order #G has b bits and a set of n integers a 1 , • • • , a n (scalars) between 0 and #G, the goal is to compute efficiently the group element [a 1 ]G 1 + • • • + [a n ]G n .In SNARK applications, we are interested in large instances of variable-base MSMs (n = 10 7 , 10 8 , 10 9 ) -with random bases and random scalars -over the pairing groups G 1 and G 2 .
The naive algorithm uses a double-and-add strategy to compute each [a i ]G i then adds them all up, costing on average 3/2 • b • n group operations (+).On the one hand, there are several algorithms that optimize the total number of group operations as a function of n such as Strauss [Str64], Sec. 4] and Pippenger [Pip76] algorithms.For large instances of a variable-base MSM, the fastest approach is a variant of Pippenger's algorithm [BDLO12,Sec. 4].For simplicity, we call it the bucket method.On the other hand, efficient implementation of finite fields arithmetic impacts directly the performance of the group operation and therefore the performance of the whole MSM algorithm.In this paper we are interested in the bucket-method MSM on inner curves of 2-chains and 2-cycles of elliptic curves.

Contributions.
Our contribution is twofold: first, we discovered an optimization that reduces the number of operations needed to compute the modular multiplication of two big integers for most (but not all) choices of modulus.To the best of our knowledge, we are not aware of any prior art describing this optimization.Our multiplication algorithm improves on the CIOS algorithm [Aca98] by saving 5N + 2 additions, where N is the number of 64-bit machine words in the modulus.When N = 6 for instance, this yields a 8% improvement.On arm64, our implementation achieves an additional 17% thanks to a collection of assembly optimizations explained in Sec. 6.
Second, we analyse the complexity of the bucket method variant of the Pippenger MSM algorithm.We propose a new coordinate system for twisted Edwards curves tailored for this algorithm.We finally show how to use the algebraic structure of elliptic curves to further reduce the complexity of the algorithm.
We choose, as an example, the widely used BLS12-377 elliptic [BCG + 20] as an inner 2-chain.We implement both the finite field arithmetic using our algorithm and the bucket MSM algorithm using the twisted Edwards new coordinate system.Our implementation in Go outperforms the state-of-the-art implementation in Rust [aC22] by 40-47%.
Our implementation won the first place in the ZPrize competition in the open division "Accelerating MSM on Mobile" (https://www.zprize.io/)and will be deployed in two real-world applications: Linea zkEVM by ConsenSys (https://consensys.net/zkevm/)and probably Celo network (https://celo.org/).The zkEVM use-case uses our CPU MSM implementation to generate a PLONK proof of a batch of transactions to scale the Ethereum blockchain, while the Celo network would use our techniques to reduce its Groth16 proof generation time on a mobile from 3s to 400ms.
Organization of the paper.Section 2 provides preliminaries on CIOS modular multiplication and proofs of our new algorithm.Section 3 provides definitions of 2-chains and 2-cycles of elliptic curves and some results we prove.In section 4, we explain the bucket method and provide its complexity analysis.Section 5 provides our optimizations to the bucket method for both generic elliptic curves and the twisted Edwards curves.We prove that inner 2-chains and 2-cycles always fall into the second more optimized case.Finally, section 6 reports on our implementation of the bucket method alongside our optimizations.We choose to tailor the implementation to the widely used BLS12-377 curve and to benchmark our results on two different CPU architectures(x86 and arm64).

The Montgomery multiplication: theory
The modular multiplication problem.Given integers a, b and p the modular multiplication problem is to compute the remainder of the product ab mod p .
On computers a division operation is much slower than other operations such as multiplication.Thus, a naive implementation of ab mod p using a division operation is prohibitively slow.In 1985, Montgomery introduced a method to avoid costly divisions [Mon85].This method, now called the Montgomery multiplication, is among the fastest solutions to the problem and it continues to enjoy widespread use in modern cryptography.

Overview of the solution: the Montgomery multiplication.
There are many good expositions of the Montgomery multiplication (e.g.[BM17]).As such, we do not go into detail on the mathematics of the Montgomery multiplication.Instead, this paragraph is intended to establish notation that is used throughout this section.
The Montgomery multiplication algorithm does not directly compute ab mod p. Instead it computes abR −1 mod p for some carefully chosen number R called the Montgomery radix.Typically, R is set to the smallest power of two exceeding p that falls on a computer word boundary.For example, if p is 381 bits then R = 2 6×64 = 2 384 on a 64-bit architecture.
In order to make use of the Montgomery multiplication the numbers a and b must be encoded into the Montgomery form: instead of storing (a, b), we store the numbers (ã, b) given by ã = aR mod p and b = bR mod p.A simple calculation shows that the Montgomery multiplication produces the product ab mod p, encoded in the Montgomery form: (aR)(bR)R −1 = abR mod p.The idea is that numbers are always stored in the Montgomery form so as to avoid costly conversions to and from the Montgomery form.
Other arithmetic operations such as addition, subtraction are unaffected by the Montgomery form encoding.But the modular inverse computation a −1 mod p must be adapted to account for the Montgomery form.We do not discuss modular inversion in this section (cf.[BY19] and [Por20]).

The Montgomery multiplication: implementation
For security purposes, cryptographic protocols use large modulia,b and p are stored on multiple machine words (multi-precision).In this section, we let D denote the base in which integers are represented.(For example, D = 2 64 if a word is 64 bits).A large number a can be represented by its base-D digits a 0 , . . ., a N stored in machine words (uint) such that a = N i=0 a i D i .
There are several variations of multi-precision Montgomery multiplication.To compute c = ãb R −1 , we need to multiply the operands (P = ãb ) and then compute P R −1 mod p.This second step (Montgomery reduction) can be computed efficiently (see Algorithm 1 [BM17]) when we precompute the number −p −1 mod R.
A popular choice is the Coarsely Integrated Operand Scanning (CIOS) variant [Aca98] which interleaves the operands multiplication and the reduction step.In some settings, factors such as modulus size, CPU cache management, optimization techniques, architecture and available instruction set might favor other variants.
How fast is the CIOS method?Let N denote the number of machine words needed to store the modulus p.For example, if p is a 381-bit prime and the hardware has 64-bit word size then N = 6.The CIOS method solves modular multiplication using 4N 2 + 4N + 2 unsigned integer additions and 2N 2 + N unsigned integer multiplications.
Our optimization reduces the number of additions needed in the CIOS Montgomery multiplication to only 4N 2 − N , a saving of 5N + 2 additions.This optimization can be used whenever the highest bit of the modulus is zero (and not all of the remaining bits are set -see below for details).
The core of the state-of-the-art CIOS Montgomery multiplication is reproduced below.This listing is adapted from Section 2.3.2 of Tolga Acar's thesis [Aca98].The symbols in this listing have the following meanings: • N is the number of machine words needed to store the modulus p.
• D is the word size.For example, on a 64-bit architecture D is 2 64 .

• ã[i], b[i], p[i]
are the i-th words of the integers ã, b and p.
• p [0] is the lowest word of the number −p −1 mod R.This quantity is precomputed, as it does not depend on the inputs ã and b.
• t is an array of N + 2 words.
• C, S are machine words.A pair (C, S) refers to (high-bits, low-bits) of a two-word number.For short we denote them (hi, lo).
Next, we show that we can avoid the additions in lines 5 and 12 of Alg. 1 when the highest word of the modulus p is at most (D − 1)/2 − 1.This condition holds if and only if the highest bit of the modulus is zero and not all of the remaining bits are set.With 64bits machine words (D = 2 64 ) the most significant word of the modulus should be at most 0x7FFFFFFFFFFFFFFE.
Our optimization.Observe that lines 4 and 10 have the form (hi, lo) := m 1 +m 2 •B +m 3 , where hi, lo, m 1 , m 2 , m 3 and B are machine-words where each is at most lo From which we derive the following Lemma: We use Lemma 1 to prove the following Proposition: Algorithm 1: The CIOS Montgomery multiplication Input: ã = aR and b = bR with ã, b < R Output: abR mod p and t[N + 1] always store the value 0 at the beginning of each iteration of the outer i-loop.
Proof.We prove this proposition by induction.The base case i = 0 is trivial, since the t array is initialized to 0. For the inductive step at the iteration i, we suppose that t[N ] = t[N + 1] = 0 and trace the execution through the iteration.Begin at the final iteration of the first inner loop (j = N − 1) on line 4.Because ã < p and because the highest word of p is smaller than (D − 1)/2, we may use Lemma 1 to see that the carry C is at most (D − 1)/2.Then line 5 sets A similar observation holds at the end of the second inner loop (j = N − 1) on line 10: Lemma 1 implies that the carry C is at most (D − 1)/2.We previously observed that t[N ] is also at most (D − 1)/2, so t[N ] + C is at most which fits entirely into a single word.Then line 11 sets C to 0 and line 12 sets t[N ] to 0. The proof by induction is now complete.
With this proposition, we no longer need the addition at line 5, and guarantee that addition on line 11 will fit in one machine word.t size is reduced to N + 1 words.
Performance.In practice (cf.Sec.6.)Algorithm 2 yields a 5-10% improvement over Algorithm 1 given different N values.For N = 4, we measure 5.9% improvement.For N = 6, which is the value that corresponds to a field on which a 128-bit secure elliptic curve should be defined, our algorithm achieves a 8% improvement.The improvement peaks at N = 8 (10%) and decreases afterwards.We measure 5% for N = 10.This is expected as the number of additions we saved is linear whereas the total number of word Algorithm 2: Our optimized CIOS Montgomery multiplication Input: ã = aR mod p, b = bR mod p Output: abR mod p The same reasoning applies as well to the squaring algorithm (cf.Alg. 5 in the appendix A).Note that the condition on the modulus p i.e. p[N − 1] ≤ (D − 1)/2 − 1 is a relaxed condition compared to other techniques that impose a specific form for p such as Montgomery-friendly primes [BD21] (e.g.p = 2 e2 α ± 1 where 2 e2 is an upper bound for the reduction coefficient R).However, in our method, we impose the inputs ã, b to be reduced mod p.This can limit lazy reduction techniques [Sco07] for multiplication over the extensions of F p .

2-chains
Following [HG22], a 2-chain of elliptic curves is a set of two curves as in Definition 2. Definition 1.A 2-chain of elliptic curves is a list of two distinct curves E 1 /F p1 and E 1 /F p2 where p 1 and p 2 are large primes and p 1 | #E 2 (F p2 ).SNARK-friendly 2-chains are composed of two curves that have highly 2-adic subgroups of orders In a 2-chain, the first curve is denoted the inner curve, while the second curve whose order is the characteristic of the inner curve, is denoted the outer curve (cf.Fig. 1).
Inner curves from polynomial families.The best elliptic curves amenable to efficient implementations arise from polynomial based families.These curves are obtained by parameterizing the Complex Multiplication (CM) equation with polynomials p(x), t(x), r(x) and y(x).The authors of [HG22] showed that the polynomial-based pairing-friendly Barreto-Lynn-Scott families of embedding degrees k = 12 (BLS12) and k = 24 (BLS24) [BLS02] are the most suitable to construct inner curves in the context of pairing-based SNARKs.
These curves require the seed x to satisfy x ≡ 1 mod 3•2 L to have the 2-adicity requirement with respect to both r and p.
A particular example of an efficient 2-chain for SNARK applications is composed of the inner curve BLS12-377 [BCG + 20] and the outer curve BW6-761 [HG20].
We prove useful a result 2 that will be needed later to optimize the MSM computation.

Lemma 2. All inner BLS curves admit a twisted Edwards form ay
Proof.Proposition 2 shows that all inner BLS curves are of the form W 0,1 : The following map defines the curve E a,d : We have p ≡ 1 mod 4 from the 2-adicity condition, so which is always equal to 1 for all BLS curves (x ≡ 1 mod 3 and x − 1 | p − 1).More generally one can prove that when p = 2 or p ≡ 1 or 11 mod 12 then 3 is a quadratic residue in F p .For inner BLS, we have p ≡ 1 mod 3 • 2 L with L 2.
In particular a 2-cycle is a 2-chain where both curves are inner and outer curves with respect to each other (cf.Fig. 2).This means that both curves in a 2-cycle admit a twisted Edwards form following the same reasoning as in subsection 3.1.In the sequel we will focus on the case of BLS12 inner curves that form a 2-chain but we stress that these results apply to 2-chain inner curves from other families (e.g.BLS24 and BN [AHG22]) and to 2-cycles as well.

The bucket method
The high-level strategy of the bucket-method MSM can be given in three steps:

Deduce b/c instances of c-bit MSMs from the partitioned scalars
Step 1 is negligible.

Step 2: solve each c-bit MSM T j efficiently
1.For each T j , accumulate the bases G i inside buckets Each element a i,j is in the set {0, 1, 2, • • • 2 c − 1}.We initialize 2 c − 1 empty buckets (with points at infinity) and accumulate the bases G i from each T j inside the bucket corresponding to the scalar a i,j .

Combine the buckets to compute T j
This step is also a c-bit MSM of size 2 c − 1 but this time the scalars are ordered and known in advance −1 , thus we can compute this instance efficiently as follows

Step 3: combine the c-bit MSMs into the final b-bit MSM
Algorithm 3 gives an iterative way to combine the small MSMs into the original MSM.Combining Steps 1, 2 and 3, the expected overall cost of the bucket method is Remark 1 (On choosing c).The theoretical minimum occurs at c ≈ log n and the asymptotic scaling looks like ‰(b n log n ).However, in practice, empirical choices of c yield a better performance because the memory usage scales with 2 c and there are fewer edge cases if c divides b.For example, with n = 10 7 and b = 256, we observed a peak performance at c = 16 instead of c = log n ≈ 23.

Parallelism
Since each c-bit MSM is independent of the rest, we can compute each (Step 2) on a separate core.This makes full use of up to b/c cores but increases memory usage as each core needs 2 c − 1 buckets (points).If more than b/c cores are available, further parallelism does not help much because m MSM instances of size n/m cost more than 1 MSM instance of size n.

Precomputation
When the bases G 1 , • • • , G n are known in advance, we can use a smooth trade-off between precomputed storage vs. run time.For each base G i , choose k as big as the storage allows and precompute k points [ 1]G and use the bucket method only for the first 2 c − 1−k buckets instead of 2 c − 1.The total cost becomes ≈ b c (n + 2 c −k).However, large MSM instances already use most available memory.For example, when n = 10 8 our implementation needs 58GB to store enough BLS12-377 curve points to produce a Groth16 [Gro16] proof.Hence, the precomputation approach yield negligible improvement in our case.

Algebraic structure
Since the bases G 1 , • • • , G n are points in G 1 (or G 2 ), we can use the algebraic structure of elliptic curves to further optimize the bucket method.

Non-Adjacent-Form (NAF).
Given a point G i = (x, y) ∈ G 1 (or G 2 ), on a Weierstrass curve for instance, the negative −G i is (x, −y).This observation is well known to speed up the scalar multiplication [s]G i by encoding the scalar s in a signed binary form {−1, 0, 1} (later called 2-NAF -the first usage might go back to 1989 [MO90]).However, this does not help in the bucket method because the cost increases with the number of possible scalars regardless of their encodings.For a c-bit scalar, we always need 2 c − 1 buckets.That is said, we can use the 2-NAF decomposition differently.Instead of writing the c-bit scalars in the set {0, • • • , 2 c − 1}, we write them in the signed set {−2 c−1 , • • • , 2 c−1 − 1} (cf.Alg. 4).If a scalar a i,j is strictly positive we add G i to the bucket S (ai,j )2 as usual, and if a i,j is strictly negative we add −G i to the bucket S |(ai,j )2| .This way we reduce the number of buckets by half.
Algorithm 4: Signed-digit decomposition Input: The signed-digit decomposition cost is negligible but it works only if the bitsize of #G 1 (and #G 2 ) is strictly bigger than b.We use the spare bits to avoid the overflow.This observation should be taken into account at the curve design level.

Curve forms and coordinate systems.
To minimize the overall cost of storage but also run time, one can store the bases G i in affine coordinates.This way we only need the tuples (x i , y i ) for storage (although we can batch-compress these following [Kos21]) and we can make use of mixed addition with a different coordinate systems.
The overall cost of the bucket method is b c (n + 2 c−1 ) + (b − c − b/c − 1) group operations.This can be broken down explicitly to: For large MSM instances, the dominating cost is in the mixed additions as it scales with n.For this, we use extended Jacobian coordinates {X, Y, ZZ, ZZZ} (x = X/ZZ, y = Y /ZZZ, ZZ 3 = ZZZ 2 ) trading-off memory for run time compared to the usual Jacobian coordinates {X, Y, Z} We work over fields of large prime characteristic ( = 2, 3), so the elliptic curves in question have always a short Weierstrass (SW ) form y 2 = x 3 + ax + b.Over this form, the Table 3: Cost of mixed addition in different elliptic curve forms and coordinate systems assuming 1m = 1s.Formulas and references from [BL22].

Form
Coordinates system Equation Mixed addition cost short Weierstrass extended Jacobian fastest mixed addition is achieved using extended Jacobian coordinates.However, there are other forms that enable even faster mixed additions (cf.Table 3).It appears that a twisted Edwards (tEd) form is appealing for the bucket method since it has the lowest cost for the mixed addition in extended coordinates.Furthermore, the arithmetic on this form is complete, i.e. the addition formulas are defined for all inputs.This improves the run time by eliminating the need of branching in case of adding the neutral element or doubling compared to a SW form.We showed in Lemma 2 that all inner BLS curves admit a tEd form.
• To combine the bucket sums we use unified additions (9m) to keep track of the running sum and unified re-additions (8m) to keep track of the total sum.We save 1m by caching the multiplication by 2d from the running sum.
• To accumulate the G i in the c-bit MSM we use unified re-additions with some precomputations.Instead of storing G i in affine coordinates we store them in a custom coordinates system (X, Y, T ) where y − x = X, y + x = Y and 2d • x • y = T .This saves 1m and 2a (additions) at each accumulation of G i .
We note that although the dedicated addition (resp.the dedicated mixed addition) in [HWCD08, Sec.3.2] saves the multiplication by 2d , it costs 4m (resp.2m) to check the operands equality: This cost offset makes both the dedicated (mixed) addition and the dedicated doubling slower than the unified (mixed) addition in the MSM case.We also note that the conversion of all the G i points given on a SW curve with affine coordinates to points on a tEd curve (also with a = −1) with the custom coordinates (X, Y, T ) is a one-time computation dominated by a single inverse using the Montgomery batch trick.In SNARKs, since the G i are points from the proving key, this computation can be part of the Setup algorithm and do not impact the Prove algorithm.If the Setup ceremony is yet to be conducted, it can be performed directly with points in the twisted Edwards form.
Our implementation shows that an MSM instance of size 2 16 on the BLS12-377 curve is 30% faster when the G i points are given on a tEd curve with the custom coordinates compared to the Jacobian-extended-based version which takes points in affine coordinates on a SW curve.

Implementation
We implemented our algorithm in Go language.We've benchmarked the implementation against the arkworks Rust library [aC22], a widely used library in SNARK projects.We've chosen two different CPU architectures: a x86 z1d.largeAWS machine (Intel Xeon Platinum 8151 CPU @ 3.40GHz) and a arm64 Samsung Galaxy A13 5G (Model SM-A136ULGDXAA with SoC MediaTek Dimensity 700 (MT6833)) running on Android 12 (API level 32).
We achieved a speed up of 40-47% for MSM instances of sizes ranging from 2 8 to 2 18 .The source code is available under MIT or Apache2 licenses at: https://github.com/gbotrel/zprize-mobile-harnessThe speedup against arkworks comes from the algorithmic optimizations discussed in this paper and the bigint arithmetic optimizations.
Finite field arithmetic implementation on arm64.We use a Montgomery CIOS variant to handle the field multiplication (Details of the algorithms and proofs are in section 2).
The two inner loops (line 3 and line 9 in Alg. 2) have the same form.They perform one word × word multiplication and two word + word additions.These additions can overflow and the two distinct carry chains need to be propagated up to the last iteration.
For efficiency reasons, it is highly desirable to keep the carry in the CPU flag (i.e avoid moving it to a register) between the additions.
On x86 architectures, we leverage the ADCX, ADOX and MULX instructions to efficiently handle the interleaved carry chains in the algorithm.ADCX and ADOX perform unsigned addition with carry using distinct CPU flags, while MULX performs unsigned multiply without affecting flags.
On arm64 architecture, we split the inner loops in two to ensure the carry propagation are uninterrupted.In the first part, we multiply and propagate the first carry from the lo word.The large number of available registers (in practice 28 for arm64 against 14 for x86) allows us to store this intermediary result in registers.
In the second part, we propagate the second carry chain from the hi word.The same technique is used for the squaring function -except we have three carry to propagate since we double the intermediate product (line 5 in Alg. 5).
The impact of these optimizations is ∼ 17% for F p multiplication and ∼ 25% for the squaring.For an ext-Jac MSM instance of size 2 16 , the timing was 821ms before these arm64 field arithmetic optimizations and 620ms after.For the tEd-custom version the speedup is only related to the F p -multiplication since there are no squaring in the mixed addition.For this same version, we stored (y −x, y +x) in the coordinates system instead of (x, y) and added ∼ 40 lines of arm64 assembly for a small function in F p (Butterfly(a, b) → a = a + b; b = a -b).The butterfly performance impact was ∼ 5%, as it speeds up the unified (mixed) addition in the tEd form.
We report in Figure 3 a comparison of our code to the arkworks baseline on the Samsung Galaxy A13 and in Figure 4 the comparison on the x86 AWS machine.We report timings of several MSM instances of different sizes (powers of 2) and with different curve parameterizations (SW in extended Jacobians vs. tEd (a = −1) in custom/extended coordinates).For different sizes ranging from 2 8 to 2 18 the speed up is 40-47% with the tEd version and 20-35% with SW-extJac.

Conclusion
Multi-scalar-multiplication dominates the proving cost in most elliptic-curve-based SNARKs.Inner curves such as the BLS12-377 are optimized elliptic curves suitable for both proving generic-purpose statements and in particular for proving composition and recursive statements.Hence, it is critical to aggressively optimize the computation of MSM instances on these curves.We showed that our work yield a very fast implementation both when the points are given on a short Weierstrass curve and even more when the points are given on a twisted Edwards curve.We showed that this is always the case for inner curves such as BLS12-377 and that the conversion cost is a one-time computation that can be performed in the Setup phase.We note that, more generally, these tricks apply to any elliptic curve that admits a twisted Edwards form -particularly SNARK-friendly 2-cycles of elliptic curves.We suggest that this should be taken into account at the design level of SNARK-friendly curves.
Open question: For the Groth16 SNARK [Gro16], the same scalars a i are used for two MSMs on two different elliptic curves (G 1 and G 2 MSMs where these are the pairing groups [Cos12, Chapter 2]).We ask if it is possible to mutualize a maximum of computations between these two instances?It seems that moving to a type-2 pairing [GPS08] would allow to deduce the G 1 instance from the G 2 one using an efficient homomorphism over the resulting single point (the Trace map, cf.[Cos12, Section 2.3.1]).However, G 2 computations would be done on the much slower full extension F p k (instead of F p k/d where d is the twist degree and k the embedding degree of the curve).The pairing, needed for proof verification, would also be also slightly slower (using the anti-Trace map, cf.[Cos12, Section 2.3.1]).

A Our optimized Montgomery squaring
The condition on the modulus differs, here However, the reasoning is similar to section 2 and we end up with Algorithm 5.
is 3 is a quadratic residue.This is always the case in F p on which an inner BLS curve is defined.Let 3

Figure 2 :
Figure 2: A cycle of elliptic curves.

•
Step 1: reduce the b-bit MSM to several c-bit MSMs for some fixed c ≤ b • Step 2: solve each c-bit MSM efficiently • Step 3: combine the c-bit MSMs into the final b-bit MSM 4.1 Step 1: reduce the b-bit MSM to several c-bit MSMs 1. Choose a window c ≤ b 2. Write each scalar a 1 , • • • , a n in binary form and partition each into c-bit parts Cost of Step 3: (b/c − 1)(c + 1) = b − c + b/c − 1 group operations.

•
Mixed additions: to accumulate G i in the c-bit MSM buckets with cost b c (n−2 c−1 +1) • Additions: to combine the bucket sums with cost b c (2 c − 3) • Additions and doublings: to combine the c-bit MSMs into the b-bit MSM with cost b − c + b/c − 1 b/c − 1 additions and b − c doublings

Figure 3 :Figure 4 :
Figure 3: Comparison of our MSM code and the arkworks one for different instances on the BLS12-377 G 1 group on the Samsung Galaxy A13.

Table 2 :
Cost of arithmetic in Jacobian and extended Jacobian coordinate systems.m=Multiplication and s=Squaring in the field.

Table 4 :
Comparison of the arkworks and our MSM instances of 2 16 G 1 -points on the BLS12-377 curve.