Speeding Up Multi-Scalar Multiplication over Fixed Points Towards Eﬃcient zkSNARKs

. The arithmetic of computing multiple scalar multiplications in an elliptic curve group then adding them together is called multi-scalar multiplication (MSM). MSM over ﬁxed points dominates the time consumption in the pairing-based trusted setup zero-knowledge succinct non-interactive argument of knowledge (zkSNARK), thus for practical applications we would appreciate fast algorithms to compute it. This paper proposes a bucket set construction that can be utilized in the context of Pippenger’s bucket method to speed up MSM over ﬁxed points with the help of precomputation. If instantiating the proposed construction over BLS12-381 curve, when computing n -scalar multiplications for n = 2 e (10 ≤ e ≤ 21), theoretical analysis indicates that the proposed construction saves more than 21% computational cost compared to Pippenger’s bucket method, and that it saves 2 . 6% to 9 . 6% computational cost compared to the most popular variant of Pippenger’s bucket method. Finally, our experimental result demonstrates the feasibility of accelerating the computation of MSM over ﬁxed points using large precomputation tables as well as the eﬀectiveness of our new construction.


Introduction
In recent years, zero-knowledge succinct non-interactive argument of knowledge (zk-SNARK) has gained tremendous interest from its theoretical development to the practical implementation because it provides an elegant privacy protection solution. Popular examples include anonymous transactions in Zcash [BCG + 14] and smart contract verification over private inputs [Ebe] in Ethereum. Many zkSNARKs with trusted setup rely on pairing-based cryptography, which are very efficient in general. Groth et al. [GOS06,Gro06,Gro09,Gro10,GOS12,GS12] first introduced pairing-based zeroknowledge proofs, leading to the extensive research work in this area [Lip12, GGPR13, DFGK14, Gro16, MBKM19, GWC19, CHM + 19, BFS20,BDFG21].
All pairing-based trusted setup zkSNARKs in the literature follow a common paradigm, where the prover computes a proof consisting of several points in an elliptic curve group by generic group operations and the verifier checks the proof by a number of pairings in the verification equation. Basically, it requires the prover and verifier to conduct their computation by only using linear operations to the points built in the common reference string. This computation is indeed MSM over fixed points. MSM dominates the overall time for generating and verifying the proof. Thus, fast algorithms for computing MSM over fixed points are desirable and necessary. MSM in those applications shows the characteristic of having a large amount of points. For example, one of the most classical zkSNARK applications is to prove the knowledge of preimage for a cryptographic hash function. When using the traditional SHA-256, which is compiled to an arithmetic circuit with 22,272 AND gates when the preimage is 512 bit [CGGN17], it will lead to the computation of MSM with more than 22,272 points. When utilizing zkSNARK-friendly hash function Poseidon [GKR + 21], the MSM still has hundreds of points.

Related work
The most popular method for scalar multiplication in the elliptic curve group is the binary algorithm, known as doubling and addition method (also known as square and multiplication method in the exponentiation setting) [Knu97, Section 4.6.3]. GLV method [GLV01] and GLS method [GLS11] decompose the scalar into dimensions 2,4,6 and 8, then compute the corresponding MSM. When the point for scalar multiplication is fixed, precomputation can be used to reduce the computational cost. Knuth's 5 window algorithm utilizes the precomputation of 16 points to speed up scalar multiplication [Knu97,BC89]. If bigger window and more storage for precomputed points are used, the windowing method can be even faster. Pippenger's bucket method and its variants decompose the scalar, then sort out all points into buckets with respect to their scalars, and finally utilize an accumulation algorithm to add them together [Pip76,BDLO12]. Another line of research lies in constructing new number systems to represent the scalar, such as basic digit sets [Mat82,BGMW95] and multi-base number systems [DKS09,SIM12,YWLT13]. Researchers also try to make the addition arithmetic more efficient by using different curve representations, such as projective coordinates and Jacobian coordinates that eliminate the inversion operations, and Montgomery form that only utilizes x-coordinate [Mon87]. Differential addition chains (DACs) are used accompanying with x-only-coordinate systems, for example, PRAC chains [Mon92], DJB chains [Ber06] and other multi-dimensional DACs [Bro15,Rao15]. Most of the aforementioned techniques can be applied to MSM where the number of points is small.
When the number of points in MSM is big, which is the situation in pairing-based trusted setup zkSNARK applications, Pippenger's bucket method and its variants are the state-of-the-art algorithms that outperform other competitors. Bernstein et al. [BDLO12] investigated Bos-Coster method [DR94, Section 4], Straus method [Str64] and Pippenger's bucket method, then chose Pippenger's bucket method to implement batching forgery identification for elliptic curve signatures, which marks the beginning of extensive deployment of Pippenger's bucket method for computing MSM with a big number of points.

Our contribution
This paper proposes a new bucket set construction that yields an efficient algorithm to compute MSM over fixed points in the context of Pippenger's bucket method. Our construction targets on n-scalar multiplication with 2 10 ≤ n ≤ 2 21 , which is desirable for many pairing-based trusted setup zkSNARK applications. Our main contributions are summarized as follows.
• A new subsum accumulation algorithm. After sorting out points into buckets with respect to their scalars, Pippenger's bucket method would compute intermediate subsums and utilize an accumulation algorithm to add those subsums together. The original subsum accumulation Algorithm 1 presented in Section 2.3 is applicable for the situation where the scalars in the bucket set are consecutive. When the scalars in the bucket set are inconsecutive, Algorithm 1 would be less efficient. This paper proposes a new subsum accumulation Algorithm 3 that accumulates m intermediate subsums using at most 2m + d − 3 additions, where d is the maximum difference between two neighbor elements in the bucket set.
• A construction of bucket set that yields efficient algorithm to compute MSM over fixed points. The proposed bucket set construction carefully selects integer elements from [0, q/2] so that for all t (0 ≤ t ≤ q), there exist an integer b in the bucket set and an integer m ∈ {1, 2, 3} such that the following assertion holds, When instantiating over the BLS12-381 curve [Bow17], this construction would yield an algorithm that takes advantage of 3nh precomputed points to evaluate the n-scalar multiplication over fixed points where all scalars are smaller than a 255-bit prime r, using at most approximately where h = log q r . The theoretical analysis shows that for n = 2 e (10 ≤ e ≤ 21), the proposed algorithm saves more than 21% computational cost compared to Pippenger's bucket method, and that it saves 2.6% to 9.6% computational cost compared to the most popular variant of Pippenger's bucket method, which is reviewed in Section 2.3.2.
• The feasibility of accelerating the computation of MSM by taking advantage of large precomputation tables and the effectiveness of our new construction are demonstrated by our implementation. We implemented the popular variant of Pippenger's bucket method and our construction based on the BLS12-381 library blst [bls]. When computing n-scalar multiplication over fixed points in the BLS12-381 curve groups, the experimental result shows that the proposed construction saves more than 17.7% of the computing cost compared to the Pippenger's bucket method implementation built in blst for n = 2 e (10 ≤ e ≤ 21), and that it saves 3.1% to 9.2% of the computing cost compared to the variant of Pippenger's bucket method for n = 2 e (10 ≤ e ≤ 21, e = 16, 20).
The paper is organized as follows. In Section 2 several popular MSM algorithms including Pippenger's bucket method and one of its popular variants are reviewed. Then we propose a new subsum accumulation algorithm in Section 3. In Section 4, we present a framework of computing MSM over fixed points taking the advantage of precomputation. This framework is used to derive our new MSM algorithm. Section 5 is dedicated to the construction of our new bucket set and multiplier set. We instantiate our construction over BLS12-381 curve in Section 6 and do the theoretical time complexity analysis. In the end, we present the implementation and experimental result in Section 7.
Let us first introduce the notations used throughout the paper before diving into the content. Notations. Without special explanations hereinafter, let E be an elliptic curve group and r be its order. Let x be the largest integer that is equal to or smaller than x, and x be the smallest integer that is equal to or greater than x. Let || be bit string concatenation. Notation S n,r represents the following MSM over fixed points, S n,r = a 1 P 1 + a 2 P 2 + · · · + a n P n , where a i 's are scalars such that 0 ≤ a i < r and P i 's are fixed points in E. Radix q = 2 c is an integer used to express a scalar in its radix q representation. Integer h is the length of a scalar in its radix q representation, i.e., h = log q r . The term addition refers to the point addition arithmetic in E. Let us assume for simplicity the computational cost of doubling and that of addition in E are the same, denoted as A. This is the norm in Pippenger-like algorithms, where the major operations are additions. The storage size of a point is denoted as P .

Recap of multi-scalar multiplication methods
In this section we review several widely used methods that compute S n,r with large n, namely trivial method, Straus method, Pippenger's bucket method and one of the variants of Pippenger's bucket method.

Trivial method
In trivial method, each a i P i in (1) is computed separately by the doubling and addition method, then n intermediate results are added together to obtain the final result. In the worst case each scalar multiplication cost about 2 · ( log 2 r − 1) · A, the total cost of computing S n,r is If non-adjacent form is used to represent the scalar a i (i = 1, 2, · · · , n), because every non-zero digit has to be adjacent to two 0s, in the worst case there are half non-zero digits in a i . The cost of each scalar multiplication would drop to about (3/2) log 2 (r) · A. The time complexity of computing S n,r in the worst case is about 3 2 log 2 r · n + (n − 1) · A ≈ 3 2 · n log 2 r · A. (3)
S n,2 hc is exactly what we aim to compute, i.e., S n,r . Straus method is only suitable for small n because when n goes big the precomputation would be exponentially large. One variant that can be used for large number n is to only store n · (2 c − 1) precomputed values, where c is a small integer. At j-th iteration of (5)(6)(7) (j = 0, 1, 2, · · · , h − 1) in Straus method, separately add together precomputed points a 1j P 1 , a 2j P 2 , · · · , a nj P n with n − 1 additions to obtain a 1j P 1 + a 2j P 2 + · · · + a nj P n .
The storage size would drop from 2 nc · P to This process would repeat h times, each time it conducts n additions and c doublings (the last time does not require doubling), so the computational cost is approximately

Algorithm 1 Subsum accumulation algorithm I
The correctness of Algorithm 1 is ensured by the following equation, The computation of S n,2 c costs n − (2 c − 1) + 2(2 c − 2) ≈ (n + 2 c ) additions. The computational cost of S n,r is thus approximately Compared to (8), in the first glimpse it seems that Pippenger's bucket method is less efficient against Straus method, but this might not be right for large n. Because there is no precomputation requirement in Pippenger's bucket method, bigger c can be selected to minimize the overall computational cost.

The variant
In the aforementioned Pippenger's bucket method, one downside is that Algorithm 1 runs h times. If there is storage available for precomputation, this shortcoming can be circumvented by the variant presented in [BGMW95]. Choose a radix q = 2 c , partition a i (i = 1, 2, · · · , n) into segments as follows, We precompute the following points which requires the storage size of nh · P, then S n,r = S nh,q can be computed by using Algorithm 1 only once. The computational cost is

Further optimization
Pippenger's bucket method and the variant can be further optimized by halving the size of the bucket set. Let radix q = 2 c , using the observation that in an elliptic curve group −P is obtained from P by taking the negative of its y coordinate with almost no cost, all the buckets can be restricted to scalars that are no more than q/2 if where h = log q r . Algorithm 2 can be used to convert scalar a (0 ≤ a < r) from its standard q-ary form to the representation where every digit is in the range of [−q/2, q/2].

Algorithm 2 Scalar conversion I
The correctness of Algorithm 2 is straightforward. Notice that the assumption ensures The time complexity of Pippenger's bucket method would thus drop to and the complexity of the variant would be Henceforward when mentioning Pippenger's bucket method and Pippenger's variant, we refer to the algorithms whose time complexities are (13) and (14) respectively.

Comparison of multi-scalar multiplication algorithms
We summarize in Table 1 the precomputation storage and the time complexity of computing S n,r by the aforementioned methods together with our construction proposed in Section 5. Here q = 2 c , h = log q r . Radix q is selected to minimize the computational cost. The time complexity of Pippenger's bucket set and Pippenger's variant hold if r ≤ q/2 · q h−1 . The time complexity of our construction holds when r/q h is small. Trivial method n · P 3/2 · (n log 2 r) · A Straus method [Str64] n2

A new subsum accumulation algorithm
During the computation of S n,r by Pippenger's bucket method, after sorting every point into the bucket with respect to its scalar and computing the intermediate subsum S i s, the reminder is invoking a subsum accumulation algorithm to compute is not a sequence of consecutive integers, Algorithm 1 shows the limitation of handling such case with less efficiency. One may utilize Bos-Coster method [DR94, Section 4] to deal with this case but it is a recursive algorithm and its complexity is not easy to analyze. Here we propose a straightforward algorithm to tackle this case.
then S can be computed by Algorithm 3. If {b i } 1≤i≤m is strictly increasing and k in line 4 goes through {1, 2, · · · , d}, then in the for loop (line 2 -6), each iteration executes exactly 2 additions. Since all d + 1 temp variables in tmp are initialized as 0's, there are d + 1 additions with addend 0, which have no computational cost, so the for loop executes 2m − (d + 1) additions. Line 7 is computed by subsum accumulation Algorithm 1 with 2(d − 1) additions. In total, the cost of Algorithm 3 is 2m + d − 3 additions.

Algorithm 3 Subsum accumulation algorithm II
If {b i } 1≤i≤m are not strictly increasing, which means sometimes k in line 4 equals 0, in the corresponding for iteration it will only execute one addition by skipping if part.
If k in line 4 does not go through all integers in {1, 2, · · · , d}, there exists a tmp[k] (1 ≤ k ≤ d) who would skip the for loop and stay at 0. In the for loop, the addition saved by the fact that tmp[k] is initialized as 0 will no longer be saved. In the mean time when line 7 is executed, at least one addition will be saved because tmp[k] = 0, so the total cost will not increase.
To sum up, the cost of Algorithm 3 in the worst case is (2m + d − 3) · A. When d = 1, Algorithm 3 degenerates to Algorithm 1.

A framework of computing multi-scalar multiplication over fixed points
The following framework is inspired by Brickell et al. [BGMW95], who presented a similar method to compute single scalar multiplication using the notion of basic digit sets. They did not consider the possible overflow of the most significant digit of a scalar, which is not a big issue in single scalar multiplication while it matters in MSM S n,r , because the overflow will increase the computational cost by at most n additions. Here we give a straightforward illustration to the framework without the involvement of basic digit sets. Suppose we are going to compute S n,r . Let M be a set of integers, B be a set of non-negative integers and 0 ∈ B. Given scalar a i (0 ≤ a i < r) in its radix q representation is the product of an element from set M and an element from set B, i.e., S n,r can be computed as follows, Denote P ij = m ij q j P i , then Suppose those nh|M | points are precomputed, and define intermediate subsum S k , Equation (17) can be evaluated by first computing all S k s (k ∈ B) with at most nh − (|B| − 1) additions, the reason is straightforward since there are nh points being sorted into |B| − 1 subsums. The reminder is computed by Algorithm 3 with at most 2(|B| − 1) + d − 3 additions, where d is the maximum difference between two neighbor elements in B.
To sum up, the worst case time complexity of computing S n,r is where h = log q r , with the help of nh|M | precomputed points. Set M is called a multiplier set, because the set of precomputed points contains the points multiplied by every element from M . Set B is called a bucket set, since all points are sorted into subsum buckets with respect to the scalars in B. This framework is translated into Algorithm 4.
Algorithm 4 Multi-scalar multiplication over fixed points Input: Scalars a 1 , a 2 , · · · , a n , fixed points P 1 , P 2 , · · · , P n , radix q, scalar length h, multi- Precompute a hash table mindex to record the index of every multiplier, such that mindex[m k ] = k. Precompute a hash table bindex to record the index of every bucket, such that bindex[b k ] = k. 2: Convert every a i to its standard q-ary form, then convert it to a i = h−1 j=0 m ij b ij q j . 3: Create a length-nh scalar array scalars, such that scalars Create a length-nh array points recording the index of points, such that If we denote the expected number of zero element in the length-nh array scalars as f , and assume all elements in the length-|B| array buckets, Step 5 are none-zero, then the average cost can be estimated as From (19)(21) we can see, given n and r, in order to reduce the time complexity of computing S n,r , we can choose a larger radix q to make h smaller, or find a smaller bucket set B. Those two alternatives are closely related.
Here are two examples of utilizing this framework. Example 2. For radix q = 2 c such that q h−1 < r ≤ 1/4 · q · q h−1 , we denote λ = q mod 3, λ ∈ {1, 2}. The multiplier set is picked as the corresponding bucket set is It can be shown that this is a valid construction and that |B| ≤ q/3 + 2, thus the pair (M, B) yields an algorithm to compute S n,r using at most

A construction of multiplier set and bucket set
In this section, we construct a pair of multiplier set and bucket set (M, B) that can be utilized to speed up the computation of S n,r under the framework presented in Section 4. The essential difficulty to reduce the size of B is to make sure that every scalar in [0, r − 1] can be converted to its radix q representation where every digit is the product of an element from M and an element from B. Given a scalar a (0 ≤ a < r) in its standard q-ary representation we will show that our construction enables the scalar conversion from its standard qary form to the required radix q representation, thus yielding efficient S n,r computation algorithm.

Our construction
For radix q = 2 c (10 ≤ c ≤ 31), the multiplier set is picked as Bucket set B is established by an algorithm. In order to determine B, let us first define three auxiliary sets B 0 , B 1 and B 2 . Let r h−1 = r/q h−1 be the maximum leading term of a scalar in its standard q-ary expression, where ω 2 (b) represents the exponent of factor 2 in b, and ω 3 (b) represents the exponent of factor 3 in b. For instance, if i = 2 e k, 2 k, ω 2 (b) = e. From the definitions, B 0 (or B 2 ) has such a property that for all 0 ≤ t ≤ q/2 (or 0 ≤ t ≤ r h−1 + 1), there exist an element b ∈ B 0 (or b ∈ B 2 ) and an integer m ∈ {1, 2, 3}, such that t = mb.
Set B 0 itself is a valid bucket set construction, which was also mentioned in [BGMW95].
Since we can utilize negative elements in M , there are redundant elements to be removed from B 0 . Set B 1 is defined by Algorithm 5, and the following Property 1 holds for B 1 . if i is in B 0 and q − 2 · i is in B 0 then 4: B 1 .remove(q − 2 · i) 5: for i = q/6 to q/4 − 1 by 1 do 6: if i is in B 0 and q − 3 · i is in B 0 then 7: Property 1. Given q = 2 c (10 ≤ c ≤ 31), for all t (0 ≤ t ≤ q), there exist an element b ∈ B 1 and an integer m ∈ {1, 2, 3}, such that This property is verified by computation using Algorithm 7. It is also asserted by computation that exchanging two for loops in Algorithm 5 would construct the same B 1 .
Finally the bucket set is proposed to be Property 2. For the multiplier set M and the bucket set B defined in (25) (27), scalar a (0 ≤ a < r) can be expressed (not necessarily uniquely) as follows Proof. By Property 1 we know that for arbitrary integer t ∈ [0, q], it can be expressed as and by the definition of B 2 we know that for integer t ∈ [0, r h−1 + 1], it can be expressed as Back to Property 2, Algorithm 6 can be used to convert a from its standard q-ary representation defined in (24) to its radix q representation defined in (28).

Algorithm 6 Scalar conversion II
The correctness of Algorithm 6 comes from the fact that For every t ∈ [0, q], a hash table H is precomputed to store its decomposition, i.e., H(t) = (m, b, α) such that t = mb + αq, m ∈ M, b ∈ B, α ∈ {0, 1}. Steps 2 and 4 in Algorithm 6 are executed by retrieving the decomposition from the hash table. For the proposed (M, B), hash table H can be realized by a length-(q + 1) array decomposition using Algorithm 7. decomposition is also utilized to verify Property 1 by checking whether in decomposition there is any entry whose last element is −1.
See Section 6 and Appendix A for the detailed parameters. It is checked that the maximum difference between two neighbor elements in B is d = 6. For a point P = (x, y) on elliptic curve E with short Weierstrass form, −P = (x, −y) can be obtained for almost no cost, hence the points associated with negative elements in M can be excluded from the precomputation table. Correspondingly, in Step 3, Algorithm 4, a length-nh boolean array is added to record the sign of multipliers. In Step 4, if a multiplier is negative, the corresponding point should be deducted from the intermediate subsum.
By the computational cost estimation formula presented in (19), we have the following Proposition 1. Proposition 1. Given number of points n and group order r, suppose q = 2 c (10 ≤ c ≤ 31) and h = log q r , the multiplier set and bucket set defined in (25) (27) yield an algorithm to compute MSM S n,r over BLS12-381 curve using at most approximately (nh + 0.21q) · A, q = 2 c (10 ≤ c ≤ 31, c = 15, 16, 17), with the help of 3nh precomputed points

Instantiation
In this section we instantiate our construction over BLS12-381 curve [BLS02,Bow17], and present some theoretical analysis against Pippenger's bucket method and Pippenger's variant.
BLS12-381 curve is defined by the equation is the 381-bit field characteristic (in hexadecimal), and its embedding degree is 12. Two subgroups G 1 ⊂ E(F p ) and G 2 ⊂ E(F p 2 ) over which bilinear pairings are defined have the same 255-bit prime order r = 0x73eda753299d7d483339d80809a1d805 53bda402fffe5bfeffffffff00000001.

Theoretical analysis
Radix q is called optimal if the number of additions required to compute S n,r in the worst case is minimized. The optimal q and its corresponding scalar length h for different methods is summarized in Table 2. The precomputation size presented in this table is in terms of points in G 1 with affine coordinates. The precomputation size over G 2 would double its counterpart over G 1 . Pippenger's bucket method and Pippenger's variant are those two methods introduced in Section 2.3.2. Our construction refers to the proposed construction presented in Section 5. The number of additions taken to compute S n,r in the worst case and their comparison are summarized in Table 3, where • Improv1 = ((Our construction) -Pippenger)/Pippenger, • Improv2 = ((Our construction) -(Pippenger variant))/(Pippenger variant). Table 3 shows that theoretically when computing S n,r over BLS12-381 curve for n = 2 e (10 ≤ e ≤ 21), our construction saves 21% to 40% additions compared to Pippenger's bucket method, and it saves 2.6% to 9.6% additions compared to Pippenger's variant. It is noted that the proposed bucket sets listed in Appendix A are sufficient to compute S n,r over BLS12-381 for n = 2 e (22 ≤ e ≤ 28). Our method still shows 2.8% ∼ 5.8% theoretical improvement against Pippenger's variant in those cases, but its drawback is that the precomputation would be too large.

Time complexity: worst case versus average case
We are going to show in this section that the difference between the worst case time complexity and the average case is tiny, hence the worst case is used in this paper as the representative. The result relies on the group order r, that is why we do the average case analysis after instantiation. It is done by estimating the expected number of zero elements, denoted as f , in the length-nh array scalars, Algorithm 4.
Suppose the group order in its standard q-ary form is r = h−1 j=0 r j q j . For every uniformly randomly picked scalar a (0 ≤ a < r), when a is converted to the standard q-ary form for simplicity we assume that and Let us first do the analysis for our construction. When converting scalar a by Algorithm 6 from its standard q-ary form to the radix q representation we know b j = 0 (1 ≤ j ≤ h − 2) if and only if a j = 0 and the carry bit from the previous digit α j−1 = 0, or a j = q − 1 and the carry bit α j−1 = 1. Assume the probability of carry bit being 0 is λ, which is equal to the probability of α = 0 in array decomposition decided by Algorithm 7, then and If a scalar is converted to the representation in (31), the expectation of number of j's such When running Algorithm 4, the expectation of number of zeros in scalars is Define I = worst case time complexity − average case time complexity worst case time complexity to measure the difference between the worst case time complexity and the average case time complexity. Our method utilizes radix q comparable to n, so n(h − 1)/q is a small number that can be ignored. It follows that For radix q = 2 c (10 ≤ c ≤ 22, c = 15, 17) used in Table 2, the triad (q, h, r h−1 ) can be found in Appendix A, and it is checked that λ < 0.7 for those radixes. It follows that I < 1%, which means the difference is small. A similar analysis also applies to Pippenger's bucket method and Pippenger's variant, and (33) still holds for them. Those two methods have λ ≈ 0.5. For q = 2 c (8 ≤ c ≤ 22), I is even smaller compared to our construction.

Implementation
In order to assess the cost of scalar conversion and the impact of memory locality issue caused by large precomputation size, we conducted the experiment on the basis of blst, a BLS12-381 signature library written in C and assembly [bls]. blst library includes the addition/doubling arithmetic and the implementation of Pippenger's bucket method over G 1 and G 2 . We implemented Pippenger's variant and our construction following Algorithm 4, we invoked the Pippenger's bucket method implementation built in blst.

Implementation analysis
In terms of the scalar conversion, which is Step 2 of Algorithm 4, a scalar is given as a length-8 uint32_t array. Both Pippenger's bucket method and Pippenger's variant need to first convert the scalar to its standard q-ary form, then convert to the expression where the absolute value of every digit is no more than q/2 using Algorithm 2. Our construction first converts the scalar to its standard q-ary form, then converts to the expression where every digit is the product of an element from the multiplier set and an element from the bucket set using Algorithm 6 with the help of the decomposition hash table. We utilize a length-(q + 1) array decomposition to realize this hash table. So the concern boils down to retrieving data from array decomposition.
In terms of Step 4 in Algorithm 4, where all points are sorted into different buckets, the addition is done between a point fetched from array precomputation and another point fetched from array buckets. We treat the n fixed points as the length-n precomputation array for Pippenger's bucket method. We have the following observations. • Pippenger's bucket method compute h times the equation (9), so in total it fetches data nh times from its length-n array precomputation, it fetches data nh times from its length-(0.5q + 1) array buckets.
• Pippenger's variant fetches data nh times from its length-nh array precomputation, it fetches data nh times from its length-(0.5q + 1) array buckets.
Pippenger's variant and our construction show some advantages here regarding the number of fetch operations, since their h's are usually smaller than that of Pippenger's bucket method. Their disadvantages are that the fetch operations are executed in larger precomputation and buckets arrays.
Step 4 of Algorithm 4 is a simple loop, so we utilize prefetch to mitigate the impact of memory access of large arrays. It is noted that in terms of fetching data from buckets array, our construction would have some advantage against Pippenger's variant when the same radix q is used because our construction would use smaller buckets array in this case. Even if our radix q (q = 2 16 ) is twice as big, our construction still keeps such advantage.

Experimental result
Our experiment is done on an Apple 14 inch MacBook Pro with 3.2 GHz M1 Pro chip, and with 16 GB memory. M1 Pro has advantages like large cache size, high memory bandwidth. Most importantly its cache line is 128 bytes, which is sufficient to accommodate a BLS12-381 G 1 point whose size is 96 bytes. Those characteristics are expected to provide us some benefit when fetching data from large arrays.
We keep blst's implementation intact, because on one hand our focus is on the comparison between Pippenger's variant and our construction, on the other hand blst's implementation can serve as a performance benchmark. In Table 4, s.c. v. represents the time spent by Pippenger's variant to do the scalar conversion for all n scalars in S n,r , while s.c.
c. represents that of our construction. In Table 5, Improv1 is the comparison between Pippenger's bucket method and our construction, and Improv2 is the comparison between Pippenger's variant and our construction. Both Pippenger's variant and our construction show a huge improvement compared to Pippenger's bucket method, which demonstrates the feasibility of speeding up the computation of S n,r using large precomputation tables.
If we focus on the comparison between Pippenger's variant and our construction, we have the following observations when computing S n,r in G 1 for n = 2 e (10 ≤ e ≤ 21), and in G 2 for n = 2 e (10 ≤ e ≤ 20) 1 , • In terms of the computation time of scalar conversion, in G 1 Pippenger's variant takes 0.9 ∼ 1.5% out of its entire S n,r computation time, while our construction takes 1.1 ∼ 2.8%. Because in G 2 the addition arithmetic takes relatively more time compared to that in G 1 , the percentages are smaller. In G 2 Pippenger's variant takes 0.4 ∼ 0.6% out of its whole S n,r computation time, while our construction takes 0.4 ∼ 1.1%.
• Our construction does not perform good for n = 2 16 , 2 20 . For n = 2 16 , our optimal radix q = 2 19 , which is 8 times larger than that of Pippenger's variant. For n = 2 20 , our optimal radix q = 2 22 , which is 4 times larger than that of Pippenger's variant. Since the radix value is even larger than n, it would have negative impact on fetching data from array buckets as the analysis in the previous section indicates. When we change to smaller radixes, specifically q = 2 18 for n = 2 16 , and q = 2 20 for n = 2 20 , our construction outperforms Pippenger's variant again as the results marked by asterisk show, although theoretically those radixes are not optimal.