Load-Balanced Parallel Implementation on GPUs for Multi-Scalar Multiplication Algorithm

. Multi-scalar multiplication (MSM) is an important building block in most of elliptic-curve-based zero-knowledge proof systems, such as Groth16 and PLONK. Recently, Lu et al. proposed cuZK , a new parallel MSM algorithm on GPUs. In this paper, we revisit this scheme and present a new GPU-based implementation to further improve the performance of MSM algorithm. First, we propose a novel method for mapping scalars into Pippenger’s bucket indices, largely reducing the number of buckets compared to the original Pippenger algorithm. Second, in the case that memory is suﬃcient, we develop a new eﬃcient algorithm based on homogeneous coordinates in the bucket accumulation phase. Moreover, our accumulation phase is load-balanced, which means the parallel speedup ratio is almost linear growth as the number of device threads increases. Finally, we also propose a parallel layered reduction algorithm for the bucket aggregation phase, whose time complexity remains at the logarithmic level of the number of buckets. The implementation results over the BLS12-381 curve on the V100 graphics card show that our proposed algorithm achieves up to 1.998 × , 1.821 × and 1.818 × speedup compared to cuZK at scales of 2 21 , 2 22 , and 2 23 , respectively.


Introduction
In recent years, there has been a growing emphasis on privacy concerns within the industrial sector.Zero-knowledge Succinct Non-interactive ARgument of Knowledge (zk-SNARK), an excellent cryptographic primitive, not only provides robust privacy protection but also allows for essential audits.It has been widely used in industrial-grade solutions such as anonymous transactions in Zerocash [BSCG + 14] and flexible anonymous credentials zk-cred [RWGM23].It allows the prover to generate a proof π for any (small) nondeterministic polynomial (NP) relation R = {(x; w) : P (x, w)}.Unfortunately, most of elliptic-curve-based zk-SNARKs like Groth16 [Gro16] and PLONK [GWC19] still suffer from performance bottlenecks.Various acceleration solutions have already been published.For instance, Ni et al. [NZ23] took advantage of GPU to enhance the efficiency of zkSNARK by accelerating the Number-Theoretic Transform (NTT) and Inverse Number-Theoretic Transform (INTT).We notice that the computational cost of the Setup or Prove phase is significantly influenced by the number of circuit multiplication gates, resulting in a much longer time duration compared to the Verify phase.Furthermore, large-scale Multi-scalar multiplication (MSM) operations occupy the majority of the computational cost during the proof generation phase.Thus, the practical deployment of zk-SNARKs urgently requires fast MSM computational algorithms.
In scenarios involving zk-SNARKs or large-scale BLS signature aggregation [BGLS03], the state-of-the-art serial algorithms for MSM are the Pippenger algorithm and its vari-ants [Pip76,BDLO12].These algorithms partition the λ-bit scalar into multiple c-bit windows.Subsequently, all points are sorted out into buckets with respect to sub-scalar values, and finally the values within buckets are aggregated to derive the final result.Recently, a series of works were proposed to further improve the performance of the Pippenger algorithm and its variants.Botrel and Housni [BEH23] optimized the arithmetic of finite fields by improving on the Coarsely Integrated Operand Scanning (CIOS) modular multiplication and proposed a new coordinate system for twisted Edwards curves tailored for the Pippenger algorithm.Luo, Fu and Gong [LFG23] proposed a bucket set construction to speed up MSM over fixed points with the help of large precomputation tables.
Owing to the thriving development of modern GPU architectures, researchers have proposed several GPU-accelerated implementations based on the Pippenger algorithm.MatterLab and Yrrid [Mat22,Yrr22] are the winners of the Zprize competition [Zpr22], a competition which focuses on accelerating MSM using GPUs.Both MatterLab and Yrrid utilized radix sort to process the scalars used in MSM.Recently, Lu et al. [LWY + 23] proposed cuZK: a new GPU-accelerated implementation.In particular, the authors pointed out the possibility of load imbalance for the implementations in Matterlab and Yrrid.For this reason, they also converted the major operations used in the Pippenger algorithm to a series of basic sparse matrix operations, including sparse matrix transpose and sparse matrix-vector multiplication.In fact, cuZK is well-suit for the high parallelism in GPU-based implementation and has nearly perfect linear speedup over the Pippenger algorithm.

Our contributions
In this paper, we propose a high-speed GPU-based implementation for MSM in scenarios with different memory sizes.The proposed implementations are applicable to most of curve-based zk-SNARKs and achieve high performance on modern GPU architectures.Our contributions are summarized as follows: • In Section 3.1, we propose a new method for mapping scalars into Pippenger's bucket indices, reducing the number of buckets to 1 4 of that in the original Pippenger algorithm.To be specific, we fix the Pippenger's window size at c-bit and convert each k i,j ∈ [0, 2 c ] into 0 or a unique odd bucket index ki,j ∈ [1, 2 c − 1].We recode the bucket indices along with the mapped additional information for point operations during the bucket accumulation phase.This custom encoding format does not increase the sorting cost when we use the radix sort algorithm [Pow90].
• In Section 3.2, we focus on the fundamental operations of point addition during the bucket accumulation phase.We employ the mixed point addition formulas in cached Jacobian coordinates [CC86] when the GPU memory is of typical size and switch to a more efficient algorithm based on homogeneous coordinates when the memory is sufficient.Compared to the former, for every two points added to the same bucket, the accumulation algorithm based on homogeneous coordinates saves one finite field multiplication cost (−5%).In addition, we also apply the technique of lazy reduction [AKL + 11, Sco07] to all point addition and doubling operations and employ certain CUDA assembly tricks to further enhance speed, etc.
• In Section 3.3, we present our load-balanced bucket accumulation method, which means the parallel speedup ratio is almost linear growth as the number of device threads increases.Differing from cuZK, our approach abstains from the use of data structures like vectors (libstl-cuda/vector), which could potentially involve dynamic memory allocation.Instead, we design a compact static buffer and employ binary sieving to accumulate results from different threads.
• In Section 3.4, we provide a parallel layered reduction algorithm for the serially executed bucket aggregation phase.The time complexity of our algorithm remains at the logarithmic level of the number of buckets.
• In Section 3.5, we discuss the computational cost for each phase within each thread.
In addition, we analyze why our implementation is not constant-time, and illustrate a new algorithm for bucket accumulation to remedy this shortcoming in Section 5.
-On the BLS24-315 curve, our implementation based on homogeneous coordinates achieves 90.35ms (on RTX 4090 with 2 23 points MSM).Furthermore, since the bucket accumulation algorithm in previous work assigns non-overlapping bucket indices to each thread, we implement this previous method to demonstrate the superiority of our proposed load-balanced algorithm.The comparison reveals up to 1.853× speedup.

Elliptic point groups and coordinates
Let F p denote the finite field of the prime order p. |F p | denotes the byte length of elements in F p .The group G is a subgroup of the elliptic points group E(F p ) over F p with the order r.In the group G, the identity element O refers to the point at infinity.The group law PADD and PDBL refer to the point addition for unequal points and point doubling for equal points, respectively.|G| denotes the byte length of elements in G. Let λ = log 2 r denote the bits of the order r.
Different coordinates.To reduce storage overhead, elliptic curve points are commonly represented in affine coordinates, i.e., only two field elements {X, Y } are required to represent one point.To speed up computations, diverse coordinate systems and addition formulas have emerged1 , such as Jacobian coordinates {X, Y, Z} (x = X/Z 2 , y = Y /Z 3 ), modified Jacobian coordinates {X, Y, Z, T } (x = X/Z 2 , y = Y /Z 3 , T = aZ 4 ), homogeneous coordinates {X, Y, Z} (x = X/Z, y = Y /Z), and cached Jacobian coordinates {X, Y, ZZ, ZZZ} (x = X/ZZ, y = Y /ZZZ, ZZ 3 = ZZZ 2 ).For subsequent optimisation work, we review computational overhead under several coordinate systems in Table 1.Traditionally, the single-scalar multiplication approach is said to use mixed point addition between affine and Jacobian coordinates and point double under Jacobian coordinates.

Multi-Scalar Multiplication
Given n scalars k i and n points As shown in Alg. 1, the double-and-add method can be used as a straightforward approach to compute MSM, which needs to perform nλ PADD and λ − 1 PDBL in the worst case.This would be prohibitively slow and thus unacceptable for real-world applications in zk-SNARKs.

Algorithm 1
The double-and-add algorithm for computing MSM Input: The scalars {k i } i∈ [n] and the points end for 9: end for 10: return Q

The Pippenger Algorithm
For the large-scale MSM, the Pippenger algorithm and its variants [Pip76,BDLO12] are the most popular serial algorithms.Our proposed GPU-accelerated MSM algorithm is also built on this foundation algorithm.The Pippenger algorithm is described as follows: Step-1: Decompose the main task into multiple subtasks.The Pippenger algorithm first chooses an integer c ∈ [λ] as the window size, and decomposes each λ-bit scalar Thus, the main task can be considered as computing λ c subtasks, called the smaller-scale MSM, i.e., Q j = n i=1 k i,j P i where 0 ≤ k i,j < 2 c and λ c = λ c − 1.
Step-2: Compute the bucket points in each subtask.To complete each subtask, the Pippenger algorithm introduces a buffer point called "bucket" and divides the subtask into two phases.In the first phase for each subtask Q j , called bucket accumulation, it computes 2 c − 1 "bucket" points {B t } 1≤t≤2 c −1 , which is the sum of all points with the small scalar equaled to t, i.e., B t = SUM{P i |k i,j = t} as shown in Figure 1.Note that B 0 = O since it is the sum of all points with zero scalars.In the second phase for each subtask Q j , called bucket aggregation, the Pippenger algorithm sequentially computes 2. So, we have After calculating all points {G t } 1≤t≤2 c −1 , it can accumulate them to obtain Q j .Step-3: Aggregate all subtask results into the main task result.Finally, the Pippenger algorithm sets Q = Q λc as the initial status and computes Q = 2 c Q + Q i from i = λ c to 0. After λ c rounds, it can get the final result Q. Computational cost.It can be observed that the Pippenger algorithm converts all scalar multiplications into PADD and PDBL computations.The computational costs of scalar segmentation and sorting in the preceding steps are relatively small, with the primary time consumption occurring in the subsequent point operations.For each subtask, the bucket accumulation phase requires at most n PADDs to add base points to the buckets, and the bucket aggregation phase requires (2 c+1 − 4) PADDs to add buckets to subtask sum.Finally, the subtask aggregation requires c( λ c − 1) PDBLs and λ c PADDs.Thus, the Pippenger algorithm needs to perform λ c (n + 2 c+1 − 3) PADDs and c( λ c − 1) PDBLs.The CUDA programming model is a parallel computing platform and API developed by NVIDIA for GPUs.It allows developers to harness the computational power of GPUs for a wide range of tasks.In this model, programmers write kernels, which are parallel functions that can be executed by thousands of GPU threads.These kernels are launched from the CPU and executed on the GPU.

Challenges in parallel implementations
For bucket accumulation.As shown in Figure 1, each subtask requires a buffer size of 2 c − 1 elliptic curve points, which is entirely acceptable in any mode.However, for GPU-accelerated mode, there still exists a critical influencing factor in the accumulation phase.Assuming the total number of threads that the GPU card can provide is N , the fundamental idea of the parallel implementation is that each thread processes a consecutive set of n N sub-scalars along with their associated points.But when different threads process buckets stored at the same storage in the buffer, write conflicts are inevitably encountered.Previous works like [6bl22, Mat22] either employs performance-degrading atomic functions or utilizes Alg. 2. Before Alg. 2, they initialize an array of tuple pairs consisting of sub-scalars and point indices {(k i,j , i)} and perform radix sorting based on the keys k i,j to obtain a new sorted array of tuple pairs {(a i , p i )} (in ascending order).Then, each thread no longer simply processes data with indices ), but instead adjusts the left edge to min{s | s ≥ s ∧ a s = a s −1 } (excluding the case where s = 0), and similarly adjusts the right edge (exclusive) to min{e | e ≥ e ∧ a e = a e −1 }.This ensures that write operations of buckets between threads do not conflict.However, this approach brings about the issue of load imbalance and may even result in threads idling, leading to a certain degree of computing power waste.
For bucket aggregation.The bucket aggregation phase can be divided into two stages: the bucket updates and the final summation.The final summation can be accomplished using the well-known parallel reduction algorithms.However, it is evident that the bucket updating part (in Figure 2) is a serial computational process, for which no practical parallel algorithms have been proposed at present (In prior works, the parallel methods used for this part always involved small-scale scalar multiplication, which resulted in a performance impact).

Our GPU-Accelerated MSM Algorithm
In this section, we describe our specific implementation of an individual Pippenger subtask on GPU platforms.We use N to denote the maximum thread number supported by the current GPU device.B ai =PADD(B ai , P pi ) 13: end for

Scalar processing and point precomputation
The first optimization is to reduce the number of buckets from 2 c −1 to 2 c−2 −1 by processing all scalars slices {k i,j } 1≤i≤n,0≤j≤λc for each subtask.This could make the number of buckets a quarter of the number in the original Pippenger algorithm.Mathematically, the scalar slices are transformed according to the following sequence: where ki,j is an odd number of at most c − 1 bits and P i,hi,j is a precomputed point.

Input:
The bit-length λ, n integers {k i } i∈ [n] and the window size c ∈ [λ] Output: The integer tuples { ki,j , h i,j , s i,j } 1: h i,j = max{η : 2 η | ki,j }, ki,j = ki,j >> η factor ki,j 8: end for • Convert scalars to the signed representation.Commonly, we can convert or s i,j = 0, ki,j = k i,j otherwise.As long as the curve parameters are appropriate, here s i,λc = 0 2 .Then, the sign bit s i,j can be stored for each scalar k i,j .Note that if k i,j = 2 c−1 , we have s i,j = 1 and ki,j = 2 c−1 .• Convert scalars to the float representation.Inspired by the floating-point format, we can compute an odd number ki,j ∈ [0, 2 c−1 ) and a small exponent h i,j ∈ [0, c − 1] such that ki,j = ki,j 2 hi,j .Then, we can reorganize an integer ki,j with the higher c − 1 bits to store ki,j , the middle log 2 (c − 1) bits to store h i,j and the least significant bit to store s i,j .We provide an example of the format of ki,j in Figure 3.At maximum, a 32-bit integer can store ki,j for any window size c ≤ 26.Note that we set ki,j = 0 if ki,j = 0.For simplicity, we write ki,j .ODD, ki,j .EXP and ki,j .SIGN to denote the odd number, exponent and sign bit of ki,j , respectively.
• Create a scalar mapping table.Typically, we can compute the integer tuples { ki,j , h i,j , s i,j } by Alg. 3 and reorganize them into ki,j .If c is not particularly large (e.g.no more than 16), we can create a table with the input k i,j and output ki,j .This table costs 2 c+2 bytes of storage but can improve scalar processing performance.
Change the way to add points into the bucket.After scalar processing, we should add the point (−1) si,j P i,hi,j = (−1) si,j 2 hi,j P i to the bucket Bk i,j with the index ki,j .The point −P i,hi,j is easy to compute by the negation of y-coordinate.But, calculating from P i to the point 2 hi,j P i requires h i,j times PDBLs.For efficiency, points {2 1 P i , 2 2 P i , • • • , 2 τ P i } are precomputed with a threshold parameter τ ∈ [1, c − 1] and stored in the GPU memory with τ • n points.Obviously, different h i,j can be greater or less than the threshold τ .So, • If h i,j ≤ τ , P i,hi,j can be looked up directly in precomputed point table.
• If h i,j > τ , 2 τ P i can be looked up directly in precomputed point table.Then, it is also necessary to recalculate P i,hi,j = 2 hi,j −τ • 2 τ P i with h i,j − τ times PDBLs.
Statistically, the values of ki,j that satisfy {2 τ ki,j ∧ 2 τ +1 | ki,j } constitute only a fraction of approximately 1/2 τ +1 of the total, which is absolutely acceptable in the parallel program.
Totally, in this optimization phase, we need 2 c+2 bytes memory for scalar processing, τ • n|G| bytes memory for precomputed points with τ • n times PDBL operations.When τ equals to the maximum value c − 1, this process will consist of only one table lookup and one negative operation in F p .Thus, it is constant-time for any tuple (k i,j , P i ) processing.

Single-point and double-point addition optimization
The second optimization is to reduce the number of multiplications in the process of adding one or two original points into one bucket.The basic idea is still to represent buckets in a specific coordinate system, and then accumulate the points in affine coordinates into the bucket.We mainly design two approaches:  • Homogeneous coordinates.Fortuitously, we find that adding two points to the same bucket per round further reduces the number of multiplications.For instance, we choose homogeneous coordinates in Table 1, add two affine points (P j , P j+1 ) to a buffer point P P j/2 (cost 5 M + 2 S) [CMO98], then add P P j/2 with the bucket via point addition formula in homogeneous coordinates (cost 12 M).
Remark that this double-point point approach only costs 17M + 2S (≈ 19M), making it approximately 5% faster than the single-point approach (16M + 4S ≈ 20M) per two points.However, the double-point point approach is non-constant time since the number of multiplications per point depends on the parity of the number of points added to the same bucket.
We also employ the lazy reduction technique [AKL + 11, Sco07] in our point addition and point doubling operations.More specifically, we reduce the number of Montgomery reductions [Mon85,KKAK96] required for point addition (both points are in projective), mixed addition (one point is in affine), and doubling in the cached Jacobian coordinate system by one.Similarly, for point addition (both points are in projective), addition (both points are in affine), and doubling in the homogeneous coordinate system, we reduce the number of Montgomery reductions by 3, 1, and 1, respectively.

Load-balanced accumulation from original points to buckets
The third optimization is to accumulate all original points into buckets using parallel threads.Let tid denote each thread's index.For the maximum thread number N , we assign each thread to process n N points with indices ∈ [s, e) = [ n N • tid, n N • (tid + 1)).However, unordered and random scalars can lead to a conflict where different threads write to the same bucket.Alg. 2 aims to adjust the left and right boundaries (s, e) such that buckets processed by each thread do not overlap.However, such a solution may lead to an imbalanced thread workload, resulting in a waste of computational resources.Therefore, we pre-allocating non-conflicting point buffers to solve this problem.
Instead of writing to buckets directly, we write to non-conflicting static buffers assigned to each thread.Intuitively, each thread handles data corresponding to at most n N buckets.We can allocate a buffer of n points for each thread.However, this approach would incur a memory penalty.Moreover, if we further parallelize with λ c subtasks, a memory size of λ c • n points, which is sparse, becomes unacceptable.Therefore, we preallocate a static buffer for each of the λ c subtasks, but we set the starting offset within the respective buffer where each thread writes to as: where âi.ODD+1 2 is the bucket offset corresponding to âi .The correctness of offset tid is evident because the previous thread indices ∈ [0, tid) correspond to at least tid points.Additionally, whenever the boundary between different bucket offsets falls within a thread rather than between different threads, it results in an extra point.For clarity, we denote such boundaries within threads as "inner-boundary" and between threads as "inter-boundary", as shown in Figure 5. Since {â i } are in ascending order, the minimum bucket offset from the current threads is exactly âs.ODD+1 2 , which is also the maximum number of inner-boundaries existing in previous threads within the same Pippenger subtask.It ensures that the starting offset never exceeds offset tid .In summary, we deduce that the total size of the secure static buffer is λ c • (N + 2 c−2 ), and the points within it are compact (with unused memory occurring only as a low probability event when the boundaries happen to fall between threads).In large-scale MSM computations, such a buffer size is significantly smaller than λ c • n and does not increase with the growth of n.It is solely dependent on the characteristics of the device.
In addition to allocating the buffer containing elliptic curve points, we also need to allocate three auxiliary arrays: buffer_offset, buffer_index, and buffer_used.These arrays are respectively used to store the starting positions where each thread writes to the buffer (as shown in Eq. ( 2)), the bucket offsets corresponding to the points being written, and the actual buffer size utilized by each thread.Clearly, the sizes of these three auxiliary arrays are λ c • N , λ c • (N + 2 c−2 ), and λ c • N , respectively.They store simple integer elements that are much smaller in size compared to the curve points.
Then we can use a three-step process to complete the bucket accumulation phase: 1) Sorting the processed scalars.For each subtask, we initialize an array consisting of converted scalars and point indices in the form of tuples {( ki,j , i)} before bucket accumulation, and sort them in ascending order based on the key ki,j .ODD.For simplicity, we collectively refer to the ordered arrays obtained from j subtasks as {(â i , p i )}.
2) Accumulating parts of the buckets into buffers.With the non-conflicting static buffers, we accumulate parts of the buckets into buffer in each thread and record the corresponding values in three auxiliary arrays buffer_offset, buffer_index, and buffer_used.3) Aggregating buffered points into buckets.After writing to the buffers, we need to aggregate the buffered points back into their respective unique buckets.It's worth noting that the bucket offsets corresponding to the points stored in the buffer are also sorted in ascending order.Therefore, we can utilize a variation of the binary search [Wil76] Algorithm 4 Accumulation of bucket parts into buffers using the shared memory thread index tid, intra-block thread index tid inner Output: buffer,buffer_offset,buffer_index,buffer_used 1: Obtain the boundaries (s, e) 2: pre_bucket_idx = 0x8000 0x8000: non-existent bucket index 3: buffer_offset[tid] = offset = tid + âs.ODD+1 2 4: num = 0 Cached Jacobian-based: end if 45: end for 46: if parity = 0 then 47: shown in Alg. 5 to complete the entire bucket accumulation phase.
For each subtask, we allocate 2 c−2 threads, where the thread with index tid is used to search for all buffered points corresponding to bucket B 2•tid+1 .The accumulation of these points is done through the shared memory smem[N T HREADS].The worst-case complexity of binary search for the corresponding buffered point is O(log N ), and all subtasks can concurrently run with λ c • 2 c−2 threads.

A layered parallel reduction algorithm for the bucket aggregation
After obtaining the values of {B t } where 1 ≤ t < 2 c−1 and t is an odd integer, we need to calculate the result of bucket aggregation phase.Since the number of buckets in our proposed algorithm has been reduced to 1 4 of that in the original Pippenger algorithm, we no longer calculate the result of Eq. (1).Instead, we compute for each subtask.But it is still necessary to update all the buckets For this reason, we propose a parallel layered reduction algorithm for this process that was originally performed sequentially.For clarity, Figure 6 provides an example with only four buckets, which is divided into two rounds: In the first round, we concurrently update the bucket B 5 to B 5 = B 5 + B 7 and the bucket B 1 to B 1 = B 1 + B 3 .In the second round, we concurrently update the bucket B 1 to B1 = B 1 + B 5 and B 3 to B3 = B 3 + B 5 .When the number of buckets expands to 2 c−2 in our scheme, the rounds will also expand to c − 2 rounds, i.e., log 2 (2 c−2 ).In the best case, we can assign 2 c−3 threads to each subtask with thread indices tid ∈ [0, 2 c−3 ).Then in the i-th round, each thread computes baseline = 2 i • tid 2 i−1 + 2 i−1 and offset = (tid%2 i−1 ) + 1, and performs a single addition operation: It means that we update the (baseline + offset)-th-to-last bucket to be the sum of itself and the (baseline)-th-to-last bucket.Even if the total number of threads is not large enough, we can evenly distribute these small tasks that involve only one addition among these threads, ensuring that the computational cost for each thread remains quite low.Moreover, it is not necessary to launch a new kernel for each round but directly use CUDA's cooperative groups to achieve thread synchronization across the entire grid.Therefore, the time complexity of our layered parallel reduction scheme remains at the logarithmic level of the number of buckets.We perform the experiments on three different GPUs: 1) V100, 2) RTX3090, and 3) RTX4090.Detailed hardware information including the CPU configuration on the host side is listed in Table 2. Since the main body of our MSM implementation is done on the GPU, the CPU's performance only affects the scalar transfer time between the host and the device.To address this, we employ multi-stream techniques to overlap data transfers with device computations, further reducing the latency overhead caused by transfers.The BLS12-377, BLS12-381, and BLS24-315 curves that we employ are defined over prime fields F p1 , F p2 , and F p3 , respectively.The primes are 377-bit, 381-bit and 315-bit, respectively.Since we store elements from F p1 (or F p2 ) in multiple 32-bit unsigned integers, a single finite field element on both BLS12-377 and BLS12-381 is represented using 384 bits.Similarly, a single finite field element on BLS24-315 can be stored with fewer 320 bits.
In terms of parameter selection, we set the window size c as 16, which can avoid explicit sub-scalar segmentation by the pointer of type uint16_t.We set the τ as 6 for the cached Jacobian coordinate-based method and evaluate it on the BLS12-377 and BLS12-381 curves.For example, considering our tested MSM upper limit size of 2 24 , this requires a memory space of 10.5GB to store 7 • n affine points (including the base points themselves), which is well within the capabilities of mainstream GPU devices.This parameter selection also ensures that the cases where doubling operations are required during the bucket accumulation phase account for only about 1 128 of the overall workload, effectively leveraging the precomputation table.For the method based on the homogeneous coordinate system, we generate a complete precomputation table (i.e., τ = 15) and also evaluate it on the SNARK-Friendly BLS24-315 curve.When the MSM size reaches 2 23 , it requires 10GB of memory space to store 16 • n affine points (each point occupying 320 • 2 bits), which is still within an acceptable range.
The shared memory limit is 48KB, and it is at the block level for CUDA.It only launches a block when there is enough free memory to accommodate the entire size required by that block.We set the block size N T HREAD to 64, then the method based on cached Jacobian coordinates occupies 24KB of shared memory per block on the BLS12-381 curve, achieving full occupancy (2 • 24KB).The method based on homogeneous coordinates occupies 15KB per block on the BLS24-315 curve, achieving a 93.75% occupancy (3 • 15KB).In addition, we determine the maximum number of threads allocated based on the current maximum number of Streaming Multiprocessors (SMs) in the GPU, specifically, it is (256 ZPrize [Zpr22] is an annual competition with a primary focus on promoting the utilization and advancement of zero-knowledge cryptography.One of the tracks it establishes focuses on accelerating MSM computations using GPUs, and our implementation is based on their testing framework.The source code is available at https://github.com/dunkirkturbo/wlc_msm. Since the number of uint32_t used for element storage on both BLS12-377 and BLS12-381 curves is the same, the computational cost for operations like point addition is nearly identical (as confirmed in our practical testing).Table 3 presents our benchmark results of MSM on the BLS12-381 curve evaluated on different GPUs.).This implies that our approach can be applied to the vast majority of MSM computation scales required by zk-SNARKs and significantly outperforms cuZK in terms of performance.
• Our load-balancing method requires a buffer size of N max + λ 16 • 2 14 cached Jacobian points.Taking RTX3090 as an example, we set the maximum threads N max to 82 • 256, and therefore, the buffer size mentioned above does not exceed 52M B (independent of the MSM size).Additionally, due to the similarity in the number of SMs between V100 and RTX3090, we compare the acceleration on RTX3090 and RTX4090.cuZK is a load-balanced solution that utilizes sparse matrices, and the results show that our relative speedup compared to cuZK remains largely consistent on RTX4090 as it does on RTX3090.Hence, our performance also exhibits nearly linear growth with an increase in the total allocated threads.
To demonstrate the speedup of each optimization technique employed in Table 3, we first use only our method for mapping scalars (in Section 3.1) and the parallel layered reduction algorithm (in Section 3.4), comparing performance with cuZK.After incorporating the lazy reduction technique, we conduct another round of testing.Finally, with the application of the load-balancing algorithm (in Section 3.3), we get the complete benchmark results presented in Table 3.The speedup ratio for each round of the testing mentioned above is shown in Table 4.
• Even in the "imbal" version, our implementation achieves up to 2.35×, 2.1×, and 2.3× speedup compared to cuZK on V100, RTX3090, and RTX4090, respectively (at scales ranging from 2 19 to 2 24 ).After adding the lazy reduction technique, the "imbal+lazy" version achieves a further acceleration of up to 5.76% compared to the "imbal" version.Finally, the complete "bal+lazy" version is up to 21.48% faster than the "imbal+lazy" version.It should be noted that the "bal+lazy" version performs significantly faster on higher computational power devices compared to the "imbal+lazy" version.
Furthermore, we also evaluate the performance improvement brought about by the loadbalancing algorithm on the BLS24-315 curve.Before this, we switch to the homogeneous coordinate on the basis of the "imbal+lazy" version, which yields a further performance improvement of approximately 3% (The reason for the improvement not reaching the theoretical value is due to issues such as bank conflicts in the usage of shared memory in CUDA).This gives us a new control group with load imbalance.The comparative results between the imbalanced and balanced versions are shown in Figure 7.In the load-imbalanced version, we additionally sort the key âi [0 : 30] during radix sort.Since each thread is responsible for sub-scalar values corresponding to non-overlapping buckets, warp synchronization is achieved for the operation of determining whether the y-coordinate should be negated.Warp synchronization is a critical factor affecting the efficiency of CUDA programs.However, despite this, our load-balanced version is still 28.2% to 85.3% faster than the imbalanced version.Furthermore, as the MSM size gradually increases, this advantage becomes even more pronounced.
In summary, our implementation on SNARK-friendly curves like BLS12-381 and BLS24-315 significantly improves MSM calculation speeds in G 1 compared to the latest implementations.Additionally, it satisfies load balancing and adapts well to GPUs with varying computational power.

Constant-time algorithm for the bucket accumulation
This section reviews the cost analysis from Section 3.5, where the bucket accumulation phase based on the cached Jacobian coordinate is not constant time (here, "constant time" means the running time is independent of the secret input).In particular, for accumulating parts of the buckets into buffers, we can set the parameter τ to c − 1.It eliminates all doubling operations and the cost per individual thread is fixed at n/N point additions.For aggregating the buffered points into buckets, we design a new algorithm that can be executed in constant time.
For each subtask, the size of the static buffer is N + 2 c−2 .Compared to the previous Alg.5, we retain only one auxiliary array: buffer_index.This array is used to store the bucket offsets corresponding to the points being written.According to our definition of "inter-boundary" in Figure 5, each occurrence of such boundary results in a redundant point.Excluding the bucket offsets corresponding to the redundant points, the array buffer_index is non-decreasing.Therefore, we can perform operations OP 1 , OP 2 , OP 3 , and OP 4 in each thread based on the four consecutive buffer_index values to be processed (as shown in Table 5).
Table 5: Operations (OP 1 → OP 2 → OP 3 → OP 4 ) corresponding to the relationships among four values in the buffer_index.We write A, B, C, D for the points corresponding to these values.The operation OUT requires copying a point once.Before accumulating parts of the buckets, it should be noted that we initialize the buffer as {O} and the buffer_index as {0}.Then we increment the bucket offsets corresponding to non-redundant points by 1, i.e., add a redundant bucket at the front position.The offset of this redundant bucket is 0 and corresponds to redundant points.After generating the buffered points, due to the attribute of "inter-boundary", the value before the element 0 must be less than the value after it in buffer_index.Thus, when determining which relation is in Table 5, the element 0 needs to be equal to the preceding element and smaller than the following element.

Relationships
It is obvious that each thread requires at most 2 PADD and 2 OUT operations.Although some relationships may not require the complete execution of these operations, we still use operations such as adding O to ensure the constant-time property.More importantly, we avoid program branches through techniques such as address offsetting and table lookup.In details, we unify the operations as follows: we allocate two arrays of length 6 in each thread, aux 1 and aux 2 .The former stores the addresses of points A, B, C, D, followed by the addresses of O and a copy of the address of point D. The latter stores the addresses of the buckets corresponding to points A, B, C, D, followed by the addresses of O and a copy of the address of point C. Assuming the corresponding value of the relationship being processed by the current thread is i, the operations can be rewritten as follows: OP 3 :OUT(aux The required operand addresses can be obtained through the same table lookup in all branches.Then the exactly same operations OP 1 → OP 2 → OP 3 → OP 4 are executed.After that, only the new points A and C are useful.However, we still need to update the corresponding values in buffer_index.Similarly, we allocate another array, aux 3 , with a length of 5.It stores the bucket offsets corresponding to points A, B, C, D, followed by a zero.Then we perform operations OP 5 , OP 6 and OP 7 , where OP 6 is designed for the " =, =, < " case.It should be clarified that these operations are still performed on each branch. Therefore, the complete set of operations includes OP 1 ~OP 7 .After each round of parallel processing, the buffered points and corresponding bucket offsets to be processed will be reduced by half, and the adjacent distances will also double.We can complete the aggregation of buffered points into buckets within log 2 (N + 2 c−2 ) − 1 rounds.The operations in the final round require two more additions to handle the remaining two points.In summary, our new algorithm for aggregating buffered points requires 2 log 2 (N + 2 c−2 ) point additions and 2(log 2 (N + 2 c−2 ) − 1) copies within each thread.OP 5 ~OP 7 only involve bitwise operations with relatively low overhead.All of the above operations are independent of the values of input scalars.We have theoretically achieved the constant-time property in the bucket accumulation phase.
It should be noted that if all scalars are identical, there will only be N useful buffered points (i.e., there are 2 c−2 points at infinity).This will result in many instances of OP 1 and OP 4 containing O in the operands.After aggregating the buffered points into buckets, there are also 2 c−2 − 1 buckets equal to O.The popular implementation of PADD usually starts by checking whether there is a O in the operands.If O is found, it directly outputs the other operand.Therefore, if we apply this type of implementation, the overall running time of the corner case (i.e., all scalars are identical) is less than that of the random instance.To solve this problem, we also patch the PADD operation.We first execute addition formulas consistently, and finally check whether there is a O in the operands.If it does not exist, we copy the temporary result to the output.Otherwise, we copy the other operand.The copy cost of these two cases is exactly the same.
Experimental results.We do benchmark tests on our new algorithm with different cases, including corner and average scalar distribution.In the average case, all n scalars are randomly sampled in F r , where r is the order of G.While in the corner case, scalars are still random but identical.Our experiments correspond with the average of 1000 executions.The results on V100 are shown in Table 6.
Since the selection of the parameter τ directly determines the cost of doublings when generating buffered points, increasing τ to 15 eliminates all doubling operations.It can be observed that the performance has been significantly improved, but requiring more GPU memory.For instance, the MSM with a size of 2 23 requires precomputing 16 • 2 23 affine points (including the base points themselves).It occupies 12GB of memory, but when τ = 6, it only requires 5.25GB.This increase in memory usage remains within an acceptable range and is applicable to popular GPUs, such as the V100-SXM2-32GB.Then if we apply the new algorithm for aggregating the buffered points into buckets, we will obtain a constant-time version.In the average case, the performance of the constanttime version is close to that of the non-constant-time one.Although more additions are required to aggregate buffered points compared to Alg. 5, the warp synchronization also contributes to the faster parallel execution of several operations such as PADD.
If the parameter τ is fixed at 15, the non-constant-time version will experience a performance drop in the corner case.As analyzed in Section 3.5, it requires at most N point additions to aggregate buffered points in one of the threads.However, the performance drop in the corner case becomes less noticeable with an increase in n, as N is relatively smaller than n.For example, the performance decreases by 50.2% when n = 2 19 , while only decreases by 6.8% when n = 2 23 .In contrast, the running time of our constant-time version in the corner case is almost indistinguishable from the average case.For example, the running time differs by less than 0.22% when n = 2 23 .The slight difference in running time of the constant-time version is due to CUDA's internal optimizations for memory access that we cannot adjust (e.g., coalesced memory access [NVI23]).When a warp executes an instruction that accesses global memory, it coalesces the memory accesses of the threads within the warp into one or more of these memory transactions depending on the size of the word accessed by each thread and the distribution of the memory addresses across the threads.And the memory transaction is up to 128 bytes.In the BLS12-381 curve, |F p | = 48-byte but |G| = 192-byte> 128-byte (based on the cached Jacobian coordinate).Recalling our implementation, different threads do not access points at the same global memory location.Thus, these accesses to points will not be coalesced.
The affected accesses are only to the scalar mapping table and op_index, where each element is 4-byte.Global memory accesses are always cached in L2 cache (i.e., using 32-byte memory transactions).Therefore, a memory transaction will cover 8 elements in these two tables.
• For op_index: The total length of this array is 32-byte, so these accesses are coalesced regardless of the input.
• For the scalar mapping table: When the elements in the table mapped by sub-scalars within the same warp appear in consecutive 32-byte, the accesses will be coalesced.Assuming the attacker is in an ideal scenario without benchmark errors, if he knows that the running time of n scalars is less than that of another n scalars, he can only infer that the difference between certain sub-scalars within the same warp does not exceed 7.Even if the small range of 32 sub-scalars whose accesses will be coalesced is given, there are at least ( 32 4 ) 32 possibilities.Moreover, the small range is also indeterminable in reality.
Based on the above analysis, along with the inherent benchmark errors, the information leakage introduced by CUDA's internal memory access optimization is negligible.

Conclusion
In this work, we present a novel GPU-accelerated MSM algorithm that can be applied to large-size scenarios required by zk-SNARKs and achieves high performance.First, we propose a new method for mapping scalars into Pippenger's bucket indices, reducing the number of buckets to 1 4 of that in the original Pippenger algorithm.Second, we focus on the fundamental operations of point addition during the bucket accumulation phase.We employ mixed point addition formulas in cached Jacobian coordinates when the GPU memory is of typical size, and switch to a more efficient algorithm based on homogeneous coordinates which can save one finite field multiplication cost for every two additions when memory is sufficient.In addition, we also apply the lazy reduction skill to all point addition and doubling operations.Third, we present our load-balanced bucket accumulation method using the compact static buffer, which means the parallel speedup ratio is almost linear growth as the number of device threads increases.Finally, we provide a parallel layered reduction algorithm for the serially executed bucket aggregation phase, whose time complexity remains at the logarithmic level of the number of buckets.
Utilizing the aforementioned techniques, our MSM achieves approximately 40.2% to 185.1% faster performance than cuZK on popular graphics cards (at scales ranging from 2 19 to 2 24 ).Our approach can also be further optimized at the lower level.Longa [Lon23] proposed an approach that generalizes interleaved modular multiplication algorithms for the computation of sums of products over large prime fields.It can avoid the penalty of double-precision operations and perform faster over the popular SNARK-friendly curve.Andy [AM22] et al. presented a LSD radix sorting algorithm for large GPU sorting problems residing in global memory.Compared to CUB which we use, it provides a speedup of ≈ 1.5×.The optimizations using these algorithms are left as future work.

Figure 1 :
Figure 1: An example of the bucket accumulation phase.

Figure 2 :
Figure 2: The original bucket aggregation phase (excluding the final summation).
e = n 5: while s = 0 and s < n and a s = a s−1 do 6: s = s + 1 7: end while 8: while e < n and a e = a e−1 do 9: e = e + 1 10: end while 11: for i = s to e − 1 by 1 do 12:

Figure 4 :
Figure 4: Two methods of adding affine points to the bucket.
Furthermore, we utilize the shared memory, denoted as smem[2 • N T HREADS] (or smem[3 • N T HREADS]), to speed up read and write operations on original/bucket points.Here, N T HREADS is the (CUDA) block size.As shown in Alg. 4, each thread writes smem[2 • tid inner + 1] back to global memory after processing each bucket in the cached Jacobian coordinates, while smem[3 • tid + 2] is written back to the global memory in homogeneous coordinate systems.

Figure 7 :
Figure 7: Comparison of the load-balanced and imbalanced versions of our method based on the homogeneous coordinate system (BLS24-315 MSM on RTX4090).

Table 1 :
Costs of point addition in different coordinate systems.The notations M and S represent the costs of multiplication and squaring in F p , respectively (M ≈ S).

Table 2 :
Hardware configuration of testing environments

Table 4 :
The speedup ratio compared to cuZK after adding our optimization techniques one by one (For simplicity, "imbal" means we only use the method for mapping scalars and the parallel layered reduction algorithm, "imbal+lazy" means we incorporate the lazy reduction technique, and "bal+lazy" means we further utilize the load-balancing algorithm).

Table 6 :
Execution times (millisecond) of BLS12-381 MSM on V100 (based on the cached Jacobian coordinate).For simplicity, "const" denotes satisfying the constant-time property, while "non-constant" does not."avg" means scalars are randomly sampled, and "corner" means all scalar values are random but identical.