Time-Memory Analysis of Parallel Collision Search Algorithms ∗

. Parallel versions of collision search algorithms require a signiﬁcant amount of memory to store a proportion of the points computed by the pseudo-random walks. Implementations available in the literature use a hash table to store these points and allow fast memory access. We provide theoretical evidence that memory is an important factor in determining the runtime of this method. We propose to replace the traditional hash table by a simple structure, inspired by radix trees, which saves space and provides fast look-up and insertion. In the case of many-collision search algorithms, our variant has a constant-factor improved runtime. We give benchmarks that show the linear parallel performance of the attack on elliptic curves discrete logarithms and improved running times for meet-in-the-middle applications.


Introduction
Given a function f : S → S on a finite set S, we call collision any pair a, b of elements in S such that f (a) = f (b). Collision search has a broad range of applications in the cryptanalysis of both symmetric and asymmetric ciphers: computing discrete logarithms, finding collisions on hash functions and meet-in-the-middle attacks. Pollard's rho method [Pol78], initially proposed for solving factoring and discrete logs, can be adapted to find collisions for any random mapping f . The parallel collision search algorithm, proposed by van Oorschot and Wiener [vW99], builds on Pollard's rho method, and is expected to have a linear speedup compared to its sequential version. This algorithm computes several walks in parallel and stores some of these points, called distinguished points.
In this paper, we revisit the memory complexity of the parallel collision search algorithm, both for applications that need a small number of collisions (i.e. discrete logs) and those needing a large number of collisions, such as meet-in-middle attacks. In the case of discrete logarithms, collision search methods are the fastest known attacks in a generic group. In elliptic curve cryptography, subexponential attacks are known for solving the discrete log on curves defined over extension fields, but only generic attacks are known to work in the prime field case. Evaluating the performance of collision search algorithms is thus essential for understanding the security of curve-based cryptosystems. Several record-breaking implementations of this algorithm are available in the literature. Over a prime field we note the computation of a discrete log in a 112-bit group on a curve of the form y 2 = x 3 − 3x + b [BKK + 12, BKM09]. This computation was performed on a Playstation 3. More recently, Bernstein, Lange and Schwabe [BLS11] reported on an implementation on the same platform and for the same curve, in which the use of the negation map gives a speed-up by a factor √ 2. Over binary fields, the current record is an FPGA implementation breaking a discrete logarithm in a 117-bit group [BEL + ]. As for the meet-in-the-middle attack, this generic technique is widely used in cryptanalysis to break block ciphers (double and triple DES, GOST [Iso11]), lattice-based cryptosystems (NTRU [HGSW03,vV16]) and isogeny-based cryptosystems [ACC + 18].
Two models of computation can be considered for this algorithm. The first one follows the shared memory paradigm, in which each thread will compute distinguished points and store it in the common memory. The second one is a message-passing model, where the threads computing points, called the clients, send the distinguished points to a separate process, running on a different machine called the server, who will handle the memory and check for collisions.
First, our contribution is to extend the analysis of the parallel collision search algorithm and present a formula for the expected runtime to find any given number of collisions, with and without a memory constraint. We show how to compute optimal values of θthe proportion of distinguished points, allowing to minimize the running time of collision search, both in the case of discrete logarithms and many-collision attacks. In the case where the available memory is limited, we determine the optimal value of θ. Going further in the analysis, our formulae show that the actual running time of many-collisions algorithm is critically reduced if the number of words w that can be stored in memory is larger.
Secondly, we focus on the data structure used for the algorithm. To the best of our knowledge, all existing implementations of parallel collision search algorithms use hash tables to organize memory and allow fast lookup operations. In this paper, we introduce a new structure, called Packed Radix-Tree-List (PRTL), which is inspired by radix trees. We show that the use of this structure leads to better use of memory in implementations and thus yields improved running times for many-collision applications.
Using the PRTL structure, we have implemented the parallel collision search algorithm for discrete logarithms on elliptic curves defined over prime fields and experimented using a Shared-Memory Parallelism (SMP) system. Our benchmarks demonstrate the performance and scalability of this method. While in the case of a single discrete log, the PRTL variant implementation yields running times similar to those of a hash table approach, our experiments demonstrate that the new data structure gives faster limited-memory multi-collision attacks.
Organisation. Section 2 reviews algorithms for solving the discrete logarithm problem and for meet-in-the-middle attacks. In Section 3, we revisit the proof for the time complexity of the collision finding algorithm for a small and a large number of collisions. Section 4 describes our choice for the data structure, complexity estimates and comparison with hash tables. Finally, Section 5 presents our experimental results.

Parallel collision search
In this section, we briefly review Pollard's rho method and the parallel algorithm for searching collisions. Let S be a finite set of cardinality n. In order to look for collisions for a function f : S → S with Pollard's rho method, the idea is to compute a sequence of elements x i = f (x i−1 ) starting at some random element x 0 . Since S is finite, eventually this sequence begins to cycle and we therefore obtain the desired collision f (x k ) = f (x k+t ), where x k is the point in the sequence before the cycle begins and x k+t is the last point on the cycle before getting to x k+1 (hence f (x k ) = f (x k+t ) = x k+1 ). One may show that the expected number of steps taken until the collision is found is πn 2 , and therefore that the memory complexity is also O( πn 2 ). This algorithm can be further optimized to constant memory complexity by using Floyd's cycle-finding algorithm [Jou09,Bre80]. We do not further detail memory optimizations here since they are inherently of sequential nature and there is currently no known way to exploit these ideas in a parallel algorithm.
The parallel algorithm for collision search proposed by van Oorschot and Wiener [vW99] assigns to each thread the computation of a trail given by points x i = f (x i−1 ) starting at some point x 0 . Only points that belong to a certain subset, called the set of distinguished points, are stored. This set is defined by points having an easily testable property. Whenever a thread computes a distinguished point x d , it stores it in a common list of tuples (x 0 , x d ). If two walks collide, this is identified when they both reached a common distinguished point. We may then re-compute the paths and the points preceding the common point are distinct points that map to the same value.
Solving discrete logarithms. In this subsection, S denotes a cyclic group of order n. We focus on the elliptic curve discrete logarithm problem (ECDLP) in a cyclic group S = P , but the methods described in this paper apply to any finite cyclic group. We will assume that the curve E is defined over a finite field F p , where p is a prime number. Let Q ∈ S and say we want to solve the discrete logarithm problem Q = xP , where x ∈ Z. To apply the ideas explained above, we define a map F : S → S which behaves randomly and such that each time we compute R i+1 = f (R i ) we can easily keep track of integers a i and b i such that f (R i ) = a i P + b i Q. Pollard's initial proposal for such a function was where the sets S i , i ∈ {1, 2, 3} are pairwise disjoint and give a partition of the group S. As a consequence, whenever a collision f (R j ) = f (R k ) occurs, we obtain an equality This allows us to recover x = (a j − a k )/(b k − b j ) (mod n), provided that b k − b j is not a multiple of n. Starting from R 0 , a multiple of P , Pollard's rho [Pol78] method computes a sequence of points R i where R i+1 = f (R i ). Since the group S is finite, this sequence will produce a collision after πn 2 iterations on average. In the parallel version, each thread computes a walk, and only distinguished points on this walk are stored in memory. These points are usually defined by an easily testable property, such as a certain number of trailing bits of their x-coordinate being zero. Whenever a thread computes such a point, this is stored in a common list, together with the corresponding a and b. When two walks collide, this cannot be identified until the common distinguished point is computed. Then the discrete logarithm can be recovered from an equation of type (2).

Many collision applications.
A first type of application of the van Oorschot and Wiener algorithm computing many collisions is the multi-user setting of both public and secret key schemes. In such a setting, it has been demonstrated that it is more efficient to recover individual keys one by one, by using a growing common database of distinguished points, instead of running the algorithm for each key separately (see [KS01,FJM14]). A second type of applications concerns meet-in-the-middle attacks, which require finding a collision of the type f 1 (a) = f 2 (b), where f 1 : D 1 → R and f 2 : D 2 → R are two functions with the same co-domain. As explained in [vW99], solving this equation may be formulated as a collision search problem on a single function f : S × {1, 2} → S × {1, 2}, where the solution we need is of the type: and S is a set bijective to D 1 . This collision is called the golden collision. The number of unordered pairs in S are approximately n 2 2 and the probability that the two points in a pair map to the same value of f is 1 n . There are n 2 expected collisions for f and there may be several solutions to Equation (3). If one assumes that all collisions are equally likely to occur, then in the worst case all possible n 2 collisions for f are generated before finding the golden one. Because so many collisions are generated, memory complexity can be the bottleneck in meet-in-the-middle attacks and the memory constraint becomes an important factor in determining the running time of the algorithm. We further explain this idea in Section 3.
Computational model and data structure. We consider a CPU implementation of the shared memory variant of the algorithm, where each thread involved in the process performs the same task of finding and storing distinguished points. In this case, the choice of a data structure for allowing efficient lookup and insertion is significant. The most common structure used in the literature is a hash table. In order to make parallel access to memory possible, van Oorschot and Wiener [vW99] propose the use of the most significant bits of distinguished points. Their idea is to divide the memory into segments, each corresponding to a pattern on the first few bits. Threads read off these first bits and are directed towards the right segment. Each segment is organized as a memory structure on its own.
In recent years, with the development of GPUs and programmable circuits, the clientserver model has been widely used for implementing parallel collision search. In this setting, a large number of client chips are communicating with a central memory server over the Internet. For computing discrete logarithms, [BBB + 09] gives a comparison between implementations on different architectures in this model. Current record-breaking implementations of ECDLP also rely on this model [BKK + 12, BKM09, BEL + ].
Except for the need for a structure that allows efficient simultaneous access to memory, all results in this paper apply to both the client-server and the SMP versions of the PCS algorithm, even though our experimental results are obtained using a CPU implementation following the SMP paradigm.

Notation.
In the remainder of this paper, we denote by θ the proportion of distinguished points in a set S. We denote by n the number of elements of S. We denote by E an elliptic curve defined over a prime finite field F p and by E(F p ) the group of points on E defined over F p . Whenever the set S is the group E(F p ), n is the cardinality of this group. For simplicity, in this case, we assume that n is prime (which is the optimal case in implementations).

Time complexity
Van Oorschot and Wiener [vW99] gave formulae for the expected running time of parallel collision search algorithms. In this section, we revisit the steps of their proof and show a careful analysis of the running time both for computing a single collision or multiple collision applications. Our refined formulae indicate that the actual running time of the algorithm depends on the proportion of distinguished points and allow us to determine the optimal choice of θ for actual implementations.

Finding one collision: elliptic curve discrete logarithm
Van Oorschot and Wiener [vW99] proved that the runtime for finding one collision is with L the number of threads we use. This is obtained by finding the expected number of computed points before a collision occurs and then intuitively dividing the clock time by L when L processors are involved. The proof of the following theorem, given in Appendix A, provides a more rigorous argument for the linear scalability of the algorithm.
Theorem 1. Let S be a set with n elements and f : S → S be a random map. In the parallel collision search algorithm, denote by θ is the proportion of distinguished points and t c and t s denote the time for computing and storing a point respectively.

The worst case running time is
Remark 1. In the client-server model, clients do not have access to memory, but they send distinguished points to the server and thus t s stands for cost of communication on the client-side. We suppose that all the client processors are dedicated to computing points. On the server-side however, the analysis is different. Theorem 1 and the means of finding the optimal value for θ apply both to the shared memory implementation adopted in this paper, and to the more common distributed client-server model.
As we can see in Equation (4) and (5), the proportion of distinguished points we choose will influence our time complexity. The optimal value for θ is the one that gives the minimal run complexity. Most importantly, our analysis puts forward the idea that the optimal choice for θ depends essentially on the choices made for the implementation and memory management. From this formula, we easily deduce that if the proportion of distinguished points is too small or too large, the running time of the algorithm increases significantly.
By estimating the ratio t s /t c for a given implementation, one can extrapolate the optimal value of θ by computing the zeros of the derivative of the function in Equation (4): Figure 1 gives timings for our implementation of the attack, using a hash table to store distinguished points. Timings shown in the figure are averaged over 100 runs on a 65-bit curve and support our theoretical findings.
Note that most recent implementations available in the literature choose the number of trailing bits giving the distinguished point property in a range between 0.178 log n and 0.256 log n (see [BEL + , BLS11, BKM09]). This value was determined by experimenting on curves defined over small size fields. Our theoretical findings confirm that these values were close to optimal, but we suggest that for future record-breaking implementations, the value of θ should be determined as explained above.

Finding many collisions
Using a simplified complexity analysis, van Oorschot and Wiener [vW99] put forward the following heuristic. This heuristic suggests that in the case of many collisions attacks, a memory data structure allowing to store more distinguished points will yield a better time complexity. We give a more refined analysis for the running time of a parallel collision search for finding m collisions.
Theorem 2. Let S be a set with n elements and f : S → S a random map. We denote by θ the proportion of distinguished points in S. The expected running time to find m collisions for f with a memory constraint of w words is: Proof. Let X be the expected number of distinguished points calculated per thread before duplication. Let T 1 be the expected number of distinguished points computed until the first collision was found, and T i , for any i > 1, the expected number of points stored in the memory after the (i − 1)th collision was found and before the ith collision is found. As shown in Theorem 1, the expected number of points stored before finding the first collision is T 1 = θ πn 2 . The probability of not having found the second collision after each thread has found and stored T distinguished points is As in the proof of Theorem 1, we approximate this expression by Hence the expected number of distinguished points computed by one thread before the second collision is: where the last equality is obtained by integration by parts. We denote by By applying repeatedly the formula above (and neglecting the last integral), we have that n , we obtain a sequence given by the recurrence formula We will use the Cesaro-Stolz criterion to prove the convergence of this limit. First, we note that this sequence is increasing and tends to ∞. Moreover we have that Since U k is the number of distinguished points computed per thread, the total number of stored points is θ √ 2kn. Hence the memory will fill when θ √ 2kn = w. This will occur after computing the first k w = w 2 2θ 2 n collisions and the expected total time for one thread is w Lθ . When the memory is full, the time to find a collision is θn w (see [vW99] for detailed explanation). Finally, to actually locate the collision, we need to restart the two colliding trails from their start, which requires 2/θ steps on average.
To sum up, the total time to find m collisions is: Remark 2. According to the formula obtained in Equation 7, we see that if the memory is not filled when running the algorithm for finding n 2 collisions, as in meet-in-the-middle applications, then we store θn distinguished points, i.e. all distinguished points in S.
Note that the proof of Theorem 2 relies strongly on our formula for the expected total number of computed distinguished points for finding m collisions, when m is sufficiently large and the memory is not limited: We confirmed this asymptotic formula experimentally by running a multi-collision algorithm for a curve over a 55-bit prime field. The comparison of our formula with the experimental  Table is an average of 100 runs where we set θ = 1/2 13 . Furthermore, our formula coincides with the estimated workload for computing k discrete logarithms in [KS01], which is obtained using a different analysis valid when k < n 1/4 . By minimizing the time complexity function obtained in Theorem 2, we obtain an estimate for the optimal value of θ to take, in order to compute n 2 collisions.
Proof. From Theorem 2, the runtime complexity is given by: By computing the zeros of the derivative: we obtain that by taking θ = This eliminates some of the heuristics in [vW99] and confirms the asymptotic runtime. Most importantly, Corollary 1 suggests that in the case of applications that fill the memory available, the number of distinguished points we can store is an important factor in the running time complexity. More storage space yields a faster algorithm by a constant factor. We propose such an optimization in Section 4.

Our approach for the data structure
In this section, we evaluate the memory complexity of parallel collision search algorithms. As explained in Section 2, van Oorschot and Wiener's [vW99] proposed to divide the memory into segments to allow simultaneous access by threads. We revisit this construction, with the goal in mind to minimize the memory consumption as well. Since in Section 3 we showed that the time complexity of collision search depends strongly on the available amount of memory, we propose an alternative structure called a Packed Radix-Tree-List, which will be referred to as PRTL in this paper. We explain how to choose the densest implementation of this structure for collision search data storing in Section 5.
Since the PRTL is inspired by radix trees, we first describe the classic radix tree structure and then we give complexity analysis on why its straightforward implementation is not memory efficient. The PRTL structure has the memory gain of radix tree common prefixes but avoids the memory loss of manipulating pointers.

Radix tree structure
Each distinguished point from the collision search is represented as a number in a base of our choice, denoted by b. For example, in the case of attacks on the discrete logs on the elliptic curve, we may represent a point by its x-coordinate. The first numerical digit of this number in base b gives the root node in the tree, the next digit is a child and so on. This leads to the construction of an acyclic graph which consists of b connected components (i.e. a forest).
In regard to memory consumption, we take advantage of common prefixes to have a more compact structure. Let c be the length of numbers written in base b that we store in the tree and K the number of distinguished points computed by our algorithm. To estimate the memory complexity of this approach, we give upper and lower bounds for the number of nodes that will be allocated in the radix tree before a collision is found.
Proposition 1. The expected number of nodes in the radix tree verifies the following inequalities: The proof of these inequalities is detailed in Appendix B.
Traditionally, nodes in a tree are implemented as arrays of pointers to child nodes. This representation will lead to excessive memory consumption when the data to be stored follows a uniform random distribution, leading to sparsely populated branches and to the average distribution of nodes in the tree being closer to the worst case than to the best case.
The difference between the worst-case value and the best-case value can be approximated as ∆ ∼ K(c − log b K). Depending on the application, this value may be large. Let us consider the case where a single collision is required for solving the ECDLP. By a theorem of Hasse [Sil86], we know that the number of points on the curve is given by n = p + 1 − t, with |t| ≤ 2 √ p. Since we assume that n is prime, we approximate log n ∼ log p. Hence an approximation of ∆ is: which implies that the tree is sparse. In the case of many collisions algorithms, c ∼ log b K and this standard deviation becomes negligible, resulting into a space-reduced data structure. We show how to handle sparse trees efficiently in Section 4.2.

Packed Radix-Tree-List
Starting from the analysis in Section 4.1, we look to construct a more efficient memory structure by avoiding the properties of the classic radix tree that make it memory costly for our purposes. Intuitively, we see that the radix tree is dense at the upper levels and sparse at the lower ones. Hence it would be more efficient to construct a radix tree up to a certain level and then add the points to linked lists, each list starting from a leaf on the tree. We denote by l be the level up to which we build the radix tree. We call this a Packed Radix-Tree-List 1 . Figure 2 illustrates an example of an abstract Radix-Tree-List in base 4. This idea was considered by Knuth [Knu98, Chapter 6.3] for improving on a table structure called trie, introduced by Fredkin [Fre60]. Knuth considers a forest of radix trees that stop branching at a certain level, whose choice is a trade-off between space and fast  access. Indeed, the more we branch, the faster the lookup is, but the more memory we require. He suggests that the mixed strategy yields a faster lookup when we build a tree up to a level where only a few keys are possible. Starting from this level a sequential search through a list of the remaining keys is fast. In our use case, we favor memory optimization to fast lookup, thus we use a different technique to decide on the tree level. First, we look to estimate up to which level the tree is complete for our use case. The number of leaves in a complete radix tree of depth l is b l . As per the coupon collector's problem, all the linked lists associated with a leaf will contain at least one point when the following inequality is verified: We consider the highest value of l which satisfies this inequality to be the optimal level, as it allows us to obtain the shortest linked lists while having 100% rate of use of the memory structure. We verified this experimentally by inserting a given number of randomly obtained points of length 65, with b = 2, in the PRTL structure. The results are in Table  2. We performed 100 runs for each value of K and counted the number of empty lists at the end of each run. None of the 300 runs finished with an empty list in the PRTL structure, which supports the claim that the obtained l is small enough to have at least one point per list. Then, to confirm that l is the highest possible value that achieves this, we reproduced the experiments by taking l + 1, which is the lowest value that does not satisfy Equation (11). The results show that l + 1 is not small enough to produce a 100% rate of use of the memory, therefore l is in fact the optimal level to choose. The attribution of a point to a leaf is determined by its prefix and we know in advance that all the leaves will be allocated. Therefore, in practice we do not actually have to construct the whole tree, but only the leaves. Hence, we allocate an array indexed by prefixes beforehand and then we insert each point in the list for the corresponding prefix. The operation used to map a point to an index is faster than a hash table function. More precisely, we perform a bitwise AND operation between the x-coordinate of the point and a precomputed mask to extract the prefix. Furthermore, the lists are sorted. Since we are doing a search-and-add operation, sorting the lists does not take additional time and proves to be more efficient than simply adding at the end of the list. Figure 3 illustrates the implementation of this structure.
Remark 3. When implementing the attack for curves defined over sparse primes, we advise taking an l-bit suffix instead of an l-bit prefix. Prefixes of numbers in sparse prime fields are not uniformly distributed and one might end up only with prefixes starting with the 0-bit, and therefore a half-empty array.
Remark 4. To experiment with this structure, we chose the example of ECDLP. In this case, we store the starting point of the Pollard walk kP and the first distinguished point we find, represented by the coefficient k and the x-coordinate correspondingly. Consequently, we store a pair (x-coordinate, k). However, the analysis and choices we made for constructing the PRTL are valid for every collision search application which needs to store pairs (key, data) and requires pairs to be efficiently looked up by keys. For the ECDLP, Bailey et al. [BBB + 09] propose, for example, to store a 64-bit seed on the server instead of the initial point, which makes the pair (x-coordinate, seed).
Remark 5. In our implementation, we always use b = 2, and thus, the parameter b will no longer be specified.

PRTL vs. hash table.
We experimented with the ElfHash function, which is used in the UNIX ELF format for object files. It is a very fast hash function, and thus comparable to the mask operation in our implementation. Small differences in efficiency are negligible since the insertion is the less significant part of the algorithm. Indeed, recall that insertion is performed every 1 θ iterations of the random map f . As is the practice with the parallel collision search, we allocate K indexes for the hash table, since we expect to have K stored points. Recall that this guarantees an average search time of O(1), but it does not avoid multi-collisions. Indeed, according to [Jou09, Section 6.3.2], in order to avoid 3-multi-collisions, one should choose a hash table with K 3 2 buckets. Consequently, we insert points in the linked lists corresponding to their hash keys, as we did with the PRTL. Every element in the list holds a pair (key, data) and a link to the next element. The PRTL is more efficient in this regard as we only need to store the suffix of the key.
With this approach, we can not be sure that a 100% of the hash table indexes will have at least one element. We test this by inserting a given number of random points on a 65-bit curve and counting the number of empty lists at the end of each run, like we did to test the rate of use for the PRTL. We try out two different table sizes: the recommended hash table size and for comparison, a size that matches the number of leaves in the PRTL. All results are an average of 100 runs.  Table 3 show that when we choose a smaller table size, we have fewer empty lists, but the hash table is still not 100% full. Due to these results, when implementing a hash table we choose to allocate a array of pointers to slots, instead of allocating an array of actual slots that will not be filled. This is the optimal choice because we only waste 8 bytes for each empty slot, instead of 24 (the size of one slot).
Since results in Table 2 show that the array in PRTL will be filled completely, when using this structure, we allocate an array of slots directly. This makes PRTL save a constant of 8K bytes compared to a hash table.
To sum up, the PRTL structure is less space-consuming and has a memory rate of use of 1. Note however that by Equation (11), the average number of elements in a linked list corresponding to a prefix is K b l ≈ l log b + 0.577. This shows that the search time in our structure is negligible, and our benchmarks shown in Section 5 confirm that memory access has no impact on the total running time for the algorithm.
It is clear that when one implements the PRTL, this structure takes the form of a hash table where the hash function is in fact the modulo a specific value calculated using Equation (11). It might seem counter-intuitive that the optimal solution for a hash function is the modulo function. However, collision search algorithms do not require a memory structure that has hash table properties, such as each key to be assigned to a unique index.
Finally, a well-distributed hash function is useful when we look to avoid multi-collisions. With collision search algorithms, the number of stored elements is so vast that we can not possibly allocate a hash table of the appropriate size and thus we are sure to have longer than usual linked lists. Fortunately, this is not a problem since the insertion time is, in this case, not significant compared to the 1 θ random walk computations needed before each insertion. For example, 1 θ would be of order 2 32 for a 129-bit curve. On the other hand, as shown in Section 3, the available storage space is a significant factor in the time complexity, which makes the use of this alternative structure more appropriate for collision search.

PRTL vs the data structure in [CLN + 20].
It is worth noting here that the use of the modulo function to compute the index of a distinguished point in the array has already been proposed in the literature for a meet-in-the-middle attack for isogeny-based cryptography [CLN + 20]. The data structure in [CLN + 20] is an array with w entries, where w is the total number of distinguished points that can be stored in memory. Since this particular attack uses several versions of the function f , the choice of the modulo w operation to compute the indexes in the array allows to easily detect whether two distinguished points in a collision were obtained using the same function. Moreover, the authors chose not to implement linked lists, hence once a distinguished point having an index that was used before is found, the old point is thrown away. With respect to memory use, the proposed structure is similar to a hash table : since the size of the array is not computed in terms of the number of expected points, there will be empty slots in it when no distinguished points are computed for the corresponding indexes.

Implementation and benchmarks
To support our findings, we implemented the parallel collision search using both PRTLs and hash tables for discrete logarithms on elliptic curves defined over prime fields. Our C implementation relies on the GNU Multiple Precision Arithmetic Library [MM11] for large numbers arithmetic, and on the OpenMP (Open Multi-Processing) interface [OPE] for shared memory multiprocessing programming. Our experiments were performed on a 28-core Intel Xeon E5-2640 processor using 128 GB of RAM and we experimented using between 1 and 28 threads. In this section, first we explain in detail the implementation of the PRTL structure and then we show experimental results.  In order to have the best packed structure, we look to avoid wasting space on addressing, structure memory alignment and unintended padding. Hence we propose to store all relevant data in one byte-vector. Our compact slot has the following structure: The key-suffix and data are bound in one single vector. In this way, we have at most 7 bits wasted due to alignment.
We designed functions that allow us to extract and set values in the vector. Our implementation of a PRTL yields a better memory occupation, but most importantly, manipulating this structure does not slow down the overall runtime of the attack. We show experimental results that verify this in Table 4, where we insert a given number of random points on a 65-bit curve, using both a hash table and the PRTL. To have a measurement of the runtime that does not depend on point computation time, we take θ = 1, meaning every point is a distinguished one. The key length is thus c = 65. All results are an average of 100 runs.
We show similar experiments in Table 5. This time, we performed actual attacks on the discrete log over elliptic curves, instead of inserting random points. Since the number of stored points is now random and can be different between two sets of runs, the runtime per stored point and memory per stored point are more relevant results.
The results are an average of 100 runs and they show that by using a PRTL for the storage of distinguished points we optimize the memory complexity by a factor of 3.
Calculating the exact memory occupation. Let f be the size of the field in bits and t the number of trailing bits set to zero in a distinguished point. We keep the notation of K for the expected number of stored points and of l for the level of the PRTL structure. To calculate the expected memory occupation of the entire PRTL structure, we first calculate the size of a compact slot. Recall that a compact slot holds one byte-vector and a pointer to the next slot, because we used linked lists. Thus, one compact slot takes (f − l − t + f )/8 + 8 bytes (f − l − t bits for the key-suffix and f bits for the data). The size of the slot is multiplied by K, as K slots will be allocated. To make sure that the access to the shared memory is asynchronous, we use locks on the shared data structure. To minimize the time threads spend on locks, there is a lock for every entry in the array, which makes a total of 2 l locks. This adds 8 · 2 l bytes to the total memory occupation. Let us compare the total cost to that of a hash table. As explained in Section 4, the entries in our hash table are linked lists as well. To store a pair (key, data), every element in the linked list holds a pointer to the key, a pointer to the data and a pointer to the next element, which need 24 bytes in total. To this we add (f − t)/8 bytes for storing the key and f /8 bytes for storing the data. All this is multiplied by K. Finally, we allocate 8K bytes for the array of pointers to slots and 8K bytes for the locks (one lock for each entry in the array). Hence, the total memory occupation of the hash table is (40 + (f − t)/8 + f /8 )K. Table 6 shows examples of memory requirements of large ECDLP computations calculated in this way. The first two lines in this Table concern the computation in [BEL + ] on the elliptic curve target117 over F 2 127 . For this example, we suppose that a seed of 64 bits is stored instead of the k-coefficient representing the data (see Remark 4). On the first line we give the memory amount needed to store the expected number of distinguished points, while on the second line we consider the actual computation in [BEL + ], that finished after collecting 968531433 distinguished points. Note that, in the case of the actual computation, the l parameter was calculated with respect to the estimated value of K, since l always needs to be set beforehand. Similarly, the size of the hash table and the number of locks correspond to the estimated value of K, as the table is allocated beforehand. On the third line, we also give the memory requirements for a discrete log computation on a 160-bit curve, with an estimated number of stored distinguished points.  [Tes01] showed experimentally that the walk proposed by Pollard described in Equation (1) originally performs on average slightly worse than a random walk. She proposes alternative mappings that lead to the same performance as expected in the random case: additive walks and mixed walks. In our implementation, we adopted the approach of using additive walks and we chose r = 20 as the number of sets S i that give a partition of the group S. Teske showed experimentally that if r ≥ 20 then additive walks are close to random walks.
In the theoretical model [vW99], the Parallel Collision Search is considered to have linear scalability and our time complexity in Theorem 1 confirms this. To assess the parallel performance of our implementation, we experimented with L ∈ {1, 2, 7, 14, 28} threads, solving the discrete log over a 60-bit curve. Table 7 shows the Wall clock runtime and the parallel performance of the attack when we double the number of threads. The parallel performance is an indication of how the runtime of a program changes when the number of parallel processing elements increases. It is computed as where t i is the Wall clock runtime with L i threads and L 1 > L 2 . A program is considered to scale linearly if the speedup is equal to the number of threads used i.e. if the parallel performance is equal to 1 (or very close to 1, in practice). From our results, we conclude that the parallel performance is not as good as expected for a small number of threads, but gets closer to linear as the number of threads grows.

Multi-collision search computation.
To prove our claims from Section 3 that more storage space yields a faster algorithm, we ran a multi-collision search while limiting the available memory. When the memory is filled, each thread continues to search for collisions without adding new points. As a practical application of this computation, we chose the discrete logarithm in the multi-user setting [KS01,FJM14]. Hence, the data that we store for each distinguished point is the coefficient k, plus an integer representing the user. Results in Table 8 show that the PRTL yields a better runtime compared to a classic hash table due to the more efficient memory use. 6 Limitations of our approach and future work Our discrete logarithm implementation was not conceived with the intent to break records and as such, it does not include techniques for fast arithmetic operations on the curve. To quantify the impact, we would need to combine the new data structure to existing arithmetic optimization techniques and adapt the implementation to a client-server model, which takes into account the communication overhead. However, note that in current records of a single ECDLP computation [BEL + , BKM09], the memory requirements were quite small and we do not think that the use of the PRTL would result into new records at the time of this writing. In any case, Table 6 shows that the use of PRTL brings the amount of RAM memory needed to run the attack for current security levels closer to numbers which could be feasible in practice, especially if distributed memory were to be used. However, we expect the use of the PRTL to have a visible impact on the running time of meet-in-the middle applications since these have significant memory requirements. To quantify this impact, we intend to adapt our analysis and experiments to the case of a golden collision search for meet-in-the-middle attacks [ACC + 18, CLN + 20, HGSW03]. As shown by van Oorschot and Wiener and verified experimentally in the literature, this attack requires several versions of the function f . Once a certain number of distinguished points have been produced, the function version needs to be changed and the distinguished points in the memory discarded. All these suggest that at the implementation level, we might need to adapt the PRTL data structure and the choice of the parameter l giving the size of the array in the PRTL.
Finally, another perspective is to check whether it is possible to use less (or no) locks in the data structure, as they present a non-negligible part of the memory requirements. Locks are necessary to avoid inconsistencies. However, for real-world implementations, the time to compute one distinguished point can be long enough for the probability of two threads writing a distinguished point at the same time to the same slot to be negligibly small. Our experimental results show that the algorithm can be parallelized linearly, which suggests that the threads do not spend time waiting on a lock. However, more experiments are needed to verify whether removing the locks is feasible.

Conclusion
We revisited the time complexity of the parallel collision search and explained how to choose the optimal value for the proportion of distinguished points when implementing this algorithm. We proposed an alternative memory structure for the parallel collision search algorithm proposed by van Oorschot and Wiener [vW99]. We showed that this structure yields a better memory complexity than the hash table variant of the algorithm. Moreover, using the new memory structure, we obtained a better bound for the time complexity of the parallel collision search, in the case where a large number of collisions is needed. The experiments in this paper are limited to the cryptanalysis of the discrete logarithm problem on elliptic curves. In future work, we will explore the applicability of our findings on other applications of the multi-collision search, such as isogeny-based cryptography and lattice-based cryptography.
Proof. 1. We call short path the chain of points computed by a thread between two consecutive distinguished points. The expected number of distinguished points produced after a certain clock time T is T Lθ. The probability of not having a collision at T = 1, for one thread is 1 − Lθ nθ .
Note that any of the L threads can cause a collision. Thus, the probability for all threads of not finding a collision on any point on the short walk is: at the moment T = 1. Let X be the number of points calculated per thread before duplication. Hence: To do this multiplication, we will take a shortcut. When x is close to 0, a coarse first-order Taylor approximation for e x as: This gives us the probability , thus the expected number of distinguished points found before duplication, is T · (P (X > T − 1) − P (X > T )) = ∞ T =0 P (X > T ).
We approximate Since the expected length of a short walk is 1 θ , the number of distinguished points before a collision occurs is θ L πn 2 .
However, a collision might occur on any point on the walk and it will not be detected until the walk reaches a distinguished one. We add 1 θ to the number of calculations for the discovery of a collision. Finally, the expected number of calculated points per thread is: The two main operations in our algorithm are computing the next point on the random walk and storing a distinguished point. Thus, the time complexity of our algorithm is: 2. To compute the worst time complexity, we compute the variance of the random variable X as σ(X) = E(X 2 ) − E(X) 2 . Using a similar approximation as in Equation (12) we obtain that E(X 2 ) ≈ ∞ 0 e −xL 2 2n dx ≈ 2n L 2 . Hence the worst case runtime is Remark 6. Note that the analysis above shows that the number of points computed by the algorithm is O θ πn 2 . This was proven by van Oorschot and Wiener in the first place.

B Appendix : Proof of Proposition 1
The lower and upper bound in Equation (10) are given by the worst-case and best-case scenario for the number of nodes.
Worst-case scenario. In the worst case scenario, for each new word added in this structure we will create as much nodes as possible. This means that the x-coordinates of the added points have the shortest possible common prefix, as shown in Figure 4. For the first b points, we will use bc nodes. After that, the first distinguished point that we find will take c − 1 nodes, since all possibilities for the first letter in the string were created. This case is repeated (b − 1)b times, provided that K > b + (b − 1)b. More generally, let k = log b K − 1. We build the tree by allocating nodes as follows: For each of the remaining K − (b + k i=1 (b − 1)b i ) points we will need c − k − 1 nodes. To sum up, the total number of nodes that will bound our worst-case scenario is given by: We simplify the sums and we approximate by: Since k = log 10 K − 1, we have that Best-case scenario. Let K be the number of distinguished points that we need to store and let k = log b K . In the best-case scenario, we may assume without loss of generality that each time a new point is added in the structure, the minimal number of nodes is used, i.e. the x-coordinate of the added point has the longest possible common prefix with some other point that was previously stored. For example, for the first point c nodes are allocated, for the next (b − 1) nodes, one extra node is allocated and so on, until all subtrees of depth 1, 2 etc. are filled one by one. Figure 5 gives an example of how 215 points are stored. If K > b c−1 , we fill the first tree and start a new one. Let x i , for i ∈ {0, 1 . . . , k}, denote the i-th digit of K, from right to the left. In full generality, since c > k, we use: • x k complete subtrees of depth k and a (x k +1)-th incomplete tree of depth k; • the (x k +1)-th tree of depth k has x k−1 complete subtrees of depth k − 1 and a (x k−1 +1)-th incomplete tree of depth k − 1; • c − k − 1 extra nodes. Summing up all nodes, we get the following formula: We conclude that: