Batching CSIDH Group Actions using AVX-512

Commutative Supersingular Isogeny Diffie-Hellman (or CSIDH for short) is a recently-proposed post-quantum key establishment scheme that belongs to the family of isogeny-based cryptosystems. The CSIDH protocol is based on the action of an ideal class group on a set of supersingular elliptic curves and comes with some very attractive features, e.g. the ability to serve as a “drop-in” replacement for the standard elliptic curve Diffie-Hellman protocol. Unfortunately, the execution time of CSIDH is prohibitively high for many real-world applications, mainly due to the enormous computational cost of the underlying group action. Consequently, there is a strong demand for optimizations that increase the efficiency of the class group action evaluation, which is not only important for CSIDH, but also for related cryptosystems like the signature schemes CSI-FiSh and SeaSign. In this paper, we explore how the AVX-512 vector extensions (incl. AVX-512F and AVX-512IFMA) can be utilized to optimize constant-time evaluation of the CSIDH-512 class group action with the goal of, respectively, maximizing throughput and minimizing latency. We introduce different approaches for batching group actions and computing them in SIMD fashion on modern Intel processors. In particular, we present a hybrid batching technique that, when combined with optimized (8 × 1)-way prime-field arithmetic, increases the throughput by a factor of 3.64 compared to a state-of-the-art (non-vectorized) x64 implementation. On the other hand, vectorization in a 2-way fashion aimed to reduce latency makes our AVX-512 implementation of the group action evaluation about 1.54 times faster than the state-of-the-art. To the best of our knowledge, this paper is the first to demonstrate the high potential of using vector instructions to increase the throughput (resp. decrease the latency) of constant-time CSIDH.


Introduction
Quantum computing exploits quantum-mechanical effects and phenomena, such as state superposition and entanglement, to efficiently solve certain computational problems, in particular optimization and search problems [KLM07]. However, quantum computing has also a destructive side since it is assumed that a quantum computer with a few thousand logical qubits would be capable to break essentially any public-key cryptosystem in use today [RNSL17]. The dawning era of quantum computing has spurred much research on Post-Quantum Cryptography (PQC), a sub-domain of cryptography concerned with the design, analysis and implementation of cryptosystems that are expected to resist attacks executed on both conventional and quantum computers [SL21]. Almost all of the to-date existing post-quantum key establishment and signature algorithms fall into one of five categories, which are lattice-based cryptography, multivariate cryptography, hash-based cryptography, code-based cryptography, and supersingular isogeny cryptography. These categories differ with respect to the hard mathematical problems their security is based on, but also in terms of computational cost, key lengths, and the length of ciphertexts (resp. signatures) [SL21]. The security of isogeny-based cryptosystems rests upon the intractability of the problem of finding an explicit isogeny between two (supersingular) elliptic curves over a finite field that are known to be isogenous [DeF17]. While isogenybased schemes are computation-intensive, their key sizes are among the smallest of the five categories and come even close to that of pre-quantum elliptic curve schemes.
Various isogeny-based cryptosystems have appeared in the literature in the past ten years. SIKE (short for Supersingular Isogeny Key Encapsulation) is a key encapsulation mechanism whose security relies upon the supersingular isogeny walk problem between two elliptic curves in the same isogeny class, which asks to find a path made of isogenies of small degree [Cos19]. A variant of SIKE is an alternative candidate in the third round of the PQC standardization project of the NIST [JAC + 20]. CSIDH (an abbreviation of Commutative Supersingular Isogeny Diffie-Hellman) is an "ECDH-like" key-exchange scheme based on a commutative group action of an ideal class group [CLM + 18]. Given an initial elliptic curve E, a secret key in CSIDH is an ideal class a in a class group (represented by its list of exponents), and the corresponding public key can be obtained by computing the group action E = a E. The security of CSIDH is based on the hard problem of finding an isogeny path from the isogenous curves E and E . CSIDH has received a lot of attention in recent years since it comes with highly attractive features like efficient validation of public keys, making it suitable for non-interactive (i.e. static) key exchange protocols. In fact, CSIDH can serve as "drop-in" replacement for classical ECDH key exchange and does even comply with the requirements of "0-RTT" protocols such as QUIC. Furthermore, class group actions provide a rich foundation for the design of various other cryptosystems, e.g. signature schemes [BKV19,DG19]. However, the downside of CSIDH is that the computation of group actions is very costly, which makes CSIDH extremely slow, not only in relation to X25519 [Ber06] and other pre-quantum ECDH variants, but also when compared to SIKE. For example, while an Intel Skylake processor can execute a variable-base scalar multiplication on Curve25519 in less than 100 k cycles [NS20] and a SIKEp434 encapsulation or decapsulation in about 10 M cycles [JAC + 20], the to-date best constant-time implementation of a CSIDH-512 group action evaluation and key validation requires close to 240 M clock cycles [HLKA20].
The lengthy computation time of CSIDH poses a major obstacle for its application in security protocols like TLS or HTTPS when taking into account that, for example, the web servers of large enterprises like Google or Facebook are confronted with thousands of HTTPS requests per second. In order to be able to cope with such extreme volumes of traffic, the server infrastructure of such enterprises often includes a so-called TLS termination proxy or TLS reverse proxy, which transparently translates HTTPS sessions to TCP sessions for back-end servers (e.g. web or database servers), see [JHH + 11]. This offloading of the TLS termination to a dedicated proxy frees the web server from having to execute computation-intensive TLS handshakes that involve public-key operations to authenticate the server to the client and establish a shared secret key using e.g. X25519 key exchange [Ber06]. A TLS termination proxy equipped with a high-end 64-bit Intel processor clocked at 4 GHz is (in theory) able to perform 40,000 X25519 key exchanges per second per core since, as mentioned before, a variable-base scalar multiplication on Curve25519 costs below 100 k cycles 1 . Replacing X25519 by SIKEp434 would decrease the (theoretical) upper bound of the number of key exchanges per second on one core to around 400. Even worse, when X25519 gets replaced by CSIDH-512, the number of key exchanges per core would go down to a mere 17 per second, which is more than three orders of magnitude below the (theoretical) throughput of X25519. Therefore, it is little surprising that techniques to speed up CSIDH are eagerly sought.
Contributions. The straightforward way of maximizing the throughput of CSIDH is to minimize the latency of the underlying group action. However, we demonstrate in this paper that the usual approach of maximizing throughput by minimizing latency leads to sub-optimal results on Intel processors that are equipped with recent vector (i.e. SIMD) extensions such as AVX-512. To be more concrete, we show that, when using AVX-512 instructions, minimizing the latency of one group action requires different optimization strategies than maximizing the throughput of several group actions that are executed in SIMD fashion. We explain how the "limb-slicing" method presented in [CGT + 21] can be applied to compute eight independent CSIDH group actions in parallel using AVX-512 instructions, whereby each group action uses a 64-bit element of a 512-bit vector. Limbslicing is somewhat related to the well-known "bit-slicing" technique used in symmetric cryptography since it increases throughput at the expense of latency. We discuss in detail the obstacles we had to overcome to efficiently batch group actions and execute them in a SIMD-parallel way. Further, we describe software optimization techniques that enable a highly-efficient (8 × 1)-way parallel execution of the prime-field arithmetic operations using AVX-512F and AVX-512IFMA instructions. We also present a latency-optimized implementation of the group action for AVX-512IFMA, which can be used to speed up client-side TLS processing (while our throughput-optimized implementation targets the server side 2 and can be used for TLS termination as described in [JHH + 11]). Our results for CSIDH-512 show that batch processing and limb-slicing achieve a throughput gain by a factor of 3.64 compared to an optimized (but non-vectorized) x64 implementation. In light of the recent debate about the post-quantum security of CSIDH-512, we emphasize that our optimizations can also be applied to parameter sets with larger primes, and we expect similar improvements in performance over non-vectorized implementations.

Preliminaries
In this section, we give a brief overview of the CSIDH protocol and the CSIDH class group action of Castryck, Lange, Martindale, Panny, and Renes [CLM + 18]. Further, we summarize the existing constant-time implementations of the CSIDH class group action. For a detailed analysis of the theory of elliptic curves that is relevant for isogeny-based cryptography, we refer the reader to the lecture notes of De Feo [DeF17].

The CSIDH Key Exchange Protocol
The CSIDH protocol works over a finite field F p , where p is a large prime of the special form p = 4 · 1 · · · n − 1 and 1 < . . . < n are small odd primes. In addition, it uses supersingular elliptic curves 3 E A , defined over F p and represented in Montgomery form E A : y 2 = x 3 + Ax 2 + x, with A 2 = 4, where the F p -endomorphism ring 4 of such curves is isomorphic to an order in the imaginary quadratic field Q( √ −p). Specifically, the authors in [CLM + 18] choose a supersingular Montgomery curve E 0 (i.e. splits as a product of prime ideals ( i ) = l i l i = i , π − 1 i , π + 1 , where π = √ −p is the Frobenius endomorphism 6 and since ( i ) is principal, we get . In CSIDH we are interested in computing the action of an ideal a = l e1 1 · · · l en n ∈ Cl(Z[ , where e 1 , . . . , e n are small exponents, chosen uniformly from some interval [−b, b]. This is done by computing in sequence the action of the ideal l i , if e i ≥ 0, or l i , if e i < 0, exactly |e i | times for every i ∈ {1, . . . , n}. Algorithm 1: Computing the class group action for CSIDH [CLM + 18] Input: A ∈ Fp and a list of integers (e1, . . . , en). Output: B ∈ Fp, such that (l e 1 1 · · · l en n ) EA = EB, where EB : For the action of the ideal l i we choose a point R ∈ E(F p ) of order i that lies in the kernel of π − 1 and compute the isogeny φ li : E → E/ R = l i E, with ker(φ li ) = R and deg(φ li ) = i . For the action of the ideal l i we choose a random point R ∈ E(F p 2 ) (i.e. the quadratic twist of E), of order i in the kernel of π + 1. Note that in this case, R = (x, iy), where x, y ∈ F p and i = √ −1. Then we compute the isogeny φ li : E → E/ R = l i E, with ker(φ li ) = R and deg(φ li ) = i . Both isogenies are computed using the Vélu formulae [Vél71], which require O( i log p 2 ) bit operations, hence they are efficiently computed for relatively small primes i . Iterating each isogeny computation |e i | times, depending on the sign of e i and composing the resulting isogenies in each step, yields the final codomain curve a E = (l e1 1 · · · l en n ) E (see Algorithm 1 [CLM + 18]).

CSIDH.
The public parameters are the prime p = 4 · 1 · · · n − 1, the starting curve E 0 : y 2 = x 3 + x and a positive integer b, such that (2b + 1) ≥ n #Cl(Z[ √ −p]) in order to maintain security. Alice's secret key is a list of exponents sk A = (e 1 , . . . , e n ) ∈ [−b, b] n , while her public key is derived from the action of the ideal l e1 1 · · · l en n on the curve E 0 , using Algorithm 1, i.e. pk A = E A = (l e1 1 · · · l en n ) E 0 , which is sent to Bob. In the same vein, Bob's secret key is sk B = (d 1 , . . . , d n ) ∈ [−b, b] n , and his public key pk B = E B = (l d1 1 · · · l dn n ) E 0 is sent to Alice. For the shared secret, Alice and Bob compute the codomain curves k A = (l e1 1 · · · l en n ) E B and k B = (l d1 1 · · · l dn n ) E A respectively, using Algorithm 1. The two curves are F p -isomorphic, because they are derived from the action of the ideal l e1+d1 1 · · · l en+dn n on the initial curve E 0 , as a result of the commutativity property of the ideal class group Cl(Z[ √ −p]). Note that the public keys and the shared secret, are elliptic curves in Montgomery form, hence they are represented as a single coefficient in F p .
Security of CSIDH. The security of CSIDH is based on the Group Action Inverse Problem (GAIP). That is, given two supersingular elliptic curves E and E , defined over F p , with the same F p -endomorphism ring Z[ √ −p], to find an ideal a = l e1 1 · · · l en n such that a E = E . The best known classical attack for solving GAIP is the meet-in-the-middle attack with fully exponential complexity O( In the quantum setting, Childs, Jao, and Soukharev [CJS14] have shown that solving the GAIP problem can be reduced to the abelian hidden-shift problem, for which the subexponential quantum algorithms of Regev [Reg04] and Kuperberg [Kup05] can be applied, where the latter has complexity O( √ log p) and the quantum space complexity O(log p). Based on the above, Castryck, Lange, Martindale, Panny, and Renes [CLM + 18] conjectured that CSIDH-512 would achieve NIST's post-quantum security level 1 based on the asymptotic complexity of Kuperberg's algorithm [Kup05,Kup13]. However, the concrete security of CSIDH-512 is under debate since the works of Peikert [Pei20], Bonnetain and Schrottenloher [BS20], and more recently Chávez-Saab, Chi-Domínguez, Jaques, and Rodríguez-Henríquez [CCJR20], estimate that the prime p should be significantly larger in order to meet NIST's security level 1. In particular, [CCJR20] suggests that p should be updated to 4096 bits.

Optimization Techniques for Constant-Time CSIDH
In practice, we require a constant-time implementation of Algorithm 1 to resist against side-channel attacks. Given a secret exponent list (e 1 , . . . , e n ), Algorithm 1 computes |e 1 | + . . . + |e n | isogenies, and thus its execution time fully depends on the secret key. Meyer, Campos, and Reith [MCR19] presented three leakage scenarios that appear when implementing Algorithm 1. Timing leakage occurs since different secret keys lead to different execution times of evaluating the class group action. Power analysis leaks information on the sign distribution of the secret key since unbalanced, in terms of the sign, secret exponents lead to scalar multiplications with larger scalars. Cache timing attacks are also possible and leak information based on branch conditions or array indices. The authors in [CLM + 18] argued that a constant-time implementation can be obtained when adding certain "dummy" operations, which will not be considered nor affect the final output of the group action. The first constant-time implementations of Algorithm 1 are due to Bernstein, Lange, Martindale, Panny [BLMP19] and Jalali, Azarderakhsh, Kermani, Jao [JAKJ19], which add a large amount of dummy operations and have a probability of failure in the class group action computation.
Meyer, Campos, and Reith. A constant-time implementation of the CSIDH class group action with significant optimizations is presented in [MCR19], and it is known as the "MCR-style". The algorithm uses only positive secret exponents e i , each sampled from its own space [0, b i ] where all b i are chosen such that security is maintained. This mitigates power attacks, while the different intervals allow to reduce the number of large degree isogenies. Meyer et al. use dummy isogenies so that the same number of isogenies is computed in each class group action. For each i their algorithm computes e i "real" and b i − e i "dummy" i -isogenies. Further, they use the Elligator 2 map [BHKL13] for sampling points on the curve. As observed in [MR18], they compute the class group action in descending order in terms of the primes i , which results in a speedup over the ascending order. The most significant optimization is the SIMBA-m-µ technique, where the idea is to partition the indices {1, . . . , n} into m disjoint subsets and evaluate the group action on each subset individually, which results in smaller scalars in step 7 of Algorithm 1. However, this should be done for a specific number of rounds µ, and the subsets should be merged back after this threshold. The authors argue that finding optimal values for m and µ, as well as for the upper bounds b i is a hard task. They present various choices, based on the CSIDH-512 instance, where the best example is SIMBA-5-11.
Onuki, Aikawa, Yamazaki, and Takagi. In [OAYT19] the authors present a constanttime implementation of Algorithm 1, known as the "OAYT-style", that improves on the MCR-style by 29.03%. In their algorithm each e i is also sampled from its own space, but in contrast to the MCR-style, each e i is allowed to be negative as well. The algorithm mitigates timing attacks by keeping track of two points P 0 ∈ E[π − 1] and P 1 ∈ E[π + 1] and picking the appropriate point, depending on the sign of e i , in order to create the kernel generator. Both points P 0 and P 1 are mapped through the isogeny, and the point not used to derive the kernel is multiplied by i . The algorithm also uses the Elligator 2 map for generating points on the curve and dummy isogenies as in the MCR-style.

Cervantes-Vázquez, Chenu, Chi-Domínguez, De Feo, Rodríguez-Henríquez, and Smith.
The work in [CCC + 19] provides significant improvements on both the MCR-and OAYTstyle algorithms. The authors present efficient formulas in the twisted Edwards model for performing isogeny computations, isogeny evaluations and curve operations (point addition/doubling). They also use differential addition chains which provide a 25% improvement compared to the classical Montgomery ladder, for computing scalar multiplications. Besides the optimizations for the MCR-and OAYT-style algorithms, the authors present a constant-time implementation that excludes the dummy operations, known as "dummyfree-style". Although this is less efficient compared to the MCR-and OAYT-style, it is resistant against fault-injection attacks, i.e. stronger attackers who can interfere in computations and determine whether a modification happened on a "real" or a "dummy" isogeny. Their optimized OAYT-style with SIMBA-3-8 is 39% faster than the MCR-style presented in [MCR19], and their dummy-free-style group action is two times slower compared to their OAYT-style implementation.
Optimal Strategies. In [HLKA20], Hutchinson, LeGrow, Koziel, and Azarderakhsh further studied problems of choosing the optimal bounds b i for sampling secret exponents, the optimal ordering of primes i , and the optimal partition m for SIMBA technique. Such selections are referred to as optimal strategies. Their optimal strategies offer 5.06% improvement for the OAYT-style implementation in [CCC + 19]. In [CR20], Chi-Domínguez and Rodríguez-Henríquez generalized the computational strategies that are used in the SIKE implementation [JAC + 20] for the case of CSIDH. Their new algorithms do not rely on the SIMBA approach and provide an improvement of 12.09%, 3.36%, and 10.58% compared to the MCR-, OAYT-, and dummy-free-style implementations of [CCC + 19], respectively. The OAYT-style algorithm of [HLKA20], the MCR-and dummy-free-style algorithms of [CR20] are to date the most efficient constant-time implementations of CSIDH in the literature. These algorithms are further optimized by Adj, Chi-Domínguez, and Rodríguez-Henríquez [ACR20] when applying certain tricks that reduce the computational cost of new Vélu formulae of [BDLS20].

AVX-512
AVX-512 is the latest incarnation of Intel Advanced Vector eXtension (AVX), which augments the execution environment of x64 by 32 registers of a length of 512 bits and various new instructions. AVX-512 contains multiple extensions, but a specific processor may support some but not all of them. From another perspective, all processors equipped with AVX-512 support the core extension named AVX-512 Foundation (AVX-512F).
AVX-512IFMA. Among the extensions of AVX-512, AVX-512IFMA (Integer Fused Multiply-Add), or simply IFMA, is very attractive for the public key cryptosystems whose underlying arithmetic is the large integer arithmetic, including isogeny-based cryptosystems. IFMA was first supported with Intel Cannon Lake and continued to be equipped in its successors such as Ice Lake and Tiger Lake processors. Intel described IFMA in [Int18] as "two new instructions for big number multiplication for acceleration of RSA vectorized SW and other Crypto algorithms (Public key) performance". Specifically, IFMA introduced two novel instructions vpmadd52luq and vpmadd52huq. An IFMA instruction first multiplies packed unsigned 52-bit integers in each 64-bit lane of two registers to produce a 104-bit intermediate product. It then adds either the low or the high 52-bit from the product with corresponding packed unsigned 64-bit integer (from the third register), and stores the final results in a destination register. Compared to vpmuludq or vpmuldq instruction in AVX-512F, IFMA not only offers wider multiplier (52-bit in IFMA vs 32-bit in AVX-512F) but also fuses the multiplication and the addition in a single instruction.
Target Platform. In this work, we target the Intel Ice Lake processor which supports IFMA extension. The specific processor we used in our experiments is Intel Core i3-1005G1 CPU clocked at 1.2 GHz. Table 1 lists the most relevant AVX-512 instructions used in this work, along with their latency and throughput data 7 on the Ice Lake processor which we obtained from Intel official document [Int20]; the throughput data is shown in Clock Per Instructions (CPI) ratio [Int18]; the instruction mnemonic is used to describe algorithms in this paper.

Methods for Batching CSIDH Group Actions
Recall from Section 2.2 that the to-date fastest constant-time CSIDH implementations are the two OAYT-style variants of [HLKA20] and [CR20]. According to our measurements of their group action evaluation (see Table 3), the former is 1% faster than the latter. The optimization techniques used in both variants improve the OAYT-style implementation of [CCC + 19] by 5%, and they are all considered in terms of x64 implementation. When the same optimization techniques are applied to AVX-512 software, the resulting effects may be different. In this work we focus on batching the OAYT-style implementation of [CCC + 19]. However, we argue that the optimization techniques of [HLKA20, CR20] as well as [ACR20] can also be applied in our batched implementation. The OAYT-style class group action algorithm of [CCC + 19] is described in Algorithm 2, which was originally presented in [OAYT19]. Algorithm 2 takes advantage of the SIMBA- For the CSIDH-512 instantiation, the best choices according to [OAYT19] are m = 3 and µ = 8. In this case, Algorithm 2 computes 404 "real" and "dummy" isogenies (i.e. the variable t max = n i=1 b i = 404, see Appendix A for the bound vector b and the ordering of the small primes ). Following [CCC + 19], we define the constant-time equality test and constant-time conditional swap functions isequal and cswap, respectively, as: Further, we consider a constant-time function sign(x), which returns 0 if x < 0 while returns 1 if x ≥ 0. More details on Algorithm 2 are given in [OAYT19, CCC + 19].

Obstacles to Batching CSIDH Group Actions
We conceive our batched software where eight CSIDH group action instances in the fashion of Algorithm 2 are to be computed simultaneously by AVX-512 instructions. Besides, each instance is computed in 64-bit lane, and instances are independent of each other. Since AVX-512 is in the paradigm of Single Instruction Multiple Data, from another perspective, this requires that the same instruction must be executed at the same time in eight instances. In other words, each of the eight instances in the execution of our batched software must process the same instruction sequence or, we say, the same operation sequence at a higher layer. This is a crucial requirement, in addition to having a constant-time implementation which is already accomplished by Algorithm 2. The sequence of operations in Algorithm 2, relies on specific conditional statements, which are (indirectly) affected by the value of the random curve points generated internally at line 7 to 8. Specifically, a closer look at Algorithm 2 reveals that the sequence of operations (and instructions) that are carried out depends on whether the kernel generator R at line 13 is infinity or not. If R = ∞, the algorithm computes either a "real" or a"dummy" isogeny (depending on whether e i is non-zero or not) in the "if"-branch, whereas it performs a scalar multiplication in the "else"-branch if R = ∞. In the scenario of eight parallel class group actions that we are considering, this is problematic, and especially, it is very likely that at some iterations the point R will be infinity at least in one of the parallel instances. In particular, the probability for a point of order i to be infinity is 1/ i , which is considerably high when i is small (e.g. 3, 5, 7, . . . in CSIDH). This will cause a mismatch between the simultaneous instances and will affect other variables as well, such as the update of b i and the isogeny counter t isog at line 21, leading to different instruction sequences for the different instances.
Clearly, in order to obtain a constant-time batched software where eight running instances follow the same instruction sequence, we need to mitigate the mismatch caused Algorithm 2: The OAYT-style algorithm with SIMBA-m-µ for computing the CSIDH group action.
by the infinity check at line 14. We explore three methodologies that achieve a batchingfriendly CSIDH group action and deal with this specific if-else statement. In our first method, the idea is to rewrite Algorithm 2 so that this if-else clause is no longer needed. This requires the computation of additional dummy isogenies in the case where the kernel point is infinity, and hence we refer to this method as extra-dummy (Section 3.2). In the second method, we still keep this if-else statement but we force all eight instances to always agree on the same branch. That is, if at least one kernel point in the eight instances is the point at infinity, then all instances will enter the "else"-branch at line 22. We refer to this methodology as extra-infinity method (Section 3.3). The third idea is based on the combination of the extra-dummy and extra-infinity methods, therefore we call it the combined method (Section 3.4).

Hybrid Mode.
Notably, all of our three methods are constructed in a hybrid mode which significantly improves the performance of the batched CSIDH implementation. In the context of this paper, hybrid mode means that the entire batched software is The three methods are based on the OAYT-style implementation of Algorithm 2 with SIMBA-m-µ. However, in Section 3.5, we show that all three methods can also be applied to batch the dummy-free-style algorithm of [CCC + 19], which is considered to be secure against fault-injection attacks.

Extra-Dummy Method
Our first batching method initially aims at making Algorithm 2 independent of all inputs as well as all randomness. In brief, we remove the if-else clause (line 14 and line 22) that checks whether R is infinity, at a cost of extra dummy isogeny computations. This idea was first presented in [BLMP19] and also adopted in [JAKJ19] for a constant-time CSIDH implementation on 64-bit ARM processors. For both implementations there exists a probability of failure in computing the correct codomain curve, and a large number of dummy isogeny computations are required to make this probability negligible (e.g. 2 −32 ).
In detail, according to [CCC + 19], given a point P = (Y : T ) represented in twisted Edwards Y T -coordinates 8 , we define a constant-time function for checking whether the point P is infinity as: This time we compute a "real" isogeny from the kernel point R, if R = ∞ and e i = 0; whereas we compute a "dummy" isogeny if either R = ∞ or e i = 0. Similarly, these dummy isogenies will not be considered in the final result, but they will cause the counter t isog to increment. Consequently, there is a possibility that for some indices i, the number of the computed "dummy" isogenies exceeds the value b i − |e i | in which case we lose "real" isogenies that should be computed. This implies that although the algorithm will terminate, the resulting codomain curve will be incorrect since it will not correspond to the secret exponents (e 1 , . . . , e n ). This probability of failure can be reduced by fixing the number of dummy isogenies to be computed, as done in [BLMP19,JAKJ19]. In other words, except for the dummy isogenies originally needed by Algorithm 2 to make the group action independent of the secret exponents (we call them initial dummy isogenies), we import extra dummy isogenies to make the group action independent of the randomness (we call them extra dummy isogenies). The modified group action has now the same operation sequence in each execution, which meets the requirement for batching. However, according to our calculation, it requires to import more than n i=1 b i extra dummy isogenies to make the failure probability below 2 −32 . Hence, this idea needs to compute more than two times the number of isogenies needed in Algorithm 2, which significantly reduces the efficiency of the algorithm, while the probability of failure still exists.
Based on the above discussion, we are looking for a way to drastically reduce the number of extra dummy isogenies and eliminate the probability of failure, but meanwhile retain this batching-friendly fashion of group action. This can be done using the hybrid mode. As introduced in the previous subsection, the hybrid mode consists of the batched component and the unbatched component. In the batched component, we still compute n i=1 b i ("real" and "dummy") isogenies, where in this case, dummy isogenies appear  whenever a secret exponent is zero, or the kernel point is infinity. In addition, we keep track of all the dummy isogenies that occurred from infinity kernel points. After the batched component terminates, we take advantage of the unbatched component to compute "compensatory" isogenies based on the previously recorded infinity cases. Since this method adds extra dummy isogenies for each instance that occurred from the infinity cases, we refer to it as the extra-dummy method. Algorithm 3 explains the batched component of our extra-dummy method. We first apply this batched component for computing eight group action instances in parallel. Hence, in our batched software, the input is composed of a secret exponent list and a starting curve: In Algorithm 3, apart from the bound list (b 1 , . . . , b n ), we also create an additional bound list for each instance to record the infinity cases:  (same as (b 1 , . . . , b n )).
When the batched component terminates, it outputs for each instance, a curve coefficient B (k) , the list (b Our experiments indicate that for each instance there are often around 10 ("real" and "dummy") isogenies remaining to be computed, i.e. current n i=1b (k) i ≈ 10. We compute the remaining isogenies by executing the unbatched component for each instance in sequence:

Extra-Infinity Method
We assume that eight different instances of Algorithm 2 are computed in parallel. In brief, the idea in our second method is that for each iteration of the inner loop (line 10 to line 26 of Algorithm 2), if the kernel generator is infinity in at least one of the eight instances, then we force all instances to execute the "else"-branch at line 22. In particular, we define the variable inf as: where R (k) denotes the kernel generator in the k th simultaneous instance. If inf = 0, then R (k) = ∞ for all k ∈ {1, . . . , 8} and all eight instances will enter the "if"-branch at line 14 in Algorithm 2, in order to compute a "real" or a "dummy" i -isogeny. On the other hand, if inf = 1, then R (k) = ∞ for at least one k ∈ {1, . . . , 8} and all instances will proceed to the "else"-branch. In this case, we need to execute the scalar multiplication T 1 . This is not needed in Algorithm 2, because the i -torsion part of the point T 0 has already vanished (since R = ∞). In our approach, when inf = 1, we are forcing all instances to proceed as if all R (k) were infinity, however the i -torsion parts of some points T in the "else"-branch, these infinity-related computations may include more point generation operations (using the Elligator map at line 8), which will result in more scalar multiplications for the full order points P (k) 0 , P (k) 1 (line 9) and the kernel generator R (k) (line 13). For this reason, we refer to this method as the extra-infinity method. Note also that the probability of inf = 1 is 1 − (1 − 1/ i ) 8 , which is considerably higher when i is small (e.g. 3, 5, and 7). As a result, an increased number of inf = 1 cases (and hence, an increased number of infinity-related computations) is expected, which affects the efficiency of the extra-infinity method.
We mitigate this problem by considering again the hybrid mode. More precisely, we divide the primes = ( 1 , . . . , n ) into two subsets, batch for the batched component and unbatch for the unbatched component. unbatch contains only the smaller primes, specifically 3, 5, 7, 11, 13, 17 and 19 in our implementation, whereas batch includes the remaining primes in . In the same way, the bound list b = (b 1 , . . . , b n )  Then we execute the unbatched component (such as Algorithm 2) sequentially in order to obtain the correct codomain curve for each instance:

The Combination of Extra-Dummy and Extra-Infinity Methods
Before we introduce the combined method, we give a few more details on the extra-dummy and extra-infinity methods. We consider an example where in an iteration of the inner loop, n inf of the eight kernel points R (k) = ∞, in the batched component of both methods. The extra-dummy method will complete the computations of this iteration (from line 14 to 25 in Algorithm 3), and later it will compute n inf "compensatory" isogenies with the unbatched component. On the other hand, the extra-infinity method will enter its "else"-branch to compute two scalar multiplications, for all eight instances, and it may later perform the other infinity-related computations, which are in theory needed by n inf instances. Based on the operations that are carried out in each method, we observe that the extra-dummy method handles the infinity cases more efficiently than the extra-infinity method when only few R (k) = ∞, hence when n inf is small. On the other hand, the extra-infinity method seems to be more efficient in handling the infinity cases, when most of the eight R (k) = ∞ (i.e. when n inf is close to 8).
Based on the above observation, our idea is to combine the two approaches, aiming at obtaining a more efficient method. In order to do this, we set the variable n inf as: n inf = isinfinity(R (1) ) + isinfinity(R (2) ) + . . . + isinfinity(R (8) ), so that n inf ∈ [0, 8]. Taking Algorithm 3 to describe this combined method, after the computation at line 13 (where the kernel generator R (k) is computed), we add an if-else statement to check if the variable n inf is within a predefined threshold n thld . If n inf ≤ n thld which means there are few R (k) = ∞, we do the same computations as in the extra-dummy method (line 14 to 22 of Algorithm 3). On the other hand, if n inf > n thld which means there are more R (k) = ∞, we proceed to the "else"-branch of the extra-infinity method and perform the two scalar multiplications T 0 . After this if-else statement, the operations at line 23 to line 25 will be performed. Additionally, the unbatched component of the combined method is the same as the one in the extra-dummy method. From our experiments, when the threshold n thld = 3, this combined method provides the best performance, and particularly, it is slightly faster than the extra-dummy method and quite faster than the extra-infinity method.

Batching Dummy-Free-Style Group Actions
The methods that we have considered in Sections 3.2, 3.3, and 3.4 for batching CSIDH group actions are all based on the OAYT-style algorithm of [OAYT19] and require the computation of dummy isogenies. More precisely, recall that Algorithm 2 computes |e i | "real" and b i − |e i | "dummy" isogenies. Such implementations are vulnerable to fault injection attacks. As observed in [CCC + 19], an attacker can modify the codomain curve or the images of the points T 0 , T 1 under the isogeny φ in Algorithm 2 (fault injections), and if the result is correct, he knows that a dummy isogeny is computed and thus e i = 0. This is also true in Algorithm 3. If the same modification produces the correct result, then the attacker knows that either e i = 0, or the current kernel generator R = ∞.
In We argue that the three methods that we have introduced in Sections 3.2, 3.3, and 3.4 can also be used to batch the dummy-free-style algorithm of [CCC + 19]. In particular, for the extra-dummy and the combined methods, we are still using dummy isogenies for the case where the kernel generator R = ∞, however, these dummy isogenies do not reveal any information about the secret exponent, since they depend only on the random kernel generator. For the extra-infinity method, we follow the same strategy as in Section 3.3, with the only difference being that the small prime list in the unbatched component is unbatch = (3, 5, 7, 11, 13, 17, 19, 23, 29). For the combined method in the dummy-free-style, the optimal threshold to achieve the best performance is n thld = 5.

(8 × 1)-Way Prime-Field Arithmetic
In the batched components of all batching methods, we use the same curve and isogeny arithmetic implementation that we developed, based on [CCC + 19], with minor optimizations to better fit the batched software. At a lower layer, we developed all the needed (8 × 1)-way 9 prime-field operations from scratch, using respectively AVX-512F and IFMA, by taking advantage of "limb-slicing" technique [CGT + 21]. This section only studies our IFMA vectorized implementation of prime-field operations. Compared to the IFMA version, the AVX-512F implementation has two fundamental differences; (i) it uses vpmuludq instead of IFMA instructions to perform vector multiplication; (ii) the field element is represented in radix-2 29 (with 18 limbs) due to the 32-bit multiplier.
The main data structure of our parallel software is a (8 × 1)-way limb vector set, which is composed of eight radix-2 52 elements. Given eight integers a, b, c, d, e, f, g, h ∈ F p , a (8 × 1)-way limb vector set V is defined as: a 1 , b 1 , c 1 , d 1 , e 1 , f 1 , g 1 , h 1 ] . . .
[a 9 , b 9 , c 9 , d 9 , e 9 , f 9 , g 9 , h 9 ] is called a limb vector. All the inputs and outputs of our (8 × 1)-way field operations are limb vector sets of which each limb is precisely 52 bits long. In terms of our field operations, we saved the final subtraction in Montgomery reduction, and our addition and subtraction perform reduction modulo 2p instead of p. This means all the integers inputted to or outputted from our field operations are in the range [0, 2p − 1]. We use P = p, p, p, p, p, p, p, p to denote an (8 × 1)-way limb vector set of prime p, and Q = 2 × P = 2p, 2p, 2p, 2p, 2p, 2p, 2p, 2p .

Field Addition and Subtraction
Field addition Z ← X + Y mod Q is performed in three steps. At first, we add X with Y and store their sum in Z. We then subtract Q from Z, so there might be negative results in some lanes of Z. Finally, we create a 512-bit mask vector where the 64-bit element is set to all-1 if the corresponding lane's integer in Z is negative, or to all-0 if non-negative. Through the bitwise AND between this mask vector and Q, we add 2p to the negative integers in Z whereas add 0 to the non-negative integers. There are only two steps in the field subtraction Z ← X − Y mod Q, which first subtracts Y from X and then carries out a same final step of field addition.

Field Multiplication
Field multiplication has a significant impact on the performance of any isogeny-based cryptosystem and deserves special care.  Table 4]. However, things become more complicated when developing on a processor with more computing power. Considering our case, Ice Lake processor is equipped with ten execution ports (and various execution units), so the processor can execute several instructions simultaneously. Excluding the number of needed instructions and instruction latency/throughput, we are supposed to take instruction-level parallelism into account. The memory consumption receives less attention in this case since an Ice Lake machine usually possesses a GB-level memory. Currently, most of the related AVX-512 implementations are designed for accelerating 1-, 2-or 4-way Montgomery multiplication. However, these optimization techniques are not ideally applicable to our 8-way software. We discuss these implementations in more detail in Section 5.1. Due to the "limb-slicing" pattern, our 8-way Montgomery multiplication essentially "duplicates" 1-way implementation to eight lanes by AVX-512 instructions. To the best of our knowledge, there are only two AVX-512 implementations of this type in the literature. Takahashi proposed both AVX-512F and IFMA implementation of 8-way Montgomery multiplication in [Tak20], but this software works on 62-bit and 52-bit operands, respectively, and not in the case of large integers. Buhrow, Gilbert, and Haider in [BGH21] presented a Block Product Scanning (BPS) variant of Montgomery multiplication, which is based on radix-2 32 representation. An 8-way 512-bit BPS variant implemented with AVX-512F takes 189 clock cycles for each instance, which translates to 1512 clock cycles for a whole 8-way implementation.
In order to find an optimal field multiplication for our software, the best way is to develop the corresponding 8-way AVX-512 implementation of various Montgomery multiplication variants and select the fastest one among them. From an algorithmic viewpoint, all the variants differ in three aspects: (i) different methods to implement integer multiplication, e.g. operand-scanning, product-scanning or the advanced technique such as Karatsuba algorithm [KO63]; (ii) different methods to implement Montgomery reduction, e.g. operand-scanning or product-scanning; (iii) whether Montgomery reduction is separated from or interleaved with integer multiplication (and how it is interleaved in the latter case). For our IFMA version, we conducted experiments in which we developed a dozen of implementation candidates of 8-way field multiplication based on various combinations from above three aspects. Notably, our 8-way implementation candidates are not straightforwardly "duplicating" the ordinary 1-way x64 implementation of different variants (or we say combinations). We rather concentrated on improving instructionlevel parallelism in each implementation candidate. In order to achieve this purpose, we tried to improve the ports utilization by optimizing dependency chains in the code. From our benchmarking results on Ice Lake processor, the implementation candidate with the lowest latency is shown in Algorithm 11 at Appendix C, which possesses a similar structure as Coarsely Integrated Hybrid Scanning (CIHS) [KAK96], and it serves as field multiplication in our 8-way IFMA software. Our field multiplication uses productscanning for integer multiplication and utilizes operand-scanning to handle Montgomery reduction, and reduction is interleaved with the second outer loop of product-scanning integer multiplication (line 7 to 15 in Algorithm 11). As for our AVX-512F version, we carried out a similar procedure to evaluate also a dozen of AVX-512F implementation candidates. The optimal field multiplication in AVX-512F version switches to Karatsuba algorithm for integer multiplication since there are 18 limbs of each element, and uses a product-scanning Montgomery reduction that is separated from integer multiplication.
The information and latency of field multiplication in both versions are shown in Table 2, which indicates that our Karatsuba-based AVX-512F implementation outperforms the BPS variant in [BGH21]. We herein emphasize on the importance of using an optimal field multiplication in such parallel AVX-512 software of an isogeny-based cryptosystem. In our experiments for IFMA version, the 8-way Separated Product Scanning (SPS) variant [LG14] or 8-way original FIPS variant [KAK96] (i.e. the one that has not been optimized for improving instruction-level parallelism) costs more than 700 clock cycles, i.e. taking 60% more CPU-cycles than Algorithm 11. From our experiments, using such unsuitable field multiplication and squaring implementation will finally result in up to 30% more CPU-cycles for CSIDH group action evaluation compared to the one using optimal variants.

Field Squaring
Most of the existing CSIDH implementations, e.g. [MCR19, OAYT19, CCC + 19, CR20, HLKA20], take advantage of a same x64 assembly implementation of field operations originally from [CLM + 18]. In this assembly implementation, a field squaring just invokes a field multiplication in which two operands are same. In other words, field squaring possesses the same latency as field multiplication. In this work, we developed a dedicated Montgomery squaring instead of directly using field multiplication. Specifically, compared to field multiplication, our field squaring utilizes a dedicated integer squaring instead of integer multiplication.
In essence, integer squaring is a special instance of multiplication, where all partial products in the form of f i · f j with i = j appear twice due to f i · f j = f j · f i . A classic technique for optimizing squaring is to just compute these partial products once and double them, thereby saving numerous multiplication instructions. We develop our integer squaring by this classic technique, and again we developed many squaring candidates to obtain an optimal implementation. The information of our field squaring implementation is also listed in Table 2 where it proves a dedicated field squaring saves at least 15% CPU-cycles than a field multiplication. The algorithmic description of our IFMA 8-way Montgomery squaring is shown in Algorithm 12 at Appendix C.

Low-Latency Implementation
In our hybrid mode which is introduced in Section 3.1, the low-latency implementation of a single group action evaluation serves as the unbatched component and is needed by each instance. More importantly, this low-latency implementation can also be used in more applications e.g. accelerating the CSIDH key exchange protocol on the client side.
In this section we describe our (2×4)-way IFMA implementation, which is developed for accelerating a single group action evaluation and used as the unbatched component in our IFMA throughput-optimized software. In the case of AVX-512F, our experiments showed that the (2 × 4)-way AVX-512F implementation is slower than the x64 implementation of [CCC + 19]. Hence, for our AVX-512F throughput-optimized software, we use the [CCC + 19] implementation as the unbatched component.

(2 × 4)-Way Field Operations
Neither the structure nor the radix of the (2 × 4)-way limb vector set is the same compared to the (8 × 1)-way set. To be specific, we take advantage of (2 × 4)-way interleaved vectors combined with radix-2 43 this time. The (2 × 4)-way limb vector set V = a, b is defined as follows: a 3 , a 6 , a 9 , b 0 , b 3 , b 6 , b 9 ]  [a 1 , a 4 , a 7 , a 10 , b 1 , b 4 , b 7 , b 10 ]  [a 2 , a 5 , a 8 , a 11 , b 2 , b 5 , b 8 (2) Each limb vector V i = [a i , a i+3 , a i+6 , a i+9 , b i , b i+3 , b i+6 , b i+9 ] contains four limbs from each integer a and b, and limbs are arranged in an interleaved pattern. The reason for using radix-2 43 instead of radix-2 52 is now easily inferred from the Equation (2). It is because there will still remain three limb vectors if using radix-2 52 (ten limbs for each integer), whereas radix-2 43 offers more headroom in each limb (a i or b i ) and is thus friendly for delaying the carry propagation. Similar to the (8 × 1)-way implementation, our (2 × 4)-way implementation also saves a final subtraction in Montgomery reduction and performs modulo 2p instead of p reduction in field addition and subtraction.
Mixed Addition and Subtraction. In curve and isogeny arithmetic, we can generally perform a pair of field addition and subtraction simultaneously, but not two additions or two subtractions due to sequential dependency. Therefore, it makes more sense to develop a parallel and mixed operation of addition and subtraction. We denote this mixed operation as "±". Formally, it works as r, s ← a, b ± c, d where r = a + c mod 2p and s = b − d mod 2p. In essence, this mixed operation executes the similar steps described in Section 4.2. At first, we construct two (2 × 4)-way limb vector sets c, 0 and 2p, d . We add a, b with c, 0 , and then subtract 2p, d from the sum to reach a + c − 2p, b − d . The final step is similar to Section 4.2 in order to ensure the results of this mixed operation are in [0, 2p − 1] by a mask vector.
Multiplication. As we mentioned in Section 4.3, some work has already been done for accelerating 1-, 2-or 4-way Montgomery multiplication or squaring with AVX-512. Several papers have been published which focus on using IFMA to accelerate 1-way large integer arithmetic such as integer multiplication [GK16,KG19] and Montgomery squaring [DG18]. Edamatsu and Takahashi in [ET20] presented an IFMA implementation of single large integer multiplication, which takes advantage of Karatsuba algorithm. Apart from the work on 1-way implementation acceleration, Orisaka, Aranha, and López presented a well-designed and fast (4 × 2)-way Montgomery multiplication for SIDH in [OAL18] by AVX-512F, and their approach is working on the 4-way interleaved vectors. We designed our (2 × 4)-way IFMA Montgomery multiplication based on the approach of [OAL18] with several modifications: (i) we use IFMA instructions to replace vpmuldq and save vpaddq; (ii) we apply our (2 × 4)-way limb vector set; (iii) we implement integer multiplication and reduction in interleaved fashion instead of separated one which is originally-used, because the interleaved fashion is measured to be faster than the separated one from our experiments. Our (2 × 4)-way field multiplication is described in Algorithm 4. Vector sets L and H respectively accumulate the partial products produced by vpmadd52lo and vpmadd52hi. Notably, excluding the computation at line 14, there is no dependency between L and H in the main loop (line 3 to 14), which benefits the efficient utilization of ports.
Squaring. Orisaka et al. did not present a dedicated integer squaring in [OAL18] but planned it as a future work. We herein propose a fast integer squaring by using the classic optimization technique that we described in Section 4.4. Our integer squaring can be slightly modified to fit any (2 × 4)-or (4 × 2)-way AVX-512 Montgomery squaring that uses interleaved vectors, e.g. the integer squaring needed in [OAL18]. Our method is described in Algorithm 5, which saves 24 IFMA instructions compared to an integer multiplication (corresponding to line 5 to 7 in Algorithm 4) which requires 72 IFMA instructions in total. We keep the output of Algorithm 5 in two sets L and H, since our Montgomery reduction is designed to directly work on them. Our complete Montgomery squaring just replaces the integer multiplication part of Algorithm 4 by Algorithm 5.

Curve and Isogeny Arithmetic
Following [CCC + 19], the curve arithmetic mainly includes y-coordinate point doubling, point addition and scalar multiplication (using addition chains) on twisted Edwards curve, whereas the isogeny operations contains y-coordinate isogeny computation and isogeny evaluation. Fortunately, all of the above five operations can be internally parallelized in 2-way, where the cost 10 switches from iM + jS to i 2 M 2 + j 2 S 2 .
The Elligator 2 map was originally introduced in [BHKL13] for generating random points on Montgomery curves and was modified in [CCC + 19] for the twisted Edwards case. The latter takes as input the values A 0 = a and A 1 = a − d where a, d ∈ F p are the coefficients of the curve E a,d in twisted Edwards form, a random u ∈ {2, . . . , (p − 1)/2} which is used to derive the random curve points. Then it outputs two points P 0 ∈ E a,d [π−1] and P 1 ∈ E a,d [π + 1]. The method of [CCC + 19] requires 8M + 3S + 16A plus one square test for the Legendre symbol, which we slightly improved by saving 2A. Our 2-way implementation of the Elligator 2 map is based on [CCC + 19], and it is presented in Algorithm 6, with total cost 5M 2 + 1S 2 + 9A 2 plus the square test for the Legendre symbol. Montgomery constant R = 2 516 mod p. Output: A pair of points P0 ∈ E a,d [π − 1] and P1 ∈ E a,d [π + 1].
cswap(T0, T1, 1 − m) 20 return P0 = (Y0 : T0) and P1 = (Y1 : T1) Specifically, the value u will be used to derive the random curve points, and the Montgomery constant R is used to map the random value u to the Montgomery domain. The algorithm first generates the two points using XZ-coordinate representation, namely P 0 = (X 0 : Z 0 ) and P 1 = (X 1 : Z 1 ) on the birationally equivalent Montgomery curve, C Y 2 Z 2 = C X 3 Z + A X 2 Z 2 + C XZ 3 , where A = 2(a + d) and C = a − d. More precisely, the two points are defined as: Then, the algorithm converts the two points in twisted Edwards form, using Y Tcoordinate representation, at lines 17 and 18. This is relatively cheap, since it requires only 2A 2 operations, namely P 0 = (Y 0 : T 0 ) = (X 0 − Z 0 : X 0 + Z 0 ) and P 1 = (Y 1 : T 1 ) = (X 1 − Z 1 : X 1 + Z 1 ).
At line 16, we use the constant time function issquare to check whether the value is a square in F p . If it is a square (m = 1), then the generated point lies on E a,d , otherwise (m = 0), the point lies on the quadratic twist of E a,d . At the final step (line 19), the two points are swapped according to m, so that the point P 0 ∈ E a,d [π − 1] and P 1 ∈ E a,d [π + 1].
In Appendix B we present our 2-way algorithms for point doubling and (differential) point addition, as well as the algorithms for the -isogeny computation and evaluation, where all algorithms use Y T -coordinate representation on twisted Edwards curves. At the top layer, we respectively implemented an OAYT-style group action and a dummy-free-style group action according to [CCC + 19].

Experimental Results
We downloaded the original CSIDH software [CLM + 18], all the OAYT-style and dummyfree-style constant-time CSIDH software including [OAYT19], [CCC + 19], [CR20] and [HLKA20]. All the source codes are publicly available. In particular, the source code of [CLM + 18] is available at CSIDH website 12 , while the authors of [CCC + 19], [CR20] and [HLKA20] provided their source code links in their articles. In addition, although the authors of [OAYT19] did not give the link of their source code in the article, we found the source code repository of the implementation in [OAYT19] on GitHub 13 .
In order to figure out the real improvement of our work, we benchmarked our software and the CSIDH group action evaluation of all the above implementations on an Intel Core i3-1005G1 Ice Lake CPU clocked at 1.2 GHz. All source codes were compiled with GCC version 9.3.0 and Turbo boost was disabled during the performance measurements. The results of the OAYT-style implementations are shown in Table 3, where the speedup ratio is defined by comparing the "CPU-cycles/#instances" between the baseline and the specific implementation, i.e. the normalized throughput. We use [CCC + 19] as baseline, because in this way we know precisely how much our vector processing techniques improve the results (note that [CCC + 19] also served as baseline in other papers, e.g. [CR20, HLKA20]). As shown in Table 3, our 2-way low-latency IFMA implementation has roughly the same latency as the original non-constant-time implementation in [CLM + 18], and it is about 1.5 times faster than the x64 implementation of [CCC + 19]. Our (8 × 1)-way IFMA implementation, when applied with the combined batching method, takes 446.9 M clock cycles for eight parallel instances, which represents a 3.64 times higher throughput compared to the x64 implementation in [CCC + 19]. An analysis of the execution times of our (8 × 1)-way software shows that all the IFMA implementations are nearly 1.9 times faster than the corresponding AVX-512F implementations, which confirms that the IFMA extension indeed significantly accelerates CSIDH compared to general AVX-512F. The benchmarking results of dummy-free-style implementations are summarized in Table 4. These results show that our proposed batching methods still work efficiently when applied to the dummy-free-style CSIDH group action and can yield an up to 3.63 times higher throughput compared to the x64 implementation in [CCC + 19].
Performance Evaluation and Analysis. Though AVX-512 can work on eight 64-bit elements simultaneously with a single instruction, the theoretical maximum speedup factor of an AVX-512 implementation (compared to x64 implementation) of isogeny-based crypto is actually far from eight. The main reason is the multiplier. An x64 implementation executed on an Ice Lake CPU has to use a single multiplier sequentially, but this multiplier can execute a full (64 × 64 → 128)-bit multiplication, which is very beneficial for the field arithmetic. On the other hand, AVX-512F can execute eight (64 × 64 → 64)-bit multiplications (vpmullq) or eight (32 × 32 → 64)-bit multiplications (vpmuludq/vpmuldq) in parallel, whereby the latter is typically used in multi-precision arithmetic. An AVX-512IFMA instruction can perform eight multiplications on 52-bit operands, but the result is either the lower half or the upper half of the eight 104-bit products, i.e. two IFMA instructions are necessary. Taking the multiplication of 512-bit integers using the schoolbook method as example, an x64 implementation needs 8 2 = 64 mul instructions for one instance, while AVX-512F needs at least 16 2 = 256 vectorized mul instructions for eight instances (a radix-2 29 representation would even need more instructions) and AVX-512IFMA requires 10 2 · 2 = 200 IFMA instructions for eight instances. Since the CPI of these mul instructions is same on Ice Lake CPU (see [Int20]), the approximate speed-up (compared to an x64 implementation) of AVX-512F and AVX-512IFMA is a factor of 2.0 and 2.56, respectively. This is also the case for the Montgomery reduction. As we mentioned before in Section 4.3, the field multiplication significantly affects the performance of CSIDH so that the theoretical maximum speedup factor of AVX-512 for CSIDH group action evaluation should be far from 8. Taking this analysis into account, our throughput-optimized AVX-512 implementations have the expected speed-ups.
As for the latency-optimized implementation, a 2-way IFMA latency-optimized implementation of SIKEp503 was presented by Kostic and Gueron in [KG19], which is 1.72 times faster than the x64 assembly implementation of SIKEp503. We can thus conclude that our 2-way IFMA low-latency implementations (which achieve speed-up factors of 1.54 and 1.71, respectively) also correspond to the expected acceleration. There are several reasons that make the 2-way latency-optimized implementation less efficient than the throughput-optimized implementation, including (i) the overheads caused by aligning and blending AVX-512 vectors in 2-way curve and isogeny operations; (ii) the fact that some point operations (e.g. y-coordinate doubling and Elligator 2) can not be parallelized in an ideal 14 2-way fashion due to the dependencies of internal field operations; (iii) some computations in the field operations (e.g. the complete carry propagation) cannot be parallelized in an ideal (2 × 4)-way fashion due to sequential dependencies of instructions; (iv) the instruction-level parallelism of (2 × 4)-way is lower than (8 × 1)-way since four limbs are stored in one vector. For all these reasons and because of the 32-bit multiplier in AVX-512F, the 2-way AVX-512F implementation is actually slower than the x64 implementation, which is confirmed by our experimental results.

Conclusions
Vector engines like Intel's AVX have become steadily more powerful from one generation to the next, not only because of the addition of new functionality, but also through the extension of the supported vector lengths. The expectation of this trend to continue in the coming years makes AVX an important platform for the implementation of PQC, in particular for computation-intensive isogeny-based cryptosystems. Although CSIDH has a couple of highly-desirable and unique features, the massive computational cost of the underlying class group action hampers its deployment in security protocols like TLS. In this paper we demonstrated how the enormous parallel processing power of AVX-512 can be exploited to, respectively, maximize the throughput of eight instances and minimize the latency of one instance of CSIDH-512 group action evaluation; the former alleviates the burden of server-side TLS processing, while the latter is beneficial on the client side. Our latency-optimized implementation makes CSIDH-512 group action evaluation roughly 1.5 times faster compared to a state-of-the-art non-vectorized x64 implementation that can resist timing attacks. On the other hand, by developing efficient batching methods for the class group action and combining them with highly-optimized (8 × 1)-way parallel field arithmetic based on the "limb-slicing" technique, we were able to achieve a 3.6-fold gain in throughput compared to a state-of-the-art x64 implementation of the CSIDH-512 group action evaluation. In light of this significant improvement, we expect that our batching methods are also highly beneficial for optimizing CSIDH-based digital signature schemes, such as CSI-FiSh [BKV19] and SeaSign [DG19], in which multiple independent class group actions are computed in the key generation, signing and verification processes.
The correct parameterization of CSIDH (including the order of the underlying prime field) to achieve NIST's security level 1 is currently still a topic of debate. It was suggested that, for level-1 security, the prime p should be much longer than 512 bits, e.g. 4096 bits [CCJR20]. Our CSIDH software was developed in a modular and parameterized way so as to reduce the effort when adapting it for other parameter sets since the point arithmetic (e.g. point addition, point doubling, scalar multiplication) and also certain parts of the field arithmetic can be re-used. 14 We define an ideal 2-way parallelized fashion of point or isogeny operation has the cost of i 2 M 2 + j 2 S 2 + k 2 A 2 when the corresponding 1-way implementation has the cost of iM + jS + kA.

B 2-Way Curve and Isogeny Operations
Let E a,d be an elliptic curve over a finite field F p , represented in twisted Edwards form. In this section we present the 2-way implementation for y-coordinate doubling, (differential) addition, isogeny computation and the evaluation of an isogeny at a specific point on E a,d . In all algorithms, any elliptic curve point is represented using projective Y T -coordinates according to [CCC + 19].

B.1 Point doubling
For a point P = (Y P : T P ) on the curve E a,d , the point R = [2]P is defined as: where e = a − d [CCC + 19]. Algorithm 7 describes our 2-way doubling process using Y T -coordinate arithmetic, with cost 2M 2 + 1S 2 + 3A 2 .

B.2 Point addition
For point addition, we use the formulas that are presented in [CCC + 19]. These formulas correspond to the differential addition using Y T -coordinates on twisted Edwards curves. In particular, let P = (Y P : T P ) and Q = (Y Q : T Q ) be two points on the curve and let P Q = P − Q = (Y P −Q : T P −Q ). The point R = P + Q is derived from the coordinates of the points P, Q and P Q, using the formulas: Algorithm 8 is the 2-way (differential) addition process using Y T -coordinate arithmetic with cost 2M 2 + 1S 2 + 3A 2 .

B.3 -isogeny computation
Algorithm 9 describes the procedure for computing an isogeny of some odd degree , using Y T -coordinate representation on twisted Edwards curves. The algorithm takes as input the values A 0 = a, A 1 = a − d, where E a,d is an elliptic curve in twisted Edwards form, a point P = (Y P : T P ) and the degree of the isogeny = 2k + 1. Then the algorithm computes the codomain curve E a ,d and the list of points {P 1 = (Y 1 : T 1 ), . . . , P k = (Y k : T k )}, where P i = [i]P , for each i ∈ {1, . . . , k}. Based on the work of Moody and Shumow [MS16], the coefficients of the codomain curve are defined as: Algorithm 9 outputs the values A 0 = a and A 1 = a − d , as well as the list of points {P 1 = (Y 1 : T 1 ), . . . , P k = (Y k : T k )}.