CTIDH: faster constant-time CSIDH

. This paper introduces a new key space for CSIDH and a new algorithm for constant-time evaluation of the CSIDH group action. The key space is not useful with previous algorithms, and the algorithm is not useful with previous key spaces, but combining the new key space with the new algorithm produces speed records for constant-time CSIDH. For example, for CSIDH-512 with a 256-bit key space, the best previous constant-time results used 789000 multiplications and more than 200 million Skylake cycles; this paper uses 438006 multiplications and 125.53 million cycles


Introduction
Isogeny-based cryptography, a relatively new area of post-quantum cryptography, has gained substantial attention in the past few years.Schemes like SIDH (Supersingular Isogeny Diffie-Hellman) [14] and CSIDH (Commutative Supersingular Isogeny Diffie-Hellman) [10] offer key-exchange protocols with the smallest key sizes among post-quantum systems.CSIDH is even a non-interactive key exchange, matching the data flow of traditional Diffie-Hellman key exchange, and it has received a considerable amount of attention related to constant-time algorithms [18,22,11,15,13,1,12].
Briefly, one can explain CSIDH as follows.Pick small odd primes also prime.A public key is a supersingular elliptic curve E A /F p : y 2 = x 3 + Ax 2 + x, specified by a single element A ∈ F p .Given this curve one can efficiently compute two curves i -isogenous to E A , denoted l i E A and l −1 i E A , for any of the i in the definition of p. Alice's private key is a list of exponents (e 1 , . . ., e n ) ∈ Z n where e i shows how often each l i is used: Alice's public key is l e1 1 • • • l en n E 0 = E A , and if Bob's public key is E B then the secret shared with Bob is l e1 The key-exchange protocol works because is a commutative group action: the ordering of the isogenies is not important.
The first constant-time CSIDH paper [18] specified each exponent e i as being between 0 and a public constant m i , and always computed m i iterations of l i , secretly discarding the dummy operations beyond e i iterations.The original CSIDH paper [10] had allowed e i ∈ [−m i , m i ]; in the constant-time context this might seem to require m i iterations of l i plus m i iterations of l −1 i , but [22] introduced a "2-point" algorithm with just m i iterations, each iteration being only about 1/3 more expensive than before.All m i were taken equal in [10], for example taking e i ∈ [−5, 5] for CSIDH-512; subsequent papers did better by allowing m i to depend on i (as suggested in [10, Remark 14]) and accounting for the costs of l i .Further speedups in the literature come from various techniques to speed up each l i computation and to merge work across sequences of l i computations.

Contributions of this paper
This paper introduces a new key space for CSIDH, and a new constant-time algorithm to evaluate the CSIDH group action.The new key space is not useful by itself-it slows down previous constant-time algorithms-and similarly the new constant-time algorithm is not useful for previous key spaces; but there is a synergy between the key space and the algorithm, and using both of them together produces a large improvement in the performance of constant-time CSIDH.As a very small example of the new key space, assume that one is using just 6 primes and allows at most 6 isogeny computations, with each l i exponent being nonnegative.The standard key space chooses (e 1 , e 2 , . . ., e 6 ) ∈ {0, 1} 6 , giving 2 6 = 64 keys.The new key space, with 2 batches of 3 primes each, chooses (e 1 , e 2 , . . ., e 6 ) ∈ [0, 3] 6 with the condition that e 1 + e 2 + e 3 ≤ 3 and e 4 + e 5 + e 6 ≤ 3, giving 20 2 = 400 keys.Similar comments apply when negative exponents are allowed.
The extreme case of putting each prime into a size-1 batch is not new: it is the standard key space.The opposite extreme, putting all primes into 1 giant batch, is also not new: putting a bound on the 1-norm of the key vector was highlighted in [20] as allowing the best tradeoffs between the number of isogenies and the size of the key space.In the above example, 1 giant batch of 6 primes gives 924 keys for 6 isogenies, i.e., 0.61 isogenies per key bit, compared to 1 isogeny per key bit for the standard key space.
However, plugging 1 giant batch into constant-time algorithms takes 36 isogenies for 924 keys, since each of the 6 primes uses 6 computations.Our intermediate example, 2 batches of 3 primes each, uses 18 isogenies for 400 keys, which is still many more isogenies per key bit than 6 isogenies for 64 keys.
We do better by evaluating isogenies differently.The central challenge tackled in this paper is to develop an efficient constant-time algorithm for the new key space, computing any isogeny in a batch using the same sequence of operations.This raises several questions: 1. How does one optimally compute a sequence of isogenies, and handle probabilistic failures in standard algorithms to compute l i , while at the same time hiding which isogeny is computed?See Section 4 for the introduction of atomic blocks for these computations, and Section 5 for how to compute them in constant time.2. How does one optimally set batches, compute private keys, and determine the number of isogenies per batch to match a required size of the key space?See Section 3 for analysis of the key space, and Section 6 for how to minimize the multiplication count.3. How does one minimize the cycle count for constant-time software?Section 7 describes our software and low-level ideas; Section 8 presents the speeds.
Our constant-time algorithm combines several old and new techniques.For example, as observed in [5], Vélu's formulas have a Matryoshka-doll structure; we constrain the more recent √ élu formulas [3] in a way that creates a Matryoshka-doll structure.The cost of an isogeny computation exploiting this structure depends on the largest prime in the batch for the traditional formulas, but also on the smallest prime in the batch for √ élu.Standard algorithms to compute l i fail with probability 1/ i ; to hide which i in a batch is used we arrange for failures to occur with probability matching the smallest prime in the batch.Our batches consist of primes of similar sizes, to obtain the optimal tradeoffs between the cost per batch and the size of the key space.Further constant-time optimizations are described throughout the paper.
For comparability we report CSIDH-512 speeds, setting records in multiplications and in cycles for complete constant-time software.Partial analyses [5,23,7,12] suggest that the post-quantum security level of CSIDH-512 is around 2 60 qubit operations; for applications that want higher security levels, our software also supports larger sizes.

Background
Sections 2.1, 2.2, 2.3, and 2.4 review the CSIDH group action, computations of individual -isogenies, strategies for computing sequences of isogenies, and previous constant-time algorithms.

The CSIDH group action
CSIDH [10] is a Diffie-Hellman-like key-exchange protocol based on isogenies of supersingular elliptic curves over a finite field F p .For a prime p > 3 an elliptic curve E/F p is supersingular if and only if #E(F p ) = p + 1, where E(F p ) is its group of points over F p .CSIDH uses p ≡ 3 (mod 8), and uses supersingular elliptic curves over F p that can be written in Montgomery form E A : y 2 = x 3 + Ax 2 + x for A ∈ F p \{−2, 2}.We call A the Montgomery coefficient of E A .We write E = {E A : #E A (F p ) = p + 1} for the set of CSIDH curves and M = {A : E A ∈ E} for the set of corresponding Montgomery coefficients.Curves E A for distinct A ∈ M are non-isomorphic by [10,Proposition 8], and each E A (F p ) is cyclic.An isogeny is a nonzero map ϕ : E → E which is given by rational functions and is compatible with elliptic-curve addition.An -isogeny is an isogeny of degree (as a rational map).Isogenies are typically defined by their kernels, i.e., by the points they map to ∞. Computing an -isogeny with Vélu's formulas requires a point P of order ; the isogeny has kernel P , and any point in P of order leads to the same isogeny.
The CSIDH prime p is chosen as 4 is a curve in E then there are j − 1 points of order j in E A (F p ).Each of these points of order j generates the kernel of an j -isogeny E A → E A , which is the same isogeny for all of these points.The codomain E A of this j -isogeny is written l j E A .
Fix i ∈ F p 2 with i 2 = −1.Define ẼA (F p ) as the set of points (x, iy) ∈ E A (F p 2 ) where x, y ∈ F p , along with the neutral element; equivalently, ẼA (F p ) is the image of E −A (F p ) under the isomorphism (x, y) → (−x, iy).For each j , there are j − 1 points of order j in ẼA (F p ).Each of these points generates the kernel of an j -isogeny E A → E A , the same isogeny for all of these points.The codomain E A of this j -isogeny is written l −1 j E A .The isogeny from E A to l −1 j E A maps the points of order j in E A (F p ) to points of order j , while mapping the points of order j in ẼA (F p ) to ∞ on l −1 j E A .The isogeny from E A to l j E A maps the points of order j in ẼA (F p ) to points of order j , while mapping the points of order j in E A (F p ) to ∞ on l j E A .
Applying i -isogenies induces a group action [5] of the commutative group Z n on E. An exponent vector (e 1 , . . ., e n ) ∈ Z n acts on the curve E A to produce the curve (l e1 1 . . .l en n ) E A , computed as a sequence having |e j | many j -isogenies for each j, each isogeny using l j if e j > 0 or l −1 j if e j < 0. The private key of each party is a (secret) vector (e 1 , . . ., e n ) sampled from a finite key space K ⊂ Z n .To protect against meet-in-the-middle attacks, it is conventional to take #K ≥ 2 2λ for security 2 λ , but see [12] for arguments that smaller key spaces suffice.Beyond the size of K, the specific choice of K has an important impact on efficiency.
Most previous CSIDH implementations have used one of two types of key space.Given an exponent bound vector m ).The original CSIDH paper [10] and [22] use K m with m = (5, . . ., 5) for CSIDH-512.It was suggested in [10,Remark 14] and shown in [13] that allowing the m i to vary improves speed.The space K + m with m = (10, . . ., 10) was used in [18] for CSIDH-512.

Computing isogenies
Let P ∈ E A be a point of order with x-coordinate in F p and ϕ : E A → E A = E A / P the -isogeny induced by P .The main computational task, called xISOG, is to compute (1) the Montgomery coefficient A of the target curve E A and (2) the images under ϕ of some specified points (normally zero, one, or two points) Vélu and √ élu.The main algorithms for xISOG are Vélu's formulas [24] and √ élu 1 [3].The main computational task in both of these algorithms is to evaluate a polynomial for some index set S. All the arithmetic is done using only x-coordinates.
For √ élu, in contrast to the linear algorithm of Vélu, the main part of the product is evaluated using a baby-step-giant-step strategy.It is simplest to evaluate h S (X) with S = {1, 3, 5, . . ., − 2}; this set S is split into a "box" U × V and leftover set W such that S ↔ (U × V ) ∪ W . Then h S (X) is computed as the product of h W (X) with the resultant of h U (X) and a polynomial related to h V (X); see [3] for details.For each , one chooses U and V to minimize cost.Asymptotically, √ élu uses Õ( √ ) field multiplications.
One can view Vélu's formulas as a special case of √ élu in which U and V are empty.This special case is optimal for small primes.The exact cutoff depends on lower-level algorithmic details but is around = 89.
Sampling points of order .No efficient way is known to deterministically generate points of order in E(F p ).However, E(F p ) is cyclic of order p + 1, so if T is a uniform random point then P = [(p + 1)/ ]T will have order with probability 1 − 1/ .This can-and often will-fail, and needs to be repeated until it succeeds.Once P has order , one can use P with Vélu or √ élu.Typically, one uses the Elligator 2 map [4] to sample points in E(F p ) or Ẽ(F p ).We discuss this approach in Appendix B.

Strategies
Computing the multiple [(p + 1)/ ]T is very costly.If p has 512 bits then (p + 1)/ has almost 512 bits and this scalar multiplication costs thousands of field multiplications.The cost of scalar multiplication is typically amortized by pushing points through isogenies.This approach aims to compute a series of isogenies after only sampling one point on the initial curve.
Following [14], we call a method that computes a given series of isogenies a strategy.Informally, a strategy determines the order of isogeny evaluations, and how to obtain suitable kernel generators through either scalar multiplications or point evaluations.In the context of SIDH, optimal strategies (with the minimum computational cost) can be found [14].When adapting this to CSIDH, there are three main complications: the choice of isogenies to be combined into a sequence, the possibility of point rejections due to wrong orders, and the pairwise different degrees of the involved isogenies.Indeed, [13] showed that a direct adaption of the method of [14] to CSIDH becomes infeasible when considering all possible permutations.Instead, several avenues for optimizing CSIDH strategies have been proposed, though none claims actual optimality.

Multiplicative strategy.
A simple multiplicative strategy was used in the algorithm of [10].Let D, a divisor of p + 1, be the product of the degrees of the isogenies to be combined in one sequence.Sample a point T on the initial curve, and set T ← [(p + 1)/D]T ; now the order of T divides D. Compute P ← [D/ ]T , where is the degree of the first isogeny.If P = ∞, skip this isogeny and continue with the next isogeny.If P = ∞, then P has order , so we can compute the required -isogeny ϕ, and push T through to get T ← ϕ(T ).Either way, the order of T now divides D/ .For the next isogeny, say of degree , compute P ← [D/( )]T as a potential kernel generator.Continue in the same fashion, pushing one point through each isogeny, until T = ∞.Note that the scalar multiplications reduce in length at each step: as observed in [19], processing the isogenies in decreasing degree order reduces the total cost.

SIMBA.
In [10], the product D was taken as large as possible at each step.The SIMBA (Splitting Isogenies into Multiple Batches) strategy proposed in [18] limits D for better performance.SIMBA-M partitions the set of isogeny degrees { 1 , . . ., n } into M prides, 2 where the i-th pride contains all isogeny degrees j for which j ≡ i mod M .Each step, SIMBA picks as many isogenies as possible for a single pride, processing them as described above.This typically results in a smaller total degree D, making the initial scalar multiplication T ← [(p + 1)/D]T more expensive, while the later scalar multiplications of the form P ← [D/ j ]T become significantly cheaper.Overall, [18] reports that a significant speedup can be achieved for well-chosen values of M .
Point-pushing strategies.One can also push additional points through isogenies.For instance, when computing the first kernel generator as described above via P ← [D/ ]T , one can save an intermediate point T of small order divisible by , push both T and T through the first isogeny, and then use T to compute the next kernel generator via a smaller scalar multiplication.This may be more efficient than the multiplicative strategy, depending on the cost of evaluating ϕ at additional points compared to the savings due to cheaper scalar multiplications.However, except for very small isogeny degrees, the cost of evaluating additional points can be significantly higher than computing scalar multiplications.Thus, an optimal strategy is expected to be closer to the multiplicative strategy, only rarely pushing additional points through isogenies.In [13] optimal strategies are computed under the assumption of always choosing as many isogeny degrees as possible per sequence, and an increasing ordering of the involved degrees.It remains open whether other choices of primes per sequence, as in SIMBA, or different orderings of the degrees could yield a more optimized point-pushing strategy.
Remark 1.The SIMBA approach is generalized in [15] with point-pushing strategies within prides, more efficient partitions of SIMBA prides, and their permutations.However, all optimization attempts have required imposing certain assumptions in order for the optimization problem to be solvable, and thus only produce conditionally optimal strategies.The comparison in [13] shows that all of these approaches give roughly the same performance results for constant-time CSIDH-512 algorithms: that is, within a margin of 4%.

Previous constant-time algorithms
A private key (e 1 , . . ., e n ) requires us to compute |e i | isogenies of degree i (regardless of the strategy), so the running time of a naïve CSIDH algorithm depends directly on the key.Various constant-time approaches have been proposed to avoid this dependency.

What constant time means.
A deterministic algorithm computes a function from inputs to outputs.A randomized algorithm is more complicated: it computes a function from inputs to distributions over outputs, since each run will, in general, depend on random bits generated inside the algorithm.Similarly, the time taken by an algorithm is a function from inputs to distributions of times."Constant time" means that this function is constant: the distribution of algorithm time for input i matches the distribution of algorithm time for input i .In other words, the algorithm time provides no information about the input.
In particular, if the input is a CSIDH curve and a private key, and the output is the result of the CSIDH action, then the algorithm time provides no information about the private key, and provides no information about the output.
Avoiding data flow from inputs to branches and array indices is sufficient to ensure the constant-time property for many definitions of algorithm time, and is the main focus of work on constant-time algorithms for CSIDH, including the work in this paper.Beware, however, that this is not sufficient if the definition changes to allow, e.g., a variable-time division instruction, like the division instructions on most computers.
The constant-time property also does not mean that time is deterministic.The paper [5] aims for time to be constant and deterministic, so as to be able to run in superposition on a quantum computer, but this costs extra and is not necessary for the objective of stopping timing attacks.
Structurally, every claim of constant-time software in the literature relies on various CPU instructions taking constant time, and could be undermined by CPU manufacturers adding timing variations to those instructions.The literature on constant-time software generally assumes, for example, that multiplication instructions take constant time, and declares that CPUs with variable-time multiplication instructions are out of scope.Formally, the constant-time claims are in a model of "time" where various instructions, including multiplications, take constant time.
Dummy isogenies.[18] used dummy isogenies to obtain a fixed number of isogenies per group action evaluation.Essentially, if e i is sampled such that |e i | ≤ m i , this amounts to computing m i isogenies of degree i , where m i − |e i | of these are dummy computations whose results are simply discarded. 3As noted in Section 2.2, this might require more than m i attempts to sample a point of order i , due to the point rejection probability of 1/ i .However, the number of attempts only depends on randomness and m i , and is thus independent of the choice of e i .

1-point and 2-point approaches.
It is observed in [18] that if we compute multiple isogenies from a single sampled point, then the running time of the algorithm depends on the sign distribution of the private keys.Indeed, when a single point is sampled, only i -isogenies with equal signs of the corresponding e i can be combined in a strategy.Since this approach of combining isogenies is desirable for efficiency (see Section 2.3), [18] proposed eliminating this dependency by sampling e i from [0, 2m i ] instead of [−m i , m i ], although this requires computing twice as many isogenies per degree.
In order to mitigate this slowdown, [22] proposed sampling two points, T 0 ∈ E A (F p ) and T 1 ∈ ẼA (F p ).For each isogeny in the sequence, one picks the kernel generator according to the sign of the corresponding e i .This approach combines isogeny computations independent of key signs, and thus goes back to sampling e i from [−m i , m i ] at the cost of pushing two points through each isogeny instead of one.
A dummy-free variant of the 2-point approach was proposed in [11].This requires roughly twice as many isogenies, but may be useful in situations where fault-injection attacks play an important role.We return to this approach in Appendix A.

Batching and key spaces
The main conceptual novelty in CTIDH is the organization of primes and isogenies in batches.For this we define a new batch-oriented key space, which is slightly more complicated than the key spaces K m and K + m mentioned in Section 2.
Extreme batching choices correspond to well-known approaches to the group action evaluation: one prime per batch (B = n and N = (1, . . ., 1)) was considered in [10]; one n-prime batch (B = 1 and N = (n)) is considered in [5] for the quantum oracle evaluation and in [20] as a speedup for CSIDH.The intermediate cases are new, and, as we will show, faster.
The new key space.For N ∈ Z B >0 and m ∈ Z B ≥0 , we define We may see K N,m as a generalization of K m .
Lemma 1.We have counts the vectors in Z x with 1-norm at most y.
Proof.The size of the key space is the product of the sizes for each batch.In Φ(x, y) the number of nonzero entries in the x positions is k and there are x k ways to determine which entries are nonzero.For each of the nonzero entries there are 2 ways to choose the sign.The vector of partial sums over these k nonzero entries has k different integers in [1, y] and each vector uniquely matches one assignment of partial sums.There are y k ways to pick k different integers in [1, y].

Isogeny atomic blocks
In this section we formalize the concept of isogeny atomic blocks (ABs), subroutines that have been widely used in constant-time CSIDH algorithms but never formalized before.The first step of an algorithm chooses a series of degrees for which isogenies still need to be computed, and then uses, for example, the multiplicative strategy (Section 2.3) to compute a sequence of isogenies of those degrees.The next step chooses a possibly different series of degrees, and computes another sequence of isogenies.Each step of the computation is the evaluation of not one isogeny, but a sequence of isogenies.Atomic blocks formalize these steps.
Square-free ABs generalize the approach we take when evaluating the CSIDH group action with the traditional key spaces K m and K + m as in Algorithm 1. Restricted square-free ABs are used to evaluate the group action using the batching idea with keys in K N,m ; with details in Algorithm 2. We postpone the explicit construction of ABs to Section 5.

Square-free atomic blocks
Definition 1 (Square-free ABs).Let R ⊆ {−1, 0, 1} and Ii ) E A , satisfying the following two properties: 1. there is a function σ such that, for each (A, ), the distribution of f , given that (A , f ) is returned by α R,I on input (A, ), is σ(R, I), and there is a function τ such that, for each (A, ) and each f , the distribution of the time taken by α R,I , given that (A , f ) is returned by α R,I on input (A, ), is τ (R, I, f ).
Suppose an algorithm evaluates the group action on input e ∈ K and A ∈ M using a sequence of square-free AB calls (A , f ) ← α R,I (A, ).If in each step the choice of R and I are independent of e, the algorithm does not leak information about e through timing.This is illustrated by Algorithm 1, which expresses the constant-time group action from [22] using a sequence of square-free ABs with R = {−1, 0, 1} to evaluate the action for keys in K m .The choices of R and I are independent of e for each AB α R,I , and all other steps can be easily made constant-time.The choice of I in Step 3 may vary between different executions, due to the varying failure vectors f of previously evaluated ABs.However this only depends on the initial choice of m i , and is independent of e.

Restricted square-free atomic blocks
In the language of Section 3, restricted square-free ABs are generalizations of square-free ABs that further do not leak information on which of the primes we have chosen from a batch.
Ii,Ji ) E A , satisfying the following two properties: 1. there is a function σ such that, for each (A, , J), the distribution of f , given that (A , f ) is returned by β R,I on input (A, , J), is σ(R, I); and 2. there is a function τ such that, for each (A, , J) and each f , the distribution of the time taken by β R,I , given that (A , f ) is returned by β R,I on input (A, , J), is τ (R, I, f ).
Algorithm 2 uses restricted square-free ABs with R = {−1, 0, 1} to compute group actions for keys in K N,m ; it may be considered a generalization of Algorithm 1.
Algorithm 2: A constant-time group action for keys in K N,m based on restricted square-free ABs with R = {−1, 0, 1}.
Parameters: N , m, B Input:

Evaluating atomic blocks in constant time
This section introduces the algorithm used in CTIDH to realize the restricted square-free atomic block β R,I introduced in Section 4. Throughout this section, R is {−1, 0, 1}.As a warmup, Section 5.1 recasts the inner loop of [22,Algorithm 3] as a realization of the square-free atomic block α R,I .We first present the algorithm in a simpler variable-time form (Algorithm 3) and then explain the small changes needed to eliminate timing leaks, obtaining α R,I .
Section 5.2 presents our new algorithm to realize β R,I .The extra challenge here is to hide which prime is being used within each batch.Again we begin by presenting a simpler variable-time algorithm (Algorithm 4) and then explain how to eliminate timing leaks.

Square-free atomic blocks for isogeny evaluation
Algorithm 3 translates the inner loop of [22,Algorithm 3] to the AB framework.The inputs are A ∈ M and ∈ {−1, 0, 1} k .The goal is to compute k isogenies of degrees I1 , . . ., I k , but some of these computations may fail.The outputs are a vector f ∈ {0, 1} k recording which of the computations succeeded, and A such that The algorithm uses the 2-point approach with dummy isogenies.It uses two subroutines: • UniformRandomPoints takes A ∈ M, and returns a uniform random pair of points (T 0 , T 1 ), with T 0 ∈ E A (F p ) and T 1 ∈ ẼA (F p ); i.e., T 0 is a uniform random element of E A (F p ), and T 1 , independent of T 0 , is a uniform random element of ẼA (F p ).
See Appendix B for analysis of the Elligator alternative to UniformRandomPoints.
Remark 4. Algorithm 3 uses a multiplicative strategy, but it can easily be modified to use a SIMBA or point-pushing strategy, which is much more efficient in general [22,13].The isogeny algorithm can be Vélu or √ élu, whichever is more efficient for the given degree.

Modifying Algorithm 3 to eliminate timing leaks
The following standard modifications to Algorithm 3 produce an algorithm meeting Definition 1, the definition of a square-free atomic block.
Observe first that f j = 1 if and only if the prime Ij divides the order of the current T s .This is equivalent to Ij dividing the order of the initially sampled point T s (since T s has been modified only by multiplication by scalars that are not divisible by Ij , and by isogenies of degrees not divisible by Ij ).This has probability 1 − 1/ Ij , since the initial T s is a uniform random point in a cyclic group of size p + 1.These probabilities are independent across j, since (T 0 , T 1 ) is a uniform random pair of points.
To summarize, the distribution of the f vector has position j set with probability 1 − 1/ Ij , independently across j.This distribution is a function purely of I, independent of (A, ), as required.What follows are algorithm modifications to ensure that the time distribution is a function purely of (I, f ); these modifications do not affect f .
Step 7, taking T 0 if s = 0 or T 1 if s = 1, is replaced with a constant-time point selection: e.g., taking the bitwise XOR T 0 ⊕ T 1 , then ANDing each bit with s, and then XORing the result with T 0 .Similar comments apply to the subsequent uses of T s and T 1−s .It is simplest to merge all of these selections into a constant-time swap of T 0 , T 1 when s = 1, followed by a constant-time swap back at the bottom of the loop.The adjacent swaps at the bottom of one loop and the top of the next loop can be merged, analogous merging is standard in constant-time versions of the Montgomery ladder for scalar multiplication.
Step 11 determines whether an actual isogeny or a dummy isogeny has to be computed.The conditional assignment to (A, T 0 , T 1 ) in the first case is replaced with unconditional constant-time point selection.The conditional operation in the second case is replaced with an unconditional operation, multiplying T s by Ij in both cases.This changes the point T s in the first case, but does not change the order of T s (since the isogeny has already removed Ij from the order of T s in the first case), and all that matters for the algorithm is the order.See [18,22] for a slightly more efficient approach, merging the multiplication by Ij into a dummy isogeny computation.
The branch in Step 8 is determined by public information f j and does not need to be modified.The isogeny computation inside Isogeny takes constant time with standard algorithms; at a lower level, arithmetic in F p is handled by constant-time subroutines, not by subroutines that try to save time by suppressing leading zero bits.The computation of UniformRandomPoints takes variable time with standard algorithms, but the time distribution is independent of the curve provided as input.
The total time is the sum for initialization (UniformRandomPoints, computation of r and r , initial scalar multiplication), f j computation (division, scalar multiplications, selection), and computations when f j = 1 (Isogeny, scalar multiplication, selection).This sum is a function purely of (I, f ), independent of (A, ), as required.

Restricted square-free atomic blocks
We now consider the more difficult goal of hiding which isogeny is being computed within each batch.We present first the high-level algorithm (Algorithm 4), then the PointAccept (Section 5.2.1) and MatryoshkaIsogeny (Section 5.2.2) subroutines, and finally the algorithm modifications (Section 5.2.3) to meet Definition 2.
The inputs to Algorithm 4 are Like Algorithm 3, Algorithm 4 uses a 2-point approach and dummy isogenies.It uses the following subroutines: • UniformRandomPoints is as before.
• PointAccept replaces the check P = ∞ to prevent timing leakage.It takes a point P and I j , J j ∈ Z such that P either has order Ij ,Jj or 1, and outputs either 0 or 1, under the condition that the output is 0 whenever P = ∞.• MatryoshkaIsogeny replaces Isogeny from Algorithm 3.There is an extra input J j indicating the secret position within a batch.
Note that the output of PointAccept can be 0 when P = ∞, so we add a multiplication by Ij ,Jj in Step 14 to make sure we continue the loop with points of expected order.

PointAccept
Step 8 of Algorithm 4 computes a potential kernel generator, P .The probability that P = ∞ is 1/ Ij ,Jj , which depends on J j .For the batch ( Ij ,1 , . . ., Ij ,N I j ), PointAccept artificially increases this probability to 1/ Ij ,1 , independent of Ij ,Jj , by tossing a coin with success probability and only returning f j = 1 if P = ∞ and the coin toss is successful.The probability that the output is 1 is then γ , which is independent of J j .Thus the batch can fail publicly.

MatryoshkaIsogeny
MatryoshkaIsogeny replaces the Isogeny computation.It takes the Montgomery coefficient of a curve E A , a batch ( i,1 , . . ., i,Ni ), an isogeny index j within the batch, a point P of order i,j generating the kernel of an isogeny ϕ : E A → E A / P = E A , and a tuple of points (Q 1 , . . ., Q t ), and returns A and (ϕ(Q 1 ), . . ., ϕ(Q t )).MatryoshkaIsogeny is computed with cost independent of j.For Vélu's formulas, [5] showed how to compute any i -isogeny for i ≤ using the computation of an -isogeny and masking.The first step of computing (1) is to compute x(P ), x([2]P ), . . ., x([( − 1)/2]P ).This includes the computation for smaller i ; [5] described this as a Matryoshka-doll property.
In this paper we specialize the √ élu formulas so as to obtain a Matryoshka-doll structure.We define the sets U and V , introduced in Section 2.2, as the optimal choices for the smallest degree in the batch: i.e., i,1 .The leftover set W is chosen to make the formulas work even for the largest prime i,Ni in the batch.Then the baby-step giant-step algorithm stays unchanged; while we iterate through W we save the intermediate results corresponding to all degrees i,j in the batch.In the final step, we select the result corresponding to the index j that we wanted to compute.
The sets U and V have size around i,1 .If the primes in the batch are sufficiently close then the rounded values match or are marginally different, meaning that the Matryoshkalike formulas are at worst marginally slower than the optimal formulas for i,Ni .

Modifying Algorithm 4 to eliminate timing leaks
We now indicate algorithm modifications to meet Definition 2, the definition of a restricted square-free atomic block.
As in Section 5.1.1,we begin with the distribution of f .For each input (A, , J), the distribution has f j set with probability 1 − 1/ Ij ,1 (not 1 − 1/ Ij ,Jj ; see Section 5.2.1), independently across j.This distribution is a function purely of I, independent of (A, , J), as required.What remains is to ensure that the time distribution is a function purely of (I, f ).
There are secret scalars r, r , and Ij ,Jj used in various scalar multiplications in Steps 3, 8, and 14.Standard integer-arithmetic algorithms that dynamically suppress leading zero bits are replaced by constant-time algorithms that always use the maximum number of bits, and variable-time scalar-multiplication algorithms are replaced by a constant-time Montgomery ladder, as in [5].It is straightforward to compute an upper bound on each scalar in Algorithm 4. See Section 7 for faster alternatives.
Section 5.2.2 explains how to compute MatryoshkaIsogeny in time that depends only on the batch, not on the selection of a prime within the batch.Everything else is as in Section 5.1.1:the distribution of UniformRandomPoints timings is independent of the inputs, Step 8 uses constant-time selection, the branch in Step 12 is replaced by constant-time selection, and the branch in Step 10 does not need to be modified.

Strategies and parameters for CTIDH
The optimization process for previous constant-time algorithms for CSIDH has two levels.The bottom level tries to minimize the cost of each AB, for example by optimizing √ élu parameters and searching for a choice of strategy from Section 2.3.The top level searches for a choice of exponent bounds m = (m 1 , . . ., m n ), trying to minimize the total AB cost subject to the key space reaching a specified size.A cost function that models the cost of an AB, taking account of the bottom-level search, is plugged into the top-level search.
Optimizing CTIDH is more complicated.There is a new top level, searching for a choice of batch sizes N = (N 1 , . . ., N B ).These batch sizes influence the success chance and cost of an AB at the bottom level: see Sections 5.2.1 and 5.2.2.They also influence the total cost of any particular choice of 1-norm bounds m = (m 1 , . . ., m B ) at the middle level.The size of the key space depends on both N and m; see Lemma 1.
This section describes a reasonably efficient method to search for CTIDH parameters.

Strategies for CTIDH.
We save time at the lowest level of the search by simply using multiplicative strategies.As in previous papers, it would be easy to adapt Algorithm 4 to use SIMBA or optimized point-pushing strategies or both, giving many further parameters that could be explored with more search time, but this is unlikely to produce large benefits.
Seen from a high level, evaluating ABs multiplicatively in CTIDH has a similar effect to SIMBA strategies for previous algorithms.For example, SIMBA-N 1 for traditional batch sizes (1, . . ., 1) limits each AB to at most n/N 1 isogenies (if n is divisible by N 1 ), in order to save multiplicative effort.Now consider CTIDH where all B batches have size N 1 , i.e., N = (N 1 , . . ., N 1 ).Each CTIDH AB then computes at most B = n/N 1 isogenies, saving multiplicative effort in the same way.
One could split a CTIDH AB into further SIMBA prides, but [18] already shows that most of the benefit of SIMBA comes from putting some cap on the number of isogenies in an AB; the exact choice of cap is relatively unimportant.One could also try to optimize point-pushing strategies as an alternative to multiplicative strategies, as an alternative to SIMBA, or within each SIMBA pride, but the searches in [15] and [13] suggest that optimizing these strategies saves at most a small percentage in the number of multiplications, while incurring overhead for managing additional points.
Cost functions for CTIDH.The search through various CTIDH batching configuration vectors N and 1-norm bound vectors m tries to minimize a function C(N, m), a model of the cost of a group-action evaluation.The numerical search examples later in this section use the following cost function: the average number of multiplications (counting squarings as multiplications) used by the CTIDH algorithms, including the speedups described in Section 7.
One way to compute this function is to statistically approximate it: run the software from Section 7 many times, inspect the multiplication counter built into the software, and take the average over many experiments.A more efficient way to compute the same function with the same accuracy is with a simulator that skips the multiplications but still counts how many there are.Our simulator, despite being written in Python, is about 50 times faster than the software from Section 7.
However, using a statistical approximation raises concerns about the impact of statistical variations.So, instead of using the software or the simulator, we directly compute the average cost of the first AB, the average cost of the second AB, etc., stopping when the probability of needing any further AB is below 10 −9 .
Batch b, with smallest prime b,1 , has success probability 1 − 1/ b,1 from each AB, so the chance q b of reaching m b successes within R ABs is the sum of the coefficients of x m b , x m b +1 , . . . in the polynomial (1/ b,1 + (1 − 1/ b,1 )x) R .Batches are independent, so q 1 q 2 • • • q B is the probability of not needing any further AB.Note that multiplying the polynomial (1/ b,1 x for each increase in R is more efficient than computing binomial coefficients. Computing the cost of an AB (times the probability that the AB occurs) is more complicated.Splitting the analysis into 2 B cases-e.g., one case, occurring with probability , is that all B batches still remain to be done-might be workable, since B is not very large and one can skip cases that occur with very low probability.We instead take the following approach.Fix b.The probability that batch b is in the AB is 1 − q b ; the probability that batch a is in the AB for exactly j values a < b is the coefficient of x j in the polynomial a<b (q a + (1 − q a )x); and the probability that batch c is in the AB for exactly k values c > b is the coefficient of x k in the polynomial c>b (q c + (1 − q c )x).There are O(B 2 ) possibilities for (j, k); each possibility determines the total number of batches in the AB and the position of b in the AB, assuming b is in the AB.For the AB algorithms considered here, this is enough information to determine the contribution of batch b to the cost of the AB.Our Python implementation of this approach has similar cost to 100 runs of the simulator, depending on B.
We also explored various simpler possibilities for cost functions.A deterministic model of ABs is easier to compute and simulates real costs reasonably well, leading to parameters whose observed costs were consistently within 10% of the best costs we found via the cost function defined above.
Optimizing the 1-norm bounds.Given a fixed configuration N of B batches, we use a greedy algorithm as in [13] to search for a 1-norm bound vector m as follows: 1. Choose an initial m = (m 1 , . . ., m B ) such that K N,m is large enough, and set C min ← C(N, m). 2. For each i in {1, . . ., B}, do the following: (c) Else, set m ← m, and for each j = i in {1, . . ., B} do the following: This algorithm applies small changes to the bound vector m at each step, reducing one entry while possibly increasing others.Obviously, this finds only a locally optimal m with respect to these changes and the initial choice of m in Step 1; different choices generally produce different results.
One way to choose an initial m is to try (1, 1, . . ., 1), then (2, 2, . . ., 2), etc., stopping when K N,m is large enough.Another approach, in the context of the N search described below, is to start from the best m found for the parent N , and merely increase the first component of m until K N,m is large enough; usually at most one increase is needed.
The algorithm involves at least B(B − 1) evaluations of the cost function for the final pass through Step 2. It can involve many more evaluations if there are many recursive calls or if there are many improvements to m, but usually these are small effects.
Optimizing the prime batches.We optimize N via a similar greedy algorithm, using the algorithm above as a subroutine.For a fixed number of batches B, we proceed as follows: 1. Choose an initial N = (N 1 , . . ., N B ) with i N i = n, and let (m, C min ) be the output of the algorithm above applied to N .2. For each i ∈ {1, . . ., B}, do the following: ii.Let ( m, C) be the output of the algorithm above applied to N i,j .iii.If C < C min , then update (N, m, C min ) ← ( Ñ i,j , m, C).This algorithm also finds only a local optimum with respect to these changes, and with respect to the initial choice of N in Step 1; again, different choices may lead to different results.Our experiments took an initial choice for One can also omit one or more large primes j by taking each N j = 1 and m j = 0.
Within the full two-layer greedy algorithm, each N considered at the upper layer involves B(B − 1) calls to the lower layer, the optimization of 1-norm bounds.Recall that each call to the lower layer involves at least B(B − 1) evaluations of the cost function.Overall there are nearly B 4 evaluations of the cost function.
Numerical examples.Table 1 shows examples of outputs of the above search.For each B, the "N "/"m" column shows the final (N, m) found, and the "cost" column shows the cost function for that (N, m), to two digits after the decimal point.
We used a server with two 64-core AMD EPYC 7742 CPUs, but limited each search to 32 cores running in parallel.We parallelized only the upper layer of the search; often fewer than 32 cores were used since some calls to the lower layer were slower than others.For each B, "wall" shows the seconds of real time used for the search, and "CPU" shows the total seconds of CPU time (across all cores, user time plus system time) used for the search.

Constant-time software for the action
We have built a self-contained high-performance software package, high-ctidh, that includes implementations of all of the operations needed by CSIDH users for whichever parameter set is selected: constant-time key generation, constant-time computation of the CSIDH action, and validation of (claimed) public keys.The package uses the CTIDH key space and CTIDH algorithms to set new cycle-count records for constant-time CSIDH.
The high-ctidh source code is in C, with assembly language for field arithmetic.Beyond the performance benefits, using low-level languages is helpful for ensuring constanttime behavior, as explained below.Measuring the performance of a full C implementation also resolves the concerns raised by using multiplications as a predictor of performance, such as concerns that some subroutines could be difficult to handle in constant time and that improved multiplication counts could be outweighed by overhead.The software is freely available at http://ctidh.isogeny.org/.This section describes the software.Section 8 reports the software speeds and compares to previous speeds.

Processor selection and field arithmetic
The original CSIDH paper reported clock cycles for variable-time CSIDH-512 software on an Intel Skylake CPU core.Skylake is also the most common CPU choice in followup papers on CSIDH software speed.We similarly focus on Skylake to maximize comparability.
The original csidh-20180826 software from [10] included a small assembly-language library for Intel chips (Broadwell and newer) to perform arithmetic modulo the CSIDH-512 prime.The same library has been copied, with minor tweaks and generalizations to other primes, into various subsequent software packages, including high-ctidh.Code above the field-arithmetic level, decomposing isogenies into multiplications etc., are written in C, so porting the software to another CPU is mainly a matter of writing an efficient Montgomery multiplier for that CPU.Beware that each CPU will have different cycle counts, and possibly a different ranking of algorithmic choices.
The velusqrt-asm software from [3] includes an adaptation of the same library to CSIDH-1024.The sqale-csidh-velusqrt software from [12] includes adaptations to larger sizes, all automatically generated by a code generator that takes p as input.The high-ctidh package includes a similar code generator, with some small improvements in the details: for example, we use less arithmetic for conditional subtraction, and we avoid cmov instructions with memory operands out of concern that they could have data-dependent timings.

Computing one isogeny
The middle layer of high-ctidh computes an -isogeny for one prime ; it also includes auxiliary functions such as multiplying by the scalar .We built this layer as follows.
We started with the xISOG function in velusqrt-asm.As in csidh-20180826, this function takes a curve and a point P of order , and returns the corresponding -isogenous curve.It also takes a point T , and returns the image of that point under the isogeny.
We extended the function interface to take lower and upper bounds on -the smallest and largest prime in the batch containing -and we modified the software to take time depending only on these bounds, not on the secret .The Matryoshka-doll structure of the computation (see Section 5.2.2) meant that very little code had to change.Each loop to is replaced by a loop to the upper bound, with constant-time conditional selection of the results relevant to ; and is replaced by the lower bound as input to the √ élu parameter selection.An upper bound was used the same way in [5]; the use of the lower bound for a Matryoshka-doll √ élu is new here.We reused the automatic √ élu parameter-tuning mechanisms from velusqrt-asm.These mechanisms offer the option of tuning for multiplication counts or tuning for cycles.Since most CSIDH-related papers report multiplication counts while fewer report cycles, we chose to tune for multiplication counts for comparability, but this makes only a small difference: cycle counts and multiplication counts are highly correlated.
We made more changes to incorporate known optimizations, including an observation from [5] regarding the applicability of multiexponentiation, and an observation from [1] regarding reciprocal polynomials.Computing a 587-isogeny and pushing a point through takes 2108 multiplications in this software (counting squarings as multiplications); for comparison, [1] took 3.4% more, and velusqrt-asm took 8.9% more.
More importantly for the high-level algorithms, we extended the interface to allow an array of points T to be pushed through the isogeny-e.g., two or zero points rather than one.We also incorporated shorter differential addition chains, as in [11], for scalar multiplications, and standard addition chains for the constant-time exponentiation inside Legendre-symbol computation.
There would be marginal speedups from tuning the √ élu parameters separately for each number of points.Taking parameters (6, 3) for 0 points instead of (0, 0) saves 2 out of 328 multiplications for = 79; 2 out of 344 multiplications for = 83; and 8 out of 368 multiplications for = 89.Parameter adjustments also save 3 multiplications for 0 points for each ∈ {557, 587, 613}.However, we did not find such speedups for most primes, and we did not find such speedups for the much more common case of 2 points.

Computing the action
The top layer of high-ctidh is new, and includes the core CTIDH algorithms described earlier in this paper.The key space is K N,m , allowing any vector with 1-norm at most m 1 for the first N 1 primes, 1-norm at most m 2 for the first N 2 primes, etc. Constant-time generation of a length-N i vector of 1-norm at most m i works as follows: • Set the bottom bit of each of the first N i integers, and clear the bottom bit of each of the last m i integers.• Sort the integers.(We reused existing constant-time sorting software from [2].) • If any adjacent integers are the same outside the bottom bit, start over.(Otherwise the integers were distinct outside the bottom bit, so sorting them applies a uniform random permutation.)• Extract the bottom bit at each position.(This is a uniform random bit string of length N i + m i with exactly N i bits set.) • Consider the entries as integers.Add the first entry to the second, then add the resulting second entry to the third, etc. (Now there are maybe some 0s, then at least one 1, then at least one 2, and so on through at least one N i .)• Count, in constant time, the number e 0 of 0, the number e 1 of 1, and so on through the number e Ni−1 of N i − 1. (These tallies add up to at most N i + m i − 1, since the number of N i was not included.Each of e 1 , . . ., e Ni−1 is positive, and e 0 is nonnegative.)• Subtract 1 from each of e 1 , . . ., e Ni−1 .(Now e 0 , . . ., e Ni−1 is a uniform random string of N i nonnegative integers with sum at most m i .)• Generate a uniform random N i -bit string s 0 , . . ., s Ni−1 .
• Compute, in constant time, whether any j has s j = 1 and e j = 0.If so, start over.
• Replace each e j with −e j if s j = 1.
As required by the constant-time property, the two rejection steps in this algorithm are independent of the secrets produced as output.The first rejection step is very unlikely to occur when b is chosen so that 2 b is on a larger scale than (N i + m i ) 2 .The second rejection step occurs more frequently.Sign variations for vectors of Hamming weight k contribute 2 k by Lemma 1 and thus the rejection correctly happens more frequently for smaller k.
In the case N i > m i , high-ctidh saves time by skipping (in constant time) the s j = 1 rejection test for the first N i − m i values of j having e j = 0.There are always at least N i − m i such values of j.This increases each acceptance chance by a factor 2 Ni−mi , preserving uniformity of the final output.
Once a private key is generated, the action is computed by a series of restricted squarefree ABs.As in Section 4, the first AB handles one prime from each batch, the next AB handles one prime from each batch that might have something left to do, etc.Within each AB, Elligator is used twice to generate two independent points; see Appendix B. Specifically, Elligator is used to generate a point on the first curve E A : a point in ẼA (F p ) if the first isogeny has negative sign, otherwise in E A (F p ).This point is pushed through the first isogeny.Elligator is then used again to generate an independent point on the second curve E A : a point in ẼA (F p ) if the first isogeny had positive sign, otherwise in E A (F p ).Both choices are secret.These two points (T 0 , T 1 ) are then pushed through subsequent isogenies as in Algorithm 4, except that no points are pushed through the last isogeny and only one point is pushed through the isogeny before that.The AB thus pushes 1, 2, 2, 2, . . ., 2, 2, 2, 1, 0 points through isogenies.The software permutes the b ≤ B batches in the AB to use primes Each AB selects one prime from each batch in the block and tries to compute an isogeny of total degree D, the product of the selected primes; D = r in Algorithm 4. Each point is multiplied by 4 and then by all primes outside D immediately after the point is generated by Elligator, so that the order of the point divides D. There are two types of primes outside D (compare Steps 2 and 3 of Algorithm 4): • The batches in the AB are public.Primes outside these batches are publicly outside D.
• Primes that are inside the batches in the AB, but that are not the secretly selected prime per batch, are secretly outside D.
For scalar multiplication by a product of secret primes, [5] uses a Montgomery ladder, with the number of ladder steps determined by the maximum possible product.For public primes, [11] does better using a precomputed differential addition chain for each prime.
Our high-ctidh software also uses these chains for secret primes, taking care to handle the incompleteness of differential-addition formulas and to do everything in constant time.
The primes in a batch usually vary slightly in chain length, so the software always runs to the maximum length.
Each -isogeny then clears from the order of the point that was used to compute the isogeny.As in line 14 of Algorithm 4, the software multiplies the point by anyway (again using a constant-time differential addition chain), just in case this was a dummy isogeny, i.e., there was secretly nothing left to do in the batch.This extra scalar multiplication could be merged with the isogeny computation, but the √ élu structure seems to make this somewhat more complicated than in [18], and the extra scalar multiplication accounts for only about 3% of the CSIDH-512 computation.The other point is also multiplied by .
Recall that an AB successfully handling a batch is a public event, visible in timing: it means that a (real or dummy) -isogeny is computed now for some in the batch, publicly decreasing the maximum 1-norm of the batch.This event occurs with probability 1 − 1/ i,1 , where i,1 is the smallest prime in the batch containing = i,j .As in Section 5.2.1, the software creates this event exactly when there is a conjunction of a natural success and an artificial success.A natural success, probability 1 − 1/ , means that cofactor multiplication produces a point of order rather than order 1.An artificial success, probability γ = (1 − 1/ i,1 )/(1 − 1/ ), is determined by a γ-biased coin toss.
One obvious way to generate a γ-biased coin is to (1) generate a uniform random integer modulo i,1 ( − 1) and (2) compute whether the integer is smaller than ( i,1 − 1).The second step is easy to do in constant time.For the first step, the software generates a uniform random 256-bit integer and, in constant time, reduces that modulo i,1 ( − 1); the resulting distribution is indistinguishable from uniform.One could instead use rejection sampling to compute a uniform random integer modulo i,1 M , where M is the least common multiple of − 1 across primes in the batch, and then reduce the integer modulo i,1 ( − 1), to obtain an exactly uniform distribution; the reason to use M here rather than just one − 1 is to avoid having the secret influence the rejection probability.

Automated constant-time verification
We designed and analyzed every step of the CTIDH algorithm to be constant time, leaking nothing about the input through timing; this is the basis for our claim that the algorithm is in fact constant time.We also designed and reviewed every new line of code in high-ctidh to be constant time, and reviewed every line of reused code for the same property; this is the basis for our claim that the software is in fact constant time.These analyses are complete-but, as in most papers on constant-time algorithms, are entirely done by hand, raising the question of what protections there are against human error.
For extra assurance, we designated an internal auditor to use automated tools to verify the constant-time claims.This subsection is an audit report to support external auditing.This report describes what the tools verified, describes various limitations of this verification, and describes various steps that the auditor took to compensate for those limitations.
From a risk-management perspective, a timing leak in high-ctidh would have to be at the intersection of (1) human error in this paper's manual analysis and (2) limitations of the automated verification.One might hope for automated verification without any limitations, eliminating the need for manual analysis (assuming correctness of the verification tools), but one risk identified below is beyond the current state of the art in automated verification.The automated verification is nevertheless useful in reducing risks overall.

An automated test using valgrind.
There is a standard tool, valgrind [21], that runs a specified binary, watching each instruction for memory errors-in particular, branches and array indices derived from undefined data.If secret data in cryptographic software is marked as undefined then simply running valgrind will automatically check whether there is any data flow from secrets to branches and array indices; see, e.g., [17].See also [16] for a survey of related tools.
Because valgrind works at the binary level, this analysis includes any optimizations that might have been introduced by the compiler.A compiler change could generate a different binary with timing leaks, but valgrind is fast enough to be systematically run on all compiled cryptographic software before the software is deployed.
The auditor wrote a simple checkct program using high-ctidh to perform a full CSIDH key exchange; this program is included in the high-ctidh package.For example, running valgrind ./checkct512defaulttakes under 30 seconds on a 3GHz Skylake core, where checkct512default performs a full CSIDH-512 key exchange.The underlying randombytes function marks all of its output as undefined, so valgrind is checking for any possible data flow from randomness to branches or to array indices.For each size, valgrind completes successfully, indicating that there is no such data flow.
Limitations of the automated test, and steps to address the limitations.The following paragraphs ask, from the auditor's perspective, what could have been missed by this automated test-for example, the auditor asks what would happen if private keys were actually generated by OpenSSL's RAND_bytes rather than randombytes.Everything is covered by the paper's manual analysis-for example, we had already checked that all code was generating all randomness via randombytes-but the question addressed here is the level of extra assurance provided by the automated analysis.
If the code generates randomness such as private keys via RAND_bytes rather than randombytes, then private keys will not be marked as undefined, so valgrind will not track data flow from private keys to branches or to array indices.To address this, the auditor skimmed high-ctidh to check the (very limited) set of C library functions being used, and double-checked the list of functions output by nm checkct512default.
If private keys are actually deterministic then they will not be marked as undefined.To address this, the auditor added a step to checkct to mark Alice's private key as undefined before Alice handles Bob's public key.The auditor also checked examples of private keys and saw them varying.
The valgrind analysis is dynamic, tracing through one run of code.Perhaps CSIDH-1024 triggers code paths that are not used by CSIDH-512 and that leak secret data.To address this, the auditor tried all sizes of interest.
Different runs could still follow different code paths because of the declassification described below.To address this, the auditor tried many runs of each size, but there is still a risk that all of the runs missed some code path.In theory it should be possible to combine valgrind with static code-coverage analysis for binaries that follow standard calling conventions, but as far as we know no tools are available for this.There are tools for static constant-time analysis of C code, and those tools could be applied to a modified version of high-ctidh that replaces the assembly-language portions with reference C code.
The valgrind analysis checks that all array indices and all branch conditions are defined, but does not check that division inputs are defined.Division instructions take variable time in most CPUs, and should not be modeled as taking constant time.To address this, the auditor skimmed the assembly-language code for any use of division instructions, and skimmed the C code for any operations likely to be compiled into division instructions.Patching valgrind to limit the set of acceptable instructions would reduce risks here.
Finally, the high-ctidh package has six crypto_declassify lines explicitly marking certain pieces of data as defined, meaning that valgrind allows branches and array indices derived from that data.Perhaps this declassification leaks secrets.This is the most important risk, the risk that would require advances in automated verification to address.
Five of the six lines are in rejection-sampling loops: one in generating uniform random integers modulo p (rejecting numbers ≥p), two in Elligator (rejecting random numbers 0, 1, −1), and two in the subroutine described above to generate vectors of bounded 1-norm.The sixth, and the most worrisome, declassifies the success of a batch in an AB.Analyzing the safety of this declassification requires analyzing everything that influences the success probability, including • the logic concluding that the natural failure probability of generating a curve point of order i,j is exactly 1/ i,j , • the coin toss artificially increasing the failure probability to exactly 1/ i,1 , and • the use of Elligator as a substitute for uniform random points, assuming (as previously conjectured; see Appendix B) indistinguishability of the point orders.
We again emphasize that all of this analysis is included in this paper.The challenge for the future is to automate the analysis.

Software speeds
This section reports various measurements of the high-ctidh software from Section 7, and compares the measurements to previous speeds for constant-time CSIDH.

Selecting a CSIDH size and collecting performance data
For comparability to previous speed reports, we focus here on CSIDH-512 with a key space of 2 256 vectors.After some searching we took the (N, m) shown for B = 14 in Figure 1.This (N, m) has approximately 2 256.009 keys.Our cost calculator claimed that this (N, m) would use approximately 437986 multiplications on average.We chose parameters a and c, and performed a action computations for each of c different private keys on a 3GHz Intel Xeon E3-1220 v5 (Skylake) CPU with Turbo Boost disabled.This CPU does not support hyperthreading, and to limit noise we used only one core.For each of the ac computations, we recorded a cycle count, a total multiplication count including squarings, a separate count of squarings, and a total addition count including subtractions.We also tracked, for each key and each batch of primes, the success probability of that batch in ABs for the computations for that key.
Choosing c = 65, as in [3], and a = 16383 meant that experiments completed quickly, half a day on one core.We did not detect any deviations from the null hypothesis that the software performance is independent of the private key.For example, as discussed below, the per-key success probability of the batch having smallest prime i,1 was not statistically distinguishable from 1 − 1/ i,1 .
One can easily justify spending further computer time on experiments.Larger a or c would make the total statistics more robust.Larger a would make the per-key statistics more robust.Larger c would be useful if there were a set of, say, 1 in every 1000 keys that somehow leaked information through timing.On the other hand, predictable CTIDH implementation errors such as taking a coin with probability 1 − γ rather than γ would have been caught by our experiments.

Performance results for the selected CSIDH size
We use the standard notation M for multiplications not including squarings, S for squarings, and a for additions including subtractions.One common metric in the literature is (M, S, a) = (1, 1, 0), counting the total number of multiplications while ignoring the costs of addition and ignoring possible squaring speedups.Another common metric is (M, S, a) = (1, 0.8, 0.05).
Across all 1064895 experiments, the average cycle count was 125.53 million, standard deviation 3.01 million.The average M was 321207, standard deviation 6621.The average S was 116798, standard deviation 4336.The average a was 482311, standard deviation 9322.The average cost in the (M, S, a) = (1, 1, 0) metric was 438006.The average cost in the (M, S, a) = (1, 0.8, 0.05) metric was 438762.
To understand the performance results in more detail, we plotted the distribution of all ac multiplication counts as the red curve in Figure 1.We also computed, for each key, the distribution of the 16383 multiplication counts for that key; there are five blue curves in Figure 1, showing the minimum, first quartile, median, third quartile, and maximum of these 65 distributions.The green curves, with a larger spread, are like the blue curves but are limited to the last 255 multiplication counts for each key.
Each curve has a stair-step shape.Another step upwards reflects another AB in the computation, with (typically) two Elligator calls and two large scalar multiplications.Any number of ABs can appear-for example, = 3 can fail again and again-but with exponentially low probability.One can extrapolate the budget needed for an application that needs to run for a fixed time (e.g., [5]) with a failure probability of, e.g., 2 −100 ; in this scenario the time would be lower if the smallest primes were left out of all batches.
It is unsurprising to see that the green curves have a spread of step positioning on the scale of 10%, given that these curves consider only 255 experiments for each of the 65 keys.Similar comments apply to the blue curves, with more experiments and a narrower spread.As a visual illustration that the spread is what one would expect, Figure 2 replaces the green and blue curves from Figure 1 with random simulations based on the red curve.
To gain more confidence that the distributions match, one could run more experiments for each key and watch for a further narrowing from the blue curves towards the red curve.Note that graphing the complete distributions provides much more information than computing a single number such as a t-statistic.Similar comments apply to cost metrics beyond multiplications: for example, Figure 3 shows cycle counts, and one could similarly plot other statistics such as the time of the first failed batch.
We also measured the cost of validation of a public key: median 14680 multiplications (after some easy speedups), around 4.09 million cycles.Note that validation takes variable time: an invalid key fails much more quickly.Finally, we measured the cost of generating a private key: typically under 1 million cycles, a negligible cost compared to generating the corresponding public key.
Across all 1064895 experiments, the average cycle count was 469.52 million, standard deviation 15.29 million.The average M was 287739, standard deviation 7420.The average S was 87944, standard deviation 4961.The average a was 486764, standard deviation 10525.The average cost in the (M, S, a) = (1, 1, 0) metric was 375683.The average cost in the (M, S, a) = (1, 0.8, 0.05) metric was 382432.
As in [1, Tables 1 and 2], the CSIDH-1024 prime uses fewer multiplications than the CSIDH-512 prime (although, unsurprisingly, the larger prime makes each multiplication slower).One might think that a larger list of needs more multiplications, since one needs to clear more cofactors; but the larger list also means that one can use smaller exponents for the same size key space, and in our experiments this turns out to have a larger effect.

Comparisons
There have been several previous speed reports [18,22,11,15,13,1,12] for constant-time CSIDH.CSIDH-512 with a key space of 2 256 vectors is almost always included, and for this size the lowest multiplication count we have found in the literature is 789000: this is from [1, Table 1, "hvelu", "OAYT-style"], which reports 624000M + 165000S + 893000a.All Skylake cycle counts we have found are above 200 million.The high-ctidh speeds are much faster, and have the added feature of constant-time verification using valgrind.
Sometimes the latest software speeds are not fully reflected in the relevant papers, so we downloaded the latest versions of csidh_withstrategies (dated July 2020) from [13] and sqale-csidh-velusqrt (dated December 2020) from [12], and ran their benchmarking tools-with one modification to change the number of experiments ("its") from 1024 to 65536-on the same Skylake machine that we had used for measuring high-ctidh.Some care is required in comparisons, for at least three reasons: first, some tools report the time for an action plus key validation; second, different benchmarking frameworks could be measuring different things (e.g., our impression is that the costs of Elligator were omitted from the multiplication counts reported in [1]); third, the 512-bit parameters in Table 2: Comparison of speed reports for constant-time CSIDH actions.The CSIDH size is specified by "pub" (512 for the CSIDH-512 prime, 1024 for the CSIDH-1024 prime) and "priv" (k where private keys are chosen from a space of approximately 2 k vectors)."DH" is the Diffie-Hellman stage: "1" for computing a public key (computing the CSIDH action), "2" for computing a shared secret (validating a public key and then computing the CSIDH action)."Mcyc" is millions of Skylake cycles (not shown for Python software); "M" is the number of multiplications not including squarings; "S" is the number of squarings; "a" is the number of additions including subtractions; "1, 1, 0" and "1, 0. The sqale-csidh-velusqrt tools, using BITS=512 STYLE=wd2, reported averages of 190.921 million cycles (standard deviation 4.32 million), 626000a (standard deviation 13000), 128000S (standard deviation 5000), and 447000M (standard deviation 9000); i.e., 575000 multiplications.For comparison, high-ctidh takes 89.11 million cycles (310945 multiplications) as noted above, plus 4.09 million cycles for validation.
Finally, Table 2 summarizes the measurements listed above for high-ctidh, for the software from [13], and for the software from [12]; the measurements stated in [22,11,15,1] for the software in those papers; and the measurements stated in [11] for the software in [18].For [15] the reported processor is an Intel Core i7-7500k, which is Kaby Lake rather than Skylake, but Kaby Lake cycle counts generally match Skylake cycle counts.The table omits cycle counts for [1], which used Python, and [22], which used C but had measurements affected by an unknown amount of Turbo Boost.
The cost of UniformRandomPoints.The obvious way to generate a point in E A (F p ) is to generate a uniform random x ∈ F p and compute y = ± √ x 3 + Ax 2 + x, trying again if x 3 + Ax 2 + x is not a square.The distribution is not exactly uniform, but one can easily adjust the procedure to correct this (see [5,Section 4.1]), or simply accept the distribution as being statistically indistinguishable from uniform.
The standard way to try to compute a square root, given that p ≡ 3 (mod 4), is to compute a (p + 1)/4 power.One more squaring then reveals whether the input was a square.Generating a point in E A (F p ) in this way takes two exponentiations on average.Before trying to compute y one can check the Legendre symbol x 3 +Ax 2 +x p .The square-root attempt will succeed if and only if the symbol is not −1.This reduces two exponentiations to two Legendre-symbol computations and one exponentiation, saving time if a Legendre-symbol computation is more than twice as fast as an exponentiation.
Similar comments apply to ẼA (F p ), producing an average UniformRandomPoints cost of four exponentiations, or four Legendre-symbol computations and two exponentiations.One can easily reduce the cost to three exponentiations, or two Legendre-symbol computations and two exponentations, by taking the first x as generating a point in E A (F p ) in half of the cases and generating a point in ẼA (F p ) in the other half of the cases.
Conventional algorithms for Montgomery-curve computations, including the isogeny computations needed in CSIDH, work only with x and do not need to inspect y.One can thus reduce the cost of UniformRandomPoints to three Legendre-symbol computations on average.
Our high-ctidh software follows previous CSIDH work in computing a Legendre symbol as a (p − 1)/2 power, so the speed is the same as computing a square root, but it would be interesting to investigate faster algorithms.One can use blinding to guarantee constant-time Legendre-symbol computation: • If x 3 + Ax 2 + x is 0, set a bit indicating this, and replace the 0 with 1.
• Multiply by ±r 2 where r is a uniform random nonzero element of F p .
• Adjust the output according to the 0 bit and the ± bit.
It would also be interesting to investigate whether the techniques of [6] can be adapted to this context, avoiding the costs of blinding.
Elligators everywhere.The literature generally takes a different approach, using the Elligator 2 [4] map.This approach has the advantage of using just one Legendre-symbol computation to generate a point in E A (F p ) and, with no extra cost, a point in ẼA (F p ).The disadvantage is that each point produced is distinguishable from uniform, covering only (p − 3)/2 out of the p + 1 possible points.Perhaps the orders of these points are distinguishable from the orders of uniform random points.
Elligator was first used in the CSIDH context in [5], which analyzed algorithms to compute CSIDH in superposition as a subroutine inside quantum attacks.That paper mentioned experiments suggesting that Elligator outputs have "failure chance almost exactly 1/ " and that the higher-level algorithms in [5] performed as predicted.However, the security question for constructive CSIDH applications, namely the order-indistinguishability question, did not arise in [5].A measurable deviation in orders could easily have avoided detection by the experiments in [5].
Elligator was first used for constructive CSIDH applications in [18] to generate an element of E A (F p ).It was then used in [22] to generate an element of E A (F p ) × ẼA (F p ). Subsequent CSIDH software has also used Elligator.It is conceivable, however, that information about A is leaked via the distribution of orders of the points that are generated by Elligator.
For each p, we enumerated all A ∈ M. For each (p, A), we enumerated all Elligator outputs T 0 ∈ E A (F p ) and computed the exact distribution of the order of [4]T 0 .We then compared this to the uniform model: the exact distribution of orders for uniform random elements of Z/((p + 1)/4)Z.Specifically, we computed the total-variation distance between these two distributions; recall that the total-variation distance between D and E, the conventional form of statistical distance, For example, for k = 1 and p = 11 = 4 • 3 − 1, the Elligator order distribution for each A ∈ {0, 5, 6} is 100% order 3: if T 0 is output by Elligator then [4]T 0 always has order 3.The uniform model is that orders 3 and 1 appear with probability 2/3 and 1/3 respectively.The total-variation distance is (|1 − 2/3| + |0 − 1/3|)/2 = 1/3.
Seeing two different values of A with different distributions shows that the result of replacing UniformRandomPoints with Elligator is not exactly an atomic block.This does not end the security analysis: for security it is enough to have something indistinguishable from an atomic block.If the total-variation distance drops quickly enough to reach, e.g., 2 −128 for p ≈ 2 512 , then the Elligator orders are indistinguishable from uniform-point orders for every A, and are thus indistinguishable from one A to another.To see more information regarding the distributions, we inspected, for each A with p = 419, the full distribution of orders of (4 times) Elligator points in E A (F p ).The green curves in Figure 4 show the minimum, quartiles, and maximum of the per-A distributions; for comparison, the red curve shows the uniform model.Figure 5 is for p = 12011.The green curves are closer to the red curve for p = 12011 than for p = 419.
Elligator simulators.Consider the following simulator, resampling from the uniform model: for each A ∈ M, generate a uniform random sequence of (p − 3)/2 elements of [4]E A (F p ) ∼ = Z/((p + 1)/4)Z, and compute the distribution of orders of these elements.
However, within the range of our experiments, this simulator produces similar results to Elligator.Compare Figure 5  A heuristic analysis of this simulator for arbitrary sizes of p proceeds as follows.For each positive integer d dividing (p + 1)/4, there are ϕ(d) elements of order d in Z/((p + 1)/4)Z, where ϕ is Euler's phi function.For example, there is 1 element of order 1, and there are ( 1 − 1) • • • ( n − 1) elements of order (p + 1)/4 = 1 • • • n .Order d thus occurs with probability 4ϕ(d)/(p + 1).
In general, if a trial succeeds with probability q, then the number S of successes in N independent trials has average qN and standard deviation q(1 − q)N , so the ratio (S − qN )/ q(1 − q)N (assuming 0 < q < 1) has average 0 and standard deviation The distribution of (S − qN )/ q(1 − q)N rapidly approaches a normal distribution as q(1 − q)N increases.If the distribution were exactly normal then the half absolute value |S − qN |/(2 q(1 − q)N ) would have average 1/2π.One thus expects the variation |S/N − q|/2 to be approximately q(1 − q)/2N π on average.
Summing over d says that the total-variation distance is, on average, approximately d q d (1 − q d )/(p − 3)π.This is at most d q d /(p − 3)π = X/ (p − 3)π where X = d √ q d .Notice that X factors as j ( 1 − 1/l j + 1/l j ), which is easy to compute even when p is large.For example, for the CSIDH-512 prime p, this product X is below 2 11 , while 1/ (p − 3)π is around 2 −256 .
There are roughly √ p choices of A, and they usually vary in total-variation distance.
For any particular d, one expects that there exists some A with approximately √ log p standard deviations in the probability that d occurs: e.g., 19 standard deviations for the CSIDH-512 prime p (although finding just 10 deviations would be a large computation).This effect is on a smaller scale than X when (p + 1)/4 has many small prime factors: i.e., the typical variation accumulated by many large divisors d is larger than occasional variation for the largest divisor.One does not expect several standard deviations to occur for several d simultaneously.
To summarize, one expects all total-variation distances from the simulator to be close to X/ (p − 3)π.The ratios in Table 3 show that the actual Elligator total-variation distances are close to X/ (p − 3)π for p ∈ {11, 59, 419, 12011, 78539, 1021019}.
Zero hazards.The original Elligator paper [4] did not define Elligator 2 for A = 0.The application of Elligator to CSIDH attacks in [5] suggested handling A = 0 by precomputing a point of full order, or, alternatively, replacing the initial A/(r 2 − 1) with r when A = 0.
Our CTIDH software (see Section 7) replaces the initial A/(r 2 − 1) with 1/(r 2 − 1) when A = 0.For constant-time projective computations this seems slightly more efficient than replacing A/(r 2 − 1) with r.We included this handling of A = 0 in the computations described above of the total-variation distance.
Beware that the alternative of precomputing a point of full order would not generally be safe in constructive applications.This precomputation eliminates failure cases for A = 0, making A = 0 easily distinguishable from other values of A via timing.An attacker Table 3: Heuristic analysis of the Elligator simulator (see text) compared to actual Elligator order distributions."#E": number of choices of A. "lg X": logarithm base 2 of X = j ( 1 − 1/l j + 1/l j )."lg H": logarithm base 2 of H = d q d /(p − 3)π = X/ (p − 3)π."lg H 1 ": logarithm base 2 of H 1 = d q d (1 − q d )/(p − 3)π."min" and "avg" and "max": logarithm base 2 of the minimum and average and maximum, over A, of the total-variation distance between Elligator and the uniform model."ratios": H 1 and minimum and average and maximum, divided by H, without logarithms.perspective.(Note that if Legendre-symbol computations have low enough cost then it is easy to argue for incurring the slowdown of using two Legendre-symbol computations on each curve to generate uniform random points, skipping Elligator and further simplifying the security analysis.) On the other hand, a single Elligator call could be best for speed, and is used in several previous papers.So we also studied the joint distribution of E A (F p ) × ẼA (F p ) orders.
For A = 0, the Elligator extensions mentioned above produce outputs of the form ((x, y), (−x, iy)).The "distortion map" from (x, y) to (−x, iy) is compatible with ellipticcurve addition, so it preserves the order of points.This reduces the security question for E 0 (F p ) × Ẽ0 (F p ) to the security question for E 0 (F p ), which was addressed above.
It is not surprising that the distance from the joint distribution to the uniform-pair model is generally larger than the distance from the single-point distribution to the uniform model.There are many more possibilities for a pair of orders than for a single order.
To quantify this, consider a joint-distribution simulator that resamples from the uniformpair model.If d 1 and d 2 are positive integers dividing (p + 1)/4 then there are ϕ(d 1 )ϕ(d 2 ) pairs of elements T 1 , T 2 of orders d 1 , d 2 respectively in Z/((p + 1)/4)Z.A heuristic analysis proceeds as before, with q d replaced by q d1 q d2 , and X replaced by X 2 , giving the estimate X 2 / (p − 3)π.This estimate is ≈0.263654 for p = 59, ≈0.164434 for p = 419, ≈0.0410512 for p = 12011, and ≈0.0277182 for p = 78539.If the actual joint-Elligator distances remain close to X 2 / (p − 3)π for all CSIDH primes p then the distances are acceptably small for CSIDH-512.

3 .
If C min was updated in Step 2, then repeat Step 2. 4. Return (m, C min ).

3 .
If C min was updated in Step 2, then repeat Step 2. 4. Return N , m, and C min .

Figure 1 :
Figure 1: Distributions of costs in the (M, S, a) = (1, 1, 0) metric, i.e., total multiplication counts.Five green curves: minimum, first quartile, median, third quartile, and maximum of 65 per-key distributions of multiplication counts in the last 255 experiments.Five blue curves: minimum, first quartile, median, third quartile, and maximum of 65 per-key distributions of multiplication counts in all 16383 experiments.Red curve: distribution of multiplication counts in all experiments across all keys.

Figure 2 :
Figure 2: Simulation of Figure 1.The red curve is copied from Figure 1.The green and blue curves replace the underlying per-key data with random samples from the red curve.

Figure 3 :
Figure 3: Same as Figure 1, but replacing multiplication counts with cycle counts scaled by 10 8 .

Figure 4 :
Figure4: For p = 419, distributions of orders of points[4]T 0 in E A (F p ). Red curve: the uniform model, choosing T 0 uniformly at random from E A (F p ). Five green curves: minimum, first quartile, median, third quartile, and maximum of the per-A distributions when T 0 is output by Elligator.

Table 1 :
Results of searches, for various choices of B, for CTIDH parameters with at least 2 256 keys for the CSIDH-512 prime.See text for description.
8, 0.05" are combinations of M, S, and a. See text for measurement details and standard deviations.