Faster Montgomery and double-add ladders for short Weierstrass curves

. The Montgomery ladder and Joye ladder are well-known algorithms for elliptic curve scalar multiplication with a regular structure. The Montgomery ladder is best known for its implementation on Montgomery curves, which requires 5 M + 4 S + 1 m + 8 A per scalar bit, and 6 ﬁeld registers. Here ( M , S , m , A ) represent respectively ﬁeld M ultiplications, S quarings, m ultiplications by a curve constant, and A dditions or subtractions. This ladder is also complete , meaning that it works on all input points and all scalars. Many protocols do not use Montgomery curves, but instead use prime-order curves in short Weierstrass form. These have historically been much slower, with ladders costing at least 14 multiplications or squarings per bit: 8 M + 6 S + 27 A for the Montgomery ladder and 8 M + 6 S + 30 A for the Joye ladder. In 2017, Kim et al. improved the Montgomery ladder to 8 M + 4 S + 12 A + 1 H per bit using 9 registers, where the H represents a halving. Hamburg simpliﬁed Kim et al.’s formulas to 8 M + 4 S + 8 A + 1 H per bit using 6 registers. Here we present improved formulas which compute the Montgomery ladder on short Weierstrass curves using 8 M + 3 S + 7 A per bit, and requiring 6 registers. We also give formulas for the Joye ladder that use 9 M +3 S +7 A per bit, requiring 5 registers. One of our new formulas supports very eﬃcient 4-way vectorization. We also discuss curve invariants, exceptional points, side-channel protection and how to set up and ﬁnish these ladder operations. Finally, we show a novel technique to make these ladders complete when the curve order is not divisible by 2 or 3, at a modest increase in cost.


Introduction and related work
The core operation of most elliptic curve cryptography algorithms is scalar multiplication, in which an element P 0 of an elliptic curve group is multiplied by an integer ("scalar") k.Considerable study has been devoted to optimizing scalar multiplication algorithms.
This paper is mainly concerned with variable-base scalar multiplication algorithms, meaning that P 0 is not known ahead of time, so no precomputation has been done on it.Typically k is secret.To avoid side-channel attacks, the algorithm should be regular, meaning that its timing and control flow should not depend on k.

The Montgomery and Joye ladders
The Montgomery ladder [Mon87] is a general algorithm for computing a power or scalar multiple of a group element.This algorithm's regular structure is conducive to implementations that resist side-channel attack.This technique is fastest on elliptic curves in the

Montgomery form
But it can be efficiently applied on any elliptic curve, in particular those in the short Weierstrass form y 2 = x 3 + ax + b.
Joye's double-add ladder [Joy07] -hereafter called the Joye ladder -is a similar algorithm, which is also used in efficient implementations of elliptic curve scalar multiplication.The Joye ladder can be viewed as the dual of the Montgomery ladder [Wal17].Both ladders apply a sequence of linear transformations over the group.The transformations applied by the Joye ladder are adjoint to those in the Montgomery ladder, and are applied in the opposite order.
Each algorithm takes as input a group element P 0 and a scalar k, and computes k • P 0 .Let k have a binary representation k := n i=0 2 i k i where each k i ∈ {0, 1}.
The two ladder algorithms are typically presented as starting with the state (O, P 0 ), where O is the identity element (or neutral point) of the curve.However, the neutral point is at infinity, which causes problems for some formulas.So in this paper we use a variant which rewrites k into an equivalent scalar, such that the first bit (the most-or least-significant bit for the Montgomery or Joye ladder, respectively) is 1.For the Joye ladder, this requires that the order q of the curve (or at least the order of P 0 ) is odd.The ladder then starts after the first step, when the state no longer contains the neutral point.These algorithms are shown in Figure 1.
While the Montgomery ladder can be implemented using only Q and R, many implementations also use the base point P := R − Q = P 0 .Likewise, implementations of the Joye ladder often track a third point R := P + Q = 2 i • P 0 in the ith iteration.In these cases, the Montgomery and Joye ladder update steps are rearrangements of each other: both map triples of points (P, Q, R) to (P, P + Q, 2 • R) in some order.This often makes it possible to convert between formulas for the Montgomery ladder and those for the Joye ladder.
However, the converted formulas aren't always equally performant or side-channel resistant.They also may need adjustment if the three points don't use the same representation in the ladder state.For example, the base point never changes in the Montgomery ladder, so it might be stored in affine coordinates; or the ladder might track the y-coordinates of some points but not others.
For most ladder formulas, the fastest version has a state containing enough information to recover the x-coordinate of each point, but not the y-coordinate.As a consequence, the formulas work in x-only protocols such as elliptic curve Diffie-Hellman, where only the x-coordinate of P is given and only the x-coordinate of the output is required.
Additional work is required to recover y.For the Montgomery ladder, the state contains points (Q, R) such that R − Q = P 0 , and possibly a representation of P 0 itself.If P 0 is given as (x, y) then enough information is present to efficiently solve for the output y-coordinate as well [LD99].For an n-iteration Joye ladder, the final state instead contains R = 2 n+1 P 0 .If R is known (e.g. if P 0 is a fixed generator of the curve and R has been precomputed) then y can be recovered in the same way.Otherwise, in order to recover y, the ladder must track additional information such as a Z-coordinate, typically at a cost of 1-2 registers and 1M per bit.
This pattern holds for the ladder formulas we present here.For the Montgomery ladder, minimal additional work is required to recover y; only the y-coordinate of P 0 must be remembered.For the Joye ladder, if y is needed and R hasn't been precomputed, then additional work is required to track the Z-coordinate.

Co-Z coordinates
Elliptic curve implementations typically calculate using a projective version of the elliptic curve for efficiency.Instead of storing (x, y), points are represented in either projective coordinates as (xZ : yZ : Z) or in Jacobian coordinates as (xZ 2 : yZ 3 : Z), where Z is an arbitrary scaling factor.This avoids costly finite-field divisions except at the end of the computation.When storing multiple points in a ladder, the straightforward way is to use a separate Z-coordinate for each point.
Co-Z formulas [Mel07, GJM10, GJM + 11] instead use the same Z-coordinate for all points in the ladder state.This reduces memory usage.It can also improve performance, because often the first step in a point addition is to rescale both points to have the same Z-coordinate, and that step is not needed for co-Z representations.In some cases, the common Z-coordinate doesn't need to be calculated or stored, since having the scaled X and Y coordinates for multiple points is enough information to solve for Z.
Most formulas on short Weierstrass curves need special cases around the curve's neutral point O, which is written with Z = 0.This is especially true for co-Z formulas, because if one point has Z = 0 then they all do; the finite points would then be represented as (0 : 0 : 0), which is indeterminate.Thus, co-Z ladders always start with a nonzero state, as shown in Figure 1.

The Kim et al. formulas
In 2017, Kim et al. published a variant of the Montgomery ladder with "on-the-fly adaptive coordinates" [KCK + 17] defined by (X 1 , X 2 , K, L, A, S, T ) and a bit b, where: Note that these coordinates are scaled by powers of a Z-coordinate, but do not include Z itself.The Kim et al. formulas improved the previous state of the art of 9M + 5S + 18A per bit [GJM + 11] to 8M + 4S + 12A + 1H per bit, using 9 registers which each hold one field element.The formulas are quite complex, with each operation depending on two bits of the key instead of one.
Hamburg presented a simplified and optimized version of Kim et al.'s main ladder formula at the CHES 2017 rump session [Ham17].These formulas use the fact that the points in a ladder state or their negations lie on a line y = m(x − x 0 ) + y 0 .They use modified Jacobian co-Z coordinates of the form (3X 0 , 2Y 0 , X 1 − X 0 , X 2 − X 0 , 2M ) where Hamburg's formulas require 8M + 4S + 8A + 1H per bit, and 6 field registers.

Our contribution
In Section 2.1 we give two new formulas for the Montgomery ladder.Both use 8M+3S+7A and 6 field registers.The formula in Figure 3 is closely related to Hamburg's rump session formula.It keeps a ladder state of (X QP , X RP , Y P , M ) where X QP = (x Q − x P ) • Z 2 and likewise for X RP .Thus, it drops the the X P coordinate from [Ham17], which is called X 0 in that presentation.In addition to improving performance, this change gives an improvement in resistance to side-channel attacks, as discussed in Section 4.An alternative formula 1 in Figure 4 instead tracks (X QP , X RP , X 2 RQ , Y Q , Y R ).This formula has the same performance as the one in Figure 3 except that it requires an extra conditional swap for Y Q and Y R .It can be parallelized efficiently over four multiplication units instead of three, making it comparable to recent vectorized implementations of the Montgomery ladder on Montgomery curves [HEY20,NS20].
In Section 2.2 we show a Joye ladder formula, based on our first Montgomery ladder formula, using 9M+3S+7A and 5 registers.We are not aware of any previous Montgomery or Joye ladder formula that requires only 5 field registers.
Counting registers is somewhat tricky.Our register counts are given for x-only scalar multiplication; recovering the y-coordinate requires one more register, and for the Joye ladder, also additional computation.They do not count the scalar itself, the curve constants, or small constants such as 2 and 3.They also assume that the processor supports "multiplication in place" (X ← X • Y ) and for the complete Joye formulas, "reverse subtraction in place" (X ← Y − X).If it does not support these operations, an additional register is required.
In Sections 2.3 and 2.4 we show how to set up and finalize the ladder state for either x-only or (x, y) calculations.For the Joye ladder, (x, y) calculations generally require an extra M per bit to track the Z coordinate.We also give invariants on the ladder state and a discussion of side-channel protection.
See Figure 2 for a comparison of our new formulas to past work.They are still not as fast as the Montgomery ladder on Montgomery curves [Mon87]: approximating 2 1S ≈ 0.75M, 1m ≈ 0.25M, 1A ≈ 1H ≈ 0.1M gives an estimate of 21% more compute time per scalar bit, excluding the final division.This is an improvement from [Ham17] and [Riv11], which use respectively 31% and 61% more compute time per bit than [Mon87].Our new formulas also support the usual S − M tradeoffs at the cost of extra additions and registers, but we present them in their simplest and most compact form.
Our new Montgomery ladder is faster even than a variable-time non-adjacent form (NAF) algorithm with a = −3 [HMV06], by about 4% using the above estimates.More S − M tradeoffs are known for Jacobian operations, so NAF would still be faster with sufficiently high M : S : A ratios.
1 A preprint of this paper presents the formulas in Figure 4 as tracking (XQP , XRP , M, M ) where M is an additional slope variable.The present version performs the same calculations, but has a different boundary between the end of one iteration and the beginning of the next.We have chosen to use the present version because it is more similar to the calculations in the rest of the paper.
2 More or less arbitrarily, but this is typical of an IOT implementation, such as a 256-bit curve on a 32-bit processor.This is where the M : S : A ratio is most relevant: for lightweight implementations, memory usage is more important, and for high-performance implementations, parallelism is more important.Note that with these cost estimates, the typical tradeoff Notes: • † These ladders are complete, at least for a subset of curves.
• * The Kim et al. ladder is not written using conditional swaps, and we did not attempt to convert it.
• ‡ Binary NAF is included for comparison purposes only, since unlike the other algorithms in the table it isn't regular or constant-time.The expected costs per bit are listed.Here we assume that the curve constant a = −3, and that the doubling formula is improved beyond [HMV06] to 4M + 4S + 7A + 1H using the identity 3 2 x = x + 1 2 x. • The costs are given in Multiplications, Squarings, Additions or subtractions, multiplications by a curve constant, Halvings and Comparisons.Multiplication by numbers ≤ 4 is decomposed into additions.
• Mn, Sn and An mean the cost of at most n multiplications, squarings or additions in parallel.Parallel register counts assume that these operations can be done in place.
• The costs do not include setup or finalization.Setup may include an on-curve check and finalization always includes a division.Division typically costs between 1S/bit and (1S + 1M)/bit, depending on the modulus and on memory constraints.
• The register counts for our new formulas are also sufficient for setup and finalization, using the technique from Section 2.4.2.
• These costs are for x-only ladders; recovering y requires extra storage.It also costs an extra 1M/bit for our Joye ladders but not our Montgomery ladders.The formulas in Section 2.1 and Section 2.2 are not complete: they break down if the neutral point appears in the ladder state, but not in any other case.In Section 3 we give an analysis of this problem and a novel solution, which is not specific to our formulas: it potentially allows other ladders to implement complete scalar multiplication at a modest performance cost.These formulas are given in the supplementary material.

Ladder Formulas
Our ladder works on short Weierstrass curves over large-characteristic fields.They are derived from the following theorem: Theorem 1 (Ladder formulas with differences of x-coordinates).Let be the state of the Montgomery ladder on an elliptic curve y 2 = x 3 + ax + b defined over a field of characteristic other than 2. The three points (P, Q, −R) lie on a line with slope m := (y be the state after a ladder operation, where (P, S, −T ) lie on a line of slope m .Let Proof.Deferred to Appendix A.
Theorem 1 may seem somewhat unintuitive, and in fact was reverse engineered from the new formulas, rather than being the motivation for them.The formulas are improvements of [Ham17], which are derived from [KCK + 17].However, the new formulas are quite different from [KCK + 17], so the motivation for that work likely does not apply here.
These formulas in Theorem 1 are compatible with Jacobian coordinates: if (m, x, y) are replaced by (mZ, xZ 2 , yZ 3 ) then the outputs will also be in that form, with the same Z.Note that to avoid extra additions and subtractions, in some cases it is more efficient to calculate with the variables m and y multiplied by a small constant, such as −1, 2 or −2.

Formulas for the Montgomery ladder
Theorem 1 gives a straightforward strategy to implement the Montgomery ladder.We begin with Jacobian versions of x Q − x P , x R − x P , y P and m.We can calculate y R = y P + m • (x R − x P ), and then follow Theorem 1 to get x S − x P , x T − x P and m .Finally, y P stays the same, but with a new Jacobian denominator Z.
In Jacobian coordinates, the state variables will be For most curves, the value of Z need not be represented; see Section 2.4.On each step, we calculate the negated y-coordinate ȲR := −2y R Z 3 = Y P + 2M X RP .Then Z will be multiplied by a local denominator z := ȲR • (X QP − X RP ).We can then easily compute rz, sz, tz and mz, from which the rest of the terms follow homogeneously.An optimized implementation is shown in Figure 3.It is also possible to perform a Montgomery ladder whose state incorporates Y Q and Y R , instead of Y P and M .The ladder state comprises Here G is included in the ladder state just to avoid an extra subtraction from recomputing X RP − X QP at the beginning of the ladder step.This formula also requires 8M + 3S + 7A per bit, and is shown in Figure 4.This ladder's multiplications can be parallelized 4 ways, which improves performance on vector processors.With the minor changes shown in Figure 4's notes, the cost rises to 9M + 2S + 8A, but the additions can also be parallelized.Therefore on a parallel machine it can be implemented in 3M 4 + 3A 3 per bit, where M 4 and A 3 are the cost of 4 parallel multiplications and 3 parallel additions, respectively.

Formulas for the Joye ladder
For the Joye ladder, the same outline works, but we are conditionally swapping (x P , y P ) ↔ (x Q , y Q ) instead of (x Q , y Q ) ↔ (x R , y R ).The x-coordinates are easily rearranged to support this by tracking X RP := X R − X P and X RQ := X R − X Q .For y-coordinates, we now need to track both y P and y Q .Conveniently, we can use both coordinates to compute x S − x P = −tu from Theorem 1.The Joye ladder state is: An optimized Joye ladder is shown in Figure 5.

Ladder setup
The initial state of the ladder encodes the points P = (x P , y P ) and R = 2P = (x R , y R ).Let (x, y) lie on the elliptic curve y 2 = x 3 + ax + b.We need to compute (x R − x P )Z 2 , 2y P Z 3 and mZ, where m = (3x 2 P + a)/(2y P ) is the slope of the tangent at P .Since x R + 2x P = m 2 , we have (x R − x P )Z 2 = (mZ) 2 − 3x P Z 2 .Setting Z = 2y P , we get Montgomery ladder.Input: ladder state (XQP , XRP , M, YP ) Notes: • This formula uses 8M + 3S + 7A and two temporary registers for a total of 6.Here M, S, A represent the costs of field multiplication, squaring, and addition or subtraction, respectively.
, where H is the cost of a halving, and costs no extra registers.
• The multiplications can easily be parallelized over 2 or 3 units.
• If Z is to be tracked, it should be multiplied by F .Montgomery ladder.Input: ladder state (XQP , XRP , G, YQ, YR) Output: ladder state (XSP , XT P , G , YS, YT ) Notes: • This formula uses 8M + 3S + 7A and 6 field registers.Here M, S, A represent the costs of field multiplication, squaring, and addition or subtraction, respectively.
• The formula can be parallelized over 2, 3 or 4 multiplication units.
• This formula may be rearranged to compute instead of lines 8-11.This uses 9M + 2S + 8A, and allows the multiplications to be parallelized 4 ways and the additions 3 ways, for a total latency of 3M 4 + 3A 3 per bit.Joye ladder.Input: ladder state (XRP , XRQ, YP , YQ, M ) Notes: • This formula uses 9M + 3S + 7A, and no temporary registers for a total of 5.Here M, S, A represent the costs of field multiplication, squaring, and addition or subtraction, respectively.
This changes the cost to 8M + 4S + 10A + 1H, and requires one extra register for a total of 6.
• If Z is to be tracked, it should be multiplied by XRQ • YR.
• The calculation of YR can be moved to the end of the round, so that YR is a state variable instead of M .
• The multiplications can easily be parallelized over 2 or 3 units.Since these formulas depend on y 2 P rather than y P , they still work if y P is not given, which is common for elliptic curve Diffie-Hellman protocols.However, if the elliptic curve is not twist-secure, then the implementation must check that the putative Z 2 is actually square.Otherwise the ladder will still work, but with arithmetic on the curve's quadratic twist.If power analysis is not a concern, or if the twist has no small subgroups, then the check can be implemented in a batch with the final division [Ham12].
The ladder setup can easily be accomplished in 5 registers, including the check that Z 2 is actually square.So the setup routine doesn't increase the memory footprint of either implementation.

Simple technique
To complete the ladder, we must recover the final x Q , and possibly also y Q , from the ladder state.In the Montgomery ladder, if the original coordinates (x P , y P ) are retained and nonzero, this is easy.We have both y P Z3 = Y P /2 and We can then recover the final point Likewise, if only the original x P is retained, we can calculate x Q using 1/Z 2 = x P /(x P Z 2 ).This technique naturally works for the Montgomery ladder, but not the Joye ladder because the initial point P 0 isn't part of the ladder state.However, it can be applied to the Joye ladder if the final point R = 2 n+1 P 0 is precomputed, where n is the number of ladder steps.This is convenient to do if P 0 is a standard generator on the curve, or perhaps a frequently-used static public key.
However, the simple finalization technique requires remembering x P .It also doesn't work if x P = 0, which can happen on certain curves [AT03], so we will propose an improved technique.The improved technique doesn't work with curves of j-invariant 0 or 1728.The most popular curve with j-invariant 0, NIST's secp256k1, has no points with x = 0 or y = 0, so the simple technique can be used for the Montgomery ladder on that curve.

Improved technique
We want to avoid incompleteness when the starting point has xy = 0. We also need an alternative technique for the Joye ladder: the point P in the ladder isn't the base point P 0 , but instead on the nth step the ladder has R = 2 n+1 P 0 .Unless that point has been precomputed, we cannot take advantage of a known point to determine Z.Likewise, low-memory implementations of the Montgomery ladder may wish to discard the base point.
If the curve doesn't have j-invariant 0 (meaning that a = 0) or 1728 (meaning that b = 0), then we have an improved technique to recover Z 2 , which allows us to recover x Q but not y Q .Let c := y P − mx P be the y-intercept of the line connecting (P, Q, −R).Then (x P , x Q , x R ) are the roots of Rearranging, we get Note that in the ladder state, the values of m, x i and c are scaled by Z, Z 2 and Z 3 respectively, so these formulas compute A := aZ 4 and B := bZ 6 instead.This allows us to calculate 1/Z 2 = Ab/(aB), which is enough to calculate x Q but not y Q .The division will be 0/0 if and only if a = 0, b = 0 or Z = 0.
For the Montgomery ladder, if y P is available we can compute 1/Z = aBy P /(Aby P Z 3 ) and thus recover the y-coordinate of the output.This would fail if y P = 0, but in that case even the starting state of the ladder contains the point at infinity.
If only the x-coordinate of the input point is given, it is possible that the initial y isn't in the base field F but instead in a quadratic extension.In this case, y and Z are in the extension, but the values actually computed by the setup routiney 2 , Z 2 and yZ 3 -are still in F. Because Y = 2yZ 3 , the true value of y is in F if and only if this putative Z 2 is actually square.Its Jacobi symbol can be checked during the inversion at little extra cost using the batching technique in [Ham12].This means that if power analysis is not a concern, or if the curve is twist-secure, then the initial on-curve check can be deferred until finalization.
With careful use of the identity x P + x Q + x R = m 2 , the improved finalization step can also be calculated in 5 registers, including the invariant and Jacobi symbol checks.This is shown in the supplementary material.As a result, using our Joye ladder formulas it is possible to calculate an entire x-only variable-base scalar multiplication using only 5 mutable field registers, plus the scalar and the curve constants a and b.

Tracking Z
It is also possible to track the Z coordinate through the ladder steps, and then to invert it in the finalization step.The starting Z-coordinate is 2y P , which is known if the initial point is given with a y-coordinate.This is simplest for the Montgomery ladder formula in Figure 3.In each ladder step Z is multiplied by the intermediate value F , so this technique costs 1M per bit and one register to store Z.
Likewise, for the Joye ladder formulas in Section 2.2, in each ladder step Z is multiplied by X RQ • Y R .This can be computed as an intermediate value, but doing so requires 6 registers instead of 5, plus one register to hold Z for a total of 7.So tracking Z costs either 1M and 2 registers, or 2M and one register.
For the parallel Montgomery ladder formulas in Figure 4, Z is multiplied by X T S • Y R , which does not appear as an intermediate.Therefore tracking Z would cost 2M and one register.However, the formulas can be rewritten to track Z 2 instead at a cost of 1M and one register; see the notes in that figure .If the initial y-coordinate is not given, then in the starting state Z is not known, so we could instead use two registers to track Ẑ := Z/y P (which begins at 2), and to remember the starting y 2 P .At the end, we can calculate Z 2 = Ẑ2 • y 2 P ; this suffices to recover x Q but not y Q .Or, we could track Z 2 directly.This saves a register but costs an additional 1S per bit, except for the parallel Montgomery ladder where tracking Z 2 is cheaper.
Overall, tracking Z is more expensive, but it should not be necessary for the Montgomery ladder unless the curve has j ∈ {0, 1728} and also has a point with x = 0. Tracking Z is necessary for the Joye ladder if the y-coordinate of the output is required, or if j ∈ {0, 1728}, unless x R and/or y R have been precomputed.

Ladder state invariants
The improved technique's formulas for A = aZ 4 and B = bZ 6 also serve as a ladder state invariant: we must have This equation can be checked during finalization, or periodically between ladder steps, as a fault attack countermeasure.If Z or Z 2 is tracked, then the stronger conditions A = aZ 4 and B = bZ 6 can be checked instead.

Completeness of the Montgomery and Joye formulas
Our new ladder formulas are complete unless they encounter the curve's neutral point O.
Recall that the ladder operation computes The Montgomery and Joye ladders will resolve to 0/0 if the (possibly untracked) Zcoordinate ever becomes 0. In each iteration, This condition cannot begin with Q = R, because then P = R − Q would already be the neutral point and we would already have Z = 0.
If the curve has odd prime order q and P 0 = 0, and if the minimum number of steps ( log 2 q ) is used, then the neutral point cannot be reached until the last two ladder steps.This is because in the Montgomery ladder, the ladder state is (P, Q, R) = (P 0 , k i P 0 , (k i + 1)P 0 ) for k i an initial segment of k, which has i + 1 bits after i iterations.Likewise in the Joye ladder, the ladder state is where k i and 2 i+1 −k i also have i+1 bits after i iterations.So for any of these values to be a multiple of q, at least log 2 q − 1 iterations (i.e. the last or second-last iteration) must have passed.A trivial case analysis on k i shows that the dangerous scalars for the Montgomery ladder are {−2, −1, 0, 1} mod q, and for the Joye ladder they are {−2 n , 0, 2 n , 2 n+1 } mod q.So a complete implementation in this case can focus on either the last two ladder steps, or on avoiding those 4 problematic scalars.
Overall, the entire operation will be correct when the ladder never reaches the neutral point, and one of the following finalization techniques is used: • The simple technique in Section 2.4.1 for the Montgomery ladder on curves with no point of the form (0, y).
• The improved technique in Section 2.4.2 on curves with j-invariant neither 0 nor 1728.
• The Z-tracking technique in Section 2.4.3.
In turn, the probability that the neutral point is reached is negligible if all the following conditions are all met: • The curve's order q is a large prime.
• The initial point P 0 isn't the neutral point.
• The scalar is uniformly random mod q, or at least has sufficiently high min-entropy.
• If the scalar is represented using more bits than the curve order, then the representation's initial segments of length at least ≥ log 2 q − 2 must also have high min-entropy.
• Either the twist also has prime order, or the calculation begins with an on-curve check.
If power analysis isn't a concern, then reaching the neutral point on the twist may not matter, so long as the result is rejected by a timing-invariant check at the end of the calculation.

Avoiding the neutral point
Some applications do not meet the above criteria.For example, the scalar might not be random, or the curve's order might not be prime.In this section, we will show a technique which avoids the neutral point when the curve's order (and its twist's order, if there is no on-curve check) is not divisible by 2 or 3.This technique is generally applicable, and for some ladders (e.g.our Joye ladder) it adds no extra multiplications.However, it does require extra adds and conditional swaps, and it doesn't prevent an attacker from using power analysis to determine whether the neutral point was reached.

Notation change: ladder state sums to O
For both the Joye and Montgomery ladder, the state (P, Q, R) normally satisfies R = P + Q.However, the state is typically permuted between ladder steps, and we will further permute it here.To ensure that the ladder state satisfies a consistent invariant without introducing additional cases, we will write it as (P, Q, R) where R = −R, so that no matter how the state is permuted, P + Q + R = O.This changes the ladder operation to (P, Q, R) → (P, Q − R, 2 R).

Entering and leaving the neutral zone
We call the set of states where O ∈ {P, Q, R} the "neutral zone".
Our key observation is that there are only a few, highly constrained paths for the ladder state to pass through the neutral zone.We can recognize whether the ladder state is on one of these paths, and in what state S it will exit the neutral zone.In that case, we can reach S by a different sequence of ladder steps, of the same length, which doesn't pass through the neutral zone.
This idea might work on curves of any order.But for simplicity, we will only consider curves of order not divisible by 2 or by 3. Even-order curves are less interesting anyway, because most cryptographic examples are Montgomery curves, and the Montgomery ladder on these curves is already complete [BL17].
On odd-order curves, if the ladder does not start in the neutral zone (i.e. if P 0 = O) then the only way to enter it is by the ladder step On the next step, the Montgomery ladder either stays in this state or exits the neutral zone to (−2Q, −2Q, 4Q).Likewise, the Joye ladder either doubles to (−4Q, O, 4Q) or exits to (−2Q, −2Q, 4Q).

The shadow state
We propose to avoid this set of transitions by recognizing that Q = R and instead using an equivalent, equally long sequence of transitions that avoids the neutral zone.Specifically, we permute the state (−2Q, Q, Q) to (Q, −2Q, Q), so that the ladder operation moves it to the state (Q, −3Q, 2Q).Since we intend to follow the correct ladder state at a distance ("shadow" it), we will call this state, and others which are proportional to permutations of (1 : 2 : −3), a "shadow state".The shadow state is outside the neutral zone if Q = O and the curve's order is not divisible by 2 or 3.
Via a permutation and a ladder step, a shadow state can transition to (2Q, −4Q, 2Q), which can be negated and permuted into the state (−2Q, −2Q, 4Q) at which the ordinary Montgomery or Joye ladder would leave the neutral zone.Via different permutations and ladder steps, the shadow step can remain the same (as the Montgomery ladder does in the neutral zone) or double (as the Joye ladder does).States proportional to permutations of (1, 2, −3) are the only states outside the neutral zone with either of these properties.If the ladder should end in the neutral zone, then the final output is either the neutral point or ±2Q, which can also be extracted from the shadow state.
Our full technique can be described at a high level as follows: • Before each ladder step, check whether Q = R.If not, then perform the ladder step (P, Q, R) → (P, Q − R, 2 R) as normal.It won't go to the neutral zone if q is odd.
• If Q = R, then a ladder step would move the state into the neutral zone.Instead permute the state to (Q, −2Q, Q).Then apply the ladder step, which moves to the shadow state (Q, −3Q, 2Q) instead of the neutral zone.
• On the next step, determine from the key bits whether the ladder state will exit the neutral zone, which it does to (−2Q, −2Q, 4Q).If so, permute the shadow state to (2Q, −3Q, Q) and then apply a ladder step to reach (2Q, −4Q, 2Q).Permute and negate this result into the correct exit state.
• If instead the Montgomery ladder would remain in the neutral zone, permute the shadow state to (−3Q, 2Q, Q); after the ladder step it will be (−3Q, Q, 2Q).
• If the Joye ladder would remain in the neutral zone, permute the shadow state to (2Q, Q, −3Q); after the ladder step it will be the doubled shadow state (2Q, 4Q, −6Q).
• If the ladder completes while in the neutral zone, the correct answer is either O or ±2Q, which may be extracted from the shadow state.
For the Joye ladder, is is slightly preferable to negate the state on entering the shadow state instead of leaving, so that the shadow state contains the correct output −2Q instead of +2Q.State diagrams of the ladder avoidance technique is shown in Figure 6.The diagrams in Figure 6 show paths that pass through the neutral zone on the left, and on the right they show equivalent paths that do not pass through the neutral zone.Each thick arrow denotes a ladder step, and the state is permuted between ladder steps.For example, consider the following sequence of Montgomery ladder steps, which passes through the neutral zone: This sequence can be traced out on the left side of the Montgomery ladder state diagram.We can avoid the neutral point by using different permutations between the steps, as shown on the right side of the upper state diagram: This ending state on the second path can then be permuted and negated to reach the same ending state from the first path.

Implementation
The detailed algorithm requires a large number of conditional swaps, and is given in the supplementary material.This technique is simplest to implement using our Joye ladder formulas with a state of and optionally Z.This allows permutations of the Y -coordinates to be done directly.
Negating the state requires only negating Z, if it is present, and is free if Z is not present.The only tricky part is how to permute the X-coordinates.The simple case is condswap The other swaps require arithmetic on the X-coordinates, and it is preferable not to perform that conditionally.But we can build these swaps using only unconditional arithmetic and conditional swaps.For example, we can implement condswap(P, R) as swap(Q, R); condswap(P, Q); swap(Q, R).
Though they are described as Joye ladder formulas, these formulas could also be used for the Montgomery ladder, except that we would track the x-coordinates X QP and X RP .These complete formulas do not require any extra registers, at least if a "reverse subtraction" operation X ← Y − X is available.

Side-Channel Protections
The present work is not primarily focused on side-channel attacks or defenses.However, we can offer a few brief observations.Simple power analysis: In a straightforward implementation of the Montgomery or Joye ladder, the ladder steps and states depend deterministically on the initial point P 0 , and on the initial segment of the scalar which has been processed to that point.This leads to simple power analysis (SPA) attacks [KJJ99].These attacks can be mitigated in part by projectively blinding the starting state.That is, we can choose a random nonzero r from the field and multiply (X, Y, Z, M ) by (r 2 , r 3 , r, r) respectively, so that they are a different, random representation of the same points.This defense is included in the supplementary material.
Horizontal attacks: Horizontal attacks are power analysis attacks which look for correlations between power consumption at different steps in the algorithm.This enables the attacker to determine whether the points have been swapped between rounds or not, which determines the key bit.These attacks are easiest if a single value is used in two different ladder steps with a conditional swap in between, because the two instances of that value being read from memory or operated on may have highly-correlated power traces [BJPW14].Otherwise, correlations may occur between reading and writing a value, but these correlations are typically smaller, which makes the attack more difficult.
The new Montgomery ladder in Figure 3 should resist horizontal attack better than our other new formulas, because it does not use any of the round output variables within the round.If the signal-to-noise ratio is low enough, then this attack might thwart a horizontal collision attack.Otherwise, additional countermeasures might be required, such as re-randomizing the projective blinding between rounds, or blinding the secret scalar.
Fault attacks: Fault attacks work by causing computation errors in a device that is computing using some secret value.Comprehensive defenses against fault attacks are very complex, and far beyond the scope of this work.However, one possible component of these defenses is to check invariants of the algorithm's state.The integrity of the ladder state can be checked using the invariants in Section 2.5.

Conclusions and Future Work
We have presented new, optimized formulas for the Montgomery and Joye ladders on short Weierstrass curves.Our Montgomery ladder formulas are faster and slightly more side-channel-resistant than our Joye ladder formula, and we hope that future work can improve the Joye formula to match.We have not attempted a computer search for more optimal formulas with this representation or similar representations, which might well be fruitful.It may also be interesting to search for further S − M tradeoffs.
For typical applications such as the NIST and Brainpool curves, our ladders are complete except with negligible probability, and even that negligible probability can be avoided at a reasonable additional cost.However, the avoidance technique would be more useful if it could be simplified.
Our new formulas are amenable to parallel implementations, which would be an interesting line of future work.It is not immediately clear whether we can combine the full benefits of parallelism with zero avoidance and/or Z-coordinate tracking.
be the state of the Montgomery ladder on an elliptic curve y 2 = x 3 + ax + b defined over a field of characteristic other than 2. The three points (P, Q, −R) lie on a line with slope m := (y Q + y R )/(x Q − x R ).Let P = (x P , y P ), S := Q + R = (x S , y S ), T := 2R = (x T , y T ) be the state after a ladder operation, where (P, S, −T ) lie on a line of slope m .Let Then Proof.The three points P, Q, −P − Q lie on the intersection for some c.The solutions to this equation are roots of a polynomial of the form x 3 − m 2 x 2 + O(x), so that

Figure 1 :
Figure 1: The Montgomery and Joye ladders, modified for nonzero initial state.

Figure 2 :
Figure 2: Comparison to selected previous work.
Let m := (y Q − y R )/(x Q − x R ) be the slope of the line connecting Q, R and −(Q + R) = (x S , −y S ).We have m2 = x Q + x R + x S ; m − m = −t; m + m = u.Therefore x S − x P = m2 − x Q − x R − x P = m2 − m 2 = ( m + m) • ( m − m) = −tu.We also have−u = −( m + m) = (m − m) − 2m = t − 2m.Next, we have y S − y P = −(−y S − y R ) + (−y R − y P ) = − m • (x S − x R ) + m • (x R − x P ) = − m • (x S − x P ) + ( m + m) • (x R − x P ) = − m • (x S − x P ) + u • (x R − x P )whence the output slope from P to Q + R ism out := y S − y P x S − x P = − m + u(x R − x P ) −tu = − m − s = t − m − sFinally, let's calculate x T − x P .We have m out = − m − s.In calculating x T − x P = m 2 out − 2x P − x S , we may expand m 2 out = m2 + 2 ms + s 2 = x Q + x R + x S + 2 ms + s 2 help performance if xy is needed, but it does help if instead 2xy is needed.

•
The Z value is multiplied by XT S • YR in each iteration, so the value of Z 2 is multiplied by G • H. Thus Z 2 can be tracked with only 1M extra per iteration by rewriting X QP • H = XQP • (G • H), but the resulting formula does not appear to parallelize 4 ways.