Concrete quantum cryptanalysis of binary elliptic curves

. This paper analyzes and optimizes quantum circuits for computing discrete logarithms on binary elliptic curves, including reversible circuits for ﬁxed-base-point scalar multiplication and the full stack of relevant subroutines. The main optimization target is the size of the quantum computer, i.e., the number of logical qubits required, as this appears to be the main obstacle to implementing Shor’s polynomial-time discrete-logarithm algorithm. The secondary optimization target is the number of logical Toﬀoli gates. For an elliptic curve over a ﬁeld of 2 n elements, this paper reduces the number of qubits to 7 n + b log 2 ( n ) c + 9. At the same time this paper reduces the number of Toﬀoli gates to 48 n 3 + 8 n log 2 (3)+1 + 352 n 2 log 2 ( n ) + 512 n 2 + O ( n log 2 (3) ) with double-and-add scalar multiplication, and a logarithmic factor smaller with ﬁxed-window scalar multiplication. The number of CNOT gates is also O ( n 3 ). Exact gate counts are given for various sizes of elliptic curves currently used for cryptography.


Introduction
Current cryptographic systems used on the Internet rely on the Diffie-Hellman key exchange, a way to create shared secret keys over a public channel.One of the most common Diffie-Hellman variants uses elliptic-curve cryptography (ECC).The key-exchange schemes rely on problems that are hard to solve with a classical computer.However, a quantum computer has advantages against these problems and can solve them exponentially faster.
Current quantum computers are very small compared to classical computers.However, a time will soon come when quantum computers can threaten computer security.This paper looks at a specific instance of a currently used cryptographic system and analyzes how large a quantum computer would have to be to quickly break it.
Optimizing quantum algorithms for concrete cryptanalysis has a lot in common with hardware design.The extra challenge is that quantum algorithms are required to be reversible.Reversible circuits are composed of a fixed set of reversible gates -NOT, CNOT, and Toffoli -which match the functionality of NOT, XOR, and AND with the extra condition that they return enough of the inputs to make the operations reversible.This creates an additional challenge for space efficient algorithms as trivial applications of the gate translation would amass a lot of qubits.

When will RSA and ECC be broken?
The number of years left for RSA and ECC depends on advances in building quantum computers, but also on advances in optimizing Shor's algorithm, and on the selected key sizes.Normally RSA and ECC key sizes are chosen to provide equal strength against non-quantum attacks, but this does not mean that they have equal strength against quantum attacks.Overheads in quantum elliptic-curve arithmetic make Shor's algorithm more challenging to optimize for ECC, but, as pre-quantum security levels increase, RSA chooses relatively large key sizes to protect against subexponential-time non-quantum factorization attacks.This creates a cross-over point in pre-quantum security levels, below which Shor's algorithm is faster for RSA than for ECC and above which Shor's algorithm is faster for ECC than for RSA.
At Asiacrypt 2017, Rötteler, Naehrig, Svore and Lauter [RNSL17] presented concrete quantum cryptanalysis of elliptic curve cryptography over prime fields.Their paper was the first to give a detailed study of this problem for prime fields and found a cross-over point much smaller than previously thought.Last year, Gidney and Ekerå [GE19] improved the cost of breaking RSA, leading again to a later cross-over point between RSA and ECC.
For binary elliptic curves, several papers have studied different curve shapes and approaches to the arithmetic, generally pointing to a later cross-over point than [RNSL17].The most recent paper in that sequence of publications is [ARS13] by Amento, Rötteler and Steinwandt.That paper uses depth as its singular metric, sacrificing space to improve latency, whereas [RNSL17] emphasized space and gate count, so the results are not directly comparable.Furthermore, [ARS13] does not specify the entirety of Shor's algorithm, gates, and O(n 3 ) CNOT gates.The costs with windowing are more complicated but smaller by a logarithmic factor.We present exact gate counts for standard ECC sizes from 163 bits through 571 bits in Tables 5 and 6 (considering windows).
A preliminary form of this paper was included in the third author's master's thesis in 2019 and achieved the same 7n + log 2 (n) + 9 qubits for binary-field ECDLP.An independent paper [HJN + 20] achieved about 8n + 10.2 log 2 (n) − 1 qubits for prime-field ECDLP.The previous paper [RNSL17] used 9n + 2 log 2 (n) + 10 qubits for prime-field ECDLP.See Section 9 for a more detailed comparison of our work to other work.

Organization of the paper
Sections 2 and 3 consist of background on elliptic curves and quantum computing respectively, while clarifying notation and goals.Section 4 details Shor's algorithm, the general quantum algorithm we use to solve discrete logarithm problems.Section 5 introduces basic finite-field operations like addition and constant multiplication.Section 6 details and compares two methods to do division: a new algorithm using extended greatest common divisor and an algorithm using Fermat's little theorem.In Section 7 we put this together to achieve point addition on binary elliptic curves.Section 8 presents a quantum version of scalar multiplication using windowing.For both approaches, the resulting resource count and a comparison to other work is given in Section 9. Finally, Section 10 draws a conclusion and details future work.

Binary elliptic curve discrete logarithm
This section contains a very brief introduction into binary elliptic curve cryptography, the primary application of this paper.For more background on elliptic curves see, e.g., [ACD + 05].

Binary elliptic curves
Binary elliptic curves are elliptic curves defined over a binary field F 2 n .We use a polynomial representation for F 2 n , i.e., the elements are represented as polynomials of degree less than n with coefficients in F 2 .Computations use that is an irreducible polynomial of degree n, i.e., all computations are done modulo m(z).Binary elliptic curves are standardized in [KG13], for the defining polynomials m(z) used for those curves see table 1.
We consider only ordinary binary elliptic curves, as the supersingular ones have stronger attacks.An ordinary binary elliptic curve is given by y 2 + xy = x 3 + ax 2 + b, where a ∈ F 2 and b ∈ F * 2 n .Points on this curve are tuples P = (x, y) ∈ F 2 2 n satisfying the curve equation along with a special point O called the "point at infinity".
The set of points on an elliptic curve forms a group under point addition defined as follows.The neutral element is O.The negative of a point P 1 = (x 1 , y 1 ) is −P 1 = (x 1 , y 1 + x 1 ), so that P 1 + (−P 1 ) = O.Two points P 1 = (x 1 , y 1 ) and P 2 = (x 2 , y 2 ) = ±P 1 are added to produce P 1 + P 2 = P 3 = (x 3 , y 3 ) as and P 1 = −P 1 is doubled to produce [2]P 1 = (x 3 , y 3 ) as [Has36] the number of points on an elliptic curve over F 2 n is at most 2 n + 2 n/2+1 + 1; this is less than 2 n+1 for n > 2. The order ord(P ) of a point P is the smallest positive integer such that [ord(P )]P = O.The order of a point divides the number of points on the elliptic curve.

Elliptic curve Diffie-Hellman
Elliptic curve Diffie-Hellman, the primary key-exchange mechanism using elliptic curves, works as follows: Alice and Bob want to privately agree on a secret point on a public curve while communicating in a public space.To do this, each takes a secret integer α and β respectively.Publicly, they agree on a point P with a large prime order.Then, they calculate and tell each other P α = [α]P and P β = [β]P .Finally, they calculate their shared point The problem of computing α from P α and P is called the elliptic curve discrete logarithm problem (ECDLP).The best non-quantum attacks on the ECDLP take exponential time in ord(P ).Shor's algorithm [Sho94] computes α in time polynomial in ord(P ) with a quantum computer.

Quantum background
This section contains a brief overview of quantum computing.For more details we refer to Ronald de Wolf's lecture notes available online [Wol19].

Qubits
A classical bit can take 2 values: 0 or 1, measuring that bit does nothing to it and using transistors we can have gates like AND or OR.In the quantum case we have quantum bits qubits, for which these things are not true.The qubits can take a superposition of values, meaning that the qubit can be in two states at once, and measuring a qubit changes its value by collapsing it to take value 0 or 1.The base states of a qubit are written in ket notation as |0 and |1 and a superposition is a weighted sum of these two base states α|0 + β|1 , where α, β ∈ C and |α 2 | + |β 2 | = 1.The chance to observe 0 in the measurement equals |α 2 |.A qubit with |α| = |β|, such as , is said to be in uniform superposition; it has equal chance of being measured as 0 or 1.
Combining n qubits provides a superposition over 2 n states where i = (q n−1,i q n−2,i . . .q 1,i q 0,i ) 2 is the representation of i in base 2. Measuring outputs i with probability |α 2 i |.For simplicity we write

Quantum Gates
Quantum computing requires reversible gates.Unlike classical gates like AND or XOR reversible gates are bijective (every input state corresponds to exactly one output state) and require an equal number of input and output qubits.In the following sections we state our algorithms only in terms of these gates applied to classical states, but the gates we use can be applied to superpositions of qubits in states |1 and |0 .Each state then behaves as expected individually: applying a NOT-gate to α|0 + β|1 turns it into α|1 + β|0 .For elliptic-curve computations we need the following gates (see also Circuit 1): • The NOT gate.It has one input and one output: if the input is |0 , the output is |1 and vice versa.It is its own inverse.• The CNOT (controlled NOT), or Feynman, gate is the reversible equivalent of XOR.
This gate has 2 qubits as inputs and adds one of the qubits to the other qubit:  c). Circuit 1b has an example.We write this as c ← TOF(a, b, c) in algorithms.• The SWAP operation swaps two qubits a and b, after a swap we refer to qubit a as "b" and qubit b as "a".This is free in the cost metrics we use.Circuit 1: Basic quantum gates used in this paper, beyond NOT.
Quantum computing has other gates and actions, which are purely quantum and not available in classical reversible computing.In Shor's algorithm, the following gates are necessary: the Hadamard gate (H), the phase shift gate (R φ ), and measurement, indicated by a meter symbol.Shor's algorithm is described in the next section.Our results can use this algorithm as a black box so we do not describe these gates.
Quantum mechanics has a unique property called entanglement that is not present in the classical world.When 2 qubits interact, they become entangled.Using this entanglement we can make quantum algorithms.

Quantum Algorithms
Quantum algorithms consist of operations on registers of qubits.We divide those qubits into two types: • Input and output qubits.These qubits contain the input and will contain the output after running the algorithm, potentially with some qubits being in the same state as before.For example, a Toffoli gate has 3 input and output qubits, but only 1 of them changes.• Ancillary qubits.These qubits are used by the algorithm, but do not contain the input and output.For this paper we restrict ancillary qubits to always start and end in a fixed state of |0 .

Efficiency
There are several methods to measure the efficiency of algorithms: • On the most basic level, we can compare the number of gates.However, quantum Toffoli gates are expected to be much more expensive than CNOT gates, with the exact difference depending on the physical realization of the quantum computer.As such, minimizing the number of Toffoli gates alone can be considered a better goal.
The number of Toffoli gates will be an important concern in this paper.• Furthermore, the number of qubits an algorithm uses is something very relevant to implementations today.Actual quantum computers are slowly increasing their number of qubits.As such the space, or width, of an algorithm is also relevant.The lower this space, the sooner the algorithm can be implemented on a real quantum computer.Space will be the primary concern in this work.• In addition to this, we can parallelize quantum circuits well: applying a circuit once on a set of qubits and once on a different set of qubits can be done twice as fast as applying that circuit twice on some of the same qubits.For example, CNOT(a, b) and CNOT(b, c) has to be done sequentially in 2 steps, while CNOT(a, b) and CNOT(c, d) can be done in one step.This measure of how many gates we need sequentially is called depth.In this work, depth will not be explored in-depth, but will be reported and optimization left to future work.• Finally, all of the above assumes quantum computers will not have errors.Precise quantum states are difficult to maintain and errors come quickly.Error correction has to be implemented to create what are called logical qubits, qubits on which operations can be performed with a reasonable degree of certainty.Error correction is not considered in this work and any mention of qubits refers to logical qubits.
An ideal analysis would give a parameterized algorithm in all of the above.However, users of cryptography need a concrete number to see how close to broken binary ECC is.Thus we prioritize the number of qubits and the number of Toffoli gates as those have been used in previous work [RNSL17].We explore adding a small (constant or logarithmic) number of qubits to reduce the gate count, but minimize the number of qubits down to contributions linear in n.

Shor's algorithm
In 1994, Peter Shor described how to use quantum computers to break traditional asymmetric cryptography [Sho94].While his primary example detailed how to break RSA by factoring integers in polynomial time on a quantum computer, he also showed how to extend his algorithm to any discrete logarithm problem, which includes the ECDLP.We use the same version as used in [RNSL17] to show the basics of Shor's algorithm.Shor's algorithm for solving discrete logarithms works as follows: we have two points We want to find α.Take 2 registers k and of size n + 1 each in uniform superposition 1 2 n+1 2 n+1 −1 k, =0 |k, .Take another 2n qubits in a state representing |O .Conditional on the first 2 registers, add classically precomputed points Circuit 2: Shor's algorithm for finding elliptic curve discrete logarithm.
Circuit 3: Shor's algorithm for finding elliptic curve logarithms with a semiclassical Fourier transform.
gates is applied to the first 2 registers1 .Those two registers are then measured, and the measurement result can be used to compute α classically [Sho97].Measuring the last 2n qubits gives a point R, for which k, exist such that In Circuit 2 the general algorithm is drawn.Note that it does not matter when the final 2n qubits are measured, so these can be measured when measuring the entire state or even after the result of the quantum Fourier transform is measured.By taking measurements after every step, we can compress the quantum Fourier transform on the first 2n + 2 qubits into a single qubit [GN96].The phase shift after every step depends on the previous measurement outcomes µ 0 , ..., µ 2n+1 with In Circuit 3 the algorithm has been drawn.
What matters for our analysis is that Shor requires the conditional addition of precomputed classical points to an intermediate point given in superposition, where the condition is also given in superposition.This requires computations in F 2 n and elliptic curve operations that fit this data flow and are reversible.

Basic arithmetic
In this section we discuss reversible in-place algorithms for the basic arithmetic of binary polynomials modulo a field polynomial m(z), i.e. elements of F 2 n .

Addition and binary shift
The first operation we consider, addition, can easily be implemented for binary polynomials.Each addition in F 2 takes one CNOT gate.The addition of two polynomials of degree at most n − 1 takes n CNOT gates with depth 1.This operation uses no ancillary qubits and the result of the addition replaces either of the inputs.Since addition is component-wise, addition for polynomials over F 2 is the same as addition for elements of the field F 2 n .
For polynomials in F 2 [z] multiplication by z is a shift of the coefficient vector.This requires no quantum computation by doing a series of swaps.In a finite field, we want to do a multiplication of a polynomial g(z) of degree at most n − 1 by z followed by a modular reduction by a fixed irreducible weight-ω degree-n polynomial m(z).For our purposes ω will always be 3 or 5.We represent m(z) as M where M is an ordered list of length ω that contains the degrees of the nonzero terms in descending order, for example if Step 1: For every qubit g i change its index so that it represents the coefficient of z i+1 mod n .Let h i be the coefficients of the relabeled polynomial, i.e. h i+1 mod n = g i .• Step 2: Apply CNOT controlled by the x 0 term h 0 (g n−1 before Step 1) to h j , with j = M 1 , . . ., M ω−2 .In the example of 1 + z 3 + z 10 we would apply 1 CNOT to h 3 controlled by h 0 .
See Circuit 4 for an example.After a multiplication by z without reduction the coefficient of z 0 is always 0. As m(z) is irreducible, it always has coefficient 1 for z 0 , so after a reduction by m(z) that qubit will be 1 and if no reduction takes place that qubit will be 0, which means our modular shift algorithm is reversible.This results in a total of ω − 2 CNOT gates for a modular reduction, with depth ω − 2 and we do not use ancillary qubits.Running this circuit in reverse corresponds to dividing by z modulo m(z).

Multiplication
For multiplication we use a space-efficient Karatsuba algorithm by Van Hoof [Hoo20] which uses O(n 2 ) CNOT gates, O(n log 2 3 ) Toffoli gates and 3n total qubits: 2n qubits for the input, f, g, and n separate qubits for the output, h.The algorithm is detailed in the full version of this paper [BBvHL20, It adds the multiplication result to the output qubits, h + f • g.

Squaring
Squaring in F 2 n is a lot easier than in the general case since: If we do not consider the mod operation, this would be 'free,' as we just need to shuffle zeroes between our registers.We can see two approaches for squaring in F 2 n : a circuit that takes the result of squaring a polynomial of degree at most n − 1 and stores it in n separate qubits, or a circuit that replaces the input with the result.The second approach is only possible for finite fields with 2 n elements since squaring is bijective.

Squaring and replacing the input
To square and replace the input, we make use of the fact that squaring is a linear map and we can write that map as an n by n matrix.Using an LUP-decomposition, we get a lower triangular, upper triangular and permutation matrix, which can be translated into a circuit consisting of at most n 2 − n CNOT gates and a number of swaps. 2

Squaring and storing the result separately
For this approach, we can take schoolbook squaring mod m(z): for every i from 0 to n − 1 add a i z 2i mod m(z) to the output qubits which start in state |0 .For fixed m(z) we can exactly compute the number of CNOT gates required depending on it.For example, squaring modulo 1 + z 3 + z 10 requires 16 CNOT gates.
There are families of polynomials m(z) where this algorithm uses a quadratic number of CNOT gates.However, the attacker can move to an isomorphic field, for example replacing z 127 + z 126 + 1 with z 127 + z + 1.Standard conjectures imply that every n ≥ 2 has an irreducible degree-n trinomial or pentanomial with the second non-zero term having degree at most n/2, and then this algorithm uses O(n) CNOT gates.

Inversion and division in binary finite fields
The most computationally intensive step is the division step.For the purpose of this paper we treat division by a field element as multiplication by the inverse of that element.There are two different ways these inverses are calculated, which we compare in this section.

Inversion using extended GCD
The first variant we look at is using the extended greatest common divisor (GCD) or Euclid's algorithm.Roetteler, Naehrig, Svore and Lauter [RNSL17] propose a straightforward variant using Kaliski's binary GCD algorithm for inversion in F p .In the quantum setting this has a problem because Kaliski's algorithm terminates in a number of steps dependent on the input polynomial.To circumvent this, a qubit stores whether the algorithm has terminated and log(n) qubits store how long ago the algorithm terminated.This ends up introducing a rather large number of conditional CNOT and conditional Toffoli gates at each step, which balloons the total Toffoli gate cost.This algorithm ends up having 32n 2 log(n) Toffoli gates while using only 8n + 2 log(n) + 9 qubits.
Recently Bernstein and Yang [BY19] introduced streamlined constant-time inversion algorithms for integers and polynomials.We introduce a reversible variant of the polynomial algorithm in [BY19].We have chosen notation to help the reader see how the steps in the optimized reversible computation here correspond to the steps in the optimized nonreversible algorithm in [BY19, Section 7.1]: in particular, the arrays f, g, v, r here are the arrays of coefficients of the polynomials f, g, v, r in [BY19].To minimize the number of qubits and, secondarily, the number of Toffoli gates, we carefully track the sizes of intermediate results and of inputs that need to be recorded for reversibility.
2 Muñoz-Coreas and Thapliyal [MCT17] propose a design which uses a small number of gates for reversible squaring by shuffling the qubits cleverly.The number of CNOT gates saved for their squaring compared to squaring with separate output is equal to n, and they use no ancillary qubits.Their algorithm as proposed, however, does not take into account cases where qubits in the upper n 2 registers have to interact.For example, if n = 8 and m(z) = z 8 + z 4 + z 3 + z + 1, we have z 6•2 = z 7 + z 5 + z 3 + z + 1.This means the qubit corresponding to z 6 in the input has to be added to qubits that also have to add themselves to the qubit corresponding to z 6 in the input, regardless of which output qubit you use to represent input qubits z 4 , z 5 , z 7 .This does not obviously translate into a quantum algorithm and their code is not publicly accessible.

Quantum input :
• A non-zero binary polynomial R 1 (z) of degree up to n − 1 stored in array g of size n to invert.• A binary polynomial R 2 (z) of degree up to n − 1 to multiply with the inverse stored in array B. • A binary polynomial R 3 (z) of degree up to n − 1 for the result stored in array C.

Result:
Everything except C the same as their input, C as ) Using these strategies, we arrive at Algorithm 1.The loop is repeated 2n − 1 times, each round uses the following actions: • RIGHTSHIFT and LEFTSHIFT shift the contents using only swap gates.
• a is the qubit used to decide whether to swap or not.Since v is always odd after a swap takes place and even if no swap has taken place, we can uncompute it directly.Unfortunately, v is always even before the swap takes place and whether r is odd depends on g, so keeping track of the sign is necessary.• δ is almost the integer δ in [BY19], but offset by 2 log(n) +1 − 1 so that the δ > 0 test in [BY19] turns into a single-bit test, checking the bit at position log(n) + 1.The series of CNOT gates to negate δ also increments δ, which is why δ is only incremented with the incrementer circuit if a is 0. • CSWAP is a conditional swap using 2 CNOT and 1 TOF gate to swap 2 qubits based on a. • It is not possible to uncompute g 0 within a single step.In [RNSL17] a similar value, called m i , is stored.We reduce some of the space by observing that f and g start to decrease in size after n steps but at step n the registers v, r, f, g, g 0 all need mostly full n + 1 qubit arrays.This means the number of qubits for these arrays is 5n + O(1) at least.• INC 1+a is a controlled incrementing algorithm.Using the n borrowed bits design from [Gid15] (we easily have log(n) qubits laying around for borrowing), we turn the CNOT gates into TOF and TOF into 3 TOF gates using ancillary qubit g 0 [ ] at step which is still zero at this point.This leads to 22 log(n) + 26 TOF gates and 2 log(n) + 3 CNOT gates.• In total we get 2(Λ + λ) + 5 TOF gates at step and 4(Λ + λ) + 3 CNOT gates in addition to the gates from INC, with Λ = min(2n − 2 − , n) and λ = min( + 1, n).
By keeping track of the maximum sizes of f, g, v, r we get two distinct benefits: the CSWAP and TOF steps take fewer gates and we free up some space to store some of the decisional qubits.On average, both Λ and λ have size 3n/4 + O(1) since we have n − 1 steps of size n and n steps where the size is increasing or decreasing by 1 per step.We need to do the loop 4n − 2 times in total: 2n − 1 for computing and 2n − 1 for uncomputing.Not including the multiplication (step 21 on Algorithm 1), this gives us 12n 2 + (88n − 44) log(n) + 116n − 62 TOF gates and 24n 2 + 8n log(n) + O(n) CNOT gates while using 4n + log(n) + 8 ancillary qubits plus 3n qubits for the input and output qubits.

Inversion using FLT
Fermat's little theorem (FLT) states x p = x mod p.This can be extended for binary finite fields to where n is the degree of m(z).By using squarings we can compute this in n multiplications and n − 1 squarings: However, improvements to this straightforward method exist.Itoh and Tsujii3 [IT88] give Algorithm 2: FLT_DIV.Reversible algorithm for division using inversion with Fermat's little theorem.Fixed input : A constant field polynomial m(z) of degree n > 0.

Quantum input :
• A non-zero binary polynomials R 1 (z) of degree up to n − 1 stored in array f 0 of size n to invert.• A binary polynomial R 2 (z) of degree n − 1 to multiply with the inverse stored in array B. • A binary polynomial R 3 (z) of degree n − 1 for the result stored in array C.
• k zero arrays of size n initialized to an all-|0 state: f 1 , ..., f k .
18 UNCOMPUTE lines 1-16 three improved variants.We use the second variant (Theorem 2 in [IT88]) since the third variant, despite giving better results, requires n to be a product of two integers, meaning it cannot be used for n prime like the NIST curves [KG13] use.
This algorithm uses two observations: to reduce the cost to below 2 log(n) multiplications and to n − 1 squarings.This algorithm works as follows: Note t is the Hamming weight of n − 1 in binary and t ≤ log(n − 1) + 1 and k 1 = log(n − 1) .
1. Calculate f 2 2 k 1 −1 with k 1 multiplications using the second observation, save the Circuit 6: Step 1-3 of Algorithm 2 for n = 10.K is the squaring circuit using a LUP-decomposition and In total we have k 1 + t − 1 multiplications, which in the quantum case translates to 2n log(3) (k 1 + t − 1 2 ) TOF gates and n • max(k 1 + t − 1, k 1 + 1) ancillary qubits.The classic algorithm uses n − 1 squarings, while we have to use up to 4n − 4. We use O(n 2 ) CNOT gates per squaring as explained in Section 5.3, but we cannot be more accurate about the number of CNOT gates for general n due to the variance in the squaring algorithm.We can get the exact number of CNOT gates using an LUP-decomposition.A full division algorithm is given in Algorithm 2. We can save up to n(k 1 − t) qubits by doing additional multiplications to uncompute intermediate results, at the cost of a significant number of Toffoli gates.We leave to future work how many qubits we can save for specific fields.

Comparison of the two division algorithms
We implement both division algorithms for the purpose of comparison.As can be seen in Table 2 the algorithms have different strengths.For small n (n < 12 or n = 13) the FLT-based algorithm performs better in both number of qubits and Toffoli gate count, for larger n the GCD-based algorithm performs better in number of qubits.For any n the GCD-based algorithm performs better in CNOT gate count, with roughly half the gate count of the FLT-based algorithm.The FLT-based algorithm uses roughly a fifth of the Toffoli gates used by the GCD-based algorithm while using roughly twice the number of qubits.Due to the lower space requirement of the GCD-based algorithm we use it in the remainder of the work despite the larger Toffoli gate cost.

Quantum random access memory
Such a lookup naturally uses quantum random access memory (qRAM).We define LOOKUP(i, a, b) as returning (i, a + ([i]P 2 ) x , b + ([i]P 2 ) y ), and we define LOOKUP x (i, a) as returning (i, a + ([i]P 2 ) x ).
The maximum possible cost of these operations comes from implementing qRAM using Toffoli gates.Below we report Toffoli-gate counts using the qRAM implementation from [BGB + 18].One can also consider a magical implementation of qRAM that reduces the cost of each operation to just 1, or one can consider intermediate possibilities.

New special cases
Each addition of [i]P 2 has a significant chance 1/2 of being an addition of [0]P 2 , the point at infinity.Recall that the point at infinity is a failure case in the generic addition formulas: the point at infinity is not even expressible as (x, y).Other failure cases have negligible chance of occurring (see Section 7.2), but 1/2 is not negligible.
Algorithm 3 eliminated [0]P 2 by using controlled additions instead of additions.One could similarly design an algorithm using precomputed points to perform controlled additions, where the control bit is computed as [i = 0].However, it is simple to avoid this failure by changing the table of 2 − 1 precomputed multiples [1]P 2 , . . ., [2 − 1]P 2 to a table of 2 precomputed points T, T + [1]P 2 , . . ., T + [2 − 1]P 2 avoiding infinity.This also adds T to the output of each P step, but one can cancel out this contribution by adding the opposite offset −T to the tables for the Q steps.Shor's algorithm uses the same number of additions of multiples of P as of Q.One can also use a separate offset for each step.

Point addition algorithm with precomputed points
Using these lookup actions instead of the regular x 2 and y 2 additions, we get Algorithm 4. Note that aside from the addition by a, all additions have become regular CNOT additions.Otherwise, nothing has changed besides the 6 lookups.

Window size
In order to know the ideal window size, we need to know the cost of a qRAM lookup compared to the cost of a Toffoli gate.This information is currently not available.In the unlikely case that this turns out to be very inexpensive also the cost of pre-computation compared to the cost of quantum computation matters.
We summarize the cost of different window sizes in Table 4.Note that the qubit cost for the semi-classical Fourier transform increases linearly with the window size, for example = 7 requires 6 more qubits than = 1.
Using the estimate of 2(2 − 1) TOF gates per lookup from [BGB + 18], after optimizing for , at = 14 Algorithm 4 uses 58,401,000 TOF gates, which is approximately 10 times less than the case without windowing.

Results
The only step requiring ancillary qubits is division, which needs 4n + log(n) + 8 ancillary qubits.Point addition needs 3n qubits for input and output and 1 control qubit on which We do not give an exact number of CNOT gates due to our upper bound of the cost of multiplication, leaving the total count of CNOT gates at O(n 3 ).In Table 5 several numerical examples are given.We used java to calculate an LUP-decomposition and then calculate the number of gates.The total number of TOF gates is simply the number of TOF gates for a single step multiplied by 2n + 2. The depth upper bound is calculated by keeping track of whether 2 or more gates can be executed at the same time, increasing the counter if they cannot.These algorithms are not optimized for depth, as such the depth is of the same order as the number of TOF gates.We can see that the number of Toffoli gates is strongly dependent on the number of Toffoli gates in the division: 48n 3 + 352n 2 log(n) + 512n 2 is purely from the division, with the log(n) term coming specifically from the incrementer circuit.If we apply the windowing variant, we get Table 6.
The rest of this section compares our results to previous results and to the independent paper [HJN + 20].Some of these comparisons are to algorithms for the prime-field case.One would expect carries in the prime-field case to use extra Toffoli gates, but it is not obvious what overall impact to expect, and it is not obvious that the prime-field case should require any extra qubits.As noted in Section 1, we use 7n + log n + 9 qubits for binary-field ECDLP, while the independent paper [HJN + 20] uses 8n+10.2log n −1 qubits for prime-field ECDLP, and [RNSL17] used 9n + 2 log 2 (n) + 10 qubits for prime-field ECDLP.

Comparison to other gcd-based inversion algorithms
The algorithm we used for inversion and division is an improvement over the inversion algorithm based on Kaliski's [RNSL17].That algorithm uses a large number of controlled Toffoli and controlled CNOT gates, which are translated into 3 Toffoli gates and 1 Toffoli gate respectively.This causes a large increase in Toffoli gate count, with the prime field cases using 32n 2 log(n) Toffoli gates.An adaptation of Kaliski's algorithm to the binary case would replace some integer additions with binary polynomial additions but would still involve some integer comparisons.
As for space, we save n + log(n) + 2 ancillary qubits compared to [RNSL17].We get this benefit by using part of the input to store decision qubits, saving n qubits; using an incrementer circuit that uses dirty qubits rather than clean ones, saving log(n) qubits; and using just one extra control qubit, compared to the three required by [RNSL17].
We commented in a preliminary version of this paper that starting from the integer algorithms in [BY19] and building a prime-field variant of our binary-field division algorithm is likely to reduce cost in the prime-field case.The independent paper [HJN + 20, Appendix A.3] says that its approach to prime-field inversions is "nearly identical" to one of the algorithms in [BY19].

Comparison to prime-field point-addition algorithms
Our approach to addition on the curve y 2 + xy = x 3 + ax 2 + b is conceptually the same as the approach in [RNSL17] to addition on the curve y 2 = x 3 + ax + b.However, [RNSL17] uses extra space for inversion output, as its division algorithm requires separate steps for inversion and multiplication.
For field multiplications, we benefit from the recent space-efficient reversible Karatsuba algorithm from Van Hoof [Hoo20] for multiplying polynomials.There is also a recent space-efficient reversible Karatsuba algorithm from Gidney [Gid19] for multiplying integers, but the algorithm from [Hoo20] includes reduction modulo a polynomial, and it is not clear whether this is possible for the algorithm from [Gid19] without extra space.

Comparison to previous binary-field point-addition algorithms
Amento, Rötteler and Steinwandt [ARS13] use projective coordinates to avoid divisions.They need only 13 multiplications every step, which would result in 26n log(3)+1 as the leading term in their Toffoli gate count if the multiplications were implemented using [Hoo20].
However, this use of projective coordinates has two disadvantages.First, the formulas in [ARS13] use many ancillary qubits and separate input and output qubits, leading to 10n qubits in one point-addition step even with space-efficient multiplications.This is already worse space requirements than the 7n + log(n) + 9 we use.
Second, projective coordinates have a much larger space disadvantage not pointed out in [ARS13].What is easy to calculate in projective coordinates, and what is calculated in 0, 0, 0 that keeps a copy of its input.Composing a series of these additions consumes more and more qubits for intermediate results, increasing the number of qubits to the scale of n 2 / log n if the window size is on the scale of log n.In reversible computations there is a generic checkpointing technique by Bennett and Tompa [Ben89] that somewhat reduces space for a long series of operations, but this requires computing each operation many times and still needs superlinear space.
In non-quantum computations, there is no requirement of reversibility, and one can simply throw the intermediate results away.However, Shor's algorithm requires a quantum computation that produces simply the final result [k]P + [ ]Q.Uncomputing intermediate results is easy in affine coordinates-after adding P 2 to P 1 , simply add −P 2 to the output P 1 + P 2 to uncompute P 1 -but not in projective coordinates, because naively adding −P 2 = (x 2 , y 2 + x 2 ) to the point (X 3 , Y 3 , Z 3 ) results in a representation of P 1 that is not always equal to (X 1 , Y 1 , Z 1 ).
For the same reasons, the projective Montgomery ladder, commonly used to improve the efficiency of arithmetic in non-quantum variable-base-point scalar multiplication, requires much more space in a quantum setting.The Montgomery ladder is also not as efficient as windowing for fixed base points.

Comparison of Toffoli gates, T -gates, and depth for ECDLP
Detailed proposals for quantum computers usually implement a Toffoli gate as a series of 7 "T -gates" and 9 "Clifford gates".The Clifford gates are expected to be much less expensive than the T -gates.Various algorithms have been optimized at the Clifford+T level, often reducing a Toffoli gate to fewer than 7 T -gates.One can also consider parallel gates: a Toffoli gate can be reduced to T -depth 4, or T -depth 3 with an extra Clifford gate, or T -depth 1 with several extra Clifford gates and a few ancillary qubits.
However, developing algorithms at the Toffoli level has advantages for testability on today's computers, as explained in [RNSL17].The algorithms in [RNSL17] for primefield ECDLP are thus developed at the Toffoli level, and are reported to use at most 9n + 2 log 2 (n) + 10 qubits and at most 448n 3 log 2 (n) + 4090n 3 Toffoli gates, with a slightly smaller Toffoli depth.For example, [RNSL17] reports 4719 qubits, 2 40.1 Toffoli gates, and Toffoli depth 2 39.9 for a 521-bit prime field.
We also work at the Toffoli level, using 7n + log 2 (n) + 9 qubits and just 48n 3 + 8n log 2 (3)+1 + 352n 2 log 2 (n) + 512n 2 + O(n log 2 (3) ) Toffoli gates: for example, 4015 qubits and 2 33.3 Toffoli gates for a 571-bit binary field.Our upper bounds on depth in Table 5 are somewhat above our gate counts because we consider the depth of all gates, not just Toffoli gates.Note that we have not optimized depth.
Windowing then saves a logarithmic factor, reducing 2 33.3 Toffoli gates to 2 29.4 Toffoli gates in Table 6 for a 571-bit binary field.This implies an upper bound of 2 32.2 T -gates, and an upper bound on T -depth of 2 31.0 while still using only slightly over 4000 qubits.

Conclusion
The results in Table 5 show concrete numbers of logical qubits required to perform Shor's algorithm to solve the discrete logarithm problem on binary elliptic curves.We obtained these results by optimizing the multiplication and division circuits.The number of Toffoli gates is high due to choosing algorithms optimized for space.Using the alternative division Algorithm 2 with cryptographic field sizes, the number of Toffoli gates for division could be cut by about 80% at the cost of doubling the number of qubits.Furthermore, optimizing for depth might result in a better depth count than the upper bounds given without changing the number of gates.Additionally, Table 6 shows a reduction in the number of Toffoli gates when the windowing method is applied.
Depth so far has been an upper bound: both the multiplication and division algorithm could benefit from a further look at how to optimize it.The division algorithm specifically can also benefit from a better incrementer circuit.Finally, we suspect that a better algorithm exists for multiplication by z n 2 + 1 modulo m(z).
(a, b) → (a ⊕ b, b).It is its own inverse: applying a CNOT to (a ⊕ b, b) results in (a ⊕ b ⊕ b, b) = (a, b).By abuse of notation we write this as a ← CNOT(a, b) in algorithms to highlight the position that changes.• The Toffoli (TOF) gate is the reversible equivalent of AND.This gate has 3 qubits as inputs and adds the first qubit multiplied with the second qubit to the third qubit: (a, b, c) → (a, b, c ⊕ (a • b)).It is also its own inverse: (a, b, c ⊕ (a • b) ⊕ (a • b)) = (a, b,

( a )
The CNOT gate.(b) The TOF gate.(c)The swap operation.

Table 1 :
List of irreducible polynomials for binary finite fields used in this paper.
Fourier transform (QFT), consisting of specific phase shift gates and Hadamard Binary shift circuit for F 2 10 with g 0 +• • •+g 9 z 9 as the input and

Table 2 :
Comparison of various instances of division Algorithms 1 and 2. Field polynomials from Table 1.Depths and gate count are upper bounds since a generic algorithm is used rather than optimizing for specific fields.in cryptography and have been used in recent variants of Shor's algorithm [Gid19, HJN + 20].

Table 4 :
Comparison of window sizes for n = 233.

Table 5 :
Qubit and gate count for Shor's algorithm for binary elliptic curves.Field polynomials from Table1.CNOT count given is an upper bound.