Quantum Period Finding against Symmetric Primitives in Practice

We present the first complete implementation of the offline Simon's algorithm, and estimate its cost to attack the MAC Chaskey, the block cipher PRINCE and the NIST lightweight candidate AEAD scheme Elephant. These attacks require a reasonable amount of qubits, comparable to the number of qubits required to break RSA-2048. They are faster than other collision algorithms, and the attacks against PRINCE and Chaskey are the most efficient known to date. As Elephant has a key smaller than its state size, the algorithm is less efficient and ends up more expensive than exhaustive search. We also propose an optimized quantum circuit for boolean linear algebra as well as complete reversible implementations of PRINCE, Chaskey, spongent and Keccak which are of independent interest for quantum cryptanalysis. We stress that our attacks could be applied in the future against today's communications, and recommend caution when choosing symmetric constructions for cases where long-term security is expected.


Introduction
Due to Shor's algorithm [58], quantum computing has significantly changed cryptography, despite its currently theoretical nature.
In public-key cryptography, this has led to the thriving field of quantum-safe cryptography and an ongoing competition organized by the NIST [53] will propose new standards for key exchange and signatures. In the meantime, quantum circuits for Shor's algorithm have been proposed and improved over time [34,31,3,35], leading to a better understanding of the precise resources needed for a quantum computer to be threatening.
In symmetric cryptography, it has long been thought that the only threat was the quantum acceleration on exhaustive search. This has changed with works on dedicated cryptanalysis of block ciphers [14], hash functions [38], and the many cryptanalyses that rely on Simon's algorithm [46,44,10,49,13,12]. Nevertheless, work on quantum circuits focuses mainly on exhaustive key search, and specifically on AES key search [40,23,48,1,33]. Hence, many quantum attacks in symmetric cryptography are either only known asymptotically, or only with rough estimates.
Our Contributions. We present the first quantum circuits that implement the offline Simon's algorithm [12], and propose cost estimates for the attack against the MAC Chaskey, the block cipher PRINCE, and the NIST lightweight candidate AEAD scheme Elephant.
We stress that these attacks, as Shor's algorithm, could by applied against today's communications: a patient attacker could gather the required data now and wait until a powerful enough quantum computer is available to run the attack.
Using Q#, we designed and implemented multiple quantum circuits of independent interest: an efficient reversible circuit to solve boolean linear equations, and optimized quantum circuits for Chaskey, PRINCE and the two permutations used in Elephant, spongent and Keccak.
We find that PRINCE and Chaskey are especially vulnerable to this attack, requiring only 2 65 qubit operations to recover the key. For comparison, Shor's algorithm requires 2 31 similar operations to break RSA-2048. Elephant suffers much less: it has a larger state size, with the same data limitation and key size. This makes the Elephant cryptanalysis slightly more costly than exhaustive search.
Outline. Section 2 presents the basics of quantum computing, the constructions we will attack and the generic quantum attacks against them. Section 3 presents the offline Simon's algorithm, the quantum algorithm we implement. Section 4 presents Simon-based cryptanalysis and details for each construction the attack model and principles. In Section 5, we propose a new optimized quantum circuit to solve boolean linear equations reversibly. Section 6 presents our design of quantum circuits for the constructions we attack, as well as our optimization strategies. Section 7 details the cost estimates of our attacks.

Quantum computing
For our purposes a quantum computer is a collection of qubits, objects with a joint quantum state represented by a projective complex vector space of dimension 2 n , for n qubits. We model the quantum computer as a peripheral of some classical controller [41], which alters the quantum state by applying gates. These interventions apply to one or more qubits, and the controller is free to apply gates simultaneously to disjoint sets of qubits. The cost of a quantum algorithm is then measured in the number of interventions applied. For quantum computers today, and for surface codes in the future, "gates" are not distinct physical objects, but an operation that we perform on the quantum computer. Hence, 2 65 gates does not imply 2 65 physical components, but it does imply performing some process 2 65 times, and so we focus on the total cost of these processes. For this reason, we will often refer to gates as "operations" or "qubit operations".
The algorithms we analyze are definitively in a fault-tolerant era of quantum computing, where quantum error correction enables large computations. As surface codes are the most promising error correction candidate today [29], we focus on costs relevant to surface codes. We pay special attention to the number of T-gates, which are the most expensive gate on surface codes, and we do not give any extra cost to measurements.
While the attack depends on quantum interference, the most expensive subroutines are quantum emulations of classical algorithms: block ciphers, linear algebra, and memory access. Thus, we can design and test these subroutines even at cryptographic sizes. We use the Q# programming language for this [61].
We use the Clifford+T gate set with measurements, though we design circuits using only X, CNOT, Toffoli 3 , and AND operations. These operations act like classical bit operations on bitstrings, hence they are efficient to simulate. The Toffolis and ANDs are further decomposed into Clifford+T operations, and only Toffoli and AND require T operations. Figure 1 summarizes the quantum gates we use to implement reversible classical circuits. We did not explore any fully quantum techniques (such as measurementbased uncomputation) for these classical tasks, beyond atomic operations present in Q#, such as measurement-based ANDs.
NIST's security levels for post-quantum cryptography emphasize the maximum circuit depth available to an adversary [53]. Since Grover-like algorithms parallelize badly [62], attacks that finish quickly cost much more than attacks that are allowed to take a long time. While this also affects our attack, our goal is to demonstrate another aspect of post-quantum security, rather than to compare to post-quantum asymmetric cryptography, so we do not account for depth limits.

Generic designs
Even-Mansour. The Even-Mansour construction [27], presented in Figure 2, is a very minimal block cipher, with provable classical security: assuming P has been chosen randomly, any key recovery requires an amount of time T and data D that satisfies T D ≥ 2 n .
x P K1 K2 FX construction. The FX construction [45] is a simple way to extend the key length of a block cipher: it adds two whitening keys, at the input and the output of the cipher, as presented on Figure 3. Fig. 3: The FX construction. E K is a block cipher.

Target constructions
Chaskey. Chaskey [52] is a lightweight MAC oriented to 32-bits architectures. It uses a mode that can be seen as a combination of Even-Mansour and CBC-MAC, described in Figure 4, with a 128-bit ARX permutation π. It uses a 128-bit key K, from which the key K 1 is derived: K 1 = 2K, with a multiplication in the finite field F 2 [X]/ X 128 + X 7 + X 2 + X + 1 .
It outputs a t-bit tag, with t ≤ 128 specified by the user. In the original design, the permutation contained 8 rounds. As the 7-rounds permutation happened to be broken [50], Chaskey with a 12-rounds permutation is included in the standard ISO/IEC 29192-6 [39].
Chaskey has a data limitation of 2 48 message blocks with the same key, which corresponds to 2 55 bits.
Classical security. Because of the Even-Mansour construct, Chaskey can be attacked with a time-data tradeoff that satisfies T D ≥ 2 128 , which is why the data is limited to 2 48 blocks. PRINCE. PRINCE [15] is a low-latency block cipher, with a 64 bit block size and a 128 bit key, split into two 64-bit keys, K 0 and K 1 . It follows the FX construction, as presented in Figure 5.
Notably, some microcontrollers use PRINCE to encrypt memory [56].
x PRINCE-core Classical security. PRINCE claims a data-time tradeoff of T D ≥ 2 126 . It has been analyzed extensively [42,60,28,21,25,24,57,32], and so far the claim holds. Very recently, a new version of PRINCE, PRINCEv2 [17] was proposed. While this new version is very close to PRINCE, it does not have the FX structure, and each round uses alternatively K 0 or K 1 . This makes PRINCEv2 immune to the attack we present here. [6] is an authenticated encryption with associated data (AEAD) scheme, and a 2nd-round candidate in the NIST lightweight authenticated encryption competition [54]. It is a block-oriented construction whose encryption shares some similarities with the counter mode, with an encryptthen-MAC authentication.

Elephant. Elephant
Elephant uses a 128-bit key K and a 96-bit nonce N . It comes in 3 variants, with a different permutation P and a different security level: Elephant-160 uses the 160-bit permutation spongent-π[160] [9]. Its expected classical security is 2 112 with data limited to 2 53 bits processed. Elephant-176 uses the 176-bit permutation spongent-π[176] [9]. Its expected classical security is 2 127 with a data limited to 2 53 bits processed. Elephant-200 uses the 200-bit permutation Keccak-f [200] [5]. Its expected classical security is 2 127 with a data limited to 2 77 bits processed. The encryption of a message is presented on Figure 6. The mask values are computed from the expanded key K ′ = P (K||0), and two LFSR φ a and φ b : For encryption, only j = 0 is used. Masks with j = 1 and j = 2 are used to compute the tag.
A new version of Elephant, Elephant v2 [7], has been proposed for the third round of the NIST lightweight competition. There are only two differences between the versions: the encryption uses masks with j = 1 for encryption, and the tag computation is different. This does not affect our attack.

Generic attacks
There are two types of attacks that can always be applied on the structures we're attacking.
Key search. As the constructions contain some secret material, it is possible to brute-force it. Classically, this will cost 2 k computations of the construction.
Its quantum equivalent uses amplitude amplification [18] to recover the key, and requires π 2 2 k/2 computations of the construction, assuming one computation can uniquely identify the key.
Collision finding. The Even-Mansour construction can be attacked by looking for collisions [26]: let's consider that we have queried 2 d Even-Mansour encryptions. For any δ, we can compute a list of elements of the form If the list happens to contain two messages x, y such that x ⊕ y ⊕ δ = K 1 , then we have P (x⊕ δ) = P (y ⊕ K 1 ) and conversely P (y ⊕ δ) = P (x⊕ K 1 ). Hence, the list will contain a collision.
As the list is of size 2 d , this will occur with probability 2 2d−n , which means we need to try 2 n−2d distinct δ. Overall, as one try costs 2 d , the total time cost is T = 2 n /2 d , with 2 d data, for a tradeoff of DT = 2 n .
Quantum version. There are multiple quantum algorithms to compute collisions. The most well known matches the query lower bound of Ω 2 n/3 [19]. It however requires the QRAM model, and there is no known time-efficient implementation of this algorithm.
More recently, a quantum algorithm based on distinguished points has been proposed [22], with a time cost in O 2 2n/5 or O 2 3n/7 , depending whether one of the colliding functions can be queried quantumly or not. This algorithm was used in [37] to propose quantum attacks on Even-Mansour with the tradeoff DT 6 = 2 3n .
Collisions for FX. The FX construction can be attacked simply by checking wether or not the Even-Mansour attack works given an inner key guess. This changes the tradeoffs, replacing n with n + k. Remark 1. One may consider that searching for the key will always be more expensive than looking for collisions. This is not always the case: collision-finding depends on the state size, and key search on the key size (though the two are often equal).

The offline Simon's algorithm
The following sections present the algorithmic core of our attacks, which amounts to finding a periodic function.
From an abstract point of view, our attacks can be seen as instances of the following problem: Solving this problem reduces to finding a periodic function, as the function E(x)⊕ f (i 0 , x) has period s. Here, E will be a secret function (a block cipher, for example) that we can only query classically, and f will be computable quantumly.

Simon's algorithm
Simon's algorithm [59] solves the following problem in polynomial time: Problem 2 (Simon's Problem). Let n be an integer and X a set. Let f : Given oracle access to f , find s.
It does so using Circuit 1, which is described as Algorithm 1.
Output: j with j · s = 0 1: Initialize two n-bits registers : |0 |0 2: Apply H gates on the first register, to compute 2 n −1 x=0 |x |0 3: Apply O f , to compute 2 n −1 x=0 |x |f (x) 4: Reapply H gates on the register, to compute We can factor the x that have the same f (x), and rewrite the state as x∈{0,1} n /(s) Now, from Algorithm 1, we see that the j we can measure must fulfill (−1) x·j + (−1) (x⊕s)·j = 0, that is, s · j = 0. Hence, this routine can only produce values orthogonal to the secret.

Remark 3.
If the function is not periodic, then random values will be measured, and the set of values can be of rank n.
Full algorithm. From this circuit, we recover the complete value of s by obtaining O(n) queries, and using linear algebra classically to compute s.
Reversible implementations of Simon's algorithm. Without the final measurement, Algorithm 1 becomes a reversible quantum circuit that computes in its first register the uniform superposition of values orthogonal to s. Hence, if we apply it multiple times in parallel, we can reversibly compute the value of s, assuming we also have a quantum circuit for the linear algebra. We present such a circuit in Section 5.
Simon's algorithm as a distinguisher. As Simon's algorithm can compute a period, it can also determine wether a given function is periodic or not. With enough sampled vectors, their rank will be at most n − 1 if the function is periodic, and will likely be n if the function is not. This principle can be used in quantum distinguishers.

Grover-meets-Simon
The Grover-meets-Simon algorithm [49] performs a quantum search that uses Simon's algorithm to identify the correct guess. This is possible as Simon's algorithm can be implemented reversibly. Grover-meets-Simon solves the following problem: Problem 3 (Search for a periodic function). Let n be an integer and X a set. Let f : {0, 1} k × {0, 1} n → X be a function such that there exists a unique i 0 such that f (i 0 , ·) is periodic. Find i 0 and the period of f (i 0 , ·).
Algorithm 2 solves this problem by simply testing wether or not the function f (i, ·) is periodic, using Simon's algorithm as in Circuit 2.
This algorithm has a cost of O n2 k/2 queries and O n 3 2 k/2 time, as each iteration of the quantum search requires an application of Simon's algorithm, which needs O (n) queries plus O n 3 for the linear algebra.

The offline Simon's algorithm
We can see Problem 1, the Offline Simon's problem, as a special case of Problem 3, a search for a periodic function, and solve it with Algorithm 2. Indeed, if we have will be periodic if and only if i = i 0 , and its period will be s. The main limitation of this approach is that we need quantum query access to the periodic function, which is not possible if the function E is only accessible classically. The offline Simon's algorithm [12] proposes two improvements over the Grovermeets-Simon algorithm to overcome this restriction.
Reusing quantum queries. The first improvement comes from the fact that the periodic function, E(x) ⊕ f (i, x), has a very specific two-part structure, where the function E(x) is independent of i. This means each occurence of the Simon test makes the exact same query to E. This allows a slightly different approach for the Simon test: the queries to E are done once at the beginning of the procedure, and then reused for each test, as shown in Algorithm 3, which uses Circuit 3 instead of Circuit 2.
This new approach reduces the number of quantum queries to E from exponential to polynomial.
Using classical queries. The second improvement computes the states

Simon's algorithm with additional collisions and concrete estimates
In practice, the promise of Simon's algorithm is only partially fulfilled: for the periodic functions we consider, we can have f (x) = f (y) and x = y ⊕ s. This impacts Simon's algorithm, but [11] shows that for almost all functions, the cost overhead is negligible, via the following theorem: , 1} m such that the offline Simon's algorithm, repeating π 4 arcsin √ 2 −k iterations with n + k + α + 1 queries per iteration, succeeds with probability lower than Theorem 1 tells us that Simon's algorithm needs only (n + k + α + 1) queries, and it allows us to use functions with a small output size, which roughly halves Algorithm 3 The Offline Simon's algorithm [12] 1: Query m times E, to compute From |ψ m , compute m times

4:
Apply H on the input registers 5: Compute the rank of the values in the input registers 6: if the rank is lower than n then 7: Do a phase shift 8: end if 9: Uncompute steps 5 to 3. 10: end amplify the required number of qubits and slightly reduces the computational cost of f . This approach shares some similarities with the oracle compression technique from [51]. We however do not consider a random set of functions applied to the output, but a carefully chosen function such that the overall computational cost is minimized.

Quantum Simon-based attacks
Since the seminal Simon-based distinguisher on the 3-round Feistel construction of Kuwakado and Morii [46], many attacks that use Simon's algorithm have been proposed. We present here the Simon-based attacks on the Even-Mansour and FX constructions, and detail how we instantiate them for the primitives presented in subsection 2.3.

Attack on Even-Mansour
For Even-Mansour constructions, we can consider the function which has period K 1 . Hence, with access to quantum queries, Simon's algorithm can recover K 1 in polynomial time, from which it is trivial to recover K 2 . This was proposed in [47].

Attack on the FX construction
The quantum attack against the FX construction proposed in [49] is based on a simple idea: if the key is known, then this reduces to an Even-Mansour, and the previous attack applies. In more details, the function has period K 1 if and only if i = K. Hence, with quantum query access, we can apply the Grover-meets-Simon algorithm to recover K and K 1 in time O 2 k/2 if |K| = k.

Offline version
The previous attacks can be adapted to classical-query attacks thanks to the offline Simon's algorithm, as proposed in [12].
Offline attack on the FX construction. The periodic function of the FX construction directly fits the structure of Problem 1, with E = F X K1,K,K2 and f (i, x) = E i (x). Hence, we can attack the FX construction on a block cipher of n bits with a k-bit key in 2 n classical queries and time O max 2 n , 2 k/2 .
Offline attack on Even-Mansour. We cannot directly apply the previous attack, as it would require 2 n classical queries. However, if we fix n − u bits in the input of the cipher, we can still obtain a periodic function: with K 1 1 the first n−u bits of K 1 , and K 2 1 its last u bits. This function is periodic if and only if y = K 2 1 . Hence, we can apply the offline Simon's algorithm, at a cost of O (2 u ) classical queries, and O max 2 u , 2 (n−u)/2 quantum time. In this case we can choose u, and the cost will be minimal for u ∼ n/3.

Remark 4 (Truncation, affine spaces).
Technically, the input is not required to be of the form (x||0 n−u ). The attack can work with any u-dimensional affine space. In particular, for any fixed c, we can take all the inputs of the form (x||c).
Remark 5 (Truncation for the FX attack). We can also apply this input truncation technique to the FX attack. This can balance the costs if n > k/2.
Concrete estimates. We rely on Theorem 1 for concrete query estimates. We chose α = 9, as this will ensure a success probability of around 99%. In all the instances we consider, we have n + k ≤ 200. Hence, an output size of m = 11 bits will be sufficient for our purposes.

Attack on Chaskey
We attack Chaskey with a one-block message, which degenerates into a truncated Even-Mansour: From Theorem 1 the attack does not require the full output, so the truncation is not an issue. However, for some of the circuit optimizations in subsection 6.3, we assume t ≥ 96.
We can directly apply the Even-Mansour offline attack. We do a chosenplaintext attack, and query classically the MAC of the 2 u 128-bit messages of the form 0 n−u * .
Then the quantum attack recovers the value of K ⊕ K 1 . As K 1 = 2K, we have K ⊕ K 1 = 3K. Thus, we can divide by 3 in the finite field to recover the key K, which is the master key.

Attack on PRINCE
We can directly apply the FX attack to PRINCE. We do a chosen-plaintext attack, and classically query the encryption of 2 u 64-bit messages of the form 0 n−u * . Then the quantum attack recovers K 0 and K 1 , which correspond to the full PRINCE key.

Attack on Elephant
To attack Elephant, we consider the encryption of a single-block message: This is an Even-Mansour, but the input is the nonce, not the message. Hence, with only known plaintexts, we can gain access to the values we need. To make the attack work, we need to have a set of 2 u nonces that form an affine space. This is no obstacle to the attack, since Elephant's security proofs assume the adversary can choose nonces as long as they do not repeat. Interestingly, if the adversary has no control of the nonces but the nonce is incremented between each query, then the nonces will still from an affine space and the attack will go through.
As we have an Even-Mansour construction, we can apply the offline Simon attack, which will recover the value of mask 0,0 K = K ′ = P (K||0). This expanded key is sufficient to compute all the masks in Elephant. Moreover, as P is a permutation, we can also recover the 128-bit master key K.

A quantum circuit to solve boolean linear equations
In this section, we present a quantum algorithm that can compute, given m n-bit vectors as input, the rank of their span or a basis of its dual. At its core, it uses Algorithm 4, which computes a basis of the span in triangular form. From this we can easily compute the rank or any orthogonal vector. Figure 7 represents the qubits in the algorithm.   Definition 2. We let (i, j) denote the jth iteration of the inner loop in the ith iteration of the outer loop. We use the partial order (i, j) ≤ (k, l) ⇔ i ≤ k∧j ≤ l, and assume that (i, j) occured before (k, l) if (i, j) < (k, l). Proof. We prove this by induction over (i, j). We do not enforce a total order on the iterations. Here, we only need that each (i, j) is computed atomically; that is, we cannot have parallel iterations with the same i or j, and we enforce that (i, j) occurs after (k, l) for all (k, l) < (i, j).
At the beginning of (0, 0), av i = 1 and used j = 0, hence the lemma holds. Assume that at the the beginning of (i, j), the lemma holds. We now want to prove that it will still hold at the end. • If av i = 0, used j is not updated, hence av i is also not updated. Hence, if we sequentially apply the previous lemma, we get one β i at each outer for loop, if any such vector exists. In the end, either the vectors are put in b i or fully reduced to 0. Hence, the theorem holds.
⊓ ⊔ Remark 6 (Parallel computation). For the correctness of the algorithm, we only need that if (i, j) < (k, l), then (i, j) must be computed before (k, l). This allows us to compute in parallel the steps (i, j) with i + j constant, as they are independent.

Cost analysis
Qubits. The circuit modifies in-place its m × n qubit input, though it needs m + n(n + 1)/2 auxiliary qubits for b, used, and av. We also use another n(n − 1) auxiliary qubits to reduce the depth of row reductions, as detailed below.
Gate count. Steps 3 and 4 require just one Toffoli gate and are repeated mn times. Inserting x j at Step 5 requires n − i Toffoli gates, as does Step 8. Summed over all i, and repeated m times, gives a total of mn 2 + mn Toffoli gates to compute the triangular basis.
Depth. As Remark 6 indicates, we can compute two iterations (i, j) and (i ′ , j ′ ) in parallel if i + j = i ′ + j ′ . Hence, we only need to perform m + n iterations sequentially. Iteration (i, j) has a naive depth of 2(n − i + 1) + 2, as inserting and reducing x j are controlled by single qubits, so we must apply each Toffoli sequentially. However, we can fan out the control to apply the Toffolis simultaneously. This means a depth of ⌈log 2 (n − i + 1)⌉ + 4, though this is what requires the extra n(n − 1) auxiliary qubits.
When reducing x j , once we have modified x j [i + 1], we can begin the next iteration with (i + 1, j), and reduce x j [(i + 2) . . . n] simultaneously. However, the same logic does not apply to inserting x j into the basis; we need to finish with used j before the next iteration modifies it. This gives us a total circuit depth of O((m + n) lg(n)). The specific constants will depend on our cost model, the structure of the fanout, and the choice of Toffoli gate. We used linear regression on the results from Q# to estimate the concrete asymptotics.

Final steps
Rank computation. Once we have the triangular basis, we only need to check if the basis has a full rank, which only requires testing whether all av i bits are set to 0.
Computing orthogonal vectors. While this is not directly useful here, given the triangular basis we could easily compute a vector orthogonal to it, at a cost of n CNOT and n 2 − n Toffoli. The idea is to choose the bit i, beginning with the last bit, such that the vector we compute is orthogonal to the basis vectors i to n. As the basis is in triangular form, we can sequentially compute the vector.
The only freedom we have is on the values we put when the vector i is missing in the basis. If we only need one vector, we can simply put 1 in that case. This is Algorithm 5.  6 Reversible implementations of quantum primitives

Design Philosophy
To apply our attack, we implement an operator with the following general shape: Thus, there is little reason for us to prefer an in-place encryption algorithm, since we need to preserve the input for proper interference in Simon's algorithm. However, the permutations we consider are all iterated designs containing multiple rounds of some simpler permutation. If a single round is out-of-place, we either need to double our computational cost to uncompute as we proceed, or allocate fresh qubits for every round; hence, we tried to find in-place circuits.
Some permutations use small S-boxes of 4 to 5 bits. We could use a table look-up, but this is out-of-place and has cost linear in the table size (e.g., 16 AND operations for 4 bits). Instead we found optimized in-place circuits, inspired by masked implementations of block ciphers, which also use a model in which XOR is cheap and AND is expensive.
In depth-limited Grover-like algorithms, the most efficient oracle design makes strong trade-offs of depth against width. However, the Q# resource estimator will not reuse qubits when optimizing for depth. That is, if each permutation round needed to borrow and release 10 qubits, and a cipher ran for 80 rounds, Q# would count 800 extra qubits. To avoid this issue, we used a width-optimizing compiler, which always prefers to reuse qubits, even if that means delaying other operations. Thanks to our in-place implementations, neither issue has a large effect on our results.

Simon-specific optimizations
The primitive circuits we implement have some relaxed constraints, which allows us to compute slightly different (and cheaper) functions.
Shorter output. From Theorem 1, we can afford to have a short output, which will be in practice of 11 bits. This allows us to not compute some of the output bits, and in general we can at least avoid the computation of most of the final non-linear layer.
Linear combination. For our attacks, we have the general property We can remark that for any affine function φ, φ • f and φ • E will have the same general property: Hence, we can apply any affine function to the output of our function (as long as its output is long enough). This actually generalizes the previous property, as truncation is linear.
Overall, we can remove many operations in the last rounds: the ones that either do not influence the bits we're interested in, or only act linearly on them.
Partially fixed input. We can split the variable i on which we do a quantum search into two: y, which corresponds to the part of the message which is fixed, and k, which is a secret we must guess completely. For Even-Mansour, k is empty, and for the FX construction, y can be empty. The general shape is presented on Figure 8a.
Moreover, the design of the function transforms the input in-place and bijectively. This means we can decompose the full function f into f (k, x, y) = f ′ (k, x, g(k, y)), as in Figure 8b. With this specific structure, the output of g will be identical for all the parallel computations of f . As y is guessed by the quantum search, we can afford to only compute g once for all the parallel computations of f . This saves us some computation, depending on how fast the input bits diffuse. We found ways to save part of the first linear layer and a few S-boxes.
We go further and remark that in many cases, the mapping y → g(k, y) will be a permutation. Hence, instead of applying the quantum search to f to find k and y, we search f ′ to find k and g(k, y). Once we find g(k, y) and k, it is easy to invert and find y. This allows us to completely remove all the operations that only operate on the bits of y from the quantum circuit.  Summary. We can leverage the specific structure of the problem to reduce the computational cost of f . These optimizations rely on the limits of the diffusion in some iterated constructions. In practice, for the constructions we considered, they save a cost equivalent to 1 to 2 rounds, which becomes completely negligible for constructions with a very large number of rounds. Nevertheless, these optimizations are independent of the actual implementation of the quantum circuit, and can always be applied.

Chaskey
The Chaskey permutation has an ARX structure: it uses only XOR, bit rotation, and modular addition. All of these can be implemented in-place on a quantum computer, and efficient circuits for them are already available [34]. We use the adder with the fewest T operations [30]. The quantum circuit for the permutation is practically identical to the classical circuit. Optimizations from Section 6.2 for a shorter output are particularly effective, detailed in Circuit 4 and 5. We save a fourth of the operation in the first round thanks to the partially fixed input, shown in Circuit 4. Circuit 5 presents the last two rounds of the truncated permutation. Once it is computed, we copy out bits from 5 to 15 and from 37 to 47 into the output register before uncomputing. This has the same effect as the CNOT highlighted in green in Circuit 5, but saves uncomputation. The total effect is 18% in depth and operation savings for 8 rounds and 12.5% for 12 rounds.

Prince
Internally, PRINCE uses a keyed permutation of 12 rounds, where each round XORs round constants, applies an S-box to each nibble, multiplies the state by a binary matrix, and XORs the key (Circuit 6).
We implemented PRINCE in-place with the S-box decomposition from [16], which only requires 6 Toffoli operations per S-box (Circuit 7).
We perform a PLU decomposition for the linear layer as well as the affine layers in the S-box decomposition, as in [40]. Round 9 only needs to apply the S-box to nibbles 3, 6, 9, and 12. Then in round 10, we only need to use those bits of the key and the round constant. We only apply the part of the linear layer necessary to compute these nibbles, and then the row shift puts these nibbles in the first 16 bits. We finish with an S-box on these bits. This saves us 13.5% of all operations, though provides negligible depth reduction.

RC11
from i = 1 to 5 from i = 6 to 10

Elephant-160/176
Elephant-160 and 176 use the spongent permutation [9], with respectively 80 and 96 rounds (Circuit 8). The first step of each round is an XOR with a fixed sequence of strings C i , which requires only a series of X operations. The next step is an S-box layer. We implemented it in-place using a masking-friendly decomposition that only required 4 Toffoli operations (Circuit 9), using the fact that 4 bit S-boxes are Circuit 7 PRINCE's S-box, applied to 4 qubits. fully classified and their decomposition as a composition of quadratic functions is known [20,8,55]. The final step is a permutation, which can be done by the classical computer with no extra quantum operations.
Input and output optimizations are less effective here because Elephant repeats so many rounds. We still limit the final layer of the S-box to only the bits we use in the output, resulting in 1.8% and 1.7% operation savings for Elephant-160 and 176, respectively, with no depth improvement.  Keccak round starts with 3 linear functions, θ, ρ, and π. We used a PLU decomposition of all three functions to perform them in-place. After these is the non-linear function χ. We adapt the circuit from the Keccak implementation; however, it is out-of-place, so we also adapted a circuit for χ −1 from [36] (Circuit 10). We apply the adjoint of this circuit to uncompute the input to χ, then release these qubits. Since χ −1 is mostly AND operations, their adjoint Circuit 10 Keccak's χ function. can be done cheaply using measurements [43,30]. The final function is ι, which simply XORs a constant onto the state, which requires only X operations.
Here we can also limit the non-linear χ in the last round, for 5% T-operation savings and 1.6% savings over all operations.  Table 1: Quantum circuit costs for the circuits we analyze. "1QC" are singlequbit Clifford operations and "M" are measurements.

Quantum Lookups
Constructing the initial database from our offline queries requires a QROM 4 circuit. We do not assume special, cheap QROM operations (i.e., the QRAM model), but rather give the cost in terms of a Clifford+T simulation of QROM.
With no depth restriction, the cheapest (in total operation count) is due to Babbush et al. [2]. Berry et al. [4] give a version that is cheaper in T-operations and smoothly parallelizes, but since we have no need to parallelize and consider the full operation count, we use only the Babbush et al. QROM circuit.

Attack circuits and estimates
Offline Simon attack. To estimate the total cost of the attack, we estimated the cost at each value of u and chose the minimum cost, up to some specified limit on u. The value of u determines the size of the quantum look-up, which is computed once. We used Theorem 1 to determine the necessary linear system size m and computed the cost to repeat the cipher m times in parallel, based on the cost of a single cipher computation from Q#. For PRINCE, which is an FX construction, each parallel repetition needs a copy of the permutation key. However, the permutation key is only infrequently XORed onto the state. With CNOTs, this has depth 1, and can be pipelined efficiently, so we assume the repetitions share the permutation key. This increases the depth by m CNOTs, which is negligible compared to the overall depth of the cipher.
We then estimated the cost of solving an m × n linear system, using costs from subsection 5.1. Once we found the optimal m, we used Q# to get an exact cost of solving the linear system. The code for this estimation is available at https://github.com/sam-jaques/offline-quantum-period-finding/.
Our results are in Table 2 and Table 3. We include results for Shor's algorithm to attack RSA-2048 and an exhaustive quantum key search on AES-128 for comparison.
Exhaustive Key Search. We also estimated the cost of performing an exhaustive quantum key search on the ciphers, summarized in Table 4. The circuits for these are slightly different, as we need to attack the full encryption, rather than just the permutation. Chaskey and Elephant modify the key slightly before using it. Elephant transforms the key from 128 bits to the block size, so it is much more efficient to modify the key as part of the search oracle and search a 128-bit space, rather than search a key space as large as the full block size.
To ensure a unique key, we need 2 blocks for Chaskey and 3 blocks for PRINCE. We follow the STO approach of [23], so that we only need to infrequently check blocks besides the first. This also keeps the qubit requirements low; PRINCE only needs 257 qubits, half of which are only needed as auxiliary qubits for the multi-controlled NOT.
Generic collision attacks. We can remark that in all cases, the total number of quantum gates for the offline Simon's algorithm is close to 2 n/2−d/6 , with 2 d classical queries, that is, the query cost of the generic offline collision attack. This means the offline Simon's algorithm outperforms the generic attack, since its larger polynomial factor is not an issue for cryptographic parameter sizes.

Conclusion
A new kind of attack. Quantum exhaustive key search may not be a real threat to symmetric cryptography because of its poor parallelization [62,40] and the expected overheads of error correction. However, we showed that there    are other avenues of quantum attack that may be more feasible. For example, Chaskey and PRINCE have "only" 33 more bits of quantum security than RSA-2048, widely believed to be completely broken in a post-quantum setting.
Comparing the security of RSA-2048 to Chaskey and PRINCE, we point out that our attack requires less than 4 times as many logical qubits, but many more quantum operations. This means breaking these ciphers will take much longer and require much more coherence than breaking RSA. However, adding more coherence to an already-coherent quantum computer is relatively easy. For surface code error correction, coherence grows exponentially with code distance, and the qubit overhead grows only quadratically [29]. Moreover, our attacks tend to have a lower depth than quantum search, which may also help its implementation. Thus, we believe that these attacks could be an interesting milestone for quantum computers, much harder than RSA-2048 factoring, but much easier than AES-128 key recovery.
On quantum-safe symmetric cryptography. We found that Chaskey (independently of its number of rounds) and PRINCE have almost identical quantum security. Moreover, the data limitation of Chaskey has a negligible impact on the attack cost and our attacks end up being almost a million times cheaper than the corresponding quantum key search.
Our attack on Elephant is less competitive and requires more quantum operations than the direct key search. This is mainly because our attack targets the state size, and Elephant's key size is smaller. The data limitation also slows our attack, but the cost increase is much smaller than the cost increase of the classical attack. Moreover, this attack shows that to make an Elephant instance with significantly more quantum security than 2 64 queries would require an increase in both the key and the state length. One of Elephant's features compared to other lightweight cryptography candidates is its small state size, so such a change would make it less competitive.
To counteract the offline Simon attack and to achieve quantum security, we recommend: -Using a large state size, not just a large key size.
-Not relying on data limits, as these have limited impact on quantum attacks.
-Avoiding the Even-Mansour and FX constructions altogether.
For an example of the last idea, the design of the recent PRINCE v2 [17] is very close to the original PRINCE, but with a simple key schedule that replaces the FX construction.
Immediate implications. We stress that, like quantum exhaustive key search or factoring, a patient attacker could apply this attack to today's communications, as it is an offline attack: the data can be collected before any quantum computation. This is especially important for lightweight cryptography, which is intended for use in embedded systems, RFID chips or sensor networks, where an update is either impractical or downright impossible.