Low Trace-Count Template Attacks on 32-bit Implementations of ASCON AEAD

. The recently adopted Ascon standard by NIST oﬀers a lightweight authenticated encryption algorithm for use in resource-constrained cryptographic devices. To help assess side-channel attack risks of Ascon implementations, we present the ﬁrst template attack based on analyzing power traces, recorded from an STM32F303 microcontroller board running Weatherley’s 32-bit implementations of Ascon -128. Our analysis combines a fragment template attack with belief-propagation and key-enumeration techniques. The main results are three-fold: (1) we reached 100% success rate from a single trace if the C compiler optimized the unmasked implementation for space, (2) the success rate was about 95% after three traces if the compiler optimized instead for time, and (3) we also attacked a masked version, where the success rate was over 90% with 20 traces of executions with the same key, all after enumerating up to 2 24 key candidates. These results show that suitably-designed template attacks can pose a real threat to Ascon implementations, even if protected by ﬁrst-order masking, but we also learnt how some diﬀerences in programming style, and even compiler optimization settings, can signiﬁcantly aﬀect the result.


Introduction
Ascon [DEMS21], a family of algorithms for authenticated encryption with associated data (AEAD) and secure hashing, designed for resource-constrained devices, was in 2019 selected as a prime choice for lightweight applications in the CAESAR competition.In 2023, after multiple review rounds over five years, the National Institute of Standards and Technology (NIST) finally chose Ascon as the winner of NIST's lightweight cryptography standardisation process [LWC].One may expect that Ascon could soon be implemented on millions of authentication chips, RFID tags and radio-controlled devices.
As Ascon becomes a new NIST standard, it is important to understand not only its theoretical properties but also potential implementation challenges, such as side-channel attacks (SCA).The designers of Ascon have already carefully considered side-channel protection.For example, Ascon does not use conditional branches or require any look-up table, which naturally prevents many timing attacks [DEMS21].Furthermore, Ascon's permutation uses S-boxes of degree 2, which facilitates threshold implementations and masking as efficient countermeasures against some side-channel attacks [SD17].And Ascon's mode of operation supports a so-called leveled implementation [BBC + 20, VCS23], where counter-measures against Differential Power Analysis (DPA) are only required for its initial and final steps, and not for the processing of each message block, which reduces their performance impact.
Compared to the extensive published cryptanalysis work on Ascon, practical experiments on side-channel leakage have received less attention so far.Samwel and Daemen [SD17] presented Correlation Power Analysis (CPA) attacks and a threshold implementation of a toy-sized 20-bit version of Ascon.Gross et al. [GWDE17] showed that Ascon could resist first-order DPA based on simulated leakage.Abdulgadir et al. [ADK19] presented threshold implementations of Ascon on an Artix7 FPGA and demonstrated that they were effective in preventing DPA attacks.Diehl et al. [DAF + 18] compared the cost of threshold implementations of Ascon and other selected authenticated ciphers against DPA.Recently, Luo et al. [LWL + 22] presented a Soft Analytical Side-Channel Attack (SASCA) on Ascon based on simulation, using Hamming weights (HW) of 8-bit values with independent, identically distributed, added Gaussian noise as a leakage model.While attacks using simulated HWs can provide useful insights, real power traces can provide more information in the form of likelihoods for specific values.But real traces also require additional processing, such as interesting-point selection and dimensionality reduction, to deal with the much larger amount of leakage data, the longer word size, and potentially correlated switching noise from a real processing pipeline.
In this paper, we present template attacks using power traces from a 32-bit microcontroller STM32F303, with ARM Cortex-M4 core, running several Ascon-128 AEAD implementations.Our attack strategy combines three techniques: firstly, we use a fragment template attack [YK22], which uses a form of Linear Discriminant Analysis (LDA) [SA08,CK14a] modified for observing larger register sizes (Sec.2.2), to obtain side-channel information leaked from this 32-bit device.Secondly, we use belief propagation (SASCA) [VCGS14] (Sec.2.3) to improve the likelihood tables obtained from the template matches, by considering algebraic dependencies between the intermediate values observed.Finally we use key enumeration [VCGRS12] (Sec.2.4) as an optimized brute-force search technique, to deal with residual errors.Combined, these measures optimize our analysis for the design of Ascon and the targeted 32-bit implementations.
We attack both unmasked (Sec.3 and 4) and masked implementations (Sec.6), and also consider the effect of two different compiler optimization settings on the success rate of key recovery (Sec.5).We achieved fast key-recovery success rates of over 95% with fewer than 10, in some cases even single, power traces (Figure 9) based on the unmasked implementation [Wea21, ASCON/].When it comes to the attack on the masked implementation [Wea21, ASCON_masked/], single traces did not help much, however, we succeeded in fast key recovery with 10 to 20 traces (Figure 13).

Ascon
Ascon AEAD is based on a sponge mode, similar to MonkeyDuplex [BDPA12], but with stronger keyed initialization and finalization phases.The underlying permutations, denoted p a and p b , are obtained by iterating a 320-bit round function p for a or b times, respectively.Ascon AEAD first takes as inputs an initial vector IV (to identify the algorithm), a key K, and a nonce N , which it then combines with permutation p a applied as a non-invertible key derivation function (KDF).It then invokes permutation p b for blocks of associated data A and plaintext P , to absorb their content and generate the key stream for producing the blocks of ciphertext C. A final invocation of p a serves as a tag generating function (TGF) to produce a message-authentication tag T .The encryption process of Ascon is illustrated in the following figure: During the execution of the AEAD mode, the output of the permutation is divided into two parts called rate (S r ) and capacity (S c ).The size of the rate is equal to the maximum number of data bits that an invocation of permutation will process.The two Ascon variants Ascon-128 and Ascon-128a differ according to their rate, capacity and number of rounds a and b (see Table 1).In this paper, we focus on Ascon-128 since it is the primary recommendation by the Ascon designers [DEMS21].

Ascon permutation
The permutation operates on a 320-bit state (S) that is divided into five 64-bit words (or five lanes), as in The round function p follows the substitution and permutation network (SPN) design principle, and consists of three operations: constant addition p C , substitution p S and linear diffusion p L , as p = p L • p S • p C .

Constant Addition:
The operation p C updates the state by XORing an 8-bit round constant to the least significant byte of L 2 , where the round constant in each round is Constant: 0xf0 0xe1 0xd2 0xc3 0xb4 0xa5 0x96 0x87 0x78 0x69 0x5a 0x4b

Substitution operation:
Step p S is a nonlinear operation applying a 5-bit S-box, which operates on This S-box has the following cryptographic properties: maximum differential and linear probability 1 4 , differential and linear branch number 3 and algebraic degree 2. This substitution operation allows an efficient bit-sliced implementation and its low degree also facilitates efficient threshold implementations and masking.

Linear Diffusion operation:
Step p L provides diffusion by applying five different linear operations Σ i for 0 ≤ i ≤ 4, and each Σ i performs XOR (⊕) and right rotations (≫) on the word L i as in In later sections, we refer to the internal states of the Ascon p 12 permutation as We will use symbols L 0 , L 1 , L 2 , L 3 , L 4 to represent the five lanes in intermediate states α Ω or β Ω , where Ω represents the round index in the p 12 permutation.

Template attack
The Template Attack (TA), introduced by Chari et al. [CRR03], is a powerful profiled side-channel exploitation technique.The attacker first profiles the target device, while operating it in a training mode where they know the data being processed.Traces are recorded at this stage to build Gaussian multivariate trace templates that model the leakage of each of the different known values being processed.Then, during the attack stage, the attacker records an attack trace while an unknown secret is being processed, and compares that trace against each template.The unknown secret is obtained based on the candidate template that is most similar to the attack trace.
Several variants of the template attack have been proposed to improve the efficiency and accuracy of the profiling procedure.Schindler et al. [SLP05] introduced their F 9 "stochastic model", where each bit in a targeted byte is treated as an independent variable of a multivariate linear-regression model, to predict the expected values of single points on a trace.Standaert and Archambeau [SA08] proposed using Fisher's Linear Discriminant Analysis (LDA) in template attacks for dimensionality reduction.Choudary and Kuhn [CK14a] combined both techniques to provide a predicted probability for each possible value for a target byte, instead of only the probability for the HW of possible values.Such LDA-based dimensionality reduction has several benefits: the projection can increase the signal-to-noise ratio, the reduced dimensionality leads to better covariance estimates, and LDA-based templates have shown better portability across different devices [CK14b].Next, we perform LDA dimensionality reduction to project longer traces to shorter vectors.To prepare this step, we first build two covariance matrices, B and Σ, which represent the signal and noise, respectively, as

Template attack with LDA
where x denotes the average of all 256 vectors xb .We then project all the m-sample traces, including profiling traces x b,t , expected traces xb , and the attack trace x a , onto the m (m m) eigenvectors with the largest eigenvalues of Σ −1 B, to obtain m -sample traces x b,t,proj , xb,proj , x a,proj ∈ R m .
In this new subspace, where the signal-to-noise ratio is larger, we can now build a pooled covariance matrix [CK14a] such that the probability density of the attack trace x a,proj can be modelled as where n v represents the number of traces in group v. B f only contains signals from fragment number f , and signals from the other three bytes no longer count here, but instead contribute to Σ f .In other words, they are considered to be switching noise in this model.
After projecting the profiling traces and attack traces to the m -dimensional subspace (e.g., we used m = 8 for byte and m = 16 for 16-bit fragments) via these two matrices, we can (as before) calculate the pooled covariance matrix and combine it with the projected expected traces into the template for this byte fragment of the target word.
Due to the noise sensitivity of the template attack, the correct candidate will not always top the ranking table.We can use two additional steps to make the attack more resilient to noise: 1) belief propagation, which takes algorithmic dependencies between the targeted values into account, and 2) key enumeration, which tests more combinations of target values than just the top-ranked ones.We elaborate these techniques in the following sections.

Belief propagation and SASCA
Veyrat-Charvillon et al. [VCGS14] introduced Soft Analytical Side-Channel Analysis (SASCA), an inference technique for template attacks on cryptographic algorithms based on the belief-propagation algorithm [Mac03,Chapter 26].The idea behind SASCA is that all the probability information available to the attacker is represented as a factor graph, where there are two types of nodes: 1) variables, which represent the intermediate states of the cryptographic algorithm, and 2) factors, which represent how these intermediate states depend on each other and the observed traces.Information can flow bidirectionally through edges that connect variables with factors.We choose a factor graph to reflect the mathematical relations between the target parts in the cryptographic algorithm.
The variable nodes represent the intermediate values in the cryptographic algorithm.You and Kuhn distinguish two types of factor nodes [YK22].Observation factors f m (x n ) represent observed probabilities of the values of their only connected variable x n , here usually from a template-based likelihood.Constraint factors f m (x m ) are connected to more than one variable (x n1 , . . ., x n km ) = x m (where N (m) = {n 1 , . . ., n km } denotes the set of indices of these variables) with a mathematical equation as the constraint (see example in Figure 1).
The information flow can be thought of as messages passed between variable nodes x n and factor nodes f m , which in practice are stored in a table, and from which the marginal probabilities of all the candidate values of each variable can be calculated.A message from a variable x n to a factor f m is denoted as q n− →m , and a message from a factor f m to a variable x n as r m− →n .Each of these messages is a function of a value ξ of x n .The probability of a candidate x n = ξ in message q n− →m is Meanwhile, the probability of a candidate x n = ξ in the message r m− →n is where For the special case of an observation factor, this reduces to where f m (x n ) is the probability table observed from the templates, instead of a constraint function.Finally, we obtain the probability P n of candidates x n = ξ by normalizing , where is the product of the probabilities of ξ in all the messages r passed to the same variable x n .
If the graph is a tree structure, all the r and q probabilities can be calculated by traversing the tree recursively, visiting each edge once.However, factor graphs for cryptographic algorithms often feature loops, where this recursion would not terminate.
MacKay [Mac03, Chapter 26] describes a solution called loopy belief propagation (loopy BP).The main idea is to initialize all the values in the table for all messages q with one, then alternatingly update all the messages in the table for r and then q, with renormalization to prevent the probability values from becoming too small.Then the procedure terminates when it reaches a steady state.We count one update of all r followed by one update of all q as one iteration.

(b)
The frontier F, blocks labeled in red, when the gray blocks have been enumerated.

Figure 2:
The key enumeration array

Key enumeration
While the belief-propagation step reduces the impact of noise on the ranking tables, the correct candidates may still not top the tables, and more candidates further down will have to be tested as well.Veyrat-Charvillon et al. [VCGRS12] introduced an optimal key enumeration algorithm to search the correct key across independent ranked likelihood tables.
Assume that there are two independent secret variables s 0 and s 1 with M and N possible values, respectively.Through some side-channel analysis, we have already obtained their ranked (sorted) probability tables.Here the m th and n th most likely candidates of these two variables are denoted as s0,m and s1,n , respectively, and their corresponding probabilities are denoted as p 0,m and p 1,n .Figure 2a shows an M × N joint ranked probability table, where each block represents the joint probability, p 0,m × p 1,n , of a combination (s 0,m , s1,n ).
Since p 0,m and p 1,n are from sorted tables, we have the following partial order of the joint probabilities p 0,m × p 1,n of (s 0,m , s1,n ): Therefore, the top-leftmost block, representing (s 0,1 , s1,1 ), is the combination with the largest joint probability, and the combination with the second largest joint probability will pertain to one of (s 0,2 , s1,1 ) or (s 0,1 , s1,2 ).They will both be added into a set called frontier, F, which includes all possible candidates for the combination with the next largest joint probability.Therefore, we only need to compare values from this set to find the next value pair to be enumerated.In Figure 2b, once all the combinations marked in gray have been enumerated, the frontier F, marked in red, will be the set of all the pairs at the concave corners.While a combination (s 0,m , s1,n ) is being enumerated, we need to update F by removing (s 0,m , s1,n ), and then considering whether (s 0,m+1 , s1,n ) or (s 0,m , s1,n+1 ) or both shall be added to F, respectively, by checking if one occupies a concave corner in the already enumerated gray part of the array.
Initialization Finalization Following this algorithm, the probability p 1,n will only be queried after the combination (s 0,1 , s1,n−1 ) has been enumerated and then we need to add (s 0,1 , s1,n ) into F; similar operations apply to p 0,m .This means that we do not need all the values in the probability tables initially, and therefore we can build a tree of iterators recursively to search combinations of candidates from beyond two tables.

Key rank estimation
In experiments aimed at evaluating the effectiveness of attacks, we know the correct key and merely need to determine its rank.Faster techniques, such as the histogram-based method by Glowacz et al. [GGP + 15] can estimate the rank of a correct key in cases where enumeration would be too time consuming.

Building templates for Ascon AEAD
Our attack strategy consists of three main steps: fragment template attack, belief propagation, and key enumeration.In this section, we focus on the fragment template attack.

General experimental assumptions
We demonstrate a profiled fixed-length known-plaintext attack, only targeting the secret key K.In the profiling stage, our attacker can provide varying K, N , A, P , and can observe the corresponding C and T along with recorded power traces.In the attack stage, they can obtain values of N , A, P , C, T , and recorded power traces, to recover the secret key K.We demonstrate our attack by targeting Ascon-128.Note that while Ascon AEAD allows arbitrary-length associated data and plaintexts, in this attack demonstration, we used empty associated data and 7-byte plaintexts, to keep the traces aligned and minimize their length when covering the entire encryption process.In other words, we focus entirely on the two invocations of permutation p 12 in the Initialization (KDF) and Finalization (TGF) phases, which process K. Figure 3 depicts this target encryption procedure.
Thanks to the lightweight structure of Ascon and our choice of short input size (Figure 3), it is practical to record power traces covering the full AEAD mode.Therefore, we built templates for target fragments of all the states α 0 , . . ., α 11 and β   permutation p 12 in both Initialization and Finalization, except for fragments of known values: for IV , N , P , C, and T , we assign the probability of the actual fragment value to be 1 and all others 0. The two lanes L 1 and L 2 of the Initialization input contain the key fragments, which are our main targets.
For each experiment, we separated the recorded traces into the following categories, by purpose: We recorded these number of traces for each of three experiments, which we refer to as U-Os, U-O3, and M-Os, respectively.The first two ran the unmasked implementation [Wea21, ASCON/], compiled either optimized for space (gcc option -Os) or for time (option -O3), whereas the third ran the masked implementation [Wea21, ASCON_masked/] optimized for space (option -Os).
We recorded 10 attack traces each from encryptions with the same key K, which we used for our experiments combining traces from multiple encryptions (Sec.4.1).We varied nonce N and plaintext P randomly.For the M-Os experiment, we increased the number of recorded attack traces per key K to 100, i.e. there we recorded 100 000 attack traces in total.For the other categories, we varied the inputs K, N , and P randomly for each encryption recorded.In the rest of this section, we show data from the U-Os experiment.

Detecting interesting clock cycles
As the raw traces were very long, it is difficult to directly derive a distribution that describes the power traces for our target intermediate values.As not all the clock cycles are relevant to our target intermediate values, we can only consider the sample points of the clock cycles that are clearly correlated to these intermediate values.We refer to these as the interesting clock cycles.
We followed the method of finding interesting clock cycles for 32-bit key fragments from [YK22].We divided all the intermediate states (β −1 , α 0 , . . ., β 11 ) into 32-bit words and further divided these into four byte fragments (numbered 0, 1, 2 and 3), and then applied multivariate linear regression to model the correlation between the samples on power traces and each byte.After we built the regression model, we calculated its coefficient of determination R 2 ∈ [0, 1], to see how well the samples fit the model.As we now have four , we applied their sum f R 2 f to estimate the correlation between the model and the power traces for a 32-bit word.We selected those interesting where a 64-bit lane is not just separated into a high and a low 32-bit word, but also sliced into its odd and even parts during the permutation, such that one 64-bit rotation becomes two 32-bit rotations.Therefore, data bits, especially the input and output, can be stored separated into both high and low words (H/L words), as well as sliced into even-bit and odd-bit words (E/O words).We decided to detect the interesting clock cycles for both the H/L and E/O words for a lane, and use their union set as the interesting clock sets for this lane, to consider both situations.
Table 2 shows the number of detected interesting clock cycles for each target 32-bit, H/L word of the intermediate states for the full AEAD process.We can see that there were more interesting clock cycles detected for those words closer to input or output (i.e., β −1 or β 11 ), as some of their interesting clock cycles were related to operations outside of the Ascon permutation, such as loading the initial states, XORing with P or K, or calculating T .
Among all the words, we detected the highest number of interesting clock cycles for L 1 and L 2 in β −1 of Initialization, since these two lanes are loaded with K, which is used four times in the full encryption.Figure 4 shows the Σ f R2 f values for the H/L words from L 1 and L 2 in β −1 of Initializaion with the corresponding clock cycles.We found that the spikes were mainly located in four separate regions, indicating the clock cycles related to the four times when Ascon AEAD uses the key K.
We downsampled the selected interesting clock cycles from 500 to 10 PPC by replacing each 50 consecutive samples with their average, and concatenate them to form the traces x used for LDA-based template building (see Sec. 2.2).

Results of fragment template profiling
We evaluate the quality of our templates using two metrics defined by Standaert et al. [SMY09]: the n th -order success rate (n-SR), which is the fraction of trials where the correct candidate has rank not larger than n, and the guessing entropy (GE), which is the expected value of the rank of the correct candidate (1 being the top rank).The logarithmic guessing entropy (LGE) is the arithmetic mean of the base-2 logarithm of those ranks.After first shortening the traces according to the results of the interesting-clock-cycle detection and then building the LDA-based template parameters (S, xb,proj for all 256 values b of a fragment, etc.), we determined for the 1000 traces in our validation set both the LGE and 1-SR value, the latter being the fraction of those traces where the correct fragment (byte) candidate tops the ranking table of all 256 likelihood values L(x a,proj |x b,proj , S).
Table 3 shows these 1-SR and LGE values for a few example templates, while Figure 5 plots the results for all the target templates.Note that we built the H/L templates for the key, but E/O templates for the other intermediate values, to better match the implementation.
As we had expected, templates for the key fragments had the best quality among all the templates, as K fragments were built from the highest numbers of clock cycles.The results for templates of fragments in the last two lanes in state β 11 in Finalization were also satisfactory, considering that these two lanes are part of the permutation output and then XORed with the key for the tag T , leading to more interesting clock cycles detected.The 1-SRs for fragments from the middle rounds, α 6 in Initialization for example, were much lower, while the corresponding LGEs were much higher than those values from either K or β 11 in Finalization, as the optimized implementation of Weatherley appears to reduce the clock cycles that operate on a single intermediate value inside the permutation, whereas the input and output of a permutation will be involved in more operations across the permutations for AEAD mode.
We also show the results of quality evaluation when we built the templates for the key fragments with only one of the four regions of interesting clock cycles in Table 4.These results provide evidence that using the same key more than once in an Ascon AEAD significantly helps the attackers to build better templates.

Belief propagation and key enumeration for Ascon AEAD
Once we have obtained probability tables from the normalized likelihoods of attack traces matched against our fragment templates, we then apply belief propagation and key enumeration.

Factor graph for bitwise loopy BP across all intermediate states
In a factor graph covering the full Ascon AEAD mode with our assumptions, the connections within a single permutation and the connections across permutations form a loopy structure, so we apply a loopy-BP procedure to update the probability tables for K.
With such a complicated loopy structure, we have to marginalize all the probability tables estimated by fragment templates into bitwise tables and perform belief propagation on individual bits as variables for efficiency.As for the tables from E/O templates, we only need a simple transposition step to move the bits to their original places in the H/L words after marginalization.

Graph for single encryptions
We first build the factor graph covering a single round of the Ascon permutation.This small factor graph (Figure 6) includes three variable states and their corresponding observed factors: β Ω−1 , α Ω , β Ω , and two types of constraint factors: f S and f L .Since α Ω = p S (p C (β Ω−1 )), the f S constraint factors should update the information following the mathematical relations in functions p C (constant addition) and p S (non-linear substitution).We followed how Kannwischer et al. designed their constraint factors for the non-linear step and the constant addition step in Keccak [KPP20, Sec.4.1]: for the factors of p S , we connect their five input bits and five output bits and use the S-box table as the mathematical constraint, while for p C , we swap the probability values of the two candidates (0 and 1) when the value of the corresponding constant bit is 1.On the other hand, constraint factors f L update the information following the mathematical relations in the linear function p L , which are all XOR functions with three inputs and one output at the bitwise level (see eq. (2)).For example, L 0 ← Σ 0 (L 0 ) can be realized by the constraint where α Ω [i, j] means bit j in lane i.Having built the factor graph for the first round of the p 12 permutation, we can simply repeat the same construction for the remaining eleven rounds, only with different round constants, to cover all the states in one invocation of this permutation.
Considering Ascon AEAD applies a sequence of p a and p b permutations, the overall factor graph will be a concatenation of multiple factor graphs, one for each permutation, connected by constraint factors f ⊕ for XOR functions, and variables for the inputs or outputs processed in between.Figure 7 shows the factor graph covering all the target states in our experiment.According to the encryption shown in Figure 3, the input state of the p 12 in Finalization will be the output state (or state β 11 ) of p 12 in Initialization, XORed with the state P (0x80) K K , where K is the key K with the least significant bit flipped.Therefore, via a constraint factor f ⊕ , the two variables, respectively representing the bit in the first lane L 0 of the input state of Finalization and its counterpart in the output state of Initialization, will be connected with the corresponding variable for the bit in the padded plaintext P (0x80).Similarly, bits from L 1 L 2 of the two states will be connected to the variables for K via constraint factor f ⊕ .Bits from L 3 L 4 are also linked to the variables for K via f ⊕ , but the probability values for the least significant bit need to be swapped when the message is exchanged between the K variables and the f ⊕ .Likewise, the variables for the last 128 bits of the Finalization output are connected with the variables for K and T via f ⊕ .Graph for multiple encryptions with the same key Ascon allows reuse of a key with different nonces.So for multiple encryptions with the same key, we introduce another external constraint factor f mext , where the constraint is K 1 = K 2 = . . .= K N , to connect each key variable from a separate factor graph for a single encryption.
Key enumeration Finally, we apply the key enumeration algorithm [VCGRS12], as discussed in Sec.2.4, to find the correct bit combination for the key, given the bit probability tables obtained from the belief propagation procedure.We can simply check the correctness of the enumerated key combinations since we know N , A, P , C, and T according to our assumptions in Section 3.1.

Loop-free alternative factor graph
As we can see from the results of the template evaluation in Figure 5, the templates for fragments in the middle states of both the permutations in Initialization and Finalization provide only little information.It may not be worth to perform belief propagation with a large factor graph covering all the middle states.Therefore, we also tried an alternative factor graph, where we removed the nodes for those middle states from the loopy factor graph, and only kept the nodes related to the XOR operation of the key K and the last 128 bits of β 11 in Finalization to calculate the tag T .Figure 8 shows the new smaller factor graph for single encryptions and its expanded version for multiple encryptions with the same key.These smaller factor graphs will similarly output updated probability tables for key enumeration.
There are two advantages of this smaller graph design.The first one is that it is no longer a loopy structure, but a tree, so we can update the information recursively by accessing each node only once.On the other hand, thanks to the simplicity of the XOR operation, as well as the assumption that the tag T is already known by attackers, it will still be practical to perform belief propagation on byte tables or tables for even larger fragments, and therefore avoid the information loss caused by marginalization to bit tables.In these cases, the belief propagation procedure will output the updated probability tables for fragments instead of bits for enumeration.

Results for belief propagation and key enumeration
In the following, we report both the logarithmic guessing entropy and the n th -order success rate as the evaluation metrics for our ability to recover the full key.For key ranks up to 2 24 , we determined these through actual key enumeration, whereas higher key ranks we estimated using the histogram-based method of [GGP + 15, Alg.1].We determined or estimated this way the rank of each of the 1000 keys used in our attack traces, and then Table 5: Base-2 logarithmic guessing entropy (LGE) achieved for our key-recover attacks in the U-Os experiment with between 1 and 10 attack traces, using either no SASCA stage (using byte fragments), or with a SASCA stage using either the loopy belief propagation (using byte fragments marginalized into bits), or the tree-based belief propagation (using either 8-bit or 16-bit fragments, not marginalized).Table 5 shows the guessing entropy we achieved for all of our U-Os experiments, using either no belief propagation step, or after applying SASCA using either the loopy or tree-shaped factor graphs. Figure 9 plots for the latter experiments with SASCA the success rates after key enumeration with a given search depth.The results show that with loopy belief propagation on a single attack trace, a 2 32 key enumeration will achieve a success rate of nearly 100%, and applying the tree-shaped belief propagation3 reduces the cost of the key enumeration to well below 2 20 steps.(With two or more attack traces, hardly any key enumeration is needed.) To see whether a larger template-fragment size collects more information, we also repeated the tree-BP experiment with 16-bit fragments instead of bytes.The bottom two rows in Table 5 show that the guessing entropy drops very roughly by a factor two. Due to the high quality of the templates for our key fragments, belief propagation was not actually essential for extracting the key from the U-Os target.

Effect of compiler optimization on template attack
In the previous U-Os experiments, we left the gcc optimization set to option -Os (optimize for space), which was the default for the ChipWhisperer platform software.To see whether the compiler's code generation affects the performance of our attack, we decided to repeat the experiment with gcc option -O3 (optimize for time), resulting in the U-O3 recordings.Note that the different optimization options will not affect the execution of Weatherley's Table 6: Base-2 logarithmic guessing entropy achieved for our key-recover attacks in the U-O3 experiment with between 1 and 10 attack traces, using either no SASCA stage (using byte fragments), or with a SASCA stage using the tree-based belief propagation (using either 8-bit or 16-bit fragments, not marginalized).Cf.   6. Cf. Figure 9.
Ascon permutation [Wea21, ASCON/internal-ascon-armv7m.S], since its source code is manually optimized assembly code, which bypasses the optimizer.However they do affect the AEAD code outside the permutations, such as XORing the key K or calculating the tag T , as these mode-level operations are written in C. (Thus, we focus here only on the tree-BP experiment, as the middle rounds of the permutation won't be affected.)Table 6 and Figure 10 show the equivalent information as Table 5 and Figure 9, respectively, but for compiler option -O3 instead of -Os.Compared to the very successful single-trace attack in the U-Os experiment, the performance here is clearly worse: after our tree BP with 16-bit fragments, we need to search about 2 27 key candidates to achieve a success rate higher than 50% (compared to previously 2 2 ), and we need to search about 2 36 candidates with 8-bit fragments (compared to previously 2 4 ).In other words, the U-O3 attack would hardly be practical without both BP and key enumeration.
A look into both the C source code of Weatherley's unmasked implementation, and the assembler listing produced by the compiler (with option -Wa,-adhlns=file.lst),revealed the reason.Although the handwritten assembler code for the permutation uses 32-bit registers, the surrounding C code XORs the key K with the state of the duplex construct.We can also observe this difference on the recorded traces.Figure 11a and 11b show the results of the interesting clock-cycle detection for the high word of the first lane (L 1 ) of K during the calculation of T , when the code was compiled with options -Os and -O3, respectively.For U-Os, the peaks of the R 2 f values of each 8-bit fragment are located in four different clock cycles, indicating that their operations were not executed simultaneously, whereas for U-O3, the peaks are located in the same clock cycle.
Table 7 indicates how much the four-time use of the key in the Ascon AEAD mode helped our single-trace attacks against the unmasked version to succeed.Building the templates for key fragments with clock cycles related to only each single use of the key at a time, compared to those from all four times, increased the key enumeration cost by more than a factor 2 32 for both our unmasked experiments.

Attack strategy
Following our experiments above on the unmasked Ascon implementation (recordings U-Os and U-O3), we also tried to apply the combination of fragment template attack, belief propagation, and key enumeration on an implementation with masking (recording M-Os).Our target masked implementation of Ascon AEAD was from the same package by Weatherley [Wea21, ASCON_masked/].This offers a C implementation of the permutation and protects the inputs (key, nonce, plaintext, etc.) with first-order Boolean masking [CJRR99], separating each of these values into two shares: one is the mask, varying per encryption, provided by a pseudo-random generator based on ChaCha [Ber08], and therefore the other share is the XOR of the input value and the mask.Throughout the encryption process, the intermediate values all remain likewise split into two shares, to randomize all the register values during execution.Compared to the unmasked version, this implementation also avoids some problems that may help side-channel attacks on the former.For example, it no longer XORs 8-bit values when calculating the tag T , and the two shares of the key are only sliced once, rather than three times.
Bronchain and Standaert [BS21] attacked Boolean masked implementations of AES and Clyde by extending the factor graph for the unmasked algorithm with nodes representing the original values connected to their shares in the masked version via a f ⊕ factor.Following this idea, we introduce a multi-trace attack derived from the previous tree-shaped one, where the factor graph (Figure 12) will also cover the two shares of the original target states.Similar to the setting of the previous unmasked version, we use the empty associated data A and fix the size of the plaintext P to seven bytes.In the profiling stage, we assume that attackers can access all the input and output values (K, N , P , C, and T ) as well as the seed of the pseudo-random generator, so that they can produce fragment templates for the key, its two shares, and all the other intermediate states in the factor graph.In the attack stage, we only use the probability tables obtained from the templates, and the known T values, to perform belief propagation and key enumeration, without knowledge of the seed.
Note that Figure 12 reflects the mathematical relations among the original values and their shares, not the actual steps in this masked implementation to calculate T .The implementation first calculates T , and finally T := T A ⊕T B .Therefore, we cannot build templates for the fragments of β 11 since this value never appears.Instead we assign them a probability table with a uniform distribution (i.e., no information update).Besides, our assumption was that the attacker knows T , so we do not need the templates or probability tables of T A and T B , given that they will not affect the belief propagation.

Experiments
As mentioned in Sec.3.2, while recording the M-Os traces from the masked version, we kept the setup the same as for the U-Os traces, except for the larger number of attack traces recorded, to have 100 encryptions each for the same key.
Table 8 shows the number of interesting clock cycles detected in the M-Os experiment, while Table 9 shows the results of the quality evaluation of the fragment templates.Here we built fragment templates for sliced registers (E/O words), since that is how the implementation represents most of our target states.We can see that the masking does protect the key K to some extent, as fewer interesting clock cycles (37 and 35 for the two lanes, respectively) were detected compared to the unmasked experiments (see Table 2, β −1 in Init.), leading to lower quality templates as evident from the higher guessing-entropy values for these fragments.However, for the two shares K A and K B , we still detected a large number of interesting clock cycles (144 and 139 for two lanes of K A , 240 and 259 for K B ), and therefore the quality of their templates is still promising once attackers can calculate the random numbers for masking in the profiling stage.Note that there are more interesting clock cycles for K B , the random mask, than for K A , because for the former we can also detect leaks from where the masks are generated.
For the belief propagation and key enumeration, Figure 13 and Table 10 show the key-recovery guessing entropy and success rates achieved with different numbers of attack traces.Single-trace attacks did not succeed, however starting from around five attack traces recorded with the same key, a 2 36 key enumeration is likely to succeed.10.

Conclusion
Ascon's design supports a leveled implementation of side-channel countermeasures, meaning that only its Initialization and Finalization phases (two key applications and one permutation p a each) need to be protected against DPA attacks, whereas for the remaining operations (message XORs and permutations p b ), SPA countermeasures are sufficient to ensure confidentiality of the encryption process, and no further countermeasures are required to ensure integrity [BBC + 20, VCS23].This was achieved by applying the key four times to the capacity, both before and after p a during both Initialization and Finalization.
However, our results demonstrate that if such countermeasures are either absent (U-Os, U-O3), or not entirely effective against a template attack (M-Os), then this quadruple use of the key in Ascon AEAD mode actually increases the exposure of the key in profiled side-channel attacks (see Table 4), which may enable full key recovery (see Table 7), compromising both confidentiality and integrity.That higher exposure of the key, which in our loopy factor graph is directly connected to four different locations (Figure 7), enables the belief propagation algorithm to pass messages between Initialization and Finalization.Previous attack simulations by Luo et al. [LWL + 22] did not exploit this higher key exposure and used only the mathematical relations around the first use of the key, at the start of Initialization.
Our successful single-trace attack (U-Os) benefited from some remaining 8-bit instructions in an open-source 32-bit adaption of the algorithm.Yet, even once these were fully converted to 32-bit instructions (U-O3), we still could recover the key used in this unmasked Ascon AEAD implementation, by belief propagation and key enumeration, with high success rates, from no more than 10 traces.Our attack procedure should also be applicable to Ascon-128a, with only minor modifications.
Our successful multi-trace attack on the more carefully written first-order Boolean masked Ascon AEAD implementation demonstrates how such protection, originally designed against CPA/DPA-style attacks, can be overcome by an appropriately designed template attack.However, a real-world application of such a profiled attack may still pose challenges considering our assumptions about the access to inputs that the attacker needs during the profiling phase.
An additional outstanding challenge remains to recover complete Ascon hashing inputs from a single trace, as was accomplished in [YK22] for SHA-3 (Keccak), another sponge-based hash function.This will likely require better templates (e.g., directly built for 32-bit values, as recently proposed by Cassiers et al. [CDSU23]) for the internal states of the Ascon permutation.Our templates for these (e.g., Init.α 6 in Table 3 and Figure 5) were less effective than those reported for the Keccak permutation in [YK22, Table 2].However, even with the very similar hardware setting we used, such direct comparisons are still complicated by the fact that the Keccak and Ascon target implementations came from different authors and had different programming styles.The former was entirely portable C code that left the 64-bit to 32-bit conversion to the compiler, whereas the latter offered a handcrafted assembler implementation of the permutation.That, but also the fact that Ascon's permutation is significantly simpler, for example it lacks an equivalent of Keccak's complex θ step, overall appears to have resulted in less information leaking from the fewer instructions needed by Ascon to process its intermediate values.
We hope that our attack methodology can serve as a benchmark for the design of stronger masking protections, and other implementation guidance, specifically for protecting against profiled attacks on software implementations.Data and source code used are available at: https://www.cl.cam.ac.uk/research/security/datasets/ascon/ In this template building approach, we first group the profiling traces according to the targeted byte value b ∈ {0, . . ., 255}.A profiling trace observing target value b is denoted as x b,t , where t ∈ {1, . . ., n b } enumerates the traces in group b.With the F 9 stochastic model, we treat each member bit (b[0] to b[7]) as an independent variable in a multivariate linear regression to calculate coefficients c 0 to c 7 ∈ R and a constant c 8 ∈ R to predict the expected values of samples as xb = 7 =0 (b[ ] • c ) + c 8 for each possible value b.To represent the expected vector of an entire m-sample trace, we write xb = 7 =0 (b[ ] • c ) + c 8 , where c 0 , . . ., c 8 ∈ R m .
b} be the set of all 32-bit values where fragment number f has value b.For each f , we apply the F 9 stochastic model to obtain the 256 expected trace vectors xf,b = 7 =0 b[ ] • c f, + c f,8 , from the traces x v,t with v ∈ V f,b , respectively.For the LDA procedure, we calculate B and Σ separately for each fragment:

Figure 1 :
Figure 1: An example of the factor graph for the operation x c = x a ⊕ x b .Circular nodes represent variables and square nodes represent factors.The three observation factors f 1 , f 2 , f 3 represent the template likelihoods and one constraint factor f ⊕ represents the operation.

Figure 3 :
Figure 3: Ascon-128 with empty associated data A and seven-byte plaintext P .

Figure 4 :
Figure 4: The top plot shows the Σ f R 2 f results for each 32-bit word of the 128-bit K for U-Os.The spikes lie in the marked regions corresponding to the four uses of K.The bottom plot shows the maximum of the SNR values among the four fragments.

Figure 5 :
Figure 5: First-order success rate (left) and logarithmic guessing entropy (right) of all target fragment templates from U-Os.Each row represents a 40-byte state, e.g.state 0 is β −1 , state 1 is α 0 , state 2 is β 0 , etc., in chronological order.The red blocks represent bytes of the known values IV , N , for which no templates were needed.

Figure 9 :
Figure9: Key-recovery success as a function of the search depth (2 n -SR) for the U-Os experiment when performing SASCA with either our loopy or tree-shaped factor graphs.

Figure 10 :
Figure10: Key-recovery success rates on Ascon-128 with both 8 and 16-bit fragments in the U-O3 experiment.The little circle the success rate if the search depth is limited to the respective guessing entropy from Table6.Cf.Figure9.

Figure 11 :
Figure 11: The Σ f R f results and the R 2 f values for each byte fragment (f = 0, 1, 2, 3) of the high word of L 1 in K

Figure 12 :
Figure 12: Factor graph for our M-Os experiment.(Each variable node also connects to an observation factor node, which is omitted in this graph.)

Figure 13 :
Figure13: Key-recovery success rates in the M-Os experiments on the masked implementation as a function of the key-enumeration search depth.The little circle marks the success rate if the search depth is limited to the respective guessing entropy from Table10.

Table 2 :
Number of interesting clock cycles detected for 32-bit intermediate words in U-Os.The detection for IV and N is not needed since they are known values.Our target [Wea21, ASCON/] implements a bit-interleaved [DEMS21, Sec.4.1.1]version of Ascon,

Table 3 :
Quality evaluation of selected templates from the U-Os experiment, using both the top-rank success rate (1-SR) and the base-2 logarithm of the guessing entropy (LGE).

Table 4 :
Quality evaluation of fragment templates for the key of Ascon AEAD with either all or only one part of the interesting clock cycles (U-Os experiment).

Table 7 :
ThisLGE table compares the single-trace attack performance between including in the selected clock cycles only one of the four key-use regions 1, 2, 3, or 4, (from Figure4, respectively), as opposed to all four together, all using tree BP with byte fragments.For example, it XORs two lanes of Finalization output (β 11 ) with K, to generate the tag T , using the macro in [Wea21, ASCON/internal-util.h], which is actually a loop processing individual bytes.When compiled to optimize code space (i.e., minimize the size of the executable) with gcc option -Os, the resulting ARMv7-M assembler code looks pretty exactly like the source code suggests, i.e., a loop over 16 bytes, which loads one byte from K and one from β 11 into the 32-bit registers, XORs them, and stores one byte of T per iteration.In contrast, if we instead ask the compiler to optimize for time (-O3), it not only unrolls that loop, but also converts it into a sequence of just four repetitions of the operations for loading, XORing and storing 32-bit words.In other words, the optimizer converted here an 8-bit implementation of the key XOR operation into a 32-bit implementation.

Table 8 :
Numbers of interesting clock cycle detected for the M-Os experiment.

Table 10 :
Logarithmic guessing entropy achieved by our key-recovery attack with 8-bit fragment templates, using 1-100 encryption traces, respectively (tree BP, M-Os experiment).(The no BP control experiment, using only templates for K, found hardly any exploitable first-order leakage of K from the targeted implementation.)