GE vs GM: Eﬃcient side-channel security evaluations on full cryptographic keys

. Security evaluations for full cryptographic keys is a very important research topic since the past decade. An eﬃcient rank estimation algorithm was proposed at FSE 2015 to approximate the empirical guessing entropy remaining after a side-channel attack on a full AES key, by combining information from attacks on each byte of the key independently. However, these could not easily scale to very large keys over 1024 bits. Hence, at CHES 2017, it was proposed a new approach for scalable security evaluations based on Massey’s guessing entropy, which was shown tight and scalable to very large keys, even beyond 8192 bits. Then, at CHES 2020, it was proposed a new method for estimating the empirical guessing entropy for the case of full-key evaluations, showing also important divergences between the empirical guessing entropy and Massey’s guessing entropy. However, there has been some confusion in recent publications of side-channel evaluation methods relying on these two variants of the guessing entropy. Furthermore, it remained an open problem to decide which of these methods should be used and in which context, particularly given the wide acceptance of the empirical guessing entropy in the side-channel community and the relatively little use of the other. In this paper, we tackle this open problem through several contributions. First of all, we provide an unitary presentation of both versions of the guessing entropy, allowing an easy comparison of the two metrics. Secondly, we compare the two metrics using a set of common and relevant indicators, as well as three diﬀerent datasets for side-channel evaluations (simulated, AVR XMEGA 8-bit microcontroller and a 32-bit device). We used these indicators and datasets also to compare the three full-key evaluation methods from FSE 2015, CHES 2017 and CHES 2020, allowing us to provide a clear overview of the usefulness and limitations of each method. Furthermore, our analysis has enabled us to ﬁnd a new method for verifying the soundness of a leakage model, by comparing both versions of the guessing entropy. This method can be easily extended to full-key evaluations, hence leading to a new useful method for side-channel evaluations.


Introduction
Power and electromagnetic analysis attacks are powerful tools to extract secret information from hardware devices, such as modern System-on-Chip processors used in smpartphones or the cryptographic microcontrollers used in banking smartcards. These side-channel attacks apply a divide-and-conquer strategy, such that they are able to target each subkey byte of a cryptographic algorithm independently. This may allow an attacker to mount a practical attack on a block cipher such as AES, when using a key of 128 or 256 bits (16 or 32 bytes, respectively), by targeting each of the 16 or 32 key bytes independently, whereas a purely brute-force search attack on the full key is computationally infeasible.
While classic side-channel attacks focused on attacking a single key byte, recent advances in side-channel attacks have focused on the problem of estimating the rank 1 of the full key of a cryptographic algorithm, after obtaining sorted lists of probabilities for the different subkeys that compose the full key (e.g. lists for the 16 subkey bytes of AES, when used with a 128-bit key). These algorithms represent very useful tools for security evaluators that need to estimate the security of a given device.
One of the first approaches towards estimating the rank of a full 128-bit key was proposed by Veyrat-Charvillon et al. [CNG+13], albeit with a considerable error margin. Afterwards, other algorithms [GCG+15,MPO+15,BJL+15] have reduced the bounds of this estimation to within one bit for 128-bit keys and can run within seconds of computation, after being given with a list of sorted probabilities for the individual subkeys. One of the most efficient such algorithms is the one by Glowacz et al. presented at FSE 2015 [GCG+15], which allows estimating the empirical guessing entropy introduced by Standaert et al. [SMY09] in the full-key scenario. However, these algorithms cannot easily scale for large keys composed of more than 128 bytes (e.g. an RSA 2048 or 4096 bit key), while at the same time providing tight bounds.
In this context, Choudary and Popescu [CP17] presented at CHES 2017 a new approach, based on mathematical bounds of Massey's Guessing Entropy [Mas94], to bound the guessing entropy remaining after a side-channel attack for very large cryptographic keys (or other secret data). They showed that their method works for keys of 8192 bits (1024 bytes) and beyond, in almost constant time and memory, which none of the other methods could do.
Then, at CHES 2020, Zhang et al. [ZZD+20] presented a new approach for estimating the empirical guessing entropy in full-key evaluations, that may provide advantages over previous approaches such as the FSE 2015 method.
Concurrent to these publications and also within the paper of Zhang et al., there have been several observations regarding the limitations of each method and in particular the observation that the empirical guessing entropy and Massey's guessing entropy may diverge considerably, leading to substantially different results [GV18, DW19, AMP+19, ZZD+20]. Nevertheless, it remained an open problem to decide which of the two guessing entropies that are estimated by the full-key evaluation methods cited above (FSE 2015, CHES 2017, CHES 2020) is better suited for side-channel evaluations.
In this paper, we give an answer to this open problem, by attentively presenting and comparing the two guessing entropy metrics as well as these three full-key estimation algorithms. We show both analytically and experimentally the advantages and weaknesses of each method, by using a set of common indicators and three side-channel datasets: a simulated dataset, traces from the hardware AES engine of the AVR XMEGA microcontroller and traces from a software bitsliced AES implementation on a 32-bit device.
This analysis enables us to provide a guide that can help us decide which method to use for various side-channel security evaluations. Furthermore, our analysis has enabled us to develop a new method for deciding whether a leakage model is sound, by actually combining, rather than choosing, the two variants of the guessing entropy.
In the next section we provide some background on full-key side-channel evaluations. Then, in Section 3 we provide a detailed presentation of the two guessing entropy variants used in the side-channel literature: Massey's guessing entropy and the empirical guessing entropy. Then, in Section 4 we present the three full-key evaluation methods analysed in this paper (FSE 2015, CHES 2017, CHES 2020). In Section 5 we present our evaluation context: the key indicators used for comparing implementations as well as our datasets and the method of Template Attacks used to implement our side-channel attacks. Sections 6 and 7 present our analysis for the guessing entropy metrics as well as for the full-key evaluation methods, respectively.
2 Background: security evaluations for side-channel attacks on full cryptographic keys Given a physical device (e.g. a smartcard) that implements a cryptographic algorithm, such as AES, we may record side-channel traces (power consumption or electromagnetic emissions) using an oscilloscope. In this case, for each encryption of a plaintext p i with a key k , we can obtain a leakage trace x i that contains some information about the encryption operation.
For the particular case of AES and other similar block ciphers that use a substitution box (S-Box), a common target for side-channel attacks is the S-box operation v = S-box(k ⊕p) from the first round of the block cipher. Since this operation is done for each subkey k in part (for AES each subkey only has 8 bits), we can attack each of the subkeys separately. And by using information from the leakage traces, a side-channel attack such as DPA [KJJ99], CPA [BEC+04] or Template Attacks [CRR02] can assign higher probabilities to the correct subkeys, leading to a very powerful brute-force search on each subkey.
After obtaining the lists of probabilities for each subkey, we may need to combine these lists in some way in order to determine what are the most likely values for the full cryptographic key. One important motivation for this is that secure devices, such as the microcontrollers used in EMV cards, need to obtain a Common Criteria [Com] or EMVCo [EMV] certification at some assurance level (e.g. EAL4+). To provide such certification, evaluation laboratories may need to verify the security of devices against side-channel attacks also for the case of full-key recovery attacks, in particular where some subkeys may leak considerably different than others.
For the particular case of AES, we need to combine from 16 bytes (128-bit key) to 32 bytes (256-bit key). If the target device leaks enough information and sufficient measurements are done, then the attack may provide a probability close to one for the correct subkey value, while assigning a very small probability to the other candidate subkey values. In this case, the combination is trivial, as we only need to use the most likely value for each subkey. However, in practice, due to noise in the measurements and various security measures in secured devices, the correct value of each subkey may be ranked anywhere between the first and the last position. In this case, a trivial direct combination of all the lists of probabilities is not computationally feasible. Note that this problem arises in any scenario where we need to combine multiple lists of probabilities.
To deal with this combination problem in the context of side-channel attacks, two kinds of algorithms have emerged in recent years: key enumeration and rank estimation algorithms. Key enumeration algorithms [VGR+12,PRS+16] provide a method to output full keys in decreasing order of likelihood, such that we can minimize the number of keys we try until finding the correct one (which is typically verified by comparing the encryption of a known plaintext/ciphertext pair).
The other kind of algorithms, which are the main focus of this paper, are guesswork estimation algorithms. These algorithms provide an estimate of the level of security remaining after a side-channel attack targeting a full cryptographic key, i.e. the estimated number of keys we should try (guess) until finding the correct one if we were to apply a similar approach to key enumeration. The great advantage of such estimation algorithms is that we can estimate the guesswork even if this is very high (e.g. 2 80 or larger), whereas enumerating such large number of keys is computationally infeasible.
Veyrat-Charvillon et al. [CNG+13] proposed the first practical guesswork estimation algorithm for 128-bit keys, known as rank estimation algorithm because it estimates the key rank [SMY09] (empirical guessing entropy). This algorithm could run in between 5 and 900 seconds. The main drawbacks of this algorithm are that the bounds of the rank estimation can be up to 20-30 bits apart from the real key rank and the required time to tighten the bounds increases exponentially. Soon afterwards, many other algorithms, including those analysed in this paper, provided more efficient and scalable methods to perform guesswork estimation [GCG+15, MPO+15, BJL+15, CP17, GV18, DW19, ZZD+20, DW21].
Among these, we shall focus on three remarkable methods: a) the rank estimation from FSE 2015 [GCG+15] -one of the first efficient implementations for rank estimation, among the best performant methods to this day 2 ; b) the scalable bounds on Massey's guessing entropy from CHES 2017 [CP17] -the most scalable solution to date; c) the rank estimation method from CHES 2020 [ZZD+20] -introducing a new approach to rank estimation: based on estimating the distribution of scores.
We start below with a presentation of the two guesswork metrics (Massey's guessing entropy and the empirical guessing entropy), on which the three full-key evaluation methods mentioned above are based on.

Security metrics based on expected guesswork
The methods analysed in this paper estimate or bound two main security metrics: the key rank (or empirical guessing entropy), which we shall refer to as GE in this paper, and Massey's guessing entropy, which we shall refer to as GM. Unfortunately, through the past decade these metrics have been mixed in several forms, causing some confusion. Hence, in this section we aim to clarify each metric in the context of side-channel attacks. This will be useful and necessary to understand better the differences between the three methods analysed afterwards, in the reminder of the paper. We start with a brief history overview of these metrics and then provide more details on each of them.
The first description of what we term as Massey's guessing entropy was most probably in the paper by James Massey in 1994 [Mas94], where he defined a value G as "the number of guesses used in the guessing strategy that minimizes E[G]". A few years later, under his supervision, Christian Cachin published his PhD thesis on "Entropy measures and unconditional security in cryptography" [Cac97]. In his thesis, Cachin termed the measure E[G] (rewritten as E[G(X)]) as the "guessing entropy of X", where X is the random variable to be guessed (e.g. the key of a cryptographic algorithm). Also in this thesis, Cachin presented the "conditional guessing entropy of X given Y", E[G(X|Y )], for "the case of guessing X with knowledge of a correlated random variable Y". This conditional guessing entropy was then used in the context of side-channel attacks probably for the first time by Köpf and Basin [KB07]. Shortly afterwards, Standaert et al. [SMY09] defined an empirical version of the guessing entropy that they termed as "guessing entropy". This definition and measure was then commonly used within the side-channel research community to evaluate the success of side-channel attacks. However, this also introduced some confusion for several reasons: (a) the measure is actually a conditional guessing entropy, as it depends on the side-channel observations; (b) they used the same "guessing entropy" term for a measure that is computed differently than the previously introduced conditional guessing entropy, defined by Cachin [Cac97] and used in the side-channel context by Köpf and Basin [KB07]; (c) this measure has been called either "guessing entropy" or "rank" or "key rank" across different publications. As a result, it is not entirely clear what metric should be used: if we should prefer one metric over the other or in which context we should use each of them. Throughout this paper, we aim to bring some light into this issue, given the importance of these metrics for the scalable methods evaluated here.

Massey's guessing entropy (GM)
As detailed earlier, James L. Massey proposed in 1994 a metric used to capture the "number of guesses used in the guessing strategy that minimizes E[G]" [Mas94] (until we find the desired value, e.g. some cryptographic key). Cachin rewrote this value as E[G(X)] and termed it as the "guessing entropy of X" [Cac97], specifying the random variable X to be guessed. This is computed as: where p 1 ≥ p 2 ≥ . . . ≥ p N are the probabilities for different values of the random variable X, in descending order, according to the guessing strategy.
Cachin then defined also G(X|Y ), for the case of guessing X with knowledge of a correlated random variable Y , as the "guessing function for X given Y when G(X|y) is a guessing function for the probability distribution P X|Y =y ". In this case, we can compute the average number of guesses needed to determine X, with knowledge of a correlated random variable Y , obtaining the conditional guessing entropy of X given Y as: (2) Köpf and Basin used this conditional guessing entropy in the context of side-channel attacks (where the correlated random variable Y is the side-channel leakage), stating that E[G(X|Y )] is "a lower bound on the expected number of off-line guesses that an attacker must perform for key recovery after having carried out a side-channel attack" [KB07].
For our side-channel evaluation context, we are interested in the guessing entropy (or conditional guessing entropy) of a secret key K given the leakage X, i.e. E[G(K|X)]. For this, we can first compute E[G(K|X = X)] as: where |S| represents the number of possible values of K and P (k 1 |X = X) ≥ P (k 2 |X = X) ≥ . . . ≥ P (k |S| |X = X) represent the conditional probabilities obtained after a sidechannel attack with traces X (see acquisition details in Section 5), sorted in descending order (according to the guessing strategy). Then, we can compute E[G(K|X)] as: Since it is often not possible to iterate over all the possibilities of the leakage space, we often approximate the above expectation using enough experiments (N ), obtaining the conditional guessing entropy as follows: We shall refer to this value as Massey's guessing entropy (or GM for short) in the reminder of the paper. Hence, we have:

Empirical guessing entropy (GE)
From a more empirical perspective, Standaert et al. presented in 2009 another guessing entropy, based on the position of the correct key in the sorted vector of conditional probabilities [SMY09].
be again the conditional probabilities obtained after a side-channel attack with traces X, sorted in descending order (according to the guessing strategy), for each of the |S| possible values of the key. Then, we can define this guessing entropy for a single experiment (typically known as rank) as: where k represents the correct key and X the side-channel traces. That is, rank(K|X = X) provides the actual position of the correct key k in the list of candidate values, sorted accordingly to the guessing strategy (typically after a side-channel attack, in our context).
Similarly to the case of Massey's guessing entropy GM (see above), we are also interested here in obtaining an expected or average value for this rank. We can compute this as: where N is again the number of experiments used for a sufficient approximation of the rank. We shall refer to this last value as the empirical guessing entropy (or GE for short). Hence, we have: We can also compute the expected value of the rank, rank(K|X), based on the probability that the correct key k is ranked at a given position (pos k ) in the descendingly sorted list of possible key values: although the probabilities P [pos k = i|X] are difficult to use in scalable security estimation methods. However, there is another form of the above equation, easier to estimate and which is used by one of the methods analysed in this paper (see Section 4.3), based on the probabilities that other keys (k i ) are more (or less) likely than the correct one (k ): (11) This last equation can be more helpful than Equation 10, because the summation here can be estimated using the normal cumulative density function, as shown in Section 4.3.

Efficient full-key security evaluation methods
In this section we present the three main guesswork estimators analysed in this paper: the rank estimation from FSE 2015 [GCG+15], the scalable bounds from CHES 2017 [CP17] and the recent rank estimation method from CHES 2020 [ZZD+20]. We may rewrite slightly the original notations in order to harmonize the presentation in this paper.

FSE 2015 estimator for GE
The rank estimation method of Glowacz et al. [GCG+15], to which we refer as the FSE 2015 estimator in this paper, is one of the fastest such algorithms and scales well for keys up to 128 bytes. To use this algorithm, we must first compute the logarithms of the conditional probabilities, log P (k j i |X = X), for all the n s chunks (1 ≤ i ≤ |S|, 1 ≤ j ≤ n s ) of a full cryptographic key (e.g. n s = 16 byte chunks for a 16-byte AES key). Then, we can compute the histograms of the log-probabilities for each key chunk: and finally compute a large histogram for the entire key, by combining the individual histograms through convolution: This is typically computed by first convolving the first two histograms, then the result with the third histogram and so on.
Having computed the convolution of all the histograms and with knowledge of the correct key (and hence of the bins containing the correct key chunk in each histogram), we can estimate the full key rank by adding the values of the bins in the full histogram, starting with the bin that should contain the correct full-key until the last one: This basically estimates the number of keys with a log-probability larger than or equal to the one of the correct key, which essentially estimates the rank of the correct key. Similar to the empirical guessing entropy (see previous section), we can approximate the expectation of this rank of the full key using N experiments, obtaining: There have been some speed improvements on this method, such as the work of Grosso from CARDIS 2018 [GV18], which also compares the execution time of the FSE 2015 estimator and the CHES 2017 bounds [GV18, Fig. 6, p.14]. However, as mentioned by the same author, the performance improvements come at the cost of tightness. Similarly, the recently published method of David and Wool [DW21] can improve slightly the performance at the cost of memory consumption, but is also more complicated to implement. Hence, for generality and simplicity, we shall use the FSE 2015 method in the analysis of this work, given that it remains one of the representative algorithms for estimating the empirical guessing entropy GE when targeting full cryptographic keys.

CHES 2017 bounds for GM
At CHES 2017, Choudary and Popescu [CP17] showed efficient and scalable bounds for Massey's guessing entropy as follows: where p i = P (k i |X = X) are again the conditional probabilities obtained after a sidechannel attack, but not necessarily sorted. This allows a faster evaluation, as we do not need to sort the probabilities, as it is necessary e.g. for computing Massey's guessing entropy or the empirical guessing entropy (see Section 3). By averaging the terms over many experiments, we obtain the bounds for the approximation of the expected value of Massey's guessing entropy, which was termed GM: where p i,q = P (k i |X = X q ) is the conditional probability for key value k i in experiment q.
In the same CHES 2017 paper, the authors also showed that these bounds can be used for full-key security evaluations [CP17, Theorem 1], providing the following bounds: where p j i = P (k j i |X = X) are the conditional probabilities for the j-th key chunk values k j i (1 ≤ j ≤ n s ). We may again average each term over N experiments, to obtain bounds for Massey's guessing entropy GM on the full key (GM full ): where p j i,q = P (k j i |X = X q ) are the conditional probabilities for the j-th key chunk values k j i in experiment q. This is the fastest and most scalable method for full-key evaluation known to date. We shall refer to these lower and upper bounds on GM as LB GM and UB GM , respectively, regardless of whether we use them for single byte or full-key evaluations.

CHES 2020 estimator for GE
At CHES 2020, Zhang et al. [ZZD+20] proposed a new method, termed 'GEEA' (guessing entropy estimation algorithm) to estimate the empirical guessing entropy (GE). Their method relies on the observations made by Rivain et al. [Riv08,LPR+14], that the success rate can be computed from the multivariate Gaussian distribution of the ranking score vectors. Zhang et al. have used this distribution to produce the GEEA estimator, which is expected to approximate better the value rank(K|X) than the average estimator GE (see Section 3.2 for details).
Given a list of scores s = {s 1 , s 2 , . . . , s |S| }, obtained after a side-channel attack (e.g. the probabilities P (k i |X = X) or their logarithm log P (k i |X = X)), we can compute a comparison vector ∆ ∆ ∆ ∆(K|X = X) having (|S| − 1) elements: where the values ∆ i are computed for all the scores values s i , except for s k , the score for the correct key. Then, Rivain et al. [Riv08,LPR+14] have shown that if the scores from multiple experiments can be combined through addition 3 , then this comparison vector follows a multivariate distribution N (µ ∆ ∆ ∆ ∆ , 1 na Σ ∆ ∆ ∆ ∆ ), where µ ∆ ∆ ∆ ∆ and Σ ∆ ∆ ∆ ∆ are the mean vector and respectively covariance matrix of the comparison vector, and n a represents the data available to an attacker, for which we wish to estimate the comparison vector. Note that we need to obtain the side-channel traces over a sufficient number N of experiments to estimate the mean (µ ∆ ∆ ∆ ∆ ) and covariance (Σ ∆ ∆ ∆ ∆ ) of the comparison vector. Then, we use these multivariate parameters to estimate the comparison vector for a given value n a , with the goal of estimating the guessing entropy obtained by an adversary that has access to n a attack traces.
From a particular observation of the comparison vector ∆ ∆ ∆ ∆(K|X = X), we can compute the rank as: where N pos (∆ ∆ ∆ ∆(K|X = X)) represents the number of positive components of ∆ ∆ ∆ ∆(K|X = X), i.e. the number of components for which s i > s k . Thus, we can obtain the expected value of the rank as: Based on the above results, Zhang et al. provide the following formula for their GEEA estimator of the rank: where are the estimated mean and variance of the component ∆ i (K|X = X) of ∆ ∆ ∆ ∆(K|X = X), Φ(·) represents the normal cumulative density function and n a is the assumed number of attack traces available to an attacker for the estimation of the empirical guessing entropy using GEEA.
The authors of GEEA also provide an extension of their estimator for full-key evaluation. Given the set of comparison vectors {∆ ∆ ∆ ∆ 1 , ∆ ∆ ∆ ∆ 2 , . . . , ∆ ∆ ∆ ∆ ns } for n s target bytes (e.g. sub-key values), with their respective estimated mean and variance components ∆ j i (K|X) and S ii j (K|X) for each ∆ ∆ ∆ ∆ j , Zhang et al. [ZZD+20] sum these values, assuming that the full key comparison score is the sum of byte comparison scores 4 , to obtain the mean and variance of the comparison score vector for each possible full-key candidate k f = (k i1 |k i2 | . . . |k ins ): with the observation that each of the values ∆ j ij (K|X) and S ij ij j are particular to each subkey byte j (1 ≤ j ≤ n s ). With these values, Zhang et al. compute the estimation of the full key rank from a (random) set 5 of M full-key values S f = {k f 1 , k f 2 , . . . , k f M } as: We make again a note here that, while for the FSE 2015 and CHES 2017 methods the approximation of the expected value of the rank (GE full FSE ) or Massey's conditional guessing entropy (GM full ) were obtained through averaging over N individual experiments (see Equations 15 and 19), for GEEA the approximation of the expected value of the rank is obtained directly from the estimated mean and covariance parameters (see Equations 23 and 26), but in turn these parameters are also obtained by averaging over N individual experiments (see Equation 24). Hence, in all cases we need to iterate over N experiments to estimate acceptably the desired metrics.

Evaluation context
To analyse and compare the methods discussed in this paper, we have used the following indicators, that we think allow a fair and useful comparison of the methods from different perspectives: • Precision: through this indicator, we want to check how well and how fast a given metric approximates the expected value that is trying to estimate. This will be typically done by measuring the standard deviation of the given method and by observing this deviation over a different number of experiments.
• Resource complexity: with this indicator we shall compare the time and memory complexity of the methods, either from the estimated theoretical bounds or from practical results.
• Scalability: this indicator is particularly useful for the full-key evaluations. Here, we aim to measure the possibility of each method to cope with evaluations on large cryptographic keys. In particular, we shall compare the rate of increasing time/memory complexity as a function of key length.
• Relevance for side-channel evaluations: this is a (possibly subjective) indicator, that observes the usefulness of a method for side-channel evaluation. This will be done through an iteration of possible scenarios where a method seems of interest for such security evaluations.
Furthermore, in order to provide a more comprehensive analysis of the evaluation methods, we used three distinct datasets: one from MATLAB simulated data (simulated dataset), one from the hardware AES co-processor of an AVR XMEGA device (XMEGA dataset) and one from a 32-bit ARM device (SoC dataset). We provide some more details below.

Simulated dataset
For this dataset, we simply implemented AES and added uniform noise to the output of the sub-bytes operation and then applied Template Attacks [CK13] to obtain lists of probabilities for each key byte of the AES key (16 key bytes in total).
The data contains unidimensional leakage samples x i produced as the hamming weight of the AES S-box output value mixed with Gaussian noise, i.e.
where p i is the plaintext byte corresponding to this trace, and r i represents the Gaussian noise (variance 10). We shall refer to this as the simulated dataset.

Xmega dataset
This dataset consists of 2 20 ≈ 1M power-supply traces of the AES engine inside an AVR XMEGA microcontroller, obtained while the cryptographic engine was encrypting different uniformly distributed plaintexts. The traces correspond to the S-box lookup from the first round key. Each trace contains m = 5000 oscilloscope samples recorded at 500MS/s, using a Tektronix TDS7054 oscilloscope, configured at 250 MHz bandwidth in HIRES mode with Fastframe and 10mV/div vertical resolution, using DC coupling. The XMEGA microcontroller was powered at 3.3 V from batteries and was run by a 2MHz sinewave clock. We shall refer to this as the XMEGA dataset.

SoC dataset
The third dataset consists of 100000 power traces acquired for a bitsliced variant of the AES-128 algorithm. The 32-bit unprotected implementation covered in [BJG+15] was used. The acquisition campaign was conducted entirely on a ChipWhisperer-Lite [Lit22], using the integrated STM32F303 32-bit ARM target architecture and the attached capture instrument. Each trace consists of 5000 samples recorded at ≈ 30MS/s (for a sampling clock of 29.48 MHz), covering also the processing of the first S-box operation. We shall refer to this as the SoC dataset.

Template attacks
To use our datasets with the methods evaluated in this paper, we need to obtain lists of probabilities for the possible values of the 16 subkeys used with our AES implementations. For this, we use Template Attacks (TA) [CRR02,CK13] on each subkey during the S-box lookup of the first AES round, thus obtaining the desired lists of probabilities. The Template Attacks work in two steps: a profiling and an attack step. In the profiling step, we first compute a set of profiling parameters that typically estimate a multivariate distribution for each possible candidate value (e.g. key byte). Then, during the attack step we compute the likelihood of the attack traces given the template parameters, hence obtaining (typically via Bayes) probabilities for each possible candidate value.
For each dataset we have between a few hundred thousands and one million traces. We split the data randomly into profiling and attack sets and we do this for many experiments (typically we create over 100 such sets). The traces in each pair of sets (profiling and attack) are randomised prior to the separation into profiling and attack set, so that we can remove unwanted effects such as temperature influencing consecutive traces. Typically we use many more traces for profiling, so the profiling parameters are well estimated (e.g. for the XMEGA dataset we used around 200 traces per byte value in each set). For the attack set we typically select between a few hundred and a few thousand traces per set. In the case of the simulated dataset, the templates were obtained simply by computing the mean and covariance parameters for the simulated samples (one leakage sample per trace) and then using these template parameters on the set of attack traces. For the XMEGA dataset, we first applied a sample selection method equivalent to the SNR and SOST methods [CK18] in order to compress the traces down to only very few samples per trace and then applied the Template Attack on these compressed traces. Finally, for the SoC dataset, we combined stochastic models with Principal Component Analysis [COK14] in order to improve the profiling step and to reduce the size of the traces.
After executing a side-channel attack using a vector X of leakage traces (e.g. the real or simulated traces in our case), we obtain a vector of scores or probabilities d(k|X) ∈ R |S| for each possible key byte value k ∈ {1, . . . , |S|}, where |S| is the number of possible values (typically |S| = 256 for one AES subkey byte). In the case of Template Attacks, we obtain probabilities and we shall often write P (k|X) = d(k|X). 6 After obtaining the probabilities P (k|X) for each subkey byte k, we can compute the security metrics and rank estimation methods presented in the paper.

Using GE or GM for security evaluations
Since the FSE 2015 and CHES 2020 methods approximate the expected value of the rank (i.e. they compute a variant of GE) and the CHES 2017 bounds are obtained for Massey's conditional guessing entropy (GM), we start by comparing these two variants of the guessing entropy. Then, in the next section we analyse the three methods used for full-key evaluation.

Precision of GE and GM
It was recently mentioned that the GM might have bad precision [AMP+19], requiring to average the results of many experiments to get good estimation results. Similarly, Zhang et al. [ZZD+20, Section 1,p.27] stated that the GM calculation suffers "the same practical problem of needing to average over many data sets" as the GE.
To verify the extent of this situation, we have used our datasets and computed both the GE and the GM metrics, with the formulas from Section 3, using q = 100 experiments and a varying amount of attack traces n a (1 ≤ n a ≤ 100). Using this data we have also computed the standard deviation of each metric, hence comparing their accuracies. The 6 Unprofiled side-channel attacks such as CPA often return a score vector, e.g. based on the correlation coefficient ρ k ∈ [−1, 1] for each possible candidate value k, which might not work very well with rank estimation methods. However, even in the unprofiled setting is possible to use other methods, such as linear regression on the fly [COP+16] to obtain pseudo-probabilities that work well with rank estimation algorithms. results are shown in Figure 1. We also show in Table 1 the 5th, 50th (median) and 95th percentile of the standard deviation (across the different values of n a attack traces) for each method in each experiment. From these results we observe that GE has indeed a large standard deviation, generally one order of magnitude larger than the standard deviation of the GM across our experiments. Therefore, methods based on GM may allow security evaluations even when having access to few attack traces. The large differences between GE and GM can be explained by the fact that the calculation of the GE relies on the exact (actual) position of the correct key, which may fluctuate greatly, while the calculation of GM only depends on the relative magnitude of probabilities, regardless of the position of the correct key in the sorted vector of probabilities.

Resource complexity of GE and GM
As can be seen from their definitions (see Equations 6 and 9), both GE and GM require first sorting of the probabilities, for computation of the rank (Equation 7) and conditional guessing entropy (Equation 3), respectively. This has a computational complexity O(|S| log |S|), where |S| is again the number of possible values of the target candidate k (e.g. a key byte or the entire key for full-key evaluations). If we consider also the number of experiments N required for a good estimation, then the complexity is O(N |S| log |S|), although in general the parameter N can be kept relatively small even when dealing with large key sizes, hence the important factor here is |S|.
In any case, it is clear that both metrics have the same computational requirements and that it becomes impractical to use them for full cryptographic keys (e.g. 128-bit keys), due to the impossibility of performing the sorting in acceptable time. Hence the requirement for the full-key evaluation methods evaluated in next section.
In terms of memory, both methods need to store all the probabilities P (k i |X = X) in order to perform the sorting, hence they require memory linear with |S|. This can again become impractical for large keys.

Relevance of GE and GM for side-channel security evaluations
After analysing the definitions, accuracies and performance of GM and GE, we need to evaluate the usefulness of each measure for security evaluation purposes. For this task, we shall resort to their original apparition and scope, as presented in Section 3.
As seen, GM approximates the conditional guessing entropy of the key K given the leakage X. Based on its original definition by Massey [Mas94] and then its further development by Cachin [Cac97], GM estimates the average number of guesses needed to determine the correct value of the key variable K when given the side-channel leakage X. This measure was then applied precisely in the side-channel context by Köpf and Basin [KB07], who gave the important statement that this conditional guessing entropy is a lower bound on the expected number of off-line guesses that an attacker must perform for key recovery after having carried out a side-channel attack. Hence, we can expect that the actual number of off-line guesses that an attacker must perform for key recovery after a side-channel attack is higher than the GM.
On the other hand, GE was introduced in the side-channel community by Standaert et al. [SMY09] precisely as an empirical measure, that would provide the exact number of guesses that an attacker would need to determine the correct value of the key variable K when given the leakage X.
Given this situation, we may expect that in general GM may provide a lower value than GE. Hence, an important question is which of them should be used and in which scenario, given that the GE has been used predominantly so far in the side-channel community, while GM has been reintroduced only recently, since the development of efficient bounds for full-key evaluations.  To give an answer to the question of which of them should be used and in which scenarios, we start by evaluating both metrics within our datasets, when targeting one and two bytes (for full-key evaluations see next section). Our results are shown in Figures  2 and 3 for evaluations on one and two bytes, respectively.
As we can see, indeed the GM is consistently below GE, as expected from the discussion above. Furthermore, Zhang et al. [ZZD+20,see Appendix] also showed that in general GM will be lower than GE and that values of GM above GE are generally not observable in numerical studies. Given these results and observations, we can state the following conclusions when targeting a small key chunk (one or two bytes): • GM will likely provide a lower bound on the actual guessing entropy, i.e. an attacker will generally require more effort than shown by GM.
• GE provides a good approximation of the expected effort needed by a real attacker, at the cost of some more experiments to obtain a smooth approximation.
Therefore, we may select to use one or the other depending on our evaluation requirements. If we must provide an estimate of the difficulty in attacking a device that is as close as possible to the attacker's perspective, then we should use GE. However, if all we need is to check whether the security of a device is above a certain threshold, then GM may also be of use. This will become much more relevant when we focus on the full-key scenario, in the next section.
Besides the above conclusions regarding the usefulness of each metric individually, we can observe another very important aspect, based on the results shown above: that the difference between GE and GM is related to the quality of the leakage model, as also noticed by Zhang et al. [ZZD+20,Appendix] in their analysis. Examining the figures above for both, evaluations on a single byte as well as evaluations on two bytes, we can see that the two metrics almost overlap for the simulated dataset, they diverge slightly (GM below GE) for the XMEGA dataset and they diverge greatly for the SoC dataset.
Such variations in the difference between GE and GM can be explained as follows. For the simulated dataset, we have used the exact leakage model that was used to generate the traces (since they are simulated from the leakage model) and in this case, as noted also by Zhang et al., the correct key is very well distinguished and hence the order of probabilities generally matches well with the order of the correct key, leading to GE being very close to GM. For the XMEGA dataset, we have a leakage model that is not entirely accurate, due to the noise of the various components influencing the leakage traces used to attack the hardware AES implementation in the XMEGA device. Finally, for the SoC dataset, we have a very weak leakage model due to targeting one or two bytes at a time in the bitsliced AES implementation on a 32-bit device. In this case, only 8 or 16 bits from the 32 processed by the device are actually relevant (e.g. the first bit of the first 8 key bytes), while the remaining bits produce substantial noise, hence leading to the weak leakage model and to the large difference between GE and GM.
Hence, rather than using only one metric or the other, these results show that it can be very useful to compute both and use their difference as a method to verify the quality of the leakage model used during security evaluations: if the two metrics are close to each other, then the model is good, otherwise the model may suffer from estimation or assumption errors [DFS+14].

Comparative analysis of full-key evaluation methods
In this section, we analyse the three full-key security evaluation methods presented earlier: the rank estimation algorithm from FSE 2015 [GCG+15], the scalable bounds for GM from CHES 2017 [CP17] and the newer rank estimation algorithm from CHES 2020 [ZZD+20], using the indicators and datasets presented earlier (see Section 5).

Precision of full-key estimators: the case for low-data complexity tools.
As Azouaoui et al. [AMP+19] have mentioned, it is very useful to have a tool that can quickly (e.g. using very few attack traces) determine whether an attack has some chance of revealing the key with practical computation time and hence whether we should perform key enumeration (which is not trivial for full cryptographic keys). In this context, evaluating the precision of each full-key evaluation method seems very relevant. For this task, we have applied the three full-key evaluation methods on our datasets, comparing their standard deviation (similarly to what we did in Section 6.1). Our results are shown in Figure 4. The 5th, 50th and 95th percentile of the standard deviation across the n a attack traces in bits (i.e. by taking the difference of the logarithms of the measures with and without a standard deviation) for each method and dataset is shown in Table 2.
For the GEEA method, the computation of the mean and standard deviation need some explanation, since the GEEA method already produces an estimation based on a  prior use of mean and standard deviation parameters. In order to provide a reasonable comparison between metrics, we have decided to allow the same amount of data to each metric. Hence, since the GEEA method estimates the guessing entropy of an attack with n a traces based on the previously computed parameters, we have used a total of n a traces per each computation of the GEEA method and then computed the mean and standard deviation of GEEA over N such computations. Furthermore, in order to provide a fair comparison also in terms of computing power, we have limited the number of random keys used by the full-key GEEA estimator (see Equation 26) to M = 10 4 , which results in a similar computation time as for the computation of the mean and standard deviation values for the FSE 2015 and CHES 2017 methods 7 . These results show that both the CHES 2017 and CHES 2020 methods have good precision overall, while the FSE 2015 method has a somewhat larger standard deviation. However, we should also notice that unfortunately the CHES 2020 method provides results quite far from the FSE 2015 method, most likely due to the selected number of random keys used for its approximation, as detailed above.

Time/complexity analysis of full-key estimators
To continue our analysis, we show in Table 3 the time required to compute each method for 16-byte (128-bit), 128-byte (1024-bit) and 1024-byte (8192-bit) keys across all the datasets. The time includes the computation for all values of number of attack traces and all iterations. As mentioned before, for the CHES 2020 method it is difficult to set a clear point, as its computation time depends also on the number of random keys that we want to use for its approximation. The more we use, the better should be its estimation, but increasing it too much becomes impractical. For 16 bytes we have used M = 10 4 and for 128 bytes we used M = 10 6 values, resulting in the time shown. Nevertheless, the memory requirements remain the same, since we only need to keep the same list of score vectors, regardless of the number of random keys that we use.
These results confirm that the CHES 2017 method is faster than the others by one or  more orders of magnitude. Nevertheless, one may still prefer one of the other methods, if the requirements are to estimate precisely the empirical guessing entropy GE.

Scalability of full-key estimators
In this section we explore the scalability of each method by comparing them using 128 byte keys 8 across all our datasets. The results of these methods across our datasets are shown in Figure 5. We make here the following observations: • All methods can be computed relatively well for 128 byte keys. However, when trying to compute the results for even larger keys, e.g. 1024 byte keys, we could not obtain this for the FSE 2015 and CHES 2020 methods, due to their computational and memory limitations. This was somewhat expected for the FSE 2015 method, but now it was also seen for the CHES 2020 method. Hence, for such larger keys, the CHES 2017 method might be the only viable solution.
• We see again the difference between the results of the FSE 2015 and CHES 2020 methods, confirming that they do not lead to same results when dealing with large keys and moderate amount of computation. From previous publications, such as our CHES 2017 paper [CP17, Figure 5], we observe that the FSE 2015 method follows closely the empirical guessing entropy GE. Hence, our results imply that the CHES 2020 method cannot reliably approximate the empirical guessing entropy, using moderate computation.
• These figures also show that the FSE 2015 and CHES 2017 are close for the simulated and XMEGA datasets, but differ substantially for the SoC dataset, i.e. where the leakage model is not accurate as described earlier. Hence, this confirms that we may combine the FSE 2015 and CHES 2017 methods to determine whether a leakage model is sound.

Relevance of full-key estimators for security evaluations
As seen by previous results, each of the full-key estimation methods explored in this paper has its advantages and limitations. We make a summary of our observations in Table 4.

Conclusion
In this paper we have explored the differences between two versions of the guessing entropy, as used for security evaluations of side-channel attacks: Massey's guessing entropy and the empirical guessing entropy. Our analysis has clarified previous confusion on the difference between these two measures when used for side-channel evaluations, allowing security evaluators to have a better understanding of each metric and its potential use. Furthermore, we have analysed three representative full-key estimation methods for these security metrics, namely the methods presented at FSE 2015, CHES 2017 and CHES 2020. Our analysis presents a clear overview of the advantages and limitations of each method, which should be of great utility for any security evaluator.
In addition, we have discovered a new method for verifying the soundness of a leakage model used in a side-channel attack, by combining the empirical guessing entropy and Massey's guessing entropy. This may provide a useful tool for side-channel evaluations of both small and large keys.