A Comprehensive Study of Deep Learning for Side-Channel Analysis

. Recently, several studies have been published on the application of deep learning to enhance Side-Channel Attacks (SCA). These seminal works have practically validated the soundness of the approach, especially against implementations protected by masking or by jittering. Concurrently, important open issues have emerged. Among them, the relevance of machine (and thereby deep) learning based SCA has been questioned in several papers based on the lack of relation between the accuracy , a typical performance metric used in machine learning, and common SCA metrics like the Guessing entropy or the key-discrimination success rate . Also, the impact of the classical side-channel counter-measures on the eﬃciency of deep learning has been questioned, in particular by the semi-conductor industry. Both questions enlighten the importance of studying the theoretical soundness of deep learning in the context of side-channel and of developing means to quantify its eﬃciency, especially with respect to the optimality bounds published so far in the literature for side-channel leakage exploitation. The ﬁrst main contribution of this paper directly concerns the latter point. It is indeed proved that minimizing the Negative Log Likelihood (NLL for short) loss function during the training of deep neural networks is actually asymptotically equivalent to maximizing the Perceived Information introduced by Renauld et al. at EUROCRYPT 2011 as a lower bound of the Mutual Information between the leakage and the target secret. Hence, such a training can be considered as an eﬃcient and eﬀective estimation of the PI, and thereby of the MI (known to be complex to accurately estimate in the context of secure implementations). As a second direct consequence of our main contribution, it is argued that, in a side-channel exploitation context, choosing the NLL loss function to drive the training is sound from an information theory point of view. As a third contribution, classical counter-measures like Boolean masking or execution ﬂow shuﬄing, initially dedicated to classical SCA, are proved to stay sound against deep Learning based attacks.


Context
Side-channel analysis is a class of attacks against cryptographic primitives that exploit weaknesses of their physical implementation. During the execution of the latter implementation, some sensitive variables are indeed processed that depend on both a piece of public data (e.g. a plaintext) and on some chunk of a secret value (e.g. a key). Hence, combining information about a sensitive variable with the knowledge of the public data enables an attacker to reduce the key chunk search space. By repeating this attack several times, implementations of secure cryptographic algorithms such as the Advanced Encryption Standard (AES) can then be attacked by recovering each byte of the secret key separately thanks to a divide-and-conquer strategy, thereby breaking the high complexity usually required to defeat such an algorithm. The information on sensitive variables is usually gathered thanks to physical leakages such as the power consumption or the electromagnetic emanations measured on the target device.
In an almost optimal attack scenario, the adversary runs a so-called profiling phase to learn about the statistical dependency between the manipulated sensitive variables and the leakage. An attack phase is subsequently launched during which the learned information is used to distinguish the secret. The first example of such a modus operandi, called profiling attack, has been published in the early 2000's under the name of Gaussian Template Attacks (GTA for short) [CRR02]. To circumvent it, many counter-measures were developed including de-synchronization and masking. Both of them have been shown to be practically effective [SVO + 10, VMKS12], and their use in industrial implementations is today common. metrics such as Guessing Entropy (GE) or Success Rate (SR) [SMY09], which has been later empirically confirmed by Picek et al.

Problem Addressed in this Paper
In view of the current state of the art, we are today in an uncomfortable situation where the replacement of the target SCA Optimization Problem by the Supervised Classification Problem shows promising efficiency gains while several recent papers question the theoretical soundness of the replacement [CDP17,PHJ + 19]. This situation prevents the SCA community to get a clear picture of the potential impact of Deep Learning, especially from the developers perspective. Indeed, though an attacker only needs to know an efficient practical approach to train a DNN, a developer needs a theoretically grounded approach to be able to give the best security bounds on the complexity of mounting a profiling attack, especially when the implementation is protected by counter-measures.
Our paper aims at grounding the use of DNNs in the SCA context, especially when classical counter-measures like masking, de-synchronization and shuffling are involved. It starts from the important observation that questioning the accuracy metric relevance is actually ill-posed in the specific case of Deep Learning based SCA since the accuracy is never directly optimized in Machine Learning. Indeed, the latter optimization is not feasible in practice 1 and DNNs are typically trained by minimizing a surrogate loss function (with the hope that the accuracy will be maximized as a side effect). This observation leads us to investigate to what extent the SCA Optimization Problem and the Supervised Classification Problem with respect to the surrogate loss function that is minimized (instead of the accuracy) are related. This is the main problem addressed in this paper.

Our contribution
In the literature, mostly two surrogate loss functions have been used in the SCA context: the Negative Log Likelihood (NLL) [CDP17, PSB + 18, KPH + 19] and the Mean Square Error 2 [MPP16, Tim19, WMM19]. As a first contribution, we propose a theoretical study of the NLL loss, by enlightening the fact that such a function is strongly linked to a side-channel information theoretic quantity called Perceived Information (PI) that has been formally introduced by Renauld et al. at EUROCRYPT 2011 [RSV + 11], and recently studied by Bronchain et al. [BHM + 19]. As a direct consequence, the PI can be straightforwardly computed from the NLL loss. But more interestingly, this implies that the training phase of a Deep Learning model, through the minimization of the NLL loss, is actually equivalent to giving the PI estimation that is the closest to the Mutual Information (MI) between the leakage and the target sensitive variable. This result, combined with the recent works in [BHM + 19] and [dCGRP19], proves that a lower bound of the minimal number of queries needed in a successful attack, which depends on the MI, can be accurately estimated, which justifies the soundness of addressing Supervised Classification when the latter is solved by training DNNs through the minimization of the NLL loss.
As we shall show in this paper, the latter result has many direct impacts. First, the training of DNNs with the NLL loss can be considered as an efficient and effective estimation of the PI, and thereby of the MI (known to be complex to accurately estimate in the context of secure implementations [PR10, BGP + 11].). Secondly, it implies that in 1 Directly maximizing the accuracy of DNNs is a NP-hard problem. This also holds for Random Forest and Support Vector Machines that are investigated by Picek et al. in [PHJ + 19] to discuss the link between the accuracy and SCA metrics. See details in Subsection 3.2.
2 From a purely optimization point of view, the Mean Square Error might suffer from problems [Nie18]. From a SCA evaluation point of view, the relevance of MSE is an open question [vdVP19], beyond the scope of this paper. a SCA context, choosing the NLL loss function to drive the training is sound when it comes to address the SCA Optimization Problem. Thirdly, it enables to study the impact of classical SCA counter-measures on the efficiency of Deep Learning based SCA and to formally prove that they stay sound.
The second part of the paper is dedicated to the validation of our theoretical results through several experiments and simulations in the context of implementations secured by masking, shuffling and de-synchronization.

Organization of the paper
The paper is organized as follows. Notations are first introduced in Section 2. In Section 3, the profiling attack scenario is introduced. It also discusses the relevance of the Supervised Classification Problem in a Side-Channel context, and propose another way to tackle the evaluation. Section 4 states the soundness of minimizing the NLL loss since it is nothing but maximizing the Perceived Information. This will then be verified by simulations in Section 5, and also illustrated on experimental examples in Section 6.

Notations
Throughout the paper we use calligraphic letters as X to denote sets, the corresponding upper-case letter X to denote random variables (resp. random vectors X) over X , and the corresponding lower-case letter x (resp. x) to denote realizations of X (resp. X). The i-th entry of a vector x is denoted by x[i]. We denote the probability space of a set X by P(X ). If X is discrete, it corresponds to the set of vectors [0, 1] |X | such that the coordinates sum to 1. If a random variable X is drawn from a distribution D, then D N denotes the joint distribution over the sequence of N i.i.d. random variables of same probability distribution than X. The symbol E denotes the expected value, and might be subscripted by a random variable E X , or by a probability distribution E X∼D to specify under which probability distribution it is computed. Likewise, V denotes the variance of a random variable. The output of a cryptographic primitive C is considered as the target sensitive variable Z = C(P, K), where P denotes some public variable, e.g. a plaintext chunk, where K denotes the part of secret key the attacker aims to retrieve, and where Z takes values in Z = {s 1 , . . . , s |Z| }. Among all the possible values K may take, k will denote the right key hypothesis. Side-channel traces will be viewed as discrete realizations of a random column vector X with values in X = [0, 2 ω − 1] D where ω depends on the vertical resolution of the oscilloscope used for the acquisitions (usually, we have ω ∈ {8, 10, 12}. Let (A n ) n be a sequence of random variables and let A be another random variable. We say that A n converges in probabilities towards A, denoted as A n P −→ n→∞ A when the following property holds: We now define some Information Theoretic quantities that have been taken from [CT06]. Let Z ∈ Z be a discrete random variable. The entropy of Z, denoted by H(Z), describes the uncertainty to guess the value of a realization of a discrete random variable Z. It is formally defined by: Likewise, the conditional entropy of a discrete random variable Z given another random variable X quantifies the remaining uncertainty on the guess of Z once X is known. It is formally defined as: If D and D are two probability distributions on Z, we define the Kullback -Leibler divergence (or KL divergence) as: This quantity is typically used to measure the difference between two discrete probability distributions, since it is always non-negative and equals zero if and only if D = D . Thanks to the previous definitions, we can introduce the Mutual Information (MI) between two variables Z and X as: This characterizes how much information can be obtained about Z by observing X.

Profiling Attacks and their Evaluation
This section presents the framework we will consider when attacking a device through a profiling attack. Once presented, we will set the goal of an evaluator. The considered scenario is made of the following steps: • Profiling acquisition: a dataset of N p profiling traces is acquired on the prototype device. It will be seen as a realization of the random variable S p {(x 1 , z 1 ), . . . , (x Np , z Np )} ∼ Pr[X, Z] Np , where all the x i (resp. all the z i ) are i.i.d. realizations of X (resp. Z).
• Profiling phase: based on S p , a model is built that returns a set of scores for each hypothetical value of Z, that can be assimilated to a pmf (possibly after normalization). F : X → P(Z).
• Attack acquisition: a dataset of N a attack traces is acquired on the target device. It will be seen as a realization of S a (k , {(x 1 , p 1 ), . . . , (x Na , p Na )}) such that k ∈ K, and for all i ∈ 1, N a , p i ∼ Pr[P ] and • Predictions: a prediction vector is computed on each attack trace, based on the previously built model: For each trace, it assigns a score to each key hypothesis, namely, for every j ∈ 1, |Z| , the value of the j-th coordinate of y i corresponds to the score assigned by the model to the hypothesis Z = s j when observing x i .
• Guessing: the scores are combined over all the attack traces to output a likelihood for each key hypothesis; the candidate with the highest likelihood is predicted to be the right key. A maximum likelihood score can be used for the guessing. For every key hypothesis k ∈ K, this score is defined as: Based on the scores in Equation 1, the key hypotheses are ranked in a decreasing order. Finally, the attacker chooses the key that is ranked first. More generally, the rank g Sa (k ) of the correct key hypothesis k is defined as: If g Sa (k ) = 1, then the attack is considered as successful.
To assess the difficulty of attacking a target device with profiling attacks (which is assumed to be the worst-case scenario for the attacked device), it has initially been suggested to measure or estimate the minimum number of traces required to get a successful attack [Man04]. Observing that many random factors may be involved during the attack, the latter measure has been refined to study the probability that the right key is ranked first. This metric is called the Success Rate [SMY09]: 3 Within this framework, it is common to formulate the evaluator's goal in the worst-case scenario as follows [LPB + 15, HGM + 11, PHJ + 19]: Problem 1 (SCA Optimization). Given a profiling set S p , find a model F that minimizes N a such that SR(N a ) ≥ β, where β is a threshold defined by the evaluator. We denote N a the corresponding minimum number of attack traces.
For convenience, we will denote by N a (F ) the minimal number of attack traces needed to verify the condition SR(N a ) ≥ β for a given fixed model F . An analytical optimal solution to Problem 1 is given by the conditional pmf of the leakage, as stated in the following proposition. Proposition 1 tells us that the conditional pmf is the best model we can build so far for a profiling attack. Yet, such a solution is still analytical and remains unknown to the evaluator, which makes it necessary to find alternatives to this theoretical (optimal) solution.
To find sub-optimal solutions to the SCA Optimization Problem, a natural approach is to look at accurate and efficient estimators of the pmf Pr [Z|X]. Solving the new problem is typically the purpose of GTAs which approximate Pr[X|Z] by a Gaussian distribution and apply Bayes' Theorem to deduce an estimator of the targeted pmf. Unfortunately, when the underlying Gaussian assumption does not hold, estimating Pr[X|Z] is known to be hard in practice (especially in presence of counter-measures) [BGH + 17,BGHR14]. This has led the SCA community to look for alternatives. The Machine (and especially deep) learning paradigm aims at generalizing the approach taken by GTAs, by considering wider sets of models to approach the optimal solution. This alternative will be discussed in the next section.

Problems with Supervised Classification
This section presents the concept of Supervised Classification which is commonly applied to solve the SCA Optimization Problem with machine learning. Its soundness is discussed in the context of SCA. Formally, Supervised Classification is defined as the problem of finding one estimator of the true pmf that maximizes the accuracy, namely the rate of good predictions of the value of the target sensitive variable Z over the joint distribution of the leakage Z, X. For any function F : X −→ P(Z), this accuracy is denoted by Acc(F ) and is defined as: The estimator is taken from a parameterized hypotheses class previously defined by the evaluator. The class may be seen as a collection of models of the form F : where θ ∈ Θ ⊆ R q denotes the q-dimensional vector gathering all the parameters. 4 It turns out that both SCA Optimization and Supervised Classification Problems are linked, thanks to the following proposition that essentially states that no estimator can have a better accuracy than the one defined by the true conditional pmf: Proposition 2 (Bayes Error for Supervised Classification [SSBD14]). Let F be F = Pr [Z|X]. Then, for any function F : X → P(Z) we have: When F belongs to the considered hypotheses class H, Propositions 1 and 2 tell us that the model selection through accuracy maximization is the optimal strategy for the SCA Optimization Problem. When the latter condition is not satisfied, the accuracy maximization strategy outputs another model, which has sub-optimal accuracy compared to F as stated in Proposition 2. This questions the soundness of the accuracy maximization for the SCA Optimization Problem, or in other words, whether the accuracy properly quantifies the quality of a solution for Problem 1. This question has first been pointed out by Cagli et al. who mentioned in [CDP17] that the accuracy only corresponds to finding the model that maximizes SR(1), which is different from the criterion we consider in the SCA Optimization Problem. Likewise, Picek et al. empirically verified in [PHJ + 19] that the accuracy of some Machine Learning models such as Support Vector Machines (SVM) and Random Forests (RF) was not always related to the Guessing Entropy. More precisely, they argue that a high accuracy is a clue for good performance in SCA Optimization, though the inverse does not empirically hold. All together, these findings question the Supervised Classification Approach in a SCA context. This paper aims at addressing a slightly rephrased version of the question raised in [CDP17] and [PHJ + 19]. Indeed, machine learning algorithms do not directly maximize the accuracy in practice, 5 , while directly maximizing the accuracy is known to be computationally intractable. This also holds for the specific case of DNNs [SSBD14, Thm 20.7] where a surrogate loss function is minimized instead with the hope that this will also optimize the target loss function (e.g. to maximize the Supervised Classification accuracy). A commonly used surrogate loss is the Negative Log Likelihood (NLL) [CDP17, PSB + 18, KPH + 19]. We hereafter recall its definition: Np , and a DNN model defined by θ from a hypothesis class H, the Negative Log Likelihood is defined as: Furthermore, we define the Maximum Likelihood Estimator 6θ as the parameter vector from Θ that minimizes the NLL loss computed over the profiling set S p :θ argmin θ∈Θ L Sp (θ).
Since DNNs are trained by minimizing the NLL instead of the accuracy, we argue here that in the specific case where one considers neural networks as a hypotheses class, the question raised by [CDP17] and [PHJ + 19] should be rephrased as questioning the equivalence between the NLL minimization for DNNs and the SCA Optimization Problem (i.e. Problem 1). As a side effect, this questioning does not extend to algorithms not minimizing this loss, such as Random Forest or (kernel based) SVMs. In order to address this question, we need first to substitute, in the next section, the SCA Optimization problem with an intermediate problem.

Model Training for Leakage Assessment
The problem substitution presented in this section comes as a direct consequence of the recent work [dCGRP19] which has stated that N a , namely the number of traces required to succeed an attack when the involved model corresponds to the optimal solution to Problem 1, is linked to the MI between the target sensitive variable and the leakage through the following inequality: where f is a known, invertible, strictly increasing function defined in [dCGRP19], and β is the threshold defined in Problem 1. Cherisey et al. argue that the lower N a , the tighter Inequality (6). Nevertheless, from the point of view of conservative security evaluations, it remains interesting to compute the value of the left-hand side in Inequality (6), no matter the value of N a . Unfortunately, computing the MI in the denominator also requires to perfectly know the true pmf Pr [Z|X]. Like with the SCA Optimization and the Supervised Classification Problem, this cannot be assumed in practice. To circumvent this issue, we can fortunately use the notion of Perceived Information (PI) which extends the MI to accept pmfs estimations [RSV + 11].
Definition 2 (Perceived Information [BHM + 19]). Let Θ be the parameter space of a parameterized hypothesis class H and let θ be an element in Θ. The Perceived Information between Z and X for the model F (., θ) ∈ H is denoted by PI (Z; X; θ) and defined as: Intuitively, when the pmf Pr[Z|X] is perfectly learned, the PI equals the MI, otherwise the first one is always lower than the latter one [BHM + 19]. This is of great interest here since it enables to derive an upper bound of the left-hand side in Inequality (6), namely Moreover, we can then compare different models in terms of their PI: the higher the PI, the lower the distance to MI and thereby the better the estimation of f (β) MI(Z;X) with N a,θ . This leads to introduce a new intermediate Problem, named Leakage Assessment.
Problem 2 (Leakage Assessment). Given a profiling set S p ∼ Pr[X, Z] Np , find the model with the highest PI.
At this point, we have argued that addressing the Leakage Assessment Problem is sound for the SCA Optimization Problem, in the sense that it will enable to estimate a lower-bound of the optimal solution N a of the latter problem. The following section aims at deeply studying Problem 2. We will show that training deep learning models with the NLL loss is asymptotically equivalent to this problem which implies that conducting profiled SCA with deep learning can be argued to be relevant within this framework.

NLL Minimization is PI Maximization
This section is devoted to show that a deep learning model trained by minimizing the NLL loss fits with Problem 2. Subsection 4.1 studies the link between the NLL loss and an information theoretical quantity called Cross Entropy, that we will define hereafter. Then, Subsection 4.2 will make a link between cross entropy and PI. Finally, Subsection 4.3 discusses the gap between the MI and a PI estimated by training deep learning based models. Eventually, it will be concluded that the MI can be accurately estimated thanks to this approach.

The Consistency of the NLL Loss with Cross Entropy
This subsection is devoted to recall to the unfamiliar reader an important machine learning notion that will be used afterwards in Subsection 4.2, namely the property of consistency. Briefly, it states that the NLL loss minimization is asymptotically equivalent to the minimization of an information theoretic quantity called Cross-Entropy. We stand by recalling the latter notion hereafter.
Definition 3 (Cross Entropy). Given a joint probability distribution of a target sensitive variable Z and its leakage X denoted as Pr[X, Z], we define the Cross Entropy as the expected value of each term in Equation 5: The cross entropy is actually nothing but the expected value of the NLL loss computed over the profiling set of traces. Besides, according to the law of large numbers, for any fixed θ the NLL loss converges in probabilities towards the cross entropy [SSBD14]. However, since the true joint distribution of Z and X is actually unknown, one cannot exactly compute the cross entropy. The hope behind the NLL minimization is that for a number N p of profiling traces high enough, the obtained parameter vectorθ will be a good candidate to minimize the cross entropy.
It is not trivial though that L Sp (θ) converges in probabilities towards min θ∈Θ L Pr(X,Z) (θ), asθ is varying for each value of N p . Thankfully, a fundamental result of machine learning called consistency proves the soundness of the approach. 7 Theorem 1 (Consistency of Maximum Likelihood Estimation [Vap99,SSBD14]). Let N p ∈ N and let S p be a profiling set of size N p . Assume that H is a hypotheses class of finite VC-dimension 8 (or equivalently Θ is its parameter space). Then: In particular, if H is the set of Multi-Layer Perceptron (MLP), it follows that: In other words, the solutionθ given by the minimization of the NLL loss converges towards the best possible solution for the cross entropy, and the NLL loss ofθ is a good approximation of the generalization loss ofθ.
Proof. The Fundamental Theorem of Statistical Learning states that the consistency holds if and only if the VC-dimension of H is finite [Vap99]. In parallel, if H is a class hypothesis trained by minimizing a real valued loss function, its VC-dimension equals the VC-dimension of the same hypothesis class where each model has a binary output [Vap95,p76]. For the specific class of Multi-Layer Perceptron (MLP) with a binary output, the VC-dimension is indeed finite [SSBD14, Theorem 20.6, p274], as it can be bounded by a function of the number of neurons.
As mentioned in Theorem 1, the latter result also holds for any hypotheses class with finite VC-dimension. This includes for example (kernel based) softmax classifiers that are beyond the scope of this paper.
As a consequence of Theorem 1, any property verified by the cross entropy is also asymptotically verified by the NLL loss (i.e. when the number of profiling traces N p converges towards infinity). Therefore we can substitute the analysis of the NLL loss with that of the cross entropy. It remains now to draw the link between cross entropy and PI, in order to address the Leakage Assessment Problem.

The Link between Cross Entropy and Perceived Information
This section aims at explaining to what extent the PI and the cross entropy introduced in the previous section are linked. It is argued here that the PI actually equals the cross entropy up to constant factors. Such a link and the reduction argued in Subsection 4.1 will allow us to guarantee that minimizing the NLL loss is a consistent approach for solving the Leakage Assessment Problem. It is recalled that the PI has been formally defined in Subsection 3.3. We also introduce hereafter the Empirical Perceived Information, as given in [BHM + 19].
Definition 4 (Empirical Perceived Information [BHM + 19]). Let Θ be the parameter space of a parameterized hypothesis class H. Let θ ∈ Θ. The Empirical Perceived Information, denoted as PI Np (Z; X; θ), is defined from a profiling set S p as follows: Informally, the PI is defined the same way as the MI, but by substituting the uncertainty of the true pmf, namely log 2 Pr[Z|X = x], with the uncertainty of the approximating pmf, namely log 2 F (X, θ). Surprisingly, this substitution is exactly what defines the cross entropy.
Proposition 3 (Our contribution). Let Z be a random variable with uniform distribution over Z = F n 2 for some n ∈ N. Then, the cross entropy and the NLL loss are respectively linked to the Perceived Information and its empirical estimation as follows: Proof.
The assumption about Z implies that H(Z) = n. Injecting the latter result into the definition of the PI, and by using the formula of total probabilities for the expected value we have: The proof for the empirical PI follows exactly the same reasoning substituting expected values with averages. 9 Proposition 3 tells us that the cross entropy and the Perceived Information are exactly the same concept. As already pointed out in [BHM + 19, Thm. 6], we have for all θ ∈ Θ PI (Z; X; θ) ≤ MI (Z; X). In other words, computing the cross entropy of any deep learning model enables to get a lower bound of the MI. This tells nothing about the tightness of such a bound though. Hopefully, based on the previous results stated in this section, we now know how to tighten this inequality, as stated by the following proposition.
Roughly speaking, Proposition 4 states that the NLL loss minimization is asymptotically equivalent to the PI maximization mentioned in the Leakage Assessment Problem (i.e. Problem 2). Therefore, on the one hand, we have a theoretically grounded method to address the Leakage Assessment Problem (i.e. Problem 2) thanks to Proposition 4, namely by minimizing the NLL loss. On the other hand, since it has been argued in Subsection 3.3 that solving the Leakage Assessment was sound in order to address the SCA Optimization Problem, it follows from Proposition 4 the main result of this paper, given hereafter. Proof. By applying Proposition Proposition 4, we get In other words, Corollary Corollary 1 tells us that minimizing the NLL loss is sound for the SCA Optimization Problem (i.e. Problem Problem 1), in the sense that has been argued in Subsection 3.3, and that the term N a ,θ might be a good approximation in view of estimating the lower bound of Inequality (6). However, this also emphasizes that in the pursuit of estimating N a through the NLL minimization, some weaknesses must be discussed.
First, as recalled in Subsection 3.3, the higher N a , the looser Inequality (6). It is therefore of natural interest to verify to what extent the tightness of the latter inequality holds, in view of estimating N a by f (β) MI(Z;X) . This must be at least empirically verified. Second, the tightness of Inequality (17) is another possible source of imprecision when one wants to substitute the MI with the PI. This will be discussed in the next section, and will eventually be verified through simulations and experiments.

To what Extent the Obtained Bound is Tight?
So far we have argued that minimizing the NLL loss is a sound approach to tackle Problem 2: it is indeed consistent with minimizing the cross entropy (cf Equation 9), thereby consistent with maximizing the PI (cf Equation 7). In the particular case where the hypothesis class H is a set of neural networks, it becomes now of natural interest to study the gap between the MI and the NLL loss we are minimizing (or equivalently the empirical PI we are maximizing) to assess the quality of the built solution.
Such a minimization is typically done with a Stochastic Gradient Descent (SGD) algorithm. The obtained model is denoted by the parameter vector θ SGD . Thus, one can decompose the gap between the solution found with SGD and the MI into three parts: The term (20) corresponds to the approximation error: this error is due to the choice of a restricted hypotheses class H from which we select our model. This error is of particular interest as it gives a computational security bound. Proposition 4 shows that no model from H can give a tighter lower bound on the MI. A remarkable result specific to Multi-Layer Perceptrons, a simple type of DNN, known as the Universal Approximation Theorem, states that when considering a L 2 error as a loss function, such an approximation error converges towards 0 when the number of neurons in the layers increases [Pet98]. 10 Unfortunately, to the best of our knowledge, no similar result has been stated when considering the cross entropy as an approximation error.
The term (19) corresponds to the estimation error. It is the error due to the fact that we do not maximize the PI (as the true pmf is unknown) but rather its empirical estimation, since we only have a finite set of profiling traces. An upper bound of this error, based on the value of the VC-dimension, can be derived in the context of the NLL loss minimization [Vap99], thereby extending the consistency result recalled in Theorem 1 by providing convergence rates. Unfortunately, the recent deep learning literature has shown that such a bound is very conservative, regarding the potentially high value of the VC-dimension [Har].
The term (18) corresponds to the optimization error. This is the error made by the SGD algorithm, since it is not clearly proved yet to converge towards the solution NLL loss minimization. Indeed, for convex functions, SGD is shown to converge towards the minimum [SSBD14]. Unfortunately this assumption does not hold for the NLL loss applied to DNNs [GBC16]. Hopefully, the nature of the loss landscape implies that SGD still remains a good heuristic to approach the minimum [LBH15, Bot12, KLY18, CS18].
We remark that each error term refers to a restriction in the capacity of an evaluator (finite hypothesis class, finite profiling set, heuristic for MLE instead of an exact solution). That is why, in order to practically assess the quality of the estimation of the MI, it is interesting to emulate cases where such restrictions can be ignored, so that each error term can be evaluated separately. The experiments conducted in Sections 5 and 6 assess each error term.

Partial Conclusions
The results we have stated so far are threefold.
First, it has been argued that addressing the SCA Optimization Problem may be done by considering another problem, called Leakage Assessment, which aims at finding a model that extracts the most perceived information, rather than choosing the model maximizing the accuracy.
Second, the loss function we are usually minimizing, namely the NLL loss can be interpreted as a perceived information that aims at being maximized. That is why in Section 5 and Section 6, we will plot the PI, as computed with Equation 15, since it will enable to replace the accuracy in order to compare and evaluate the efficiency of a trained model.
Third, to discuss the tightness of Inequality (16), we can decompose the gap into three terms, namely the approximation error, the estimation error and the optimization error. Each error term refers to a restriction in the capacity of an evaluator. The experiments conducted in Sections 5 and 6 study the practical impact of each term.
Eventually, the whole discussion conducted in this section, aiming at emphasizing the links between machine learning concepts and metrics and the ones used in SCA, can be synthesized in Table 1.

Settings of the experiments
To verify the tightness of the bounds, we simulate simple D-dimensional leakages from an n-bit sensitive variable Z. The traces are defined such that for every t ∈ 1, D : where (U i ) i , (B i ) i and all (z t,i ) i are independent, U i ∼ B(n, 0.5) (i.e. U i is drawn from a binomial law of parameters n and 0.5), B i ∼ N (0, σ 2 ), where hw denotes the Hamming weight function and where (z 1,i , . . . , z d+1,i ) is a (d + 1)-sharing of z i for the bit-wise addition law. 11 This example corresponds to a situation where the leakages of the shares are hidden among values that have no relation with the target, but have the same marginal pmf. Since the z t,i are drawn uniformly, hw(z t,i ) follows a binomial marginal pmf so they are indistinguishable without prior knowledge. Hence the choice of a binomial law for U i when emulating non-informative components. Every possible combination of the (d + 1)-sharing has been generated and replicated a given number of times (denoted by q) before adding the noise, in order to have an exhaustive dataset. Therefore, it contains q × 2 (d+1)n simulated traces. Once the data were generated, we trained a MLP with one hidden layer made of r = 1, 000 neurons. The training loss is naturally the NLL loss. 12 The training lasts T = 200 epochs 13 , with a Stochastic Gradient Descent and a learning rate of 10 −3 . Our simulations comprise three main campaigns: Experiment 1 (masking only): in this experiment, we set D = d + 1 in order to avoid to consider irrelevant input features. The simulations are done over n = 4 bits, d ∈ {0, 1, 2, 3} and σ ∈ {0.01, 0.1, 0.2, 0.4, 0.8, 1.6, 3.2}. We also generate enough data so that the training 11 A masking scheme of order d consists in a d + 1-sharing of the sensitive target variable. 12 Beware that in Pytorch and Tensorflow, the NLL loss is computed with natural logarithms, whereas one must consider the logarithm in base 2. 13 One epoch refers to the number of iterations needed to process the whole dataset through the SGD algorithm.  set is "exhaustive", i.e. the number of replicas is q = 2, 000. With such generated dataset, we expect to make the estimation error (19) negligible. The gap between the MI and the PI should therefore only be composed of the optimization error (18) and the approximation error (20).
Experiment 2 (masking, with uninformative components): in a second experiment, we have D = 40, including the uninformative components. Since all components share the same margin law, we recall that they cannot be distinguished without knowing Z. Compared to Experiment 1, we might expect the optimization error to be more important because of the potential difficulty induced by the presence of uninformative components. Experiment 3 (shuffling, no masking): in a third experiment, we set d = 0, D ∈ {2, 4, 16}; in other words, D − 1 uninformative components are added like in Experiment 2, but this time they are randomly shuffled with the only informative component. Note that the shuffling is different for each simulated trace so that one cannot guess in which position the informative leakage lies. Therefore, we expect the information perceived by the model to be lower than without shuffling [VMKS12]. Besides, σ ∈ {0.04, 0.2, 0.4, 0.8, 1.6, 3.2} here.
From those experiments, the Perceived Information PI (Z; X; θ SGD ) is estimated thanks to a hold-out dataset of 1/5-th of the size of the training set size. For the sake of comparison, we estimate the MI between the target sensitive 4-bit variable and its simulated leakage model with a Monte Carlo sampling of the leakage pmf Pr[X|Z].

Analysis of the Results
In this section we analyze the results obtained by running Experiments 1 (Figure 1, left), 2 (Figure 1, right) and 3 (Figure 1, bottom). On each figure, the plain lines correspond to the estimated MI and the crosses correspond to the information perceived by the trained MLP, as computed from the NLL loss with Equation 14. Based on these results, several observations can be done.
First, on each result the crosses are always below the lines, which is in line with the results given in the literature: the estimated PI is a lower bound of the MI. But more interestingly, since the crosses are always close to the line no matter the MI magnitude. In the case of Experiment 1, we argued that the error was composed of the approximation and optimization errors. Since it turns out that the sum of those errors is negligible, we conclude that even for a simple MLP with one layer and 1, 000 neurons, both errors can be ignored. This is of particular interest concerning the approximation error, as it decreases with the number of layers and the number of neurons inside each layer of the MLP. Therefore, in the case of a Hamming weight leakage model with additive Gaussian noise, any more sophisticated MLP (i.e. with more layers or more neurons by layer) will also have a negligible approximation error.
Secondly, the PI plotted in Figure 1 (right) shows that the presence of uninformative components in Experiment 2 does not annihilate the capacity of the MLP to optimally extract information about the target variable, provided that these components are not shuffled with informative ones. This shows that the optimization error, which was thought to be increased compared to Experiment 1, remains stable.
Finally, the preceding observations hold when considering masking (Figure 1, left) or shuffling (Figure 1, right). This can be interpreted as the fact that the MLP trained through the NLL loss minimization is able to give a model optimally extracting the remaining informative leakage, while being "agnostic" concerning the presence or not of such counter-measures. Nevertheless, since both counter-measures have been shown to decrease the MI (exponentially with the level of noise for masking [PR13,DFS15], or linearly for shuffling [VMKS12]), they remain sound against Deep Learning.
At this stage, we have argued thanks to our simulations that the approximation error is negligible, no matter the considered counter-measure, nor the architecture of a MLP, while the optimization error is likely to remain negligible as well. Therefore, our MI estimation obtained by PI maximization seems accurate. This provides an empirical validation of Proposition Proposition 4. As another consequence, we are fairly confident that in the case of such simple leakage models, which often happen on real use cases, replacing an optimal architecture by another should not degrade too much the MI estimation. 14 These observations must be challenged by tests on experimental traces, where one cannot have an exhaustive dataset. This will naturally lead to discussions regarding the estimation error which has not been investigated here.

Application on Experimental Data
So far, we have seen that deep neural networks could reach the informational security bounds of a leakage in simulated experiments, thereby giving useful estimations for the developer. This success did not rely on any prior knowledge on the leakage, but was achieved thanks to a simple MLP with one hidden layer. To confirm these observations, we propose to complete the investigations by considering experimental leakage traces. Subsection 6.1 presents the acquisition of the dataset used for the experiments, Subsection 6.2 presents the methodology of our experiments, and Subsection 6.3 discusses their results. Besides, details on the used DNN architectures for these experiments can be found in Appendix D.

Presentation of the Dataset
The leakage traces represent the power consumption of a XMEGA128D4 chip supported on a Chip Whisperer Lite board [OC14]. The program ran on the chip aims at simulating the leakage of several shares that may be processed by a protected implementation of a cryptographic primitive. The firmware is directly written in assembly code and consists in loading each byte of an input plaintext array to a register, setting it to zero and then storing it back to the input array. Some details of the code are given in Appendix B. 500, 000 traces of 2, 500 time samples each have been acquired, along with the corresponding bytes array denoted by plain[i], i ∈ 0, 15 . The complete acquisition has been done within 15 hours.
To reproduce conditions similar to the simulations, we only target the n = 4 most significant bits of the target variable. In other words, |Z| = 2 n = 16. Eventually, we verified that the traces did not contain unexpected leakages that might help targeting masked variables (see Figure 4 and the corresponding discussion in Appendix C).

Methodology
Common settings The trainings have been done with a variant of the SGD algorithm called Adam [KB15] through a number of epochs denoted by T , i.e. each trace has been processed T times by the Adam algorithm. Over the 500, 000 profiling traces, a portion α is used for the training, and the remaining is used as a hold-out set for computing an unbiased estimate of the perceived information. In other words, the profiling set is made of N p = α × 500, 000 traces while the hold-out set is made of N v = (1 − α) × 500, 000 traces. We fix the limit α ≤ 4/5 so that the quality of the estimation over the hold-out set remains satisfying: the error margin will be at most 10 −2 with a confidence at least 90% in the worst case, according to Chebychev's inequality (see Appendix A).

Experiment 4 on masking
When considering masking, the generated target values are Z = i∈ 0,d plain[i] for d ∈ {0, 1, 2}, where ⊕ denotes the xor operation between two bytes. This way, it can simulate leakages of order d.
Provided with these target values, we selected Points of Interest (PoIs) based on the magnitude of the Signal-to-Noise Ratio [MOP07]: between 4 and 6 PoIs are selected in decreasing order of magnitude of SNR from each of the three first bytes of the plaintext array. 15 The time coordinates 13 to 16, 25 to 30 and 37 to 41 respectively correspond to the PoIs of the latter bytes manipulation. This gives an input dimension of D = 15. This way, we hoped to reduce the quantity of irrelevant components, which would have made the optimization with SGD harder, and therefore hoped to get a good estimate that corresponds the best to the approximation error (20). Details of the trained MLP can be found in Appendix D. We set T = 200 and let α vary so that N p ∈ 1, 000; 400, 000 . This way, we will be able to plot the so-called learning curve, namely plotting the values of PI (Z; X; θ SGD ) and PI Np (Z; X; θ SGD ) depending on N p . This is a classical representation in machine learning that will enable to discuss the estimation error (19) according to the size of the profiling set. 16 Experiment 5 on shuffling When considering shuffling, the generated target values are Z = plain[i] where i is randomly drawn from a subset of 0, 15 of size c, c denoting the number of shuffled bytes.
Contrary to the experiments on masking, we did not selected PoIs but only restricted the target window to the D = 250 first time samples of the traces, which was sufficient to cover the leakages of every shuffled plaintext byte (see Appendix C). Afterwards, a CNN with a VGG-like architecture has been used for those trainings. Details of the trained CNN can be found in Appendix D.
We set α = 4/5, T = 100, and c ∈ {1, 2, 4, 16}. The aim of this experiment is to empirically verify the trend observed on the Experiment 3 (Figure 1, bottom), namely a linear decrease of PI with the number of shuffled bytes.

Results and Discussions
Figure 2 (left) presents the learning curves of Experiment 4, when targeting respectively 1, 2 or 3 shares among the considered ones. The dotted curves are the estimated PI over the N p profiling traces whereas the plain curves denote the PI estimated with the N v validation traces.
It may first be observed that the amount of information leaking on the sensitive un-split variable seems to decrease at an exponential rate in the number of shares, as expected from both theory [PR13,DFS15] and our simulations (see Section 5). More interestingly, 15 See Figure 4 in Appendix C. 16 On a learning curve, it is expected that the empirical PI decreases with Np while the true PI increases, and both converge towards the supremum of the PI [Vap95]. the gap between dotted curves and their corresponding plain ones exactly corresponds to the estimation error term (19). It appears then that the latter one becomes negligible relatively to the PI when the profiling set size exceeds respectively a few thousands when targeting one share, or one hundred thousand when targeting two shares. When targeting three-share, the estimation error is not completely negligible, even with 400, 000 profiling traces. It is furthermore particularly noticeable that when profiling the three shares masking scheme with less than 100, 000 traces, the learning phase completely failed since the PI was null. This indicates that, in addition to the effect on MI predicted by theoretical works [PR13,DFS15], the masking counter-measure also has an effect on the PI through an increasing estimation error, making the MI estimation poorer. Figure 2 (right) presents the results of Experiment 5 on shuffling. It is recalled that contrary to Experiment 4 where PoIs where extracted, here 250-dimensional traces have been processed through a CNN. The gap in Figure 1 (top right) between each curve remains observable when considering experimental traces. However, the PI obtained when the attack target is shuffled among 16 random values seems decreasing starting the 20-th epoch, while the empirical PI (in dotted curves) keeps increasing. This is a sign of over-fitting, denoting a high estimation error, probably due to the high dimensionality of the traces. Therefore, the PI reached in the graph is not necessarily optimal: more profiling traces might be required to improve the CNN training.
Altogether, our experiments show that similarly to the approximation and optimization errors discussed in Section 5, the estimation error is also negligible relatively to the MI, when considering unprotected scenarios where the profiling set size is reasonably high (i.e. 10, 000 traces or above). This therefore leads to a tight estimation of the MI through the maximization of the PI (i.e. the minimization of the NLL loss). When considering protected devices, the investigated counter-measures impact the estimation error, and thereby on the tightness of the lower bound computed through PI maximization. Nevertheless this can be controlled by increasing the size of the profiling set. More precisely, the harder the counter-measure (i.e. the higher the masking order, or the more shuffled bytes), the higher the profiling set size.
Another way to decrease the estimation error would be to decrease the capacity of the hypotheses class, i.e. its VC-dimension, by decreasing the number of layers or the number of neurons on each layer. Since we have argued in Subsection 5.2 that the approximation error was negligible even for a simple architecture, we are quite confident that this would not strongly affect the quality of the MI estimation.

Application on Public Datasets
So far, we have considered our experimental investigations through the view of the Leakage Assessment Problem (i.e. Problem 2). However, we remind that the final task an evaluator is given to achieve is the SCA Optimization (i.e. Problem 1), namely to find N a . It is recalled that Corollary 1 argued that by solving the Leakage Assessment Problem, one could get an accurate estimation N a ,θ of the quantity f (β) MI(Z;X) , known to be a lowerbound of the optimal solution of the SCA Optimization Problem, namely N a .
One could wonder whether this inequality still holds for any model, maybe sub-optimal, i.e. when estimating the minimal number of queries N a (θ) to the target device for such a model with the quantity N a,θ . A formal proof would be a promising further work, though beyond the scope of this paper. Nevertheless we propose here to empirically verify this hypothesis by training a CNN on two public datasets and by implementing the key enumeration in order to evaluate the smallest N a such that SR(N a ) ≥ β, as defined in the SCA Optimization Problem. In the following, we will restrict to β = 0.9.
To this end, we considered three public datasets. The first one is the Random Delay Counter-Measure Dataset (AES-RD) released by Coron and Kizhvatov [CK09]. 17 The target smart-card is an 8-bit Atmel AVR micro-controller, protected by a Random Delay counter-measure, which has an effect on the misalignment of the traces, making some attacks like Gaussian Templates much harder but possibly having no effect on deep learning based attacks [CDP17]. The targeted variable is the output of the first S-Box. 50, 000 traces of D = 3, 500 time samples each are given in this dataset. We use N p = 40, 000 traces for profiling and the remaining N v = 10, 000 for validation.
The second one is the ASCAD dataset [PSB + 18]. 18 The target platform is an 8-bit ATMEGA8515 running a masked AES-128 implementation and measurements are made using electromagnetic emanation. The targeted variable is the output of the third S-Box. The dataset provides 60, 000 traces of D = 700 time samples each, where N p = 50, 000 traces are used for profiling and N v = 10, 000 for validation.
The third one is the AES-HD dataset, gathering traces measured on an unprotected AES-128 on FPGA. The dataset contains 100, 000 traces of 1, 250 time samples each. 19 80, 000 traces have been used for profiling and the remaining 20, 000 have been used for the key recovery phase.
For each training, a VGG-like CNN architecture has been used. Specific details about the parameters used can be found in Appendix D. The training have been run on 200 epochs on the AES-RD, 50 epochs on the AES-HD, and stopped after 30 epochs on the ASCAD since the model started over-fitting. After each epoch, an estimation of N a (θ SGD ) (i.e. for the current model given by the Adam optimizer) is computed thanks to a key enumeration, according to the procedure detailed in Appendix E.
The results are given in Figure 3. On each graph, N a,θ is denoted in green, whereas the enumeration key estimation N a (θ SGD ) is denoted by the orange curve. 20 On Figure 3 (left), we can first remark that the first epochs of the profiling of the AES-RD dataset show a chaotic behavior. This is explained by the fact that the NLL loss is initially close to n = 8 bits, or in other words, the PI is close to zero, leading to unstable estimations of N a (θ SGD ). Once the model has started extracting some information, i.e. after approximately 20 epochs, the PI starts to be higher than 0 and the instability vanishes. We can then observe that N a (θ SGD ) is always lower than the key enumeration estimation, while remaining tight through the epochs: the average relative error, computed starting the 20-th epoch is of 0.16. The final model is able to recover the secret key in 3 traces, and has a PI of 2.95 bits.
Likewise, for the ASCAD, the results are presented in Figure 3 (center). We can observe the same instability at the beginning of the training, though the quantity N a,θ remains lower than the estimation through the enumeration key afterwards, while staying quite tight. The average relative error is here of 0.16, and the final PI is 0.065.
Finally, for the AES-HD, the results are presented in Figure 3 (right). Similarly to the two other experiments, a tight estimation is obtained, since the relative error is 0.18, while the final PI is 0.020.
As a consequence, those three experiments enable to confirm that the quantities N a,θ and N a (θ) are effectively related, at least for β = 0.9. This is of great interest in the evaluation of the security of a device, since this not only empirically shows the relevance of minimizing the NLL loss, but this also provides a relevant tool to predict the required number of queries to succeed the key recovery, or at least to give a lower-bound to such a number, which is still useful since we look for a worst case scenario in a SCA evaluation.

Conclusion
In this paper, we have given some theoretical and experimental reasons why the deep learning paradigm is suitable for evaluating implementations against SCA from a worst-case scenario point of view, regardless the nature of the counter-measures.
Contrary to what was commonly believed until the works of Picek et al. [PHJ + 19], the supervised classification approach is not theoretically grounded generally speaking. Yet, deep learning based attacks still worked. The reason is that in the specific case where the NLL is used as a surrogate loss function, it turns out that the latter one is actually consistent with maximizing the PI, solving the so-called Leakage Assessment Problem. Since the latter problem was argued to be sound with the SCA Optimization Problem, we conclude that the choice of the NLL as a surrogate loss function is sound from an evaluation point of view, in the sense that it enables to accurately estimate a lower bound of the minimal number of queries required by an attacker provided with an optimal leakage model in order to successfully recover the secret key.
Simulations and experiments verified that the PI maximization via NLL minimization was an efficient method in order to estimate the MI in several configurations, i.e. on different architectures and with different types of counter-measures, including higher order masking, shuffling or de-synchronization through random delays.
This leads to the takeaway messages of this paper: the minimization of the NLL loss via a neural network model enables to give relevant estimations of the mutual information between a sensitive variable and the corresponding side-channel traces, thereby quantitatively measuring the impact of counter-measures (and their implementations) so that an evaluator can precisely assess whether the latter one stays sound or not.

A Confidence Interval with a Hold-Out Set
The bounds on the estimation error, discussed in Subsection 4.3 might be too high in practical uses with Deep Neural Nets. This is why usually, the estimation of the Cross Entropy is done otherwise. In cases where the evaluator has lots of data, he can take a so-called hold-out that will be distinct from the profiling set S p for the minimization of the NLL loss. Therefore, these fresh data will give a more correct estimation of the Cross Entropy.
Lemma 1 (Chebychev's inequality [SSBD14]). Let H be a parametrized hypothesis class and Θ its corresponding parameter space. Let S v {(x 1 , z 1 ), . . . , (x Nv , z Nv ) be a hold-out set of N v i.i.d. leakages and the corresponding values of the sensitive target variable. Assume that V (− log 2 F (X, θ)[Z]) ≤ 1. Then for all θ ∈ Θ, it holds with probability at least 1 − δ that: Equation 22 will be used in Section 5 and Section 6 to estimate provide a conservative confidence interval of the Cross Entropy and thereby the information between a sensitive target variable and a leakage perceived by a Neural Net (PI).

B Source Code for the Acquisitions
Algorithm 1 loadData 1: LD r0, X Loads the first byte in r0 2: CLR r0 Clears the register 3: ST X, r0 Stores 0 in the plaintext array 4: LD r0, X Do it again to clear the bus 5: CLR r0 6: ST X, r0 7: LD r0, X One more time to be sure 8: CLR r0 9: ST X+, r0

C The Experimental Traces
To verify that there is no leakage implying a combination of different bytes, we have also computed a SNR of order 2. That is to say that for each combination of 2 among the 16 bytes, the xor has been computed and used as a target variable in order to compute the SNR. The absence of peaks confirms that there is no undesirable leakage. An example of a trace and the SNRs of order 1 and 2 can be found in Figure 4.

D The hypothesis class H
The hypothesis class H that will be used for the experiment has been defined as the set of MLP with one hidden layer and r = 500 hidden neurons. In other words, there are two linear layers: the hidden one, denoted by λ θ1 , and the output one denoted by λ θ2 . Symbols θ 1 and θ 2 denote the associated real parameters of the hidden layer and the output layer respectively. Between these linear layers, an activation layer called ReLU and denoted by σ is added. This is a non-linear real valued function that is responsible of the high capacity of MLP to approximate any pmf [GBC16]. In addition, two batch normalization layers [IS15] have been applied at the input of each linear layer. Batch Normalization (BN) layers simply normalize the input features to a mean and a deviation that are automatically set by the SGD algorithm. BN has been shown to make the loss function smoother, making the optimization easier and faster [STIM18]. Finally, a dropout layer δ p [SHK + 14, GBC16] has been added on the input of the softmax classifier. Dropout is known to prevent DNNs from overfitting (i.e. to prevent the estimation error to explode), which is useful when one lacks data. The dropout parameter has been set to p = 0.1 i.e. each neuron of the hidden layer is randomly set to 0 with probability p each time an output F (x, θ) is computed during the optimization.
All together, we can sum up the architecture of our MLP as follows: For the experiment on the ASCAD dataset, we have considered the same architecture as proposed in [CDP17, PSB + 18], with the same notations: where γ denotes a convolutional layer, σ denotes an activation function i.e. a non-linear function applied elementwise, µ denotes a batch-normalization layer, δ denotes an average pooling layer, λ denotes a dense layer and s denotes the softmax layer. Furthermore, n 1 denotes the number of dense blocks, namely the composition [λ • σ]. Likewise, n 3 denotes the number of convolutional blocks, namely [δ • σ • µ • γ]. A global pooling layer δ G , has been added at the top of the last block. Its pooling size equals the width of the feature maps in the last convolutional layer, so that each feature maps are reduced to one point. More specifically, the following parameters have been used: n 1 = 2, n 3 = 7, the convolutional filters are of length 11. 10 filters are in the first layer, and they are doubled at each convolutional layer. The dense layers contains 1, 000 intermediate neurons. The same architecture has been used in the experiments on the AES-HD dataset.
For experiments on the AES-RD dataset, the same VGG-like architecture as the one presented in [KPH + 19] has been used. More specifically, n 3 is set to 9 so that there is enough pooling layers to get feature maps on the last convolutional layer whose width equals one. Besides, n 1 = 0, i.e. there is no intermediate dense layer, except softmax.

E Success Rate Estimation
In practice, to compute SR(N a ), sampling many attack sets may be very prohibitive in an evaluation context, especially if we need to reproduce the estimations for many values of N a until we find the smallest value such that ; one solution to circumvent this problem is, given a validation set S v of N v traces, to sample some attack sets by permuting the order of the traces into the validation set (e.g. 500 times in our experiments). d Sa can then be computed with a cumulative sum to get a score for each N a ∈ 1, N v , and so is g Sa (k ). For each value of N a , the success rate is estimated by the occurence frequency of the event g Sa (k ) = 1. 21 21 While this trick gives good estimations for Na Nv, one has to keep in mind that the estimates become biased when Na → Nv. Hopefully, in our experiments, the validation set size remains much higher than numT racesAttacks afterwards.