Information Bounds and Convergence Rates for Side-Channel Security Evaluators

. Current side-channel evaluation methodologies exhibit a gap between ineﬃcient tools oﬀering strong theoretical guarantees and eﬃcient tools only oﬀering heuristic (sometimes case-speciﬁc) guarantees. Proﬁled attacks based on the empirical leakage distribution correspond to the ﬁrst category. Bronchain et al. showed at Crypto 2019 that they allow bounding the worst-case security level of an implementation, but the bounds become loose as the leakage dimensionality increases. Template attacks and machine learning models are examples of the second category. In view of the increasing popularity of such parametric tools in the literature, a natural question is whether the information they can extract can be bounded. In this paper, we ﬁrst show that a metric conjectured to be useful for this purpose, the hypothetical information, does not oﬀer such a general bound. It only does when the assumptions exploited by a parametric model match the true leakage distribution. We therefore introduce a new metric, the training information, that provides the guarantees that were conjectured for the hypothetical information for practically-relevant models. We next initiate a study of the convergence rates of proﬁled side-channel distinguishers which clariﬁes, to the best of our knowledge for the ﬁrst time, the parameters that inﬂuence the complexity of a proﬁling. On the one hand, the latter has practical consequences for evaluators as it can guide them in choosing the appropriate modeling tool depending on the implementation ( e.g. , protected or not) and contexts ( e.g. , granting them access to the countermeasures’ randomness or not). It also allows anticipating the amount of measurements needed to guarantee a suﬃcient model quality. On the other hand, our results connect and exhibit diﬀerences between side-channel analysis and statistical learning theory.


Introduction
Evaluating the security of a cryptographic implementation against side-channel attacks is a complex problem.Since their introduction by Kocher et al. in the late nineties [KJJ99], a broad literature has focused on analyzing physical leakage in order to perform concrete attacks efficiently and to assess physical security on theoretically sound bases.
A first step towards such sound bases is the separation between non-profiled and profiled attacks.While Kocher's seminal work and early variants like Brier et al.'s Correlation Power Analysis (CPA) exploit an a-priori leakage model [BCO04], it has been shown that profiling the target device (i.e., leveraging an open sample to estimate a leakage model) can significantly improve the attacks' efficiency.Chari et al. introduced profiled attacks, and stated that such attacks are "the strongest form of side-channel attack possible in an information theoretic sense" [CRR02].This statement seeded a line of works on worst-case side-channel security, i.e., the security level reached when universally quantifying over the adversary.Standaert et al. observed that profiled attacks are critical to estimate the worstcase security of an implementation [SMY09].Whitnall et al. extended this observation and proved that profiling is in general necessary for this purpose (i.e., there is no generic attack strategy enabling us to recover secret information from a physically observable device's leakage without any a priori knowledge about the device's leakage distribution) [WOS14].Heuser et al. finally proved that a generalized version of Chari et al.'s strategy, namely distinguishing thanks to the probability distribution of the leakage conditioned on the targeted secret, is indeed optimal in an information theoretic sense [HRG14].
A second step towards sound side-channel security evaluations is the acknowledgment that even in the profiled evaluation setting, performing an optimal attack in the sense of Heuser et al. is a highly non-trivial task.The main reason is that the true leakage distribution of a device is in general unknown and can be quite complex to estimate, especially in the presence of countermeasures like masking [CJRR99].As a result, one can summarize the evaluation problem in two questions: 1. What is the data complexity of the attack using an optimal profiled model? 2. What is the profiling data complexity to estimate this optimal model?Here, both data complexities are defined in terms of number of measured traces.
The first question is standard in the cryptographic setting.It aims at determining the level of security that can be guaranteed against an informed adversary.Since running an attack to evaluate its complexity for highly secure cryptographic implementations can be prohibitively expensive, an increasingly standard evaluation approach consists in using information theoretic metrics for this purpose.In particular, the Mutual Information (MI) can be used to bound the data complexity of worst-case attacks [DFS15, dCGRP19, MRS22, BCG + 23].The difficulty of estimating the MI [Pan03], which we elaborate later in this paper, has led Renauld et al. to identify the Perceived Information (PI) as a metric capturing the amount of information that can be extracted from physical leakage thanks to the adversary/evaluator's (parametric) model, possibly biased by estimation or assumption errors [RSV + 11].Durvaux et al. therefore formalized leakage certification as the problem of assessing the distance between the PI and the MI [DSV14].Bronchain et al. showed that the PI is in general (i.e., for any leakage distribution, including for masked implementations) a lower bound for the MI and that an upper bound is obtained by estimating the empirical Hypothetical Information (eHI), which is the amount of information that would be extractable from a device if the true distribution was identical to the one of a measured evaluation dataset [BHM + 19].They additionally showed that, when increasing the dataset size, the expected value of the eHI asymptotically converges towards the MI.Unfortunately, the practical impact of these results is limited since the required dataset size grows with the number of points in the leakage traces, becoming very quickly impractical.The informal workaround proposed by Bronchain et al. is to use the HI estimated with a parametric model in such cases.Informally, and while the non-empirical HI loosens the formal link with the MI, the goal is to use the parametric HI as an upper bound for the complexity of the evaluator's best attack.They conjectured that this HI is an upper bound of the PI estimated with the same model.
The second question is less standard in the cryptographic setting.It rather aims at determining whether a worst-case attack is somewhat "practical".In other words, despite the profiling of a leakage model is a one-time effort, could it be so complex that estimating an accurate model becomes unrealistic.To the best of our knowledge, investigations in this direction have been less formal so far.Numerous profiling techniques have been introduced and evaluated based on specific case studies.These include extensions of Chari et al..'s Template Attacks (TA) [CRR02, SLP05, GLP06, APSQ06, SA08, SKS09, CK13, CK14] and a steadily increasing (and not exhaustive) list of works leveraging machine (and deep) learning [HGM + 11, HZ12, LMBM13, LBM14, LPB + 15, MPP16, CDP17, CCC + 19, ZBHV20, WAGP20, ZBD + 21].Recently, Masure et al. showed that these profiling strategies are not disconnected: by optimizing the appropriate loss function, evaluation approaches based on machine learning and deep learning actually target the same goal as TA, namely maximizing the PI [MDP20].However, a systematic characterization of the parameters that influence the profiling phase of a side-channel attack, which would answer the practicality question, is still missing.For example, how does the convergence of a machine learning model depend on the physical leakage characteristics (noise level, number of dimensions, security order), number of classes and number of profiling traces?And are some statistical tools better suited depending on the contexts?Our contributions regarding these two main questions are twofold: Regarding the first question, we falsify and fix the conjecture of Bronchain et al.Precisely, we show that the parametric HI is not always an upper bound of the parametric PI.Since our counterexample corresponds to realistic leakage distributions (namely, mixture distributions that happen with masked implementations), we then propose a new metric, the Training Information (TI N ), that eliminates this limitation.While the HI can be viewed as a measure of a parametric model tested against itself, the TI N is a measure of a parametric model tested against (the empirical distribution of) its training samples.We show that for parametric leakage models that optimize the appropriate loss function, the TI N upper bounds the "learnable information" (LI) defined as the supremum of the PI over a parametric class of models, and that for N → ∞, the PI and TI N converge towards the LI.Like the HI, the TI N does not offer guarantees against assumption errors when it is computed for parametric models: the LI may be smaller than the MI.But it offers an easy way to bound estimation errors (i.e., LI − PI) for practically relevant classes of distinguishers.Besides, it can be used for both generative and discriminative models (while the HI was limited to the first ones).This allows evaluators to gauge how much their attacks can be improved by collecting more profiling traces, and to stop their measurement campaigns when the gain becomes small.In other words, this new metric answers the question: how much information can be learned with my leakage model?
Regarding the second question, we initiate a study of the convergence rate of the TI N and PI metrics for practically-relevant profiling techniques.Namely, we consider simple representatives of two widely-used profiled attack families.For the Gaussian templates, we consider the original attack of Chari et al. [CRR02], denoted in this paper as gTA, and its variant with pooled covariance matrix estimation [CK13], denoted as p-gTA.For the deep learning attacks, we analyze a Multi-Layer Perceptron (MLP) with L layers and W weights to fit, trained with a negative log-likelihood loss function.Although less common in side-channel attacks, we also consider the k th -order logistic regression, denoted as LR k , which is interesting since this model is similar to Gaussian templates but its training process is closer to the one of the MLP.Our results are synthesized in Table 1.
On the one hand, this table positively answers our question regarding the practicality of the profiling phase in a security evaluation.It shows that there are profiling tools for which the estimation error is inversely proportional to √ N (N being the number of profiling traces) for any (even protected) implementation (e.g., MLP and LR k ).It also shows that the convergence rate of the models depends on their hyperparameters but not on the physical leakage characteristics (i.e., the true leakage distribution), and consolidates the general intuition that side-channel security evaluations are a trade-off between the genericity and the efficiency of the profiling.On the other hand, the table shows that there are statistical tools that are better suited depending on the evaluation contexts.For Table 1: Convergence of the PI of different profiling tools (the O(•) notation ignores log terms).The "Fast regime" column assumes that, for some ideally chosen values of the parameters, the model can perfectly match the true leakage distribution.

Model
Fast Regime General Bound Q denotes the number of profiled classes, D the dimensionality of the traces, and N the number of traces acquired for profiling, i.e., quantify the sample complexity of profiling.
) example, the convergence rate of LR k for a security order k leads the modeling error to scale in O(D k ).By contrast, for a circuit of complexity k (e.g., the masking of a sensitive variable that would leak D = k samples corresponding to the shares), it is always possible to build an MLP whose complexity W • L scales as poly(D = k) [SB14,Thm. 20.3].So if an evaluator has to profile higher-order leakages, leveraging MLPs leads to a more efficient profiling than trying to profile moments of the leakage distribution with LR k .
As discussed in Section 7, we hope these theoretical results can help evaluators operating within a limited time frame towards finding the best trade-off in their model selection, by anticipating and optimizing the models' profiling complexity.

Related Works
The use of information theoretic metrics to guide/compare profiled attacks dates back to [SKS09].In a work from Cosade 2021 [PBP21], Picek et al. show that this intuition does not only hold for the number of profiling traces but also for the number of epochs used in the training phase of a machine learning model.Ito et al. show that the direct optimization of security metrics such as the Success Rate (SR) or Guessing Entropy (GE) [SMY09] can slightly improve an optimization guided by information theoretic metrics in some contexts, at the cost of some computational overheads [IUH22].It follows previous observations that security metrics and information theoretic metrics can sometimes lead to comparatively different outcomes (e.g., for low noise levels or small number of attack traces) [SPAQ06, PHJ + 19].Yet, since information theoretic metrics are inversely proportional to the asymptotic complexity of a side-channel attack phase, the concrete impact of such an observation is also limited.For example, the experiments performed in [IUH22] show some gains for attacks that succeed in 400 traces, but these gains already vanish for attacks succeeding in more than 1,000 traces.So while such results are interesting to push the optimization of concrete attacks in specific contexts, they do not contradict the general relevance of information theoretic metrics for side-channel security evaluations.Finally, Cristiani et al. investigate the so-called Neural-based MI estimation (MINE) [CLM20].They leverage the variational formulation of the MI allowing to train an MLP to maximize a lower bound of the MI, similarly to the PI [CT12, Eq. (8.93)].This research follows the observation of Mather et al. [MOBW13] that an evaluator may estimate the complexity of her best attack without having to mount it.Analyzing whether this complementary approach could be used to upper bound the information leakage like the TI N and assessing its convergence rate are interesting scopes for further investigation.

Background
Notations.In the following, we denote random variables (resp., random vectors) by uppercase (resp., bold upper-case) letters X (resp., X).We denote by the same calligraphic letter X the observation domain of the corresponding random variable (resp., random vector).We denote observations of a random variable (resp., random vector) by the corresponding lower-case roman letter x (resp., x).If a random variable X is discrete, we denote by Pr(X = x) its probability mass function (pmf), for which we will use the shortcut notation p(x).We note P(V) the set of probability distributions over a random variable of domain V.If p and m denote two distributions over the same support, the Kullback -Leibler (KL) divergence is denoted by m(X) .We use the notation O(f (n)) to hide constant factors in n, and the notation O(f (n)) to additionally hide log factors in n.For a square matrix A, we denote by A * its spectral norm (i.e., the greatest of its eigenvalues in absolute value) and by A F its Frobenius norm.

Information Theoretic Metrics
Let Y be a discrete uniform random variable over a domain Y, denoting the sensitive intermediate computation targeted by the attacker/evaluator, and L be a discrete random vector over a domain L, denoting the corresponding physical measurement of the leakage of Y .During its attack, the adversary/evaluator, who knows the distribution of Y , acquires a profiling set S N made of N observations (y, l) of the joint probability distribution of (Y, L).We consider the problem of estimating a discriminative model m(y | l) for the true conditional Probability Mass Function (PMF) Pr(Y = y | L = l), for which we will use the shortcut notation p(y | l).In some cases, we also care about a generative model m(l | y) for the true PMF Pr(L = l | Y = y), denoted for short as p(l | y).We note that, since the distribution of Y is known, a generative model naturally induces a discriminative model (using Bayes' rule).We further define a distance metric ∆ between a generative model m and a discriminative model m (a probability distribution p may also be used in place of one (or two) of the models): where H(Y ) is the entropy of Y .Thanks to this notation, we can express the Mutual Information (MI) between the random variables Y and L as The MI is a relevant evaluation metric for side-channel attacks since the (measurement) complexity of a worst-case side-channel attack targeting a secret key, e.g., y = S(x⊕k) where x denotes a plain text, k denotes a secret key chunk, and S denotes an S-box, is inversely proportional to MI(Y ; L) [DFS19,dCGRP19].However, this metric cannot be computed directly since the true leakage distribution (i.e., p(l | y)) is in general unknown.One solution is to estimate it, which is known to be a difficult problem [Pan03].Alternatively, the amount of information that can be extracted from the leakages thanks to a model can be quantified by the Perceived Information (PI) given by The authors in [BHM + 19] additionally considered the Hypothetical Information (HI): and the empirical Hypothetical Information (eHI) defined as , where ẽ denotes the operator that maps a profiling set S N to the corresponding empirical distribution, i.e., ẽS N (y, l) = 1 N N i=1 1 (y,l)=(yi,li) .Whenever there is no ambiguity, we will replace the notation ẽS N by ẽN .Based on these quantities, their main result is twofold.First, the PI is always upper bounded by the MI regardless of the tested model m, with equality if and only if m coincides with the true leakage distribution p.Second, the eHI may be used to bound the MI as follows: (2) Note that the bound is for the expectation of the HI over the model estimations.It only holds for the empirical distribution ẽN and the authors also show that By contrast, the PI bound is true for any model.

Limitations of the HI
One important question left open by Bronchain et al. is whether the properties of the HI generalize to parametric leakage models.This question is important since, as experimentally observed in [BHM + 19], assessing the security of an implementation with an empirical model (and the corresponding bounds) rapidly becomes too expensive.In this section, we consolidate this HI proposal in two directions.First, we give a counter-example contradicting that the HI is in general (i.e., for any model) an upper bound for the PI.
In our example, it appears that this conjecture only holds when the parametric model used in the bound corresponds to the true leakage function to a sufficient extent.This will lead us to introduce a new metric to fix this issue in Section 4. Second, we formalize the observation that empirical models converge too slowly for being a practical alternative in (multivariate) side-channel security evaluations.For this purpose, we reconsider the convergence of the eHI towards the MI.Bronchain et al. proved a monotone convergence of the expectation.However, in practice the profiling dataset acquisition is usually performed a single time by the evaluators.Accordingly, stronger notions of convergence (e.g., in probability) are better suited to argue about the profiling phase of a side-channel attack.We give such a stronger result in Section 3.2, while also showing that an evaluation based on the eHI suffers from very slow convergence rates.In particular, it suffers from a bias that grows exponentially with the trace dimensionality.

Inconsistency with Non-Empirical Models
In [BHM + 19], the authors proposed the gHI (i.e. the HI computed for a Gaussian model) as a surrogate of the eHI enabling a faster convergence.We next show empirically that we can actually observe all three possible cases for the convergence of the PI and HI in a quite realistic context: either they both converge to the same asymptotic value, or the HI converges strictly above the PI, or the HI converges strictly below the PI.
We illustrate the three cases by measuring the gHI against true distributions that are not Gaussian.In particular, we use discretized univariate Gaussian mixture models which are relevant in the context of masked implementations.Concretely, the leakage is the sum of a Gaussian noise and the Hamming weight of the sharing (x ⊕ r, r) for the n-bit word   x, masked with a uniformly random n-bit word r.The model, for each leakage class (i.e.x = 0 and x = 1) is a Gaussian fitted using maximum likelihood estimators.In Figure 1, we show the leakage (continuous lines) and the models (dashed lines) for two distinct values of the SNR, computed as the ratio between the variance of the Hamming weight of an n-bit uniformly random variable, and the variance of the Gaussian noise [Man04].
In Figure 2, we show the corresponding gPI, gHI and MI.In addition to the observation of the aforementioned three cases, we can look at the relationship between the gPI/gHI and the MI.When the true distribution is close to Gaussian (Figure 1a), both gPI and gHI converge to the MI, as conjectured.However, in the other cases, the gPI and gHI are below the MI.This is explained by the inability of the Gaussian model to accurately represent the distinctive features of the classes, and thus to exhibit good class discrimination.Visually, the more dissimilarity between the true leakage and the model (i.e., from left to right in Figure 1), the wider the gap between HI and MI (from left to right in Figure 2).

Slow Convergence of the Empirical Model
We now formalize the observation that empirical models converge too slowly for being a practical alternative in side-channel security evaluations.

Convergence of the Expectation.
We first state that the bias of eHI scales exponentially in the dimensionality of the traces D and linearly in Q N , with Q the number of classes and N the number of profiling traces.
Theorem 1.Consider an evaluator sampling N traces from a D-dimensional leakage with an ω-bit resolution, related to a sensitive intermediate computation over Q classes, assumed to be uniformly distributed.Then, the eHI satisfies the following inequalities: where B denotes the number of bins in the empirical distribution.In particular, here B = 2 ωD .Moreover, The proof of this statement is directly inspired from Paninski's work [Pan03], and is detailed in Appendix A. Note that as a consequence of Equation 5, the upper bound of Equation 4 is asymptotically tight, thereby meaning that the lower bound is asymptotically loose.Since there is no unbiased estimator of the MI [Pan03, Prop.8], this is unavoidable (otherwise removing the right term of Equation 4 would have given an unbiased estimator of the MI).We illustrate this result with the auxiliary source code released by Bronchain et al. with the paper [BHM + 19].1 Figure 3 depicts the absolute difference between eHI N and MI with respect to the number N of profiling traces, simulated according to a "Hamming weight + Gaussian noise" leakage model, with a trace dimensionality ranging from 1 to 4.We can see that every curve has the same slope of roughly −1 with a constant offset between each other, which confirms the theoretical expectations of Theorem 1.

Convergence in Probability.
So far we provided a speed of convergence of the expectation of the eHI towards the MI.As already mentioned, such a result is not directly representative of an evaluation context where the profiling phase is (ideally) performed once.For example, the results shown in Figure 3 depict the convergence of eHI for one simulation, whereas Theorem 1 only ensures that the shape of the curves observed in Figure 3 are the ones that are expected on average, i.e. over several simulations.It might however be possible that by (lack of) chance, one could observe different results for one particular eHI computation.We next eliminate this limitation by discussing/proving a stronger notion of convergence, namely the convergence in probability.Incidentally, Bronchain et al. already proved the convergence in probability, in the proof of [BHM + 19, Lemma 2, p. 10], although not claimed as a theoretical result in their paper.In this section, we additionally provide upper bounds on the rate of convergence in probability.We state hereafter that the deviation between the eHI and its expected value converges towards 0 at a speed O log(N ) √ N .
Theorem 2. For all δ > 0, the inequality holds with probability at least 1 − δ, and furthermore The proof of Theorem 2 is provided in Appendix A and is also directly inspired by Paninski's work [Pan03].Interestingly, the convergence rate of Equation 6 does not depend on D, while the bias increases exponentially with D. When the number of dimensions is large, the bias will therefore dominate for practical N , despite the faster convergence rate of the bias with respect to N .In that case, the eHI is thus an upper-bound of the MI with high probability, although so loose that it is of little interest.Overall we conclude that the eHI converges too slowly for many practical use-cases, which calls for a better solution (which is not provided by the non-empirical HI, as discussed in Section 3.1).

Introducing the Training Information
The previous section showed the HI metric limitations both in terms of its ability to bound the information that can be extracted with parametric models and in terms of the convergence rate that its instantiation with the empirical function leads to.In this section, we introduce a new metric to circumvent these limitations, which we call the Training Information (TI N ).Like the eHI, it upper-bounds the PI while also having much better quantitative convergence properties.To explain the intuition behind the TI N , we recall that the eHI is the quantity ∆ ẽN ẽN , where ∆ is the operator defined in Equation 1, whereas the HI, in its general form (i.e., defined for an arbitrary model m), is given by ∆ m m , and the PI is given by ∆ m p , where p denotes the true (unknown) leakage distribution.The main goal of the TI N is to base the metric on a parametric model (enabling faster convergence), while keeping an upper bound for the PI.For this purpose, the eHI upper-bounds the MI by overfitting: it builds an ideal discriminative model ẽN (in the superscript) based on some samples, then evaluates it on the same samples (in the subscript).We define the TI N as ∆ m ẽN , where m is trained on the same sample set as the one used to compute ẽN .Since the TI N is based on a model instead of the empirical distribution, it carries the possible biases induced by the choice of possible models (e.g., Gaussian distributions).Hence it cannot upper-bound the MI in general (e.g., if the true distribution is not Gaussian).However, we can still relate the TI N and the PI to a meaningful quantity that we name the Learnable Information (LI for short).The LI is the maximum amount of information that can be extracted from a given leakage distribution using a family of models, and the gap between the LI and the MI corresponds to the "assumption error" of the evaluator/attacker's model [DSV14].Informally, we have the following inequalities: PI ≤ LI ≤ TI.We next formalize the concepts of LI and TI N in Section 4.1, then prove the above inequalities and prove that the expectation of the TI N converges in Equation 4.2.

Definition and Rationale
We first formalize the notion of "family of models" as follows.
Definition 1 (Hypothesis class).A hypothesis class H is a -possibly infinite -collection of discriminative models m : L → P(Y), where L denotes the input space of the random vector L of the side-channel trace, and Y denotes the finite set of all hypothetical values of the target discrete random variable Y .
The output of m can be seen as a possible discrete probability distribution of the target random variable Y , while an hypothesis class can be understood as "a model where the parameters are not yet fixed" (e.g. the set of MLPs with a given structure is an hypothesis class).Using this notion of hypothesis class, we next define the LI.
Definition 2 (Learnable Information).Let H be a hypothesis class.The learnable information on Y from leakage L using a model from H is defined as the quantity: In order to introduce the training information, we need two more definitions.

Definition 3 (Learning Algorithm). A learning algorithm A for a hypothesis class H is a function
taking as an input a set S N of N acquisitions drawn from the (unknown) joint probability distribution of (Y, L) and returning a model m = A(S N ) from the hypothesis class H.
It is worth noticing that in a profiled attack scenario, the adversary can be defined by its underlying learning algorithm.Hence, in this paper, we denote interchangeably by A either an adversary, or its corresponding learning algorithm.The following definition states how we compare different learning attackers, i.e., learning algorithms.Definition 4 (Regret).Let A be an attacker, i.e., a learning algorithm.The regret of A is the following quantity: By definition, the regret is always non-negative, and equals 0 if and only if the learning algorithm outputs the exact leakage model, i.e.A(S N ) = p.We can now give the formal definition of TI N , based on the ∆ operator.
Definition 5 (Training Information).Let S N be a set of N samples drawn from a distribution over (Y, L).The training information by A with N traces is defined as the following quantity: Since TI N is defined for any learning algorithm, regardless of their performances, there is no prior reason why TI N could be an upper bound of MI nor PI.Nevertheless, this is possible by adding a few more assumptions, in particular assuming that the learning algorithm is a TI N maximizer as we next formalize.
Definition 6 (TI N maximizer).Let H a hypothesis class and let S N be the dataset of N traces.The TI N maximizer for the hypothesis class H is the learning algorithm A H such that A H (S N ) = m N , where m N is defined as For conciseness, we will replace the notation m S N by m N in the remaining of this paper.

Bound and Convergence of the TI N
Provided with the TI N maximizer of a hypothesis class, it is possible to derive properties similar to the ones conjectured for the gHI by Bronchain et al.
The first one that we give hereafter tells that the maximum TI N over a hypothesis class is an upper bound in expectation of the LI for the same hypothesis class.The second one tells that, for a TI N maximizer, the expectation of the TI N is monotonically decreasing.These two results imply that the expectation of the TI N converges to an upper bound of the LI.
Proposition 1.Let H be a hypothesis class, and N be a positive integer.Then where the expectation is taken over the profiling set S N of size N .
Proof.According to Definition 5 and Definition 6, for any model m ∈ H, if m N denotes the maximum likelihood for H, it holds that Since the expectation is monotone, non-decreasing, it follows that Since the ∆ b a operator is linear with respect to a, it follows that Since the latter holds regardless the choice for m we may arbitrarily take the model that maximizes the PI, which gives Equation 12.
Proposition 2. Let H be a hypothesis class, and N be a positive integer.Then where the expectation is taken over the profiling set S N of size N .
Proof.We first remark that we can extend the definition of the TI N -maximizer to learn from an empirical distribution: let e ∈ P(Y, L), we define We shall show that the function γ : ẽN → ∆ Proposition 1 and Proposition 2 together show that the TI N satisfies the same monotone convergence of its expectation as the one satisfied by the eHI, previously shown by Bronchain et al. [BHM + 19].Moreover, Proposition 1 tells us that the asymptotic TI N is an upper bound of LI.It is therefore interesting to discuss whether, like in Bronchain et al.'s works, it is possible to get stronger notions of convergence, with the hope to get faster convergence rates than the one satisfied by eHI.Section 5 will be devoted to this question.

Convergence Rate of TI-Maximizing Distinguishers
So far, the metrics for a TI N -maximizer operating on a hypothesis class H follow where the first inequality is unconditionally true [BHM + 19], whereas the last two inequalities hold in expectation only (see Equations ( 12), ( 16)).In this section, we are interested in whether both the TI N and the PI converge towards the quantity of interest, namely the LI.And if so, what convergence rate could we expect for the gaps between those metrics?At a very high level, the answer to both questions depends on the combination of three factors: the richness of the hypothesis class H, how it is likely to depict well the true leakage model, and how smooth the metric we aim to optimize (i.e. the TI N here) is.Depending on those factors, we may observe a fast convergence (i.e., at a rate O(1/N )), a slow rate (i.e., at a rate O 1/ √ N ), or no convergence at all.Which case fits to our problem?This section aims at addressing this question.To this end, we need first to formally introduce in Section 5.1 the hypothesis classes that we will consider in this paper.Then, we will have the necessary material to state in Section 5.2 the convergence rates.

Definition of our Problem
For the remaining of Section 5, we consider a hypothesis class H that is the family of concatenations of real-valued functions belonging to a given set F (that we will describe thereafter), composed with a softmax function We assume that each real-valued function f ∈ F can be fully described by a parameter vector θ.In other words, each function m ∈ H can be written as where Θ is the concatenation of θ 1 , . . ., θ Q .We denote by H the space Θ belongs to.
Remark 1.The softmax function σ remains invariant by applying the same shift to all its entries.It follows that if the elementary class F is a group, one may fix one of the f (l; θ i ) to the constant function 1, without changing the resulting hypothesis class H.
This definition covers a broad family of models, such as Logistic Regression models with polynomial basis of degree k (LR k for short) and deep neural networks, among which we particularly focus on MLP s (without loss of generality).
In the case of an LR k -attacker, the elementary class F is the set of all polynomial transformations of degree at most k over the leakage space L ⊂ R D .As an example, in the case of LR 1 , the mapping is an affine form, where B i ∈ R D+1 and l = (l, 1).Here, θ i corresponds to B i .In the case of LR 2 , the mapping where Finally, in the case of MLP s, the mapping is a composition of L layers φ i , each being the composition of a linear mapping, defined by the weight matrix Θ (j) i , with an element-wise non-linear function (a.k.a.activation)except the L-th layer which is not composed with any activation function, since this role will be played by the whole softmax function.Here, In the rest of the paper, we assume that the total number of entries in the weight matrices equals W .
Whereas MLPs are now widely used for profiled side-channel analysis, LR models have not been considered so far in the literature to the best of our knowledge.2However, LR models may be of great interest thanks to their connection to Gaussian templates.Indeed, we claim that the hypothesis class of Gaussian templates (resp., pooled Gaussian templates [CK13]) is included in LR 2 (resp., LR 1 ).This will be shown in Section 6.A similar correspondence could be investigated for the inclusion of so-called side-channel attacks of order k [SM16, MS16] in LR k .We discuss in Section 6 the main difference between LR and Gaussian templates approaches, which is the nature of the underlying learning algorithm A used to find the right model from H = LR k (for k = 1, 2).

Convergence Rates for TI N -Maximizers
As briefly stated in introduction of Section 5, the convergence rate of the TI N and the PI towards the LI depends on three factors, namely the richness of H, how it depicts well the true leakage distribution, and the smoothness of the metrics to optimize.When considering only the first and the last criteria, it is possible to prove the convergence in probability of the PI and the TI N to the LI, with rate O P N , where P is a constant depicting the richness of H.However, formalizing the concept of richness in this case requires some involved discussion, that the interested reader may find in Appendix B.
Instead, we propose to introduce some assumption about the second criterion, as it will allow us to derive much more intuitive, and much more efficient results.Indeed, some recent advances in statistical learning theory have seen the emergence of proofs of convergence under the so-called central condition [vEGM + 15], a rather general requirement that allows us to derive fast convergence rates.Here as well, we will not elaborate much about the exact meaning of this assumption.Instead, and for readability purpose, we provide hereafter a stronger assumption which is significantly easier to grasp.[MG22]) constant factors in the bounds.That is why we will assume in this section that the hypothesis of Lemma 1 holds true.

Fast convergence of PI towards LI
We now state the fast convergence rates for the different hypothesis classes that we consider in this section.The following corollaries 1 and 2, are proven in Appendix C. Corollary 1.Let LR k for k = 1, 2 be a TI N -maximizer attack using logistic regression for profiling.Suppose that If LR k verifies the assumption of Lemma 1, and N ≥ 5, the gap LI − PI is bounded by In other words, the regret of an LR k attacker is bounded by if we assume that every real parameter and every leakage value is bounded by a constant.
Corollary 2. Let A be a TI N -maximizer attacker using MLP as defined in Equation 21with ReLU activation function for profiling.Suppose that If MLP verifies the assumption of Lemma 1, and N ≥ 5, where In other words, the regret of an MLP attacker is bounded by O LW 2L+3 DQ 5/2 N if we assume that every real parameter and every leakage value is bounded by a constant.

Fast convergence of TI N towards LI
So far we have shown that under the central condition (Lemma 1) -in other words under the assumption that LI = MI -the regret of a TI N -maximizer, i.e. the gap between the MI and the PI enjoys a fast convergence rate with high probability towards 0. Since we have shown in Section 4 that for this learning algorithm, the TI N is monotonically decreasing and converges to the LI, we may wonder what is its convergence rate.We show in Appendix B that the TI N converges in probability towards the LI at a rate O 1 √ N , and a faster convergence rate cannot hold in general.To see why, let us take a counter-example in which the hypothesis class H contains only the true leakage model p, so we trivially have the equality PI = LI = MI.Yet, since H is a singleton, the TI N -maximizer is constant, so the TI N can be expressed as an empirical mean.According to the well-known central limit theorem, the rate of convergence in probability cannot be faster than O 1 √ N .Nevertheless, the latter theoretical counter-example does not reflect what an evaluator can observe in practice.Indeed, the slow convergence rate comes from the variance in the TI N : its deviation converges slowly (as a consequence of the central limit theorem), regardless of whether the TI N -maximizer is good or not.On the other hand, similarly to the conclusion of Section 3, the gap between the TI N and the LI is dominated by its statistical bias, which converges towards 0 at a fast rate.More precisely, Proposition 3 (Appendix C) analyzes the training gap where h depends on the richness of the hypothesis class H. Proposition 3 also bounds the deviation of the training gap: In most practical cases, similarly to Section 3, we observe that h N , hence the dominant term in the deviation is proportional to the bias.
The overall picture.To summarize, combining the results of Section 4.2, Equation 4.2, and this section, we come to the following picture for the TI N -maximizer regarding the convergence w.r.t N :

PI(Y ;
denotes an inequality that holds with high probability, and ≤ E denotes an inequality verified by the expectations of both hand-sides.

Gaussian Templates
The assumption p ∈ H, which is key to obtain the fast convergence rate of the previous section, is actually a fairly common assumption made in side-channel security evaluations.One of the most popular models is the Gaussian template where H is the set of multivariate Gaussian distributions. 3 The Gaussian template attack (gTA for short), however, is not a TI N maximizer, since the parameters (mean and covariance) of the templates are chosen as the empirical average and covariance, raising the question whether we can still derive similar bounds to what has been done in Section 5? In this section, we compute the convergence rates of gTA, first for the original and most generic template attack [CRR02], then in the particular case where the covariance matrix is known to be diagonal -a.k.a. the so-called naive Bayes classifier [PHG17, PSK + 18] -and finally for the pooled gTA (i.e. the covariance is the same for all values of y) [CK13].Formally, we assume that the leakage distribution f y (•) for each of the Q different classes y has a Gaussian distribution of mean µ y and covariance Σ y .For each class y, the adversary estimates a D-dimensional Gaussian generative model f y (•) (the template) according to the empirical mean vector µ y and the empirical covariance matrix Σ y .Without loss of generality, we assume that for each class, the adversary has acquired N/Q traces during the profiling phase in order to build each template f y (•).The discriminative model derived from this Gaussian modelcomputed thanks to the Bayes rule -is used to mount a key recovery attack.
One may then remark that LR 2 covers the set of discriminative models derived from gTA.To see this, define each elementary function f (l; ) 2 .Thus, the corresponding LR 2 model m Θ coincides with the Gaussian template.Likewise, if we further assume that the covariance matrix is the same for all classes, the quadratic term − 1 2 l Σ −1 i l is common to all functions f (l; θ i ) and can be subtracted without change to the model m Θ .We deduce that the set of pooled Gaussian templates is equal to the hypothesis class of LR 1 . 4In other words, despite a gTA (resp., p-gTA) adversary differs from an LR 2 (resp., LR 1 ) adversary, since they do not use the same learning algorithm, the hypothesis class of the former one lies in the hypothesis class of the latter one.It is therefore interesting to compare their convergence rates, e.g. by comparing their respective regrets (i.e., the gap between the LI and the PI since it follows from the Gaussian assumption that LI = MI).This is the aim of this section.
Remark 2. The Gaussian TA (resp., pooled TA) is identical to the quadratic (resp., linear) discriminant analysis (QDA/LDA), which are well-known machine learning models.However, most of the literature focuses on the success rate metric (e.g.[Efr75, HTF09]), and is not directly adaptable to information theoretic metrics.To the best of our knowledge, there is no existing bound on the convergence of the LDA/QDA that applie to the PI.

gTA convergence
Let us start with a convergence bound for the gTA, which is the most general Gaussian templates model.The proof of the following corollary is given in Section D.1.
Corollary 3.For any δ > 0, the regret R (gTA) of an attacker instantiating a Gaussian template attack is upper-bounded by O QD 2 N log 1 δ with probability at least 1 − δ.
In other words, to be able to control the estimation error of the MI when profiling with a gTA, the attacker/evaluator must ensure that the number of profiling traces scales with the squared dimensionality of the traces times the number of classes.

On the tightness of the bound
So far, we have emphasized an upper bound of the regret of a gTA attacker.It is then interesting to assess whether this upper bound is tight or not.Namely, can we derive tighter bounds of our regret, for any actual multivariate Gaussian leakage?We argue that without further assumption regarding the knowledge of the attacker, we cannot get better bounds.The convergence rate emphasized in Corollary 3 essentially comes from the error terms due to the estimation of the empirical covariance matrix, namely log det Σ and Tr Σ −1 − D. However, the sum of both error terms scale with Θ QD 2 N in expectation (the proof is given in Section D.1.1).Despite this negative argument, it is still possible to obtain faster convergence, provided that the attacker has more prior knowledge concerning the leakage, and more particularly concerning the shape of the covariance matrix.We next emphasize two particular cases that are often considered in side-channel analysis.

The Covariance Matrix is Diagonal: Naive Bayes
The Naive Bayes model has sometimes been used in SCA [PHG17, PSK + 18].It assumes a Gaussian multivariate distribution with diagonal covariance matrix for the leakage function.This reduces the covariance estimation to the estimation of the variance in each dimension, leading to a faster convergence, as stated by the next corollary, proven in Section D.2.
Corollary 4. The regret of an attacker instantiating a Gaussian template attack knowing that the covariance matrices are all diagonal is upper-bounded by O QD N log 1 δ .

Choudary and Kuhn's Pooled Template Attacks.
For gTA-based side-channel attacks, the bottleneck task is the estimation of the covariance matrices.Choudary and Kuhn considered this problem at Cardis'13 and emphasized that if N/Q ≤ D, the empirical covariance matrices admit some zero singular values, so they are not invertible [CK13].To circumvent this numerical issue, they proposed to pool all the covariance matrices into one common matrix for all the classes, leading to the pooled Gaussian templates attack (p-gTA).This assumption is also known under the name of homoscedasticity and it leads to mounting a Linear Discriminant Analysis (LDA) classification under the statistical learning terminology.Despite its popular success in SCA [SA08, LPB + 15, CDP15, CDP16, BS20], less has been done regarding the analysis of this approach since Choudary and Kuhn's paper.Yet, using a p-gTA addresses the necessary condition emphasized by Choudary and Kuhn so that the attack works, but does not ensure any sufficient condition.Can we find another explanation to the success of p-gTA?At first glance, using Q times more traces to estimate the pooled covariance matrix would induce a O D 2 /N convergence for the estimation of the covariance, while keeping O(QD/N ) convergence for the means estimation.This would result in a O max D 2 /N, QD/N bound in Corollary 3 for the ultimate regret of pooled template attacks.However, we conjecture that the latter upper bound can even be tightened to O(QD/N ), becoming fully linear in the trace dimensionality, despite the D 2 matrix coefficients to estimate.
Our conjecture is grounded on the similarity with the LR 1 model and on a proof in the particular case where Q = 2, stated next and proven in Section D.3.

Corollary 5. The regret of an attacker instantiating
denotes the Mahalanobis distance between the two centroids.

Case Study and Practical Use
So far, we have studied the PI and TI N for different classes of models.We finally discuss the impact of these results for the SCA practitioner.First, we briefly explain in Section 7.1 how the theoretical bounds could be used by an evaluator.Then, we illustrate in Section 7.2 our bounds and their use on simulated and experimental data.

Discussion on the practical use
Let us illustrate the properties of the TI N and discuss its practical usage in a side-channel evaluation context.Suppose that an evaluator has a target security level claim to verify, e.g., expressed in bits leaked per trace. 5If an evaluator wants to verify this claim, she can run a profiling with a TI maximizer as a learning algorithm.Figure 4 sketches the different situations that an evaluator may face after acquiring a profiling dataset (with a given amount of traces) and a validation dataset, then running the attack.
In the first case (left of the figure), the PI is higher than the target security level.Therefore, the evaluator can conclude that the device under evaluation does not satisfy the security requirement.Furthermore, the gap between the PI and the TI captures the potential improvement of the attack that beats the target security level.
In the third case (right of the figure), the opposite situation holds.The TI is below the target and measures the guaranteed security level.Furthermore, the gap between the PI and the TI captures the potential improvement of the guaranteed security level.It is remarkable that this conclusion holds even if the PI of the model trained by the evaluator is negative, that is, independently of whether this model is useful to mount an attack.In the remaining case (middle plot), the target security level lies between the PI and the TI, for the given amount of profiling traces.While it is in general less conclusive, our tools also allow interesting statements in this case.Indeed, we know that the actual security level is also between the PI and the TI.Let us denote the target security level by T and let ε = TI N − PI.We then know that the actual security level belongs to the interval [PI, TI N ] ⊂ [T − ε, T + ε].Let us moreover assume that ε ≤ αT for some α chosen by the evaluator.We can then claim that the security level of the implementation belongs to [(1 − α)T, (1 + α)T ], i.e., T with an error margin of (α/100)%.This brings us to the relevance of knowing the convergence rate of the PI and the TI N .Indeed, this approach is practical only if the evaluator can easily make ε small.Thanks to the bounds given in Section 5 and Section 6, this requirement is satisfied: ε converges at a fast O 1 N rate, where N is the number of profiling traces.Moreover, our quantitative bounds in these sections (see, e.g., Corollary 2) show that the constants behind the O(•) notation are reasonably small.Therefore, a practical use for the convergence rates is to extrapolate the guarantees that can be obtained with a number of profiling traces: from a given target security level T and an uncertainty α, the evaluator can have a bound on the number of profiling traces she will need to conclude her experiments with confidence.

Illustration on simulated & experimental data
It now remains to illustrate our bounds with concrete data.For this purpose, we consider both simulated leakages and a public dataset of real measurements.

Setup & Models
Simulation setup.For our simulated experiments, we consider the Hamming weight leakage of an 8-bit secret in two settings.The first one (denoted as "hardware") corresponds to a typical hardware implementation: no masking and low SNR.The second one (denoted as "software") corresponds to a protected software implementation: 2-shares Boolean masking and high SNR (each share independently leaking its Hamming weight).These simulations have 1 and 2 points in the leakage traces and the noise is Gaussian.[BJP20], which is an unprotected AES implemented on FPGA.The dataset is made of 500, 000 traces of 1250 time samples, of which 450, 000 traces are used for the training, i.e., maximizing the TI, whereas the remaining is used for validation, i.e., estimating the PI.The target intermediate value is the first byte of the AES state before the AddRoundkey operation of the last round, for which the full dataset exhibits an SNR peak up to 0.016 [ZBHV20, Fig. 18].Since the last AES round is clearly identifiable on  the raw traces, we assume the evaluator/adversary to be able to restrict its target window over 100 Points of Interest (PoIs) around the SNR peak.

Models.
In the "hardware" setting, we evaluate the linear models: LR 1 and p-gTA, as well as an MLP (single hidden layer with 100 neurons in the simulations, and 10 neurons in the experiments).The p-gTA is done using the LDA from SCALib6 , and we also consider (for the experimental dataset) a variant of the p-gTA with reduction to a 10-dimensional linear subspace (also known as LDA [SA08]).The logistic regression is done with the implementation in scikit-learn7 , and for the experimental dataset, we apply a Principal Component Analysis (PCA) to reduce it to 20 dimensions, which simplifies the optimization [APSQ06].The TI maximization of the MLP is done thanks to the Adam optimizer [KB15] implemented on the Pytorch framework [PGM + 19] with a 10 −4 learning rate, without weight decay and a full batch, for 10, 000 epochs (i.e., a high number, in order to best maximize the TI).In the "software" setting, the leakage function is non-linear, we evaluate the LR 2 , gTA and MLP models (with the same hyper-parameters).

Results
The TI N and PI of these models for varying number of training traces are shown in Figure 5 for the simulations (the training is repeated for 5 different training sets) and on Figure 6 for the experiments on AES-HD.Additionally for the simulations, since the true distribution is known, the MI is also shown.These figures lead to the following observations.In the upper part of Figure 5, we see that the variance of the TI N is quite small compared to its bias (w.r.t. the LI).This is a consequence of the log( 1 δ ) terms in Corollaries 1, 2 and 3.8 Next, considering the lower part of Figure 5 which depicts the gap between the TI and the PI, we see that the slope in the logarithmic plots is close to −1, which means that the gap is inversely proportional to N , as proven Section 5 and Section 6.9 Interestingly, this holds true even when the PI and the TI are not yet close to their limit, and over a wide range of training set sizes (more than two orders of magnitudes), which confirms the practical interest of the extrapolation proposed in Section 7.1.
The same observation can be made on Figure 6b, depicting the gap between the TI and the PI on the AES-HD dataset.So concretely, an evaluator could estimate how many traces are needed for her profiling from the beginning of a learning curve (i.e.., when reaching the linear regime), which we illustrate with a concrete example.If the evaluator (who does not know the MI) wants to assess whether the target leaks less than 0.1 bit/trace when profiled with a linear model such as the p-gTA or the LR 1 , Figure 6a tells us that the she can stop the acquisition campaign and conclude after 100, 000 traces.Furthermore, she can estimate this number with a much smaller dataset of ≈ 10, 000 traces, by extrapolating the gap ε = TI N − PI, knowing that it is inversely proportional to N .
We finally remark that for the software simulation, the MLP model has a higher LI than the LR 2 and gTA models, meaning that it better models the true distribution.This increased versatility comes at a cost: training it requires at least two orders of magnitude more traces than the simpler models (roughly matching the bounds given in Table 1).

Concluding Remarks
This paper provides new information theoretic metrics and bounds together with a study of the convergence rates for practically-relevant profiled attacks.Besides their interest for helping side-channel security evaluators in selecting the profiling tools that best match their target device and time constraints, our results also show connections and differences between statistical learning theory and side-channel analysis.For example, in order to obtain convergence rates, we observed that the evaluator's goal, namely maximizing the PI to estimate the highest lower bound on MI, could be rephrased as a machine learning problem, using information theoretic metrics as loss functions.Accordingly, the TI N metric is nothing but the empirical risk studied in learning theory, and the TI N -maximizer in the profiling SCA view coincides with the Empirical Risk Minimizer (ERM), one of the most studied algorithms in machine learning.Yet, and somewhat surprisingly, the IT metrics that are most relevant for side-channel security evaluations are less investigated optimization goals than security metrics (like the accuracy) in the machine learning literature.So our study puts forward both the interest of leveraging the broad scope of theoretical results established in statistical learning theory over the past few years, and the need to adapt them to needs that are somewhat specific to security evaluations.Eventually, an interesting meta-conclusion of our results is that the profiling data complexity to estimate a model does not fundamentally differ from the attack data complexity using this model, since the profiling error we need to reach is proportional to the security level.This motivates shortcut approaches to profiling as proposed in [ABB + 20], and suggests that making security claims based on the profiling complexity of an implementation (i.e., contradicting the relevance of such shortcuts) could only be sound if showing that the model estimation problem is computationally hard, which is an interesting open problem.

A Proofs of Section 3
where D KL (• •) denotes the KL divergence.This re-statement is of great interest, since the first sum is unbiased -since ẽN (y, l) admits p(y, l) as expected value -whereas the second sum is positively biased -because each of its term are positive thanks to the KL divergence.Hence the first inequality of Equation 4. It now remains to upper bound the second sum in expectation in order to get the upper bound on the bias of eHI.To this end, as suggested by Paninski [Pan03, Proposition 1], we use the fact that Finally, we have We conclude the proof by observing that |L| is the number of bins.In addition, Equation 5is a direct consequence of [Pan03, Thm.5].
Proof of Theorem 2. Notice that where H(L) = − l∈L ẽN (l) log(ẽ N (l)), and likewise for H(Y, L).Subtracting the expected value of the eHI, we get Denoting by δ the right hand-side of Equation 31, we get the main result.Finally, the property

A.0.1 On the Effect of Discretization.
It is worth emphasizing that the latter analysis has been done assuming discrete probability distributions for the leakage.Thereby, one may wonder whether those results extend to the case where the leakage is modeled by continuous probability distributions.At first sight, the latter result would become useless, as it would imply the oscilloscope resolution ω to tend towards infinity.Unfortunately, it is hardly likely to obtain tight convergence bounds in this case, because of the so-called curse of dimensionality, which -informallystates that the convergence rate of non-parametric density estimation methods would slow down at least exponentially with D [Sto82,Sto83].Moreover, with nonparametric density estimation methods, there is a risk that, depending on the choice of the kernel, the HI no longer upper-bound the MI.

B Proofs of Section B.2 B.1 Characterizing the Complexity of H: the Pseudo-Dimension
In the next section, we will present several upper bounds on the TI N towards the LI.It is expected that those bounds will depend on the complexity -or the richness -of the underlying hypothesis class H. Intuitively, the more parameters in Θ to fit, the slower the convergence.It turns out that it is possible to characterize this complexity.This characterization, named Pseudo-Dimension, is defined in this section, and we provide some examples of pseudo-dimensions for several classes of interest for this study.We will therefore be able to provide some convergence rates in the next sections that depend on the pseudo-dimension.
We first need an intermediate definition of a pseudo-shattering.
Definition 7 (Pseudo-shattering [AB02, Def.11.1]).Let F be a set of functions mapping from a domain L to R and suppose that S N = {l 1 , . . ., l N } ⊂ L for some positive integer N .Then, S N is pseudo-shattered by F if there are real numbers r 1 , . . ., r N such that for all b ∈ {0, 1} N there is a function We say that r = (r 1 , . . ., r N ) witnesses the shattering.
An example of pseudo-shattering is depicted in Figure 7.We consider F as the set of affine functions in R. When S N = {l 1 , l 2 }, we can exhibit a function from F satisfying Equation 32 for any 2-bit vector b ∈ {0, 1} 2 .However, we can notice that when adding l 3 to S N , the new profiling set cannot be shattered anymore, since the binary vector b = (0, 0, 1) provides a counter-example where Equation 32 is not satisfied.It can be verified that no matter the choice of r 3 , one will always find such a binary vector b breaking the condition of Equation 32.Intuitively, this states that F is not rich enough to shatter any set of 3 leakages or more.Hence the choice of quantifying the richness of F by the maximum amount of leakages that can be shattered by F, as formalized hereafter.Definition 8 (Pseudo-dimension [AB02, Def.11.2]).Suppose that F is a set of functions from a domain L to R.Then, F has pseudo-dimension N if N is the largest integer such that any subset S N of L of cardinality N is pseudo-shattered by F. If no such maximum exists, we say that F has infinite pseudo-dimension.The pseudo-dimension of F is denoted P dim (F).
As an example, it is known that if F is a finite dimensionality vector space of functions from an input space L onto R, then P dim (F) is the dimensionality of F [AB02, Thm.11.4].We give hereafter the pseudo-dimension of the two classes considered in this work, namely the Logistic regression and the MLP.
Theorem 3 (Pseudo-dimension of LR k [AB02, Thm.11.8]).Let F be the class of all polynomial transformations on R D of degree at most k.Then Theorem 4 (Pseudo-dimension of MLP [BHLM19]).Let F be the class of MLP with real-valued output with piece-wise linear activation function, W parameters and L layers.Then, there exists two constants c > 0, C > 0 such that Put in another way, this means that the pseudo-dimension of parametric models is roughly proportional to the number of real-valued parameters to fit.10

B.2 Convergence Rate for TI Maximizers
We are now ready to present our main result for TI N maximizers.

Theorem 5. Let H be a hypothesis class to model the leakage of an intermediate computation of Q hypothetical values, such that the corresponding elementary class F of functions
2 ) has pseudo-dimension P dim .Define the following quantities: where N denotes the number of profiling traces.Define also the following quantity Then, for all 0 < δ ≤ 1, the inequality holds with probability at least 1 − δ.
We prove Theorem 5 in Appendix B. Corollary 6 follows from this result.
Corollary 6.Let A H be a TI N -maximizer adversary that profiles with N traces and considers a hypothesis class H such that the corresponding elementary class F has pseudodimension P dim .The following inequalities hold with probability 1 − δ (except the first one that always holds), and the slack Proof.The first inequality is a direct consequence of the definition of LI.The second one is a direct consequence of Theorem 5 and Theorem 6 (proven in Appendix), while the two last ones follow from Corollary 7.
Putting the pseudo-dimensions of our models of interest in this Corollary gives our generic convergence results (∀ p in Table 1).

B.3 Proof of Theorem 5
In this section, we prove Theorem 5.The proof is done in several steps that we briefly describe hereafter before diving into the details.
1. We bound the gap between TI N (Y ; L; m N ) and PI(Y ; L; m N ) with a uniform bound, i.e., not specific to any m ∈ H.We are now reduced to show that the gap uniformly converges towards 0.
2. We invoke a theorem stating that the uniform convergence rate is upper bounded by a quantity depending on the so-called covering numbers that we will define.
3. We will then introduce some properties of covering numbers in order to reduce the problem to bounding the covering number of the different F i .
4. The covering numbers can actually be bounded by the pseudo-dimension introduced in Section B.1.
5. We now have all the ingredients to state the theorem and its corollary.

B.4 Uniform Convergence
Definition 9 (Uniform Convergence).Let H be a hypothesis class.We say that H has the uniform convergence property if for any probability distribution over (Y, L), and for any , δ > 0, the following inequality is satisfied: Theorem 6 (Uniform Convergence implies Learnability).With the same notations as in Definition 9, the inequality is satisfied.
Proof.Let m ∈ H be fixed, and let us denote m N = A H (ẽ N ).By Definition 5, we have Since the right hand-side does not depend on the fixed m, taking the supremum of the left hand side with respect to m, concludes the proof.
In other words, it suffices to prove the uniform convergence for our hypothesis class H to show that the PI converges towards its supremum.Interestingly, the uniform convergence of H is also a necessary condition [ABCH97, Thm.4.2].11 Proof.We first prove the first inequality:

B.5 Bounding Uniform Convergence with Covering Numbers
We now turn to emphasize uniform bounds, which, thanks to Corollary 7, will enable us to draw bounds on the gap between TI N and LI.The main idea of the results that we will present in this section is to reduce the uniform convergence for infinite hypothesis classes to the uniform convergence for finite hypothesis classes, provided further assumptions.To this end, we need to introduce the concept of covering numbers.
Definition 10 (Covering of a set [SB14, Def.27.1]).Let A be a normed vector space with respect to the • 1 norm, and > 0. We say that A is -covered by a set A , with respect to the • 1 norm, if for all a ∈ A, there exists a vector a ∈ A such that a − a 1 ≤ .We define by N 1 ( , A) the cardinality of the smallest A that -covers A.
In a nutshell, an -covering of a set A can be seen as a representative finite sample of A, in the sense that any point from A is -close from at least one element from the covering.Therefore, any analysis that is done over the covering is likely to still hold (up to an error margin depending on at most ) over the whole set A.
Beyond metric spaces, covering numbers can also be defined for functional spaces, such as the ones we consider here.The following definition formally states this idea.
Definition 11 (Covering number of a hypothesis class [AB02, Sec.10.4]).Let H be a set of functions from an input space L to a subset of R Q .Given a sequence S N = (l 1 , . . ., l N ) ∈ L N of input data, we let H S N the following set: For a positive number , we define the covering number of H for accuracy and number of data N as the quantity Covering numbers are crucial in statistical learning theory.This is formally stated by Theorem 7 hereafter.
Theorem 7 ([Hau92, Thm.3]).Let H be a permissible12 hypothesis class of functions from L to P(Y), such that for all m ∈ H, and y, l ∈ Y × L, 0 ≤ − log(m[y]) ≤ B. Assume N ≥ 1. Suppose that S N is generated by N independent random draws according to any joint probability distribution on Y × L. Then where log •H denotes the set of functions {y, l → − log(m[y]) : m ∈ H}.
It now remains to see when Theorem 7 provides non-trivial bounds.Indeed, assuming that (log •H) S N is a subset of [0, B] N , for some B > 0, then the covering number N 1 ( , log •H, N ) can itself be trivially bounded by BN N .Unfortunately, in that case, the right hand-side of Equation 40 tends to infinity with N → ∞, if is small enough.In other words, without further assumption, Theorem 7 is a rather tautological result, and further conditions on H must be set for sound bounds.
Hopefully, we will see in Section B.7 that for some classes of functions, we can get tighter bounds for covering numbers, yielding non-trivial worst-case of uniform convergence rates.Before going further through our reasoning, we need a few technical lemmas concerning covering numbers.Those technical results will be helpful to derive the aimed bounds.

B.6 A Few Properties about Covering Numbers
In this section, we introduce some technical lemmas that will be helpful for bounding the covering numbers.We start with the contraction lemma that leverages the Lipschitz property of a function.
Lemma 2 (Contraction).Let A, B be two sets, and φ : A → B be a ρ-Lipschitz function for a given norm • respectively induced on A, B. That is, for a, b ∈ A, the following inequality holds: Then, if N denotes the covering number with respect to the considered norm, the inequality is valid.
Lemma 2 is inspired by the proof given by Shalev-Shwartz and Ben-David [SB14, Lemma 27.2] who showed the result for the • 2 norm.We observe however that the result can be generalized to any norm.
Proof.By definition, there exists a minimal -covering of A of size N 1 ( , A).Then, for any a ∈ A, there exists a from the covering A such that the following inequality holds: Define B = φ • A and B = φ • A .It follows from the Lipschitz property of φ that: Hence, B is a (ρ )-cover of B.
Corollary 8 (Contraction).Using the same notations as in Lemma 2, if φ is a ρ-Lipschitz function (with respect to a given norm), then for any set of functions F, one can bound the covering numbers of φ • F as follows: Proof.Recalling that N 1 ( , F, N ) is by definition the maximum value of N 1 ( , A) over all the sets A of size N in the image set of F, the result straightforwardly follows from Lemma 2.
Informally, Corollary 8 tells us that the smoother the function φ -in the sense that the lower its Lipschitz constant ρ -the less are needed to get an -cover of the image set by considering the image of the -cover of the input space.Therefore, it is useful to reduce the covering numbers computation of an hypothesis class if the latter one is a set of composed smooth functions.The direct application of Corollary 8 is to bound the covering number of log •H with the covering number of F Q defined as the set {h : L → R Q : σ •h ∈ H}, i.e., such that σ •F Q = H.Let us first observe that the Lipschitz constant of the composed function log • σ is bounded by the square root of the number of its entries, as stated by Lemma 3.
Proof.Denote by φ the considered function.Since x 2 ≤ x 1 , it suffices to show that φ is √ Q-Lipschitz in the • 2 norm.Moreover, it is known that the Lipschitz constant in the latter norm is bounded by the supremum over the range of x of the • 2 norm of the gradient of φ.For 1 ≤ j ≤ Q, the partial derivative of φ with respect to x j is δ i,j − σ(x) j , where δ i,j denotes the Kronecker symbol.Since both δ i,j and σ(x) j are bounded in [0, 1], it implies that the Lipschitz constant is bounded by √ Q.
Corollary 9.For all > 0, and for all N ≥ 1, the following inequality holds: Thanks to Corollary 9, we are now reduced to bound the covering number of the set F Q , which we now address.We start by defining the set of functions F Q previously introduced as a free product of Q elementary sets of functions.
Definition 12 (Free product).Let A = A 1 × . . .× A Q be the Cartesian product of Q metric spaces (for the L 1 distance).Let F i be a family of functions from L into A i .The free product of the F i is the class of functions We may now properly bound the covering number of F Q in terms of covering numbers of the F i , thanks to Lemma 4.
Lemma 4 ([Hau92, Lemma 7]).If F 1 , . . ., F Q are defined as above, then Let us show that U is an -cover for F. That is, let g = (g 1 , . . ., g Q ) ∈ H, and let us show that there exists f ∈ U such that g − f 1 ≤ .For all 1 ≤ i ≤ Q, since U i is an Q -cover of F i , we know that there exists Hence, U is an -cover for F Q .It now remains to notice that the cardinality of U is the product of cardinalities for U i , 1 ≤ i ≤ Q.

B.7 Bounding the Covering Numbers of F with P dim (F )
We finally come to the link between covering numbers and pseudo-dimensions, thanks to the following results.
Corollary 10.Let F be a non-empty set of real functions mapping from a domain L to the real interval [0, B] and suppose that F has finite pseudo-dimension P dim (F).Then N 1 ( , F, N ) ≤ e(P dim (F) + 1) 2eB for all > 0.
Comparing with the trivial bound BN N discussed before, Corollary 10 provides a much tighter bound since it no longer depends on the amount N of profiling data.This noticeable property is the cornerstone of statistical learning theory, in the sense that it makes the results from Theorem 7 much more useful now.

B.8 Putting all Together
Now we have characterized every element in the upper bound of Theorem 7 in terms of pseudo-dimension of F, we may gather all those results to come back to a concrete bound.Let us denote P = Pr sup m∈H PI(Y ; L; m) − ∆ m ẽN > .Applying Theorem 7, it comes that ≤ 2 (e P dim (F) + 1) the latter inequality can be rephrased as Let δ > 0. We would like to find a sufficient condition such that P ≤ δ.It suffices to find a sufficient condition such that Let we shall show that Equation 53 is satisfied.Using the above definitions, we have Moreover, since 2 ≥ 2 0 , it holds that β α log 2 ≥ β α log 2 0 .Finally, summing the two above equations gives Equation 53.
It now remains to replace the bound B of the loss function by a more practical bound on the output range of each elementary class F. This is stated by the following lemma. Proof.
− log(σ(x)) = log This result allows us to replace B with 2V + log(Q) in the definitions of α and β, which, along with the hypothesis V ≥ 1 2 , allows us to observe that γ ≥ 1, hence we can remove the max in the definition of 0 : 2 0 = γ + log 1 δ /α.Finally, taking the complement probability in Equation 52, and expliciting the expression of gives Theorem 5. We introduce hereafter a few technical lemmas that will be useful to derive the proofs.

C Proofs of fast rate
Lemma 6.Let l ∈ L be such that l 2 ≤ R. Let Θ be a parameter vector such that m Θ ∈ H, where H denotes the hypothesis class of an LR 2 attacker.Then, for all y ∈ Y and for all l ∈ L, the mapping Θ → log σ(m Θ (l)) y is ρ-Lipschitz for the norm Proof.Using Lemma 3, we get that for all (y, l), Since m is an LR 2 model, m Θ (l) i = l A i l where l = (l, 1).Therefore, using Cauchy-Schwartz' inequality, we get Injecting this bound into Equation 56 gives the desired result.Lemma 7.With the same notations has before, if now we are considering an LR 1 attacker, then the resulting mapping becomes ρ-Lipschitz with ρ ≤ Q(R 2 + 1) .
Proof.We now have m Θ (l) i = B i l (still with l = (l, 1)), and thus Injecting this bound into Equation 56 concludes the proof.
Restatement of Theorem 9.The original version of Mehta's theorem [Meh17, Thm.1] required the loss function to be exp-concave,14 instead of the true leakage model p belonging to H. Nevertheless, Mehta's proof relies on another more general assumption, the so-called η-central condition.This central condition is implied either by assuming the loss function to be η-exp-concave, or in the particular case where the loss function is the log-loss, by assuming that the true leakage distribution p belongs to H [vEGM + 15, Example 2.2].In the latter case, the parameter η is set to 1. Beside, the supremum of PI can be replaced by MI, since we assume p ∈ H.The remaining of Mehta's proof remains unchanged.
Proof of Corollary 1 for LR 1 .This is a direct application of Theorem 9, by properly setting the parameters of the theorem.First, observe that H ⊂ R (D+1)×Q so P = (D + 1)Q, and taking Next, the condition log m Finally, using Lemma 7, we get that the Lipschitz constant L is upper bounded by Q(R 2 + 1).Putting all together into Equation 55 gives the desired result.
Proof of Corollary 1 for LR 2 .This is a direct application of Theorem 9, by properly setting the parameters of the theorem.As previously, we have P = (D + 1)Q and T = 2S √ Q.Furthermore, using the same reasoning as before, but using the bound |B i l | ≤ (R 2 + 1)S, we get B = 2(R 2 + 1)S + log(Q).Finally, using Lemma 6, we get that L ≤ √ Q(R 2 + 1).Putting all together into Equation 55 gives the desired result.
Proof of Corollary 2. This is a direct application of Theorem 9, by properly setting the parameters of the theorem to fit the different assumptions.
First, recall from Section 5.1 that our class of models is composed of Q MLPs, each being made of W real parameters by assumption.Hence, H ⊂ R W ×Q so P = W Q.
Second, we bound sup θ,θ θ − θ .Notice that for each MLP φ y plugged to the entries of the softmax, θ i ≤ LS (we use l2 norms in this proof), so using the triangle inequality, we get that for all θ, θ , Third, we show the Lipschitzness of MLPs.Using Lemma 3, we get that for all (y, l), We are now reduced to bound the Lipschitz constant of each entry model m θ (l) i of the softmax.Then, we may notice that since the ReLU activation function is 1-Lipschitz, each layer φ x (j) , Θ x (j) ), hence ) Let us now prove by induction that where and x (0) = x (0) = l.The base case j = 1 is a direct consequence of Equation 59, since l ≤ R and S ≥ 1.For j = 1, we observe that x (j+1) ≤ Θ (j) i x (j) ≤ S j l ≤ S j R.Then, injecting this observation in the second term of Equation 59 and using the induction hypothesis in the first term gives the desired result.Finally, we apply Equation 60 to the full MLP, giving  Proof.Using the successively the definitons of the PI and the MI, and the linearity of the expectation, we get Since the KL divergence is always non-negative, we get the desired result.
Note that Theorem 10 is not particular to Gaussian templates, and may be applied to any generative model.Next, we remark that the KL divergence remains invariant by affine transformation, as stated hereafter.

Lemma 8. Let
By applying the change of variable x = Ax + b in the definition of KL divergence, it follows that Hence, we identify the right hand-side of Equation 64.
For Gaussian templates, we can therefore reduce the study of the KL divergence of Theorem 10 to the particular case where the true covariance matrix Σ is the identity using Lemma 8. Furthermore, in the case of gTA with Σ = I, the following lemma gives an algebraic formulation of the upper bound.
Lemma 9.For a Gaussian distribution with Σ = I, the KL divergence is given by: Proof.By definition, Substituting both f (•) and f (•) with their respective density, it follows that Using [PP + 08, Lemma 8.2.2], it follows that the second term inside the brackets has D as expected value, whereas the first term inside the brackets has (µ − µ) Σ −1 (µ − µ) + Tr Σ −1 Σ as expected value, hence the result.
We now bound each term of Lemma 9. First, we bound Equation 66.The term (66) is the well known Hotelling's T 2 statistic, as recalled by the following lemma.
Accordingly, as the Fisher distribution converges towards a χ 2 distribution with D degrees of freedom, it follows that the quantity (66) belongs to O DQ N .Second, we bound Equation 65.The terms of Equation 65 are upper bounded in the following theorem.
Theorem 11.Suppose that the leakage follows a Gaussian distribution with Σ = I, and that Σ − I * ≤ 1/2.Then the first following inequality always holds true and there exists a constant C such that for all δ > 0 and for all N ≥ 4C 2 log 2 δ D the second following inequality holds with probability at least 1 − δ: The proof of this theorem relies on the following thechnical lemmas.
Lemma 11 (Basic linear algebra).Let A, B ∈ R D×D be symmetric matrices.Then, Lemma 12.For all x ∈ (−1, 1), we have Proof.It is widely known that x 1+x ≤ log(1 + x) ≤ x.Multiplying by −1 and adding x, we get the result.
We are now ready to demonstrate the desired result.The whole proof comes into two parts.First, in Lemma 13 we upper bound the quantity of interest in terms of spectral norms of the estimation error of the covariance matrix.Then, we invoke Theorem 12 to upper bound the latter spectral norm in terms of the parameters N/Q, D of our problem.
Lemma 13.Let Σ be an empirical covariance matrix estimated from samples following the D-dimensional normal distribution with zero mean and the identity I as a covariance matrix.Then, if Proof.First, we rephrase the first two terms of the KL divergence in terms of eigenvalues λ 1 ≥ . . .≥ λ D of Σ.Since Σ is a positive symmetric matrix, we know that λ D is non-negative.Moreover, by assuming that N/Q ≥ D, we know that λ D > 0 with high probability.Furthermore, Besides, using Lemma 11, Hence, we may rephrase the quantity to upper bound as follows: Using Lemma 12, the right hand-side of the latter equation is upper-bounded as follows: We then remark that if λ i is an eigenvalue of Σ, then λ i − 1 is an eigenvalue of Σ − I, where I ∈ R D×D denotes the identity matrix.As a consequence, for all 1 ≤ i ≤ D, Therefore, since by assumption Σ − I * ≤ 1/2 we have for all i Finally, combining Equation 71 with Equation 70 gives the result.
We are now reduced to bound Σ − I * , which is the purpose of the following theorem.
Proof of Theorem 11.The theorm is a direct combination of the bounds of Theorem 12 and Lemma 13.
Proof of Corollary 3. Let us now combine all the previous results.
Proof.Starting from the KL divergence of Theorem 10, we restrict ourselves to the case Σ = I using Lemma 8.Then, we get a bound on the KL divergence with Lemma 9, whose term are themselve bounded in Lemma 10 and Theorem 11.Finally, we can see that Hotelling's T 2 statistic can be neglected.Hence, the left hand-side of Equation 74 is non-negative.
Therefore, the latter bias cannot compensate the former one, which proves the tightness of our KL divergence bound (Lemma 9) in the general case.

D.2 Proofs for the Naive Bayes bound
Theorem 14. Assume that Σ = I and Σ is a diagonal matrix.Then, for all δ > 0 the following inequality holds: Proof.Since Σ is diagonal then log det Σ exactly coincides with the sum of the empirical log-variances estimated for each of the D time samples of the traces.Likewise, Tr Σ −1 coincides with the sum of inverse empirical variances.Estimating the error term in Equation 75 can be reduced to estimate the sum of D error terms, each for one-dimensional covariance matrices.Therefore, using Equation 68 in the particular case where D = 1, and multiplying by the true dimensionality D gives the result.
Proof of Corollary 4. The proof is almost identical to the proof of Corollary 3, using Theorem 14 instead of Theorem 11.Finally, we can see that Hotelling's T 2 statistic has the same convergence rate as Theorem 14.
Proof.Using the expression of the regret given in Lemma 15 and taking the Taylor expansion (with the notation R (β, γ) = R (p-gTA)), we have We shall prove that 1.All zero-th and first-order terms are zero and, 2. The second-order terms are bounded by constant independent of D. Applying the same change of variable as previously, we get that Since f 0 is a multivariate Gaussian with diagonal covariance matrix, L i is independent of L 1 for all 1 < i ≤ D, and furthermore the mean of L i is zero.Therefore, for such i = 1, e ∆e 1 L 1 + e ∆e 1 L = 0 .For the remaining case where i = 1, observe that /2 e ∆x/2 + e −∆x/2 dx for some constant K. Since the latter integrand is an even function of R, the integral equals 0.
Using the change of variable f 1 (l) = f 0 (−l), this gives For 1 ≤ i < j ≤ D, the right hand-side of Equation 82 is zero since L j is independent of L i and L 1 , and furthermore the mean of L j is zero.For 1 < i = j ≤ D the right hand-side is positive and can be upper bounded by E L∼f0 L 2 i = 1.In the last case i = j = 1, the second derivative of the regret is also positive and reduces to where the last integral is equal to 1 (it is the variance of a standard normal distribution).Therefore, the following bounds hold: Similarly to Equation 82, it can be shown that for 1 ≤ j ≤ D we have Lje ∆L 1 (1 + e ∆L 1 ) 2 . (84) For j > 1 the latter partial derivative equals zero since L j is independent of L 1 and has zero mean.For j = 1, using a reasonning similar to Equation 83, we get that ∂ 2 ∂γ∂β1 R (0, 0) ≤ 0. Let us now look for a lower bound: Finally, we have that We deduce from Equation 85 that ∂ 2 ∂γ 2 R (0, 0) ≤ 1.

Figure 1 :
Figure1: True distributions (continuous lines) and models (dashed lines) trained with 20 samples for each of the 4 classes (i.e.n = 2 bits).The X axis is the value of the leakage and the Y axis axis is its probability density.

Figure 2 :
Figure 2: gPI, gHI and MI (Y axis, in bits) for 2-bit masked variable as a function of the number of traces used to train the Gaussian model (X axis).

Figure 4 :
Figure 4: Illustration of security evaluations results.

Figure 5 :
Figure 5: Convergence of information metrics.In the upper part of the figure, the dotted lines represent the TI while the solid lines represent the PI.
Gap trend vs. profiling complexity.

Proof of Theorem 1 .
It is worth reminding that the left inequality of Equation 4 has already been shown by Bronchain et al. [BHM + 19, Thm.5].Nevertheless, we provide here a simpler alternative proof, by taking inspiration from the work of Paninski [Pan03, Prop.1] with slight modifications adapted to our context, thereby showing the right inequality.First, we note that the eHI can be restated as follows: ) Now, using McDiarmid's inequality [AK01, Thm.1], we have that for all > 0 Pr H(L) − E H(L) very same inequality holds to upper bound H(Y, L) − E H(Y, L) .Hence, for all > 0 Pr

Figure 7 :
Figure 7: Illustration of the pseudo-shattering by the set F of affine functions of L = R.The tuples denote the different values of b. {l 1 , l 2 } is pseudo-shattered by F, while {l 1 , l 2 , l 3 } is not.
m∈H PI(Y ; L; m) − ∆ m ẽN + sup m∈H PI(Y ; L; m) − ∆ m ẽN where the bound on the first term comes from Theorem 6 and the bound on the second term follows from the definition of TI N (Y ; L; A H ). Next, we prove the second inequality TI N (Y ; L; A H ) − LI(Y ; L; H) = (TI N (Y ; L; A H ) − PI(Y ; L; m N )) − (LI(Y ; L; H) − PI(Y ; L; m N )) ≤ sup m∈H PI(Y ; L; m) − ∆ m ẽN − 0 where the bound on the second term follows from the definition of the LI.

C. 1
Convergence of the PITheorem 9 ([Meh17, Thm.1], restated).Let H = {m θ : θ ∈ H } such that θ ∈ H ⊂ R P is a convex set satisfying sup θ ,θ θ − θ 2 ≤ T .Suppose, for all y, l ∈ Y ×L, that the mapping θ → log(m(y | l)) is U -Lipschitz.Suppose that the true leakage model p belongs to H and that for all y ∈ Y, l ∈ L, m ∈ H log m(y|l) p(y|l) ≤ B. Then, if N ≥ 5, with probability at least 1 − δ, the TI N -maximizer returns a model m N such that MI(Y ; L) − PI(Y ; L; m N ) ≤ 1 N 8B P log(16U T N ) Remark 3. In Theorem 9, we assumed that the true leakage model belongs to the hypothesis class.Such a requirement can often be relaxed [vEGM + 15, Example 2.2], up to a multiplicative constant in the convergence rates.

)
Injecting the right hand-side of Equation 61 into the one of Equation 58, we get that the Lipschitz constant is upper bounded by U = √ QRS L .Finally, since p ∈ H , we may combine Equation 57, Equation 58, Equation 61 to get that log m(y|l) p(y|l) ≤ B = 2Q 3/2 RLS L+1 .Putting all together into Equation 55 gives the desired result.C.2 Convergence of the TI N Proposition 3. Let H be a finite hypothesis class such that any model m ∈ H returns a probability distribution such that for any secret hypothesis y, − log m[y] ≤ B, for some positive B. Assume that the true model p belongs to H. Then E S N [TG N (Y ; L

D.1. 1 1 ≥
Proof of TightnessTheorem 13 ([CLZ15, Cor.1]).For all Σ ∈ R D×D , the log determinant of Σ, estimated for N samples drawn from a multivariate Gaussian distribution of covariance matrix Σ, Theorem 13 is an analogue of the Central-Limit Theorem for the log-det term with a Θ QD 2 N positive bias.The following term shows that the bias from the trace of inverse covariance matrix is positive.Lemma 14.The trace of the inverse empirical covariance matrix is positively biased:E Tr Σ −1 − D ≥ 0 .(74)Proof.For any symmetric positive matrix such as Σ, the mapping Σ → Tr Σ −1 is convex [BV14, Ex. 3.18].Using Jensen's inequality, we get E Tr Σ −1 ≥ Tr E Σ −Tr(I D ) = D .

e
All first-order terms are zero.First, observe that for β = 0, γ = 0, the model corresponds to the true distribution: m(y | l) = p(y | l) and thus R (0, 0) = 0. Second, let us express∂ ∂γ R (0, 0): ∆e 1 L 1 + e ∆e 1 L = 0,where we used the same change of variable as in the proof of Lemma 15 in the last line.Now, let us express ∇ β R (0, 0) β.Similarly to the derivation of ∂ ∂γ R (0, 0), we get that Lemma 1 ([vEGM + 15, Example 2.2]).Let H be a hypothesis class and let p be the true leakage model to be estimated.If p ∈ H, then the central condition holds.Van Erven et al. argue that even if p / ∈ H, this condition is often verified [vEGM + 15, Example 2.2], up to some (possibly high We are then reduced to bound the expected value of Γ.To this end, as recalled in Lemma 1, the assumption p ∈ H implies that the central condition is verified.Van Erven et al. show that this implies that the so-called Bernstein's condition is verified [vEGM + 15, p. 1829]. LI(Y ; L; H) .Notice that by definition, E N and PI(Y ; L; A H ) ≤ LI(Y ; L; H), Γ ≥ 0.