Efficiency through Diversity in Ensemble Models applied to Side-Channel Attacks

Deep Learning based Side-Channel Attacks (DL-SCA) are considered as fundamental threats against secure cryptographic implementations. Side-channel attacks aim to recover a secret key using the least number of leakage traces. In DL-SCA, this often translates in having a model with the highest possible accuracy. Increasing an attack’s accuracy is particularly important when an attacker targets public-key cryptographic implementations where the recovery of each secret key bits is directly related to the model’s accuracy. Commonly used in the deep learning field, ensemble models are a well suited method that combine the predictions of multiple models to increase the ensemble accuracy by reducing the correlation between their errors. Linked to this correlation, the diversity is considered as an indicator of the ensemble model performance. In this paper, we propose a new loss, namely Ensembling Loss (EL), that generates an ensemble model which increases the diversity between the members. Based on the mutual information between the ensemble model and its related label, we theoretically demonstrate how the ensemble members interact during the training process. We also study how an attack’s accuracy gain translates to a drastic reduction of the remaining time complexity of a side-channel attacks through multiple scenarios on public-key implementations. Finally, we experimentally evaluate the benefits of our new learning metric on RSA and ECC secure implementations. The Ensembling Loss increases by up to 6.8% the performance of the ensemble model while the remaining brute-force is reduced by up to 222 operations depending on the attack scenario.


Introduction
Side-channel analysis (SCA) is a class of cryptographic attack in which an attacker tries to exploit the vulnerabilities of a system by analyzing its physical properties, including power consumption [KJJ99] or electromagnetic emissions [AARR03], to reveal secret information. One of the most powerful types of SCA attacks are profiled attacks. In this scenario, the attackers have access to a test device whose target intermediate values are known. Very similar to profiled attacks, the application of deep learning algorithms was inevitably explored in the side-channel context [MPP16, CDP17, PHJ + 19, MDP19,ZBHV19].
While most of the works published on Deep Learning Side-Channel Analysis (DL-SCA) target symmetric cryptographic implementations, some of them investigate the effectiveness of neural networks for defeating secure RSA [CCC + 19] and elliptic curves [WPB19, ZS19,PCBP20]. Due to a careful combination of countermeasures (e.g. message blinding, modulus randomization, exponent/scalar blinding, point blinding), the attacker must be able to recover more than 90% of the secret bits from a single trace [Cor99,Gir06]. Attacking public key implementations requires to recover each of secret bits by repeating the attack. Hence, the accuracy of the attack is crucial in order to lower the remaining operations required to find the entire secret key. This focus on the attack accuracy is particular to the public key case, as for symmetric implementations, the attacker aggregates the output probabilities of the model on multiple traces. Moreover, as public keys are much larger than symmetric keys a small gain in the attack accuracy improves drastically the remaining attack complexity.
In this paper, we consider the two main types of exploitation scenarios for profiled attacks on public key implementations: • N traces exploitation -The attacker has access to N leakage traces in order to recover the secret exponent d. This use case corresponds to ECDH (Elliptic Curve Diffie-Hellman) or RSA signature computations when exponent/scalar blinding countermeasure is applied.
• 1 trace exploitation -The attacker has access to only 1 leakage trace in order to recover the secret exponent d. This use case corresponds to ECDSA (Elliptic Curve Digital Signature Algorithm) targeting the scalar multiplication with a random nonce.
In machine learning, ensemble methods combine individual predictions from all members of a pool via a consensus method (i.e. majority vote, average, ...) [HS90,Kun04,Zho12]. These approaches are useful when the members of the committee learn and predict uncorrelated errors. Hence, a simple consensus method can efficiently reduce the global error of the system. However, in practice, the errors induced by the committee members are correlated and the overall ensemble error reduction is hard. One solution to reduce this correlation is to conduct a diversity investigation on the members in order to reduce the global error and increase to some extent, the ensemble performance [Die00b]. Following Liu el al. [LWC + 19], three ways exist to create diversity in ensemble learning. Type I diversity corresponds to the variety of the committee members structure (e.g. network architecture, optimizer hyperparameters, ...). This classical diversity was studied by Perin et al. in the side-channel context [PCP20]. The authors provide experimental results on symmetric algorithm implementations and show that combining predictions of multiple neural networks is useful to gain in performance. Type II diversity carefully chooses the network to promote error independence between the classifiers in the ensemble. Finally, Type III diversity captures the posterior probability distribution during the training process by maximizing the diversity between the learners to encourage a convergence towards different hypotheses. Combining these types of diversity can be helpful to generate a powerful ensemble model.

Contributions.
Our paper extends the preliminary results of ensemble methods in the side-channel context [PCP20] by providing theoretical observations and new propositions to increase the impact of ensembling in SCA. More precisely, our work mainly focuses on the type III diversity which has not been studied in the SCA literature to the best of our knowledge.
First, we propose a new loss, namely Ensembling Loss (EL), that maximizes the mutual information between the ensemble model and the sensitive information. This contribution, derived from [Bro09] and [ZBD + 20], tends to maximize the type III diversity between the committee members during the training process in order to ensure an ensemble of diverse members. Hence, the Ensembling Loss can be used in addition to types I and II diversity to increase the performance of an ensemble model.
Our theoretical observations are validated on two public datasets: a secure RSA implementation with exponent blinding [CCC + 19] and a protected ECSM (Elliptic Curve Scalar Multiplication) implementation [NCOS17,Chm20] with scalar randomization. Each of these datasets correspond to a type of exploitation scenario detailed above.
While the goal of this paper is not to compare the benefits of the Ensembling Loss with the other diversity types, we combine the proposed loss (i.e. type III diversity) with types I and II diversity to evaluate its impact on the ensemble model's diversity.
Finally, ensemble methods are well-known to increase the performance of a model regardless of the training process. Hence, using these techniques could have a huge impact on the training time. To support the relevance of the Ensembling Loss, we evaluate the impact of the accuracy gain on the remaining complexity of a side-channel attack and the resulted training time. We study different remaining complexity methods for public keys: the naive complexity, the 2 n -complexity and the complexity of the Alternate Attack [SW14,SW17]. These wide-range scenarios illustrate the negligible impact of the increase in training time compared to the major attack complexity improvement. To evaluate the practicability of our result, we consider the European SOG-IS scheme as a reference to support the benefits of the Ensembling Loss.
The loss proposed in this paper can also be obviously applied on symmetric cryptographic implementations and more generally, on all types of machine learning problems where a gain in accuracy is crucial.
Paper Organization. The paper is organized as follows. Section 2 recalls the learning metrics introduced in the side-channel context. It also explains the relationship between ensemble models and diversity. Section 3 proposes a new loss, called Ensembling Loss (EL), which generates an ensemble model converging towards the mutual information between a pool of classifiers and a set of labels. Section 4 presents the dataset used to validate the theoretical observation and the side-channel complexity measures. Section 5 illustrates the benefits of the ensembling loss through experimental results. Finally, Section 6 extends the ensembling loss on a binary classification problem and discusses on traditional methods used in ensemble learning (combination methods, ensemble methods, impact of the number of committee members, ...).

Notation and terminology
Let calligraphic letters X denote sets, the corresponding capital letters X (resp. bold capital letters) denote random variables (resp. random vectors T) and the lowercase x (resp. t) denote their realizations. The i-th entry of a vector t is defined as t[i]. Sidechannel traces will be constructed as a random vector T ∈ R 1×D where D defines the dimension of each trace. The targeted sensitive variable is Z = f (P, K) where f denotes a cryptographic primitive, P (∈ P) denotes a public variable (e.g. plaintext or ciphertext) and K (∈ K) denotes a part of the key (e.g. byte) that an adversary tries to retrieve. Z takes values in Z = {s 1 , ..., s |Z| } such that s j denotes a score associated with the j th sensitive variable. Let us denotes k * the secret key used by the cryptographic algorithm. We define the following information theory quantities needed in the rest of the paper [CT91] . The entropy of a random vector X, denoted H(X), measures the unpredictability of a realization x of X. It is defined by: The conditional entropy of a random variable X knowing Y is defined by: The Mutual Information (MI) between two random variables X and Y quantifies how much information can be extracted about Y by observing X and is defined as: (1) Introduced by McGill [McG54], interaction information is a multivariate generalization of mutual information for measuring dependence among multiple variables. The interaction information M I({X 0 , X 1 , · · · , X n }) between n + 1 random variables {X 0 , X 1 , · · · , X n }, denoted as {X 0:n } in the following sections, and the conditional interaction information M I({X 0:n }|Y) are respectively defined as:

Learning Losses in Side-Channel Analysis
Profiled SCA can be formulated as a classification problem. Given an input, a neural network constructs a function F θ : R D → R |Z| that computes an output called a prediction. During the training process, a set of parameters θ, called trainable parameters, are updated in order to generate the model. To solve a classification problem, the function F θ must find the right prediction z ∈ Z associated with the input t with high confidence. To find the optimized solution, a neural network has to be trained using a profiled set of N p pairs (t p i , z p i ) where t p i is the i-th profiled input and z p i is the associated label. In SCA, the input of a neural network is a side-channel measurement and the related label is defined by the corresponding sensitive value. The input goes through the network and return a distribution that quantifies the probability of observing each hypothetical sensitive value. As a classical profiling attack, we can use the resulted probability distribution to compute the score for each key hypothesis and predict the correct targeted secret. To quantify the classification error of F θ over the profiled set, a loss function has to be configured. Indeed, this function reduces the error of the model in order to optimize the prediction. For that purpose, the backward propagation [GBC16] is applied to update the trainable parameters (e.g. weights) and minimize the loss function. The classical loss function used in side-channel analysis is based on cross-entropy.
Definition 1 (Cross-Entropy). Given a joint probability distribution of a sensitive cryptographic primitive Z and corresponding leakage T denoted as Pr[T, Z], we define the Cross-Entropy of a deep leaning model F θ as: Given a profiling set T of N p pairs (t p i , z p i ) 1≤i≤Np and a classifier F θ with parameter θ, the Categorical Cross-Entropy (CCE) loss function is an estimation of the cross-entropy such that: In other words, minimizing the categorical cross-entropy reduces the dissimilarity between the right distributions and the predicted distributions for a set of inputs. According to the Law of Large Numbers, the categorical cross-entropy loss function converges in probabilities towards the cross-entropy for any θ [SSBD14]. In [ defines the output score a of the hypothesis k ∈ |K| for a given plaintext (p j ) 1≤j≤Na while α denotes one hyperparameter related to the sigmoid and approximates the identity function needed for estimating the success rate. For convenience, k * is used to denote the class related to the correct label. Hence, k * is also used to define the class associated with f (p, k * ) such that f is a cryptographic primitive and p characterizes a plaintext value.
As illustrated in [ZBD + 20], the selection of α greatly impacts the training process. Typically, α ∈ {0.001, 0.01, 0.1, 1}. They demonstrate that the ranking loss maximizes the success rate for a given number of attack traces. Thus, this learning metric generates a model converging towards the optimal distinguisher introduced in [HRG14, BGH + 17]. In the worst case, a model using the ranking loss function is as efficient as a model trained with the categorical cross-entropy [ZBD + 20].
In [PCP20], Perin et al. use the categorical cross-entropy to evaluate the benefits of ensemble models applied to side-channel attacks against symmetric cryptographic implementations. In the next section, we explain the theoretical reasons why ensembling techniques are effective and introduce the diversity term that is essential to generate a powerful and efficient ensemble model.

Ensemble Models: A Source of Diversity
Reduction of the Global Error. In [TG96a,TG96b], Tumer and Ghosh provide theoretical observations for analyzing the interest of ensembling to solve a classification problem. They analyze the classification errors that are added to the Bayes error (i.e. the lowest possible error rate for any classifier of a random outcome) for an ensemble committee. Let F = {F θ0 , F θ1 , · · · , F θ Nc −1 } be a set (or committee) of N c classifiers (or members) with a In [ZBD + 20], the output score denotes the value before the softmax function. This choice is made to impact the training process accordingly to the relative order of the key hypotheses' relevance instead of the normalized probability distribution. trainable parameter (θ n ) 0≤n<Nc and E add be the expected added error of the individual classifiers included in F. In the following, F θn will be denoted as F n . The classifiers are assumed to have the same error. Tumer and Ghosh show the expected added error of the ensemble committee, denoted E add,ens , as: where δ is a correlation factor that quantifies the error dependence among the classifiers and N c is the number of classifiers (or members) in F.
From Equation 3, we can easily evaluate the benefits of using ensemble methods to reduce the global error. If δ is 0, then the errors induced by the classifiers are independent and the ensemble expected added error is divided by N c . Therefore, the global error will be N c times smaller than the individual error provided by each classifier included in F. On the other hand, if δ is 1, the errors induced by the classifiers are correlated and E add,ens characterizes the average error of each classifier. To insure uncorrelated errors, the classifiers included in the ensemble model must be diverse [Die00b].
Ensemble Diversity Definitions. Diversity has been recognized as a very important concept in classifier combination [CC00,Lam00]. However, in the machine learning literature, there is no strict common definition of what is perceived as diversity [Kun04]. For example, bagging [Bre96] and boosting [FS96] manipulate input data to promote diversity by choosing different subsets of input during the training process. In our paper, we define the diversity as follows: Definition 3 (Diversity). Given an ensemble model F composed by N c committee members (F n ) 0≤n<Nc , we define the diversity as the quantity measuring the difference in terms of prediction among the committee members.
This definition is not new and was already considered by the machine learning community (e.g. majority vote [MHA14], PAC-Bayesian theory [GMGA17], . . . ). From Definition 3, increasing the diversity reduces the overall ensemble error by distributing the wrong hypotheses uniformly once the combination of individual predictions is performed. In [FR05], Fumera and Roli found that the performance of ensembles depends on the performance of individual classifiers and their correlation. To efficiently promote the ensemble diversity, the output of the ensemble model F can be decomposed into three categories [XKS92,Kun04]. Let F = {F 0 , F 1 , · · · , F Nc−1 } be a set (or committee) of N c classifiers (or members) and C = {c 0 , c 1 , · · · , c |K|−1 } be a set of |K| labels (or classes). For a given input t, we can define these categories as follows: • Abstract level: the output of each classifier F n (t), denoted s n , is included in C.
Thus, the N c classifier outputs define a vector s = [s 0 , s 1 , · · · , s Nc−1 ] T ∈ C Nc that characterizes the output of F.
• Oracle level: the output of F n (t) is 1 if t is correctly classified by F n , and F n (t) = 0 otherwise. This representation is called oracle because we have to know the label for each input in order to configure the output. The measurement level contains the highest amount of information while the abstract level contains the lowest [XKS92]. In this paper, we want to precisely measure the diversity between each classifier of the committee. For that purpose, we focus only on the posterior probability representation to evaluate the performance and the diversity of an ensemble model F. These probabilities will be combined following the Average Method [XKS92] to define the overall performance of F but a comparison will also be provided with Voting in Section 6.
The diversity methods are legion and it could be hard to categorize them. In [LWC + 19], Liu et al. decompose the diversity into three categories: • Type I diversity characterizes the variety of committee members structure such as network architecture (e.g. MLP, CNN, RNN, ResNets, ...), weight initialization, training dataset, optimizer hyperparameters (e.g. optimizer algorithm, learning rate, number of epochs, ...).
• Type II diversity selects a subset of members that minimize their errors correlation from a pool of learners. Hence, the resulted ensemble model promotes independence between the members and tends to reduce the overall error.
• Type III diversity forces the set of learners F to decorrelate the errors generated by each committee member during the training process. Hence, an error decorrelation penalty term is incoporated in the loss function to create complementary members that reduce the overall error.
The type II and the type III diversities are both defined and quantified based on the disagreement among ensemble members. While the type II diversity captures the disagreement measure of each committee member after the training process for selecting a subset of learners, the type III diversity considers the posterior probability representation to create and promote interactions during the profiling phase. Hence, even if an ensemble model is composed by learners with a high disagreement measure, applying the type III diversity is useful to penalize the remaining error correlation. In this paper, we propose a new loss promoting the diversity during the training process (i.e. the type III diversity). This metric is based on the mutual information between an ensemble model and its related labels. The next section introduces the concept of mutual information ensemble diversity as a foundation of our proposition. In addition, to efficiently evaluate the overall benefits of using ensemble methods, we combine all types of diversity in Section 5.3.

Mutual Information Ensemble Diversity
Type III diversity can be characterized by the application of a specific loss function promoting the diversity between committee members. Unlike the correlation that is classically employed to measure the similarity between two entities, the mutual information captures non-linear statistical dependencies between variables. Hence, this measurement can be used as a real source of dependence information [KA14]. In [Bro09], Brown evaluates the benefits of using mutual information to improve ensemble models. He rewrites the ensemble problem as a communication channel problem. From an information theoretical point of view, let Y be a message sent through a communication channel and X be the received value such that X should be decoded to recover the input message Y . For that purpose, a decoding function g(.) is defined such that an estimation of the message can be written asŶ = g(X). From a machine learning perspective, X is the set of features characterizing the input of a learner g(.) and Y is the true unknown label. During the training process, we want to minimize Pr [g(X) = Y ]. For any classifier g, [Fan61,HR70] provide theoretical bounds for Pr [g(X) = Y ] such that: Hence, to minimize Pr [g(X) = Y ], we have to maximize the mutual information between X and Y . In [Bro09], Brown proposes a solution to compute the mutual information between an ensemble model F and a set of true unknown labels Y .
Definition 4 (Mutual Information Ensemble Diversity [Bro09,ZL10]). Given an ensemble model F composed by N c committee members (F n ) 0≤n<Nc , a sensitive cryptographic primitive Z, we define the mutual information ensemble diversity as: where M I(F n ; Z) is called relevancy, M I(F n ; F 0:n−1 ) defines the redundancy and M I(F n ; F 0:n−1 |Z) characterizes the conditional redundancy.
The relevancy computes the mutual information between the n th classifier of X and the target Z. The redundancy is independent of the class label Z and measures the interactions between all the classifiers. Hence a large Nc−1 n=1 M I(F n ; F 0:n−1 ) indicates strong correlations between the classifiers. Finally, Nc−1 n=1 M I(F n ; F 0:n−1 |Z) indicates that a strong class-conditional correlation is needed to perform an efficient ensemble model. However, from a practical perspective, it is quite difficult to estimate higher-order interaction information. Currently, there is no effective computational approach in the literature. Hence, Brown proposes to simplify Equation 5 by considering only pairwise components as follows [Bro09]: where M I(F n ; Z) computes the mutual information between the n th classifier of F n and the target Z, M I(F n ; F m ) measures the mutual information between two models F n and F m and M I(F n ; F m |Z) measures the redundancy between two models F n and F m knowing Z.
Based on the pairwise approach, Equation 6 omits higher-order components. In the next section, we propose a loss that maximizes the pairwise mutual information between a committee F and a set of labels Z during the training process.

Ensembling Loss: A Pairwise Ensemble Diversity Metric
This section presents our main contribution: the Ensembling Loss (EL). In Section 3.1, we first define three sub-losses, namely Relevance Loss, Conditional Redundancy Loss and Redundancy Loss, derived from the mutual information ensemble diversity. This decomposition allows us to define the Ensembling Loss as a diversity learning metric. Then, Section 3.2 validates the theoretical aspects of the ensembling loss through visualization techniques.

Mutual Information Ensemble Diversity Estimation
This section proposes a loss derived from Equation 6 in order to maximize the pairwise mutual information and the diversity between the committee classifiers. To this end, we propose three losses namely Relevance loss, Conditional Redundancy loss and Redundancy loss. In order to achieve a general-purpose estimator, we base our propositions on the characterization of the mutual information as the Kullback-Leibler (KL-) divergence [KL51] between the joint distribution and the product of the marginals.

Relevance Loss. In Equation 6
, the relevance M I(F n ; Z) highlights the dependence of a learner F n ∈ F and a label Z. Following [ZL10,Zho12], this term gives a bound on the accuracy of the individual classifiers. Hence, a large relevance is preferred to maximize the performance of the ensemble model. In [ZBD + 20], Zaid et al propose the Ranking Loss to maximize a classical side-channel performance metric, namely Success Rate [SMY09]. Minimizing the ranking loss is asymptotically equivalent to maximizing the mutual information between a model and its related labels. The minimization of this loss function is exactly what the relevance quantifies in [Bro09]. Thus, given a set of N p profiling traces, denoted T , and a number of N a attack traces such that N a |N p , we define the Relevance Loss as: where s defines the score related to the class k given a set of N a traces, a classifier F n and k * the correct class. Finally, α denotes the hyperparameter of the sigmoid function that should be configured.
Minimizing Equation 7 tends to maximize the mutual information M I(F n ; Z) through the minimization of the error induced by Pr [Z|t]. In other words, we want to penalize a model F n when the correct label Z is not ranked as the highest hypothetical class. This penalization term depends on the distance between the score associated with the correct label Z and the other hypotheses. From a machine learning perspective, the maximization of M I(F n ; Z) tends to generate compact clusters, one for each class. If False-Positives (FP) or False-Negatives (FN) appear during the training process, the ensemble model will be overconfident on its predictions and the resulted errors could be persistent. To reduce this effect, a solution is to provide diversity in order to limit the impact of these FP, FN examples. Hence, other losses defined below bring more diversity during the training process.
Remark 1. The relevance loss is actually the same as the ranking loss defined in [ZBD + 20]. We reformulate it to facilitate the comprehension of the ensembling loss and the comparison with the mutual information ensemble diversity (see Definition 4) introduced by Brown [Bro09].
Conditional Redundancy Loss. The conditional redundancy M I(F n ; F m |Z) quantifies the dependence between F n and F m given a set of labels Z. This mutual information helps the committee members to converge towards the correct label hypothesis with the same confidence. Maximizing M I(F n ; F m |Z) is asymptotically equivalent to minimizing the error on Pr[F m |F n , Z] which defines the output probability of the model F m given F n and Z. In other words, we want to minimize the distance between the scores of F n and F m given the correct class. Thus, for a set of N a traces, we introduce the Conditional Redundancy Loss as: where the β parameter of the sigmoid function should be configured and s (n) Na,i (k * ) defines the score related to the class k * given a set of N a traces and a classifier F n .
Through Equation 8, we want to penalize the learning process when the score s Redundancy Loss. The redundancy M I(F n ; F m ) measures the pairwise dependence between all the committee members without considering the ground truth. A large mutual information induces a strong correlation among the pairwise classifiers and promotes similarities which is not desired when we want to construct an efficient ensemble model. Hence, we want to minimize this mutual information to improve the ensemble performance. The redundancy loss maximizes the distance between the score distribution of the models F n and F m . Therefore, we propose a loss penalizing the training process when this condition does not hold. We introduce the Redundancy Loss as: where the γ parameter of the sigmoid function should be configured.
Consequently, we want to increase the uncertainty of F m given F n . Through the minimization of Equation 9, we promote the cluster scattering and reduce the global confidence of the committee members on the False-Positives and False-Negatives to decrease their persistency.
Ensembling Loss. We integrate the mutual information ensemble diversity during the training process to promote the diversity between the committee members. Through our individual losses provided in Equation 7, Equation 8 and Equation 9, we formulate an Ensembling Loss (EL) that maximizes an estimation of the mutual information between an ensemble F and a label Z.
Definition 5 (Ensembling Loss -Our contribution). Given a profiling set T of N p pairs (t p i , z p i ) 1≤i≤Np , a set of classifiers F = {F 0 , F 1 , · · · , F Nc−1 } and a number of attack traces N a such that N a |N p , we define the Ensembling Loss (EL) function as: where µ quantifies the impact of the diversity term during the training process, α (resp. β, γ) is a hyperparameter that configures the relevance loss (resp. conditional redundancy and redundancy losses) effect.
We normalize each term of the ensembling loss to reduce the impact of exploding gradient. Appendix A highlights the benefits of each individual loss from a training perspective. Through this study, the reader can understand how the network would train if the conditional redundancy loss or the redundancy loss are individually used. Finally, in the following sections, the number of attack traces N a will be configured to 1 during the profiling phase as in [ZBD + 20].
for all types of implementations because it highly depends on the number of classes |K|, the noise induced in each trace, the implemented countermeasures, the targeted algorithm (e.g. RSA, ECC), etc. However, during our experiments, the tuning process was not a pitfall. Indeed, in the following section, α, β, γ values follow the strategy defined in [ZBD + 20]. Hence, they are configured in [0.001, 0.1]. In opposition, µ is not optimized in this work and always equals 1.
Remark 3. As our framework is generic, we argue it is adequate to target private-key implementations, in particular AES [DR02] and DES [Des77] (i.e. |K| = 256). However, the training time increases exponentially with the number of output classes |K|. Hence, from a practical perspective, the application of the Ensembling Loss seems more suitable for low multiclass problems (i.e. |K| ≤ 5) such as attacks against asymmetric implementations. This proposition fits with asymmetric algorithm implementations which consider low multiclass problems. Finally, even if this work is only focusing on the side-channel context, the Ensembling Loss can be used to solve any machine learning problems (i.e. image classification, image recognition, fraud detection, ...).

Visual Validation of the Ensemble Diversity
Diversity among the committee members is deemed to be a key issue in ensemble learning and should reduce the global error (see Section 2.3). In this section, we want to validate the theoretical observations provided in Section 3.1. Hence, we analyze the diversity evolution depending on the loss used during the training process. The ensemble model can be trained to follow one of the next three processes: • Independent learning strategy -There is no interaction among the classifiers.
For example, each classifier could be trained on different training set to reduce the features' correlation [Bre96]; • Sequential training -This process induces a set of learners that are trained sequentially on data sets with entirely different distributions; • Simultaneous ensemble learning -a set of committee members are trained interactively to promote uncorrelation and diversity.
In this paper, we focus on the simultaneous training strategy for allowing interaction between the committee members during the training process. This strategy fits perfectly with the ensembling loss. Furthermore, it is helpful to promote the diversity between the members even if similar architectures are used.
Dataset setup for visualization. Assessing the benefits of the ensembling loss can be illustrated through the t-SNE visualization [vdMH08] and diversity measures [KW03]. For that purpose, we use a secure RSA dataset with three classes such that each input is associated with one of these labels (see Section 4.1 for more details on the dataset). The ensemble model is configured with 5 members (i.e. F = {F 0 , F 1 , F 2 , F 3 , F 4 }) such that each of them has the same architecture. Generating 5 committee members with the same architecture is helpful to efficiently evaluate the suitability of the ensembling loss in contrast with the categorical cross-entropy and the ranking loss. These members are CNNs architectures with 1 convolutional block based on 2 filters of size 1, a BatchNormalization layer [IS15] and an average pooling layer with stride 2. Then, a flatten layer is applied to reduce the space dimension of the convolutional part. Finally, a predictive layer is applied with a softmax function. The optimizer hyperparameters are set such that each network is trained during 40 epochs, with a batch-size of 128, a learning rate set to 0.001 and the Adam optimizer [KB15].
Remark 4. In the following sections, we only consider 5 committee members because this configuration provide us the best trade-off between training time and network performance. A deeper investigation is performed in Section 6.1 to evaluate the impact of the number of committee members on the ensemble accuracy. [vdMH08], the t-SNE visualization tool maps high-dimensional data into two or threedimensional space while preserving local structure and also revealing important global structure (e.g. clusters). T-SNE employs a nonlinear and iterative process to convert similarities between data points to joint probabilities and tries to minimize the KLdivergence between the joint probabilities of the low-dimensional embedding and the high-dimensional data. This representation is helpful to evaluate the network capacity to distinguish each class and validate the theoretical approach presented in Section 3.1.  Figure 1 illustrates the t-SNE visualizations depending on the loss used during the training process. When the cross-entropy is considered, we estimate that the network is not trained enough to efficiently discriminate each class. Indeed, there are many connections between each class leading to a loss of the global performance. Through this visualization, we can question the relevance of the cross-entropy in our context. Hence, many FP and FN can badly influence the global performance of the model.

t-distributed Stochastic Neighbor Embedding (t-SNE) visualization. Introduced in
On the other hand, the ranking loss [ZBD + 20] generates three separate clusters. As mentioned in Section 3.1, the ranking loss can be formulated as the relevance loss (see Equation 7). Through the minimization of this function, we minimize the conditional entropy H(Z|F n ) which promotes the generation of three compact clusters. Hence, Figure 1 confirms the theoretical results of the previous section. The ensemble model is overconfident in the features captured during the training process. Consequently, it detects discriminative patterns to avoid connections between each cluster. However, following the t-SNE illustration, the FP and FN induced by the ranking loss are persistent and seem difficult to detect. Indeed, these errors are fully included in a wrong cluster. This phenomenon can be explained by the overfitting effect. Using more training example could be useful to reduce this impact and reduce the error rate. However, when the number of profiling traces is limited (as often in practice), a solution has to be found to improve the ensemble model.
The best solution should create three separate clusters when the ensemble model is confident in its prediction while, the errors or the uncertain predictions should convergence towards the equidistant point of the centroid of the clusters. These examples are called data uncertainty. Introduced in [MMG20], data uncertainty is the irreducible uncertainty in predictions which arises due to the complexity or noise in the data. The ensembling loss converges towards this solution. Indeed, in Figure 1, the combination of the relevance loss, the conditional redundancy loss and the redundancy loss creates three separate clusters (see Appendix A for deeper details). When the network is confident in its predictions, it will assign the related examples to the correct class. However, the ensembling loss creates some connections between the clusters which seem defined by the data uncertainty. This result tends to reduce the number of consistent FP and FN such that few errors can be detected on each cluster in contrast with the cross entropy or the ranking loss. However, the t-SNE does not provide information related to the diversity growth. To validate the suitability of the ensembling loss, we evaluate its model's diversity against the cross-entropy and the ranking loss.
One advantage of pairwise measures it that they can be easily visualized and interpreted. Choosing a specific pairwise measure does not make significant difference in our experiments, so we chose the fraction of disagreement for simplicity.
Let N ab n,m be the joint counts between two learners F n and F m . We denote a = 0 (resp. b = 0) if F n (resp. F m ) wrongly predicts a value and a = 1 (resp. b = 1) otherwise. For example, N 01 n,m defines the number of elements such that F n obtains an incorrect value for a given input while F m correctly predicts the related class for the same input.
This metric is 0 when two functions are making identical predictions, and 1 when they differ on every single examples in the test set. Hence, the larger the value, the larger the diversity. From an ensembling perspective, we want to generate a set of classifiers F maximizing the disagreement measure such that each individual learner keeps a high performance for classifying unseen examples. In [FHL19], Fort et al. propose to plot a normalized disagreement measure with respect to the accuracy of each classifier. The diversity measure is normalized by the error rate to prevent the case where random predictions provide the best diversity. From the set of classifiers F, one member is randomly picked to be used as the base model. This model is denoted as F n . Then, we calculate the diversity measure of other ensembling members against the base model. Figure 2 illustrates the diversity of each model of F against F n . In this figure, the y-axis characterizes the fraction of labels, returned by each model of F, that differs from F n while the x-axis defines their validation accuracy. Consequently, the sample with a 0 y-axis value defines F n . Three ensemble models are generated with the three losses used in order to investigate the benefits of the ensembling loss. In [FHL19], Fort et al. propose a theoretical approach to explain the results obtained in Figure 2. Let F n and F m be two committee members from F. If F n and F m have identical validation accuracy and high diversity then, they converge towards different local optimum with identical depth. In opposition, if F n and F m have identical validation accuracy and a low diversity then, they converge towards the same local optimum. Consequently, from a loss landscape perspective, it sounds beneficial to construct an ensemble model with high diversity members such that their prediction distributions and their selected features differ from each other. In contrast, the accuracy of individual models does not reflect the performance of the ensemble committee. Indeed the combination of poor performance (i.e. weak), but complementary, classifiers can generate a very effective ensemble model. While the same configuration with effective, but correlated, classifiers is not beneficial for an ensembling approach. Consequently, an ensemble model composed by weak classifiers can outperform a combination of effective individual models. The ensemble model trained with the ranking loss provides the worst diversity scenario. Even if individual classifiers are more efficient than most of the other learners (i.e. validation accuracy > 94%), the lack of diversity is an issue for developing uncorrelated members. Indeed, following [FHL19], all the committee members converge towards the same local optimum. Hence, a lack of complementarity can be exposed when the adversary only considers the ranking loss. Consequently, the resulted ensemble model performance should be equal to the average accuracy of its members. For the cross-entropy loss function, the members are more diverse than the ranking loss ensemble model. Consequently, the resulted learners are less correlated and the resulted probability combination should reduce the overall error. Finally, in comparison with the categorical cross-entropy and the ranking loss, the ensembling loss provides the most diverse models. Indeed, in Figure 2, the normalized diversity measure is the highest for the ensembling loss model. This observation is confirmed with the κ-statistic measure in Appendix B Figure 7. Interestingly, even if the committee members have the same architecture, the ensembling loss provides a clear diversity benefit. Hence, from a loss landscape perspective, the ensembling loss helps the committee members to converge towards independent local optimum with different depths.
These observations validate the theoretical observations introduced in Section 3.1. Indeed, using the ensembling loss increases the diversity between ensemble members to reduce the correlation between the errors made in order to propose an efficient ensemble model. In the next section, we evaluate this diversity gain on the practical ensembling performance.

Experimental Settings
The experiments are implemented in Python using the Keras library [C + 15] and are run on a workstation equipped with 32GB RAM and a NVIDIA GTX1080Ti with 11GB memory.

Dataset: Secure RSA Implementation
Target presentation. Introduced in [CCC + 19], the targeted RSA implementation is based on a Left-to-Right Square & Multiply Always exponentiation algorithm [Cor99] combined with three countermeasures: input randomization, modulus randomization and exponent randomization. The software part of the targeted RSA implementation does not provide specific security mechanisms to defeat horizontal or address-bit side-channel attacks. This choice has been done deliberately by CryptoExperts' team b who was responsible for the development of the RSA software part. This paper highlights that the application of advanced deep learning-based side-channel attacks makes security mechanisms against horizontal and address-bit attacks mandatory to reduce the adversary's scope.
For two 512 bits primes p and q, the combination of the three masking countermeasures corresponds to the following equation: Neural Network Architecture. While the original network performs very well (= 99.91%), we decide to reduce its complexity while preserving the related performance. The network we used is composed of one convolutional block with 2 filters of size 1, one batch normalization layer [IS15] and an average pooling. Then a flatten layer is applied to connect the detected features to a predictive layer configured with 3 outputs defining the value of seg f ree . Optimization is done using the Adam optimizer [KB15] approach on a batch-size of 128 and the learning rate is set to 10 −3 . The batch-size and the learning rate follow the values provided in [CCC + 19]. The optimization of these hyperparameters is not considered in this paper. We use the SeLU activation function to avoid vanishing and exploding gradient problems [KUMH17]. In the following sections, we only keep the model achieving its best performance (e.g. accuracy) over 100 epochs. This new model has a similar performance to the architecture proposed in [CCC + 19] (= 99.89%) while being much more efficient computational wise (i.e. 1, 950, 323 against 39, 015 trainable parameters). In this paper, we want to evaluate the suitability of the ensemble models when the number of profiling traces is limited (as often in practice). Hence, we only use 30, 000 profiling traces and 3, 000 validation traces instead of using the 750, 000 traces considered by [CCC + 19]. However, when an adversary trains a model with the 30, 000 raw profiling traces of 13, 000 samples, he already generates a classifier with very high performance (= 98.30%). Hence to efficiently evaluate the suitability of the ensembling loss, we add Gaussian noise N ∼ B 0, σ 2 such that σ defines the standard deviation of the noise. Table 1 shows the evolution of the accuracy depending on the added noise on the secure RSA dataset. In the following sections, σ is set to 6 in order to evaluate the benefits of the ensembling loss against the categorical cross-entropy and the ranking loss. The model trained with the categorical cross-entropy is considered as the state-of-the-art result because it uses the classical learning metric in the side-channel context. This result will be considered as our reference in order to highlight the performance provided by the ensembling loss. Evaluation metrics. While the accuracy is questioned when symmetric cryptographic implementations are considered [PHJ + 19], it is totally relevant to assess the training process on asymmetric datasets. As previously mentioned, an adversary exploits the index seg f ree such that 3 values can be assigned. We denote Acc label the accuracy expressing the capacity of the network to retrieve the correct value of seg f ree for a given leakage trace. This metric is used to mitigate the underfitting and overfitting issues. However, an attacker wants to retrieve secret key bits. Hence, we have to convert the balanced ternary representation (i.e. {0, 1, 2}) into a binary representation (i.e. {0, 1}). We denote Acc bit the accuracy expressing the capacity of the network to retrieve the amount of correct bit values. Its related error rate is denoted bit .

Practicability and Remaining Brute-force Complexity
If the resulted Acc bit is less than 100%, the adversary has to perform additional operations to retrieve the full secret key. While no theoretical results link the accuracy and the remaining operations, we experimentally evaluate how the accuracy impacts the final attack complexity. This is an open problem in the literature.
Given a number N op of remaining operations, we define the brute-force complexity as log 2 (N op ). The European SOG-IS scheme c considers that a maximum brute-force complexity of around 2 100 operations is practical. Hence, we consider this threshold to evaluate if an attack becomes feasible. Note that the notion of time complexity is independent of the computational power available to the attacker. In the following sections, we consider three complexity measures depending on the attack scenario: • Naive Complexity -Given a secret exponent of K bits, a blinding scalar of bit-length R and an error rate bit , the Naïve Complexity, denoted C N C , is defined as the worst-case scenario such that: In this scenario, the adversary cannot locate the wrongly predicted bits induced by the attack. Hence, he has to compute all the combinations for each wrong assumption in order to correct the remaining errors.
c The Senior Officials Group Information Systems Security (SOG-IS) agreement defines a set of requirements and evaluation procedures related to cryptographic aspects of Common Criteria security evaluations of IT products and mutually agreed by SOG-IS participants. Participants in this Agreement are government organisations or government agencies from countries of the European Union or EFTA (European Free Trade Association). The interested readers may find useful information in https://www.sogis.eu/index_en.html.
• 2 n -Complexity -Given a secret exponent of K bits, a blinding scalar of bit-length R and an error rate bit , the 2 n -Complexity, denoted C 2 n , is defined as the best-case scenario such that: In this scenario, the adversary perfectly knows the location of each potential error. Thus, each assumption error has 2 possible values and the resulted number of remaining operations is 2 (K+R)× bit . In the following, an attack that can be performed with 2 n -Complexity is called a 2 n -Attack.
• Alternate Attack Complexity -Introduced by Schindler and Wiemers in [SW14], the Alternate Attack (AA) targets RSA modular exponentiation protected with an exponent blinding. From this attack, we propose a complexity measure to estimate its practicability. Our proposition is developed in Appendix C. Given a blinding scalar of bit-length R, a secret exponent of K bits, an error rate bit and a number of attack traces N a , the Alternate Attack Complexity, denoted C AA , is defined as: In [SW17], Schindler and Wiemers set s = K − R + 2 and t 0 = 2 for R = 32. Even if using the same parameters is restrictive when R = 64, these conditions are respected in Appendix C. In the following, N a characterizes the number of attack traces that are needed for retrieving the entire bits of φ(N ) (see Appendix C). We consider an alternate attack as ineffective if the success rate related to φ(N ) is less than 100% when 300 successive alternate attacks are performed.
All these complexity measures are helpful to evaluate the efficiency of an attack. These tools are suited to highlight the benefits of the ensembling loss on different attack scenarios and prove the negligible impact of the growth of the training time.

Experimental Results
This section proposes an experimental comparison between the categorical cross-entropy, the ranking loss and the ensembling loss when ensemble models are considered. In Section 5.1, we evaluate the complementarity of committee members depending on the loss used. Then, in Section 5.2, we combine the type I diversity with the different learning metrics to illustrate its impact of the resulted performance. In Section 5.3, all the diversity types are combined to exploit the entire benefits of the ensemble methods and highlight the improvement in the resulted side-channel attack complexity.
In the following sections, CCE i,j (resp. RkL i,j , EL i,j ) denotes an ensemble model trained with the categorical cross-entropy (resp. the ranking loss, the ensembling loss), composed by i committee members such that the type j diversity is performed. Due to the interactions between the committee members during the training process, the ensembling loss can be considered as the only learning metric promoting the type III diversity.
Remark 6. In [DDFP20], Destouet et al. investigate a solution that consists of the aggregation of multiple models targeting different sensitive value (i.e. hamming weight, first big-endian bit, identity). In this paper, we assume that all the learners are trained on the same single label.

Learning Ensemble Diversity
This section evaluates the benefits of using the ensembling loss instead of the categorical cross-entropy or the ranking loss when ensemble models are considered. To assess the diversity growth, we generate an ensemble model composed of 5 committee members with the same architecture (see Section 4.1). Consequently, the diversity provided by the following ensemble models only depends on the loss used. Table 3 illustrates the performance evolution depending on the diversity type and the learning metric applied. If the adversary only considers the state-of-the-art result, he trains a unique model with the categorical cross-entropy (i.e. CCE 1 ) to perform its attack. Recently, Zaid et al. proposed the Ranking Loss for the side-channel context [ZBD + 20]. However, their work was only focused on symmetric implementations. Here, we extend this work by investigating the benefits of using this loss to evaluate asymmetric implementations. In our scenario, the ranking loss can be considered as more effective than the categorical cross-entropy (see Table 3). While a classifier trained with the categorical cross-entropy loss function does not provide powerful models (i.e. C {N C,2 n ,AA} ≥ 100), using the ranking loss an attacker can potentially break the RSA implementation (i.e. C 2 n ≤ 100).
When 5 committee members are considered in the ensemble model, we can observe a meaningful improvement. Even if the training time is multiplied by 9 in the worst case, it stays reasonable from a practical perspective. In opposition, Acc label is increased by up to 2.69% and an adversary can extend its attack scenario. Following the SOG-IS recommendations, an adversary can successfully perform an alternate attack if he applies the ensembling loss to train its ensemble model while the state-of-the-art (i.e. CCE 1 ) result cannot. This result highlights the benefit of using the ensembling loss in terms of ensemble performance. In addition, considering the ensembling loss reduces C 2 n by 25. Hence, the theoretical features of the ensembling loss, which are validated through the visualizations of Section 3.2, translate an actual gain in model accuracy as well as a realistic improvement for a full attack scenario. The ensembling loss increases the overall diversity and reduces the global error rate induced in the ensemble model. Thence, the ensembling loss is helpful to promote the complementary between the committee members.

Ensembling Loss Combined with Type I Diversity
As mentioned in Section 2.3, the type I diversity refers to the heterogeneity between the committee members' structure. This diversity is employed by Perin et al. [PCP20] to argue the generalization improvement induced by this ensemble method. In this section, we propose to combine the type I diversity with the different loss functions to evaluate the resulted gain in attack complexity. For that purpose, we randomly generate 5 networks with a wide range of hyperparameters (details are provided in Appendix D Table 9). In [Zho12], Zhou recommends the configuration of heterogeneous networks with high individual performance. From a bias-variance trade-off perspective, this procedure is powerful to reduce the bias as well as the variance by aggregation. Even if this solution can be intuitive, this is not necessary the best one as discussed in Section 3.2.
From a diversity perspective, using efficient heterogeneous networks seem to increase the uncorrelated errors. Through Figure 3, combining the type I diversity with the ensembling loss reduces the overall κ-statistic measure. Following Appendix B Definition 7, this observation confirms the gain in diversity. This result can also be verified with the disagreement measure (see Appendix B Figure 8).
From a performance perspective, the individual committee members do not exceeded 94.33% for retrieving the bits of blinding exponent when the ranking loss is considered (see Appendix D Table 9). However, applying the ensembling loss adjusts the efficiency of each learner to increase their complementarity. Indeed, the most powerful member finds 95.84% of all bits while the least significance one finds only 88.21% of all bits. The interaction between the committee members during the training process tends to accentuate the discrepancy in order to force the gain in diversity. Table 3 illustrates the benefits of combining the type I diversity with the ensembling loss from a performance perspective. Adding type I diversity reduces the remaining attack complexity regardless of the adversary capacity. Finally, even if the resulted training time increases, it stays marginal related to the gain in attack complexity. Indeed, depending on the scenario, the attack can be performed by up to 2 50 operations. In comparison with the previous state-of-the-art result (i.e. CCE 1 ), the number of operations is reduced by 2 58 while the training time is only increased by 10.

Combining All Types of Diversity
The type I+II diversity consists of the selection of members from a pool such that the diversity measure is maximized between all the learners. The pool members are selected by randomly picking out the hyperparameters from ranges defined in Table 2. The resulted pool is composed by 100 members. As recommended in [LWC + 19], we retain a set of classifiers with high performance (i.e. Acc label > 85%) such that their disagreement measure is maximized. The 5 selected architectures are identified in Appendix D Table 10.
Remark 7. In some cases (e.g. boosting [CG16]), weak learners (i.e. models that are only slightly better than random guessing) can be helpful to increase the performance of the ensemble model. The benefit of these strategies is considered as out of our scope. The type I+II diversity promotes the error uncorrelation between the individual committee members. Following the κ-statistic measure (see Figure 3), the diversity brought by the type I+II is significant in comparison with the previous experiments. Indeed, the overall κ-statistic measure is reduced in comparison with the other experiments. This observation can also be made with the disagreement measure (see Appendix B Figure 8). When the ranking loss or the categorical cross-entropy is considered, even if no interaction is proposed between the committee members during the training process, using the type I+II diversity is useful to bring more diversity in the ensemble model. However, combining the type I+II diversity with the ensembling loss accentuates the gain in diversity in order to generate a more powerful model. From Table 3, we observe a significant improvement when the ensembling loss is used in comparison to the ranking loss and the categorical cross-entropy. Generating interactions between the committee members provides more consistency during the training process. As mentioned in Section 3.2, the ensembling loss leads to converge the uncertain predictions towards the equidistant point of the centroid of the clusters. Thus, the impact of the FP and FN is reduced when the ensembling loss is performed. Combining all the diversity techniques provides the most effective model in terms of performance. While an ensemble model trained with the ranking loss needs 2 56 operations to retrieve the remaining bits in the best case scenario, the addition of the ensembling loss with the type I+II diversity needs only 2 34 operations. Even if the resulted training time is multiplied by 3, the gain in performance is significant to justify the benefits of the ensembling loss. Table 3: Performance evaluation depending on the diversity's type (Average over 10 physical traces of 1, 088 bits each). Green (resp. Red) cells are considered as practicable (resp. unpracticable) following the SOG-IS recommendations.

Section
Model Acc label Acc bit bit C N C C 2 n C AA Training Time As a conclusion, combining all the diversity techniques provides a clear advantage from a side-channel point of view. Indeed, when the type I+II diversity techniques are combined with the ensembling loss (i.e. type III diversity), we promote the diversity between the classifiers in order to reduce the global error. In comparison with the previous state-ofthe-art result (i.e. CCE 1 ), Acc bit is increased by 6.8% and the number of remaining operations is reduced by 2 290.56 (resp. 2 74 and 2 22.03 ) when the adversary wants to perform a naive attack (resp. a 2 n -attack and an alternate attack). Even if the training time is increased by up to 39.44, it stays negligible regarding the gain to perform the full attack. Indeed, following the SOG-IS recommendation, the previous state-of-the-art result considers the RSA implementation as secure while combining the different diversity techniques leads an adversary to retrieve the secret exponent. Hence, the combination of type I+II with the ensembling loss should be considered during the evaluation of the asymmetric implementations to generate more powerful attacks.

Discussion
This discussion evaluates the classical ensemble methods (i.e. Bagging [Bre96], Boosting [FS96,CG16]), the classifier fusion's techniques (i.e. average accuracy, voting) and the impact of the number of committee members. Then, we evaluate the benefits of the ensembling loss for a binary classification problem. Obviously, the results provided in this paper can be improved by using additional techniques defined as suitable in side-channel context [CDP17, PHJ + 19, WJB20, Mag20, PCBP20].

Classical Ensemble Methods
Ensemble Methods. Traditionnally, the methods considered in ensembling are the Boostrap Aggregating [Bre96], also known as Bagging, and the Boosting [FS96,CG16] techniques. Through this discussion, we evaluate the benefits of these techniques in addition to the current ensemble models.
The bagging and boosting approaches are not new in side-channel context [MPP16, PSK + 18, PCP20]. While these algorithms are essentially performed with Random Forest (RF) [Bre01], it can also be proposed for neural networks. The details on the hyperparameters selection are provided in Appendix E Table 11 for the bagging selection and in Appendix E Table 12 for the eXtreme Gradient Boosting (XGBoost) [CG16] and the CNN-XGBoost [RGL + 17].
The best results for all the models are reported in Table 4. In our experiment, this table illustrates that bagging and XGBoost do not provide a clear advantage when they are added to the standard proposition introduced in Table 3. However, if an improvement is observed, these algorithms can be combined with those introduced in this paper (i.e. Type I+II+III diversity) in order to generate a more powerful ensemble model. Remark 8. The ensembling loss cannot be considered when the bagging technique is applied. Indeed, given a profiling set T , the N p pairs ((t p i , y p i ) 0≤i<Np ) should be the same for all the committee members when the ensembling loss is computed. This condition is a limitation regarding the application of the bagging algorithm.
Combination Methods. One major issue when ensemble model is considered is to find the best way to combine the posterior probabilities of each committee member. There are several consensus methods for combining the outputs of multiple learners. We compare the two most useful combining methods: • Averaging -This consensus is considered as a linear combining method. The average prediction returned by the committee members is computed. An advanced combination technique consists of weighting the average of each classifier to promote the order of the classes. However, this method stays out of our scope.
• Voting -This method is considered as a non-linear decision-making based on ranked information. The majority voting process predicts the value with the highest number of occurrences. Hence the collective decision has a major impact on the final prediction. These results shown in Table 5 are closely correlated with those defined in the previous sections. Hence, for the experiments investigated in this paper, these aggregating functions do not impact the performance of the ensemble model.

Number of Committee Members.
The number of committee members can also be considered as an issue in ensemble methods. Indeed, no useful methods define a priori the best number of committee members that maximize the ensemble model performance. In the following, we explore this variable in order to identify its impact on the ensembling loss performance. To that purpose, we increase the number of committee members up to 32 in order to evaluate the gain in performance and the impact on the training time. Through Table 6, we can estimate the best trade-off between the training time and the ensemble performance. For the RSA implementation, the best Acc bit value is obtained for N = 10 committee members. While increasing the number of members seems helpful to improve the ensemble model's accuracy, in our context we seem to reach the maximal possible performance. Adding too many learners can reduce the diversity effect because some committee members can share the same errors and promote irrelevant outputs. The best number of committee members should be defined for each case-study.

Binary Classification Problem: Attacking an ECC implementation
To emphasize the benefits of the ensembling loss, we evaluate its suitability on a classical binary classification problem. While the secure RSA dataset can be defined as a multi-class classification task (3 outputs), we perform the same experimental process on a protected ECSM (Elliptic Curve Scalar Multiplication) implementation d [NCOS17,Chm20] where each trace corresponds to a multiplication with a random scalar. This scenario is a 1-trace exploitation which is considered when targeting the scalar multiplication of ECDSA. Note that remaining brute-force attacks that require N a exploitation traces, such as [SW14], cannot be used in this context. Proposed in [NCOS17,Chm20], the ECSM secured implementation employs the Montgomery Ladder with randomized projective coordinates and a conditional swap (cswap) (see [NCOS17, Algorithm 1]). Starting from two (or more) curve points, the cswap countermeasure performs the scalar multiplication algorithm on one of these points depending on a mask value. Hence, if an adversary learns all the cswap condition bits from one side-channel trace, he retrieves the secret key (i.e. 256 bits) [NCOS17]. To be successful, the secret bits have to be recovered from a single side-channel trace. In the dataset, each trace represents a single iteration of the Montgomery Ladder scalar multiplication and the related label corresponds to the cswap condition bit value. For deeper information on the device under test, we suggest the readers to refer to [NCOS17,Chm20].
Similarly to the secure RSA dataset, we have to add Gaussian noise N ∼ B 0, σ 2 to characterize the benefits of the ensembling loss. Table 7 shows the evolution of the accuracy depending on the added noise and the loss used when 20, 000 profiling traces are used. To evaluate the suitability of each network, 2, 000 validation traces are considered and the evolution of the accuracy is used to limit the overfitting/underfitting effect. For our analysis, we set the added noise to σ = 30. Once again, we clearly evaluate the benefits of the ranking loss when the added noise is high in comparison to the cross-entropy loss function. Indeed, we increase by up to 3.5% the resulted performance. First, we validate the theoretical observations provided in Section 3.1 for the binary classification problem. For that purpose, we visualize the t-SNE maps for the models trained with the different losses. Figure 4 confirms all the theoretical results introduced in this paper. The cross-entropy representation does not seem relevant to efficiently discriminate each cluster. The resulted ensemble model seems to select joint patterns such that many false positives and false negatives can deteriorate the overall performance. In opposition, the model trained with the ranking loss tends to overfit such that the false positives and false negatives can be considered as consistent. Finally, from a theoretical perspective, the ensembling loss seems the most suitable. Indeed, the data uncertainty seems to converge towards the centroid between the clusters. Indeed, the data uncertainty seems to converge towards the equidistant point of the centroid of the clusters. Hence, the resulted ensemble model tends to gather the uncertain examples towards a uniform probability distribution. Furthermore, using the ensembling loss provides a clear benefit from a diversity perspective (see Figure 5). Through Table 8, we confirm the benefits of the ensembling loss for increasing the performance of the ensemble model. In comparison with the previous state-of-the-art result (i.e. CCE 1 ), the accuracy expressing the performance to retrieve the cswap bit value is increased by 6.5% when the ensembling loss is combined with the type I and II diversities. From a side-channel attack perspective, we reduce the overall number of remaining operations by 2 58.41 (resp. 2 16 ) for naive attack (resp. 2 n -attack). Hence, using the ensembling loss against a binary classification problem still performs well.  From a naive attack perspective, an adversary using the previous state-of-the-art result (i.e. CCE 1 ) considers the ECC implementation as secure following the SOG-IS's recommendations (C N C > 100). However, if the adversary combines all the diversity types (including the ensembling loss), he can reconsider the security of the targeted device.
Remark 9. During our experiments, we have noticed that increasing the diversity is more difficult when binary classification problems are considered in comparison to the multiclass classification problem with 3 outputs. Indeed, we had to fine tune more precisely the hyperparameters for all types of diversity. For a binary classification problem, this phenomenon can be explained by the lack of error distribution.

Conclusion
This paper presents a new loss, namely the Ensembling Loss, that increases the performance of ensemble models. Promoting the interactions between the committee members during the training process, this loss increases the resulted diversity to reduce the correlation between the errors induced by the members. First, we link this new learning metric with the mutual information between the ensemble model and its related label introduced by Brown in [Bro09]. Then, through the disagreement measure and the t-SNE visualization, we show that ensemble models trained with the Ensembling Loss increase the diversity between the committee members.
To assess the benefits from a side-channel perspective, we evaluate the accuracy growth on the remaining attack complexity through multiple attack scenarios. This investigation shows that applying deep learning-based side channel attacks can be inadapted to defeat secure RSA/ECC implementations if the previous state-of-the-art is considered (i.e. a single model trained with the cross-entropy loss function). Following the SOG-IS security guidances, the improvement provided by the combination of different types of diversity lead to a reconsideration of the targeted system's security.
Furthermore, considering the Ensembling Loss outperforms all the current learning metrics classically used in side-channel analysis. Hence, this loss could be considered for generating efficient ensemble models.
A future work could extend this proposition to ensemble model with diverse architectures (Multi-Layer Perceptrons, Recurrent Neural Networks [SP97,HS97], Residual Neural Networks [HZRS16], U-Nets [RFB15], etc.) and additional countermeasures for Public-Key Algorithms (e.g. address masking). Moreover, while our work mainly focuses on the gain in the attack accuracy brought by the diversity, a future work can evaluate the benefits of the ensembling loss to ease the detection of a threshold for performing a 2 n − Attack. Finally, we can also consider its application to any broad machine learning problem that requires high accuracy. Figure 6 illustrates the evolution of the t-SNE visualizations [vdMH08] in order to evaluate the impact of the Conditional Redundancy Loss and the Redundancy Loss during the training process.

A t-SNE Ensembling Loss
First, as mentioned in Section 3.1, the ranking loss [ZBD + 20] can be formulated as the relevance loss (see Equation 7). Through its minimization, we minimize the conditional entropy H(Z|F n ) which promotes the generation of three compact clusters. Figure 6 confirms this observation. The ensemble model is overconfident in the features captured during the training process. Consequently, it detects discriminative patterns to avoid connections between each cluster. However, following the t-SNE illustration, the False Positives (FP) and the False Negatives (FN) induced by the ranking loss are persistent and seem difficult to detect. Indeed, these errors are fully included in a wrong cluster. For a given number of profiling traces, a solution is to promote the interaction between the committee members in order to reduce this overconfidence and enhance the ensemble model.
In Equation 8, the Conditional Redundancy Loss function minimizes (1 − Pr [F m |F n , Z = z]) which defines the output probability of the model F m given F n and Z. Hence, maximizing Pr [F m |F n , Z = z] is asymptotically equivalent to maximize H (F m |F n , Z = z). Therefore, we force the network to generate three compact clusters given the correct label. This loss tends to increase the confidence of the network on the True Positives (TP) and the True Negatives (TN) while reducing the impact of the FP and the FN. This observation can be made on Figure 6. Indeed, adding the conditional redundancy to the ranking loss is helpful to distinguish TPs and FPs for each cluster. Hence, each cluster is divided into two parts: a part with high level of confidence in prediction and a part with uncertain predictions. This phenomenon highlights the benefits of the conditional redundancy loss function to reduce the intra-class variance and makes an easier distinction between confident and uncertain predictions. However, as illustrated in Figure 6, the conditional redundancy loss function does not clearly separate the confident and uncertain predictions into different clusters. Hence, an additional partial loss should be considered in order to increase the dissociation between these samples. This is provided by the Redundancy Loss function. In Equation 9, the Redundancy Loss function minimizes Pr [F m |F n ] which defines the output probability of the model F m given F n . From an information theory perspective, this can be considered as a minimization of H (F m |F n ) In other words, we want to maximize the inter-class variance between the models F m and F n . Hence, adding the redundancy loss to the ranking loss should increase the distance between each cluster by diversifying the features representation of each cluster. This observation can be validated thanks to Figure 6. Indeed, the third t-SNE visualization illustrates a model trained with the ranking and the redundancy losses. In comparison with the first t-SNE visualization, we can highlight the benefits of the redundancy loss to increase the distance between each cluster and make the FN and FP less persistent. However, in some extent, this approach generates sparse representation of a given cluster and also reduces the confidence of the networks on some TP. Hence a good trade-off has to be found between maximizing the confidence of the TP (i.e. conditional redundancy loss) and minimizing the persistence of the FP (i.e. redundancy loss). The Ensembling Loss aims at finding this solution for a given α, β, γ, µ values (see Equation 10).
In Figure 6, the combination of the relevance loss, the conditional redundancy loss and the redundancy loss creates three separate clusters. When the network is confident in its predictions, it will assign the related examples to the correct cluster. Thanks to the conditional redundancy loss, we know that the predictions with high level of confidence will be assigned to the same compact cluster. However, the ensembling loss also creates some connections between the clusters which seem defined by the data uncertainty. This result tends to reduce the number of consistent FP and FN such that few errors can be detected on each cluster in contrast with the ranking loss. This observation highlights the benefits of the redundancy loss during the training process. In Figure 6, the ensembling loss find a good trade-off between maximizing the confidence of the TP and minimizing the persistence of the FP.

B Diversity Measures
Let N ab n,m be the joint counts between two learners F n and F m . We denote a = 0 (resp. b = 0) if F n (resp. F m ) wrongly predicts a value and a = 1 (resp. b = 1) otherwise. For example, N 01 n,m defines the number of elements such that F n obtains an incorrect value for a given input while F m correctly predict the related class.
Definition 7 (κ-statistic [Coh60]). The κ-statistic measures the diversity between two classifiers F n , F m as follows :  Figure 7 illustrates the evolution of the diversity depending on the loss used. Indeed, the ensembling loss reduces the overall κ-statistic measure in comparison with the crossentropy or the ranking loss. This figure confirms that the ensembling loss decorrelates the errors between the committee members. Moreover, combining different types of diversity is helpful to improve this effect (see Figure 3). These observation are in agreement with those introduced in Section 3.2.

C Alternate Attack on RSA without CRT
Introduced in [SW14], the Alternate Attack (AA) targets RSA modular exponentiation protected with exponent blinding. Based on the Basic Attack and the Enhanced Attack [SI11], the alternate attack retrieves the secret exponent bits from multiple traces. This attack can be extended to the Elliptic Curves [SW14] and RSA with CRT [SW17]. However, some tricks are specific to each case study. In this paper, we only focus on the application of the alternate attack on RSA without CRT. In particular, we formulate a complexity equation for this alternate attack that is missing from the original paper.
In [SW14], Schindler and Wiemers define the blinded exponent d with a blinding scalar r as: where d is the secret exponent and φ(N ) defines the Euler totient function of the modulus N .
Algorithm [SW14]. In the alternative attack scenario against RSA without CRT, it is assumed that the attacker knows the upper halves of the binary representation of φ(N ) because it is similar to N . Let K be the bit-length of the secret exponent and d >> s = d 2 s defines the bits of d shifted to the right by s places. If s ≥ K 2 + R + 6, then d 2 s depends on the upper half of the bits of φ(N ). Given a secret blinding exponent d , Schindler and Wiemers introduce α = d 2 k−1 and β such that 0 ≤ β < 2 K−1 < φ(N ) to rewrite d in such a way that the (R + 1) most significant bits influence α while the (K − 1) least significant bits influence β. Then, the authors define (d >> s) as: (d >> s) = d + r · φ(N ) 2 s = α2 K−1 + β 2 s , and, (d >> s) = α2 K−1 (modφ(N )) + β − ωφ(N ) 2 s = α2 K−1 (modN ) + β − ωN 2 s , with high probability for an unknown ω ∈ {0, 1} and s ≥ K 2 + R + 6. When an adversary captures the leakage traces, he guesses the randomized exponent to obtain an estimationd of the true blinded exponent d : where e expresses the guessing error induced by exponentd , '⊕' denotes the bitwise XOR operation,α (resp.β,ω) is an estimation of α (resp. β, ω).
Given an error rate bit , an adversary can estimate the number of erroneous bits inα. The idea of the alternative attack against RSA without CRT is to generate all candidates for α (denotedα c ) and compute the resulted blinding factor candidates aŝ r c = (d −d)/N = α c 2 K−1 /N + ω with ω ∈ {0, 1}. Then, for each candidateα c and r c , the adversary can compute an estimation of the resulted errorê based on a guessed on the secret exponent d such that: e = r c N + d 2 s 2 s ⊕ α c 2 K−1 +β .
If d 2 s = d 2 s , a blinding factor estimationr c is defined as a candidate for r if HW ( ê/2 s ) ≤ t 0 with t 0 a threshold configured by the attacker. A smaller t 0 value induces a more restrictive candidate selection. The threshold t 0 should be selected such that no false candidates for r are kept. More details on the alternative attack algorithm are provided in [SW14, Algorithm 4]. However, it is acceptable that some of the d /2 s candidates are wrongly guessed. Then, to retrieve the remaining bits of φ(N ), the adversary has to perform the Step 3 of the Enhanced Attack introduced by Schindler and Itoh [SI11]. Of course, for a number N a of attack traces, we expect q n0,t0 N candidates for d /2 s where, such that, the two brackets quantify the probabilities thatα and the relevant bits ofβ contain at most n 0 or t 0 guessing errors, respectively [SW17]. Through all these components, we can estimate the complexity of the resulted alternate attack for a given s and t 0 values.

Complexity.
First, the adversary has to configure the s, t 0 and N a values to perform successful attacks. Then, for a givend =α2 K−1 +β, the adversary has to generate allα c candidates that differ by n 0 bits fromα at most. Hence, their is M 0 = i≤n0 R + 1 i candidates for α.
The computation of each candidater c andê c depends on the number ofα c elements. Therefore, their is 2·M 0 candidates for r and 2·M 0 ·2 K−s candidates for e in the worst case (i.e. if d /2 s = d/2 s ) and 2 · M 0 candidates for e otherwise (i.e. if d /2 s = d/2 s ). In the following, we only consider the worst case scenario for the complexity estimation.
Given a secret exponent d of K bit-length, a blinding scalar of bit-length R, an error rate bit and a number of attack traces N a , the Alternate Attack Complexity C AA is defined as: with t 0 configured such that no false candidates for r are selected.
Remark 10. To consider the Alternate Attack has a success, the adversary has to define the number of attack traces N a that are needed for recovering the entire bits of φ(N ). Hence, to correctly estimate C AA , the adversary has to perform the Step 3 of the Enhanced Attack [SI11] in order to find a correct assumption about N a .

D Architectures for All Types of Diversity
The architectures used for the type I diversity are randomly selected such that the number of convolutional layers (with BatchNormalization (BN) layer [IS15]) and fully-connected layers (FC) do not exceed 2. Hence, we evaluate the type I diversity with the restriction of small network complexity. We select 5 architectures with high individual Acc label value (i.e. ≥ 85%) to limit the impact of the outliers and preserve an overall good performance. All the architectures used for the type I diversity investigation are details in Table 9. For the type I + II diversity study, we randomly generate 100 models from a range of hyperparameter selection introduced in Table 2. From the resulted pool of classifiers, we pick out those with a high individual performance (i.e. Acc label ≥ 85%) such that their pairwise diversity measure (i.e. disagreement measure or κ-statistic) is maximized. The resulted architectures are details in Table 10.