Ranking Loss: Maximizing the Success Rate in Deep Learning Side-Channel Analysis

. The side-channel community recently investigated a new approach, based on deep learning, to signiﬁcantly improve proﬁled attacks against embedded systems. Compared to template attacks, deep learning techniques can deal with protected implementations, such as masking or desynchronization, without substantial pre-processing. However, important issues are still open. One challenging problem is to adapt the methods classically used in the machine learning ﬁeld (e.g. loss function, performance metrics) to the speciﬁc side-channel context in order to obtain optimal results. We propose a new loss function derived from the learning to rank approach that helps preventing approximation and estimation errors, induced by the classical cross-entropy loss. We theoretically demonstrate that this new function, called Ranking Loss (RkL), maximizes the success rate by minimizing the ranking error of the secret key in comparison with all other hypotheses. The resulting model converges towards the optimal distinguisher when considering the mutual information between the secret and the leakage. Consequently, the approximation error is prevented. Furthermore, the estimation error, induced by the cross-entropy, is reduced by up to 23%. When the ranking loss is used, the convergence towards the best solution is up to 23% faster than a model using the cross-entropy loss function. We validate our theoretical propositions on public datasets.


Introduction
Side-channel analysis (SCA) is a class of cryptographic attack in which an evaluator tries to exploit the vulnerabilities of a system by analyzing its physical properties, including power consumption [KJJ99] or electromagnetic emissions [AARR03], to reveal secret information.During its execution, a cryptographic implementation manipulates sensitive variables that depend directly on the secret.Through the attack, evaluators try to recover this information by finding some leakage related to the secret.One of the most powerful types of SCA attacks are profiled attacks.In this scenario, the evaluators have access to a test device whose target intermediate values are known.Then, they can estimate the conditional distribution associated with each sensitive variable.They can predict the right sensitive value on a target device containing a secret they wish to retrieve by using multiple traces.In 2002, the first profiled attack was introduced by [CRR03], but their proposal was limited by the computational complexity.Very similar to profiled attacks, the application of machine learning algorithms was inevitably explored in the side-channel context [HGM + 11, BL12, LBM14].from the "Learning to Rank" approach, is linked with the mutual information between a label and an input.
Finally, we confirm the relevance of our loss by applying it to the main public datasets and we compare the results with the classical cross-entropy and the recently introduced cross-entropy ratio [ZZN + 20].All these experiments can be reproduced through the following GitHub repository: https://github.com/gabzai/Ranking-Loss-SCA.
Paper Organization.The paper is organized as follows.Section 2 explains the similarity between the profiled side-channel attacks and the machine learning approach.It also introduces the learning to rank approch applied to SCA.Section 3 proposes a new loss, called Ranking Loss (RkL), that generates a model converging towards the optimal distinguisher.In Section 4, we theoretically link the ranking loss with the mutual information.Finally, Section 5 presents an experimental validation of ranking loss on public datasets showing its efficiency.

Notation and terminology
Let calligraphic letters X denote sets, the corresponding capital letters X (resp.bold capital letters) denote random variables (resp.random vectors T) and the lowercase x (resp.t) denote their realizations.The i-th entry of a vector t is defined as t[i].Sidechannel traces will be constructed as a random vector T ∈ R 1×D where D defines the dimension of each trace.The targeted sensitive variable is Z = f (P, K) where f denotes a cryptographic primitive, P (∈ P) denotes a public variable (e.g.plaintext or ciphertext) and K (∈ K) denotes a part of the key (e.g.byte) that an adversary tries to retrieve.Z takes values in Z = {s 1 , ..., s |Z| } such that s j denotes a score associated with the j th sensitive variable.Let us denotes k * the secret key used by the cryptographic algorithm.We define the following information theory quantities needed in the rest of the paper [CT91] .The entropy of a random vector X, denoted H(X), measures the unpredictability of a realization x of X.It is defined by: The conditional entropy of a random variable X knowing Y is defined by: The Mutual Information (MI) between two random variables X and Y is defined as: (1) This quantifies how much information can be extracted about Y by observing X.

Profiled Side-Channel Attacks
When attacking a device using a profiled attack, two stages must be considered: a building phase and a matching phase.During the first phase, evaluators have access to a test device where they control the input and the secret key of the cryptographic algorithm.They use this knowledge to locate the relevant leakages depending on Z.To characterize the points of interest, evaluators generate a model F : R D → R |Z| that estimates the probability Pr[T|Z = z] from a profiled set T = {(t 0 , z 0 ) , . . ., t Np−1 , z Np−1 } of size N p .Once the leakage model is generated, evaluators estimate which intermediate value is processed using a prediction function F (•).By predicting this sensitive variable and knowing the input used during the encryption, an evaluator can use a set of attack traces of size N a and compute a score vector, based on F (t i ) , i ∈ {0, 1, . . ., N a − 1}, for each key hypothesis.Indeed, for each k ∈ K, this score is defined as: where z i = f (p i , k) and f denotes a cryptographic primitive.
Based on the scores introduced in Equation 2, we can classify all the key candidates into a vector of size |K|, denoted g Na = g 1 Na , g 2 Na , ..., g |K| Na , such that: defines the position of the secret key k * , in g Na , amongst all hypotheses.We consider g 1 Na as the most likely candidate and g |K| Na as the least likely one.Commonly, this position is called rank.The rank of the correct key gives us an insight into how well our model performs.Given a number of traces N a , the Success Rate (SR) is a metric that defines the probability that an attack succeeds in recovering the secret key k * amongst all hypotheses.A success rate of β means that β attacks succeed in retrieving k * over 100 realizations.In [SMY09], Standaert et al. propose to extend the notion of success rate to an arbitrary order d such that: (4) In other words, the d th order success rate is defined as the probability that the target secret k * is ranked amongst the d first key guesses in the score vector g.In profiling attacks, an evaluator wants to find a model F such that the condition SR d (N a ) > β is verified with the minimum number of attack traces N a .

Neural Networks in Side-Channel Analysis
Profiled SCA can be formulated as a classification problem.Given an input, a neural network constructs a function F θ : R D → R |Z| that computes an output called a prediction.During the training process, a set of parameters θ, called trainable parameters, are updated in order to generate the model.To solve a classification problem, the function F θ must find the right prediction y ∈ Z associated with the input t with high confidence.To find the optimized solution, a neural network has to be trained using a profiled set of N p pairs (t p i , y p i ) where t p i is the i-th profiled input and y p i is the associated label.In SCA, the input of a neural network is a side-channel measurement and the related label is defined by the corresponding sensitive value z.The input goes through the network to estimate the corresponding probability vector ŷp i such that ŷp This probability is computed with the softmax function [GBC16].This function maps the outputs of each class to [0; 1].Due to the exponential terms, the softmax helps to easily discriminate the classes with high confidence.As a classical profiling attack, we can use the resulted values ŷp i to compute the score for each key hypothesis and then estimate the d th order success rate.To quantify the classification error of F θ over the profiled set, a loss function has to be configured.Indeed, this function reduces the error of the model in order to optimize the prediction.For that purpose, the backward propagation [GBC16] is applied to update the trainable parameters (e.g.weights) and minimize the loss function.The classical loss function used in side-channel analysis is based on cross-entropy.
Definition 1 (Cross-Entropy).Given a joint probability distribution of a sensitive cryptographic primitive Z and corresponding leakage T denoted as Pr[T, Z], we define the Cross-Entropy of a deep leaning model F θ as: Given a profiling set T of N p pairs (t p i , y p i ) 0≤i≤Np and a classifier F θ with parameter θ, the Categorical Cross-Entropy (CCE) loss function is an estimation of the cross-entropy such that: In other words, minimizing the categorical cross-entropy reduces the dissimilarity between the right distributions and the predicted distributions for a set of inputs.According to the Law of Large Numbers, the categorical cross-entropy loss function converges in probabilities towards the cross-entropy for any θ [SSBD14].However, no information about the gap between the empirically estimated error and its true unknown value is given for any finite profiling set T .In [MDP19b], Masure et al. study the theoretical soundness of the categorical cross-entropy, denoted as Negative Log Likelihood (NLL), in side-channel to quantify its relevance in the leakage exploitation.They demonstrate that minimizing the categorical cross-entropy loss function is equivalent to maximizing an estimation of the Perceived Information (PI) [RSVC + 11] that is defined as a lower bound of the Mutual Information (MI) between the leakage and the target secret.Consequently, when the categorical cross-entropy is used as a loss function, the number of traces needed to reach a 1 st order success rate is defined as an upper bound of the optimal solution [dCGRP19].Because the PI is substitute to the MI, some source of imprecision could affect the quality of the model F θ .As mentioned in [DSVC14,MDP19b], the gap between the estimation of the PI, defined by the categorical cross-entropy, and the MI can be decomposed into three errors: • Approximation error -it defines the deviation between the empirical estimation of the PI and the MI.In theory, the Kullback-Leibler divergence [KL51] can be computed in order to evaluate this deviation.However, evaluators face the problem that the leakage Probability Density Function (PDF) is unknown.
• Estimation error -the minimization of the categorical cross-entropy maximizes the empirical estimation of the PI rather than the real value of the PI.Therefore, the finite set of profiling traces may be too low to estimate the perceived information properly.Consequently, this error can be quantified for a given number of profiling traces.
These errors caused by the categorical cross-entropy could impact the training process.Section 5 emphasizes the impact of these error terms on different datasets.In addition, the more secure the system, the larger the inequality between the empirical PI and the MI.As mentioned in [DFS15, BHM + 19], a higher MI implies a more powerful maximum likelihood attack where the secret key k * can be extracted more efficiently.In other words, the probability of the success rate is linked with the MI.Therefore, finding a new loss derived from the success rate can minimize the approximation and estimation errors unlike the categorical cross-entropy.In addition, this loss could be helpful to converge towards the optimal distinguisher for a given number of traces.

Learning To Rank Approach in Side-Channel Analysis
The "Learning to Rank" refers to machine learning techniques for training a model in a ranking task.This approach is useful for many applications in Natural Language Processing, Data Mining and Information Retrieval [Liu09,Bur10,Li11a,Li11b]. Learning to rank is a supervised learning task composed by a set of documents D and a query q.In document retrieval, the ranking task is performed by using a ranking model F θ (q, d) to sort a document d ∈ D depending on its relevance with respect to the query q.The relevance of the documents with respect to the query is represented by several grades defined as labels.The higher the grade, the more relevant is the document.In the side-channel context, the application of the learning to rank can be useful to efficiently evaluate the rank of k * .Contrary to the classical learning to rank approach that consists in the comparison between inputs relevance, we propose to adapt the "Learning to Rank" approach for the side-channel context through the comparison of the score, related to the sensitive information, with the other classes.Consequently, for a given input, this approach tries to penalize the training process when the score related to k * is not considered as the most relevant.Classically, three approaches can be considered in the learning to rank field.We adapt these approaches in the side-channel context as follows: • Pointwise approach can be seen as a regression or a classification problem.In a classification task, given a trace t, the ranking function F θ is trained such that F θ (t) = s c defines the relevance of a specific class c given a trace t.Then, the final phase consists in sorting the classes depending on their score.This entire process is exactly what the evaluator does in the classical deep learning side-channel approach.
• Pairwise approach predicts the order between a pair of scores, such that F θ (t, (c i , c j )) defines the probability that the i th class c i has a better rank than c j given a trace t [BSR + 05, BRL07, WBSG10].The comparison between each pair of scores builds the ranking of the whole model.
• Listwise approach directly sorts the entire list of sensitive information and tries to come up with the optimal ordering list.During the training process, it assumes that the evaluator can predetermine the relevance of each class depending on a given input.Consequently, given a trace t, the evaluator should be able to define the rank of each irrelevant classes.
The first approach is classically used when the evaluator wants to perform a classical deep learning side-channel attack.The listwise approach seems difficult to apply in our context because an evaluator cannot precisely define the rank of the irrelevant classes given a trace t.Finally, the pairwise approach can be useful in order to optimize the rank of a specific output in comparison with the others.In the following sections, we demonstrate how the pairwise approach can be useful to discriminate the score of the relevant output.The other learning to rank approaches are beyond the scope of this paper.

Ranking Loss: A Learning Metric Adapted For SCA
This section presents our main contribution: the Ranking Loss (RkL).Section 3.1 explains to what extent the pairwise approach and the success rate are linked in order to propose the ranking loss.Then, Section 3.2 demonstrates the theoretical bounds of the ranking loss regarding the success rate.Finally, Section 3.3 presents the impact of the ranking loss on the scores during the training process.

Ranking Loss Maximizes the Success Rate
Pairwise approach and success rate.For a given query q, each pair of documents (d i , d j ) is presented to the ranking model which computes the order of relevance between d i and d j .We denote d i d j the event that d i should be ranked higher than d j .The corresponding loss function maps the scores associated with d i and d j and penalizes the training process once the relation d i d j is not respected.This penalization is exactly what we want to optimize when the 1 st order success rate is considered.
Let N a be the number of traces needed to perform an attack.From Equation 3 and Equation 4, the following relation can be deduced: (5) In other words, measuring the 1 st order success rate is equivalent to computing the probability such that the score related to the secret key k * is higher than all key hypotheses.This means defining the probability that the class c k * a is ranked higher than c k for all k ∈ K \ {k * }.This approach is equivalent to the pairwise approach defined earlier.Let a key hypothesis k and a secret key k * , the probability that c k * c k can be estimated via a sigmoid function [BZBN19] such that: where α denotes the parameter of the sigmoid function.The value of α greatly impacts the training process.We evaluate its impact in Appendix A. In the following, we assume that α is well configured.
Definition of the Ranking Loss.We apply the cross-entropy loss function in order to penalize the deviation of the model probabilities from the desired prediction.In other words, we want to penalize the loss function when the expected relation c k * c k is not observed.Thus, we define a partial loss function l Na (c k * , c k ), for a given hypothesis k, as: where and Pk * ,k defines the true unknown probability that k * is ranked higher than k.
In the rest of the paper, we assume that the ranking value is deterministically known such that, Pk We assume that rel k * ,k is always equal to 1.In the side-channel context, this approximation is reliable a For convenience, c k * is used to denote the class related to the correct label.Hence, c k * is also used to define the class associated to f (p, k * ) such that f is a cryptographic primitive and p characterizes a plaintext value.
because we want to maximize the score related to k * compared with the other hypotheses.From Equation 6 and Equation 7, we can deduce the following partial loss function: Equation 8 gives us an insight into how the cost function penalizes the training process when the relation c k * c k is not the expected result.Therefore, maximizing the success rate tends to minimize the ranking error between the secret key k * and a hypothesis k.As a remainder, this cost function, presented in Equation 8, is only applied on a single key hypothesis.In order to efficiently train a side-channel model, we have to apply this cost function on each key hypothesis in order to maximize the rank of the secret key.
Definition 2 (Ranking loss -Our contribution).Given a profiling set T of N p pairs (t p i , y p i ) 0≤i≤Np , a classifier F θ with parameter θ and a number of attack traces N a , we define the Ranking Loss (RkL) function as: where Remark 1.To discriminate the right output and normalize this value into a probability distribution, we compute the softmax function of each score during the attack phase.The softmax function converts the negative score to very low probability.This is essential to perform a side-channel attack (see Equation 2).
With the ranking loss, we are more concerned with the relative order of the relevance of the key hypothesis than its absolute value (i.e.categorical cross-entropy).Consequently, maximizing the success rate is equivalent to minimizing the ranking error for each pair (k * , k) k∈K .Futhermore, Definition 2 takes into account the number of attack traces needed to perform a successful side-channel attack.The ranking loss penalizes the network depending on the number N a of training scores that an adversary aggregates before iterating the training process.The ranking loss tends to maximize the success rate for a given N a traces and converges towards the optimal distinguisher introduced in [BGH + 17].
Remark 2. In classic information retrieval tasks, (d i , d j ) and (d i , d j ) characterize two different pairs of documents such that the following relation is defined For each pair of documents, if the difference of their scores is equal, thus, the final loss will be the same regardless of the rank of the documents.In this specific case, swapping the rank of d i (resp.d j ) and d i (resp.d j ) does not impact the loss function.The loss only cares about the total number of pairwise-ranking it gets wrong.This can be particularly problematic if we are interested in the top ranking items.To solve this issue, the pairwise approach applies some information retrieval (IR) measures (e.g.Discounted Cumulative Gain [JK02], Normalized Discounted Cumulative Gain [JK00], Expected Reciprocal Rank [CMZG09], Mean Average Precision, etc...) to compute the loss.However, in our context, d i = d i , therefore the difference between the scores of each pair (d i , d j ) and (d i , d j ) gives us enough information on the position of d i related to d j and d j .Consequently, the addition of IR metrics is not relevant.Moreover, IR metrics can be either discontinuous or flat, so gradient descent appears to be problematic (i.e.gradient equals to 0 or not defined) unless some appropriate approximation is used.
b Here, the output score denotes the value before the softmax function.This choice is made to impact the training process accordingly to the relative order of the key hypotheses' relevance instead of the normalized probability distribution.However, the classical side-channel score (see Equation 2) can also be applied.

Theoretical Bounds of the Ranking Loss
In this section, we show that the ranking loss is an upper bound of the measure-based ranking error.In the learning to rank research area, the information retrieval measures are used to evaluate the network performance.In most cases, two categories of metrics can be used: those designed for binary relevance levels (e.g.Mean Average Precision (MAP), Mean Reciprocal Rank (MRR) [Cra09]) and those designed for multiple levels of relevance (e.g.discounted cumulative gain, normalized discounted cumulative gain [JK00]).In the side-channel context, there is only two levels of relevance such that 1 is associated with the correct output class and 0 otherwise.Thus, the metrics for binary relevance levels have to be considered.The Mean Average Precision (MAP) defines the average precision of the secret key k * over the |K| hypotheses.Let d be a threshold and M AP @d the average precision of the secret key k * in the top d relevant positions.In particular, M AP @1 can be seen as a 1 st order success rate (see Appendix B).As mentioned in [CLL + 09a], the standard pairwise loss is considered as the upper bound of the measure-based ranking error that is defined by 1 − M AP @|K| (justifications are provided in Appendix C).
Theorem 1 ([CLL + 09a]).Given 2-level rating data with n 1 objects having label 1 and n 1 > 0, then, the following inequality holds, where gr(i) defines the label (or grade) associated to the i th key hypothesis (i.e.0 or 1).
In our context, a 2-level rating data means that a class c k ∈ {0, 1} such that c k = 1 iff k = k * .In side-channel analysis, there is only one key candidate with a label equal to 1 in the one-hot encoding representation (i.e. the label corresponding to the sensitive output).Thus, Theorem 1 can be easily written following the ranking loss.
Proposition 1 (Our contribution).Given 2-level rating data with n 1 objects having label 1 and n 1 > 0, a profiling set T of N p pairs (t p i , y p i ) 0≤i≤Np , a classifier F θ with parameter θ and a number of attack traces N a , the following inequality holds: Proof.Following [CLL + 09a, Theorem 2] and [CLL + 09b, Lemma 1], given a 2-level rating data, it can be proved that: where n 1 denotes the number of elements having label 1, i 0 defines the position of the first object with label 0 in a ranking list.
If i 0 > n 1 , the first element with label 0 is ranked after position n 1 .In SCA, there is only one candidate with a label 1 (i.e.c k * ).Hence, the correct candidate is ranked at the first position and 1 − SR 1 (N a ) = 0. Similarly, the first element with a label 0 is ranked at the second position and n 1 − i 0 + 1 = 0.If i 0 ≤ n 1 , the first element with label 0 is ranked before the correct class c k * .Consequently, the correct candidate is ranked at the second position and 1 − SR 1 (N a ) = 1.Similarly, the first element with a label 0 is ranked at the first position and n 1 − i 0 + 1 = 1.Thus, Finally, from Equation 10, we can easily rewrite the right part of the inequality as: Indeed, the condition gr(j) < gr(i) holds for all j ∈ K \ {k * } iff i corresponds to c k * .This result implies: From Proposition 1, we can deduce that minimizing the ranking loss is equivalent to maximizing the 1 st order success rate.This inequality implies a maximization of the success rate through the the minimization of the ranking loss.Therefore, the value of the loss function gives us an insight into how well our network performs related to the success rate.

Impact of the Ranking Loss during the Training Process
This subsection theoretically explains how the training process can be useful in order to precisely order the secret key k * amongst all the hypotheses.
The training process aims at optimizing the loss function in order to minimize the error made by the network.This process can be decomposed into two phases: the forward propagation and the backward propagation [GBC16].Given an input, the goal of the forward propagation is to feed training examples to the network in the forward direction by processing successive linear and non-linear transformations in order to predict a value related to the input.Once this process is done, the backward propagation measures the error between the predictions and the correct output and tries to reduce it by updating the parameters θ that compose the network.Let l Na (c k * , c k ) the partial loss function defined in Equation 8where k * is the secret key and k ∈ K is a key hypothesis.Let w i ∈ θ be a i-th trainable parameter.The backpropagation updates the weights as below: where η denotes the learning rate and (s k ) k∈K defines the output score related to the class (c k ) k∈K .
From Equation 8 and Equation 12, we can deduce the following equation, This derivative can be decomposed in two parts.First, computing the gradient of the ranking loss is equivalent to computing an ascent gradient of the score s k * and a gradient descent of s k .As mentioned in Section 3.1, the score value is defined by the prediction before the softmax function.The training process updates the weights to increase the score related to the secret key and reduces the score related to the hypothesis k.Secondly, the norm of the gradient vectors is scaled by . Depending on the difference between s Na (k * ) and s Na (k), the resulted norm varies as below: s Na (k * ), γ α (s k * , s k ) tends to converge towards α thus, the norm of the gradient vector related to each score is maximized.
2 thus, the norm of the gradient vector related to each score is divided by 2.
s Na (k * ), γ α (s k * , s k ) tends to converge towards 0 thus, the norm of the gradient vector related to each score is minimized.
The gradient of the ranking loss defined in Equation 9 can be derived as: Therefore, the ranking loss proposed in Equation 9 pushes the score of the secret key up and pushes the score of the key hypotheses down via gradient ascent/descent on a pair of items.This is equivalent to maximizing the success rate.For each pair (k * , k) k∈K , there are two "forces" at play.The force that each pair exerts is proportionate to the difference of their scores multiplied with α.Consequently, α should be carefully configured during the training process.The force applied on the secret key k * is equal to the sum of the forces exerted on each pair.Consequently, using the ranking loss tends to order the secret key as the highest position which is equivalent to maximizing the success rate.

Mutual Information Analysis of the Ranking Loss
This section studies the ranking loss from a mutual information point-of-view.Section 4.1 demonstrates that ranking loss is an approximation of the optimal distinguisher.Then, Section 4.2 shows that the categorical cross-entropy is only a lower bound of the ranking loss.Finally, Section 4.3 analyzes the reduction of the different errors of the ranking loss.

An Approximation of the Optimal Distinguisher
Let D be a distinguisher that maps a set of traces T and a sensitive variable Z to an estimation of the secret key k * .Definition 3 (Optimal Distinguisher [HRG14]).Given a conditional probability distribution of a sensitive cryptographic primitive Z following a leakage T denoted as Pr[Z|T], we define the optimal distinguisher as maximizing the success rate of an attack: where f (p, k) = z is the sensitive information computed from a cryptographic primitive f , a plaintext p ∈ P, a key k ∈ K.
The main issue of Definition 3 is that the optimal distinguisher can only be computed if the leakage model is perfectly known [HRG14, BGH + 15].Therefore, one solution is to find the adequate estimation of Pr [Z = z|t].Through the maximization of the success rate, we want to find an estimation D (t, z) of the optimal distinguisher that converges towards D (t, z) [BCG + 17, BGH + 17].
Definition 4 (Estimation of the Optimal Distinguisher).Given a conditional probability distribution of a sensitive cryptographic primitive Z following a leakage T and a parameter θ denoted as Pr[Z|T, θ], we define the estimation of the optimal distinguisher as: where f (p, k) = z is the sensitive information computed from a cryptographic primitive f , a plaintext p ∈ P, a key k ∈ K.
To converge towards the true distinguisher D, some optimization algorithm shall be run in order to maximize the estimation Pr [Z = z|t, θ] and find a local maximum.However, the computation needed to reach the global minimum is computationally expensive due to a matrix inversion [SKS09, BCG + 17].Using the deep learning approach can be helpful in order to automatically reach the global minimum.When θ is optimal, finding a model F θ that maximizes the success rate is equivalent to generating a distinguisher D (t, z) that converge towards the true D. Through Equation 1, we can assume that maximizing the success rate is equivalent to optimizing the estimation of a mutual information between the sensitive information and the leakage: Hence, finding a model F θ that maximizes the success rate is equivalent to maximizing an estimation of the mutual information.When the hyperparameter configuration θ is optimal and N p → ∞, this estimation converges towards the real mutual information.Therefore, the ranking loss can be considered as an upper bound of the actual categorical cross-entropy.

The Cross Entropy as a Lower Bound of the Ranking Loss
Classically, given a set of traces T , the deep learning side-channel analysis tries to minimize the categorical cross-entropy loss function in order to maximize the score related to the true sensitive information.When the categorical cross-entropy loss function is used, the training process increases the score related to the true sensitive information in order to boost its rank.However, this loss function is not optimized for the IR measures (e.g.NDCG, MAP@d, SR d ) and no comparison is made with the irrelevant classes.Consequently, the loss function may emphasize irrelevant sensitive information [Liu09].This is exactly what the pointwise approach does in learning to rank [Li11b].
In As a consequence, a model F θ,CCE using the categorical cross-entropy as a loss function can be considered as a lower bound estimator of the MI introduced in Equation 17: Thus, Therefore, the number of traces needed to perform a successful attack on F θ,CCE is defined as an upper bound of the number of traces needed to perform a successful attack on a model F θ,RkL that maximizes the estimation of the MI [BHM + 19, dCGRP19, MDP19b].This implies that: As mentioned in Section 4.1, the maximization of the 1 st order success rate is equivalent to maximizing an estimation of the MI.Equation 20 illustrates that the ranking loss is more efficient than the current usual loss used in the side-channel context.Indeed, when the ranking loss is used, the number of traces that are needed to reach a constant guessing entropy of 1 is defined as a lower bound of N θ CCE .Following [MDP19b], this inequality is due to three forms of errors that can be decomposed into approximation, estimation and optimization errors.In the next section, we analyze the errors made by the categorical cross-entropy and we compare them with the ranking loss.

Error Analysis
First, this section recalls the gap between the categorical cross-entropy and the MI introduced in [MDP19b].Then, we explain the theoretical benefits of the ranking loss through an error analysis between the ranking loss, the categorical cross-entropy and the MI.To assess the quality of the ranking loss, we have to evaluate the tightness of the inequalities defined in Equation 18.
Error Analysis of the Categorical Cross-Entropy.Proposed in [MDP19b], this error decomposition establishes the gap between the MI and the PI that we are maximizing with the categorical cross-entropy.To facilitate the comparison of our results with [MDP19b], we use the same notations.Let θ denote an estimation of the parameters when a classical optimizer is used (i.e.SGD, Adam, RMSprop, Nadam, ...) and θ the optimal hyperparameter vector obtained when a global minima of the loss function is reached.In their paper, Masure et al. decompose the gap into three errors (i.e.approximation error, estimation error, optimization error) as: Masure et al. provide some simulations to define the impact of the maximization of the PI on the errors.Through their simulations, they argue that the approximation errors are negligible, no matter the countermeasures considered.However, Section 5 will show that, even in the simplest case, the approximation error can have a huge impact on the training process such that some irrelevant features could be defined as points of interest.Consequently, these errors could highly impact the performance of the network.Indeed, Equation 24 defines the error related to the optimization of the model trained with the loss proposed in Equation 9. Finally, Equation 25 characterizes the estimation error that can be reduced when the number of profiling traces converges towards infinity [BHM + 19].In Section 5, we show that this error is reduced by up to 23% when the ranking loss is used compared to the classical categorical cross-entropy.

Error Analysis of the
Error Gap between the Categorical Cross-Entropy and the Ranking Loss.Finally, the gap between the categorical cross-entropy and the ranking loss can be divided into different error terms.Let a model F θ,CCE (resp.F θ,RkL ) trained with the categorical cross-entropy loss (resp.the ranking loss).Here, we assume that the optimization error generated by both models is approximately the same.This strong assumption is useful to simplify our analysis and focus on the benefits of using the ranking loss compared to the categorical cross-entropy.Consequently, ( Equation 26 defines that the difference of the models lies in the approximation error generated between P I (Z; T; θ) and M I (Z; T; θ).Indeed, the approximation error that defines the distance between the PI and the MI is removed when the success rate is maximized.One of the most challenging issues, induced by this approximation error, is then prevented when the ranking loss is considered.In the next section, we validate all the theoretical observations on unprotected and protected implementations.

Experimental Results
To confirm our theoretical propositions, we complete the analysis of the ranking loss with experimental results on various public datasets.

Settings
The experiments are implemented in Python using the Keras library [C + 15] and is run on a workstation equipped with 16GB RAM and a NVIDIA GTX1080Ti with 11GB memory.All of the following architectures and hyperparameters c are based on the best state-of-the-art results [ZBHV19].Table 1 summarizes the choices made by Zaid et al.We define N t GE as the number of traces that are needed to reach a constant guessing entropy of 1.For a good estimation of N t GE , the attack traces are randomly shuffled and 100 N t GE are computed to give the average value denoted N t GE .In the next sections, the N a value needed to compute the ranking loss, is set to 1. Hence, the ranking loss tends to maximize the success rate when only 1 attack trace is considered (see Definition 2).

Comparison with the Cross-Entropy Ratio
In the following, we compare the categorical cross-entropy and the cross-entropy ratio (CER) loss functions with the ranking loss on different publicly available datasets.Introduced in [ZZN + 20], the cross-entropy ratio is defined as: where T r identifies the profiling set T with shuffled labels while keeping the traces unchanged.
Remark 3. In [ZZN + 20], Zhang et al. construct the cross-entropy ratio loss as specialized for imbalanced labels (e.g. when a particular leakage model is considered).The ranking loss proposed in this paper is generic, no leakage model is chosen (i.e. the identity function is used).A future work could study the suitability of the ranking loss on imbalanced labels.

Presentation of the Datasets
We use three different datasets for our experiments.All the datasets correspond to implementations of Advanced Encryption Standard (AES) [DR02].The datasets offer a wide range of use cases: high-SNR unprotected implementation on a smart card, low-SNR unprotected implementation on a FPGA, low-SNR protected implementation with first-order masking [SPQ05] and random-delay effect.
• Chipwhisperer d is an unprotected implementation of AES-128 (8-bit XMEGA Target).Due to the lack of countermeasure, we can recover the secret directly.In this experiment, we attack the first round S-box operation.We identify each trace with the sensitive variable is the first byte of the i-th plaintext.The measured SNR equals 7.87 (see Appendix D Figure 5).Our experience is conducted with 45, 000 power traces of 800 points for the training phase and 5, 000 power traces for the validation.Finally, 50, 000 power traces are used for the attack phase.
• AES_HD e is an unprotected AES-128 implemented on FPGA.Introduced in [PHJ + 19], the authors attack the register writing in the last round such that the label of the i-th trace is where C (i) j and C (i) j are two ciphertext bytes associated with the i-th trace, and the relation between j and j is given by the ShiftRows operation of AES.The authors use j = 12 and j = 8.The measured SNR equals 0.01554 (see Appendix D Figure 6).We use 75, 000 measurements such that 50, 000 are randomly selected for the training process (45, 000 for the training and 5, 000 for the validation) and we use 25, 000 traces for the attack phase.
As explained in [BPS + 19], the third byte is exploited.The measured SNR equals 0.007 (see Appendix D Figure 7).To generate our network, we divide the dataset of ASCAD into three subsets: 45, 000 traces for the training set, 5, 000 for the validation set and 10, 000 for the attack phase.
Remark 4. To efficiently evaluate the performance of our networks, we apply some visualization tools provided in [MDP19a,ZBHV19].Indeed, through the weight and the gradient visualizations, we are able to identify the relevant features retained by the network to classify the traces.

Evaluation of the Ranking Loss on Public Datasets
From practical perspectives, the generation of suitable architectures is known as a difficult task.Hence, two kinds of models are considered.In Section 5.4.1, models that exploit a partial set of PoI in the leakage traces are evaluated.In Section 5.4.2, models that exploit all the relevant information in the leakage traces are considered This subsection evaluates the efficiency of the ranking loss compared to the categorical cross-entropy and the cross-entropy ratio in both cases through various scenario, notably in presence of high noise, masking and desynchronization.

A Partial Exploitation of the Leakages
For simplicity, we evaluate this case study with the ChipWhisperer dataset.The model implemented is a CNN architecture with one convolutional block of 2 filters of size 1 and one fully-connected layer with 2 nodes.When considering unprotected implementation with low noise, all the models trained with different losses provide the same N t GE value (see Table 2 for small σ values).In [ZBHV19], Zaid et al. propose to visualize the weights corresponding to the flatten layer in order to evaluate the capacity of the network to extract the relevant features.Through this visualization, an evaluator is able to retrieve the points of interest selected by a network.However, due to the effect of the convolutional block, the number of weighted samples is divided by the value of the pooling stride [ZBHV19].Thus, the comparison of these visualizations with the SNR computation can be difficult.For ease of visualization, we add a padding on the weight representation in order to get the same x-axis on each figures.In Figure 1, we compare the features retained by the categorical cross-entropy (see Figure 1a), the cross-entropy ratio (see Figure 1b) and the ranking loss (see Figure 1c) with the classical SNR (see Figure 1d).Interestingly, depending on the loss, the model does not select the same relevant features.The Figure 1a, Figure 1b and Figure 1c do not show the same points of interest.While the SNR computation reveals 4 high peaks between 0 and 200 samples, the models trained with the categorical cross-entropy and the cross-entropy ratio losses detect only 2 high peaks in the same area.Hence, only a partial set of leakages is exploited by these cross-entropy losses.In comparison, the ranking loss extracts most of the sensitive information.Moreover, the categorical cross-entropy loss identifies a false-positive leakage while no irrelevant peak occurs when the ranking loss is applied.This error underlines an important issue when the categorical cross-entropy loss is used in side-channel: the approximation error is non-negligible and some false-positive leakages can occur.If the evaluator cannot find a more suitable neural network architecture, these noisy points (i.e.irrelevant features) could dramatically impact the performance of the network.As mentioned in [DSVC14], the approximation (or assumption) error can be dramatic if the model, characterizing the perceived information, does not converge towards the right distribution Pr [Z = z|t] defined by the mutual information M I (Z, T).In Section 4.3, we have shown that the ranking loss prevents the approximation error compared to the categorical cross-entropy.Hence, when the ranking loss is used, the related performance should be, at least, as good as a model trained with the categorical cross-entropy.
If the PoIs amplitude is low compared to the noise, the performance gap between a model trained with the categorical cross-entropy, the cross-entropy ratio and the ranking loss could increase.To illustrate this phenomenon, we add Gaussian noise N ∼ B 0, σ 2 such that σ defines the standard deviation of the noise.Table 2 shows the evolution of the N t GE value depending on the added noise on the Chipwhisperer dataset.When the additional noise level is low (i.e.σ ≤ 10 −2 ), the feature detection is effective regardless of the loss function and the performance gap is low (i.e. less than 9).However, for high noise level (i.e.σ ≥ 10 −1 ), the performance gap increases dramatically and reaches 1, 031 when we compare the categorical cross-entropy and the ranking loss and 885 when we compare the cross-entropy ratio and the ranking loss.The ranking loss is clearly the most efficient loss function, even in the presence of high noise levels.As a conclusion, if an evaluator generates a model that does not exploit the entire set of leakages, he shall use the ranking loss in order to obtain a model mitigating the approximation error.Indeed, depending on the level of the SNR peaks, this error can dramatically impact the performance of a network.However, from practical perspective, a model trained with the ranking loss can also extract false-positive leakages due to the optimization error.But its overall error rate stays a lower bound of the error rate generated by a model trained with a cross-entropy loss function (see Section 4.3).The evaluation of the approximation error was also made on the AES_HD and the ASCAD datasets but the architectures proposed in [ZBHV19] already give the same best solution for all the losses.For these datasets, we can assume that the approximation error is negligible.Hence, we assume that all losses exploit the entire set of relevant information.The next section evaluates the benefits of the ranking loss against the categorical cross-entropy and the cross-entropy ratio when all leakages are Remark 5.In this experiment, we have noticed that when the noise level is high, the best value of α used by the ranking loss decreases.Consequently, α is configured to obtain the most powerful model when the noise level is high.Even if the resulted performance is similar for many values of α, this observation illustrates that α should be correctly configured depending on the characteristic of the traces (i.e. level of noise, number of profiling traces ...).

A Total Exploitation of the Leakages
As previously mentioned, given the architecture provided in [ZBHV19], the entire set of leakages is detected on the AES_HD and the ASCAD datasets.Hence, we can assume that the approximation error does not impact the overall performance of the model regardless of the loss function.When all the losses converge towards the same best solution, a comparison method consists in the evaluation of the number of profiling traces that are needed to reach this performance.From an evaluator point of view, it is more interesting to converge faster towards the best solution because it is difficult to estimate a priori the number of profiling traces needed to reach the best performance.To highlight the benefits of each loss, we decompose this experimental study into an Estimation Error Gap (EEG) and a performance gap evaluations.We introduce the EEG that characterizes the difference between the number of profiling traces N p , when different losses are used, for a given N t GE .We note EEG(L i , L j ) the EEG value between models trained with the loss functions L i and L j .For each dataset, we report the performance results given by 10 models converging towards a constant guessing entropy of 1 and display the evolution of the average N t GE values for different level of N p .When the number of profiling traces is low (i.e.≤ 30, 000), some models do not retrieve the sensitive information and the resulted N t GE value cannot be estimated.For allowing an equal comparison between the losses, we only consider the models for which the N t GE value can be computed for all the learning metrics.

AES_HD.
In Figure 2a, we compare the convergence capacity of each model depending on the loss used.When the model is trained with the ranking loss, only 20, 000 profiling traces are needed to perform a successful attack such that N t GE = 2, 000.To reach the same performance, a model trained with the categorical cross-entropy needs 24, 870 profiling traces.Thus, when N t GE = 2, 000, EEG(L RkL , L CCE ) = 4, 870.Similarly, if the evaluator chooses the cross-entropy ratio as loss function, he needs to increase its training set by 4, 950 traces to perform similar attacks.When the ranking loss is used, the number of profiling traces needed to reach a constant N t GE solution is, in the worse case, similar to the cross-entropy propositions (i.e.categorical cross-entropy, cross-entropy ratio).Through Table 3, we compare the performance of each loss for a given number of profiling traces.When N p is low (i.e.≤ 30, 000), the performance gap is relatively high (up to 8, 293 traces) between the ranking loss and the cross-entropy losses.
Hence, when the number of profiling traces is limited (as often in practice), the ranking loss is the most efficient loss function.However, as defined in [MDP19b], if the number of profiling traces is large enough and no approximation error occurs, the performance gap is reduced and the categorical cross-entropy loss function generates a model that converges towards the same best solution (see Table 3).The same observation can be made if we consider the cross-entropy ratio loss function.These experimental results confirm the theoretical propositions of Section 4.2 such that the ranking loss is, at least, as efficient as a model trained with a cross-entropy loss function.Remark 6.The value α of the ranking loss needs to be adapted depending on the number of profiling traces.For example, when N p is low, the risk of overfitting is a major issue.One solution, to limit the overfitting effect, is to fix a higher learning rate [HHS17, ST17].Hence, following Equation 12 and Equation 13, α can be monitored as the learning rate, in order to optimize the training process.For the AES_HD dataset, increasing α to 10 generates a more powerful model than α equal to 1 (see Appendix A Figure 4c) when the number of profiling traces equals 20, 000.

ASCAD.
In contrast with the previous datasets, ASCAD is a protected implementation with 1 st -order masking and random-delay countermeasures.Figure 2b, Figure 3a and Figure 3b provide a comparison between models trained with the different losses for synchronized and desynchronized traces.In Figure 2b, when the model is trained with the ranking loss, only 15, 000 profiling traces are needed to perform a successful attack while 18, 500 (resp.20, 000) are needed to reach the same performance when the categorical cross-entropy (resp.cross-entropy ratio) loss is used for the training process.Consequently, if the evaluator chooses the categorical cross-entropy (resp.cross-entropy ratio) as loss function, he needs to increase its training set by 3, 500 (resp.5, 000) profiling traces on average.Thus, when N t GE = 1, 700, EEG(L RkL , L CCE ) = 3, 500.Furthermore, when no desynchronization occurs, the model converges faster towards the average best solution (i.e.N t GE ≈ 260) when the ranking loss is used (i.e.35, 000) compared to the categorical crossentropy or the cross-entropy ratio losses (i.e. about 45, 000).The resulting EEG value equals 10, 000.This estimation error gap is up to 6, 000 profiling traces when desynchronization occur (see Figure 3a and Figure 3b) g .Hence for the ASCAD dataset, EEG is not increased with the desynchronization effect.Indeed, in comparison with synchronized traces, this countermeasure only impacts the exploitation of the relevant information.Finding suitable CNN architectures reduce the desynchronization effect [CDP17, ZBHV19] while preserving the same performance as a model trained with synchronized traces.Through Table 4, we confirm the observations on the AES_HD dataset.In our experiment, for a small number of profiling traces N p (i.e.≤ 25, 000), a model trained with the ranking loss is, on average, more efficient than one trained with the categorical cross-entropy or the cross-entropy ratio.For synchronized and desynchronized traces, an evaluator with a limited number of profiling traces shall use the ranking loss.
When the entire set of leakages is detected by the network, a model trained with the ranking loss converges faster towards the best solution compared to the categorical crossentropy and the cross-entropy ratio losses.From a theoretical perspective, we can assume that the estimation error is reduced when the ranking loss is considered.As discussed in Section 2.3, the estimation error defines the gap between the empirical estimation of the PI (resp.the empirical estimation of the MI), computed with the categorical cross-entropy (resp.ranking loss), and the real value of the PI (resp.MI).When the number of profiling traces N p is large enough, we validate that the impact of the estimation error can be negligible.However, in practice, the number of profiling traces is limited.For that purpose, g Note that models trained with the cross-entropy ratio did not converge when desynchronization 100 is considered.
the ranking loss function seems to be more appropriate with the assumption that an attacker has not an infinite number of traces in the profiling phase [PHG19].Hence, the ranking loss is a solid alternative to the cross-entropy losses for side-channel attacks.Remark 7. As mentioned earlier, the value α of the ranking loss needs to be adapted depending on the number of profiling traces.For example, when N p = 15, 000 traces, increasing α to 5 generates a more powerful model than α = 0.5 (see Appendix A Figure 4b) if the desynchronization effect equals 100.Finally, as we can see in Figure 3b and Table 4, a model trained with the cross-entropy ratio loss function does not converge towards a constant GE of 1 when the random-delay effect equals 100.However, the cross-entropy ratio aims at reducing the imbalanced effect [ZZN + 20] which is not consider in this paper.

Conclusion
We extend the work done by Masure et al. [MDP19b] that consists in the interpretation and the explainability of the loss in the side-channel context.We use the learning to rank approach in order to propose a new loss, called Ranking Loss.We theoretically show that this new loss is derived from the success rate.Indeed, we demonstrate that maximizing the success rate is equivalent to minimizing the ranking error of the secret key compared to all other hypotheses.Hence, the ranking loss tends to maximize the success rate for a given N a traces and converges towards the optimal model introduced in [BGH + 17].Through this new proposition, we are more concerned with the relative order of the relevance of the key hypothesis than their absolute value.In the side-channel perspective, the ranking loss generates a distinguisher that converges towards the mutual information between the sensitive information and the leakage.Hence, the errors namely approximation, estimation errors [MDP19b] are reduced when our approach is applied.All these observations are experimentally validated through two scenarios.Firstly, if an evaluator does not generate a model exploiting all the sensitive information from a leakage trace.Using the ranking loss prevents the approximation error and provides the most efficient model.Otherwise, if an evaluator generates a model that exploits the entire set of leakages, the model trained with the ranking loss converges faster towards the best solution compared to the cross-entropy losses.Hence, if an evaluator deals with a limited number of traces, using the ranking loss should provide the most efficient model.Consequently, in all situations, the evaluator shall consider the ranking loss as a clear alternative to the cross-entropy.
While the cross-entropy ratio loss function was introduced to reduce the imbalanced data effect [ZZN + 20], a further investigation should be made to evaluate the suitability of the ranking loss in this context.Finally, this paper only looks at the pairwise approach for ranking loss.A future work could theoretically evaluate the benefits and the limitations of using the listwise approach in the side-channel context.

A Impact of α
We select a wide range of α values in order to efficiently evaluate the impact of this parameter on the training process.Figure 4 illustrates the impact of α on the loss function depending on the dataset used for training the model.The architectures are the same as in Section 5.4.For the Chipwhisperer dataset, if α is small (e.g. less than 10), the sigmoid approximates the indicator function less accurately [BZBN19].Consequently, the loss function optimizes the model that is far from the original ranking loss.However, this property is not necessarily verified in practice.Indeed, depending on the score value, the impact of α could be scaled.When α is too large, the gradient tends to vanish (see Equation 13) and the resulted training process provides poor performance.It appears that a relatively small value of α could provide a good trade-off between the optimization, through the gradient computation, and the approximation of the indicator function.The same observation can be made when the ASCAD with synchronized traces (see Figure 4b) and AES_HD (see Figure 4c) are considered.However, we have to note that α should be carefully configured depenfing on the dataset.

B Link between MAP and SR
The Mean Average Precision (MAP) defines the average precision of the secret key k * over the |K| hypotheses.Let g = g 1 , g 2 , ..., g |K| be a vector that defines the rank for each key hypothesis in K as introduced in Equation 3. We consider g 1 as the most likely candidate and g |K| as the least likely one.Let d be a threshold and M AP @d be the average precision of the secret key k * in the top d relevant position (i.e.g d ) such that: where N denotes the number of queries (e.g.batch-size) and AP i @d is the average precision of the query i over the top d relevant position as: where rel(i) is an indicator that equals 1 if the element at rank i is a relevant item and 0 otherwise.In SCA context, it means that rel(i) = 1 when i defines the rank related to k * .
In side channel analysis, we want to recover the secret key k * , consequently, the total number of True Positives is equal to 1 and the previous equation can be simplified as below:

D Signal-to-Noise Ratio of the Experimental Datasets
[MDP19b], Masure et al. provide a complete interpretation of the categorical crossentropy, denoted Negative Log Likelihood (NLL), in the side-channel context.The authors theoretically show that minimizing the NLL is asymptotically equivalent to minimizing the categorical cross-entropy that maximizes the perceived information (PI) introduced by Renauld et al. [RSVC + 11].The PI is the amount of information that can be extracted from data with the help of an estimated model.It can be seen as a lower bound of the MI [dCGRP19, BHM + 19].
Ranking Loss.In order to complete the work made by Masure et al. [MDP19b], we estimate the errors generated when M I and M I are taken into consideration.The gap between the estimated M I and the true mutual information can be decomposed into an estimation and an optimization errors as below: M I Z; T; θ − M I (Z; T) = M I Z; T; θ − M I (Z; T; θ) (24) + M I (Z; T; θ) − M I (Z; T) .(25) I Z; T; θ − M I Z; T; θ = P I Z; T; θ − P I (Z; T; θ) − M I Z; T; θ − M I (Z; T; θ) + P I (Z; T; θ) − M I (Z; T; θ) = P I (Z; T; θ) − M I (Z; T; θ) .

EEG ( a )Figure 2 :
Figure 2: Evaluation of the EEG value on synchronized datasets (average over 10 converging models)

EEG ( a )Figure 3 :
Figure 3: Evaluation of the EEG value on desynchronized datasets (average over 10 converging models)

Figure 4 :
Figure 4: Impact of α on the loss function during the training phase (45, 000 profiling traces) Positive seen × rel(i) i ,

Table 1 :
[ZBHV19]f hyperparameters[ZBHV19] and is the first open database that has been specified to serve as a common basis for further works on the application of deep learning techniques in the side-channel context.The target platform is an 8-bit AVR microcontroller (ATmega8515) where a masked AES-128 with different levels of random delay (i.e.0, 50, 100) is implemented.The leakage model is the first round S-box operation such that

Table 2 :
Evolution of N t GE depending on σ (average over 10 converging models)

Table 3 :
Evolution of N t GE depending on the number of profiling traces N p (AES_HDaverage over 10 converging models)

Table 4 :
Evolution of N t GE depending on the number of profiling traces N p (ASCADaverage over 10 converging models)