Conditional Variational AutoEncoder based on Stochastic Attacks

. Over the recent years, the cryptanalysis community leveraged the potential of research on Deep Learning to enhance attacks. In particular, several studies have recently highlighted the beneﬁts of Deep Learning based Side-Channel Attacks (DLSCA) to target real-world cryptographic implementations. While this new research area on applied cryptography provides impressive result to recover a secret key even when countermeasures are implemented (e.g. desynchronization, masking schemes), the lack of theoretical results make the construction of appropriate and powerful models a notoriously hard problem. This can be problematic during an evaluation process where a security bound is required. In this work, we propose the ﬁrst solution that bridges DL and SCA in order to ease the use of DL techniques in an evaluation process. Based on theoretical results, we develop the ﬁrst Machine Learning generative model, called Conditional Variational AutoEncoder based on Stochastic Attacks (cVAE-SA), designed from the well-known Stochastic Attacks , that have been introduced by Schindler et al. in 2005. This model reduces the black-box property of DL and eases the architecture design for every real-world crypto-system as we deﬁne theoretical complexity bounds which only depend on the dimension of the (reduced) trace and the targeting variable over F n 2 . We validate our theoretical proposition through simulations and public datasets on a wide range of use cases, including multi-task learning, curse of dimensionality and masking scheme.

Explainability of classical SCA can be transferred to DLSCA that was considered more or less as a black-box tool a .
• We propose the Conditional Variational AutoEncoder based Stochastic Attacks (cVAE-SA) as a new neural network architecture that lies between stochastic attacks and DLSCA. Our models benefit from the theoretical aspects of stochastic attacks, as well as their ability to estimate and reconstruct the targeted leakage models. This analogy is helpful to ease the construction of the neural network as well as its interpretation.
• We propose a full contextualization of the cVAE optimization process in the SCA field.
• Thanks to its analogy with the stochastic attacks, we define some theoretical bounds related to the neural network complexity. It suggests that shallow neural networks can be sufficient to exploit the sensitive information induced in a trace. This result is in accordance with the Universal Approximation Theorem [Pin99].
• We develop a new key recovery strategy based on similarity measure that allows an evaluator to specifically choose which samples the model should target to retrieve the sensitive information. This results in a more flexible solution than classical profiled side-channel attacks.
• We validate all our theoretical results through a wide range of use cases including the following challenges in SCA context namely multi-task learning, curse of dimensionality, targeting masking scheme.
• Through a detailed experimental comparison of our cVAE-SA proposition with classical profiled attacks (i.e. template and stochastic attacks) as well as multiple DLSCA models, we highlight the benefits and the drawbacks of cVAE-SA. This results in a perspective about a new typology of models specific to the SCA context.
This proposition opens-up further research directions where improvements from both fields could be further combined for enhancing the attack efficiency as well as the explainability of the results. All these experiments can be reproduced through a GitHub repository b .
Paper Organization. This work is organized as follows: Sec.2 contrasts the related works in DLSCA, which is based on discriminative approach, with the generative approach we introduce in this paper. This section is then concluded by a general overview of the main results of this work. In Sec.3, a new neural network architecture based on stochastic attacks is proposed and, a detailed description of the optimization process as well as the key recovery phase is provided. Then, Sec.4 investigates the benefits of the cVAE-SA from an interpretability and explicability perspective while validating all the theoretical observations. Sec.5 illustrates the benefits and the limitations of the cVAE-SA in comparison with traditional approaches through experimental results. Finally, Sec.6 discusses about the benefits and the limitations of the contribution while introduces some new perspectives to consider as future works.
a Some works reduce this gap from an optimization perspective [MDP19, ZZN + 20, ZBD + 20, ISUH21,IUH22]. b https://github.com/gabzai/Conditional-Variational-Autoencoder-based-Stochastic-Attacks Usually, p X denotes the measured probability distribution while q X defines the theoretical model. The KL-divergence is always non-negative and equals zero if and only if p X = q X .
SCA terminology. SCA usually apply a divide-and-conquer strategy which consists in separately recovering different parts of the N -bit (global) secret key k * = N n i=1 k * i considering n-bit subkeys k * i ∈ K. For the rest of this paper, we will consider only attacking a subkey (i.e. n = 8), hence using k * instead of k * i and referencing subkey as key in the rest of the paper. Given a variable Y and an independent noise Z, a trace T is a D-dimensional where ψ : F n 2 → R is a pseudo-boolean function [Car10] mapping a n-bit intermediate value Y which is generated from a cryptographic primitive f : X × K → F n 2 . The latter corresponds to the deterministic part of the trace. Let Z ∼ N D (µ, Σ) correspond to the noise which is characterized by a multivariate Gaussian distribution parameterized by an unknown pair (µ, Σ).

Explainability & Interpretability.
In this work, the interpretation refers to the ability of the evaluator to clearly identify each operation induced by the generative/discriminative model's layers during the decision-making process. For example, following the classical profiled SCA scenarios (i.e. stochastic attacks, template attacks), the evaluator may wonder which leakage model is extracted for each point of interest, how does the noisy part of a trace is characterized, how does the dimensionality reduction is processed or even, how does the statistical model should be designed. This paper tackles those problems in a DLSCA perspective in order to ease the use of those techniques in side-channel context.

Related works & limitations of discriminative models
Related works. Typically, to perform a DLSCA attack the evaluator considers a discriminative approach which models the conditional posterior probabilities Pr [Y |T] in order to discriminate and pick the most likely hypothetical candidate Y (i.e. sensitive information) given a trace T. A discriminative model estimates a φ-parametric probability conditional distribution Pr[Y |T, φ] that is as similar as possible to the true unknown joint probability distribution Pr [Y |T]. This approach is beneficial for directly solving a classification problem without modeling unnecessary information and thus, mitigating the impact of some countermeasures such as the desynchronization effect [CDP17a, ZBHV19, Mag19,HSAM22].This reason leads the side-channel community to investigate the DL approaches to improve the profiled SCA [CDP17a, KPH + 19, ZBHV19, BPS + 20] relying on discriminative approach. While [WPP22] combines a DL dimensionality reduction method with template attacks as an alternative to the Principal Component Analysis [APSQ06], the Linear Discriminant Analysis [BGH + 15] or the Kernel Discriminant Analysis [CDP17b], all the end-to-end DLSCA models proposed in the state-of-the-art are based on the discriminative approach (e.g. fully-connected neural networks [MZ13,MHM14,Wei20], ResNets [ZS19, JZHY20, GJS20, MS21], RNNs [LLY + 20], transformer neural network [HSAM22], attention mechanisms [LZC + 21]). However, due to the lack of theoretical results, the discriminative models can be seen as black-box tools, and the design of models can be a real challenge even against unprotected cryptographic implementations. To reduce this issue, some solutions which automatically tune model hyperparameters have been investigated [MPP16, BPS + 20, WPP20, PRA20, RWPP21, YAGF21] but the related process is time-consuming, and the range of the hyperparameters' values is randomly bounded such that a poor design of the model can be highly impacted by underfitting/overfitting issues. This paper reduces this issue by providing the first DL model based on SCA theoretical result in order to make the construction phase easier.
Generative approach. An alternative solution consists in considering a probabilistic generative approach which captures the interactions between all the variables considered by the resulting learning algorithm. To comply with this technical specification, this strategy builds a model that estimates the probability distribution of the traces. To fit with SCA context d , the conditional probability distribution, Pr[T|Y ], has to be estimated such that, afterwards, the Bayes' theorem can be computed in order to retrieve the conditional posterior probabilities Pr[Y |T] and pick the most likely label Y . More concretely, a generative model can be viewed as an estimation of a Θ-parametric conditional distribution Pr[T|Y, Θ] that is as similar as possible to the true unknown conditional distribution Pr [T|Y ]. The classical profiled SCAs, such as the stochastic attacks [SLP05] variable. One benefit of this method is the ability to explain and interpret the result provided by the model. The following section summarizes the stochastic attacks [SLP05] introduced by Schindler et al. as well as our contribution that we detail in Sec.3.

General description of this work
A short description of stochastic attacks [SLP05]. Given a trace T such that its i th time sample can be defined as T[i] = ψ i (f (X, k * )) + Z[i], the goal of the stochastic attack is to find an approximation of the leakage model, denotedψ i , as close as possible to the true unknown ψ i . As ψ i is assumed to be a pseudo-boolean function, ψ i can be viewed as a linear combination of monomial basis' vectors u ∈ F n 2 [Car10]. Hence, there exists a set of real coefficients (α u ) u∈F n 2 such that, for a sensitive intermediate value Y ∈ F n 2 , the leakage model (see Eq.1) is redefined as: where Y u denotes the monomial basis and characterizes the conjunction of all bits of Y such that Y u = n−1 j=0 Y [j] u [j] where Y [j] ∈ F 2 defines the j th bit of Y and the power notation is simply Y [j] 0 = 1 and Y [j] 1 = Y [j]. In other words, ψ i can be approximated as a multivariate polynomial in the bit-coordinate Y [j] with coefficients in R. The degree d (s.t. d ≤ n) of such monomial is defined as the maximal number of bits' interaction induced inψ i,α (Y ). In particular, this degree d can be viewed as logical operators (e.g. AND or XOR). The related subspace is denoted by F d+1 . For the profiling phase, the stochastic attack mechanism consists firstly in choosing the degree d of the pseudo-boolean function ψ α , and then in estimating the leakage model related to the targeted device. Given a set of N p labeled traces I p = {(t 0 , y 0 ), . . . , (t Np−1 , y Np−1 )}, the evaluator estimates the leakage model (ψ i,α (Y )) Y ∈F n 2 by finding the best set of coefficients (α u [i]) u∈F n 2 through the application of the ordinary least squares (OLS) method. The set of coefficients (α u [i]) u∈F n 2 which minimizes the OLS are called the OLS estimator for ψ. More details on how to practically implement stochastic attacks can be found in [CK15]. While the basis choice is essential for efficient profiling phase [MOW17], i.e. having a good approximation of the leakage model, the application of gradient descent method for minimizing the OLS method is an interesting alternative to the classical approach (see [SLP05,Eq.13]) and will be explored in next sections. This leads to get a better intuition into how a DL model should be designed in order to extract the sensitive information and results in a more flexible solution during the exploitation phase.
A new generative strategy in DLSCA. In SCA context, we want to explicitly compute an approximation of the true unknown conditional probability distribution Pr[T|Y ] in order to retrieve the secret key that is manipulated by the targeted real-world crypto-system. In 2014, Kingma and Welling introduced the Variational AutoEncoder (VAE) [KW14] as a solution to this issue outside of the SCA context. Ever since the seminal work has been widely applied in various fields (e.g. face generation [KW14,KWKT15], handwritten digits [KW14], objects [KWKT15]), we propose to contextualize conditional variational autoencoder into side-channel analysis in order to give a new perspective for generative models. In this paper, we develop a new usage of variational autoencoders for DLSCA and we present our main contribution: the Conditional Variational AutoEncoder based on Stochastic Attacks (cVAE-SA). This work can be decomposed into three parts (see Fig.1): 1. First, a description of the cVAE-SA structure is proposed. In particular, a theoretical link is highlighted with the stochastic attacks in order to model a Θ-parametric  conditional distribution Pr[T|Y, Θ] through the design of two distinct parts referred to the encoder and decoder. In outline, the encoder approximates the parameters µ and Σ which characterize the noise Z included in a trace (see Eq.1).Then, the decoder is defined to generate a synthetic trace from a variable which follows N D (µ, Σ), and an approximation of the deterministic part ψ(Y ) defined in Eq.1. The description of these entities is detailed in Sec.3.2. This part is helpful from an evaluation perspective to reduce the explainability/interpretability issues mentioned in Sec.1. In particular, it clarifies the operations induced in each layer of the cVAE-SA model.
2. Once the cVAE-SA is designed, it is automatically configured over a set of training traces in order to estimate the Θ-parametric conditional distribution Pr[T|Y, Θ] that should be as similar as possible to the true unknown conditional distribution Pr [T|Y ]. To obtain such model, a combination of a reconstruction and a KL-divergence losses is conducted in order to find the trainable parameters that fit the most with the true unknown solution. While the reconstruction loss is used to measure the similarity error (in term of Euclidean distance) between a synthetic and a real trace, the KL-divergence loss penalizes the cVAE-SA if the parameters µ and Σ do not fit with the expected noise distribution. The combination of those losses is widely known as the ELBO loss [KW14]. The justification about the use of these losses is provided in Sec.3.3.
3. Finally, based on this configured and trained model, the evaluator can compute the maximum likelihood over a set of attack traces in order to retrieve the most likely subkey candidate over F n 2 such that n = 8. A detailed modus operandi is provided in Sec.3.4. In addition, multiple visualization techniques can be considered in order to better understand the extracted leakage model as well as the latent representation. Those visualization tools are introduced in Sec.4 in order to validate the stated theoretical results. Further investigations have also been conducted in App.A and App.B to verify the theoretical statements.

Conditional Variational AutoEncoder based on Stochastic Attacks
Through this section, we explain the link between generative DL models and classical profiled SCA by building a new type of VAE. Sec.3.1 introduces the problem we want to solve and proposes a first link with SCA. Sec.3.2 explains our architecture and the theoretical link with the work provided by Schindler et al. [SLP05], known as the stochastic attacks. Then, Sec.3.3 describes the training process of cVAE-SA and the relation with similarity measures. Finally, Sec.3.4 describes the attack phase and the theoretical architecture complexity bounds.

Generative latent variable models
After a general introduction of the Conditional VAE (cVAE) [SLY15], we contextualize this solution into SCA in order to give a new perspective for DL generative models. Supported by theoretical aspects of stochastic attacks, this new approach can be considered as an alternative to classical discriminative models often used in DLSCA.
Problem statement. The cVAE aims at modeling a Θ-parametric conditional distribution Pr[T|Y, Θ] from two random variables T ∈ R D and Y ∈ F n 2 . Suppose that a trace T ∈ R D is acquired by assuming that all the time samples are sequentially generated such that its assigned label only depends on a small set of time samples (i.e. PoIs). As the cVAE is a latent variable model, which suggests that the variability in the traces given a label Y can be captured by a small finite set of time samples, its applicability in the SCA context fits well. By designing such models for performing SCA, we thus want to capture the interactions between the time samples via the characterization of a latent space V. In particular, a Θ-parametric latent variable model F Θ , providing a Θ-parametric conditional distribution Pr[T|Y, Θ], is representative of the true unknown conditional distribution Pr[T|Y ], for every trace T and every given sensitive variable Y , if there is a representation of compressed data V ∈ V, also known as latent space representation, such that the marginal distribution is given by: where v is the realization of a random variable V in a D -dimensional space V, with a probability Pr[V = v] defined over V, and Pr[V = v|Y ] denotes the probability of observing v over the latent space V knowing Y .
Intractability. However, Eq.3 is unfortunately intractable as it should be computed for every latent representation induced by the latent space V. Thus, the following part of the section proposes solutions to circumvent this issue. Hopefully, Pr[T|Y, Θ] may still be efficiently approximated thanks to the Monte-Carlo method. Hence, for a large number of samples {v 0 , . . . , v Nv }, a trace T ∈ R D and a label Y ∈ F n 2 , we can compute an estimation of Pr[T|Y ]. As a consequence, for a given label Y and a latent variable V ∈ R D , we can build a neural network that computes Pr[T|Y, V, Θ]. This model, denoted F and it is also intractable due to Eq.3. Consequently, a solution is to find a parametric model that approximates the true unknown posterior Pr[V|T, Y ]. In statistics, the variational inference techniques can approximate such complex distributions. Given a trace T ∈ R D and a label Y ∈ F n 2 , a Θ-parametric model can be constructed to estimate the latent space V such that the KLdivergence between the approximation and the targeted probability distribution , is called the encoder. In the rest of this paper, we denote as F Θ,φ the resulted cVAE-SA such that, for a given trace T, a given label Y and a function g : . Furthermore, as the aim of this paper is to bridge the DL and the classical profiled SCA, no particular focus will be proposed on dimensionality reduction techniques. Thus, the following part assumes that D = D.

Latent space estimation and instances' generation
Through the description of the stochastic attack (see Sec.2.3), the evaluator can construct a conditional variational autoencoder adapted for the SCA context.

Encoder.
As mentioned in Sec.3.1, the encoder models a neural network F which returns an element in the latent space V of dimension D. This element should describe the behavior of the targeted crypto-system. In this regard, we design the encoder F (enc) Θ such that it characterizes the leakage model ψ(Y ) and the random part Z of a trace T in order to fit with the stochastic attack process. To build a suited encoder, the related neural network should follow the structure defined in Sec.2.3 in order to extract the maximum amount of relevant information from T. First, the evaluator has to estimate the deterministic part of a trace T (i.e. leakage model ψ) that is defined by Eq.2. This modeling can be estimated by a fully-connected layer of D neurons such that each of them is linked with all elements of the monomial basis (Y u ) u∈F n 2 . Let Y ∈ F n 2 and (Y u ) u∈F n 2 (resp.ψ i,Θ (Y )) be the input (resp. output) of the i th neuron such that: where (.) is a function (linear or non-linear) and Θ ∈ M 1+ d i=0 ( n i ),D (R) denotes the set of trainable parameters for a given degree d that characterizes the space F d+1 . While the goal of our work is to reduce the gap between deep learning and classical profiled SCA, we define (.) as the identity function e in order to satisfyψ Θ [i] =ψ α [i] and consider that the deterministic part of a trace at time sample i can be approximated by a single neuron (see Fig.2a). In the rest of this paper, this layer will be denoted asψ Θ (see Fig.2b).
Once the noise-free partψ Θ is estimated, the next step is to deeply characterize the noise part Z using traces and the neurons ofψ Θ layer. In the cVAE-SA, we choose to deliberately force the subtraction of the traces at time sample i and the i th neuron ofψ Θ layer in order to fit with Eq.1. Then, the encoder F (enc) Θ is trained to return a Θ-parametric mean vector µ V,Θ ∈ R D , and a Θ-parametric covariance matrix Σ V,Θ ∈ M D,D (R) that describes the multivariate Gaussian noise for a given trace T. Those approximations respectively estimate µ V and Σ V that characterize the latent space V. Thus, from these Note that a latent variable V ∈ R D is initially sampled from the prior distribution Pr [V] such that the dimension of V should correspond with the dimension of the latent space estimated by the encoder. However, performing the training process in such configuration can be arduous. Indeed, during the training process, the backpropagation cannot be performed because the evaluator has to compute the gradient of the loss function with respect to samples (i.e. latent variable V ∈ V), which is inherently non-differentiable. To circumvent this issue, the reparametrization trick [KW14] proposes to rewrite V such that the derivative can be computed with respect to the parametric distributions (i.e. Once V is constructed, the evaluator has to approximate the deterministic part of the leakage model, namelyψ φ . As already mentioned for the encoder, its estimation can be made with a fully-connected layer such that the input of size D is characterized by (Y u ) u∈F n 2 for a given Y ∈ F n 2 . Because the evaluator wants to characterize all the input time samples, the number of nodes in theψ φ layer depends on the dimensionality of the latent space, i.e. dimension of V (see Fig.2b). Based onψ φ and V, the evaluator can then build a new traceT following Eq.4.
A discussion and some visualization methods are proposed in App.A in order to ease the understanding of the encoder and the decoder. Then, to adequately find the trainable parameters Θ and φ, the evaluator has to consider some learning metrics that aims at approximating Pr[T|Y ].

Similarity maximization
This section describes the optimization process from a side-channel perspective and introduces some simplifications that can be conducted thanks to the side-channel literature.
Introduction of the optimization problem. As defined in Sec.3.2, our generative model has to optimize a set of parameters φ and Θ in order to maximize the marginal log-likelihood log(Pr[T|Y, φ]).
Unfortunately, due to the intractability of Pr[V|T, Y, φ] (see Sec.3.1), Eq.5 cannot be solved in practice. Hence, we have to define a function such that log(Pr[T|Y, φ]) can be approximated through an optimization algorithm. In [KW14], Kingma Proof. [SLY15] log (Pr As mentioned in Sec.3.2, the prior distribution Pr [V|Y, φ] can be reduced to Pr [V] because V is independent from the label Y and φ. The equality between Eq.5 and Eq.6 holds if and only if the encoder F (enc) Θ , which approximates the parameters µ V,Θ and Σ V,Θ that are needed to compute Pr[V|T, Y, Θ], is able to perfectly predict Pr[V|T, Y, φ]. In such configuration, the latent space exactly captures the random part induced in a trace T. Based on Eq.6, we define the empirical risk that we minimize to train the cVAE-SA.
[KW14] Given a latent space V, a set of N p labeled traces I p = {(t 0 , y 0 ), . . . , (t Np−1 , y Np−1 )}, we define the empirical risk optimizing F Θ,φ , that approximates the generative distribution Pr[T|Y ], as follows: Reconstruction Loss , such that (Pr[V|t i , y i , Θ]) 0≤i<Np is computed from µ V,Θ and Σ V,Θ provided by the encoder Sampling v from the learned posterior Pr[V|T, Y, Θ] knowing the trace T, the related label Y and the multivariate Gaussian distribution N D (µ V,Θ , Σ V,Θ ), can be seen as seeks to reconstruct T from v. Classically used for training a variational autoencoder [KW14, KWKT15, GDG + 15, SSB17], the loss function defined in Def.1 can be decomposed into two terms: the reconstruction and the KL-divergence terms. From a general perspective, to minimize the reconstruction loss, the embedding means µ V,Θ , for various Y , are pushed far away from each other and embedding standard deviations Σ V,Θ are pulled toward zero. On the other hand, to get smaller D KL (Pr[V|T, Y, Θ]||Pr[V]), the embedding means are pulled toward zero and the embedding standard deviations are increased. While the KL-divergence term is opposed to the reconstruction loss, it can be seen as a regularization term. Indeed, putting a lot of information about T in V makes reconstruction trivial, but the penalization induced by the regularization term is non-negligible. Therefore, the regularization term acts as an information bottleneck, so a balance between both terms must be found in order to only keep the informative and generic features. If necessary, the KL-divergence loss can be monitored by a hyperparameter β.
In the state-of-the-art, these models are called β-Variational AutoEncoders [HMP + 17]. However, as this paper bridges the stochastic attacks with the cVAE model, the impact of the β-parameter on the resulted learning algorithm is considered as out of the scope of this paper. Remark 1. The readers might notice that the minimization optimization process is conducted on the empirical risk combined with the ELBO loss. Therefore, the reconstruction and the KL-divergence losses are simultaneously computed to train the encoder and the decoder of the cVAE-SA.

Reconstruction loss. This term, denoted by
It defines the probability of constructing T ∈ R D given the label Y ∈ F n 2 and a sample v ∈ R D of the latent space V. Hence, the reconstruction loss tends to maximize the log likelihood in order to construct traces that are correlated with the true unknown leakage model ψ and the noise Z related to T. Thus, it encourages the decoder to learn how a trace can be reconstructed from a given noise representation defined by a latent variable V ∼ N D (µ V,Θ , Σ V,Θ ) (see Eq.4). The reconstruction loss optimizes the parameters φ to retrieve the correct coefficients associated with each vector of the monomial basis (Y u ) u∈F n 2 .Typically, if we only consider the case where no interaction between the time samples of T occurs, then, the covariance matrix Σ V,Θ can be simplified to a diagonal matrix such that its vector representation can be described as In such configuration, we do not expect to capture the time samples' interaction related to the constructed tracẽ T ∈ R D . Thus, the reconstruction loss can be computed as follows: where µT[i] (resp. σ 2 T [i]) indicates the i th element of the mean (resp. variance) vector of generated tracesT given a set of latent representations and a deterministic partψ φ which depends on Y = f (X, k * ) (see Eq.4). However, assuming that Σ V,Θ can be simplified to a diagonal matrix affects the ability of the generated traceT to capture the interaction between the time samples of T. While this choice can be problematic from a performance perspective g , the computation gain is non-negligible as the matrix inversion does not have to be computed in order to process the reconstruction loss.
Then, we assume that the output distribution of the conditional variational autoencoder is an isotropic Gaussian h (i.e. for all v ∼ N D (µ V,Θ , diag(Σ V,Θ )), we can define ΣT = σ 2 ·I D where σ 2 is a scalar). While the Mean Squared Error (MSE) loss function can be written as E v∼F (enc) Θ ||T − µT|| 2 , Eq.7 can be simplified as follows: Note that this solution is minimized if the scalar σ 2 = E v∼F (enc) Θ ||T − µT|| 2 = MSE(T, µT) [Yu20]. This loss is approximated via Monte-Carlo sampling, however, due to computation constraints, we consider only one sample v for computing Eq.8 during the training process. Consequently, for an estimated traceT, we minimize its L 2 -norm from the related true trace T in order to find the best parameters φ. In other words, through this solution, we attempt to find an estimated traceT as similar as the real one T. Thus, the decoder F (dec) φ is only affected by the reconstruction loss and seeks to suitably reconstructT based on a latent representation V and a deterministic partψ φ .
KL-divergence loss. However, to reduce the overfitting issue, a regularization term is added. In addition to the optimization of φ, the cVAE concurrently optimizes Θ to minimize the KL-divergence of the approximation Pr . The latter distribution is assumed as the traces are standardized, i.e. zero mean and unit variance, and such that no interactions are captured between the time samples. As Pr[V] characterizes the random part ofT (see Eq.4), it has to follow the same distribution as the random part of the real trace T which is N (0, 1) for each non-informative time sample. Through this configuration, the KL-divergence can be computed as follows: As Σ V,Θ can be rewritten as a vector σ 2 V,Θ such that each element of (σ 2 V,Θ [i]) 0≤i<D defines the i th diagonal of Σ V,Θ , then, Eq.9 can be expressed as follows: As a remainder, for correctly dealing with the stochastic attack scenario, the deterministic part (i.e. ψ(f (X, k * )) as well as the random part (i.e. Z) should be correctly characterized by the cVAE-SA model. While the deterministic part is approximated by theψ layer, the random part is modeled by the latent space V (see Sec.3.2). Therefore, a well-trained cVAE-SA should provide a latent space that is representative of the random part Z. Through the use of the KL-divergence loss, we force a latent variable V to follow N D (0, I D ). To clearly explain the impact of the KL-divergence loss on the trainable parameters Θ, let us denote T a D-dimensional trace that has been standardized at each sample. Let {l 0 , . . . , l s−1 } define a set of indices where the sensitive information leaks (i.e. PoIs) such that, In this setting, we assume that the interactions between trace samples are negligible. On the other hand, ). However, due to the KL-divergence loss function involved during the training process, we force the latent variable V to follow N D (0, I D ). As defined in Sec.3.2, this latent variable characterizes an estimation of the noise Z induced in the trace T. Thus, during the training process of the cVAE-SA, we penalize the model to tend towards 1 such that this solution is reached if and only ifψ Θ = ψ. Consequently, when the KL-divergence loss is computed, the cVAE-SA optimizes the trainable parameters Θ of the encoder F (enc) Θ such that the regularization term equals 0 if and only if Θ is optimal. In addition, to fully assess the suitability of the training process, the evaluator can visualize the trainable parameters Θ such that, if the correct leakage model appears, therefore, the cVAE-SA model is well trained. A discussion related to this visualization technique is provided in Sec.4.2.
This justification suggests that the latent space should be only composed by PoIs. When the input traces are standardized (i.e. zero mean, unit variance), considering the KL-divergence loss is helpful to reduce the impact of irrelevant time samples. However, when the Gaussian noise increases, the dependence between T[i] and ψ[i] decreases. In this configuration, differentiating the sensitive information from the noise can be difficult as Z[i] approximately follows N (0, 1) regardless of the information included in the time sample i. This observation confirms the benefits of the noise to reduce the efficiency of DLSCA approach. This observation will be confirmed in Sec.4 and in App.B.
From a practical perspective, even if the ELBO loss function is composed of two sub-losses, namely reconstruction and KL-divergence losses, a single optimization process is performed in order to minimize the ELBO loss. Once the generative model F Θ,φ is trained, the evaluator has to make a decision following the approximation of Pr[T|Y ] in order to fit with the stochastic attack approach. The following section describes this strategy.

Decision rule & network complexity
Typically, in the Machine Learning community, the inference phase of cVAE consists in generating a new set of data based on an input and a conditional known label. In SCA context, our goal is different and tends to find the conditional unknown label Y that fits best for a given trace T. The following part describes a new solution to retrieve the secret key k * from the model previously defined.
Key recovery phase. During the training phase, we defined a function F Θ,φ that approximates log(Pr[T|Y, φ]) through an optimization algorithm (i.e. gradient descentbased algorithms) such that the generated traceT, defined by the output of the decoder F (dec) φ , is close to the real one T captured for a given label. Once the encoder and the decoder are successfully trained simultaneously to optimize F Θ,φ , the evaluator can dissociate them in order to extract the unknown secret key from a targeted device. As mentioned in Sec.3.1, the encoder is defined by F The key recovery phase will use these functions independently in order to retrieve the targeted secret key. To fully understand this strategy, a modus operandi is suggested for a given key hypothesis k ∈ K: 1. First, the evaluator generates a new set of traces from the targeted device i with a fixed unknown secret key k * . Let I a be the set of N a attack traces such that (a) The evaluator computes the label Y = f (X, k) related to t by mixing the known plaintexts X ∈ X and the key hypothesis k.
When the inferred posterior Pr[V|T, Y, Θ] deviates from the true unknown posterior Pr[V|T, Y, φ], the number of samples N v increases in order to obtain an accurate approximation of Pr[T|Y, φ]. If the profiling phase has been performed successfully, then (t i − 1 Nv Nv−1 j=0t j ) 2 should be minimized when k = k * . Hence, the most likely candidate is defined through the maximum likelihood rule: i As this paper is dedicated to the profiled attack scenario, the readers must consider the targeted device as identical to the open device used during the training process.
Following Eq.4,T ∼ N D (µT, ΣT) such that, µT =ψ φ and ΣT = Σ V,Θ . As a consequence, through this key recovery phase, the evaluator aims at identifying the hypothetical leakage modelψ φ (f (X, k)) which fits the most with T . Consequently, this process exploits the first order moment to recover information about the secret key. This observation is confirmed in App.A.
To enhance the key extraction phase, the evaluator can precisely define the PoIs' indexes via a leakage assessment once the profiling phase is performed. Indeed, if Θ and φ are correctly learned, the evaluator can visualize them in order to properly select the PoIs (see Sec.4.2). Thus, during the attack phase, instead of parsing all the samples j, the evaluator can only compute Eq.10 on the samples that are considered relevant for the cVAE-SA.
Theoretical network complexity bounds. Based on the previous sections, we can efficiently find an architecture for a given implementation. Consequently, some theoretical network complexity bounds can be expressed following the evaluator's knowledge. Indeed, our generative neural network (i.e. cVAE-SA) can be easily built for a given Y ∈ F n 2 , a degree d of bits' interaction and a D- , the encoder (resp. decoder) needs to optimize Θ (resp. φ) in order to retrieve the correct leakage model. Hence, for a given Y ∈ F n 2 , the number of weights that have to be optimized are Here, we decide to follow the classical stochastic attacks in order to easily extract the related noise. Hence, no weights are needed for this operation. Finally, to approximate µ V,Θ (resp. Σ V,Θ ), we need (D · (D + 1)) (resp. D 2 · (D + 1)) neurons. For the simplified diagonal case, Σ V,Θ can be reduced to σ 2 V,Θ , thus, only D · (D + 1) neurons are needed in this configuration. To sum up the complexity metrics, the evaluator needs to construct a generative model with (D · ((D + 1) 2 + 2 · (1 + d i=0 n i ))) weights (resp. (2D · ((D + 1) + 1 + d i=0 n i )) weights if Σ V,Θ is reduced to σ 2 V,Θ ). Following those metrics, it can be noticed that the trace dimension D influences the most of the network complexity.
However, a solution can be considered to improve the network complexity without altering the cVAE-SA performance. Indeed, following Sec.3.3, if the evaluator detects s PoIs, he can construct a vector {l 0 , . . . , l s−1 } of s indices such that l i denotes the index related to the i th point of interest. Based on this knowledge, he can build a cVAE-SA with lower complexity such that most of the relevant information, dedicated to the s PoIs, can be extracted from a trace. Instead of considering all the samples of the D-dimensional trace (s.t. D s), he can construct a neural network with (s · ((s + 1) 2 . As a consequence, we drastically reduce the network complexity without altering the ability of the generative model to retrieve the secret key as suggested in Sec.3.3. For example, the network complexity of Fig.2b is about 1, 040 weights if all bits' interactions are considered (i.e. d = 8, s = 2 and n = 8). When black-box models (e.g. discriminative models) are considered, finding such complexity bounds is known as an arduous task as no correlations are provided with classical profiled SCA.
One of the main benefits of the proposed variational autoencoder is its explainability and its interpretability regarding the side-channel context. In addition, our theoretical results suggest that its width does not have to be large no matter the dimension of the traces. This result is faithful with the Universal Approximation Theorem [Pin99]. Through the following section, we validate these properties and broaden the attacks' spectrum on protected implementations considering the boolean making scheme.

Settings
Hyperparameter selection. While classical DLSCA models need to tune a lot of hyperparameters (e.g. type of neural network, number of layers, number of nodes per layer, activation function, optimizer algorithms, learning rate, number of epochs, batch size), the configuration of the proposed cVAE-SA only deals with the optimizer algorithm, the batch size, the learning rate and the number of epochs. In this section, optimization is done using the Adam optimizer on batch size {8, 16, 32, 64, 128} and the learning rate is set to {10 −1 , 10 −2 , 10 −3 , 10 −4 }. We construct each model with a maximum number of epochs of 40 and select the hyperparameters that provide the best ranking value. Finally, in this section, N t rank denotes the number of attack traces that are needed to reach a constant rank of 1. These traces are randomly shuffled and picked up from a set of attack traces I a which is characterized by simulations that are described in the following. For a good estimation of N t rank , an average over 10 simulations, denotedN t rank , is computed.

Simulations.
To verify the benefits of the cVAE-SA, we simulate D-dimensional traces from a 8-bit sensitive variable Y . In this section, the simulated traces are built following two scenarios: • Scenario 1 -We assume the leakage model induces the maximum amount of interactions between bits (i.e. F 9 ), such that all bits influencing the leakage model have the same weights. Hence, the i th time sample of the simulated trace T is defined as follows: if i ∈ {l 0 , . . . , l s−1 }, denotes the b th bit of the output of the Sbox, and Z[i] is a Gaussian noise following N (0, σ 2 ) such that σ 2 = 1. The SNR result is provided in App.B.
• Scenario 2 -We assume that the leakage model induces interactions of degree 2 between bits (i.e. F 3 ) but differs by the location of the PoIs. The i th time sample of the trace T is defined as follows: denotes the b th bit of the output of the Sbox considering a plaintext X and the secret key k * , Z[i] is a Gaussian noise following N (0, σ 2 ) such that σ 2 = 1. The SNR result is provided in App.B.
A set of 10, 000 traces (9, 000 for the profiling phase and 1, 000 for the validation phase) is simulated for each scenario. The choice of these scenarios have been motivated to assess the ability of the cVAE-SA to capture the interactions between bits as well as simultanously targeting multiple sensitive variables. Further experiments with other scenarios has been investigated in App.A and App.B considering additional Gaussian noise parameters.  [GHMR17]. Through the application of the well-known Gram-Schmidt orthonormalization on the monomial basis, Guilley et al. introduce a new orthonormal monomial basis that uncorrelates each basis vector and preserve the degree of bits' interaction. Hence, constructing the cVAE-SA on this orthonormal monomial basis is beneficial to evaluate the ability of the neural network to retrieve the leakage model and maintain its interpretability. This approach will be considered in the rest of the paper. Using the orthonormal monomial basis has a major benefit. As shown by Kasper et al. [KSS10], when the basis is able to describe the switching activity of the circuit, the estimated basis coefficients highlight specific exploitable security flaws in the studied implementation. Hence, visualizing the basis coefficients that characterize the cVAE-SA model F Θ,φ , namely Θ and φ, is useful to get deeper information on the exploitable security flaws. The next section proposes to visualize the trainable parameters Θ and φ in order to assess the suitability of the cVAE-SA to extract the expected leakage model ψ.

Leakage model estimation & multi-task learning
Single-sensitive variable attacks. As mentioned in Sec.3.2, the encoder (resp. decoder) is trained to retrieve the trainable parameters Θ (resp. φ) in order to maximize their correlation with the targeted leakage model. Once the cVAE-SA is correctly trained, the evaluator can visualize these trainable parameters (i.e. Θ and φ) in order to find the security flaws induced in the studied implementation. In the considered scenario (see Fig.3a), the weight visualization can be used to assess the ability of the encoder (resp. decoder) to retrieve the leakage function defined in Eq.11. Indeed, these figures illustrate the coefficients associated to each vector of the orthonormal monomial basis. The first coefficients of each figure define the lowest bits' interaction induced in the leakage model. For example, the first element, included in the interaction of degree 5 area, is characterized by ⊕ 4 b=0 Y [b]. While the related weight is non-negligible, the cVAE-SA identifies that the interaction ⊕ 4 b=0 Y [b] influences the leakage model. This observation can be confirmed with Eq.11. Proceeding this analysis for the entire set of non-negligible weights can be helpful to evaluate the ability of the cVAE-SA to retrieve the leakage model. Indeed, if we compare the real simulated leakage model defined in Eq.11 with the non-negligible weights depicted in Fig.3a, we can see that all the peaks are associated with the correct basis vector. In addition, each coefficient associated with the sensitive interactions seems to get approximately the same impact which corresponds to the real leakage function defined in Eq.11. Consequently, if the cVAE-SA is correctly trained, it sounds helpful to retrieve complex leakage models as well as the related security flaws. Moreover, through the visualization provided in Fig.3a, the evaluator can also identify the time samples where the sensitive information leaks. Indeed, this figure highlights that only T[1] is useful to extract the leakage model. Hence, once the cVAE-SA is correctly trained, the evaluator can easily retrieve the PoIs. Then, during the attack phase, the evaluator can decide to focus its attack by computing Eq.10 only on T[1] instead of the entire trace dimension as mentioned in Sec.3.4. Contrary to classical profiled SCA approach, the cVAE-SA provides a more flexible solution during the exploitation phase  One advantage of the stochastic model is to approximate the data that depends on the secret key. Through this process, the evaluator directly obtains a score related to the key manipulated by the crypto-system. Hence, the evaluator can adapt the orthonormal monomial basis to target simultaneously multiple cryptographic primitives (e.g. input and output of the Sbox). Therefore, the cVAE-SA can be adapted to perform multi-tasking attacks. Additional experiments are proposed in App.B in order to assess the impact of the Gaussian noise on this visualization techniques.
Multi-task learning attacks. This approach has been in [Mag20] for enhancing the discriminative approach. While this learning strategy is beneficial to improve the performance of the key recovery phase, the design of such discriminative models remains an open question. Through this paragraph, we illustrate the flexibility of the cVAE-SA to deal with such solution. To exploit all the bits' interaction for each sensitive variable, we set d = 8 such that Fig.3b illustrates the impact of each vector of the basis F 9 once the cVAE-SA is well trained. When the time sample T[1] is considered, we can see: • An interaction of degree 1 and 2 that corresponds to the bit 5.
• An interaction between the bits 3 and 7 of the input of the Sbox.
Then, through Fig.3b, we can see that T[2] extracts a leakage model with two interactions of degree 1 associated with the 3 rd and the 6 th bit of the output of the Sbox. This result is consistent with Eq.12. Hence, through this simulation, we can validate the ability of the cVAE-SA to correctly retrieve the leakage model of multiple sensitive variables simultaneously. This approach is highly beneficial in SCA context as it gathers more information about the targeted secret key and results in a better performance. However, as mentioned in Sec.3.4, the degree d of the orthonormal monomial basis F d+1 directly affects the complexity of the cVAE-SA. Hence, considering the attacks of multi-sensitive variable increases by 2D · (1 +  the evaluator has to define the most suitable structure to employ for defeating the targeted crypto-system. However, in SCA context, the evaluator mainly deals with a non-negligible number of uninformative samples. The following section assesses the ability of the cVAE-SA to mitigate this constraint.

Curse of dimensionality
When the evaluator performs a side-channel attack, he wants to precisely find the relevant key-dependent time samples even if a large part of the trace contains uninformative time samples. Usually, the number of PoIs s is far lower than the trace dimension D (i.e. s D). Thus, to assess the benefits of the cVAE-SA, we have to understand the ability of this new model to retrieve the PoIs when a lot of samples are irrelevant. In order to evaluate it under this restriction, we consider Scenario 1 (see Sec.4.1) such that we construct 6 sub-scenarios where D ∈ {3, 10, 50, 100, 500, 1000} and s = 1 such that the related Signal-to-Noise Ratio (SNR) equals to 0.549. Hence, for each case study, only a single PoI is configured while the dimension of the simulated traces increases. In Tab.1, we denote P RI = s D , the fraction of relevant information in each sub-scenario and evaluate the impact of this variable on other parameters, namely the batch-size and the learning rate. Finally, N v denotes the number of samples V used to perform Eq.10.
As suggested in Sec.3.4, the attack process is performed only on the time samples defined as relevant k by the cVAE-SA. Hence, the weight visualization applied on φ and Θ is very helpful to define which samples can be considered as PoIs. Through Tab.1, we can see that increasing D does not impact significantly the resulted performance of the cVAE-SA (i.e.N t rank ). Indeed, if the evaluator adequately finds the correct hyperparameters, namely batch-size and learning rate, he can expect to get similar results for high values of D. However, as detailed in Sec.3.4, increasing the input dimension highly impacts the complexity of the cVAE-SA. Finding a way to focus the interest of the model only on the relevant time samples can drastically reduce the network complexity without altering its resulted performance. Such investigations could be part of a future work to highlight even more the benefits of DL in SCA context.
Once the evaluator validates the ability of the generative model to deal with a low percentage of relevant information, he can question the benefits of the cVAE-SA to defeat boolean masking implementations. The next section deeply investigates this protection against this new model.

Generalization on boolean masking implementation
Typically, the discriminative models are built to automatically extract the relevant information from a trace without providing a clear interpretability of its decision-making. In k Here, the relevance of a time sample is characterized by its coefficients φ and Θ such that the most relevant time samples have the highest coefficient values. is computed in order to satisfy the latter relation. Consequently, to perform a successful high-order attack, the evaluator has to find the (o + 1) shares in order to retrieve the sensitive information Y . Typically, classical approaches considered in profiled SCA use some recombination techniques l as preprocessing [CJRR99,Mes00,PRB09]. This approach involves the combination of (o + 1) shares in order to "demask" the masked values and perform the attacks on the unmasked value. To apply this proposition, various recombination techniques are introduced, namely product combining [CJRR99], absolute difference combining [Mes00] and optimal product combining [PRB09]. If the evaluator wants to apply one of these techniques, he has to recombine the samples related to each of the (o + 1) shares and then, target the unmasked sensitive value Y .
To evaluate the suitability of cVAE-SA in such scenarios, we decide to simulate a 5dimensional trace with different levels of masking order o ∈ {0, 1, 2, 3}. For each case study, we apply the absolute difference, the product and the optimal product combining functions and list the best result we obtained in Tab.2. Through this table, we demonstrate the ability of the proposed cVAE-SA to defeat a high-order boolean masking implementation. Surprisingly, the number of shares does not highly impact the hyperparameters' value m , namely the learning rate and the batch-size, unlike the network complexity. Indeed, for a given set of D-dimensional traces, the combining methods multiplied by D the number of time samples for each mask reduction. Hence, for performing an o order attack, the evaluator has to deal with traces of D o+1 samples. As the dimension of the traces impacts the network complexity (see Sec.3.4), the evaluator has to exponentially increase his computational ability with the attack order.
Once all these simulations validate the theoretical observations provided in Sec.3, we compare the benefits of considering the new cVAE-SA with the classical profiled side-channel attacks on real unprotected and protected implementations.

Experimental results
The experiments are implemented in Python using the Keras library and are run on a workstation equipped with 128GB RAM and a NVIDIA GTX1080Ti with 11GB memory. In the following section, the discriminative models are based on the CNN architectures provided by [ZBHV19] and then, a global benchmark is provided with other typology of l An alternative consists in applying a Bayes classification approach [OM06] in order to retrieve the targeted value.
m The readers must be aware that this observation cannot be generalized on all implementations and would benefit from further investigations. discriminative models (see Tab.4). For the generative models, the configurable hyperparameters, namely the batch-size and the learning rate, are respectively set to {8, 16, 32, 64} and {10 −1 , 10 −2 , 10 −3 }. We construct each model with the following number of epochs {10, 20, 30, 40, 50, 75, 100} and select the value that provides the best rank. As mentioned in Sec.4, we denoteN t rank the average value of N t rank over 10 shuffled experiments. In the following, we always capture the maximum amount of interactions (i.e. F 9 ). This choice was made because an evaluator does not have a priori knowledge on the bits' interactions n . Finally, as suggested through the analysis of the KL-divergence loss (see Sec.3.3), the latent space dimension should be monitored depending on the number of PoIs. As the goal of our paper is to provide a fair comparison with the state-of-the-art result, the same number of PoIs as in [KPH + 19, BPS + 20] will be considered.

Presentation of the datasets
We used three different datasets for our experiments. All the datasets correspond to implementations of Advanced Encryption Standard (AES). The datasets offer a wide range of use cases: high-SNR unprotected implementation on a smart card, low-SNR unprotected implementation on a FPGA, low-SNR protected implementation with first-order masking.
• DPA contest-v4 o is an AES software implementation with a first-order masking.
Knowing the mask value, we can consider this implementation as unprotected and recover the secret key directly. In this experiment, we attack the first round Sbox operation. We identify each trace with the sensitive variable Sbox[ where M denotes the known mask and X[0] the first byte of the plaintext.
• AES_HD p is an unprotected AES-128 implemented on FPGA. The attack targets the register writing in the last round such that the label of the i th trace is and C[j ] are two ciphertext bytes such that j = 12 and j = 8.
• ASCAD-v1 q is introduced in [BPS + 20]. The target platform is an 8-bit AVR microcontroller (ATmega8515) where a AES-128 protected with a boolean masking scheme is implemented. The targeted sensitive variable is the first round Sbox operation such that Y = Sbox[X[3] ⊕ k * ]. Currently, there are two versions of the ASCAD dataset. The distinction between these versions relies on the randomness of the secret key for the profiling traces. In particular, the ASCAD-v1-F version has a fixed secret key for the 50, 000 profiling traces and the 10, 000 attack traces. Each trace of this dataset is composed of 700 samples. On the other hand, the ASCAD-v1-R version has random keys for the 200, 000 profiling traces and a fixed key for the 100, 000 attack traces. In the ASCAD-v1-R version, each trace is composed of 1, 400 samples.
Remark 2. While this work bridges DL with SCA, no investigation has been conducted on the desynchronization effect. Indeed, as the Machine Learning community has already demonstrated the benefits of the use of shift-invariant layers (e.g. convolutional layers) to mitigate the desynchronization effect [ITLW20], further theoretical investigations should be provided to clearly explain how those layers should be configured regarding the works n In order to find the best trade-off between bits' interactions and the statistical model, one solution consists in evaluating the model quality of a linear regression model. In

A comparison with state-of-the-art SCA
In this section, we evaluate the benefits of the cVAE-SA against the classical side-channel attacks (i.e. template attacks, stochastic attacks) by respecting the same experimental conditions as the state-of-the-art results. DPA contest-v4. Once the cVAE-SA is trained, the evaluator can observe the coefficients related to each time sample as illustrated in Fig.4a. Through this visualization tool, the evaluator is able to identify the leakage model extracted by the cVAE-SA. In particular, it is can be observed that the leakage model is only influenced by the bits of Y . Therefore, the bits' interaction do not have any impact of the leakages extracted from the DPA contest-v4 dataset. Once this analysis is conducted, the evaluator can select those with the highest trainable parameters (i.e. Θ and φ) and perform his attack on this subset. This post-selection is beneficial to reduce even more the impact of noisy samples (i.e. time samples where the related weights are close to 0) during the key-recovery phase. This new feature can be proposed and explained thanks to the interpretability of the cVAE-SA architecture (see Sec.3). For this dataset, we compute Eq.10 on the 50 time samples previously extracted. When a high-SNR unprotected implementation is considered, we observe that our generative model has the same performance as classical profiled side-channel attacks (see Tab.3). Hence, for this implementation, similar results can be obtained whatever the attack performed. Consequently, in this configuration, considering the cVAE-SA is equivalent to classical profiled side-channel attacks.  Fig.4b, it can be mentioned that the key-recovery phase is only impacted by the leakages related to some bits of Y . Accordingly, while the training process was performed on traces with 50 samples, the computation of Eq.10 was made on the 14 time samples complying with the configured restriction. This processing tremendously increases the performance of the resulted attack. Indeed, the cVAE-SA model divides by 83 (resp. 15) the number of attack traces that are needed to perform a template attack (resp. stochastic attack).

ASCAD-v1.
As mentioned in Sec.4.4, we perform high-order attacks with the help of combining functions as preprocessing (i.e. product combining, optimal product combining, absolute difference combining) for both ASCAD-v1 datasets. Then, we profile the generative models on the unmasked value in order to extract the relevant information. In Tab.3, the optimal product combination provides the best performance on the ASCAD-v1-F and ASCAD-v1-R datasets. Through the experiment on ASCAD-v1-F dataset, we observe that the cVAE-SA performs better than template or stochastic attacks. While 351 (resp. 290) attack traces are needed to reach a constant rank of 1 when the template attack (resp. stochastic attack) is considered, our generative model retrieves the secret key within 194 attack traces. The same observation can be highlighted for ASCAD-v1-R dataset where the cVAE-SA model retrieves the secret key within 250 attack traces. As previously mentioned for the AES_HD dataset, those results can be explained by the ability of the cVAE-SA to target a specific range of relevant combined time samples during the attack phase. For both ASCAD-v1 datasets, only the time samples with Θ and φ coefficients greater than 1 are kept for the key recovery phase. On the contrary, the classical profiled SCA have to consider the 64 time samples (i.e. 8 time samples related to the masks and the masked value) used to perform the related attacks. Hence, resulted noisy time samples can highly influence the performance of the resulted attacks. A detailed discussion on ASCAD-v1-R dataset with explainability/interpretability results is provided in App.C.
In conclusion, when a classical profiled SCA is trained on D-dimensional traces, the evaluator has to perform the exploitation phase on the same trace dimension. Unfortunately, the evaluator does not know a priori which time samples are considered as relevant once the profiling phase is applied. Hence, performing the exploitation phase on the D-dimensional traces could be impacted by the uninformative time samples. On the other hand, once the profiling phase is performed on the D-dimensional traces, the cVAE-SA is beneficial to select a subset of s time samples, such that s D, in order to compute Eq.10 only on the informative time samples. Hence, this new proposition is more flexible than classical profiled SCA and results in a better attack perspective as we are less impacted by uninformative time samples. However, the evaluator can question the benefits of the generative approach with respect to discriminative models. The following section highlights the benefits and the limitations of both approaches in DLSCA.

A comparison with state-of-the-art DLSCA
When the discriminative approaches are considered, a major drawback can be highlighted regarding the architecture configuration. Indeed, the resulted models have a plethora of hyperparameters to tune. The more effort we spend on the hyperparameter tuning of the network architecture, the more efficient the resulted attack is expected. In addition, due to their black-box property, the discriminative models are difficult to interpret. However, the main benefit of this approach is about automatically combining the points of interest to limit the masking effect. While the discriminative model considers all the samples of the trace, we focus the interest of the generative model only on the most relevant samples. As highlighted in Sec.4.3, increasing the number of irrelevant samples highly impacts the network complexity and the training time without altering the related performance.  Fig.5a. Indeed, if N v = 1, two attack traces are needed to retrieve the secret key. However, a poor rank stabilization is observed. To rectify this point, increasing the N v value preserves a constant rank convergence towards 1.

AES_HD.
The same observation can be made when the AES_HD dataset is considered. Indeed, when the N v value increases, a rank stabilization is observed when the number of attack traces grows. In addition, Fig.5b highlights a better model when the generative approach is considered in comparison with the discriminative state-of-the-art result (see Tab.4). Indeed, for N v = {100; 1, 000}, the resulted model converges towards a constant rank of 1 with 300 attack traces. Even if the discriminative approach directly estimates Pr[Y |T], the state-of-the-art result indicates a lower performance when most of classical DLSCA models are considered. As illustrated by Ng and Jordan [NJ02], this result suggests that a better discriminative model can be found on this dataset. Indeed, an optimal discriminative model should be, at least, as efficient as a generative approach. However, finding the best discriminative model can be difficult due to the broad hyperparameter selection. This result highlights the benefits of the cVAE-SA in comparison with the classical DLSCA models from an evaluation perspective as it provides a suited security bound related to the targeted device.
ASCAD-v1. One benefit of the discriminative approach is to automatically recombine the points of interest. In opposition, the generative approach does not take advantage of this property (see Sec.4.4). Through Tab.4, we can visualize the benefits of automatically combining the points of interest. Indeed, the discriminative approach reaches better performance for both datasets (i.e. ASCAD-v1-F and ASCAD-v1-R). While the cVAE-SA  retrieves the secret key within 194 traces (resp. 250 traces) with N v = {100; 1, 000} when ASCAD-v1-F (resp. ASCAD-v1-R) is considered, the best discriminative model recovers this sensitive variable within 87 traces (resp. 78 traces). This result could be explained by the ability of the discriminative models to find a custom combining function that maximizes the posterior probabilities Pr[Y |T]. Hence, this custom unknown function can be more adapted for the targeted dataset. On the other hand, the cVAE-SA model is trained on combined traces that are constructed from classical approaches (i.e. optimal product combining). Consequently, when masking implementations are considered, a discriminative approach is beneficial to reduce the preprocessing phase and, it can provide better result than the cVAE-SA. However, applying the discriminative approach can be limited from an interpretation point of view. In addition, the discriminative approach required plethora of additional settings (i.e. architecture, activation function, weight initialization, etc.) that do not have to be considered when the cVAE-SA is constructed. Hence, the configuration of discriminative models can be an issue from a practical perspective.
The results provided in this section highlight the benefits and the limitations of cVAE-SA against classical DLSCA models. Particularly, Tab.4 refers the main models introduced in the DLSCA literature. Through this benchmark, it can be mentioned that the cVAE-SA is the only model which considers the generative approach such that it performs similarly, or even better, than classical DLSCA models. As suggested by Ng and Jordan [NJ02], the asymptotic error of the discriminative model is lower or equal to the one related to the generative approach. Therefore, discriminative models should be at least as efficient as a cVAE-SA. However, as their construction phase is not deterministic, an irrelevant model can be designed to solve a given classification task and the resulted performance can be less efficient than the cVAE-SA due to a poor approximation of the true unknown leakage model. This observation can be confirmed through the results provided in Tab.4.
To conclude, for the SCA community, the cVAE-SA can be helpful to evaluate the feasibility of an attack and get a security bound of a device. However, while the configuration of such neural network is simple in comparison to DLSCA models, this new proposition can perform worse under some conditions. Indeed, when masking coutermeasures are implemented, an evaluator using generative models (e.g. cVAE-SA) can use sub-optimal combining functions while a discriminative approach finds a custom unknown combining function which can be more adapted depending on the targeted implementation. In addition, while the cVAE-SA suggests that the true leakage distribution is Gaussian, a discriminative approach is not restricted to such assumption. As a perspective, we suggest considering a hybrid approach combining the discriminative and the generative models in order to keep the explainability and the interpretability while preserving the benefits of the automatic recombination.

Discussion
Through this paper, we have demonstrated that a derivation of the conditional variational autoencoders (cVAE) can be considered in side-channel context in order to perform physical attacks. From an evaluation perspective, this new neural network architecture is suitable as it respects the following requirements: 1. Theoretical similarities with classical profiled side-channel attacks -As illustrated in Sec.3, the cVAE can be monitored to fit with the stochastic attacks paradigm introduced by Schindler et al. [SLP05] and briefly recalled in Sec.2.3. From the evaluator point of view, this approach is useful to ease the configuration of the neural network and get a clear overview of the decision-making process. Indeed, as the cVAE-SA is designed on well-known theoretical attack strategy, the evaluator can be confident on the employed neural network structure and thus, expects to get a resulted predictive model as efficient as classical profiled side-channel attacks, namely template attacks [CRR03] and stochastic attacks [SLP05]. Based on the solution provided in [Mag20], the cVAE-SA can be easily adapted to deal with the multi-task learning process that consists in targeting simultaneously multiple sensitive variables. From this new bridge, the evaluator can deeply understand the future improvements that can be provided in the DLSCA field in order to fully exploit the automation process proposed by the ML/DL community.
2. Explainability & Interpretability -One major benefit of the proposed cVAE-SA is to preserve the interpretability and the explainability on the results provided by the learning algorithm. As our contribution is constructed from the classical profiled sidechannel attacks, the evaluator can adapt its interpretation tools (e.g. visualization) in order to deeply explain the results provided by the model. As suggested in Sec.4.2, in App.A and in App.B, the evaluator can visualize the trainable parameters of the conditional variational autoencoder in order to assess the ability of the encoder and the decoder to retrieve a hypothetical leakage model as similar as possible to the true unknown ψ. Once the evaluator retrieves an approximation of ψ, he highlights the security flaws induced by the targeted implementation and thus, can alert the developer on potential vulnerabilities and ease the development of countermeasures. An example on the ASCAD-v1-R dataset is provided in App.C.
3. Hiding countermeasures -Even if this paper does not assess the robustness of the cVAE-SA against desynchronization effect, an intuitive solution suggests adding convolutional layers to the encoder [ITLW20]. However, it should be validated in practice. While this intuition could be a suitable solution to mitigate the desynchronization effect, it also helps the network to automatically select the points of interest and prevent the effect of uninformative time samples. Indeed, as defined in Sec.3.3 and in Sec.3.4, the empirical risk as well as the decision process are only affected by the points of interest. Hence, this dimensionality reduction technique can also be useful to quadratically reduce the network complexity. However, to maintain the interpretability/explainability result provided by the cVAE-SA proposal, further investigations should clarify how those convolutional layers should be designed to fit with the side-channel state-of-the-art result (e.g. [BGH + 15]).
However, the cVAE-SA also has some limitations that are listed below: 1. Combining function -As the generative approach captures the conditional distribution Pr[T|Y ], it cannot handle masking implementations as the targeted unmasked sensitive variable Y is not directly observable through the leakage trace T. Thus, the evaluator has to consider combining functions in order to reveal the dependence between T and Y . Unfortunately, this suggests the need for preprocessing phase which is not necessarily optimal from an attack perspective. Indeed, as this combining function is not automatically learned by the generative model (contrary to the discriminative approach), the evaluator may not converge towards the optimal statistical model defined in [HRG14].
2. Performance -While the goal of a side-channel attack is to optimize a learning algorithm which approximates Pr[Y |T] in order to discriminate a sensitive variable Y from a set Y, the application of generative models can be considered as suboptimal [NJ02]. In particular, the cVAE-SA assumes that the leakage noise follows a Gaussian law which is not the case for classical DLSCA discriminative models [MPP16, CDP17a, CCC + 19, ZBHV19, BPS + 20, MS21].
However, regarding the latter issue, the experiment provided in Sec.5.3 illustrates that similar (or even better) performance results can be obtained regardless the approach (i.e. discriminative vs. generative) considered. Indeed, the actual publicly available datasets seem too easy to target (i.e. the number of traces to retrieve the secret key is low) in order to fully assess the performance gain of a discriminative approach over a generative one. A slight performance gains of few traces or, even dozens of traces, cannot be considered as a huge improvement and the benefits of each new DLSCA tool can be difficult to interpret. Through this analysis, we highlight the theoretical benefits/limitations of using the cVAE-SA and we define this solution as a concrete and generic alternative to the classical discriminative DLSCA models. Consequently, using the generative approach can give a first insight about the security bound of the targeted system.

Conclusion
This paper proposes to reduce the gap between historical SCA (i.e. generative models) and classical DLSCA (i.e. discriminative models). In that purpose, we introduce the first DLSCA model based on generative approach. From the stochastic attack introduced by Schindler et al. [SLP05], we first design an explainable and interpretable architecture that aims at retrieving the real unknown leakage model. Based on stochastic attack modeling, this new model can be easily constructed whatever the implementation an evaluator has to deal with. Furthermore, this analogy helps us to define theoretical bounds on the network complexity (e.g. number of trainable parameters) as well as identifying mutual problematic and perspectives (e.g. dimensionality reduction, multi-task learning). Then, we theoretically explain the impact on each individual loss in SCA such that, the reconstruction loss penalizes the model in order to estimate a trace as similar as possible to the real one. On the other hand, we demonstrate that the KL-divergence loss is beneficial to correctly estimate the latent space. Compared with historical profiled SCA, the cVAE-SA is beneficial by providing the ability to carefully select the samples the evaluator wants to focus on during the exploitation phase. Hence, by providing a more flexible generative approach, we drastically reduce the impact of uninformative samples on the attack performance. This observation was confirmed on real case study.
To bridge the gap between generative and discriminative approaches, we conduct experiments on simulations and public datasets on a wide range of use cases and observe that the generative approach does not perform worse than a discriminative one. As suggested by Ng and Jordan [NJ02], the discriminative models should be at least as efficient as the generative ones. However, as their construction phase is time consuming and not deterministic, an irrelevant model can be designed by an evaluator and the resulted performance can be less efficient than the cVAE-SA due to a poor approximation of the true unknown expected leakage model. Therefore, considering the cVAE-SA is a good starting point to define a security bound related to the targeted device. However, on the other hand, using the discriminative approach seems beneficial when masked implementations are targeted because more appropriate unknown combining function can be automatically retrieved by the related model. This solution cannot be considered with generative models. Thus, depending on the time he wants to spend on the construction phase, the evaluator has to select the best way to mount his supervised attacks.
All these results suggest a lot of future works. First, as the cVAE-SA is derived from the Stochastic Attacks, further investigations can be conducted on this new model in order to extend the work provided by Choudary et al. [CK15] which consists in performing profiled attacks beyond 8 bits. Then, as the generative approach aims at approximating a joint distribution between two random variables, we have to assess the suitability of using the cVAE-SA as a model to perform non-profiled SCA, or even more generally, blind SCA. In addition, while the limitations of the discriminative (resp. generative) approach seem solved by the generative (resp. discriminative) approach, one solution could be to consider a hybrid model that preserves the automatic sample recombination property (i.e. discriminative approach) while keeping the explainability/interpretability and reducing the hyperparameter selection (i.e. generative approach). Finally, while the discriminative approach does not make any assumption on the noise distribution, its application is more generic. A solution to enhance the cVAE-SA approach consists in configuring other latent spaces, maybe more generic than the Gaussian one proposed in this paper. Those suggestions can be considered as an additional step towards the use of generative machine learning models in the side-channel context. a set of 2-dimensional leakage traces is simulated such that the leakage model does not induce interactions between bits. In detail, the i th time sample of the simulated trace T is defined as follows: where Based on those three leakage distributions, we want to illustrate the ability of the cVAE-SA to capture the mutual dependency between the leakage traces and the targeted variable Y . Before the application of the cVAE-SA (see the left plot in Fig.6), it can be observed that depending on the informative value, an evaluator can retrieve some information regarding the label processed. Consequently, as the cVAE-SA constructs synthetic traces that should be similar to the input, the output leakage distribution should be similar to the one before the application of the encoder, if the cVAE-SA is well trained. Based on the leakage trace T, the cVAE-SA isolates the deterministic part, i.e. ψ, from the noise Z (see Sec.3.2). In particular, if the encoder is well configured during the training process, theψ Θ layer approximates the real unknown ψ function. One solution to empirically verify such approximation is to visualize the weights that composed theψ Θ layer (see Sec.4.2). Therefore, the latent space representation should be only characterized by the noise part Z if the cVAE-SA is well trained. This observation is validated by the middle plot in Fig.6. As no distinctions can be made regarding the label value, it can be assumed that the latent space behaves similarly whatever the underlying secret value. This empirical result confirms the theoretical ones provided in Sec.3.2 and Sec.3.3 (see the analysis related to the KL-divergence loss). Finally, an evaluator constructs a new set of synthetic traces based on the latent space and theψ φ layer. This construction is performed by the decoder and the resulting distributions are illustrated in the right plot of Fig.6. Through this Figure, it can be observed that the cVAE-SA discriminates each label following the mean of the related conditional distribution. This confirms the ability of the model to identify the mutual dependency between the initial leakage traces and the targeted variable Y . If the cVAE-SA is perfectly trained, the output distributions should be similar to those introduced in input. In Fig.6, this statement can be confirmed. First, through the visualization of the latent space, the evaluator can assess that the noise part of T is well approximated. Indeed, as the informative and non-informative samples look similar, it can be assumed that the latent representation of the cVAE-SA is successfully trained. Then, through the visualization of the synthetic leakage traces, the evaluator observes that the informative sample introduces relevant information regarding the targeted variable (i.e.ψ φ (Y )). In particular, thanks to Eq.4, it can be assumed that the synthetic traces follow the Gaussian distribution N D (µT, ΣT) such that, µT =ψ φ and ΣT = Σ V,Θ . Therefore, when the evaluator conducts a key recovery phase, he exploits the first-order moment in order to recover information about the secret key. This confirms the statement provided in Sec.3.4. However, if F 9 is configured to approximate the leakage model when Eq.13 is considered, the model quality can be badly impacted by a poor  estimation of the high order degree of bits' interaction [MOW17]. A cVAE-SA inducing a better quality model should consider F 2 which alleviates the leakage model complexity by setting aside bits' interaction. This statement is illustrated in Fig.7. No huge differences can be highlights between Fig.6 and Fig.7. However, using such visualization tool can give a first insight to the evaluator in order to assess the impact of a poor basis choice. Before performing the key recovery phase introduced in Sec.3.4, an evaluator may want to assess the suitability of the cVAE-SA training process. In Fig.8, we visualize all the distributions related to the latent space and the synthetic traces. While the latent space suggests a good approximation of the noise part, the distributions related to the synthetic leakage traces illustrate that the informative sample does not perform any discrimination regarding the targeted variable Y . This observation is consistent with Sec.3.4 which defines the key recovery phase as successful if the first-order moment of the synthetic leakage traces provides depends on Y . Consequently, plotting the distributions of each part of the cVAE-SA (i.e. input data, latent space, output data) can be beneficial to have a better understanding of the model quality as well as the impact of the leakage model against the noise.

B Impact of the Noise on Leakage Model Estimation
To verify the benefits of the cVAE-SA, we simulate 10, 000 D-dimensional traces from a 8-bit sensitive variable Y and assess the ability of this new architecture to extract leakage models. As mentioned in Sec.4, the weight visualization is a suitable tool to identify the leakage model extracted by the cVAE-SA. Therefore, an evaluator is able to retrieve the impact of each bit independently as well as all the bits' interaction. This tool has been confirmed on different use-cases in Sec.5.2. In this appendix, some simulated traces are built following three scenarios with different amounts of noise: • Scenario 1 -We assume that each leakage trace is configured by 3 time samples such that the leakage model induces the maximum amount of interactions between bits (i.e. F 9 ). In this scenario, all bits influencing the leakage model have the same weights. Hence, the i th time sample of the simulated trace T is defined as follows: denotes the b th bit of the output of the Sbox, and Z[i] is a Gaussian noise following N (0, σ 2 ) such that σ 2 ∈ {0.1, 1, 10}.
• Scenario 2 -We assume that each leakage trace is configured by 4 time samples.
The leakage model does not induce interactions between bits but differs from the location of the points of interest. Hence, the i th time sample of the simulated trace T is defined as follows: and Z[i] is a Gaussian noise following N (0, σ 2 ) such that σ 2 ∈ {0.1, 1, 10}.
• Scenario 3 -We assume that each leakage trace is configured by 3 time samples. The leakage model does not induce interactions between bits such that all bits influencing the leakage model have different weights. Hence, the i th time sample of the simulated trace t is defined as follows: and Z[i] is a Gaussian noise following N (0, σ 2 ) such that σ 2 ∈ {0.1, 1, 10}.
Based on the results obtained in Fig.9, Fig.10 and Fig.11, it can be observed that the cVAE-SA retrieves all the leakage models when moderate SNR level is considered (i.e. SNR ≥ 10 −1 ). Hence, the cVAE-SA can be used to evaluate the security flaws when large bits' interactions are observed, when the deterministic part differs between PoIs and when a non-uniform weight distribution occurs between bits. As a consequence, large use-cases can be considered when the cVAE-SA is applied. However, if low SNR level is defined, the extraction of the leakage model becomes more difficult. Indeed, while an evaluator retrieves some information on the leakage model related to Scenario 1 (e.g. the highest peaks of degree 1 indicate that the bits Y [1], Y [3] and Y [6] influence the leakage model, see Fig.9c), this interpretation can be more difficult when the SNR result is lower (see Fig.10c and Fig.11c). This result is in accordance with the theoretical ones introduced in Sec.3.3 which suggest that increasing the noise in the traces makes the deterministic part extraction more difficult. One solution to mitigate this lack of leakage characterization consists in acquiring a larger amount of traces [DSVC14,MCHS22] in order to find the trainable weights Θ and φ which fit the most with the true unknown leakage model. (c) σ 2 = 10 / SNR = 0.0577. Figure 9: Weight visualization of theψ Θ layer (encoder) and theψ φ layer (decoder) for Scenario 1. (c) σ 2 = 10 / SNR = 0.0262. Figure 11: Weight visualization of theψ Θ layer (encoder) and theψ φ layer (decoder) for Scenario 3.

C Explainability on ASCAD-v1
Through this section, we assess the benefits of the cVAE-SA to better explain and interpret the decision-making of this new statistical model. As mentioned in Sec.2.1, the interpretation refers to the ability of the evaluator to clearly identify each operation induced in the model in order to exploit the sensitive information. This includes the construction of a statistical model, namely cVAE-SA, where the extraction of the leakage model related to each PoI is fully explainable in order to identify security flaws. Through this section, a focus is proposed on the ASCAD-v1-R dataset which is introduced in Sec.5.1. This choice has been motivated because it can be considered as the most challenging targeted dataset (i.e. protected implementation with first-order masking). Due to the implemented countermeasure, two scenarios can be considered. The first one suggests that the evaluator wants to target independently the mask r 3 and the masked values with r 3 (see [BPS + 20] for deeper details on the implementation). This approach is beneficial to assess the robustness of the targeted implementation and identify the security flaws without any preprocessing phase. This scenario will be denoted as the naive approach. The second solution consists in combining the time samples related to the mask r 3 and the masked values with r 3 in order to target the unmasked values (see Sec.4.4). This latter solution is beneficial to identify the dependence generated by the combining function between the unmasked variable and a set of traces. This scenario will be denoted as the combining approach. To address the explainability and interpretability issue, this section will be decomposed into three parts: the construction of the cVAE-SA based on the theoretical results described in Sec.3.2, the detection and the extraction of the leakage models once the cVAE-SA is trained, and, the ability of the cVAE-SA to correctly characterize the first-order moment of the traces related to the ASCAD-v1-R dataset.
Model construction. Introduced in Sec.3.2, the cVAE-SA structure is adapted to capture the dependencies between a leakage trace T ∈ R D and a label Y ∈ F n 2 . Therefore, the cVAE-SA can be used to capture how the mask r 3 and the masked values with r 3 influence the physical trace T when the naive approach is considered. The evaluator can construct two distinct cVAE-SA models (i.e. one for r 3 and one for the masked values) based on the recommendations defined in Sec.3.2. To construct the cVAE-SA architecture, the same configuration as in Sec.5.2 is considered. Indeed, we select the 8 most relevant samples related to the mask r 3 and the masked values with r 3 and then, construct each cVAE-SA model. The only difference between those models rely on the input provided to the related cVAE-SA model, namely the trace T and the orthonormal monomial basis used. As mentioned in Sec.3.4, the network complexity can be defined following the number of samples s included in the traces, the degree of interaction d between the time samples and the dimension n of the targeted variable such that it equals (2s · ((s + 1) Through this section, we define s = 8, d = 8 and n = 8. Thus, the complexity of the cVAE-SA model targeting the mask r 3 , or the masked values, equals 4, 256. This is confirmed in Tab.5. Another solution consists in constructing a single cVAE-SA model by considering the concatenation of the orthonormal monomial basis of r 3 and the one related to the masked value with r 3 as label Y . This configuration can be defined as a multi-task learning strategy and has already been studied in Sec.4.2. Therefore, this section will be only focused on the construction of two distinct cVAE-SA models.
For the combining approach, the time samples related to the mask r 3 have to be combined with the one related to masked values in order to create dependency between the trace T and the targeted unmasked variable. Therefore, a preprocessing step is needed where the evaluator has to choose a combining function among the solutions introduced in the state-of-the-art [CJRR99,Mes00,PRB09]. Once this preprocessing is conducted, the evaluator constructs the cVAE-SA model (see Sec.3.2) such that the inputs of the encoder are defined by the combined traces and the orthonormal monomial basis related to the targeted unmasked variable. For this approach, the number of time samples to target equals 64. Therefore, by applying the same proposition as previously, the evaluator can easily configure the cVAE-SA with a complexity of 41, 216 trainable parameters. All the information related to each cVAE-SA model is provided in Tab.5.

Leakage model extraction.
Once all the cVAE-SA models are trained, the evaluators can take advantage of the explainability property of this new contribution to get a better insight on the exploited leakage models. As mentioned in Sec.4.2, the evaluator can visualize the trainable parameters Θ (resp. φ) that composed theψ Θ (resp.ψ φ ) layer induced in the encoder (resp. the decoder) in order to detect which part of the targeted variable leaks. In other words, this analysis is helpful to identify the bits, and the interactions, that influence the physical consumption of the targeted implementation. This is highly beneficial to explain the information that is extracted by the cVAE-SA. Through Fig.12, it can be observed that depending on the targeted variable (i.e. the mask r 3 , the masked values or, the unmasked values), the leakage model as well as the coefficient values differ. Indeed, the highest absolute value is observed when the cVAE-SA targets the mask r 3 while the lowest absolute value is denoted for the masked variable. Therefore, the approximation of the first-order moment varies depending on the targeted variable.
When the mask r 3 is considered, the visualization indices in Fig.12a is beneficial to recover the leakage model extracted by the cVAE-SA. A first observation can be made to denote that all the time samples have a similar leakage model. Even if the coefficient values related to each bit of r 3 differ, the same bits leak for all PoIs. Through Fig.12a, the evaluator can retrieve which bit influences the physical trace by highlighting the ones with a discriminative coefficient Θ and φ. For the mask r 3 , the following bits {0, 1, 2, 3, 4, 6, 7} have a coefficient that differs from the non-informative interaction, i.e. when Θ and φ are greater than 1. This analysis allows the evaluator to define an ascending leakage order to highlight the bit which leaks the most (in absolute value). In this configuration, the following order is observed: r 3 [6] < r 3 [7] < r 3 [4] < r 3 [2] < r 3 [3] < r 3 [0] < r 3 [1] such that r 3 [i] denotes the (i + 1) th bit of r 3 . Therefore, the bit that leaks the most is r 3 [1]. The same process can be conducted for the masked variable (see Fig.12b) and the unmasked variable (see Fig.12c). When the masked values with r 3 is targeted, the leakage model approximated by the encoder and the decoder identify the 1 st bit and the 5 th bit as the only source of information that can be extracted from cVAE-SA msked . All the PoIs share the same leaking bits. Finally, once the optimal recombination is conducted, the leakage model that is retrieved by the cVAE-SA unmsked network is influenced by the following bits {0, 1, 2, 3, 4} such that the following leakage order is observed (in absolute value): Those results are beneficial for the evaluator to explain and interpret the decisionmaking process of each cVAE-SA model. If this basis characterizes the switching activity of the circuit, this analysis highlights specific exploitable security flaws in ASCAD-v1-R. All those observations cannot be observed when classical (discriminative) DLSCA models are considered. Those interpretable results can only be provided because the cVAE-SA is designed from the fully explainable stochastic attack [SLP05].  Model quality. Once the leakage models extracted by the cVAE-SA are fully interpreted, the evaluator may wonder if additional information could be extracted. To conduct such verification, the evaluator can plot the evolution of the distribution over the cVAE-SA (see Sec.A) to assess the estimation of the first-order moment that is required to retrieve the secret key (see Sec.3.4). Therefore, he can visualize if the input and the output distributions of the cVAE-SA are similar. If this result is positive, the estimation of the first-order moment induced in the cVAE-SA can be considered as effective. Otherwise, some refinements can be provided on the hyperparameter values (e.g. learning rate, batch-size, number of epochs, . . . ). While App.A proposes to observe the distribution related to the leakage traces, the latent space and the synthetic leakage traces, an evaluator only requires the distribution of the latent space and the estimation of the leakage model of the encoder and the decoder to assess if the cVAE-SA correctly approximates the first-order moment induced in the leakage traces. Indeed, following Sec.3.2 and Sec.3.3, we can note that the latent space should be representative of the noise distribution such that it is forced to follow N D (0, I D ) by the KL-divergence loss. Therefore, if the encoder of the cVAE-SA model is correctly trained, the latent representation does not depend on the deterministic part of the leakage trace. Similar latent representations should be obtained whatever the targeted variable. This observation is confirmed in Fig.13 where the visualization of the distribution is proposed on two time samples in order to ease the readability of the experimental results. Therefore, the encoder is effectively trained and the related trainable parameters, namely Θ, correctly retrieve the targeted unknown leakage model. Then, via the visualization of the trainable parameters φ in Fig.12, it can be mentioned that the decoder approximates the targeted unknown leakage model because it is similar to the one extracted by the encoder. The extraction of the leakage model is consequently effective for both samples analyzed in Fig.13. The analysis of the model quality has also been conducted on other time samples in order to verify the extraction each leakage model. The obtained results help us to verify that the cVAE-SA extracts effectively the first-order moment that is needed to retrieve the secret key. As mentioned in Sec.3.4, if the latent space distribution follows N D (0, I D ), the first-order moment is only defined by the extracted leakage modelψ φ . The justification provided in this section helps the evaluator to validate the suitability of the training process as well as justify the ability of an adversary to extract the secret information (see Sec.5).
Based on the ASCAD-v1-R dataset, we identify the benefits of using the cVAE-SA model in order to enhance the explainability and the interpretability of the neural network in the DLSCA field. Through this study, we demonstrate that the construction of cVAE-SA models can be easily conducted whatever the targeted variable. This is highly beneficial from an evaluation perspective because the time for the hyperparameters search becomes negligible. Then, because all the operations induced in the cVAE-SA are known (see Sec.3.2), the evaluator can take advantage of this benefit in order to identify which leakage model is extracted in each PoI of the traces. Based on each leakage model, he can identify the security flaws induced in the circuit if the chosen basis characterizes the switching activity. Finally, because the trainable parameters of the cVAE-SA model are interpretable, the evaluator can assess the suitability of the training process by identifying if the first-order moment of the synthetic traces is representative of the first-order moment of the true leakage traces. All those observations cannot be conducted with classical DLSCA models due to the black-box property of the discriminative approach.