Cross-Device Proﬁled Side-Channel Attack with Unsupervised Domain Adaptation

. Deep learning (DL)-based techniques have recently proven to be very successful when applied to proﬁled side-channel attacks (SCA). In a real-world proﬁled SCA scenario, attackers gain knowledge about the target device by getting access to a similar device prior to the attack. However, most state-of-the-art literature performs only proof-of-concept attacks, where the traces intended for proﬁling and attacking are acquired consecutively on the same fully-controlled device. This paper reminds that even a small discrepancy between the proﬁling and attack traces (regarded as domain discrepancy) can cause a successful single-device attack to completely fail. To address the issue of domain discrepancy, we propose a Cross-Device Proﬁled Attack (CDPA), which introduces an additional ﬁne-tuning phase after establishing a pre-trained model. The ﬁne-tuning phase is designed to adjust the pre-trained network, such that it can learn a hidden representation that is not only discriminative but also domain-invariant. In order to obtain domain-invariance, we adopt a maximum mean discrepancy (MMD) loss as a constraint term of the classic cross-entropy loss function. We show that the MMD loss can be easily calculated and embedded in a standard convolutional neural network. We evaluate our strategy on both publicly available datasets and multiple devices (eight Atmel XMEGA 8-bit microcontrollers and three SAKURA-G evaluation boards). The results demonstrate that CDPA can improve the performance of the classic DL-based SCA by orders of magnitude, which signiﬁcantly eliminates the impact of domain discrepancy caused by diﬀerent devices.


Overview
Side-channel attack (SCA) has drawn a significant amount of attention since Kocher proposed timing attack [Koc96]. It aims at retrieving the secret values of cryptographic algorithms from a device or a system through the measurement and analysis of physical information.
Among all kinds of SCAs, profiled attacks play an essential role. It is considered one of the most powerful SCAs, at least from the information theory point of view [CRR02]. In such a context, the attacker is able to characterize the device leakage by means of a full-knowledge (plaintexts/ciphertexts and keys) access to a device that is similar to the one under attack. The first profiled attack is the template attack (TA) [CRR02], which builds models with means and covariances. Since then, several works have been done to make the TA more realistic in practice [RO04,CK13]. As profiled SCAs can be formulated as classification problems, the application of machine learning techniques (e.g., In this paper, we propose the use of unsupervised domain adaptation as a powerful tool to enable cross-device profiled attacks. Although the unsupervised domain adaptation proved to be successful in transferring the knowledge from a source domain to an unlabeled target domain [LCWJ15,LZWJ16,RMH + 19], as far as we are aware, this technique has not been explored in the SCA community to cope with the domain discrepancy between the profiling and attack traces. We demonstrate that a limitation of the classic two-phases profiled attack is it cannot utilize the discrepancy information, which is directly neglected. Therefore, we introduce an additional fine-tuning phase to enhance the pre-trained model. To quantify the domain discrepancy, we adopt the maximum mean discrepancy (MMD), a standard distribution distance metric, as a penalty loss of the classification loss function. We demonstrate the effectiveness of our strategy in dealing with three types of domain discrepancies, i.e., the variations in devices, the addition of countermeasure/noise, and different acquisition settings. The results show that fine-tuning with MMD loss is efficient in removing the effect of domain discrepancy in all investigated situations.

Related Work
The gap between the experimental setting and reality when performing profiled attacks has already been noticed in the past few years but is still an open topic. The authors in [RSV + 11] performed a study on 20 different devices, showing that the TA may not work at all when the profiling phase and attack phase are conducted on different devices. Elaabid et al. [EG12] showed that the variations in measurement setup could also lead to worse TA results even using the same device. Recently, in [KKKR18], the authors reported only a 28% success rate on a different keyboard as compared to 100% when profiling and attacking the same keyboard. In [WdHG + 20], the authors conducted profiled power analysis on the key loading procedure of multiple DST transponders. They concluded that their cross-device attacks could hardly succeed if only using a single device for profiling. The authors in [DGD + 19, GDD + 19, BCH + 20] reminded us the portability issue still exists and should never be neglected in the context of DL-based SCA. Although solutions to make profiled attacks work on different devices were proposed, e.g., using waveform realignment and acquisition campaigns normalization [EG12], by choice of compression methods [CK14], and using multiple devices for profiling [CK14,DGD + 19, GDD + 19, BCH + 20, WdHG + 20], these methods mainly require target-specific preprocessing or are based on a multipleprofiling-devices assumption.
The SCA community recently noticed that transfer learning might be a feasible way to transfer the knowledge learned from the profiling device to the target device [GGH20,TAM20]. However, their focus was on reducing the number of profiling traces. Furthermore, they still used labeled traces acquired from the target domain, which was not a cross-device scenario from the perspective of attackers.

Our Contributions
Herein, we consider how to transfer the pre-trained model to fit the target device with unsupervised domain adaptation, which, to the best of our knowledge, has not been explored before in the SCA community. Specifically, we introduce a new approach to remove the effect of domain discrepancy between profiling and attack traces, which makes it possible or easier to recover the key of a different device. Our main contributions are as follows: 1. A cross-device profiled attack (CDPA) strategy. CDPA is the extension of DL-based profiled attacks, which introduces an additional fine-tuning phase to adjust the pre-trained model for improving the performance when attacking different devices. We emphasize that no labeled attack traces are required since the designed fine-tuning phase focuses on the domain discrepancy instead of the classification task itself.

Introducing a new loss function to DL-based SCA.
With the MMD loss, our network is able to focus on the device discrepancy directly without using multiple profiling devices or task-specific preprocessing. We show that by minimizing the MMD loss and classification loss simultaneously, the fine-tuned model can learn a hidden representation that is not only discriminative but also domain-invariant.
3. A benchmark of cross-device SCA with satisfying results. We evaluate the effect of our strategy on eight Atmel XMEGA 8-bit microcontrollers and three SAKURA-G evaluation boards. We show that CDPA can significantly improve the performance of the attacks on different devices, and can even turn an impossible attack into a reality. Besides, we also show the potential of CDPA in removing the effect of adding (simulated) countermeasures/noise and overcoming the human error (electromagnetic probe placement). These experiments can be reproduced through the following Github repository: https://github.com/CDPA-SCA/ Cross-Device-Profiled-Attack.

Organization
The rest of this paper is organized as follows. In Section 2, we provide some background about the DL-based profiled attacks. In Section 3 we propose a cross-device profiled attack strategy and explore the methodology for removing the effect of domain discrepancy. Section 4 introduces the datasets used in our experiments. Section 5 presents the experimental results on multiple investigated situations. In Section 6, we provide a discussion around hyperparameter selection and its effect on our models. Then we give a brief comparison between CDPA and other promising techniques. Finally, we conclude this paper in Section 7.

Notations
Throughout this paper we use calligraphic letters as X to denote sets, the corresponding capital letter X to denote random variables (resp. random vectors X), and the lowercase x (resp. x) to denote their realizations. The probability space of a set X is denoted by P(X ). If X is discrete, P(X ) corresponds to the set of vectors [0, 1] |X | such that the coordinates sum to 1. Let Pr[X] denote the distribution of X and Pr[X = x] the probability when X equals x. We use E to denote the expected value and the condition might be subscripted by a random variable E X , or by a probability distribution E X∼Pr [X] to specify under which probability it is computed. For profiled attacks, the target sensitive variable is V = f (P, K) where f denotes a cryptographic primitive (e.g., the SubBytes operation), P denotes a known variable (e.g., plaintext or ciphertext) and K denotes a part of the secret key (e.g., a byte) that an attacker tries to recover. Among all the possible value K may take, k * will denote the right key hypothesis. Side-channel traces will be viewed as discrete realizations of a random vector x = x 1 , ..., x D , with D being the number of features. We use y to denote the label of a trace, which can be performed using the value or hamming weight (HW) of the sensitive variable V . In particular, we denote D s = {(x s i , y s i )} ns i=1 with n s labeled traces the source domain measured from the profiling device, and D t = {(x t i )} nt i=1 the target domain with n t unlabeled traces measured from the target device.

Profiled Side-Channel Attacks
As is shown in Figure 1, a profiled attack generally is composed of two phases: a profiling phase and an attack phase. During the profiling phase, the attacker first estimates the distribution: using the training set For example, one of the most popular ways to estimate the conditional probability is to use a mean vector µ y and a covariance matrix Σ y , which is based on the assumption that (X|Y = y) has a multivariate Gaussian distribution. Then, given a trace x i , the attacker can compute the likelihood for each possible y using the estimated probability density function. In other words, the attacker eventually gets a model F (.) : X → P(Y), that can be assimilated to a probability mass function (possibly after normalization).
In the attack phase, the attacker tries to recover the fixed unknown key k * with the trace set D t = {(x t i )} nt i=1 measured from the target device. Specifically, he can calculate the log-likelihood score over all the attack traces for each k ∈ K: (2) The attacker then select the subkey k guess leading to the highest log-likelihood score: The attack is successful if k guess = k * .

Neural Networks
Neural networks are nowadays the privileged tool to address the classification problem. In such a context, a classification task can also be performed in two phases, a training phase and a testing phase. In the training phase, the neural network aims to construct a function F (.) : R D → R |Y| that takes input x ∈ R D and outputs vector p ∈ R |Y| of scores. To construct the function F (.), a loss function is computed that quantifies the classification error of F (.) over the training batch. Then each trainable parameter is updated to minimize the loss, which is called backward propagation. After training, the classification is done by choosing the label y such that y = argmax p[y]. In general, a neural network consists of three blocks: an input layer, several hidden layers, and an output layer, which are all composed of multiple neurons. There are many kinds of neural networks, and the different behavior of them is mainly affected by how these neurons are connected within (and between) layers. In this paper, we focus on the family of the CNNs because of their potential in breaking cryptographic implementations protected with countermeasures [MPP16, CDP17, PSB + 18, ZBHV20]. CNNs use three main types of layers: convolutional layers, pooling layers, and fullyconnected layers. Convolutional layers are linear layers that share weights cross space, whose trainable parameters are several small column vectors called convolutional filters. Each filter slides over the trace by some amount of units (called stride) and is expected to extract a kind of characteristic. As inputs go along convolutional layers, higher-level abstract features are expected to be extracted. To avoid complexity explosion due to the increase of convolutional layers, the pooling layers are introduced. The most common pooling functions are the max-pooling and the average-pooling. The max-pooling outputs the maximum value within a window (called pooling filters), while the average-pooling outputs the average value within the pooling filters. Finally, fully-connected blocks are layers where every neuron is connected with all the neurons in the neighborhood layers.
In this paper, we consider simplifying the presentation of CNN by dividing it into two parts: an encoder part and a classification part. The encoder aims to extract high-level features from the input to help the decision-making. To achieve this, the main block of the encoder is a convolutional layer γ followed by an activation function σ. The former locally extracts information from the input, and the latter provides non-linearity to the learned classification function. After several (σ • γ) blocks, a pooling layer δ is added to reduce the complexity of the network. The above block is repeated several times until obtaining abstract features of reasonable size. Finally, the classification part contains some fully-connected layers λ that combine the features in different locations to obtain a global result depends on the entire inputs. To sum up, a common architecture of CNNs can be characterized by the following formula: where we use s to denote a softmax layer that is composed of |Y| classes.

Pr[Y|X=x]
Convolu onal blocks Fully-connected layers Label Figure 2: An example scheme of the CNN architecture.

DL-based Profiled Attacks
The embedding of deep learning to profiled SCA is easy since it is highly related to the profiled attacks in the side-channel context. More specifically, the profiling phase of the profiled SCA corresponds to the training of the neural networks, and the attack phase can be seen as a series of classification tasks. For each input trace, the pre-trained model will output a renormalized score vector p in such a way that they define a probability distribution p ≈ Pr[Y |X = x]. We remark that the computed output does not only provide the most likely label to solve the classification task, but also the likelihood of all the other labels. This form of output allows the attacker to rank the key candidates using multiple traces, as shown in Equation 2. Actually, the whole network Equation 3 could be viewed as an approximation of the probability density function in Equation 1.

Metrics in Profiled Attacks.
To evaluate the performance of the profiled attacks, a common option is to use Guessing Entropy (GE) or Success Rate (SR). GE is defined as the average rank of the correct subkey after the attack. More specifically, given n t attack traces, the attack will output an accumulated score vector d = [d 1 , d 2 , ..., d |K| ], each element of which denotes the estimated probability of a candidate subkey. If we sort the score vector to get a d in decreasing order of score, the GE can be defined as the average position of d k * in d over multiple experiments. Similarly, the SR is defined as the average empirical probability that d [1] = d k * .

Preprocessing techniques
Data preprocessing is very important when building a network model, which can often determine the training results. In the context of DL-based SCA, the efficiency of preprocessing techniques should never be neglected because they a) make it easier to learn a classification model that generalizes [MBTL13,CK18], and b) can greatly increase the convergence rate of the training process [MOM12]. Here we introduce four preprocessing methods that are most commonly used: 1. Feature Scaling readjusts the value of each dimension of the data (these dimensions may be independent of each other), so that the final data vector is within the interval of [a, b]. For each dimension, the general formula for scaling is given as: where x ∈ R ns is an original value, and x is the scaled value.
2. Feature Standardization aims to make the features of each dimension in the dataset have zero-mean and unit-variance, which is widely used in many machine learning algorithms. The general method is to determine the meanx and standard deviation σ. Then we subtract the mean from each feature and divide the result by its standard deviation: 3. Horizontal Scaling is a horizontal version of feature scaling introduced by Wouters et al. [WAGP20]. Unlike the feature scaling that normalizes the side-channel traces per sample, it will be applied on a per trace basis.
4. Horizontal Standardization is a horizontal version of feature standardization. It normalizes each trace using the mean and standard deviation calculated per trace on all feature dimensions.
Herein, we note that the extreme values in feature scaling (similarly, the mean and standard deviation in feature standardization) might be improper to transform the attack traces, especially in the context of cross-device SCA [MBTL13]. Furthermore, the feature scaling and feature standardization normalize traces on each dimension of features, which implies that the side-channel traces should be well-aligned. Otherwise, they could confuse and change the original features of side-channel traces. Therefore, we mainly use the horizontal version of normalization methods in the rest of this paper.

Cross-Device Profiled Attack
Although deep learning techniques seem to be quite suitable for performing the profiled SCA, the domain discrepancy between the profiling and attack traces is still a bottleneck restricting the application of the profiled attacks in practice. In fact, an implicit hypothesis of deep learning techniques is that the training data must be independent and identically distributed (i.i.d.) with the test data. However, when we adopt deep learning in the context of profiled SCA, this i.i.d. hypothesis is too strong since attack traces are often acquired from a different device without control. In such a context, various settings can easily break the hypothesis and lead to poor performance when we try to attack the target device. For example, 1. Different Devices. Although the structural changes introduced during the manufacturing process are relatively small and the devices produced meet the required specifications, no two chips are exactly the same. Even for devices of the same type, the leakage of the side-channel information is inevitably different, which is likely due to random process variations introduced during fabrication and packaging [MT10]. As a result, attacking a different device may cause a successful single-device-model attack to completely fail.
Keeping these in mind, in Section 3.1, we propose a generic attack strategy aiming at eliminating the effect of domain discrepancy. Section 3.2 elaborates the MMD loss that enables the networks to learn domain-invariant representations. Finally, in Section 3.3, we show how to minimize and embed the MMD in a standard CNN model. As described in Section 2.2, a profiled attack is composed of two phases: a profiling phase and an attack phase. We note that a limitation of the two-phases attack is that it cannot utilize the information brought by domain discrepancy, which is directly neglected. Therefore, we propose extending the classic profiled attacks by introducing an additional fine-tuning phase before the final classification (see Figure 3). Fine-tuning is a widely adopted technique in transfer learning for deep neural networks where a few epochs of training are applied to a pre-trained model's parameters to adapt them to a new task. An implicit assumption behind fine-tuning is that the source and target task should be related, which is definitely true in profiled SCA since the classification task should be the same in the profiling and attack phases. Thus, we could expect that the network parameters are not far from the optimal values for the target device.

A Cross-Device Profiled Attack Strategy
One straightforward approach for fine-tuning is to take a pre-trained network and then train (part of) its parameters using the data from the target domain. However, in a realistic SCA scenario, there is no labeled trace measured from the target device. In our strategy, therefore, the inputs of the fine-tuning phase are the original profiling traces with known labels, and a limited number of unlabeled traces measured from the target device. Our network should then capture the discrepancy information of two domains, based on which we can adjust the trainable parameters to minimize the domain discrepancy and the classification error simultaneously. Thus, we can expect to obtain a classification model that is capable of extracting domain-invariant features.

Methodology for Minimizing the Domain Discrepancy
More formally, the source domain consists of labeled traces D s = {(x s i , y s i )} ns i=1 measured from the profiling device, and the target has only unlabeled traces from the target device. The trace x * i belongs to the topological space X . The corresponding label is represented by y s i ∈ Y, where |Y| is 9 for the HW and 256 for a byte. Then our goal is to train a classifier F (.) that can predict the labels {ŷ t i } nt i=1 of the attack traces, where the data distributions of the source and target domains are different, i.e., We note that minimizing the domain discrepancy can be considered equivalent to the task of finding a representation that makes the domains appear as similar as possible. In fact, this problem is called (unsupervised) domain adaptation, which is a branch of transfer learning and has been well studied in the last few years. Inspired by these state of the arts [THZ + 14, LCWJ15, RMH + 19], we introduce the Maximum Mean Discrepancy (MMD) [GBR + 12], a standard distribution distance metric, to measure the similarity between the source and target domains in a reproducing kernel Hilbert space (RKHS). We hereafter recall its definition: . Let X s and X t be random variables defined on a topological space X , with respective Borel probability measures p and q. Let F be a class of functions f : X → R. The MMD is defined as: It has been shown that a unit ball F in a universal RKHS H is rich enough to distinguish any two distributions, and MMD can be expressed as the distance in H between their mean embeddings: The MMD can be eventually calculated using kernel methods. Specifically, for a nonlinear mapping φ(.) associated with the RKHS H ker and kernel ker( , an empirical estimate of the MMD can be obtained as: Having the empirical estimate of MMD, we consider to bound the target error by the source classification error plus the MMD between the source and target domain: where L C (X s , Y s ) denotes the classification loss 2 calculated on the available labeled traces, MMD 2 (F, X l s , X l t ) denotes the distance between the source domain X l s and target domain X l t , λ > 0 is a penalty parameter. Note that X l * is the lth layer hidden representations of side-channel traces. In this way, we can expect the high-level features in the lth layer are both discriminative and domain-invariant (see Figure 4). Note that MMD focuses on the difference in distribution between the learning and attacking datasets, regardless of the labeling information. Therefore, there are no restrictions on the mask/plaintext/key of the attack traces and no need to know whether the labels are identical between the profiling and attack traces. This unsupervised property of MMD exactly fits the realistic profiled SCA scenarios where the labeling information of the attack traces is never known before the attack. We can notice that the behavior of Equation 8 is very similar to what L1 and L2 regularizations do when training a classification model. However, the main purpose of L1 and L2 regularizations is to prevent overfitting by controlling the complexity of the model, while MMD regularization aims to minimize the domain discrepancy to make the domains appear as similar as possible. As mean embedding matching is sensitive to the kernel choices, instead of using a single kernel function in Equation 7, we consider the multiple kernel variant of MMD (MK-MMD) proposed in [GSS + 12], which leads to a principled method for optimal kernel selection. Specifically, the characteristic kernel ker(.) is determined as a convex combination of m radial basis function (RBF) kernels , by varying bandwidth γ between 2 − m/2 γ 0 and 2 m/2 γ 0 with a multiplicative step-size of 2. We set the γ 0 to be the median distance between points in the aggregate sample-the median heuristic [GBR + 12]. Thus, the kernel is finally denoted as: where m j=1 β j = 1, and β j ≥ 0. In our work, we set β j = 1/m according to [LZWJ16] and it works well in practice.

Methodology for Embedding MMD in CNN-based SCA
Then the question arises: how to minimize the new loss function in Equation 8 during the fine-tuning phase? For one thing, the original network architecture receives only labeled traces for training, which cannot be directly used to calculate the MMD loss. For another, we have not decided where to calculate the MMD loss in our network.
First, we have to modify the network architecture such that the MMD loss can be easily calculated. Herein, we consider extending the pre-trained network as depicted in Figure 5, which enables us to optimize the classification and MMD loss simultaneously. Our architecture is composed of a source and a target CNN, with shared trainable weights. The trainable weights are initialized to be the same as the pre-trained model. For each fine-tuning iteration, the extended network receives two batches of input traces. One batch of traces is from the labeled source domain, and the other batch is from the unlabeled target domain. The two batches of traces are fed to the source CNN and target CNN, respectively. In particular, the batch of labeled profiling traces is used to compute the classification loss L C as before, while the MMD loss is computed over two batches of hidden representations of both the profiling and attack traces. Then, all the trainable parameters of the network are updated by minimizing the total loss (see Equation 8) in the backpropagation algorithm.
Besides, we must determine where to calculate the MMD loss in the network. As is revealed in [YCBL14], the deep features must transform from generic to task-specific as one goes up the layers of a deep CNN. In other words, the transferability of the hidden representation tends to significantly drop in higher layers with increasing domain discrepancy. Therefore, we decide to minimize the MMD loss on the classifier part (fullyconnected layers). Note that the encoder part (convolutional blocks) of the network is still trainable during the fine-tuning phase to further adapt to the target domain. We make the encoder part trainable mainly because we expect the convolutional blocks to learn shift-invariant features in case the target domain is not well aligned. In [LCWJ15], authors show that instead of considering a single layer for adaptation, another approach is to sum up the MMD loss in multiple fully-connected layers. Thus, Equation 8 can be rewritten as: where S is the set of target fully-connected layers. During our tests, we observe that this approach usually leads to better results. Therefore, we adopt Equation 10 as the loss function of the fine-tuning phase in the rest of this paper. To summarize, an end-to-end cross-device SCA consists of the following three steps: 1. The attacker obtains a set of labeled profiling traces from a profiling device. He can train a classification model solely with the profiling traces by minimizing the cross-entropy loss.
2. The attacker then obtains a limited set of unlabeled attack traces from the target device. The attack traces together with the profiling traces are fed to the fine-tuning network, with the new loss function defined in Equation 10. He can minimize the cross-entropy loss with the labeled profiling traces, and minimize the MMD loss with the additional unlabeled attack traces.
3. The attacker finally uses the fine-tuned model for attack instead of the pre-trained model.

Datasets
Different types of domain discrepancy must be investigated to evaluate the performance of CDPA. First, we investigate the cross-device scenario with eight Atmel XMEGA 128A1U 8-bit microcontrollers and three SAKURA-G evaluation boards. Based on these devices, we build two datasets 3 (we refer to the datasets as XMEGA and SAKURA_AES hereafter) covering the main types of SCA scenarios. Second, we are curious to see if CDPA can deal with the domain discrepancy caused by the addition of countermeasures/noise. To this end, we use the ASCAD dataset [PSB + 18], by simulating two types of countermeasures/noise: Gaussian noise and clock jitters. After the simulation, we train the CNN model on the original dataset but test its performance on the protected/noisy datasets. Finally, we explore the portability issue when considering the electromagnetic (EM) side-channel and probe placing by human operators. Therefore, we build another dataset named XMEGA_EM with the eight XMEGA boards.
• XMEGA provides measurements of an unprotected AES-128 software implementation written in C language. To build a realistic cross-device dataset, we initialize the devices with different secret keys (fixed inside the device, see Figure 6a). To perform the acquisition, we insert a 10 Ω resistor between the microcontroller and GND. Then measuring the voltage drop across the resistor allows side-channel measurement in terms of power consumption. During measurement, the microcontroller is clocked at 2MHz and is connected to a Pico 3203D oscilloscope with a sampling rate of 125MS/s. The power traces are synchronized with a board-generated trigger, and a computer is used to control the whole measurement setup. For each execution, the computer generates a random 16-byte plaintext and transmits it to the microcontroller via UART. Upon receiving the corresponding ciphertext, the software then retrieves the waveform samples from the oscilloscope and saves them to disks. For each device, we acquired 25000 power traces for profiling and 5000 traces for the attack. Each trace consists of 500 points of interest (PoIs), corresponding to the SubBytes operation of the AES-128 algorithm in the first round. To highlight the difference in leakage of different devices, we calculate the SNR characterizing the first byte using 20000 traces labeled by the HW of the Sbox output. As clearly shown in Figure 6b, the leakage differs, and each board has its leakage characteristics. The reason that Device 4 is shifted apparently in time could be explained by the imprecise clock. Expecting to have 625 points in 10 clock cycles, we only observed 616 points for Device 4. We use the HW model in the experiments; thus the leakage model is: where p i is a plaintext byte and we chose i = 1. The maximum SNR of this dataset reaches 2.6231. • SAKURA_AES is an unprotected AES-128 implemented on Xilinx Spartan-6 FPGA. We acquire the side-channel traces from three SAKURA-G evaluation boards, which use different secret keys. The AES-128 core is written in a round-based architecture, which takes 11 clock cycles to perform each encryption. Side-channel traces are measured by monitoring power waveforms on the core voltage of the main FPGA. Measurements are sampled using a Teledyne LeCroy Waverunner 610zi oscilloscope with a sampling rate of 500MS/s. A suitable and commonly used leakage model is the Hamming Distance (HD) model, which is related to the register writing in the last round, i.e., where c 2 and c 6 correspond to the second and sixth bytes of the ciphertext respectively. The relation between c 2 and c 6 is given by the ShiftRows operation of AES. For each device, the profiling set contains 90000 traces and the test set contains 10000 traces, each trace with 1000 features. The measurements are relatively noisy and the measured SNR reaches 0.0248 (see Figure 7a).
• ASCAD is introduced in [PSB + 18] to provide a benchmark to evaluate deep learning techniques in the context of SCA. The target platform is an 8-bit AVR microcontroller (ATmega8515) with an operating frequency of 4MHz, where a masked AES-128 is implemented. All traces are acquired with a sensor attached to an oscilloscope sampling at 2GS/s. This dataset contains three HDF5 files with different desynchronization levels. Each file contains 50000 traces for profiling and 10000 traces for the attack. Each EM trace consists of 700 PoIs corresponding to the masked Sbox for i = 3. Since the mask is unknown, we use the leakage model as: The SNR for the ASCAD dataset is around 0.8 under the assumption we know the mask [PSB + 18], and it is 0.0061 if the mask is unknown (see Figure 7b).
• XMEGA_EM provides measurements of the previously mentioned XMEGA chips that run the unprotected software AES-128 encryption. Instead of measuring power consumption as before, we acquire side-channel traces by measuring the EM radiation emitted from the chips. This dataset is captured using a Langer LF-U 5 near-field probe, each time at a similar position but with human error. The side-channel traces are taken around the target Sbox computation in the first round, with the Teledyne LeCroy Waverunner 610zi oscilloscope sampling at 250MS/s. The dataset has 35000 traces (for each device) where each trace has 1500 features. We also use the HW of the Sbox output as the label. Thus the leakage model is the same as Equation 11. The maximum SNR of this dataset reaches 0.2464 (see Figure 7c).
We summarize the datasets used in our experiments as shown in Table 1.

Setttings
All the experiments are implemented in python using the PyTorch library, and are run on a GPU server equipped with 128GB RAM and a NVIDIA GeForce RTX 3090 with 24GB memory. Since the focus of this paper is cross-device SCA rather than optimizing the network architectures, we do not dive into the hyperparameter tuning game, but use similar CNN architectures and hyperparameters based on the state-of-the-art results [ZBHV20]. Table 2 summarizes the training parameters used in our experiments. The main difference with previous works is that we introduce a new hyperparameter λ that is defined in Equation 10. The sensitivity of this hyperparameter is discussed in Section 6.2.
Throughout the experiments, we use N t GE to denote the number of traces needed to reach a constant GE of 1. To get a good estimation of N t GE , we perform the attack 100 times with randomly-selected attack traces andN t GE is the average value.
Remark 1. We note that the fine-tuning phase requires a set of unlabeled traces that come from the attack dataset. Therefore, when attacking a different device for the first time, the amount of traces used for fine-tuning should be counted as the cost of the entire attack. Fortunately, this cost is affordable since the fine-tuned model can still outperform the pre-trained model, even taking the fine-tuning cost into consideration. Besides, this cost is one-time since the fine-tuned model can still be used for attacks even if the secret key is changed. Note that the number of traces used for fine-tuning can be further reduced, and there should exist a trade-off between the number of fine-tuning traces and a good estimate of MMD. In the rest of this paper, we report the performance (N t GE ) of the models without counting the traces used for fine-tuning.

XMEGA
We first evaluate our methodology on the XMEGA dataset. We use 20000 traces for training, 5000 traces for validation, 100 traces for fine-tuning, and 5000 traces for attacking. Since our focus is not to optimize the network architecture, we adopt the 9-classes HW model and use a simple CNN structure with three convolutional blocks followed by one fully-connected layer (see Appendix B Table 4). The hyperparameter λ is set to 0.1, and the learning rate is set to 0.001. All traces are preprocessed using horizontal standardization. Finally, we train the model for 100 epochs in the training phase, and another 15 epochs for fine-tuning. We set the batch size to be the number of traces used for fine-tuning. After training or fine-tuning, we save the model that achieves the lowest validation loss. For conciseness, we use Device x-y to denote that we train the model on device x and then try to recover the key of device y.  We show the performance of the pre-trained model on different devices in Figure 8a. It can be observed that the devices show different behaviors when they are attacked. Specifically, although the value ofN t GE for the single-device attack given on the diagonal is very small, it varies widely in the case of cross-device attacks. The matrix in Figure 8a is not perfectly symmetrical, which implies that the difficulties of task Device x-y and task Device y-x can be different. Then, we exemplarily present the GE curves of the pre-trained model (Device 1-y) in Figure 8b. We can observe that the GE for Device 1-2 and Device 1-3 is almost random using 500 attack traces, yet Device 1-4 performs slightly better although the SNR between Device 1 and Device 4 is shifted also in time (see Figure 6b). It is not surprising since SNR can only reveal the PoIs and the signal quality, which cannot be optimal to characterize the real differences between devices. We can also infer from the results that, due to the shift-invariant nature of CNN, misalignment may not be the main problem in the cross-device scenarios. Since the bad performance of cross-device attacks may also be explained by overfitting 4 [BCH + 20], we draw the learning curves of the training phase (task Device 1-4 ) in Figure 8c. We can see that the validation loss is highly consistent with the training loss when traces are from the same profiling device. In other words, overfitting is not observed when we use the same device to train and attack. However, the test loss increases apparently when we attack a different device (Device 4 ). From the above results, we could infer that the main reason for the poor performance is that the pre-trained model has learned device-specific features that cannot be safely generalized to other devices.
By applying the CDPA, the impact of device discrepancy is dramatically reduced. All the fine-tuned models could stably recover the key of a different device within 100 attack traces (see Figure 8d and Figure 8e). To further understand how the model learns during fine-tuning, we show the evolution of the test loss (cross-entropy loss on the target device) and the validation MMD loss in Figure 8f. We can observe that minimizing the MMD loss can effectively reduce the test loss on the target device, which confirms the necessity and effectiveness of our methodology. Besides, the test loss seems to converge faster (in 30 iterations) than the MMD loss. In other words, fine-tuning for a small number of iterations could be sufficient for getting a well-performed cross-device model, while further minimizing the MMD loss may not significantly improve the results.

Impact of Preprocessing.
To further understand how preprocessing methods affect the performance of the pre-trained and fine-tuned models, we present more results on the XMEGA dataset by varying the preprocessing techniques. As depicted in Figure 9, the portability issue is very obvious when no preprocessing is applied. Although preprocessing improves the pre-trained models in several cases, the effect of device discrepancy still cannot be eradicated. As we expected, CDPA improves the cross-device attacks significantly in all investigated situations.

Results with Different Numbers of Profiling Traces.
Increasing the amount of training data is efficient to prevent overfitting and can help us to obtain a more precise model [PW17]. Therefore, we investigate whether using more profiling traces can improve the performance of the pre-trained models in cross-device attack scenarios. We test with different numbers of profiling traces, and show the attack results of the pre-trained models in Figure 10. Interestingly, we can observe that increasing the number of training traces does not lead to better generalization when targeting different devices. Similar results were reported in [BCH + 20]. This is reasonable since the appended profiling traces are acquired from the same device, which cannot guarantee an improved performance when we test on a target device with a different distribution.

SAKURA_AES
Unlike the above described dataset, the SAKURA_AES dataset provides measurements of an unprotected hardware implementation of AES-128 on FPGA. We use 85000 traces for training, 5000 traces for validation, 200 traces for fine-tuning, and 10000 traces for attacking. As before, the learning rate is set to 0.001, and traces are preprocessed using horizontal standardization. The profiling phase runs for 200 epochs with a batch size of 200. The fine-tuning phase runs for only 15 epochs with the λ set to 0.05. The MMD loss is computed on the flattened layer and the fully-connected layer. Since the SNR of this dataset is relatively small, our pre-trained models require around 1000 traces to successfully recover the key of the same device (see Figure 11a and Figure 11b). When we apply the pre-trained models to other devices, the required number of attack traces is likely to double. As before, no obvious overfitting is detected on the same device, whereas the test loss (task Device 1-2 ) increases rapidly as the model learns on the profiling traces (see Figure 11c). The results after fine-tuning are shown in Figure 11d and Figure 11e. We can observe that all the cross-device experiments get improved after applying CDPA. Most fine-tuned models achieve almost similar performance as using the same device for attacking. Consequently, CDPA is also suitable and efficient for deep learning-based SCA on hardware implementations.
Remark 2. For the experiments on SAKURA_AES and ASCAD, we adopt batch normalization [IS15] after each convolutional block to make the optimization easier and faster [STIM18]. In general, batch normalization contains two non-trainable weights that get updated during the training phase. These are the variables tracking the mean and variance of the inputs. Whereas, during the fine-tuning phase, the batch normalization layers should be kept frozen (in inference mode). Otherwise, the updates applied to the non-trainable weights will suddenly destroy what the model has learned [AAB + 15]. In our experiments, we freeze the batch normalizations using the model.eval() method provided by the PyTorch library.

Extended Applications to Other Portability Issues
As mentioned in Section 3, the portability issue exists not only in different devices. The variance in implementations or settings of acquisition can also lead to bad attacking performance. To this end, we investigate two other scenarios that are very common in practice. Our first study simulates different implementations by adding artificial countermeasures/noise to the original dataset. After the simulation, we train the CNN model on the original dataset (source domain) and evaluate its performance on the deformed datasets (target domain). These experiments simulate a complex attack scenario that the target device is treated as a black box that can turn on side-channel countermeasures. Finally, we explore the portability issue in the EM analysis, where the measurements are very sensitive to probe placement (position, distance, and orientation).

Addition of Countermeasures/Noise
Herein, we consider two types of countermeasures/noise including Gaussian noise and clock jitters. All the experiments are performed based on the ASCAD dataset.
• Gaussian Noise is the most common type of noise existing in side-channel measurements. The source of Gaussian noise can come from data buses, transistors, oscilloscopes, or even the work environment. To demonstrate the influence of the addition of noise, we build the target domain by adding a normal-distributed random value r ∼ N (0, var) to each point of the trace. As a result, Gaussian noise distorted the shape of the original traces in the amplitude domain (see Figure 12a (top)).
• Clock Jitters is a classical hardware countermeasure implemented by introducing the instability in the clock [CDP17]. Herein, we simulate the clock jitters following the work of [WP20] by randomly adding and removing points with a pre-defined range. Specifically, when scanning each point in the trace, n points will be added to the trace if n is larger than zero. Otherwise, the following n points will be removed, where n ∈ [−r, r]. An example of the zoom-in viewed trace after deformation is given in Figure 12b (top). We use a similar network architecture as the one used in [ZBHV20] for the ASCAD dataset. Besides, we use 45000 traces for training, 5000 traces for validation, 200 traces for fine-tuning, and 10000 traces for attacking. The profiling phase runs for 100 epochs on the original dataset, while the fine-tuning phase runs for 15 epochs, with a batch size of 200. The attacking results are summarized in Figure 12 (bottom). As we can see, although the pre-trained model performs well (N t GE reaches 345) in the original ASCAD dataset, more than 5000 traces are required to reach a GE of 0 when the variance of Gaussian noise is set to 8. After applying the CDPA, the attack performance on the noisy target domain is significantly improved. We can infer from the results that CNN may not generalize well if only clean traces are fed to the network. However, fine-tuning using a small number of (unlabeled) noisy traces can unleash the power of CNNs and drive the network to learn domain-invariant features. The pre-trained model does not work after adding the clock jitters, which is not surprising as too much randomness was introduced. Although we still cannot recover the key within 5000 traces after fine-tuning, the GE curves decrease with more attack traces.

Portability of Electromagnetic Probe Placement
Apart from power analysis, EM-based SCA is becoming increasingly popular due to its non-invasive and spatially flexible nature. Note that EM measurements are very sensitive to probe placement. However, when we consider the realistic profiled attack scenario, the probe must be moved from the profiling device to the target device. Hence, there is always a slight difference in the probe placement caused by human error due to the position distance and orientation.
To investigate the impact of human error, we perform more cross-device experiments on the XMEGA_EM dataset. This dataset is captured from eight different devices with different keys, each time at a similar probe position but with human error. We use the same CNN architecture and training parameters as the XMEGA dataset. The performance  of the pre-trained model is depicted in Figure 13a and Figure 13b. We can observe that the result matrix is very similar to the attack result of the XMEGA dataset, but not exactly the same. This difference is inevitable since the EM traces have different features and signal quality from the power traces. As we expected, the fine-tuned models outperform the pre-trained models significantly, which can stably recover the correct key within 80 traces (see Figure 13d and Figure 13e). In Figure 13f, we see that the evolution of MMD loss is again highly consistent with the test loss (task Device 1-2 ), which confirms our previous results.

Computation Cost
We present the computation cost of training and fine-tuning in Table 3. The learning time of each epoch is mainly determined by the size of training sets, the batch size, and the length of raw traces. We can observe that the epoch time for fine-tuning is approximately twice that of training. This is reasonable since more traces are processed and an additional MMD loss is calculated in the fine-tuning phase. In addition, the time cost is still affordable. For example, if we run the fine-tuning phase for 15 epochs, this process can be completed within two minutes for all considered datasets.

Effect of Adaptation Layers
In order to further understand how the location of the adaptation layers affects the output, we conduct a series of experiments on the XMEGA dataset (task Device1-2 ) with different adaptation layers. We use a CNN network whose classifier part has three fully-connected layers (fc1-fc3). We first fine-tune the network using only a single layer, and then compare it with the result of using all three layers. As before, our network is fine-tuned for 15 epochs. The results are shown in Figure 14. An obvious observation is that the CDPA still works even a single layer is used for minimizing the MMD loss. Another observation is that the deeper the layer, the more difficult it seems to learn domain-invariant features. This is reasonable since the features obtained in higher layers must depend greatly on the specific dataset, which are not safely transferable to novel domains. Still, using all the layers of the classifier part is a good trade-off, which usually brings better results than using a single adaptation layer.

Effect of the Penalty Parameter λ
The hyperparameter λ in Equation 10 determines how strongly we would like to confuse the source and target domains. Intuitively, setting the λ too small can cause the MMD regularizer to have no effect on the learned representation, yet setting the λ too large will regularize too heavily and may result in a degenerate representation in which all features are too close together. To further understand the sensitivity of parameter λ, we give an illustration of the variation of the GE as λ ∈ {0, 1e-4, 1e-3, 1e-2, 1e-1, 1, 10, 100} on XMEGA (task Device1-4 ) in Figure 15. The network is fine-tuned for 15 epochs with a batch size of 200.
We can observe that setting the λ too small (λ = 1e-4) or too large (λ = 100) may not improve the attack, which is consistent with our analysis. In other cases of λs, all the fine-tuned models improve the results significantly. Although there is usually a wide range of λ where the pre-trained models get improved, a good empirical choice is to start with a relatively small value (e.g., 1e-2), especially when the SNR of the dataset is small. A smaller value of λ means that the optimizer should put more effort into the tough classification task. If it is observed that the reduction of MMD loss is not significant or too slow, we can gradually increase the value of λ to speed up the fine-tuning process. In practice, we can automatically select the parameter λ on a validation set (consists of source-labeled traces and target-unlabeled traces) by jointly assessing the cross-entropy loss and MMD loss. For example, a small classification loss and a large MMD loss may indicate that you are using a relatively small λ. On the contrary, a large classification loss and a very small MMD loss may result from a large value of λ.

Effect of the Number of Traces for Fine-tuning
Although fine-tuning with MMD loss helps us obtain a robust model, we need a set of attack traces to estimate the MMD. Despite the fact acquiring multiple unlabeled traces from the target device is not a strong assumption, it is still meaningful to figure out how many traces are appropriate in practice when we fine-tune the network. Therefore, we conduct a series of experiments with the number of traces varying in {100, 300, 500, 700, 900} with a batch size of 100. Finally, we fine-tune the networks for 15 epochs to ensure that the model can learn the domain discrepancy well. The results on XMEGA (Device 1-4 ), SAKURA_AES (Device 1-2 ), and ASCAD (with Gaussian noise) are depicted in Figure 16. It can be observed that 100 traces (as small as the batch size) are sufficient for the fine-tuning phase.
We remark that MMD focuses on the domain discrepancy of the profiling and attack traces instead of the classification task itself. So, the results are not surprising since 100 unlabeled traces can provide sufficient information that is distinguishable from the source domain. From the results on SAKURA_AES and ASCAD, we can conclude that using more traces could lead to a more stable and robust fine-tuned model. This is reasonable since more traces help us to obtain a more precise estimate of MMD. The result on XMEGA seems to be less affected since an unprotected implementation is definitely easier to learn and transfer.

Comparison with Other Promising Techniques
Data Augmentation has been proven successful in enhancing the robustness of CNN models [CDP17, KPH + 19]. For example, artificially generating new profiling traces by deforming (shifting) those previously acquired can help the network to learn shift-invariant features. However, data augmentation focuses on enlarging the dataset without considering the real difference between the profiling and attack traces. So, it is unclear what kind of augmentation is optimal to introduce, from the perspective of attackers. The recently introduced denoising convolutional autoencoder (CAE) is promising to remove noise (e.g., Gaussian noise and desynchronization) of raw traces [WP20]. By considering the variance between different devices as noise, the CAE may be hopeful to remove it. However, the training of CAE requires noisy-clean trace pairs, which is challenging to obtain in realistic settings. Similar to data augmentation, the simulated noise could not be optimal to characterize the differences between different devices.
Since the bad performance of the pre-trained model could also be explained by overfitting, several intuitive techniques that prevent overfitting may also be helpful to improve the performance. These techniques typically include restricting model complexity and early stopping. We, therefore, test two very simple CNN architectures (details in Figure 17) that differ only in the number of fully-connected layers. We investigate whether a less complex architecture could lead to better generalization. As shown in Figure 17, the test loss (on the attack device) increases during training for both CNN architectures. In other words, using a less complex architecture cannot improve the performance of the pre-trained model significantly. Besides, in the cross-device scenario, it is difficult to estimate a proper network setting and to identify overfitting without observing the labeled traces coming from the target device. As we have shown in the learning curves, the validation loss and the test loss may behave in the opposite way. Therefore, observing the validation loss may be insufficient to identify this kind of overfitting, which means early stopping may not be suitable for cross-device scenarios.  Note that we are not the first to utilize additional unlabeled traces to build a stronger model. The authors in [PHJ + 18] investigate the profiled attack in the semi-supervised learning scenario, where unlabeled traces are classified and then added to the training set to enhance the pre-trained model. They show that semi-supervised learning significantly helps TA and its pooled version. However, semi-supervised learning relies on strong assumptions (e.g., smoothness assumption, cluster assumption, and manifold assumption [vEH20]). If we consider a target domain with a different distribution, the unlabeled traces is very likely to be assigned a wrong guessed label. Consequently, these traces with incorrect labels may not yield an improvement but destroy what the model has learned. Apart from the above DL techniques, some special tricks, like zero-mean unit-variance normalization [MBTL13] and multi-devices profiling [CK14, DGD + 19, GDD + 19, BCH + 20, WdHG + 20] are also helpful to attack a different device. Zero-mean unit-variance normalization is a kind of preprocessing technique but requires the traces to be well-aligned. Multi-devices profiling is currently one of the most popular methods to overcome the portability issue. It assumes that a powerful attacker possesses multiple profiling devices and can capture as many side-channel traces as necessary during the profiling stage. The end goal is to use the model generated during the profiling phase (with multiple devices) to recover the secret key from an unseen device. This method is effective since the network can see more training data with different distributions. Therefore, the more profiling devices used, the more robust the model becomes. However, it still cannot promise that one profiling device is very close to the target one. Besides, in realistic settings, possessing multiple profiling devices cannot always be satisfied. To relaxes the assumption of possessing multiple devices, the proposed CDPA takes advantage of the domain information of the profiling and attack traces. Instead of introducing more labeled traces to the profiling set, CDPA focuses directly on the variance between the profiling and target devices by adopting the MMD loss. The cost of CDPA is a few additional epochs of fine-tuning with a small number of unlabeled attack traces, which is affordable compared with the cost of using multiple profiling devices.

Conclusion and Future Work
This paper focuses on addressing the open question of portability in profiled SCA, using transfer learning techniques (specifically, unsupervised domain adaptation). We consider the issue of portability as domain discrepancy, and we propose a new attack strategy called CDPA to eliminate it. CDPA introduces a fine-tuning phase before the traditional attack phase. The kernel idea is to adjust the pre-trained model such that it eventually learns a representation that is not only discriminative but also domain-invariant. To achieve this, we adopt the MMD loss as a penalty of the cross-entropy loss function, which can be easily calculated and embedded in a common CNN architecture.
We evaluate the performance of our strategy on eight XMEGA chips and three SAKURA-G boards with AES-128 implementations. Consequently, CDPA can improve the attack performance by > 20× and could even turn an impossible attack into a reality. Besides considering different devices, we also explore the ability of the CDPA to remove the impact of adding countermeasures/noise. Our results on the ASCAD dataset show that CNN may not generalize well if only clean traces are fed. However, fine-tuning using a limited number of (unlabeled) desynchronized/noisy traces could unleash the power of CNNs and drive the network to learn domain-invariant features. Finally, we show how portability issues also arise when considering the EM side-channel and probe placing by human operators. Subsequently, we demonstrate how CDPA helps the performance in such a scenario.
For future work, it would be interesting to see how CDPA performs in a cross-family attack scenario, which is a more challenging task. Another direction is to explore other transfer learning techniques that are more appropriate in the context of SCA. We believe that this work will pave the way for the realistic study on cross-device profiled SCA.