Pay Attention to Raw Traces: A Deep Learning Architecture for End-to-End Proﬁling Attacks

. With the renaissance of deep learning, the side-channel community also notices the potential of this technology, which is highly related to the proﬁling attacks in the side-channel context. Many papers have recently investigated the abilities of deep learning in proﬁling traces. Some of them also aim at the countermeasures (e.g., masking) simultaneously. Nevertheless, so far, all of these papers work with an (implicit) assumption that the number of time samples in raw traces can be reduced before the proﬁling, i.e., the position of points of interest (PoIs) can be manually located. This is arguably the most challenging part of a practical black-box analysis targeting an implementation protected by masking. Therefore, we argue that to fully utilize the potential of deep learning and get rid of any manual intervention, the end-to-end proﬁling directly mapping raw traces to target intermediate values is demanded. In this paper, we propose a neural network architecture that consists of encoders, attention mechanisms and a classiﬁer, to conduct the end-to-end proﬁling. The networks built by our architecture could directly classify the traces that contain a large number of time samples (i.e., raw traces without manual feature extraction) while whose underlying implementation is protected by masking. We validate our networks on several public datasets, i.e., DPA contest v4 and ASCAD, where over 100,000 time samples are directly used in proﬁling. To our best knowledge, we are the ﬁrst that successfully carry out end-to-end proﬁling attacks. The results on the datasets indicate that our networks could get rid of the tricky manual feature extraction. Moreover, our networks perform even systematically better (w.r.t. the number of traces in attacks) than those trained on the reduced traces. These validations imply our approach is not only a ﬁrst but also a concrete step towards end-to-end proﬁling attacks in the side-channel context.


Introduction
Side-channel analysis (SCA), introduced in [Koc96] for the first time, takes advantage of the fact that cryptographic algorithms are not entirely black-box when implemented in the real world. It exploits the possible leakages of the sensitive internals via some side-channels, e.g., execution time, power consumption, electromagnetic radiation, etc., to recover the secret of a cryptographic device.
As one branch of the SCA, the profiled SCA has been continually explored and studied since the first method template attacks (TA) [CRR02] was proposed. Following the seminal TA, many machine learning techniques [SA08,HGM + 11, BL12,HZ12,LBM14] are gradually introduced and adopted into the side-channel field. Essentially, the differential analysis in the attack phase of profiled SCA is based on the distinguishment among different classes.
Hence, in theory, any machine learning techniques that help to differ (no matter it is directly performed as classification or regression) samples could be used in profiling.
To distort the distinguishment among classes, the countermeasures reducing the relations between the internals and leakages [KJJ99,AG01] are proposed almost at the same time with the Differential Power Analysis (DPA). The presence of countermeasures, particularly masking, is also an important reason that some machine learning techniques are preferred, since the Support Vector Machine (SVM) [HGM + 11, HZ12,BL12], Random Forest (RF) [LBM14,LPB + 15], and deep learning could naturally handle the non-linear problems and deal with the high dimensional input better than classical methods like TA. Moreover, many results indicate that deep learning could directly cater to masking countermeasures [MPP16], even with some additional noise (i.e., misalignment) [CDP17, PSB + 18].
We remark that one of the most critical advantages of deep learning is that they could learn to extract and combine features automatically and thus find a better representation of data. However, when looking into the recent literature applying neural networks to SCA, there is always an (implicit) assumption that the adversary could carry out a manual feature extraction (i.e., selecting PoIs or locating a short region containing PoIs) to significantly reduce the length of raw traces before the profiling. This assumption is usually too strong or even invalid in a practical analysis targeting a masked implementation. So far, no literature in this direction has discussed this issue in-depth, which leads to a limitation of the universality of the current research. As a consequence, the power of these methods is weakened due to the impractical manual feature extraction.
The primary motivation of our paper is to conduct successful end-to-end 1 profiling. Particularly, we want to omit the tricky manual feature extraction on long 2 raw traces when the prior knowledge of masking countermeasures is lacking. Yet, there is no such approach in current literature handling the requirements straightforward, to the best of our knowledge. Therefore, in this paper, we propose a new neural network architecture consisting of three components, i.e., encoder, attention mechanism, and classifier, which conduct the operations of encoding, attending, and classifying, respectively. Our architecture is designed to automatically extract and combine the leakages of sensitive intermediate values on raw traces that are too long to be handled by the mainly focused convolutional neural networks (CNNs) or multilayer perceptrons (MLPs). In our architecture, the junior encoder first encodes the raw data and yields the fine-grained features. The senior encoder then non-linearly combines these features in time sequences by the recurrent layer. Following the encoders, an attention block that focuses on the important time steps is merged to reduce the hardness of learning a recurrent layer on long time sequences. Finally, a classifier generates the output probability.

Related Works
There are many methods aiming to attack the masked implementations. This kind of attack is usually called high-order attacks, including both the profiled [Sch08, LP07,MDM16] and non-profiled [Mes00] phenomenons. In the profiled high-order attacks, the method like Gaussian mixture models [LP07] and neural networks [MPP16, CDP17, KPH + 19, ZBHV20] could carry out successful attacks without knowing the mask, while we argue that they still face the feature extraction issue (which is also discussed at the end of [WAGP20]) in more practical scenarios. All of these methods can not be regarded as end-to-end profiling unless more evidence on raw traces is given. We notice there is also another roadmap to conduct high-order attacks that the adversary detects the PoIs of all the sensitive intermediate values jointly based on statistical tools and then combine them. The positive results have been reported in [DSV + 15, DS16]. While in this paper, we focus on conducting end-to-end profiling in practical scenarios (i.e., raw traces with masking), which naturally includes the selection task. In other words, our architecture could be regarded as including a learnable built-in PoIs selector.
Our work is also related to a series of papers focusing on the attention mechanism and its variations [BCB15, HKG + 15, RLL + 17, VSP + 17, LJD19,KKL20]. The attention mechanism has been widely used in Natural Language Processing (NLP) and Automatic Speech Recognition (ASR). Usually, it is used as a built-in component of the decoder agents in the sequence-to-sequence (seq2seq) problems. We use it in our paper to further combine the leakage features and accelerate the convergence of networks.

Our Contributions
The contributions of our work can be summarised as follows: 1. An architecture for practical end-to-end profiling. We propose a new neural network architecture in the context of SCA, which gets rid of the manual feature extraction when handling extremely long traces (e.g., raw traces) whose underlying implementation is protected by masking. With this architecture, we can mount end-to-end profiling without any prior knowledge about implementations.
2. Introduce new structures into the SCA. We point out that a locally-connected (LC) layer could be used as a junior encoder to extract useful time samples in a narrow region and overcome the drawbacks of convolutional layers. Surprisingly, we do not find any discussion about this kind of simplest layer in the side-channel community. Moreover, while the attention mechanism is prevalent in NLP and ASR in the deep learning community today, we are the first to introduce this mechanism upon a recurrent layer into SCA 3 , to the best of our knowledge.
3. Satisfying results on public datasets. To our best knowledge, we are the first that achieve end-to-end profiling in the SCA. In the profiling phase, we could handle the raw traces over 100,000 time samples in each. Concretely, we directly train networks on the raw traces of DPA contest v4.1 and two ASCAD datasets. For DPA contest v4.2, we use the first 300,000 time samples in profiling to save time. All of these networks above converge and carry out successful attacks. Moreover, our networks perform even systematically better (w.r.t. the number of traces in attacks) than those trained on the length reduced traces.
4. Explore the attention mechanism in the SCA context. In addition to the public datasets, we also apply the architecture to our own dataset acquired from the power consumption of ATmega128A running our simulated masking demo. We utilize this dataset as an additional baseline to further explore how the fundamental structures work in our network.
The rest of this paper is organized as follows. In Section 2, we provide the basics of profiled SCA, high-order attacks and some necessary backgrounds about the neural networks. In Section 3, we propose the new architecture and describe each component of it in detail. We also explain the motivations of such a design and why our architecture works. In order to validate our statements of end-to-end profiling, in Section 4, we test the network instances of our architecture on different datasets. We also dive into the trained networks in Section 5, showing the behaviors and the interactions of components in practice. Finally, in Section 6, we conclude and discuss some potential research directions in the future.

Background
In this section, we give the notations of variables and functions used in the context of SCA. For completeness, we describe the route of carrying out a profiled SCA. We also introduce the basics of neural networks which are used in our architecture.

Notations
We use the upper-case letter X to denote the random variables, the bold format X to denote the corresponding random vectors. The realizations are denoted as lower-case letters x and x, respectively. Throughout this paper, we use L to denote the trace of a device, X to denote the plaintext, K to the secret key and Z to the sensitive target. The variable Z is related to X and K (e.g., Z = Sbox(X ⊕ K)). For an acquired trace set, it is a set of realizations of variable L and could be denoted as L = {l 1 , l 2 , . . . , l N }, where N is the number of traces. Naturally, the corresponding sets of plaintexts, keys and sensitive targets are X , K, and Z, which is consisted of the entries x i,j , k i,j and z i,j , respectively, where j is the byte index. We also use sans-serif upper-case (e.g., H) and lower-case (e.g., h) letters to represent matrices and vectors, respectively, when describing the architecture of neural networks. We will omit the subscripts for conciseness whenever there is no ambiguity.

Profiled Side-Channel Analysis
A profiled attack consists of two phases: an offline profiling (training) phase and an online attack (testing) phase. In the profiling phase, the adversary collects the traces with known labels (i.e. targets) as {(l i , z i )} i=1,...,N . The goal of profiling is usually to estimate a probability density function (e.g., P[L|Z = z] in TA) that could be used to classify the traces which do not occur in the profiling set. In the attack phase, the adversary acquires a new set of traces {(l i )} i=1,...,Na with a fixed unknown key k * , where N a is the number of traces for attack. The probability P[L|K = k] for each key hypothesis could be set through P[L|Z = z] since the x is known. The classification is then carried out by posterior transformed by Bayes' Theorem: As we want to use all of the information of the attack set but a single trace, the eventual posterior p k observing the whole trace set is calculated as the score to reveal the underlying key (with the hypothesis that the ls are independent): (2) To prevent the numeric issue of underflow, the p k could also be equivalently calculated by the sum of the log-posterior. Finally, the key candidate with the maximum score is chosen as the key guess.

High-Order Attacks
One of the most common countermeasures to protect cryptographic implementations from SCA is masking, which hides the sensitive intermediate values by randomness. The (Boolean) masking applied on the Sbox can be represented as a series of XOR operations among d + 1 shares, namely, z m = Sbox(x ⊕ k) = z m ⊕ m 1 ⊕ · · · ⊕ m d , where m 1 , . . . , m d are the masks and z m is the masked value.
The fundamental idea of high-order attacks is combining the leakages of shares. Then the first-order DPA with different distinguishers can just be carried out on these combinations. However, with raw traces, the d + 1 shares make this combination very complicated. In the worst case, the adversary has to pre-compute all of the possible combinations of d + 1 time samples, and thus yield the new traces of length |l| d+1 , where |l| is the length of original traces. In recent research, the MLPs and CNNs are used to automatically combine leakages without knowing the implementation details. We note that besides the combination complexity, the architectures of MLPs and CNNs themselves also restrict the profilings on extremely long traces. We will discuss this in detail later in Section 3.1.

Neural networks
Neural networks are usually composed of multiple layers, and each layer is composed of multiple neurons. The different behavior of networks is mainly affected by how these neurons are connected within and between layers. For the convenience of describing our architecture, we briefly introduce some basics of neural networks, specifically, the useful layers and the attention mechanism.

Convolutional layer
The convolutional layer is one of the most famous neural network layers because of the great success in the computer vision field with Pooling layers. It conducts convolution operations to the input by the filters as a group of sliding windows. The weights of filters are shared among the spatial steps, which detect the same patterns at different positions. The left-hand side of Figure 1 shows a convolution operation with 2 filters of size 3 and a maximum pooling with a window of size 2. As mentioned in many other papers, the output of the stacked convolutional layer are usually called feature maps or abstract features which discriminate from the raw data. Meanwhile, these abstract features also have some natural properties like shift-invariance, and thus the CNNs are good candidates to handle the desynchronization in the side-channel context [MPP16,CDP17].

Locally-connected layer
We use the term locally-connected (LC) in this paper to specifically refer to the layer in Figure 1b. The only difference between it and the convolutional layer is that the LC layer does not share the weights among steps when sliding along the inputs. This means the filters do not learn to extract common features and each element of the output is only related to a limited small region of the input. That is, the LC layer gets more freedom to focus on details in a particular region. This property is pretty useful when we desire exhaustive features that are independent of each other in different positions. This kind of layer has been used in face recognitions [TYRW14] and speaker recognitions [CLS + 15] to extract local features.

Recurrent layer
The recurrent layer is dedicated to exhibiting temporal dynamic behavior, and thus it could learn the dependence of time samples along the time sequence. To generate the output, where i, f and o are the input, forget and output gate, respectively, C is the memory state, h is the hidden state.
In these seq2seq models, the basic idea of attention is to allow the decoders to refer back to any of the encoders' output, thereby finding out the related parts that highly contribute to generating the expected output at the specific step of the decoders. Therefore, the attention mechanism can be interpreted as a function which maps an encoded input sequence to an output with a variable query.
In the context of seq2seq problems, the attention mechanism creates an implicit soft alignment between entries in the output sequence and entries in the input sequence, which can give useful insight into the model's behavior. In the context of side-channel, this insight could be interpreted as the regions of interest that the sensitive values leak. We give more details about the attention mechanism we used in Section 3.

Our architecture
In this section, we evaluate the shortcomings of MLPs and CNNs by explaining the causes of the high dimension issue in the side-channel context. To overcome these shortcomings, we propose concrete design principles for our new architecture. We then introduce three components (i.e., encoder, attention mechanism, and classifier) guided by these principles and explain why the overall architecture works in the side-channel context. Moreover, we discuss some different choices of connections to build up variants of our architecture.

Motivation and design principles
Although the full control of a profiling device is a common assumption of profiled SCA, the situations could differ in practice. From the security analyzers' point of view, the implementations in the device under test are usually open-source and could even be modified. Therefore, the operations at every time clock are known to the analyzers. From the adversaries' point of view (which is more practical), the most common phenomenon is that they can only control the input (i.e., the plaintext) and reset the key of a profiling device. Intuitively, under the latter assumption, the complexity of attacking a masked implementation is expected to increase rapidly. In the worst case, to ensure the sensitive leakages being acquired, the adversary has to measure a quite long operation sequence and use all of the time samples to conduct an attack. The MLPs and CNNs seem to be good choices at first glance since several references have shown their ability to profile traces without knowing the mask (on the dimension reduced datasets). However, a more practical scenario could easily let the attack fall into the worst case breaking the implicit assumptions that the trace dimension could be easily reduced in advance. Meanwhile, a trace containing hundreds of thousands of time samples is quite usual (especially for software implementations), e.g., the DPA contest v4.2 (over 1,600,000 samples) and ASCAD (250,000 samples) dataset. This is the actual magnitude of time samples on which the adversary has to deal with in more practical scenarios. To this end, considering the end-to-end profiling is distinctly meaningful.

Shortcomings of typical MLPs and CNNs.
The shortcomings of MLPs are apparent when the dimension of input is increasing (i.e., the length of raw traces is increasing). As all of the layers in MLPs are fully-connected (FC) layers, each neuron of the current layer is contributed by all of the neurons from the last. That means with the increasing input dimension, each neuron of the hidden layer is related to more weights corresponding to the noise in input and thus makes the training harder. From another perspective, the property of fully-connection also restricts the number of neurons in the first hidden layer since a weight matrix between a high dimensional input and a large number of neurons could quickly run out the memory of GPU cards off-the-shelf. The problem of CNNs is that they are not good at reducing the overall dimension if we admire a good performance. This could be inspired by the VGG, ResNet and GoogLeNet-liked CNN architectures that all of them increase (usually double) the channels (i.e., the number of convolutional filters) when reducing the spatial dimension. As a result, the dimension of the final feature map is not reduced significantly compared to the input or could even be increased (like the 50-layer ResNet). Therefore, CNNs meet a similar dimension issue like MLPs when connecting the final feature map to the FC layer (additional FC layer for further feature combination or the final classifier). Moreover, from the perspective of memory utilizing, it is even more tricky to build a proper CNN instance when the input dimension is high, since in general, the feature maps generated by the convolution in the first several layers are already quite memory consuming.
We also notice the popularity of utilizing the global average pooling at the end of CNNs, since this kind of pooling avoids the massive connection to FC layers. Nevertheless, a high dimensional input indicates that even at the end of a network, the global average pooling still has to average the features on too many spatial positions. Intuitively, this introduces too much noise into the final feature vector. Some results in Section 4.3 imply that this kind of architecture does not really fit the end-to-end context. It is also important to point out that the state-of-the-art (w.r.t. specific dataset) architectures in SCA do not entirely follow the rules of thumb in the image classification field. Yet, as mentioned in [WAGP20], the results that these architectures against long traces are still lacking.
Design principles. According to the analysis above, we present some principles that we use in designing: 1. avoid heavily (fully-) connection between high dimensional layers 2. avoid expanding feature dimension when the spatial (temporal) dimension is still high 3. fine-grained feature should be extracted and compressed with lightweight network structures at the first several layers 4. layers before the final classifier should have the ability to select and combine these fine-grained features Of course, with the richness of today's neural networks, there could be many other potential architectures solving the problems in practical SCA. In this paper, we propose one possibility and demonstrate that it is an actual step towards the end-to-end profiling.

Encoder component
The first component of our architecture is the encoder component. It extracts and combines the leakages to get the abstract features, and hence the upper component could recover the underlying intermediate values. With the principles in mind, we use the LC layer and recurrent layer (i.e., LSTM) as the junior and senior encoder, respectively, for the synchronized traces. For the desynchronized traces, to extract the shift-invariant feature map, we use the stacked convolutional layers as junior encoder instead. We emphasize that by designing them with the upper components in our architecture, lightweight stacked convolutional layers do not raise the problems once in typical CNNs when handling the long raw traces.

Junior encoder
By default (synchronized situation), we use the LC layer as the junior encoder. In this paper, we always use a single LC layer with one filter per step. The filter size and stride are chosen by trial and error, while the guideline is that the filter size should cover an integer number of clock cycles and be divided by stride. Empirically, the filter size is chosen as the length of one or two clock cycles while the stride is half of the size for overlapping. The basic methodology underlying is compressing the time samples in several clock cycles to a single feature sample. The overlapping is conducted to smooth this compression. As a result, one or two clock cycles are represented by several feature samples. The non-linear activation function is also omitted to keep the junior encoder a simple sequence of affine transformations.
One reason for using the LC layer is that it significantly mitigates the effect of high dimension since each neuron only focuses on a narrow region from the input. Compared to the convolutional layer, the filter weights associated with the specific position could be fine tuned independently and the output of each neuron is not affected by the noise far from it, which leaves more freedom to extract independent and variable features. Similar conclusions can be derived when compared to the FC layer. In a word, the LC layer decouples each neuron from the whole high dimensional input, which is crucial to generate fine-grained features with high quality. Benefiting from its own property, the LC layer also naturally reduces the dimension of the input data (if we set the stride with a proper value), and thus does not yield a high dimensional internal output. From another aspect, these locally-connections could be regarded as a sequence of transformations that are analogues of the ones in the linear discriminant analysis (LDA). That is, ideally, each small region is transformed to be more distinguishable. Note that although the number of weights in the LC layer increases linearly with the increase of the input dimension, the increase of computation time of this layer is negligible since it could be well parallelized.
Why the convolutional layer can not be used in the same way? So far, it seems that compared to the convolutional layer, the advantage of LC layer that significantly reduces the dimension is not based on the layer itself but the way we use it (i.e., one filter per step, the size of the filter and the value of stride). In Figure 3, some evidence is given to show that the convolutional layer can not be used in the same way due to the different connection structures. The fundamental problem is that, in the convolutional layer, the beginning of filters and one clock cycle can not always be aligned in the sliding process. This problem is caused by the desynchronization between the clock frequency of the device and the sampling rate of the oscilloscope (i.e., the clock frequency can not divide the sampling rate), so that the length of each clock cycle is not accurately an integer (while the size of the filter and stride must be). It is shown in Figure 3b that the actual start of clock cycle 101 is 14133 while the sliding filters expect it as 14128. Furthermore, this deviation is accumulated with sliding ( Figure 3c). Note that one purpose of sharing weights between steps in the convolutional layer is to make an agreement that each filter extracts one particular pattern. The changing deviation means the pattern of fragment slided by the only filter is continuously shifting (this shifting is nonperiodic, different from when the stride is 1), which significantly harms the agreement among steps. In contrast, steps in LC layer do not try to reach an agreement with each other and thus do not care about the deviation. As a result, the convolutional layer can hardly be used in the same way like our LC layer in practical scenarios.
The convolutional layer is still preferred in desynchronized cases. Despite the shortcomings of typical CNNs, the convolutional layer itself is a powerful tool to extract shift-invariant features when the stride is set to 1 (which is the most common case). Therefore, we still use the stacked convolutional layers to profile the desynchronized traces. Thanks to the upper components of our architecture, we only need some lightweight (w.r.t. the number of filters) convolutional layers and do not really care about the final spatial dimension 4 of the feature maps.

Senior encoder
The recurrent layer, specifically, the LSTM, is used as a senior encoder in our architecture to further combine the fine-grained features extracted by the junior encoder. In our use case, the LSTM works under a seq2seq mode that, at every time step, it gets a feature vector in and yields a feature vector out. Another basic setting of LSTM is the number of units which is also known as the dimension of the internal memory (also the dimension of output vector at each step by default). In the practical phenomenon we concerned, the number of units in LSTM is set to 128 or 256 to balance the memory needed for complex inner operations and the yielded feature dimension 5 . The activation functions of gates and internal memory are set to sigmoid and tanh, respectively.
One important reason for choosing LSTM is to obey the design principles. Although the compressing applied by the LC layer reduces the input dimension, the actual internal dimension of a network still depends on the specific dataset. From the perspective of designing a stable architecture, the risk of using the FC layer to combine features should be avoided. Moreover, once we want to change the junior encoder to stacked convolutional layers to deal with the desynchronized traces, LSTM is much less sensitive to the overall dimension by avoiding fully-connection, and thus makes our architecture more pluggable to different junior encoders. In a word, the senior encoder should be able to combine the fine-grained features at any positions in a sequence of any length, while the consuming of memory should be as little as possible (which is important for profiling raw traces). In our opinion, the recurrent layer is the best choice among the layers nowadays.
Another reason is that the LSTM could learn to control the data flow in itself, and thus it could potentially learn very complex operations among the time steps. Specifically, it learns the weight matrices related to the input, forget and output gates that directly control the behaviors at each time step. That is, the LSTM could learn to remember useful information and forget (or just never remember) the useless after going through enough sequences. For the output at each time step, it is essentially a mapping of the sub-sequence until the current step, representing a potential useful abstract feature vector. This property helps to further extract and combine the fine-grained features which are (potentially) long-time dependent on each other (i.e., leakages spread among many clock cycles and/or the shares leak at different positions).
The purpose of using seq2seq mode is that encoding a too long sequence in our supposed scenario into a single feature vector will definitely make the training very hard (LSTM still faces the gradient issue in practice). As a result, we need some mechanism to reduce the hardness of training. Ideally, if we can choose some critical time steps or at least shorten the time sequences automatically, the training will be much easier. Therefore, we select the seq2seq mode which exposes the hidden state at each time step to satisfy the necessary precondition.
Bi-directional LSTM. To better profile the time-dependency among the leakages, we utilize two LSTMs that go through the input sequences in both forward and backward directions as depicted in Figure 4. The outputs of two LSTMs could be concatenated along the feature axis, as in Figure 4a (corresponding to variant 1 of our architecture). Consequently, the output at each time step could get the information from not only the past but also the future. This concatenation intuitively makes the time steps between two leakages more informative. Besides, since one of the LSTM reads the input backward, the order of accessing the sensitive leakages is reversed from the forward one. Two LSTMs may learn different kinds of combinations of features. Sometimes, we do not want the outputs of LSTMs to interact with each other through the merging operation nor limit the representing flexibility of higher layers. Therefore, we hold the outputs of two LSTMs independently, as shown in Figure 4b (corresponding to variant 2 of our architecture). In this case, we also use independent attention components to focuse on the specific time steps of each direction. Concretely, we use − → h and ← − h to represent the forward output and the backward output at each time step. For the former case, the output matrix H goes like for the latter case, the two matrices go like respectively, where S is the length of the input sequence.

Attention mechanism
In this section, we introduce the attention mechanism in our architecture, explaining why we need such a mechanism and how it works. We recall the LSTM with sequential output is chosen to build our senior encoder, in order to avoid encoding the whole sequence to a single feature vector. According to the properties of LSTM, the underlying data flow at the specific time step indicates that from the beginning to the current step, what information is remembered, combined and yielded. There is no doubt that unless the LSTM has visited the leakages of all shares, it can not generate an informative abstract feature that is helpful to recover the key. In contrast,  once all clock cycles that leak sensitive information are passed through, the rest of the sequence is just noise disturbing the learning. This inference indicates that there are some time steps (at least one step) being more informative than others. Once the most informative steps are found out, both the time dimension and the training hardness could be reduced simultaneously. The attention mechanism does exactly what we need.
In this paper, we create an analogue of the attention mechanism proposed by Bahdanau et al. [BCB15]. Our attention mechanism first evaluates the importance of each time step by a time-distributed FC layer that contains only one neural perceptron without bias. This layer shares the weights among the time steps (like a convolutional one), and thus it scores each time step with the same standard. Based on the output of this time-distributed FC layer, a batch normalization layer is appended to give each step additional flexibility adjusting the attention scores independently. Finally, the scores are normalized (by a softmax function) to generate probabilities used in the weighted sum. Concretely, our attention mechanism works as below: where v is the trainable parameter vector of the single perceptron whose dimension is the same as h, v T denotes its transpose and a is the vector of attention scores.
Here we give two detailed interpretations of why our attention mechanism helps from the training viewpoint: 1) The attention mechanism is essentially weighted summing up all time steps and generating a single vector as representation. In other words, it controls the proportion of the gradients that backpropagated through each step of LSTM by the attention probabilities. Therefore the network is mainly updated by the gradients associated with the high attention probabilities, getting rid of not knowing which time steps are most informative automatically. 2) The attention mechanism could also be regarded as a kind of soft time truncating method. The time truncating method [Jae02,Sut13] uses two parameters s 1 and s 2 to control the number of forward-pass timesteps between updates and the number of timesteps to which to apply backpropagation through time, respectively. From this point of view, the parameter s 1 in our method is set by the attention probabilities but with a variable value associated with the interval between where the probability is large enough. Meanwhile, the s 2 is also a variable value that always equals the current step count. Furthermore, being different from an actual time truncating method, our attention mechanism could also be regarded as truncating the rest of the steps after the last significantly attended step, and thus reduces the influence of noise after all of the necessary leakages have been accessed.
Why not utilize the attention mechanism after the junior encoder? To answer this question, we recall that for a high-order DPA, the combination of leakages of shares is necessary. According to [DZFL14], the centered product combination function is the best possible combination method in noisy situations. Therefore, if the attention mechanism is directly used upon the fine-grained feature, the combination of these features will be a weighted addition. As a result, a sub-optimal operation is chosen to combine the leakages. From our point of view, the gated recurrent layer is a better candidate to learn the best possible combination. More evidence is presented in Section 4.3 that the network with attention mechanism on the junior encoder (i.e., stacked convolutional layers) performs much worse than those with attention mechanism on the senior encoder (i.e., LSTM), even the junior encoder is modified to be more complex.

Classifier component
To recover the targeted intermediate value, we need a component at the end of our architecture to classify the abstract features we get so far. In general, a FC layer with softmax as an activation function is the most used output layer format in the classification task. The softmax normalizes the logits and makes them more interpretable (i.e., probabilities of classes). In the context of SCA, these probabilities are necessary to mount the subsequential attack, and thus we apply this format to our output layer. That is, our recover component is as simple as a FC layer with a softmax activation function.

Experimental results of profiling and attack
In this section, we provide the results of experiments to explore more interests of our architecture. We show the profiling and attack results on several datasets to evaluate the feasibility of the architecture when it actually faces the raw traces without any feature extraction. We carry out these experiments on the public datasets (e.g., DPA contest, ASCAD) and traces collected from a microcontroller ATmega128A (we refer to this dataset as AT128 hereafter). Before the detailed results of each dataset, we also briefly introduce the basics of the implementation. Nevertheless, we again emphasize that none of our experiments utilize these basics as necessary prior knowledge. According to the 'no-freelunch' theory, we build the networks and configure the training differently among these datasets. On each dataset, we present the results of the best network instance we get. For conciseness, the detailed topologies of networks are given in Appendixes A to C, while a brief index of experiments is given in

Results on synchronized traces
In this section, we present the results of profiling and attack on synchronized datasets. We use the unmasked value of Sbox in the first round as label (i.e., Sbox(x ⊕ k)). The guessing entropy is calculated from 100 random attacks. It is depicted as the function of the number of traces used in attacks to evaluate efficiency. No pre-processing is conducted to the traces but standardizing (zero-mean unit-variance).

DPA contest v4
The DPA contest v4 provides measurements of 2 protected implementations of AES. The first version v4.1 implements a masked AES-256 with Rotating Sbox Masking (RSM), while the second version v4.2 implements a masked AES-128 with the same masking scheme by assembly code. Both of them are implemented on ATmega163 smart card and acquired by the electromagnetic radiation. We refer to [NSGD12] and [BBD + 14] for more details about these two implementations.
Training setups for v4.1 set. The available amount of traces of v4.1 set is 100,000. We choose the first 39,000 traces for profiling and the following 1000 traces for validations and attacks. We remark that the holders of DPA contest use a fixed key to acquire the traces of v4.1. This could be a problem under the end-to-end scenario. Since the key is fixed, the bijection from the plaintext to the output of Sbox is also fixed. In this case, from the neural network viewpoint, classifying the traces according to Sbox(x ⊕ k) is equivalent to according to x. The network might learn to classify the traces by the leakages of plaintext, which is not expected. Consequently, the first 10,000 time samples are removed to exclude the disturbance of the plaintext 7 . Since plaintext is always known, this could be easily carried out. The number of time samples is then reduced to about 420,000. In this dataset, the first byte is chosen as the target. The network is built by our architecture variant 1 in Figure 5a. Finally, it is trained with Adam optimizer, batch size 32 and learning rate 0.0001.
Training setups for v4.2 set. We use all of the 16 subsets associated with different keys to train networks on this dataset. For each subset, we choose 4500 traces for profiling and 500 traces for validations and attacks. The number of time samples in the raw traces is over 1,600,000 which is too large for practical profiling with limited time, and thus we use the first 300,000 time samples (almost including the entire first round of AES). We also focus on the first byte in this dataset and build the network according to our architecture variant 1. Finally, the network is trained with Adam optimizer, batch size 64 and learning rate 0.0001.  The results of the dataset v4.1 are given in Figure 6. As the figure shows, the validation loss metric saturates after about 50 epochs, while the corresponding validation accuracy keeps increasing to about 20%. The network which gives the lowest validation loss is used to conduct 100 random attacks. The result with respect to the number of traces in attack is depicted on the right-hand side of the figure. This result is very similar to the best performances shown in the recent papers (see Appendix F) 8 , while we do not utilize any prior knowledge of the mask value nor select the PoIs. For further comparison, the result knowing the mask value (which is the case in the referred papers) is shown in the second row of Figure 6. In this case, the key could be recovered with only a single trace. The results of this dataset verify that even the raw traces are extremely long, our network could extract and classify the leakages very efficiently without the manual selection of PoIs.
The results of the dataset v4.2 are given in Figure 7 where we show the loss, accuracy, and guessing entropy for the first byte in each subset. The curves of loss and accuracy imply the 7th subset (with partial key 0x2F) is harder to attack, which is verified by the guessing entropy in the attack phase. One intuitive explanation of this phenomenon is that the specific mask candidates obstacle the profiling corresponding to some key values. However, since the holders do not provide more traces, further investigation of this observation is left in a future work. Besides the unbalanced performance among different subsets, all of the attacks could reduce the guessing entropies to 0 in a reasonable number of traces.

ASCAD
Our architecture applies a successful attack on the trace sets of DPA contest v4. However, both implementations manipulate the lightweight RSM scheme. Therefore, the ASCAD dataset [PSB + 18] is selected as a validation to attack the Boolean masking scheme where the mask value is completely random. Moreover, the newest released version of this dataset is a perfect baseline to test our architecture, as the raw traces are long enough and the key in the training set is variable. We refer to the ASCAD dataset with fixed (resp., variable) key as ASCAD v1 (resp., v2) hereafter. In ASCAD, a masked AES-128 algorithm is implemented on an 8-bit AVR microcontroller (ATmega8515). The measurements are collected through electromagnetic radiation. The masking scheme makes use of the classical table re-computation method introduced in [AG01, PR07].
Training setups for v1 set. For this trace set, we use 50,000 traces for profiling and 10,000 traces for validations and attacks. In each raw trace, there are 100,000 time samples, which are directly used in experiments. Since the masks of the first two bytes are set to 0 by the authors for elementary attack tests, the third byte is selected as the target. The network for this dataset is built by our variant 2 in Figure 5b. Finally, it is trained with Adam optimizer, batch size 8 and learning rate 0.0001.
Training setups for v2 set. For the v2 trace set, we use all of the 200,000 traces with a variable key to train the network and 10,000 traces with a fixed key to validate and attack. The raw traces of ASCAD v2 contains 250,000 time samples which are directly used in our profiling. Due to the same reason as the v1 set, the third byte is selected as the target. The network for this dataset is also built by our architecture variant 2. Finally, it is trained with Adam optimizer, batch size 200 and learning rate 0.0001.
We give the results of the ASCAD v1 dataset in Figure 8, observing that our trained network could reduce the guessing entropy to 0 very efficiently (in 10 traces), though the raw traces are used. To our best knowledge, this result is state-of-the-art, even compared to the results based on the selected PoIs (see Appendix F). In addition, we note there is a step at about 380 epochs in both curves of validation loss and accuracy. We infer the cause of this observation is that the learning rate is too large when training reaches epoch 380, and thus the network declines. In other words, we could further optimize the network if we use a lower learning rate after 380 epochs. We do not dive into this learning rate tuning game as our main target is end-to-end profiling, not finding an optimal network instance. The results of the ASCAD v2 dataset are presented in Figure 9. On this dataset, again, the partial key is correctly recovered within 10 traces. To our best knowledge, this is a state-of-the-art attacking result on this dataset. Meanwhile, there is no report of directly profiling and attacking the raw traces of this dataset before ours. We can observe that the validation loss is increasing from the epoch 25, which usually indicates overfitting in profiling. However, the validation accuracy is also increasing while we verify the network is still being optimized (w.r.t. guessing entropy). We infer the reason for this strange observation is that the network capacity is not tuned to be optimal and/or the amount of data is still not quite enough for end-to-end profiling. As a result, the network is becoming less confident about the overall classification result while classifying more test traces around margins correctly. Despite this, our network instance is already good enough for a practical end-to-end attack.

AT128
For a more thorough study of our network architecture, we acquire the power consumption from a device we fully controlled (an 8-bit AVR microcontroller, ATmega128A). To control the leakages better, we do not actually implement a protected AES. We simulate the leakages of sensitive internals by an input array (with 9 bytes) through the assembly instructions like LD and ST, which is very similar to what Masure et al. presented in [MDP20]. We conduct two simulations, where the only difference is that we insert much more redundant NOP instructions in the second one to make the leakages of each byte further apart. Thence, we simulate a more extreme leaking situation which is more difficult to conduct end-to-end profiling. We refer to the datasets acquired from these two implementations as AT128-N (N for near) and AT128-F (F for far). Both of them are collected by the Pico 3203D oscilloscope at a sampling rate of 125MS/s while the chip runs at 11MHz. With our acquirement setups, each dataset contains 200,000 traces, while each trace contains 47,000 time samples. Training setups. We carry out experiments on both of the datasets using 190,000 traces for profiling and 10,000 traces for validations and attacks. On these two datasets, we use the simulated output of Sbox as the label (i.e., the XORed value of two input bytes which simulate the m and Sbox(x ⊕ k) ⊕ m). The network for AT128-N is built by our architecture variant 1. It is trained with Adam optimizer, batch size 128 and learning rate 0.001 with decay rate 0.8 per 30 epochs. The target of AT128-N is x 4 ⊕ x 6 . The network for AT128-F is built by our architecture variant 2 whose target is x 2 ⊕ x 7 . The training setups are the same as AT128-N except for the batch size is 16. To further compare the universality of these two variants, the best networks for each dataset are exchanged and retrained (with their original training setups) from scratch.
We first show the results of the dataset AT128-N in Figure 10. We see that the networks built by two variants perform similarly in the aspect of guessing entropy. Both of them need only several traces to recover the key. Differently, the network built by the variant 2 takes about 200 epochs to reach the accuracy of 50%, which is much better than the network built by the variant 1. Then the results of AT128-F are given in Figure 11. In this dataset, the tend of differences between the variants of architecture are similar to the AT128-N. The network built by the variant 2 still performs better.
According to the results, the network based on the variant 2 handles both leakage conditions better. Recalling that the batch size of the variant 2 is 16 compared to the 128 of the variant 1, the advantage of the variant 1 is that it could converge with a larger batch size when the number of training samples is the same, and thus get a sound network for attack faster. For completeness, we point out it can not reach the same accuracy as the variant 2 (about 3% lower on AT128) even the batch size is set to 16. Furthermore, we can not find an instance of variant 1 to attack ASCAD datasets, which indicates the variant 2 is potentially more suitable to complicated leakages. We remark that although these are empirical conclusions, they help to simulate a more knowledgable adversary. In contrast, we still choose a better architecture variant by trial and error on the public datasets.

Choices of the filter size and stride in the LC layer
Finally, we also present more results about the influence of the filter size and stride in the LC layer. Although we have stated in Section 3.2.1 that the filter size choosing as the length of one or two clock cycles while the stride being half of it is a good empirical choice for general cases, the values of these two hyper-parameters are definitely not limited. Choosing these two parameters are making trade-offs between the local resolution (related to the quality of fine-grained features) and the number of time steps (more related to the training time). There is usually a wide range of filter size and stride where the networks converge. To demonstrate this, we show several searching tests on AT128-N and ASCAD v1 (both are tested with architecture variant 2) in Figure 12.

Results on desynchronized traces
In this section, the traces are randomly (but fixed during profiling and attack) shifted within an interval from the original raw dataset, to simulate the random delay. The networks are similar to the ones trained on synchronized traces but replacing the junior encoder from a LC layer to stacked convolutional layers. Since all of the datasets we used are roughly centered, we approximately rescale the traces to the interval -1 to 1 by dividing an integer. In order to make the networks converge stably under the desynchronized scenario, we also conduct the data augmentation, i.e., additional shifts, to expand the profiling set. We remark that the value of additional shifts for augmentation is variable during the profiling. We also point out that the guessing entropies shown in this section are not directly comparable to the ones given in Section 4.1 because of the data augmentation.

Desynchronized ASCAD
Settings on v1 set. On this dataset, the length of the delay interval is set to 52 (the length of one clock cycle) and 100 to evaluate our network under different degrees of random delay. The length of the augmentation interval is set to 80 for both cases. The time samples are divided by 64. The training setups are the same as the synchronized dataset.
Settings on v2 set. On this dataset, the length of the delay interval is set to 126 (the length of one clock cycle) and 200. The length of the augmentation interval is set to 20 for both cases. The time samples are divided by 128. The batch size is reduced to 40, as even the lightweight stacked convolutional layers require more memory of GPU card compared to the LC layer.
The results of the ASCAD v1 and v2 are presented in Figure 13. As the trace in the ASCAD v1 dataset are too few to support end-to-end profiling in desynchronized cases, a large augmentation interval is chosen to expand the profiling set 80 times. With such an expansion, our network needs about 20 traces to recover the key. Although considering the data augmentation, our result is not strictly better compared to the state-of-thearts [ZBHV20,WAGP20] carried out on the PoIs, it is still the first end-to-end result in the desynchronized context for ASCAD. Moreover, we show in Section 4.3 that the networks designed for PoIs do not necessarily fit the end-to-end scenario (even with the same level of augmentation). For the v2 dataset, thanks to the better measurement quality and more traces for profiling, our network reduces the guessing entropy to 0 in 10 traces with less augmentation.

Desynchronized AT128
Settings on AT128-N. We set the length of delay interval to 44 (the length of 4 clock cycles) and 100 on this dataset. The length of the augmentation interval is set to 6 and the batch size is reduced to 100. The value of time samples is divided by 16384 (2 14 ).
Settings on AT128-F. We also set the length of the delay interval to 44 and 100 on this dataset, since the length of one clock cycle is the same between these two datasets. The length of the augmentation interval is set to 6 and 20 according to the length of delay interval 44 and 100, respectively. The batch size is also reduced to 100. The results of AT128-N and AT128-F are presented in Figure 14. As it can be observed, in all of these four cases, we could reduce the guessing entropy to 0 within 4 traces. We carry out more augmentation on the AT128-F when the delay interval is set to 100. Otherwise, the guessing entropy needs over 100 traces to reduce to 0, which indicates the AT128-F is a more difficult dataset for end-to-end profiling, especially when desynchronization occurs.

More experiments for comparison
In this section, we conduct more experiments to compare our architecture with others that are potentially effective in end-to-end profiling.

Global average pooling and simplifications of our architecture
First, we explore the CNNs with global average pooling and a simplification of our architecture (i.e., utilize attention on junior encoder directly without LSTM as mentioned in Section 3.3). Concretely, we build the CNNs with global average pooling by modifying the networks proposed for desynchronized traces. We simply increase the filters of each convolutional layer to make sure the final feature dimension is the same as the output of LSTM. For the simplification of our architecture, we also increase the filters of each convolutional layer and remove the LSTM layers. To provide a baseline, we consider the dataset AT128-N and ASCAD v1 with the same setups (i.e., delay and augmentation interval) in desynchronized scenarios. We also keep the training setups the same unless we have to reduce the batch size (as the filters of convolutional layers are increased). More details of the networks are given in Appendix C.  As depicted in Figure 15, both of the CNNs with global average pooling and the simplification of our architecture fail to carry out a successful attack on these two datasets. The results of the global average pooling indicate that it could mix the features up when the final spatial dimension is still high. Meanwhile, the results of the simplification of our architecture imply the recurrent layer is critical in combining features. A simple attention structure (weighted sum) is far from enough.

Applying the state-of-the-arts to raw traces
In addition, we also apply the networks proposed in [ZBHV20] and [ZS20] to the raw traces of the ASCAD v1 dataset. For the network proposed by Zaid et al. [ZBHV20], we use the one cycle policy with more epochs (100/200) since the raw traces are more difficult to profile. As the training is conducted on raw traces, we have to reduce the batch size to 50 instead of 100 in the original paper. For the ResNet proposed by Zhou et al. [ZS20], besides testing the original network, we also replace the global average pooling to attention mechanism. We reduce the batch size to 8 (as the ResNet is quite memory consuming) and train them in 200 epochs. We illustrate the results in Figure 16.
The above experiments are mounted on the desynchronized traces where our architecture does not utilize the LC layer to reduce the dimension either. From the viewpoint of just comparing different architectures, the results are clear enough. Nevertheless, the networks proposed in [ZBHV20] and [ZS20] are not designed for raw traces. To make the comparison as fair as possible, we also insert an additional LC layer into these two networks as the first layer to reduce the dimension prior. To this end, we use the synchronized traces as the LC layer cannot handle the desynchronized scenario. The further results are given in Figure  17 that the modified networks still do not carry out a successful attack. This negative observation indicates that the LC layer does not essentially help CNN architectures to profile raw traces. With all these interesting investigations, we find that transforming even the state-of-the-art architectures on PoIs to raw traces is not trivial.

Investigations of attention mechanism
Although the networks built by our architecture could converge on raw traces to carry out a successful attack, we are more interested in how the networks actually achieve this, specifically, whether the attention mechanism works as we designed. To this end, we investigate the attention probabilities in the trained networks with the results of leakage detections, gradients and the states of LSTMs. Through these investigations, we reveal how our architecture accomplishs the feature extraction and combination. In this part of paper, we focus on the networks trained by the datasets AT128-N and ASCAD v1.

Leakages, gradients and attention probabilities
For a deep learning process conducted on raw traces, one of the most important issues is whether the network could extract the PoIs automatically. In this section, we first give the results of leakage detections which imply the positions of leakages. Then we show the absolute gradient of the network (w.r.t. the inputs) which corresponds to the positions where the network is sensitive to. If the positions indicated by the leakage detections and the gradient are highly consistent, we could reasonably infer that the networks extract the PoIs successfully. In addition, by showing the attention probabilities, we could disclose whether the attention mechanism helps to locate the PoIs. We illustrate the results of a network (based on architecture variant 1) trained on AT128-N in Figure 18. We first depicted the SNRs as the results of the leakage detections of x 4 and x 6 9 in subfigure (a). Compared to the absolute gradient in subfigure (b), we see the positions of distinct peaks in gradient and SNRs are highly consistent with each other, and thus our network could extract the PoIs accurately from raw traces. Besides, when compare to the attention probabilities, we find the peaks of attention are on the left of those in gradient and SNRs, but at a very close distance. This observation could be explained by that the network is dominated by the backward LSTM. As we use the attention mechanism with LSTM, a distinct peak in attention does not only indicate that the current time step of LSTM is important but also imply that the corresponding interval on the raw trace before (according to the direction) this time step contains information related to the classification target. Consequently, it is reasonable that the attention mechanism pays particular attention to the time steps after the backward LSTM has accessed the leakages of both x 4 and x 6 .
Similarly, in Figure 19, we show the results of a network (based on architecture variant 2) trained on ASCAD v1. To thoroughly identify the leakages, we calculate SNRs on two pairs of intermediate values, namely, (Sbox(x 3 ⊕ k 3 ) ⊕ m 3 , m 3 ) and (Sbox(x 3 ⊕ k 3 ) ⊕ m out , m out ), which are presented in subfigure (a) and (b), respectively. We then give the absolute gradient (w.r.t. the inputs) of this network, showing the network is sensitive to where the mask and the masked value leak. It seems that the leakages of Sbox(x 3 ⊕ k 3 ) ⊕ m 3 and m 3 are preferred by our network. Noting that we use the architecture variant 2 this time, there are two attention instances in the trained network. The attention probabilities of the forward and backward attention instances are simultaneously shown in subfigure (f), and compared to the SNRs in subfigure (d) and (e) respectively. According to Figure 19(

Interactions between the gate state and attention
Since the data flow in LSTM is self-controlled by the inside gates, quantitatively evaluating how the attention affect it is quite tricky. Nevertheless, we can gain some insight into the internal mechanisms of the LSTM by studying the gate activations when the networks process test data. We first select some active units in LSTM from a network trained on AT128-N. We judge the units' activeness based on the gradient of the network(w.r.t. the output of LSTM), where the most active unit generates the most accumulated absolute gradient through the whole time steps. In Figure 20, we plot the activations of the input gate in unit 10, 71 and 121. We observe that the input gates are highly activated at a close distance before the time steps are attended, which is concrete evidence that the LSTM does extract information before attention asks it to yield. The activations of the input gates are also highly related to the SNRs where the two peaks correspond to the leakages of x 4 and x 6 , which indicates the input gates let more information in when it is needed. These observations are consistent with the results in the last subsection that the attention mechanism is highly related to the feature extraction and combination. Although not all of the units (about 20 units in the network we explored) in LSTM could be observed with such a clear behavior on input gate, this result verifies that the attention mechanism does help the LSTM to shorten the time steps during which the memory should be kept. Then we explore the statistics of the input, forget and output gates to show the behavior of LSTM in a more macroscopical view. We are particularly interested in looking at the distributions of saturation regimes of the gates, where we define a gate to be left or right saturated if its activation is less than 0.1 or more than 0.9, respectively, or unsaturated otherwise. To illustrate these distributions, we calculate the fraction of times for each unit that the gates get into left or right saturated. The results are depicted in Figure 21 (with an unconverged network for comparison), where the units are collected from 10 test samples, and thus there are 1280 dots in each subfigure.
In the context of side-channel analysis, PoIs are usually in a limited number, and hence ideally, the input gate of a unit should be right saturated (let the information in) in several time steps and left saturated in the rest. That is, units occurring at the upper left of the plot (not on the y axis since the fraction of right saturated should not be 0) is the optimal scenario. The analysis of the output gate follows the same logic. As for the forget gate, it really depends on the implementation that determines how long the memory should be kept. Generally speaking, in the side-channel field, the LSTM need to be trained to find the differences (informative or not) among time samples, so that the gates with both saturations (behave more differently among steps) are supposed performing better than those who never be saturated nor saturated on one direction (i.e., units on the axes).

Conclusion
In this paper, we propose an end-to-end profiling approach by introducing a new neural network architecture that consists of encoders, attention mechanisms and a classifier. Compared to the current popular architectures in side-channel like CNNs and MLPs, our architecture can directly profile traces that are significantly longer (e.g., raw traces). This property makes our architecture more suitable for end-to-end profiling in which the implementation is protected by masking. Since in this condition, selecting PoIs is quite challenging.
We build the networks guided by our architecture and conduct the attacks on several datasets. To our best knowledge, we are the first that successfully carry out end-to-end profiling directly on the raw traces that contain over 100,000 time samples in each. With these trained networks, we can break the implementations of the datasets, like DPA contest v4 and ASCAD, with very few traces which could be even fewer than the networks trained on length reduced dataset. By replacing the junior encoder from a LC layer to stacked convolutional layers, we could also handle the desynchronized cases. To further explore how our architecture works, we investigate the attention mechanism finding it is highly related to the gradient and the behaviors of LSTM. These investigations indicate how the attention mechanism helps to accomplish feature extraction and combination. Finally, we believe our approach is a first step towards the end-to-end profiling in the context of side-channel.
Besides the successful attacks, there is still some space to improve our architecture. One possible direction is replacing the recurrent layers since the recurrent structures that can not be parallelized will slow down the training process. The self-attention proposed in [VSP + 17], which could be parallelized, is a promising candidate as it has set off a revolution of abandoning LSTM in the NLP field. We leave this to be a future work. Considering the richness of nowadays' neural network, it should be interesting to explore more new architectures and analyze their effectiveness to the SCA. size of training sets, and the corresponding batch size. The number of epochs for the first successful attack is also affected by the additional random delay.

E Experiment on hardware implementation
Our architecture focusing on feature extracting and combining on long raw traces is mainly designed for the software implementations where the masking scheme costs more clocks and the leakages of shares spread in a long interval. However, we are also interested in how our approach performs in hardware scenarios. In hardware implementations, the raw traces are usually much shorter, and thus we do not expect a superior result compared to the CNNs and MLPs. We test our architecture on the dataset AES_HD used in [PHJ + 19] and illustrate the results in Figure 35. Our approach takes about 800 traces to recover the key, which is close to the state-of-the-art (about 700 traces) presented in [ZBHV20]. We refer to the source code in our Github repository for more details of this attack.

F Comparisons with networks trained on reduced traces
To keep the plots of our results easy to read, we do not directly show the results of previous works in Figures 6 and 8 but illustrate the comparisons in Figure 36.