EstraNet: An Efficient Shift-Invariant Transformer Network for Side-Channel Analysis ∗

. Deep Learning (DL) based Side-Channel Analysis (SCA) has been extremely popular recently. DL-based SCA can easily break implementations protected by masking countermeasures. DL-based SCA has also been highly successful against implementations protected by various trace desynchronization-based countermeasures like random delay, clock jitter and shuffling. Over the years, many DL models have been explored to perform SCA. Recently, Transformer Network (TN) based model has also been introduced for SCA. Though the previously introduced TN-based model is successful against implementations jointly protected by masking and random delay countermeasures, it is not scalable to long traces (having a length greater than a few thousand) due to its quadratic time and memory complexity. This work proposes a novel shift-invariant TN-based model with linear time and memory complexity. The contributions of the work are two-fold. First, we introduce a novel TN-based model called EstraNet for SCA. EstraNet has linear time and memory complexity in trace length, significantly improving over the previously proposed TN-based model’s quadratic time and memory cost. EstraNet is also shift-invariant, making it highly effective against countermeasures like random delay and clock jitter. Secondly, we evaluated EstraNet on three SCA datasets of masked implementations with random delay and clock jitter effect. Our experimental results show that EstraNet significantly outperforms several benchmark models, demonstrating up to an order of magnitude reduction in the number of attack traces required to reach guessing entropy 1.


Introduction
The power consumption or electromagnetic emission of a CMOS device depends on the data being processed within the device.Side-Channel Analysis (SCA) exploits this dependency to recover the secret key of a cryptographic device.Non-profiling SCA such as differential power analysis (DPA) [KJJ99], correlation power analysis (CPA) [BCO04], and mutual information analysis [GBTP08] perform attacks on a target device without prior characterization of its leakage behavior.In contrast, profiling SCA assumes that the adversary possesses a clone of the target device under his control.The attacker utilizes the clone device to characterize the leakage behavior, learning an approximate leakage model for the target device.This learned model is then used to perform actual attacks.Examples of profiling attacks include template attack [CRR02] and stochastic attack [SLP05].
Profiling attacks have received significant attention in the SCA literature due to their ability to provide worst-case evaluations of cryptographic devices against SCA.Classical The results indicate that EstraNet performs similarly or significantly better than the benchmark models.More precisely, it requires upto 90% less attack traces to reach the guessing entropy 1 compared to the benchmark models.
(b) We conducted additional comparisons between EstraNet and the benchmark models on the datasets after adding clock jitter effect.The experiments demonstrate that EstraNet can reach the guessing entropy 1 using fewer than 100 attack traces most of the time, while the benchmark models struggle to reach the same using as many as 5K traces.Even in cases where the benchmark models performs relatively well, they still required an order of magnitude more attack traces compared to EstraNet.
(c) We conducted several studies to assess the influence of several hyperparameters on the performance of EstraNet.Additionally, we performed an ablation study to analyze the impact of different design choices and training setup.
The organization of the paper is as follows.In Section 2, we introduce the necessary notations and SCA background.Section 3 briefly describes vanilla TN along with its prime component, self-attention operation.The section also briefly outlines an approach to make the self-attention and, thus, the TN models linear in time and memory complexity.In Section 4, we propose a novel self-attention operation with relative positional encoding and linear time and memory cost.Section 5 introduces the overall architecture of EstraNet.In Section 6, we provide the experimental results.Section 7 discusses the limitations of EstraNet and depicts some future work directions.Finally, in Section 8, we conclude the work.

Notations
We use the following notational conventions throughout the paper.We use a letter in the capital (like X) to represent a random variable.The corresponding small letter (like x) and calligraphic letter (like X ) are respectively used to represent an instantiation and the domain of the random variable.Similarly, we use a capital letter in bold (like X) to represent a random vector and the corresponding small letter in bold (like x) to represent an instantiation of the random vector.A matrix is represented by a capital letter in Roman style (like M).We represent the i-th elements of a vector x by x[i] and the element of i-th row and j-th column of a matrix M by M[i, j].We use the notation P[•] to represent the probability mass/density function and E[•] to represent expectation.

Side-Channel Analysis
The power consumption or electromagnetic (EM) emission of a semiconductor device depends on the values being manipulated within the device.SCA exploits this behavior of semiconductor devices to gain information about some intermediate sensitive variables of a cryptographic implementation and, hence, the device's secret key.More precisely, in an SCA, an adversary takes control of the target device, also known as Device Under Test (DUT), and collects power or EM measurements, referred to as traces, by executing the encryption (resp.decryption) algorithm multiple times with different plaintexts (resp.ciphertexts).Then the adversary performs a statistical test to infer the device's secret key.
SCA can be of two types: profiling SCA and non-profiling SCA.In a profiling SCA, the adversary is assumed to possess a clone of the DUT under his control.Using the clone device, he can build a profile of the DUT's power consumption or EM emission characteristic and use that profile for performing the actual attack.On the other hand, in a non-profiling SCA, the adversary does not possess any clone device and, thus, cannot build any power/EM profile of DUT.Instead, he tries to recover the secret key from the traces of DUT only.In this paper, we consider profiling SCA only.

Profiling SCA
A profiling SCA is performed in two phases.In the first phase, known as the profiling phase, the adversary sets some known key in the clone device and collects a large number of traces by executing the encryption (resp.decryption) operations on some known plaintexts (resp.ciphertexts) using the device.For each trace, the adversary computes the value of an intermediate secret variable Z = F (X, K), where X represents a component of the random plaintext (or ciphertext), K represents a component of the (possibly random) key, F (•, •) is a cryptographic primitive.Then the adversary uses the traces to build a model for (1) where L represents a random vector corresponding to the traces.The conditional probabilities P[L|Z] serve as the leakage templates in the second phase.
In the second phase, also known as the attack phase, the adversary collects several traceplaintext pairs {( li , pi )} Ta−1 i=0 , where li , pi are the i-th trace and plaintext (or ciphertext) respectively, and T a is the total number of attack traces, executing the DUT for varying plaintexts.For all the traces, the secret key k * is unknown but fixed.Finally, the adversary computes the score for each possible key as δk = The key k = argmax k δk is chosen as the predicted key.If k = k * holds, the prediction is said to be correct.The rank of the correct key in the list of all the possible keys sorted by their scores δk is used as a metric for the degree of success of the attack.This attack is also called a Template attack as the estimated P[L|Z] in Eq. 1 can be considered as the leakage template for different values of the sensitive variable Z.

Deep Learning based Profiling SCA
In Deep Learning (DL) based profiling SCA, the adversary trains a DL model which takes a trace L as input and generates a probability distribution over all possible values of the sensitive variable Z.More precisely, let f (•; θ * ) be the trained DL model with θ * be the model parameters learned during training.Thus, the output of the DL model for a trace l can be written as where p ∈ R |Z| such that p[i], for i = 0, • • • , |Z| − 1, represents the predicted probability for the intermediate variable Z = i.During the attack phase, given the set of attack trace-plaintext pairs {( li , pi )} Ta−1 i=0 , the score of each key k ∈ K is computed as where p i = f ( li ; θ * ) is the predicted probability vector for the i-th trace.Like template attack, k = argmax k δk is chosen as the guessed key.Alternatively, the rank of the correct key in the list of all possible keys sorted by their scores δk can be considered as a metric for the degree of the attack's success.
Various DL models (Feed Forward Network [MZ13, MHM13, MPP16], Convolutional Neural Network [MPP16, CDP17, BPS + 20, ZS20, PSK + 18], Recurrent Neural Network [MPP16, Mag19, LZC + 21]) have been used for SCA.Recently, [HSAM22] has introduced a shift-invariant TN model, TransNet, for SCA.However, their proposed model only applies to short traces (having lengths less than a few thousand) as the time and memory complexity of the model is quadratic in trace length.This work introduces a shift-invariant TN with linear time and memory complexity.Thus, in the next section, we describe the architecture of TN.We also briefly outline the existing methods to make the TN models' self-attention operation; thus, the TN models themselves shift-invariant and have linear computational costs.Section 4 proposes a novel shift-invariant self-attention operation for SCA having a linear cost.

Transformer Network
Like all other DL models, TN also has a layered structure.It consists of multiple transformer layers stacked one after another.Each transformer layer takes an input sequence X = [x 0 , . . ., x n−1 ] T ∈ R n×d of n feature vectors as input and transforms it into another sequence Y = [y 0 , . . ., y n−1 ] T ∈ R n×d where n corresponds to the trace length or sequence length and d is the dimension of each feature vector.The transformer layer consists of the self-attention layer and position-wise feed-forward layer.More precisely, if we denote the self-attention layer by the function f SA : R n×d → R n×d and the position-wise feed-forward layer by f P F F : R n×d → R n×d , then the output Y of a transformer layer can be computed as The function f P F F independently transforms each feature vector using a feed-forward network, thereby enhancing the non-linear characteristics of the model.In contrast, the self-attention layer captures the interdependencies between input features by transforming each feature based on its relation to other features.As a result, the self-attention layer plays a critical role in TN's ability to capture the dependency among the distant features.
In the context of SCA, the input X corresponds to the input of an intermediate layer, where n represents the input length (which is equal to the trace length for the first layer), and d represents the feature dimension of the preceding layer.Similarly, Y corresponds to the output of the layer.The self-attention layer possesses the ability to combine leakage information from multiple POIs, leading to higher SNR outputs.Specifically, it can combine the leakages of different shares from a masked implementation to reconstruct the unmasked secret in some output feature vectors.
In the following section, we present a brief description of the self-attention layer.

Self-attention Layer
Given the input sequence X = [x 0 , . . ., where The final output of the selfattention layer is computed as Ŷ = VW T o + X where W o ∈ R d×dv is the projection matrix which projects the d v -dimensional vj s back into d-dimensional vector space.The matrices W q , W k , W v and W o are the parameters of the DL model which are learned during the training and the d k and d v are two hyper-parameters known as the key and value dimension respectively.The scalar softmax can be thought of as the attention vi pays to the input feature x j .This attention mechanism plays the major role in the TN's ability to capture long-distance dependency.If there exists some dependency between two input features, say x i and x j , the attention from vi to x j can be large, making the i-th output feature ŷi = W o vi + x i dependent on both x i and x j , thus, capturing the interrelations between those.Moreover, unlike CNN and RNN, the self-attention layer can capture the dependency between x i and x j in a constant number of steps even when the distance between i and j is large.Indeed, in [VSP + 17], it has been argued that TN is better than CNN and RNN in capturing long distant dependency.[HSAM22] has demonstrated that the TN's ability of capturing long distant dependency can be utilized to make it highly effective in attacking software implementation of masked countermeasure in which the leakages of multiple shares (that can be far apart in time dimension) need to be combined for a successful attack.

Multihead Self-Attention
One self-attention operation with a set of the parameters W q , W k and W v is called one attention head.In practice, several attention heads are parallelly used in the self-attention layer.Thus, an H-head self-attention layer is computed as where, the operation concat V(0) , . . ., V(H−1) denotes the row-wise concatenation of the matrices V(0) , . . ., V(H−1) .Thus, in this new setting, V(i) ∈ R n×dv , V ∈ R n×Hdv and W o ∈ R d×Hdv .For simplicity, by self-attention, we will imply the single-head self-attention layer only, though our observations can be easily extended to multihead self-attention.
One main drawback of the vanilla self-attention operation is that any parallel implementation of self-attention has quadratic memory and computation cost with respect to the input length.In SCA, the trace length can be very large (in the order of 10 5 ).Quadratic complexity of the self-attention layer prevents TN from being applied to very long traces [HSAM22].Several variations of self-attention operation have been introduced, which operate in linear time and memory.In the next section, we describe one such approach.

Self-attention with Linear Complexity
Rewriting the last term of Eq. ( 5), the self-attention operation can be given by vi In [KVPF20], Katharopoulos et al. have replaced the exponential function of the form exp(q T k/ √ d) by a positive function k(q, k) such that k(q, k) is factorizable as ϕ(q) T ϕ(k) for some feature map ϕ : R d k → R d ′ k and k(q, k) ≥ 0 for all k, q ∈ R d k .A function k(q, k) which is factorizable as ϕ(q) T ϕ(k) is known as (positive semi-definite) kernel function [Wik22].Thus, replacing the exponential function in Eq. ( 6) by a kernel k(•, •) such that k(•, •) ≥ 0, the vector vi can be computed as Since all vi s share the terms n−1 j=0 ϕ(k j ) in Eq. ( 7), those can be computed using linear time and memory2 .However, since they have used a different kernel k(q, k) than the exp(q T k), the resultant self-attention operation differs from the vanilla softmax self-attention (i.e.Eq. ( 5)).

Feature Map for Softmax Self-attention
Recently, several works [PPY + 21, CLD + 21] have proposed self-attentions which approximate the softmax self-attention and works in linear time and memory.The theoretical foundation of the works lies in the approximation of the Gaussian kernel, i.e., the kernel of the form exp −||q − k|| 2 2 /2σ 2 using random Fourier features [RR07, YSC + 16, CRW17].More precisely, the Fourier feature map defined as where w 0 , . . ., w de−1 are i.i.d (independent and identically distributed) samples from d dimensional Gaussian distribution with zero mean and identity covariance matrix, can approximate the Gaussian kernel as exp(−||q − k|| 2 2 /2) ≈ ϕ fr (q) T ϕ fr (k).Thus, ϕ fr (x) is a feature map for the Gaussian Kernel.Peng et al. [PPY + 21] have used the above feature map for the Gaussian kernel to obtain a feature map for the kernel exp q T k .Concretely, exp(q T k) can be written as Thus, using the feature map: we can approximate the kernel exp q T k as ϕ tri (q) T ϕ tri (k) where d e is a hyper-parameter known as the dimension of the kernel feature map.However, in [CLD + 21], Choromanski et al. pointed out that ϕ tri might lead to unstable behavior of self-attention due to potentially negative components like sin(w T j x)s and cos(w T j x)s in ϕ tri .They resolved the issue by proposing positive random feature map: where w 0 , . . ., w de−1 are as defined in Eq. (8).Since ϕ pos (x) can have only positive components, it solves the unstable behavior of ϕ tri .

Relative Positional Self-attention with Linear Complexity
In self-attention of the form of Eq. ( 5), the attention an output feature vi pays to an input feature x j does not depend on their positions i.e. the indices i and j.More precisely, the attention paid by the output feature vi to the input feature x j is proportional to k(q i , k j ) which is a function of the vectors q i and k j not their positions i and j.Thus, if the input sequence is permuted, the output sequence of the self-attention layer will also be permuted similarly.However, in SCA, we want the attention to be more on the POIs rather than all sample points having equal attention.Moreover, in the presence of countermeasures like random delay and clock jitters, the distances between the POIs remain approximately same though their absolute positions (the indices at which they appear) vary from trace to trace.Thus, we want the attention an output feature vi pays to an input feature x j to depend on their relative positions or i − j.Such modeling of attention probabilities is referred to as relative positional encoding.In self-attention with relative positional encoding, the terms of the form exp q T i k j in Eq. ( 5) is generalized by a positive function of the form f (q i , k j , i − j) [SUV18, DYY + 19].
In self-attention with relative positional encoding and linear complexity, the kernel k (q i , k j ) in Eq. ( 7) is replaced by another kernel of the form k r (q i , k j , i − j) such that the new kernel is (approximately) factorizable as k r (q i , k j , i − j) = ϕ q (q i , i) T ϕ k (k j , j).Thus, by replacing the terms k(q i , k j ) in Eq. ( 7) by the relative position aware kernel, we get self-attention with relative positional encoding, which can be performed in linear time and memory as follows: In literature, several self-attention with relative positional encoding which work in linear time and memory have been introduced.For example, [LCW + 21, LSL + 21, SLP + 21] have used feature maps of the form ϕ q (q i , i) = ϕ(M i q i ) and ϕ k (k j , j) = ϕ(N j k j ) where the matrix M T i N j is a function of i − j and ϕ(•) is given by Eq. ( 9) or (10).[LCW + 21, LSL + 21] have further generalized the matrices M i and N j s by some small DL models.In [Che21], Chen et al. have used feature map of the form ϕ q (q i , i) = M i φq (q i ) and ϕ k (k j , j) = N j φq (k j ) such that M T i N j becomes a function of i − j. [LLC + 21] has used the kernel of the form k (q i , k j , i − j) = exp q T i k j + b j−i .Here b −(n−1) , . . ., b 0 , . . ., b n−1 are some relative positional biases whose values are also learned during training.They have further shown that the self-attention operation with the above kernel can be performed in O(nlog n) time using Fast Fourier Transform (FFT).[HSGB21] has used a kernel of the form k (q i , k j , i − j) = exp q T i k j + ϕ r (i) T ϕ r (j) where ϕ r (•) is a feature map for the position indices i and j satisfying ϕ r (i) T ϕ r (j) to be a function of i − j.
In this next section, we propose a novel attention for SCA.

Self-attention in SCA
The informative sample points in power or EM traces are sparse.In other words, only a few sample points in the traces are high SNR sample points, and the rest of those are noisy.To see this, we plot the SNR of four secret shares on two very widely used SCA datasets, namely ASCAD fixed key3 and ASCAD random key4 datasets in Figure 1.A peak or a high value in the plots indicates the informative sample points.It can be seen in the figures that the SNR at most of the sample points is close to zero, implying the informativeness of only a few sample points.Thus, the self-attention operation should be    able to put high attention to only a few sample points while putting close to zero attention to the rest.However, the existing self-attention with linear complexity like [LCW + 21, LSL + 21, HSGB21] puts significantly non-zero attention to most of the input feature vectors making the output feature vectors influenced by the noisy (low SNR) sample points.Figure 2 plots the histogram of the attention scores of the randomly initialized self-attention scheme of [HSGB21] for three different values of the scale hyper-parameter.As depicted in the plots, the attention scores are centered around 1, indicating that the self-attention scheme assigns attention close to 1 to the majority of input feature vectors.Similar observations have been made for other self-attention schemes, such as the one proposed by Li et al.
[LSL + 21].Another disadvantage of those schemes is that their attention scores result from complicated interactions among many parameters.Consequently, it is difficult to train the network to generate sparse attention scores.
To address the aforementioned limitations of existing self-attention mechanisms, this section introduces a novel attention that employs a Gaussian kernel on the relative positions of input features to generate attention scores.The proposed self-attention produces sparse attention scores, as discussed in Section 4.2.By utilizing relative positional encoding, it achieves shift invariance.Additionally, the proposed method exhibits linear time and memory complexity with respect to the trace length, enabling scalability to longer traces.

GaussiP: Gaussian Positional Attention
This section introduces the proposed Gaussian Positional attention, also called GaussiP attention.The GaussiP attention exhibits linear time and memory complexity with respect to the input length, making it computationally efficient.The degree of sharpness and sparseness in the attention scores can be controlled by adjusting a suitable parameter.Furthermore, the attention mechanism enables high attention to be assigned to distant features, facilitating the flow of information over long distances.
The subsequent sections delve into the details of the GaussiP attention.Firstly, we describe the kernel function utilized in the GaussiP attention.Next, we introduce the feature map employed for factorizing the kernel function, enabling efficient computation.Lastly, we address a significant drawback of the proposed kernel function and present our solution, which involves utilizing multiple heads in the attention.

Deciding the Kernel Function
Unlike most of the existing self-attention, we use a Gaussian kernel for our proposed attention: for i, j = 0, 1, . . ., n − 1 where β 1 , β 2 ∈ [0, +∞) are two hyperparameters, p, b ∈ R dp are two predefined constants, W p ∈ R dp×d is a matrix with entries drawn from a uniform random distribution, and s p ∈ (0, +∞), c p ∈ [−1, 1] are two trainable parameters.Thus, with the above defined ϕ q and ϕ k , the proposed kernel takes the form The part β 2 1 ||q i − k j || 2 2 in the above equation influences the kernel scores based on the contents of the input feature vectors while the part β 2 2 s 2 p (i − j − c p n) 2 ||W p p|| 2 2 influences the scores based on the relative positions, i.e., i − j of the feature vectors.In our initial set of experiments, we found that the first part does not positively effect the performance of EstraNet.Thus, we set β 1 = 0 in Eq. ( 13) which simplifies ϕ q (q, i) and ϕ k (k, i) as resulting into the following simplified kernel: The above equation shows several important properties of the proposed kernel function.Firstly, it can be observed that the maximum value of the kernel output occurs when i − c p n = j.In simpler terms, the i-th output feature vector assigns maximum attention to the (i − c p n)-th input feature vector.Thus, the attention mechanism facilitates the flow of information from the (i − c p n)-th index to the i-th index, enabling the learning of long distant dependencies.Secondly, the precision of the attention can be controlled by appropriately setting the hyper-parameter β 2 or learning an appropriate value for the parameter s p during the training process.Let us denote s = β 2 s p .If the value of s is large, the attention will be concentrated within a small region of the input traces.Conversely, if the value of s is small, the attention will be spread over a larger region.To illustrate this, Figure 3 displays the kernel scores for various values of s.It can be observed that as the value of s increases, the kernel scores become more concentrated in smaller regions.

Deciding the Feature Map for the Kernel Function
Since, we are using a Gaussian kernel, following the work of [CLD + 21], a feature map with only positive entries can be given for the kernel as follows: T However, we have observed that the approximation error of ϕ ′ significantly increases as the norms of the input features x become larger.It should be noted that in order to concentrate the attention scores on a smaller segment of the input traces, the norms of the input features need to be sufficiently large.As an alternative, the Fourier features ϕ fr defined in Eq.. (8) can be used as the feature map for the kernel.However, as pointed out in [CLD + 21], the kernel scores k GP A (i − j) approximated by ϕ fr (ϕ q (i)) T ϕ fr (ϕ k (j)) can be potentially negative, causing the increase in the variance of the normalized kernel score , which, in turn, causes the unstable behavior of the self-attention.To address this issue, we propose to approximate the normalizing factor of the kernel given in Eq. ( 16) in a closed form as follows: The justification for the approximation can be found in Appendix A. Using the above approximation in the proposed attention, the expression of the output feature vectors vi can be given as: where ϕ fr (•) is given by Eq. 8, and

Making the Attention Distribution Multi-modal
The major problem with the Gaussian kernel is that its attention scores follow a uni-modal distribution.In other words, it puts high attention over a small contiguous segment of the traces and assigns small attention scores to the rest of the parts (as seen in Figure 3).The uni-modal attention distribution limits the flow of information from one region to a distant region.To make the attention distribution multi-modal, we use multi-head attention (kindly refer to Section 3.1 for a description of multi-head attention) with different heads having different s p and c p parameters.Since different heads have different c p values, they put high attention to different parts of the input traces allowing the flow of information to an output feature vector from different regions of the input traces.Similarly, separate s p for each head enables them to scale the attention independently.We found that the proper initialization of c p and s p for each head is crucial for the successful training of EstraNet.We initialize s p for all heads to the same value 1.However, we initialize c p for h-th head to (1 + 2h)/2H for h = 0, 1, . . ., H − 1.In other words, c p s are initialized so that different attention heads focus on different parts of the input sequence.Appendix B plots the attention probabilities learned at the attention heads of a EstraNet model.The set of hyper-parameters of the proposed GaussiP attention are shown in Table 1.

Difference with the Existing Alternatives
The functional distinctions between the proposed GaussiP attention and existing foremost alternatives are summarized in Kernel feature map ϕtri, ϕpos or ϕ f r ϕ f r Positional Encoding linear or trigonometric linear implementation.The self-attention proposed by Guo et al. [GZL19] employs a Gaussian prior over input features to bias the attention scores based on relative distances.However, unlike our scheme, they utilize a softmax-based self-attention mechanism.Additionally, their scheme emphasizes high attention to nearby features, whereas our scheme can put high attention to distant features.Although the self-attention approach presented by Liutkus et al. [LCW + 21] can be represented as a Gaussian kernel with a Fourier feature map for certain hyper-parameter configurations, their attention scores are maximized for i − j = 0.In other words, in their scheme, each position or sample point assigns maximum attention to itself, making the propagation of information from one region to a distant region challenging.Conversely, in our proposed scheme, for sufficiently large β 2 , the attention scores are maximized when i − j ≈ nc p , where n denotes the sequence length and c p ∈ [0, 1] represents a trainable parameter.This allows each output feature to allocate high attention to a distant region, enabling the flow of information over long distances.
In the subsequent section, we present the architectural design of EstraNet.

EstraNet Architecture
This section provides a detailed overview of the EstraNet architecture.In Section 5.1, we introduce a layering-centering normalization technique.Subsequently, in Section 5.2, we present the structure of a single layer of EstraNet.Finally, in Section 5.3, we describe the complete architecture of EstraNet.

Layer-Centering
In the conventional TN architecture, layer normalization is often employed to ensure the stability of training [XYH + 20].However, it has been observed in [HSAM22] that applying layer normalization to TN layers makes the network untrainable for the SCA datasets.Our experiments also found that incorporating either layer normalization or batch normalization in EstraNet layers makes its performance poor.Consequently, we introduce a novel "layer-centering" operation as an alternative approach.
To understand the layer-centering operation, let us assume that [x 0 , . . ., x n−1 ] T ∈ R n×d represents the input to the layer.Given the input, we start by computing the mean of each vector in the input sequence:  where c lc ∈ R d is a trainable parameter.The resulting sequence [x 0 , . . ., xn−1 ] is the output of the layer in the layer-centering operation.Specifically, each element of the input sequence is first centered to 0 and then re-centered to c lc , which is a parameter learned during the training process.It is worth noting that while layer-normalization involves re-centering and re-scaling each vector in the input sequence, layer-centering involves only re-centering the elements without any re-scaling.

Single Layer of EstraNet
The EstraNet layer, as illustrated in Figure 5c, is similar to a standard TN layer depicted in Figure 5a.However, it incorporates the proposed multi-head GaussiP attention from Section 4.1 instead of the vanilla multi-head self-attention and the layer-centering instead of the layer-normalization.Compared to the TransNet layer shown in Figure 5b

EstraNet Architecture
EstraNet follows the general multilayer architecture of TN models with some notable exceptions.The architecture is shown in Figure 6.The input to the EstraNet model is a one-dimensional trace denoted as t = [t 0 , . . ., t n−1 ] ∈ R n .However, the EstraNet layer assumes the input to be two-dimensional.Thus, we pass each input trace through several, say L conv , convolutional and average-pooling layers, which convert the input traces t into a sequence of vectors X = [x 0 , . . ., x m−1 ] T ∈ R m×d where m is the (possibly reduced) length of the output sequence X and d is the dimension of its feature vectors.Note that by setting a large value for the pool size, denoted as ps, of the average-pooling layers, we can make m ≪ n enabling the efficient processing of the sequence by the following layers of the EstraNet.Then the output of the final convolutional layers X is passed through several EstraNet layers resulting in the output Y = [y 0 , . . ., y m−1 ] T ∈ R m×d .The output sequence Y of m d-dimensional feature vectors is then reduced into a single vector ȳ ∈ R d using a (multi-head) softmax-attention layer.The vector ȳ is then passed to the classification layer to generate prediction probabilities.
In the following section, we present the results of our experiments.

Experimental Results
This section presents the experimental evaluation of EstraNet.We provide an overview of the datasets used for the evaluation in Section 6.1.The methodology for selecting the attack window is described in Section 6.2.We provide the detailed information about the benchmark models used in this study in Section 6.3.The training process of EstraNet is explained in Section 6.4.Section 6.5 elaborates on the experiment setup and evaluation methods.A comparative study between EstraNet and the benchmark models in the presence of a combinations of masking, random delay, and clock jitter countermeasures is provided in Sections 6.6 to 6.8.In Section 6.9, we perform an ablation study to investigate the impact of various design choices in EstraNet.Section 6.10 discusses the impact of several hyperparameter choices in EstraNet's performance.The training time of EstraNet is compared with that of the benchmark models in Section 6.11.Finally, Section 6.12 investigates the impact of data augmentation in the shift-invariance of EstraNet.

Dataset Details
For evaluating EstraNet, we select three datasets of software implementation of ciphers protected by masking countermeasures.Since, in the software implementation of masking countermeasures, different shares of the intermediate secret variable leak at different regions of the power/EM traces, they are the ideal candidates for evaluating EstraNet's ability to capture long-distance dependency.Additionally, we have added random delay and the clock jitter effect [WP20] in the traces to evaluate the shift-invariance of EstraNet.This section provides the details of the datasets.
ASCAD Fixed Key (ASCADf) ASCAD fixed key dataset5 is a collection of 60K traces of a first-order masked implementation of AES running on an 8-bit ATMega8515 microcontroller.
We divided the entire dataset into three splits: profiling, validation, and test containing 50K, 5K, and 5K traces, respectively.Following the common practice in the literature [BPS + 20], we attack the third S-box operation of the first round of the cipher.We use the identity leakage model to generate the labels for the profiling traces as it is found to perform better than Hamming weight leakage model in previous studies [BPS + 20].

ASCAD Random Key (ASCADr)
As the ASCADf dataset, ASCAD random key dataset6 is a collection of traces of a first-order masked implementation of AES running on an 8-bit ATMega8515 microcontroller.However, in the ASCADf dataset, the secret key is fixed for all profiling traces, whereas in the ASCADr dataset, the secret key randomly varies for each profiling trace.The profiling split of the dataset contains 200K traces, while the attack split contains 100K traces.We created validation and test splits by selecting 10K traces for each split from the 100K attack traces.Like the ASCADf dataset, we attack the third S-box operation of the first round of the cipher.We use the identity leakage model to generate the labels for the profiling traces on this dataset also.

CHES 2020 CTF SW3 (CHES20)
Clyde-128 is a tweakable block cipher that supports side-channel resilient and efficient bit-slicing implementation on 32-bit microprocessors [BBB + 20].Spook SCA CTF7 is an SCA challenge for masked implementations of the Clyde-128.The challenge consists of multiple datasets for different implementations of the cipher.We select the dataset collected from a second-order masked implementation of the cipher running on an ARM Cortex-M0 microcontroller.The dataset contains 200K and 500K profiling and attack traces.We select all 200K profiling traces as the train set and 10K traces for validation and test set each from the attack traces.Since Clyde-128 is an LS design, it works on (4 × 32)-bit state, with 4 being the size of the non-linear S-box and 32 being the size of the linear L-box.We target the four bits of the 17 th column from the left of the first round S-box operation.We chose the 17 th column as we found the SNR for these bits to be high.Since each bit of an S-box is processed separately in the bit-slicing implementation of the cipher, we use a multilabel loss having a sigmoid function for each output bit of the S-box (as introduced in [ZXF + 19, ZXF + 21]) to attack the four target bits.
The statistics of all three datasets are summarized in Table 3.

Attack Window Selection
The trace length of the above three datasets varies from 62500 (for the CHES20 dataset) to 250K (for the ASCADr dataset).Instead of performing the evaluation on the full-length traces, we perform it on a selected attack window from the full-length traces.We consider two sizes for the attack window: 10K and 40K.Thus, for the first set of experiments, we select an attack window of size 10K for each of the three datasets and perform attacks on the window.And for the second set of experiments, we use an attack window of size 40K.The selected attack windows for the three datasets are given in Table 4.The attack windows are selected using SNR-based methods.Thus, we calculated the SNR at each sample point and selected a window of respective size such that the window contains the most number of high SNR sample points.In many scenarios (e.g., in the presence of countermeasures like masking, desynchronization, and clock jitter), the SNR-based method may not work.However, note that one can still be able to identify the attack window (with size in the order of 10K, which is significantly large) based on a combination of the knowledge of the implementation and intelligent guess (such as in scheme-aware threat model [MCLS23]).In the worst case, an adversary can iteratively repeat the attack by selecting different attack window at different segments of the traces.For example, if the first round of the cipher spans over 100K sample points, one can repeat the attack on multiple windows of size 40K, such as [0, 40K], [20K, 60K], [40K, 80K], and [60K, 100K].Note that, recent research [LZC + 21] has demonstrated the possibility of directly performing attacks on traces with a length in the order of 100K and achieving good results.However, the results presented in Section 6.8 reveal that the performance of these models significantly deteriorates in challenging scenarios, such as in the presence of clock jitter.Therefore, EstraNet can be a better alternative for the worst-case security evaluation in such situations.From now on, we will use the terms "window size" and "trace length" interchangeably to refer to the size of the attack window.

Benchmark Models
This section briefly describes the (existing) DL models used as the benchmark models in our experiments.Many DL models have been introduced in SCA literature over the last few years.Among all those models, we have selected three models which have been introduced to deal with long traces and/or large trace desynchronizations.The models are briefly described below.The detailed architectures of the models are given in Appendix C.

PolyCNN [MBC + 20]
In [MBC + 20], Masure et al. have proposed a CNN model to attack AES implementation protected by code polymorphism.Using their proposed model, they successfully recovered the secret key using less than 20 attack traces.Since their model is suitable for long traces, we use it as a benchmark model.We trained the model on the three datasets for 10K epochs using Adam optimizer and a constant learning rate of 1e-5.

EffCNN [ZBHV20]
In [ZBHV20], Zaid et al. have proposed a methodology for creating CNN models to be effective against desynchronized traces.They have further demonstrated that their methodology can be used to construct CNN models to perform successful attacks on several datasets.Thus, their models are good candidates for being benchmark models.We constructed three models for the three datasets and used those as the benchmark models.The models have been trained for 2K epochs using Adam optimizer and a constant learning rate of 2.5e-5.

LSTMNet [LZC + 21]
In [LZC + 21], Lu et al. proposed to use LSTM-based models to perform attacks on fulllength traces.Their experimental study demonstrated that the LSTM-based models could be used to conduct successful attacks on both synchronized and desynchronized datasets.Thus, we use their models as benchmark models.On the ASCADf and ASCADr datasets, we took the respective models from their online repository8 and trained the models.Since no LSTM-based model is available for the CHES20 dataset, we use their model for the ASCADr dataset to train on the CHES20 dataset.The models were trained for 4K epochs using Adam optimizer and a constant learning rate of 1e-4.

Training details
We use the cross-entropy loss and Adam optimizer to train EstraNet.For the learning rate schedule, cosine-decay with linear warmup schedule [ZLLS20] has been used.More precisely, we increase the learning rate linearly from 0 to 2.5e-4 for t warmup steps of gradient update and then gradually decreases to 0.004 × 2.5e-4 following a cosine curve for the remaining t max − t warmup steps where t max is the maximum training steps and t warmup (satisfying t warmup < t max ) is the number of warmup steps.In Appendix D, we describe the learning rate schedule in more detail.

Hyper-parameter settings
We adopted the common hyper-parameters like the number of EstraNet layers L, and model dimension d from [HSAM22].The value dimension d v and the number of heads H in GaussiP attention have been set to 32 and 8, respectively.Following [BPS + 20] and [HSAM22], the kernel width of the first convolutional layer has been set to 11.The kernel width of the subsequent convolutional layers has been set to 3 (as in [LZC + 21]).Unless stated otherwise, we set the number of convolutional layers and the pool size of each average-pooling layer to 2 and 10, respectively.However, we found that, on the ASCADf dataset, a pool size of 10 leads to poor performance as the sampling rate of the dataset is comparatively low.Thus, we set it to a slightly smaller value, 8.The dimension of the feature map of the GaussiP attention kernel (denoted as d e ) has been set to 512.We set t max , the maximum training steps, and t warmup , the warmup steps of EstraNet training to 4M and 1M for ACSADf and ASCADr datasets.However, using 1M as the warmup steps (t warmup ) leads to unstable training on the CHES20 dataset.Thus, we set it to 2M .We found that the scaling hyper-parameter β 2 influences the performance of EstraNet highly.Therefore, we tuned it over three values: 10, 50, and 150 for each dataset.The rest of the hyper-parameter values have been found based on some initial experiments.Section 6.10 discusses the influence of several important hyper-parameters in EstraNet's performance.

Experimental Setup
Since we aim to evaluate the robustness of the models to misalignments in the traces, we trained each model on desynchronized profiling traces.Data augmentation techniques were also applied by introducing random displacements to each profiling trace on the fly during training.Thus, in different epochs, the same profiling trace is shifted by different values.
It should be noted that this data augmentation was also implemented when training the benchmark models.To reduce training time, early stopping of training was employed.In other words, we evaluated the trained model at regular intervals during the progress of the training and stopped the training when no significant improvement was observed on the validation dataset.The intermediate model that exhibited the best performance on the validation dataset was selected for the final evaluation on the test dataset.
The guessing entropy of each model on a dataset is computed by repeating the attack 100 times on randomly permuted traces of the dataset.For each experiment, we repeat each model's training (starting from the random initialization) three times.We report the best, median, and average results of the repeated experiments.

Experimental Results for Trace Length 10K
This section presents a comparative analysis of EstraNet with the benchmark models for the attack window size of 10K (refer to Section 6.2 for details on attack window selection).To assess the robustness of the deep learning (DL) models against random delay countermeasures, we applied a profiling desynchronization of 200.In other words, we independently desynchronized each profiling trace by a maximum displacement of 200.During training, we have used data augmentation.More precisely, we randomly desynchronized each profiling trace by a maximum displacement of 200 on the fly (refer to Section 6.5 for detailed description).The trained models were then evaluated on the attack set with attack desynchronizations of 0, 200, and 400.Each DL model was trained independently three times.Table 5 provides the best, median, and average values of T GE1 (the minimum number of traces required to achieve guessing entropy 1) obtained from the three trained models for the three attack desynchronization scenarios.
Based on the observations from the table, it can be noted that on the ASCADf dataset, EstraNet consistently outperforms the other three methods.It demonstrates an improvement of more than 50% in the majority of cases in terms of all three metrics and across all three attack desynchronization scenarios.On the ASCADr dataset, EstraNet performs significantly better than EffCNN and similar to LSTMNet with respect to all three metrics.For the dataset, while PolyCNN shows similar performance to EstraNet for attack desynchronizations of 200 and 400, it performs poorly compared to EstraNet for attack desynchronization of 0. Finally, on the CHES20 dataset, EstraNet showcases more than 60% improvement over PolyCNN and EffCNN in terms of all three metrics.Although LSTMNet has comparable results to EstraNet in terms of the best T GE1 on that dataset, it requires 6 to 12 times more traces to achieve guessing entropy 1 according to the median or average values.In summary, it can be concluded that EstraNet consistently performs significantly better, with improvements of over 50%, compared to the benchmark models in the majority of cases, though gains are marginal sometimes.Figure 8 plots the median guessing entropy of the DL models with respect to the number of attack traces on the three datasets for attack desync 400.The plots also support the above observations.

Experimental Results for Trace Length 40K
This section presents a comparative analysis with the benchmark models for the attack window size of 40K.In contrast to Section 6.6, where the trained models were evaluated against smaller attack desynchronizations (0, 200, and 400), this section evaluates the models in the presence of larger attack desynchronization (600 and 1000).For the experiments, a profiling desynchronization of 600 was used, resulting in the random shifting of each profiling trace by a maximum displacement of 600.Furthermore, data augmentation was incorporated during training by additionally desynchronizing each profiling trace on-the-fly with a maximum displacement of 400.For each dataset, we independently trained each DL model three times.Table 6 provides the best, median, and average values of T GE1 (the minimum number of attack traces required to achieve guessing entropy 1) obtained from the three trained models.In this scenario, some trained models failed to reach guessing entropy 1 using 5K attack traces.Entries with a dash symbol ('-') in Table 6 indicate that the average T GE1 is not Table 6: The minimum number of traces (lesser is better) required to reach guessing entropy 1 by EstraNet and the benchmark models for trace length 40K.The models have been evaluated on attack traces with attack desyncs 600 and 1000.The columns titled Best, Med., and Avg.respectively show the best, median and average results of three independently trained models.The '-' entries in the table indicate that the average value is not available as some of the independently trained models failed to reach guessing entropy 1 using 5K attack traces.
available as some of the independently trained models failed to reach guessing entropy 1 using 5K attack traces.While comparing the results of EstraNet with the benchmark models on the ASCADf dataset, EstraNet demonstrates a significant improvement of at least 90% compared to PolyCNN and EffCNN.It also showcases improvements ranging from 26 to 90% compared to LSTMNet.Moreover, in terms of training stability, EstraNet performs better than the other models as PolyCNN and EffCNN fail to reach the guessing entropy once out of the three training trials, and LSTMNet performs poorly in terms of the average T GE1 .On the ASCADr dataset, EstraNet exhibits substantial improvements of 85 to 90% compared to PolyCNN and LSTMNet for all three metrics.It also demonstrates an improvement of 10 to 60% compared to EffCNN.In the CHES20 dataset, none of the PolyCNN and EffCNN models could bring down T GE1 below 5K.Although EstraNet performs slightly worse than LSTMNet in terms of the best and median T GE1 values on this dataset, it demonstrates more stability with an average T GE1 of around 20.In contrast, in one of the three training trials, LSTMNet failed to bring down the T GE1 below 5K.In summary, it can be concluded that EstraNet offers significant improvements (up to 90%) over the benchmark models on the ASCADf and ASCADr datasets.Though LSTMSNet performs slightly better than EstraNet in terms of the best and median T GE1 on the CHES20 dataset, its performance is significantly unstable compared to EstraNet.Figure 10 visually depicts the median guessing entropy of different methods with respect to the number of attack traces for attack desync 1000, further confirming the observations of Table 6.

Experimental Results in the Presence of Clock Jitter Effect
This section presents a comparative analysis of EstraNet with the benchmark models on three datasets with added clock jitter effect9 .We pre-processed the traces using the    approach proposed in [CDP17, WP20] to introduce the clock jitter effect into the datasets.
Appendix E provides the details of the algorithm employed to add the clock jitter effect.Figure 11 illustrates several sample traces after adding the clock jitter effect.As in the previous experiments, the DL models in this section were trained with data augmentation, where each profiling trace was independently desynchronized by a maximum displacement of 200 during training.Also, as in the previous experiments, we independently trained each DL model three times.Table 7 presents the best, median, and average of T GE1 (the minimum number of attack traces required to achieve guessing entropy 1) values obtained in the three training trials.It should be noted that the elastic alignment technique [vWWB11] is a well-known method used to align traces affected by the clock jitter.Therefore, for each dataset, we trained the benchmark models on the misaligned traces (i.e., traces obtained after adding the clock jitter effect) and the traces obtained after aligning the misaligned traces using elastic alignment.Both sets of results are presented in Table 7. Analysis of the table reveals that on the ASCADf dataset, both PolyCNN and EffCNN perform poorly on the misaligned traces, although their performance improves significantly after elastic alignment.However, they still require 5 to 12 times more traces to reach guessing entropy 1 compared to EstraNet.In contrast, none of the LSTMNet models are able to reach guessing entropy 1 using 5K traces even after elastic alignment.On the ASCADr dataset, both PolyCNN and LSTMNet exhibit poor performance.With one exception, none of the models can reach guessing entropy 1 using 5K traces.Though EstraNet fails to reach guessing entropy 1 in one of three training trials on the dataset, it shows an improvement of almost 70% over EffCNN in terms of the best T GE1 .For the CHES20 dataset, both PolyCNN and LSTMNet perform poorly, as all of their models require either more than or close to 5K traces to reach the guessing entropy 1.Although the best EffCNN model requires 576 traces to reach guessing entropy 1, its other models require either close to or greater than 5K traces for the same.In contrast, all EstraNet adversary possesses a clean trace alongside each noisy trace in the dataset.In contrast, we consider a weaker adversary with no clean traces as in [CDP17].
Table 7: The minimum number of attack traces (lesser is better) required to reach guessing entropy 1 by EstraNet and the benchmark models on the datasets with the added clock jitter effect.The column titled Elastic Alignment indicates whether the traces have been aligned using elastic alignment [vWWB11] prior to perform the attack.The columns titled Best, Med., and Avg.respectively show the best, median and average results of three independently trained models.The '-' entries in the table indicate that the average value is not available as some of the independently trained models failed to reach guessing entropy 1 using 5K attack traces.
No 26 28 32.0 models require only 25 to 40 traces to reach the guessing entropy 1.In summary, the benchmark models fail to reach guessing entropy 1 using 5K attack traces in the majority of cases, and even in those cases they perform well, their performance is an order of magnitude worse than that of EstraNet.Finally, we would like to highlight that in the presence of clock jitter, the relative distances between the POIs fluctuates significantly across different traces.Although the design of GaussiP attention assumes constant relative distances between the POIs in all traces, EstraNet demonstrates better robustness to such fluctuations compared to the benchmark models.This robustness can be attributed to two key factors.Firstly, GaussiP attention uses Gaussian attention which is resilient to minor variations in relative distances.For instance, if an attention head assigns significant attention to a trace segment [s, t], the attention output can remain almost same even if some POIs shift their positions within the segment.Secondly, in EstraNet, the input traces propagate though multiple convolutional and average-pooling layers before reaching the first GaussiP attention layer.As a result, some of the fluctuations in relative distances are absorbed during the propagation through the average-pooling layers.Indeed, the above experimental results indicate that EstraNet achieves superior robustness to fluctuations in the relative distances introduced by the Table 8: Comparison of the use of softmax-attention and global average-pooling in EstraNet.Both models have been trained thrice on the ASCADf dataset using the setup of Section 6.7.We report the number of attack traces required to reach the guessing entropy 1 in the three training runs of the two models.clock jitter effect compared to the benchmark models.

Ablation Study
The EstraNet model incorporates two novel layers: the GaussiP attention layer and the layer-centering normalization layer.Additionally, EstraNet utilizes a softmax-attention layer.This section aims to explore the impact of integrating these layers into EstraNet and determine their contribution to its improved performance.

Ablation study of softmax-attention
In this section, we assess the performance of EstraNet without using any softmax-attention altogether.Thus, we replace the softmax-attention with a global average-pooling layer, keeping all other hyper-parameters same.Table 8 presents a comparison between EstraNet with global average-pooling and the vanilla EstraNet (i.e., EstraNet with softmax-attention) on the ASCADf dataset using an attack window size of 40K.Note that, as before, both models were trained three times.The table displays the minimum number of attack traces required to reach the guessing entropy of 1 (denoted as T GE1 ) obtained from the three training runs for each model.
From the table, we observe that the T GE1 for EstraNet with softmax-attention ranges from 21 to 28.In contrast, for EstraNet with global average-pooling, it reaches as high as 1734 in some training runs.This indicates that employing softmax-attention instead of global average-pooling in EstraNet leads to a significant improvement in its performance.

Ablation study of layer-centering layer
This section focuses on assessing the impact of the newly introduced layer-centering layer in EstraNet.Toward that goal, we conducted experiments using two other normalization methods, namely layer normalization and batch normalization.We also included the results of EstraNet without any normalization, as [HSAM22] reported its effectiveness in TransNet.The pool size hyper-parameter was varied for each normalization method, and the attack performance was evaluated.Each experiment was repeated three times, and the results are presented in Table 9.
From the table, it can be observed that EstraNet with layer-centering maintains consistent performance across different pool sizes.Conversely, for other normalization methods or without any normalization, the performance of EstraNet deteriorates significantly as the pool size decreases from 8 to 4.Moreover, in the case of batch normalization, EstraNet fails to achieve the guessing entropy of 1 using considerably fewer than 5K attack traces for all pool sizes.These results highlight the robustness of EstraNet when using layer-centering compared to other alternatives.Table 9: Attack results of EstraNet using various normalization methods.All models have been independently trained thrice on the ASCADf dataset using the setup of Section 6.7.We report the number of attack traces required to reach the guessing entropy 1 in the three training runs of each model.12a) and [LCW + 21] (Figure 12b).

Ablation study of the proposed GaussiP attention layer
This section assesses the performance of EstraNet by substituting the proposed GaussiP attention layer in EstraNet with alternative self-attention layers that incorporate relative positional encoding and have linear complexity.In the TN literature, several such selfattention layers have been introduced (as discussed in Section 3.3).Due to computational constraints, it is not feasible to verify all of these options.Hence, we select the selfattentions proposed in [LCW + 21] and [HSGB21] as alternatives to GaussiP attention.We employ the same training setup to train the modified EstraNet models with the GaussiP attentions replaced by the ones proposed in [LCW + 21] and [HSGB21].Figure 12 displays the training loss of the modified EstraNet models over the course of the initial 1.2 million training steps.The figures demonstrate that the training loss of the EstraNet models with alternative self-attention layers does not even start to decrease even after 1.2 million training steps, indicating the challenge of training these models effectively.It is worth noting that it might be possible to adopt the existing self-attention layers in the context of SCA through non-trivial modifications.Nonetheless, the results presented in Figure 12 suggest that such adaptations is subject to intense research.

Choice of the Hyper-parameters
In our experiments, we found that setting most of the hyper-parameters of EstraNet to their default values provides good results.However, selecting the appropriate value for a few hyper-parameters is crucial for its good performance.This section illustrates the choice of those hyper-parameters.

Choice of number of heads in GaussiP attention layer
In all our previous experiments, we found that setting the number of heads, denoted as H, in the GaussiP attention layer to the default value of 8 resulted in good performance.In this section, we aim to evaluate the sensitivity of EstraNet's performance to the choice of H.To accomplish this, we trained EstraNet models using different values of H on the ASCADf datasets while keeping the remaining hyperparameters at their default values.Similar to earlier experiments, we trained three independent EstraNet models for each value of H and report the T GE1 (the number of attack traces required to reach guessing entropy 1) in Table 10.
The table shows that the performance of EstraNet is significantly unstable for smaller values of H, such as 4 and 6.More precisely, for H = 4 and H = 6, EstraNet requires up to 367 and over 5K attack traces, respectively, to reach guessing entropy 1, whereas, for the default H = 8, it requires fewer than 30 traces in the all the three training trials.This unstable behavior of EstraNet can be attributed to the limited flow of information to distant sample points when using smaller values of H. Indeed, as explained in Section 4.1.3,increasing the number of heads in the GaussiP attention layer allows for more information flow from one sample point to distant sample points.Therefore, EstraNet performs very well with significantly larger values of H, such as 8 and 10.However, the performance slightly deteriorates for H = 12.We attribute this deterioration in performance to the increased number of parameters in the GaussiP attention layer when using a very large value of H.In conclusion, these experiments reveal that there exists an optimal range for the choice of H in EstraNet.In our experiments, we have found that an H value within the range of 8 − 10 consistently performs well across the datasets.
Table 10: Attack results of EstraNet for the different values of H (the number of heads in GaussiP attention) hyper-parameters.As before, for each value of H, the model has been independently trained thrice on the ASCADf dataset using the setup of Section 6.7.We report the number of attack traces required to reach the guessing entropy 1 in the three training runs.fails to attain a guessing entropy of 1 even with as many as 5K traces, whereas it achieves its optimal performance with β 2 = 150.Notably, with a trace length (attack window size) of 10K, EstraNet performs most effectively with β 2 = 150 on the ASCADf and CHES20 datasets.However, for the ASCADr dataset, the optimal performance is observed with β 2 = 50.In general, for the trace length 10K, we observed that EstraNet exhibits favorable performance within the range of β 2 values between 50 and 150.Furthermore, our investigations reveal that as the trace length increases from 10K to 40K, the range of β 2 values associated with EstraNet's optimal performance increases.For instance, for the attack window size of 40K, the preferred β 2 values on the ASCADr and CHES20 datasets are 200 and 450, respectively.Notably, these values are respectively 4 and 3 times that of the optimal values obtained for the attack window size of 10K.In conclusion, we recommend tuning the β 2 hyper-parameter based on some validation data.We also want to mention that the performance of EstraNet generally improves as the value of β 2 gets larger.However, setting β 2 to a too-large value makes the EstraNet model untrainable.
In other words, while training an EstraNet model with a very large β 2 , the training loss does not fall below the level of random loss.Therefore, we suggest gradually increasing the value of β 2 during the tuning process until the model becomes untrainable.

Choice of the feature map dimension of GaussiP attention kernel
In all previous experiments, we have utilized a feature map dimension of 512 for the GaussiP attention kernel (referred to as d e in Table 1), which is significantly large.However, employing a large value of d e significantly increases the memory and compute requirements for training the EstraNet model.Therefore, this section aims to assess the performance of EstraNet when using smaller values of d e .

Shift-invariance: Dependence on Data Augmentation
The shift-invariance of a DL model is defined for infinite-length input [HSGB21].However, when dealing with inputs of finite length, the model's shift-invariance tends to break down near the input boundaries.Additionally, incorporating subsampling layers (e.g., averagepooling layer) into a DL model further diminishes its shift-invariance [Zha19,HSAM22].The reduced shift invariance could lead to overfitting during the model's training.Moreover, a model that exhibits greater shift-invariance is inherently less reliant on desynchronization in profiling traces and data augmentation [HSAM22].This section evaluates the impact of the employed data augmentation on the shift-invariance of EstraNet.Toward that goal, we repeated the experiments of Section 6.6.However, this time, the models are trained without any data augmentation.We compare the performance of EstraNet while trained with and without data augmentation in Table 14.The findings of Table 14 reveal that, on the ASCADr and CHES20 datasets, the performance of EstraNet remains almost identical whether or not data augmentation is used during training.Conversely, on the ASCADf dataset, the model's performance significantly declines when trained without data augmentation.This observed drop in performance on the ASCADf dataset can be attributed to the relatively lesser number of the profiling traces it contains, amounting to only 50K as opposed to the 200K profiling traces present in both the ASCADr and CHES20 datasets.The limited training samples cause EstraNet to overfit while trained without data augmentation, resulting in significantly poorer performance than the model trained with data augmentation.
Table 17 of Appendix F compares the shift-invariance of EstraNet with that of the benchmark models.Like EstraNet, the performance of the benchmark models also vastly deteriorates on the ASCADf dataset while the deterioration is marginal on the ASCADr and CHES20 datasets.While comparing the performance of EstraNet to the benchmark models in the absence of data augmentation, EstraNet performs significantly better on the ASCADr and CHES20 datasets.On the ASCADf dataset, PolyCNN and LSTMNet fail to reach guessing entropy 1 using 5K attack traces while EstraNet and EffCNN shows similar performance.

Limitations and Future Works
Comparing the results for the trace length of 40K (Table 6) with those for the trace length of 10K (Table 5), it is evident that the performance of EstraNet slightly deteriorates with the increase in the trace length beyond 10K though the deterioration is significantly less compared to the other DL models.To investigate the decline in EstraNet's performance on longer traces further, we conducted additional experiments utilizing the ASCADf dataset with a trace length of 60K sample points.For the experiments, we employed the attack setup of Section 6.7.The results of these experiments are provided in Table 15.Upon examination of Table 15, it becomes evident that the performance of EstraNet has deteriorated significantly as compared to the performance for the trace lengths of 10K (Table 5) and 40K (Table 6) though the performance is significantly better than the benchmark DL models.Therefore, improving EstraNet's performance further on longer traces (e.g., with a trace length > 40K) can be an important research direction to explore.

PolyCNN
The resilience of a DL model to low SNR traces constitutes a pivotal requirement for ensuring the model's efficacy across a wide spectrum of datasets.To assess the robustness of EstraNet on low SNR datasets, we conducted further experiments on the ASCADf dataset by adding Gaussian noise with standard deviations (std.) of 2 and 4. The addition of Gaussian noise reduces the peak SNR by approximately 30% and 60%, respectively.These experiments were conducted on traces of length 10K, employing the experimental setup of Section 6.6.The resulting findings are presented in The EstraNet architecture incorporates two major contributions.First, we propose a novel attention operation called GaussiP attention, which has linear time and memory costs.By incorporating relative positional encoding within the attention operation, EstraNet achieves shift-invariance.Additionally, the sparsity of attention probabilities in GaussiP attention makes it particularly suitable for SCA.Second, due to the limitations of standard normalization techniques, such as batch normalization and layer normalization for EstraNet, we introduce a novel normalization method called layer-centering.The proposed normalization method improves the stability of EstraNet training significantly.
We conducted extensive experimental evaluations of EstraNet on three datasets consisting of masked implementations.We introduced random displacements to the datasets to assess EstraNet's robustness against random delay countermeasure.The results demonstrate that EstraNet achieves up to 90% decrease in the number of attack traces required to reduce the guessing entropy to 1 compared to three benchmark models.When comparing the attack performance of EstraNet with the benchmark models on datasets incorporating clock jitter effects, EstraNet provides up to an order of magnitude reduction in the number of attack traces required to reach guessing entropy 1.Additionally, we investigated the impact of various hyperparameters on EstraNet's performance.Overall, the experimental findings highlight EstraNet's potential as a promising avenue for advancing SCA research.implying that the attention scores are the function of the relative distance of the i and j.Additionally, the attention scores at each attention head are high for a small range of relative distances and close to zero for the rest showing the sparsity of the attention scores.

C Detailed Architecture of the Benchmark Models
PolyCNN The PolyCNN model is a CNN model proposed in [MBC + 20].It has six convolutional blocks followed by a global average-pooling layer.Each convolutional block is composed of a convolutional, a batch-normalization, a ReLU, and an averagepooling layer.The number of features of the convolutional layers are respectively set to 10, 20, 40, 40, 80 and 100.The kernel width of the first convolutional layer is set to 10 and to 11 for the rest.The pool size and stride of the average-pooling layer are set to 25 in the first convolutional block and to 5 for the rest.We have used this model for trace length 40K.However, this model was not applicable to the smaller trace length 10K as the trace length becomes 1 after the first 5 convolutional blocks.Thus, we removed the one (third) convolutional block from the original PolyCNN model to use it for trace length 10K.
EffCNN In [ZBHV20], Zaid et al. proposed a methodology for constructing CNN models robust to desynchronizations in the traces.Their models have three convolutional blocks followed by a flattening layer and three fully-connected layers.Each convolutional block is composed of a convolutional, a SELU activation, a batch-normalization, and an averagepooling layer.The number of features in the convolutional layers are respectively set to 32, 64 and 128.The kernel widths of the layers are respectively set to 1, T /2, n/I where n is the trace length, T is the maximum assumed amount of desynchronization, and I is the assumed number of POIs in the traces.Similarly, the pool sizes and the strides in the average-pooling layers are respectively set to 2, T /2, n/I.For all experiments, we set I to 10.For the experiments of Sections 6.6 and 6.8, we set T = 400 and for the experiments of Section 6.7, we set T = 1000.
LSTMNet In [LZC + 21], Lu et al. have proposed the LSTM-based models.The models for the desynchronized ASCADf and ASCADr datasets have six convolutional blocks, followed by bidirectional LSTM layers, followed by softmax-attention.Each convolutional block is

D Learning Rate Schedule for EstraNet
In cosine decay with linear warmup learning rate scheduling, the learning rate is first linearly increased from 0 to some maximum value, say l max over the first t warmup steps.
Then the learning rate is gradually decreased to some minimum value l min over the remaining t max − t warmup steps where l max and l min are respectively the maximum and minimum learning rate, t max is the total number of training steps, and t warmup is the number of warmup steps.Thus, the learning rate at t-th training step is given by lt =

E Adding Clock Jitter Effect
For adding clock jitter effect, we perform the pre-processing of each trace of the datasets as follows.For each sample points in the original trace we perform one of the following three actions each with one-third probability: a) we do not add it in the new trace, b) simply add it or c) add it along with another additional sample points with magnitude equal to the average of the sample point and the next sample point.Note that the above pre-processing is equivalent to adding clock jitter effect using the algorithm of [WP20] with the clock_jitters_level set to 1.However, after doing the above pre-processing, none of the DL models performed well10 as many informative sample points were getting removed from the traces during the pre-processing making the informative sample points more sparser in the pre-processed traces.We circumvented the loss of informative sample points during the pre-processing as follows.Prior to adding the clock jitter effect in the traces, we double the number of sample points in the traces by repeating each sample points twice.Note that such pre-processing is equivalent to doubling the sampling rate of the trace acquisition setup.Since, during the addition of clock jitter effect, at most one consecutive sample point of the original traces gets removed, by repeating each sample point of the original trace twice prior to adding clock jitter effect, we make sure that each sample point of the original traces appears at least once in the pre-processed traces while having the clock jitter affect at the same time.

F Ablation Study of Data Augmentation
In this Section, we investigate the impact of the data augmentation on the shift-invariance of the DL models.Toward that goal, we performed the experiments of Section 6.6 without any data augmentations.The resultant outcomes are detailed in Table 17.The findings show a significant decline in the performance of all models on the ASCADf dataset while the models are trained without data augmentations.Specifically, PolyCNN and LSTMNet were unable to reach the guessing entropy 1 even using 5K attack traces.The performance of EstraNet is similar to or significantly better than EffCNN on the dataset.The LSTMNet model fails to reach the guessing entropy 1 on the ASCADr dataset, although it performed closely to EstraNet on the CHES20 dataset.On the ASCADr and CHES20 datasets, EstraNet demonstrated slightly superior performance compared to PolyCNN and EffCNN.Overall, the better performance of EstraNet than the benchmark models suggests better shift-invariance of EstraNet architecture.
3 ⊕ k 3 ) ⊕ m 3 Sbox(p 3 ⊕ k 3 ) ⊕ m out m 3 m out (a)Plots of the SNR of four secrete shares on ASCAD fixed key dataset.3⊕ k 3 ) ⊕ m 3 Sbox(p 3 ⊕ k 3 ) ⊕ m out m 3 m out (b)Plots of the SNR of four secrete shares on ASCAD random key dataset.

Figure 1 :
Figure 1: Plots of the informative sample points on two widely used SCA datasets.

Figure 2 :
Figure 2: Plots of the histogram of the attention values of the randomly initialized selfattention scheme of [HSGB21] for three different scales.

Figure
Figure 3: Plots of k GP A (i − j) vs. (i − j − c p n) for different values of s(= β 2 s p ).

Figure 4 :
Figure 4: Plots of the histogram of the unnormalized kernel scores approximated by the Fourier feature map ϕ tri .The histogram is plotted for three different values of s (i.e.β 2 s p ).

Figure 5 :
Figure5: Figure5ashows a single layer of standard TN with pre-layer normalization [XYH + 20].Figure5bshows a single layer of TransNet proposed in[HSAM22].Figure5cdepicts a single layer of the proposed EstraNet.

Figure 8 :
Figure 8: Plots of the guessing entropy vs. the number of attack traces for the experiments with trace length 10K.The attack traces are desynchronized by a maximum displacement of 400.

Figure 10 :Figure 11 :
Figure 10: Plots of the guessing entropy vs. the number of attack traces for the experiments with trace length 40K.The attack traces are desynchronized by a maximum displacement of 1000.

Figure 13 :
Figure 13: The contour plots of the attention scores learned at the eight different heads of the first layer's GaussiP attention operations of EstraNet on the ASCADf dataset.The attention probabilities are represented as a 4K × 4K image A where A[i, j] represents the attention from q i to k j .

Figure 14 :
Figure 14: Plots of learning rate vs. training step for cosine decay with linear warmup learning rate schedule.

t twarmup × lmax for t ≤ twarmup lmin + lmax−l min 2 ×
Figure14plots the learning rate vs training steps in cosine decay with linear learning rate schedule for t max = 100, t warmup = 10, l max = 1 and l min = 0.004.

Table 1 :
The set of hyper-parameters of the proposed GaussiP attention.

Table 2 .
Notably, the attention scores in the GaussiP attention exhibit sparsity.Figure4illustrates that, for significantly large values of s = β 2 s p , the majority of attention scores are concentrated around zero, while only a few scores are substantially greater than zero.This characteristic aligns well with the nature of side- channel traces, where informative sample points are typically sparse.While methods such as Luo et al. [LLC + 21] offer greater flexibility in learning arbitrary probability distributions, they require complex operations such as Fast Fourier Transform (FFT) for efficient

Table 2 :
Differences between the proposed GaussiP attention and the foremost selfattentions (with linear complexity and relative positional encoding) in TN literature.
, EstraNet incorporates the novel multi-head GaussiP attention.The attention operation has a linear time and memory complexity with respect to the input length.Conversely, the multi-head self-attention employed in the TransNet layer exhibits quadratic memory and time complexity.Furthermore, the TransNet layer lacks any normalization layer, whereas the EstraNet layer utilizes layer-centering for input normalization.

Table 4 :
The selected attack windows on the three datasets.

Table 5 :
The minimum number of traces (lesser is better) required to reach guessing entropy 1 by EstraNet and the benchmark models for trace length 10K.The models have been evaluated on attack traces with attack desync 0 (no desync), 200 and 400.The columns titled Best, Med., and Avg.respectively show the best, median and average results of three independently trained models.

Choice of distance based scaling in GaussiP attention
The selection of the distance-based scaling hyper-parameter, denoted as β 2 in Table1, plays a critical role in achieving satisfactory performance with EstraNet.Table11presents the experimental results of attacking the ASCADf dataset using an attack window size of 10K for four distinct values of β 2 in EstraNet.The findings demonstrate the sensitivity of EstraNet's performance to the choice of β 2 .Specifically, for β 2 = 10 and 300, EstraNet

Table 11 :
The minimum number of attack traces (lower is better) required to reach the guessing entropy 1 on the ASCADf dataset with attack window size 10K by EstraNet for different values of β 2 .For each experiment, the average result of three independently trained models is reported.

Table 12 :
The minimum number of attack traces (lower is better) required to reach the guessing entropy 1 on the ASCADf dataset by EstraNet for different values of d e .The columns titled Best, Med., and Avg.respectively show the best, median and average results of three independently trained models.

Table 13 :
Table 12 presents a performance comparison between EstraNet with d e = 512 and EstraNet with reduced feature map dimensions: 256 and 128.The experimental results indicate that the performance of EstraNet with reduced feature map dimensions is comparable to that achieved with d e = 512.These findings suggest that the performance of EstraNet remains stable for a wide range of d e , indicating the possibility of making EstraNet more memory and compute efficient using a smaller d e .Training time of the benchmark models compared to EstraNet.This section presents a comparative analysis of the training times between EstraNet and the benchmark models.Table13provides the comparison for the trace length 10K.Upon examining the table, it is evident that the training time of PolyCNN is three to six times longer than that of EstraNet.Similarly, the training time of EffCNN is six to thirteen times greater compared to EstraNet.On the other hand, the training time of LSTMNet is approximately five to six times larger on the ASCADf and ASCADr datasets, while it is almost the same on the CHES20 dataset.Hence, with few exceptions, the training of EstraNet is approximately three to thirteen times faster than the benchmark models.Note that training EstraNet may involve tuning certain hyperparameters (β 2 in particular).However, given that the training time of a single EstraNet model is significantly smaller than that of the benchmark models, the hyperparameter tuning process can be completed in a comparable time to the training time of the benchmark models.

Table 14 :
EstraNet's performance with and without data augmentation.The rest of the experimental setup is the same as in Section 6.6.

Table 15 :
The minimum number of traces (lesser is better) required to reach guessing entropy 1 by the DL models on the ASCADf dataset for trace length 60K.The columns titled Best, Med., and Avg.respectively show the best, median, and average results of the three independently trained models.The '-' entries in the table indicate that the average value is not available as some of the independently trained models failed to reach guessing entropy 1 using 5K attack traces.

Table 16 .
By comparing the results of EstraNet inTable 16 with those in Table 5, we observe that the performance 8 Conclusions Deep learning (DL) models have shown great success in SCA; however, selecting an appropriate model architecture remains crucial in achieving good performance.This work introduces a novel DL model called EstraNet for SCA.EstraNet exhibits linear time and memory complexity, significantly improving quadratic time and memory complexity of the previously proposed TN-based model, TransNet.This linear complexity makes EstraNet applicable to traces with lengths exceeding 10K.Moreover, EstraNet is shift-invariant, making it resilient to misalignments in the traces.

Table 17 :
The minimum number of traces (lesser is better) required to reach guessing entropy 1 by EstraNet and the benchmark models for trace length 10K.The models have been evaluated on attack traces with attack desync 0 (no desync), 200, and 400.The columns titled Best, Med., and Avg.respectively show the best, median, and average results of three independently trained models.During the training of the models, no data augmentation is used.