Peek into the Black-Box: Interpretable Neural Network using SAT Equations in Side-Channel Analysis

. Deep neural networks (DNN) have become a significant threat to the security of cryptographic implementations with regards to side-channel analysis (SCA), as they automatically combine the leakages without any preprocessing needed, leading to a more efficient attack. However, these DNNs for SCA remain mostly black-box algorithms that are very difficult to interpret. Benamira et al. recently proposed an interpretable neural network called Truth Table Deep Convolutional Neural Network (TT-DCNN), which is both expressive and easier to interpret. In particular, a TT-DCNN has a transparent inner structure that can entirely be transformed into SAT equations after training. In this work, we analyze the SAT equations extracted from a TT-DCNN when applied in SCA context, eventually obtaining the rules and decisions that the neural networks learned when retrieving the secret key from the cryptographic primitive (i.e., exact formula). As a result, we can pinpoint the critical rules that the neural network uses to locate the exact Points of Interest (PoIs). We validate our approach first on simulated traces for higher-order masking. However, applying TT-DCNN on real traces is not straightforward. We propose a method to adapt TT-DCNN for application on real SCA traces containing thousands of sample points. Experimental validation is performed on software-based ASCADv1 and hardware-based AES_HD_ext datasets. In addition, TT-DCNN is shown to be able to learn the exact countermeasure in a best-case setting.


Introduction
The increased usage of Internet-of-Things (IoT) devices [Lue21] has led to many applications where the data of the devices manipulated are sensitive, and such devices might be placed in a hostile environment leading to the need to evaluate their security capabilities.Side-channel analysis (SCA) is one of those crucial threats that is required to be evaluated; ever since its first appearance in 1999 [KJJ99], it has become a widely studied research area in cryptography.Physical properties such as timing delay [Koc96], power consumption [KJJ99], and electromagnetic emanation [AARR03] may reveal information on the secret data.SCA focuses on finding and exploiting these leakages to retrieve the secret data.Over the years, the field of SCA has evolved from classical techniques like template attacks [CRR03] to using machine learning algorithms [BLR12, HZ12, HGDM + 11, LBM14, LPB + 15, GHO15, LBM15, HZ12], and recently deep learning-based SCA [PPM + 21].
Due to the ever-increasing computing power over the last decades, deep neural networks (DNNs) have gained much recognition in various fields like image recognition [HZRS16], and natural language processing [YHPC17].In 2016, Maghrebi et al. [MPP16] succeeded in retrieving the secret key of an unprotected AES implementation using various types of DNNs, which has drawn much attention over the past few years [PPM + 21].One of the advantages of deep learning-based SCA is that in the presence of countermeasures like masking [PR13] or hiding, it requires little to no preprocessing to obtain a successful attack.This is unlike other classical techniques, which require a tremendous amount of preprocessing before the attack can be mounted [BPS + 20, CDP17].After understanding what DNNs have to offer in SCA, one may want to interpret the black-box algorithm and understand what these DNNs learn from the side-channel traces.Currently, if an attack is unsuccessful, we have no idea why it is so.Furthermore, it is difficult to pinpoint whether the unsuccessful attack is due to the countermeasures or because the hyperparameters and architectures are not correctly tuned for the traces obtained.On the other hand, if the attack is successful, evaluators would not be able to locate the origin of the leakages.Despite that, the designer would want to improve the security of the device.Thus, there is a need for better explainability and interpretability of these DNNs.
In order for us to understand what the DNNs learn, we have to understand where the leakages of the secret information are found on the traces.Masking [PR13] is a very common countermeasure used against SCA.It is theoretically proven to be secure up to a given level against SCA [PR13].It operates by splitting the secret information into multiple shares.Assuming sequential executions, each share leaks in a different part of the trace.These sample points on the trace that leak are better known as Points of Interest (PoIs).The adversary would need to observe PoIs from all the shares and combine them to retrieve the secret information.As of today, there has not been any work where exact formulae are extracted from the DNN to show the neural network using the PoIs from different shares for key recovery on both seen and unseen traces (i.e., global interpretation).
Prior Works.Various works have tried to explore the explainability of DNN in their own ways.The first work incorporating SCA and DNN explainability uses Gradient Visualization (GV) [MDP19], which is used in other studies like in [Tim19].It calculates the gradient of the output of the DNN with respect to the input data.The idea is to observe how a slight change in the sample points in the traces affects the DNN's prediction.The authors of [MDP19] observed that the PoIs obtained through GV for unprotected traces are very similar to the PoIs found by the classical PoI technique like Signal-to-Noise Ratio (SNR).However, they further considered traces with masking order 1 and found that the GV can locate the two PoIs if the DNN is trained with early stopping.In other words, the GV can only pinpoint the PoIs of higher-order masking if the trained DNN did not overfit the dataset.Furthermore, GV also assumes that the neural network is differentiable over the set of all traces (i.e., non-exact formulation), which may not be the case.Hettwer et al. extended this work by comparing other explainability techniques, for example Layer-wise Relevance Propagation (LRP) [BBM + 15] and Occlusion [ZF13], and visualized them on heatmaps [HGG19].[GBR22] further compared other techniques like Integrated Gradient [STY17] and SmoothGrad [STK + 17] with classical techniques in selecting the PoIs like Difference of Mean (DOM) [KJJ99] or Correlation Power Analysis (CPA) [BCO04].Although all these techniques are powerful and essential, it only gives an interpretation based on the traces given (i.e., local interpretation).
In [vdVPB21], Van der Valk et al. focus on a different aspect of explainability.They use Singular Vector Canonical Correlation Analysis (SVCCA) on two DNNs with the same architecture but trained on different datasets to compare how correlated the weights of the same layers are.Interestingly, a DNN trained on a dataset in SCA is more correlated to another DNN trained on a dataset in image recognition than to another dataset in SCA with a different countermeasure.However, their work requires a significant computational effort to calculate the correlation between convolution layers [GBC16] and therefore, they only compared the fully connected layers [GBC16] of the DNN.Another technique called ablation was explored to gain insights and better understand the trained DNN in [WWJ + 21].Ablation proceeds by randomly removing some weights or channels in a particular layer in the DNN.The authors then proceed to test the effect of various types of hiding countermeasure on the ablated DNN.They concluded that simpler countermeasures like adding Gaussian noise to the traces are processed in the early layers, while more complex countermeasures like desynchronization are processed in the deeper layers.In terms of interpreting the training of a DNN in SCA, Perin et al. [PBP20] created a metric for determining when to stop the training phase based on the work by Shwartz-Ziv, and Tishby [ST17].They use information theory concepts to visualize and interpret the information that the DNN is learning.Recently, [ZBC + 22] combined deep learning with a stochastic attack using an autoencoder.Instead of learning the usual discriminative model for key recovery (which easily retrieves the keys), their model learns a generative model.They show that the weights of the neural networks give an equation of the traces corresponding to the leakage.However, for higher-order implementation, the generative model they proposed requires the usage of a classical recombination technique [PRB09], which will increase the time complexity and the length of the traces to analyze.These suggest that analyzing a discriminating model's internal structure remains an open question.Although many works stated above have tried to interpret the DNNs, there are still gaps in understanding them, especially for traces that have not been analyzed or providing exact formulae for what they learn.
Our Contributions.Our work tries to bridge the gap between explainability and interpretability from SAT equations by using the interpretable neural network called the Truth Table Deep Convolution Neural Networks (TT-DCNNs) [BPH21].The TT-DCNN can be used as a discriminative model and provides a transparent inner structure by converting part of the network into SAT equations.The SAT equations offer a representation of the TT-DCNN for us to interpret on both seen and unseen traces, providing us with an exact and global interpretation of the TT-DCNN.To the best of our knowledge, this is the first work interpreting neural networks using SAT equations in the context of side-channel analysis.
The contributions of the paper can be summarized as follows: 1. We provide a general methodology to analyze the SAT equations that are extracted from TT-DCNN in the SCA context.
2. We propose a TT-DCNN-based architecture, which we call T T SCA small and show that T T SCA small can learn the exact locations of the PoIs in simulated traces of different masking orders.
3. T T SCA small cannot be directly applied to real traces with hundreds to thousand sample points.We propose a method to adapt TT-DCNN to overcome the computational limitation due to the patch size (the task of simplifying the SAT equations relies on an NP-complete problem) and, thus, can be used on traces with extended length.We call this adapted architecture T T SCA big .We tested this architecture on both real software-implemented traces, ASCADv1, and hardware-implemented with low SNR traces, AES_HD_ext.It outperforms previous approaches like GV [MDP19,HGG19].
4. Our analysis shows that our proposed TT-DCNN-based architecture T T SCA big finds the positions of the leakages and learns a function based on these leakages to retrieve the key on both ASCADv1 and AES_HD_ext.The exact formula extracted from our proposed TT-DCNN gives us a global interpretation of what the network learns.In the best case, modified T T SCA big is able to learn the exact masking countermeasure through SAT equations.
The source code of our models and training can be publicly accessed on the following weblinks1 2 Paper Organization.The paper is organized as follows.Section 2 will provide the necessary background on side-channel analysis, deep learning, and TT-DCNN.In Section 3, we give a methodology to analyze the SAT equations acquired from the TT-DCNN with regard to side-channel attacks.In Section 4, we present the T T SCA small and T T SCA big together with the datasets that are tested on.Subsequently, we present the results and interpretability of T T SCA small and T T SCA big on each dataset that was applied and discuss their limitation.Lastly, in Section 5, we conclude the paper and outline some future works.

Notation and Terminology
We denote sets through the use of calligraphic letters X .The corresponding capital letter X defines a random variable, and the bold capital letter X denotes a random vector.We denote the corresponding lowercase letters x and x to represent the realizations of X and X, respectively.We let x[i] stands for the i th entry of a vector x.A side-channel trace is defined as a vector t ∈ R D where D is the number of sample points in a trace.Let C represents a cryptographic primitive with P denoting some public variable (e.g., plaintext or ciphertext), and K representing a part of the key.The targeted sensitive variable is the output of the cryptographic primitive, Z = C(P, K) with Z taking values in Z = {s 1 , s 2 , . . ., s |Z| }.We denote k as the key byte candidate taking its value from the keyspace K and the correct key as k * .
Masking is a countermeasure that was proven to be secure against side-channel up to a given level of security [PR13].It splits the targeted sensitive variable Z into many shares.Formally, we say that the cryptographic primitive is of masking order such that Z can be obtained back by a generic operator g (e.g., for Boolean masking, γ is defined as the XOR of the shares m i for all i ∈ {1, . . ., d + 1}).Throughout this paper, we shall focus on the SAT equations representation known as Disjunctive Normal Form (DNF).This is because it provides the most intuitive interpretation of the filters in the neural network, which will be explained in Section 2.3.A DNF, denoted as dnf , is defined as a set of Boolean variables where each l i,j ∈ {x 1 , x 2 , . . ., x h }.

Profiling Attacks
Profiling attacks assume a worst-case scenario where the adversary has access to two similar devices: a prototype or clone device and a target device.For the prototype device, the adversary can manipulate or know the device's key while the key for the target device is unknown to him.Furthermore, the adversary is able to collect several traces from a known set of random plaintexts (or ciphertexts) from both devices.The adversary's goal is to break and retrieve the unknown key from the target device.Profiling attacks can be divided into two stages: the profiling phase and the attack phase.In the profiling phase, the adversary will build a distinguisher F that takes in a set of profiling traces from the prototype device and returns a conditional probability mass function Pr(T |Z = z).During the attack phase, the distinguisher outputs a probability score for each hypothetical sensitive value y i = F(t i ) for each attack traces t i acquired from the target device.For every key k ∈ K, the log likelihood score is defined as: where N a as the number of attack traces used and z i,k = C(p i , k) are the hypothetical sensitive values based on the key k with p i being the corresponding public variable to the trace t i .The adversary or evaluator can rank the key of the log-likelihood score in a decreasing order and classify them into a guess vector G = [G 0 , G 1 , . . ., G |K|−1 ].The key corresponds to the score G 0 is the most likely candidate, and the key of the score G |K|−1 is the least likely candidate.The index of guess vector G is called the rank of the key.The metric called guessing entropy GE is defined as the average rank of the correct key k * [SMY09].If GE = 0, when using N a attack traces, the attack is considered successful.
In deep learning-based SCA, we train a DNN, f θ , as the distinguisher where F = f θ .The most commonly found DNNs used in SCA are the Multilayer Perceptrons (MLPs) and Convolutional Neural Network (CNN), but they are not interpretable.Therefore, in the next section, we will describe an interpretable DNN that we will be exploring.

Truth Table Deep Convolutional Neural Network
In this section, we present an interpretable neural network called Truth Table Deep Convolutional Neural Networks (TT-DCNNs) proposed by Benamira et al. [BPH21].The TT-DCNN is built upon small truth tables that allow the conversion of the neural network into DNF equations simply by using the Heaviside step function [Bra78], denoted as bin act , to binarize the features (i.e., output 1 when the input value x > 0 and output 0 when the input value x ≤ 0) while still having real-valued weights.In order to train efficiently through the Heaviside step function, Benamira et al. adopted the Straight-Through Estimator (STE) proposed by [HCS + 16].
Since the traces are one-dimensional, unlike the images, which are two-dimensional, we apply the 1D-convolutional layers in our TT-DCNN instead.For 2D-convolutional TT-DCNN, it is stated in [BPH21].Formally, for a filter F of a 1D-convolutional layer Φ, it is defined as the follows, If x i k is a binary value (i.e., {0, 1}) for all k = 0, . . ., n − 1, we can formulate a truth table based on the filter F by enumerating all 2 n possible combinations.The truth table, when not too large, can then be converted into a simplified DNF formula using the Quine-McCluskey algorithm [Bla38] for interpretation.Figure 1 illustrates the conversion of a neural network's filter into a truth table.Furthermore, for a multi-layer network, similar truth tables can be built for the deeper layers.In that case, the input size of the truth table will be n = pc where p is the patch size instead of the kernel size ker, and c is the number of input channels.The patch is the region of the input that produces the feature, which is commonly referred to as the receptive field [ANS19].
There is a computational limitation when using the Quine-McCluskey algorithm because the algorithm is solving an NP-complete problem [UVSV06].Therefore, we shall limit ourselves to n ≤ 12 for this paper.Increasing the number of input channels will greatly increase the number of input variables for the truth table, so in order to keep these number of input variables in a reasonable amount, which in our case is n ≤ 12, we leverage on the group parameter, g [DV16].The idea is to decompose the convolution into g groups and apply separate convolutions to each individual group.Therefore, the number of input variables will decrease.
Figure 1: Converting 1D-convolutions filter into a Truth Table .The above example has two layers.The first layer has an input channel = 1, output channel = 4, kernel size = 4 with stride = 2, and a second layer of input channel = 4, output channel = 1, kernel size = 2 with stride = 2.The filter converted into the truth table is denoted as light blue box.
With the purpose of increasing the learning capacity of the neural network, a so-called Learning Truth Table (LTT) block is introduced in [BPH21].The LTT block is built upon a layer known as the amplification layer.The amplification layer works by simply adding a new convolution layer with kernel size 1 after a convolution layer.Doing so will not increase the patch size; instead, it allows the neural network to have more freedom to update the weights due to the new convolution layer.With this amplification layer, the LTT block can be used in TT-DCNN to increase the learning capacity of TT-DCNN.Overall, the LTT block comprises two 1D-convolution layers, denoted as Conv1D, with an amplification parameter τ , which is the ratio between the number of channels of the first layer and the number of channels of the second layer.The second layer is the aforementioned amplification layer.Each layer is followed by a batch normalization [IS15] and a non-linear activation function.In our case, for the first non-linear activation function, we are using is SeLU [KUMH17] and the second non-linear activation is bin act .Figure 2 shows the internal working and an overview of the LTT block.

Why TT-DCNN in SCA?
There are other works that use SAT equations for their neural network, namely the Binarized Neural Network (BNN) [HCS + 16], the Concept Rule Sets (CRS) [WZLW19], and the Rule-based Representation Learner (RRL) [WZLW21].The BNN and TT-DCNN are the only CNN-based networks that use SAT equations, while the CRS and RRL are MLP-based networks that comprise of SAT formulae.Nonetheless, we want to analyze CNN-based DNNs that found great success among the other DNNs [MPP16] due to their shift-invariant nature.Therefore, the only two CNN-based DNNs that use SAT equations are BNN and TT-DCNN.However, a BNN loses its interpretability when its binarized convolution block is converted into an inequality for pseudo-Boolean constraint [RM21, CNHR18, JR20], which is consequently mapped into SAT formulae.
Moreover, the SAT formulae of a BNN contains a large amount of disjuncts/clauses compared to TT-DCNN, which is intractable to analyze.Furthermore, the TT-DCNN is more expressive compared to BNN because TT-DCNN contains real-valued weights, unlike BNN, which only has binarized weights.Therefore, TT-DCNN is a preferred choice.
Although the original design for TT-DCNN is meant for the adversarial attack on image datasets [BPH21], TT-DCNNs transparent inner structure allows us to interpret the neural network easily by describing the rules/decisions made by the neural network through its DNF equations.Therefore, we can use them in SCA context to pinpoint the location of the PoIs used.Using the DNF here, tell us which sample points in the traces are used by the network in retrieving the secret key.Based on our proposed TT-DCNN-based architectures stated in Section 4, one sample point of the traces or a window of sample points corresponds to one literal.Moreover, the AND operations ∧ in a disjunct show which literals are to be jointly used together.These disjuncts will give us the position of the leakage points that are exploited by our TT-DCNN-based neural network.The disjuncts of the DNF, which are the exact and interpretable formulation of the TT-DCNN-based neural network, also provide us with a global interpretation by giving us the decision of the network even on traces that the network has not encountered, unlike the usual CNN or BNN.

Methodology of Analyzing DNF Equations for Trained TT-DCNN in SCA
After converting the filters in TT-DCNN into DNF while using the Quine-McCluskey algorithm to simplify the DNFs, we found that there are many disjuncts or literals that are unnecessary in retrieving the secret key.This is likely because the neural network overfits [BPH21] or underfits the dataset.Therefore, we propose three different steps to remove these unnecessary disjuncts or literals to find the most miniature set of rules that the neural network needs for key recovery: 1. Sieving disjuncts based on their size, 2. Separating disjuncts based on their combinations of literals (CoLs), 3. Trimming disjuncts based on the literals.
We remark that the three steps stated are heuristic approaches, and finding all sets of rules or optimal rules remains an open problem.In addition, after each step, we set the channel as 0 when it does not have any disjuncts left.
Sieving Disjuncts Based on Their Size.As described by [BPH21], a large number of literals in a disjunct is possibly due to overfitting.On the other hand, we found that some disjuncts with a small number of literals are irrelevant in retrieving the keys, which probably corresponds to the neural network underfitting the dataset.Therefore, we define the size of the disjunct as the number of literals it contains and sieves the disjuncts based on their size.We replace the original disjuncts of TT-DCNN with the disjuncts of a given size and compute its guessing entropy.We repeat this process individually for each disjunct size.Then we compare the guessing entropy of each disjunct's size and search for the minimum size such that GE = 0 (this criterion may be adapted for other use cases).Although there might be another set of disjuncts within the other disjuncts sizes used to retrieve the key, considering all the disjuncts' sizes that give us GE = 0 will become too intractable.Moreover, taking into account the smallest disjuncts' size allows us to observe the least number of literals and, therefore, the fewest number of sample points (or window of sample points) needed by the TT-DCNN to retrieve the secret key successfully.

Separating Disjuncts Based on Their Combinations of Literals (CoLs).
After separating the disjuncts based on their sizes and obtaining the minimum disjuncts' size with a GE = 0, there might remain a considerable amount of disjuncts to interpret (For ex., 198 disjuncts of size 4 for ASCADv1 dataset and 190 disjuncts of size 3 for AES_HD_ext).Therefore, we would like to separate them further based on the disjuncts' combinations of literals (CoL) which we define below.
After separating the disjuncts by size, we generate the list of unique CoLs from the disjuncts with the minimum size and denote it as Lst unique .Among the different unique CoLs, we want to obtain a set of CoLs such that replacing the original set of disjuncts acquired from the trained TT-DCNN with it can still successfully recover the secret key.We call these CoLs crucial for key recovery as critical CoLs.If the number of unique CoLs is small (≤ 5), we can check all their combinations.However, if the trained TT-DCNN has many CoLs, we would require a viable approach to find the critical CoLs.Firstly, we replace the original disjuncts of TT-DCNN with disjuncts of a given CoL and compute its guessing entropy.We repeat this process for each CoL found in Lst unique independently.This is because sometimes, the trained TT-DCNN only requires one CoL to retrieve the key successfully.However, some trained neural networks may require more than one CoL for key recovery; therefore, we propose an algorithm to acquire the critical CoLs for further analysis (see Algorithm 1).
The main idea of Algorithm 1 is first to set Lst in as the list of all unique CoLs, Lst unique , then remove one CoL from Lst in and check if the guessing entropy of the correct key increases above a certain threshold λ.Throughout the paper, we set λ = 1.If the guessing entropy increases above the threshold λ, it means that the removed CoL is crucial for recovering the key, so we put it back into Lst in .On the contrary, if the guessing entropy did not increase above the threshold λ, we can remove the CoL from Lst in and place them into Lst out , as they are currently not required for retrieving the secret key.Note that the algorithm only helps to find one set of critical CoLs.Algorithm 1 relies on the order of the list Lst unique , and therefore, the existence of another set of critical CoLs is possible.
Trimming Disjuncts Based on the Literals.Some of the disjuncts might contain literals with information that is not useful.We introduce an operation called trimming.We trim a literal (x i ) from a disjunct δ simply by removing x i if the literal appears in the disjunct δ regardless of whether it is the negation.For example, if we trim the literal (x 2 ) from (x 1 ∧ ¬x 2 ∧ x 4 ∧ ¬x 6 ) and (¬x 1 ∧ x 2 ∧ x 4 ∧ x 6 ) then we will obtain (x 1 ∧ x 4 ∧ ¬x 6 ) and (¬x 1 ∧ x 4 ∧ x 6 ) respectively.
When the number of literals found is small, it is possible to exhaust all possible scenarios for trimming and find the set of literals that has the highest impact on the guessing entropy.However, the maximum number of possible cases is 2 n (i.e., when every single literal is found), which could be too computationally intensive.This is not to mention that the calculation of the guessing entropy is also very expensive.As a result, we want to trim out the literals that have minimal impact on the guessing entropy (i.e., literals that will lead to key recovery) without enumerating all the possible cases.Therefore, we propose the following trimming algorithm.Given a set of disjuncts S, we trim x i from S and check its guessing entropy, then repeat this for all i = 0, . . ., n − 1.We set this algorithm after the previous two steps as we do not want to miss out on crucial literals required for key recovery (see Appendix A.2 for examples).

Experimental Results
In this section, we propose two TT-DCNN-based neural networks called T T SCA small and T T SCA big .We train T T SCA small on a small-sized simulated dataset and T T SCA big on real-measurement traces ASCADv1 and AES_HD_ext.We also analyze their acquired SAT equations with the proposed methodology from Section 3.

The TT-DCNN-based Neural Network, T T SCA small
The TT-DCNN-based architecture T T SCA small is illustrated in Figure 3.The T T SCA small apply a batch normalization layer on the traces followed by a Heaviside step function bin act .Three layers of 1D-convolution layers are applied thereafter, with each 1D-convolution layer utilizing a convolution operation, then a SeLU activation function, and lastly, a batch normalization.The parameters of each 1D-convolution layer can be found in Table 1, and the patch size for the T T SCA small is 9, equal to the kernel size of the filter in the Conv1D 2.Moreover, instead of using MLP layers after the Flatten operation, we use a linear regression layer to make the neural network fully interpretable.
We notice that using bin act before the Flatten operation only after training and applying a bin act after the Conv1D layer 2 are necessary for recovering the key.If we apply the bin act before the Flatten operation during training or remove the bin act after the Conv1D layer 2, we observe that it does not successfully recover the secret key.This is possibly due to the simplicity of the dataset, where some loss of information after the Conv1D layer 2 is required, but training with a bin act before the Flatten operation will result in too much information loss.Furthermore, the application of bin act before the Flatten operation only after training is required to convert the three 1D-convolution layers into truth tables.
We employ the Glorot weight initialization [GB10], the One Cycle Policy [ST18] with learning rate of 0.0025 to 0.005, a L 2 norm with regularization factor of 0.00125 and the Adam optimizer [KB17] when training T T SCA small .We train T T SCA small over 9 epochs.

Simulated Data
We generated our own simulated dataset using Python code similar to [Tim19].For each trace that was generated, it has 20 sample points.We denote each trace as an array trace[0 . . .19] and set d+1 number as the leakage point where d ∈ {0, 1, 2, 3} is the masking order of the generated dataset.The remaining 20 − (d + 1) sample points are randomly generated bytes.Then we add a noise based on the Gaussian distribution with a mean of Table 2: Leakage points of the traces generated in simulated data based on the masking order.

Masking Order d
Leakage Point   2 where m 1 , m 2 and m 3 are randomly generated bytes.We use 14k traces for the profiling phase and 5000 traces for the attack phase.

Interpretation of the DNF Equations of T T SCA small on Simulated Data:
After training T T SCA small on the simulated traces, we convert the three 1D-convolution layers into DNF equations by considering the truth tables obtained through enumerating all 2 n inputs.Since the results are similar for all masking orders 0 to 3, we shall only focus on masking order 1 here.We first sieve the disjuncts based on their sizes (see Figure 4a).Figure 4a shows that the disjuncts with size 2 to 5 all have GE = 0. We shall focus on the smallest size that has GE = 0 as stated in Section 3; in this case, 2 is the smallest size (orange line in Figure 4a).For our trained T T SCA small , the disjuncts of size 2 are just (x 1 ∧ ¬x 6 ), (x 6 ∧ ¬x 1 ) and (¬x 6 ∧ ¬x 1 ).Since the CoL for (x 1 , x 6 ) is the only CoL, we simply try to trim the literals (x 1 ) and (x 6 ) individually.We observe from Figure 4 that GE ̸ = 0 when we trim (x 1 ) or (x 6 ).This shows that both x 1 and x 6 are literals that the trained T T SCA small necessary for key recovery.Figure 5b illustrate the neural network's patch as it slides through a trace.In the second timestamp when this patch slides through the trace (i.e.light blue box in Figure 5b), the literals that corresponds to PoIs of the shares m 1 and L = Z ⊕ m 1 are x 1 and x 6 respectively.Therefore, we can conclude that our T T SCA small has learned a function of m 1 and Z ⊕ m 1 based on x 1 and x 6 such that it retrieved the secret key key.The results for masking orders of 0, 2 and 3 also show that the T T SCA small learns a function for key recovery based on the literals to which the leakage points correspond.For each masking order 1, 2 and 3, an example of a disjunct acquired from T T SCA small that is necessary for key recovery is shown in Table 5a.We conclude that T T SCA small can pinpoint the leakage's position and use it to retrieve the key.Next, we show that T T SCA small is unable to recover the secret key if there exists at any point in time PoIs corresponding to a share that are not within the patch.We show this by training T T SCA small on new simulated traces of masking order 1.We set m 1 and L = Z ⊕ m 1 at trace[1] and trace[10] respectively in the new simulated traces (see Figure 6b).Observe that at any point of time, m 1 and L = Z ⊕ m 1 are not within the patch together.We observe that GE ̸ = 0 when T T SCA small is trained on the new simulated traces (see Figure 6a).Therefore, at any point in time, both m 1 and L = Z ⊕ m 1 need to lie within the same patch for T T SCA small to retrieve the key successfully.This solidifies the claim that capturing long-distance dependency in the deep learning literature is also an issue in the context of SCA.

The TT-DCNN Neural Network, T T SCA big
As discussed in Section 4.1.1,the T T SCA small is unable to successfully recover the key if there exists a share whose PoIs are not within the patch at any point in time.Meanwhile, there are hundreds to thousands of sample points in real traces, more than our small-sized simulated traces of 20 sample points, and PoIs corresponding to different shares might be hundreds of sample points away from each other.The most obvious way to apply T T SCA small onto longer traces is to increase its patch size but increasing the patch size above 12 will become intractable to compute as the Quine-McCluskey algorithm is solving an NP-complete problem whenever it simplifies a DNF equation.Therefore, applying T T SCA small directly on traces with extended length is impossible.Nonetheless, we would want a TT-DCNN-based neural network with a patch size below or equal to 12 while still able to interpret its DNF equations with respect to the sample points without losing its ability to retrieve the secret key when trained on longer traces.We propose a new TT-DCNN-based architecture called T T SCA big , which achieves all of that.
The T T SCA big first consists of a 1D-convolution layer, followed by a batch normalization and subsequently with an average pooling layer.Then, a bin act is applied before the LTT block.The LTT block is meant to be converted into SAT equations.Thereafter, we apply the Flatten operation before utilizing three linear regressions.We use linear regressions instead of MLPs since they are interpretable compared to MLPs. Figure 7 shows T T SCA big architecture and Table 3 indicates the parameters used.
To overcome the limitation of patch size, the first 1D-convolution layer, batch normalization, and average pooling [GBC16] are considered as a preprocessing block.This preprocessing block converts each window of sample points into one literal allowing the patch size to be contained within the acceptable range.Furthermore, the average pooling is tuned to make the windows of sample points not overlap, which allows for easier interpretation.
Our training methodology is as follows: we apply the horizontal standardization on the traces in the same way as [WAGP20] and employ the He initialization [HZRS15], the One Cycle Policy with a learning rate of 0.005, and the Adam optimizer [KB17] to train T T SCA big over 50 epochs.

ASCADv1
ASCADv1 is a first-order software AES implementation running over an 8-bit AVR architecture on the ATMega8515 device with an operating frequency of 4 MHz [BPS + 20].We focus on the fixed key with synchronized traces dataset and target the third byte of the first round Sbox, which we shall denote as SBox(pt ⊕ k * ) where pt is the third plaintext byte, and k * is the third byte of the first round secret key.Since it is a first-order implementation, the common output mask of the Sbox is denoted as r out .The authors of [BPS + 20] pre-selected from the raw traces 700 sample points containing the information regarding the third byte of the first round Sbox.The dataset consists of 60k traces with 50k traces used for profiling and 10k for attacking.
Interpretation of the DNF Equations of T T SCA big on ASCADv1: Since the first 1D-convolution layer, batch normalization, and average pooling layer are considered a preprocessing block, the only portion of T T SCA big where we convert into DNF equations is the LTT block.Due to the preprocessing block, we obtain a patch size of 7 where each literal represents a set of sample points in the trace.Table 4 shows which sample points each represents, which can be derived from Figure 15 in Appendix A.1.
Table 4: Sample points for each literals of T T SCA big on ASCADv1.
Sample Points 0 to 99 100 to 199 200 to 299 300 to 399 400 to 499 500 to 599 600 to 699 We proceed with the steps proposed in Section 3.
1. Firstly, we sieve the disjuncts based on their sizes (see Figure 8a).We observe that the only disjuncts' size where GE = 0 is 4 (see the red line in Figure 8a).

Next, we generate a list of unique CoL among all the size 4 disjuncts, denote as Lst (ASCAD) unique
. There are 35 of such CoLs found.We replace the original disjuncts of TT-DCNN with disjuncts of a given CoL and compute its guessing entropy.We repeat this process for each CoL found in Lst (ASCAD) unique .This is illustrated in Figure 17 of Appendix A.3, and we observe that for each CoL found in Lst (ASCAD) unique their GE ̸ = 0.This suggests that our trained T T SCA big requires more than one CoL to retrieve the secret key.Therefore, we use Algorithm 1 to obtain a list of the critical CoLs for key recovery.Table 8b presents the critical CoLs acquired from Algorithm 1.
3. Finally, we apply the trimming algorithm found in Section 3 (see Figure 9a).Notice that when the literals (x 1 ), (x 2 ), (x 4 ), (x 5 ) and (x 6 ) are being trimmed independently, they all have GE ≥ 1 (see the orange, green, purple, brown, and pink lines, respectively).When trimming (x 1 ) and (x 4 ) individually, their GE are relatively close to 0 (i.e,GE = 4.62 for trimming (x 1 ) and GE = 1.92 when trimming (x 4 )).This suggest that maybe both requires more traces to attain GE = 0 and might not be relevant in recovering the key.
Thus, we first verify if x 4 is necessary for key recovery by trimming the literals (x 1 , x 2 , x 5 , x 6 ) and (x 0 , x 3 , x 4 ) separately from the disjuncts with critical CoLs (see orange and blue line in Figure 9b).We see that GE = 0 when (x 0 , x 3 , x 4 ) are trimmed (i.e., disjuncts left have literals x 1 , x 2 , x 5 and x 6 ), suggesting that the literal x 4 is not necessary for key recovery.Similarly, we trim (x 2 , x 5 , x 6 ) and (x 0 , x 1 , x 3 , x 4 ) separately to check if x 1 is required for key recovery (see red and green line in Figure 9b).We observe that trimming (x 2 , x 5 , x 6 ) results in GE ≈ 9, implying that x 1 is indeed relevant in retrieving the secret key.In a nutshell, the literals required for key recovery are x 1 , x 2 , x 5 and x 6 .
We observe that these 4 literals represent the position of PoIs in ASCADv1.From Table 4, the literals x 1 represents 100 to 199 sample points and x 2 represents 200 to 299 sample points which is where the PoIs of the share r out are (see orange line Figure 10a) while x 5 and x 6 represented 500 to 599 and 600 to 699 sample points respectively, where the PoIs of the share SBox(pt ⊕ k * ) ⊕ r out are located (see blue line in Figure 10a).Therefore, we conclude that T T SCA big indeed learned the position of the leakages and also a function of the shares r out and SBox(pt ⊕ k * ) ⊕ r out based on the literals x 1 , x 2 , x 5 and x 6 .While Figure 10a is obtained with knowledge of the key, the neural network learns the PoIs without knowledge of the key.List Of Critical CoLs (x 0 , x 3 , x 4 , x 5 ) (x 0 , x 1 , x 2 , x 3 ) (x 2 , x 3 , x 5 , x 6 ) (x 0 , x 1 , x 2 , x 5 ) (x 0 , x 3 , x 5 , x 6 ) (x 1 , x 2 , x 4 , x 6 ) (x 0 , x 3 , x 4 , x 6 ) (x 0 , x 4 , x 5 , x 6 ) (x 0 , x 1 , x , x 4 ) (x 3 , x 4 , x 5 , x 6 ) (x 1 , x 2 , x 3 , x 4 ) (x 1 , x 2 , x 4 , x 5 ) (b) List of critical CoLs obtained after Algorithm 1.To compare with previous works, we apply the explainability techniques GV on T T SCA big (see Figure 10b) and observe that it is unable to tell us which points in which T T SCA big learns; as it differs vastly from CPA.This is because T T SCA big overfits the dataset since we did not use any early stopping unlike [MDP19] which managed to differentiate the PoIs of ASCADv1 after using early stopping when training their DNN.Nonetheless, our work shows the exact PoIs that T T SCA big used to retrieve the secret key despite not using early stopping and even provides the DNF formulae that T T SCA big used for key recovery without assuming that the neural network is differentiable, unlike GV.The literals x 1 , x 2 , x 5 and x 6 also reveal to us the critical decision that T T SCA big used to retrieve the secret key on both seen and unseen traces, providing us a global interpretation, unlike other explainability methods used in the SCA context.The optimal number of literals that a T T SCA big can learn from the sequential leakage found in ASCADv1 is two.One literal represents the leakage from SBox(pt ⊕ k * ) ⊕ r out and the other literal represent the leakage of r out .Therefore, the literals of T T SCA big that we showed in this section may not be optimal as it requires four literals to obtain the secret key instead of two literal.In the following paragraph, we report a modified miniature T T SCA big .This modified T T SCA big uses only an XOR gate of two literals for key recovery.One reason why the literals we found are not optimal may be due to the heuristic nature of our methods.Therefore, finding the miniature set of disjuncts for key recovery is still an open question.
An Interesting Result: When manipulating the architecture from Wouters et al.'s work [WAGP20], we managed to train a modified T T SCA big (with a padding of 25 in the first Conv1D) that gives us a miniature network.Each literal represents the sample points of the traces stated in Table 5.We proceed by sieving the disjuncts based on their size.We observe that the smallest disjunct size with GE = 0 is 4 (red line of Figure 11a).Next, we found that there are 35 unique CoLs of size 4, and observed that the only critical CoL is the CoL for (x 1 , x 2 , x 3 , x 5 ) (red line of Figure 11b).There only five disjuncts (filter 66 has two disjuncts) with CoL for (x 1 , x 2 , x 3 , x 5 ) and is presented in the middle column of Table 6.We notice that when (x 1 ) or (x 5 ) are trimmed, GE ̸ = 0 (see blue and red line in Figure 12a), but when trimming (x 2 , x 3 ), we attain GE = 0 (see the purple line in Figure 12a).The disjuncts after trimming (x 2 , x 3 ) are presented in the last column of Table 6.By continuing to sieve the filters, we observe that filter 66 is only filter in which GE = 0 (see Figure 12b).The two disjuncts in filter 66 correspond to an XOR gate between x 1 and ¬x 5 : x 1 ⊕ ¬x 5 = (x 1 ∧ x 5 ) ∨ (¬x 1 ∧ ¬x 5 ), which is the masking countermeasure protecting the underlying Sbox execution.However, it is not trivial to always learn the exact function of the countermeasure, which is the best case result we achieved.As the weights are initialised randomly, neural network trained on same dataset may differ and thus the function learned would differ.
To summarize, the modified T T SCA big managed to learn a miniature set of rules with just one preprocessing block, one XOR gate (i.e., the masking function), and one linear regression to retrieve the key.  Figure 12: Guessing entropy of the CoL for (x 1 , x 2 , x 3 , x 5 ) obtained from the modified T T SCA big (i.e., before and after trimming (x 2 , x 3 )).

AES_HD_ext
Next, we validate our approach in a low SNR setting; hence we target traces from FPGA.We focus on the AES_HD_ext dataset, which is an extension of the AES_HD dataset.It consists of 500k traces each with 1250 sample points.The main leakage comes from the register writing in the last round SBox −1 (ct 11 ⊕ k 15 ) ⊕ ct 11 where ct j is the j th byte of the ciphertext, k 15 is the 15 th byte of the last round key and SBox −1 is the AES inverse Sbox.The SNR observed from [ZBHV19] is 0.01554.We use 50k traces for the profiling phase and another 50k for the attack phase.

Interpretation of the DNF Equations of T T SCA big on AES_HD_ext:
Since there are 1250 sample points, the T T SCA big used here have a patch size of 12 with each literal x q represent the sample points between q * 100 and (q + 1) * 100 − 1 for q = 0, . . ., 11.
As before, we first sieve the disjuncts based on their size (see Figure 13a) and observe that size 3 is the smallest disjunct's size (see blue line in Figure 13a).We proceed with the next step by first generating a list of unique CoL from the disjuncts with a size equal to 3, denoted as Lst    the original disjuncts of T T SCA big with a given CoL and compute its guessing entropy.We repeat this process for each CoL found in Lst (AES_HD) unique independently.The CoLs with GE = 0 are (x 3 , x 10 , x 11 ), (x 4 , x 10 , x 11 ) and (x 7 , x 10 , x 11 ) as depicted in Figure 13b.All three CoLs provided similar results, thus we shall focus only on (x 7 , x 10 , x 11 ).Since there are only 3 literals in the CoL for (x 7 , x 10 , x 11 ), we exhaust all possible scenario for trimming (excluding trimming all literals).This is illustrated in Figure 14a.Whenever x 11 is trimmed, we attain GE ̸ = 0.In fact, if x 11 is used on its own (i.e., trim literals (x 7 , x 10 )) it result in GE = 0.This reveals that the literal x 11 itself is necessary for recovering the secret key.We validated our results by using CPA on the AES_HD_ext dataset.The literal x 11 corresponds to the PoIs found in 1100 to 1200 sample points of the traces (see Figure 14b).This implies that our technique also works in a low SNR hardware-implemented setting.

Limitation
The first limitation of using T T SCA big is that it requires more traces to retrieve the secret key compared to general CNN, but this is at the expense of interpreting what the neural network learns.Secondly, our method may not always find the most miniature set of disjuncts for key recovery as it is heuristic in nature.Therefore, it is still an open question to find the optimal set of disjuncts that the T T SCA big learns.Furthermore, currently, T T SCA big does not work on jitter/desynchronization countermeasure in its current form.This is likely due to the non-overlapping of filters.Moreover, as suggested in [WWJ + 21], such countermeasures require deeper architectures as compared to the current depth of T T SCA big , we leave the extension of T T SCA big as future work.Lastly, other explainability techniques can be used on any DNN, but our methods are only applicable to the family of TT-DCNN.Nevertheless, our proposed methodology gives us a glimpse of what this family of TT-DCNN learns in the SCA context.

Conclusion
In this work, we apply the interpretable neural network called the Truth Table Deep Convolutional Neural Networks (TT-DCNNs) [BPH21] in the context of SCA.The TT-DCNN can convert the neural network into SAT equations (in the form of DNF), which allows us to interpret what it has learned.We proposed two different TT-DCNN-based architectures, namely T T SCA small and T T SCA big , where special adjustments are made to T T SCA big to work with real traces.We proposed a methodology to analyze the DNF equations in the context of SCA.Our experiments show that both T T SCA small and T T SCA big indeed use the PoIs to retrieve the secret key based on their SAT formulae.These formulae retrieve the device's secret key even on traces that are not observed, giving us a global interpretation of proposed TT-DCNN-based neural networks.Application to desynchronization/jitter is left for future work.Figure 16 shows the guessing entropy on ASCADv1 when trimming individual literals before sieving the disjuncts based on their size or their CoL.In Section 4.2.1, it was concluded that the important literals are x 1 , x 2 , x 5 and x 6 .However, when trimming literals (x 1 ), (x 2 ) and (x 6 ) individually before the first two steps, the guessing entropy remains relatively close to 0, which could just mean that it requires more traces to attack (see Figure 16a and Figure 16.Therefore, if we apply the trimming operation before either of the first two steps, we will miss out on the literals x 1 , x 2 and x 6 as the critical literals for retrieving the secret key. Figure 17 shows the guessing entropy of each CoL as describe in step 2 of Section 3. We observe that none of the CoL managed to recover the key with 15k traces.The smallest guessing entropy obtained is ≈ 10.This means that T T SCA big combines CoLs to recover the key.Therefore, we use Algorithm 1 to obtain the list of critical CoLs.

A.3 Guessing entropy of Each Individual CoL of T T SCA big on AS-
(a) Learning TruthTable Block (LTT) internal working: first Conv1D layer of kernel size 2, stride 1 and group g = 2.The amplification layer (second Conv1D layer) has an amplification parameter, τ = 3 of kernel size 1 and stride 1. Learning Truth Table Block (LTT) overview: the Conv1D with kernel size = 1 is the amplification layer.

Figure 3 :
Figure 3: Overview of T T SCA small architecture.

Figure 4 :
Figure 4: Guessing entropy of T T SCA small for simulated data with masking order 1.
a) Example of disjuncts of T T SCA small retrieved which allow key recovery.L m1 x0 x1 x2 x3 x4 x5 x6 x7 x8 (b) Orange, light blue and light green boxes shows that patch size when it is at timestamp 1, 2 and 3 respectively.Denote L = Z ⊕ m1.

Figure 5 :
Figure 5: Example of disjuncts of T T SCA small necessary for key recovery and a visualization of a simulated trace.
(a) Guessing entropy of T T SCA small trained on the new simulated traces of masking order 1 when m1 and L = Z ⊕ m1 place at trace[1] and trace[10] respectively.L m1 (b) Orange, light blue and light green boxes shows that patch size when it is at timestamp 1, 2 and 3 respectively.Denote L = Z ⊕ m1.

Figure 6 :
Figure 6: Guessing entropy of T T SCA small trained on new simulated traces and a visualization of the new simulated trace with masking order 1.

Figure 7 :
Figure 7: T T SCA big architecture.
(a) Guessing entropy of different disjuncts' sizes.

Figure 8 :
Figure 8: Guessing entropy of step 1 and the list obtained after Algorithm 1 for T T SCA big on ASCADv1.

( a )
Trimming with every individual literals (describe by the trimming algorithm in Section 3).(b)Trimming to verify the relevancy of x1 and x4.

Figure 9 :
Figure 9: Guessing entropy for T T SCA big on ASCADv1 for the step 3.
(a) CPA on the attack traces of ASCADv1.(b) GV of T T SCA big on ASCADv1.

Figure 10 :
Figure 10: CPA of ASCADv1 and GV of T T SCA big .

( a )Figure 11 :
Figure 11: Guessing entropy obtained when applying step 1 and 2 on the modified T T SCA big trained on ASCADv1.

.
There are a total of 127 CoLs with size 3 found.Then we replace (a) Different disjuncts' sizes.(b) Individual CoL of size 3.

Figure 13 :
Figure 13: Guessing entropy after seiving disjuncts based on their sizes and each CoL individual of size 3 for AES_HD_ext.

Figure 14 :
Figure 14: Guessing entropy after trimming and CPA of AES_HD_ext.
Before Any Preprocessing or Before Separating the Disjuncts Based Their CoL on ASCADv1.(a) Trimming before sieving disjuncts based on their size (before step 1).(b) Trimming before separating disjuncts based on their CoLs (right before step 2).

Figure 16 :
Figure 16: Guessing entropy for T T SCA big on ASCADv1 before step 1 and right before step 2 of proposed methodology.

Figure 17 :
Figure 17: Guessing entropy of each individual CoL of T T SCA big from disjuncts of size equal to 4

Table 5 :
Sample points for each literals for the TT-DCNN trained with padding of 25.

Table 6 :
Disjuncts with CoL for (x 1 , x 2 , x 3 , x 5 ) before and after trimming (x 2 , x 3 ) using the modified T T SCA big .