Single-Trace Side-Channel Attacks on the Toom-Cook: The Case Study of Saber

. The Toom-Cook method is a well-known strategy for building algorithms to multiply polynomials eﬃciently. Along with NTT-based polynomial multiplication, Toom-Cook-based or Karatsuba-based polynomial multiplication algorithms still have regained attention since the start of the NIST’s post-quantum standardization procedure. Compared to the comprehensive analysis done for NTT, the leakage characteristics of Toom-Cook have not been discussed. We analyze the vulnerabilities of Toom-Cook in the reference implementation of Saber, a third round ﬁnalist of NIST’s post-quantum standardization process. In this work, we present the ﬁrst single-trace attack based on the soft-analytical side-channel attack (SASCA) targeting the Toom-Cook. The deep learning-based power analysis is combined with SASCA to decrease the number of templates since there are a large number of similar operations in the Toom-Cook. Moreover, we describe the optimized factor graph and improved belief propagation to make the attack more practical. The feasibility of the attack is veriﬁed by evaluation experiments. We also discuss the possible countermeasures to prevent the attack.


Introduction
The impending threat of Shor's algorithm [Sho99] to conventional public-key cryptographic algorithms has prompted interest in alternate algorithms that are resistant to quantum computers. The National Institute of Standards and Technology (NIST) started the Post-Quantum Cryptography Standardization Project in 2016 [NIS]. The process for post-quantum cryptography is currently in the third round. The remaining candidates are the seven finalists and the eight alternates for Public Key Encryption (PKE), Key-Encapsulation Mechanism (KEM) and Digital Signature (DS) schemes [AASA + 20]. Among the finalists for KEMs, 3 out of 4 finalists are latticed-based.
The lattice-based schemes are split into Learning With Errors (LWE)-based schemes and NTRU-based schemes. Their security relies on the ideal lattices and learning with errors problems. However, their implementations have shown vulnerability against sidechannel attacks (SCAs) in the context of PKE, KEM or DS [ATT + 18, EFGT17,OSPG18]. Side-channel attacks aim to establish the relationship between the detectable leakages from physical devices and sensitive information. NIST has clarified the resistance against SCAs should be considered as an essential criterion for the Post-Quantum Cryptography Standardization Project [AASA + 20].
There has been a lot of work focus on the risks of lattice-based cryptographic systems to side-channel attacks. Several works exploit vulnerabilities of different operations, including but not limited to polynomial multiplication [PPM17,HCY19,XPSR + 21,AKJ + 18], message of Toom-Cook, Karatusba and Saber. Section 3 introduces the vulnerabilities analysis of Toom-Cook multiplication in Saber. We then describe our attack on Toom-Cook in detail and provide various optimizations to decrease the complexity of the attack in Section 4. The evaluations and experiments are provided in Section 5. Furthermore, Section 6 provides the discussion. Section 7 concludes the paper.

Saber key-encapsulation mechanism
Saber is a third round finalist post-quantum key-encapsulation mechanism candidate [VBDK + 21]. The security is based on the hardness of Module Learning with Rounding problem (MLWR).
Let Z q be the ring of integers modulo a positive integer q and quotient polynomial ring R q (x) is defined as Z q (x)/(x n + 1). The symbol l determines the dimension of the underlying lattice problem. The positive integers q, p, and T are the moduli involved in the scheme and are chosen to be powers of 2, in particular q = 2 q , p = 2 p and T = 2 T with q > p > T . Setting parameter p and T to higher values results in lower security, but lower failure probability. Its implementation resists the timing side-channel attacks [GBHLY16,KRVV19]. The symbol represents the bitwise right shift operation. This type of operation can be extended to the coefficients in polynomials. Three constants (h, h 1 , h 2 ) are used to replace rounding operations with a simple bit shift.
The decryption of PKE is shown in Algorithm 1 [DKSRV18]. Saber.KEM is transformed from Saber.PKE using the Fujisaki-Okamoto (FO) transform. It would be harmful to security when p q introduces bias in the generated keys. To avoid this noise, Saber designers choose p and q as a power of two, i.e. 2 10 and 2 13 respectively.

Toom-Cook & Karatsuba multiplication
The most straightforward method to compute the multiplication result between two ndegree polynomials is the schoolbook multiplication [NG21]. The Karatsuba algorithm is a divide-and-conquer approach to implementing the polynomial multiplication. Supposing there are two n-degree polynomials A(x) and B(x). The polynomial A(x) can be split into two n/2-degree polynomials a h (x) and a l (x), where A(x) = a h (x) · x n/2 + a l (x).

Soft-Analytical Side-Channel Attacks
Soft-Analytical Side-Channel Attacks (SASCA) aims at exploiting more leakages than the single attack point used in classical divide-and-conquer attacks for AES [VCGS14]. It combines template matching with Belief Propagation on the factor graph of cryptographic implementation. It first performs a template attack on the target intermediate variables and retrieves the probabilities. For the intermediate variable x i , one gets probabilities conditioned on the observable leakage , i.e., P r(x i = x| ), where x runs through all of the possible values of x i , is the observed side-channel leakage.
The adversary can construct a factor graph of the cryptographical algorithm and their implementation. The factor graph models the relationships among the intermediate variables. After adding the probabilities into the graph, the adversary performs the belief propagation algorithm to determine marginal probability distributions for the subkey. Next, we explain the factor graph and belief propagation more thoroughly.
Definition 1 (Factor graphs [KFL01]). A factor graph is a bipartite graph representing the factorization of a function. Factor graph is comprised of variable node for each variable x i , factor nodes for each function f j and edge-connecting variable node x i to factor node The set of factor nodes can be separated into two subsets. The first type of factor reflects the relationships among the variables in the implementation. For example a cryptographic operation OP(x i1 , x i2 ), the factor can be represented as The second subset describes the probabilities of the variables by observable side-channel leakages . These factors are non-deterministic and can be represented as Based on these rules, the common arithmetic circuit of cryptographic implementation can be constructed by the adversary. The variables correspond to the variable nodes and the factors describe the relationship among the variables. In the original SASCA work, it showed how to construct a factor graph for AES (http://point-at-infinity.org/avraes) [VCGS14].
The belief propagation algorithm is a message-passing algorithm on graphical models such as factor graphs. It was initially proposed to compute the marginalization of a function efficiently.
Each variable node represents one variable x n in the factor graph, and the factor node represents one factor f m . We denote n, m as variable and factor indices, respectively. Let x m be the variables that factor f m depends on, x m\n denotes the set of variables in x m without x n , and v n represents the value of the domain of x n . We denote M(x n ) as the neighbors of the variable node x n , and N (f m ) denotes the neighbor variables that the factor f m depends on. These notations are the same with the indices n , m .
The message passing includes from variables to factors (u xn→fm ) and from factors to variables (u fm→xn ): Messages from variable to factor: Messages from factor to variable: In a factor graph of cryptographic implementation, there have two types of factors. The one corresponds to the variables of implementation while the another describes the priori knowledge of the variables acquired through template attacks on side-channel leakages. The BP applies the above message rules to all the nodes and factors. Finally, the probabilities of sensitive nodes (i.e. key) are obtained with the iteration of propagation.

Vulnerabilities Analysis of Toom-Cook Multiplication
We analyze the leakage characteristics of Toom-Cook adopted by Saber using divide-andconquer attacks in this section. Firstly, we describe the implementation of Toom-Cook multiplication. Then we exploit the vulnerabilities in the implementation and illustrate the challenges for the traditional attacks.

Polynomial Multiplication in Saber
During the multiplication, the polynomials are of degree 256. To perform a 256 · 256 polynomial multiplication C(x) = A(x) · B(x), it adopts the Toom-Cook-4-way split the A(x) and B(x). The step transforms the multiplication of two 256-degree polynomials to 7 multiplications of 64-degree polynomials. Then split the polynomials into 9 multiplications of degree 16 through 2-levels of Karatsuba multiplication. Finally, the polynomials perform the schoolbook multiplication.
Currently, the polynomials multiplication of Saber reference C implementation is implemented based on the Toom-Cook. The full implementation is a very complex process. We simplify it for clarity, as shown in Figure 1. The IND-CPA decryption receives a ciphertext and the secret key sk. The polynomial multiplication is performed by b and s, which are transformed from ciphertext and the secret key respectively by data conversion algorithms BS2POLVECp/BS2POLVECq. The data conversion algorithms map a byte string to a vector in the ring.
After the call of InnerProd and poly_mul_acc functions, the toom_cook_4way is conducted to implement multiplication, which is our mainly target in analysis. The toom_cook_4way function takes as input a1, b1 (256 coefficients) and outputs the multiplication result. The 256-degree polynomial can be split into 4 polynomials B3, B2, B1, B0 of degree 64. During the evaluation, bw1, · · · , bw7 can be obtained by substituting the points value y = p 0 , · · · , y = p 6 into the Equation (4), shown as Equation (9).

Vulnerability Analysis
Side-channel attacks threaten cryptographic algorithms through the physical information such as timing, power, and electromagnetic emissions. In the view of timing leakage, a powerful source code analyzer [FGL + 18] can be exploited to perform the static analysis to track the sensitive information in the code. The goal of the tool is to detect the cache-timing leakages on the code level. The candidates submitted to the first round of the PQC project have been analyzed by this tool, which showed Saber was one of the submissions to be correctly protected against timing attacks [FGL + 18]. In Saber, all moduli are powers of 2 and allow the explicit modular reduction to be removed. The lack of modular reduction also implies that Saber is naturally constant time, making it resistant to cache timing attacks.
There are many leakage assessment methodologies to evaluate the detectable leakage in the power/electromagnetic analysis. The general method of Welch's t-test evaluates the leakage independent of any power model and sensitive value [GGJR + 11], which was improved by Becker et al. and renamed Test Vector Leakage Assessment (TVLA) [BCDM + 13]. It has been applied to detecting the leakage of lattice-based algorithms such as Kyber, Saber, NewHope and LAC. The polynomials transformed from the secret key are marked as the sensitive variables, as dotted arrows in Figure 2. The sensitive variables obtained from the input secret key will eventually be multiplied with the ciphertext coefficients, which leads to side-channel leakage inevitably. Next, we analyze the challenges and limitations of Toom-Cook in Saber to classical divide-and-conquer attacks in the view of the attack.

Figure 2:
The dataflow of Toom-Cook multiplication in Saber.
Incomplete key recovery. The schoolbook multiplication is vulnerable to SCA [ATT + 18, SMS19] since its intermediate values depend on the known ciphertext and unknown secret key. Huang et al. demonstrated the private-key recovery from the Karatsuba multiplication in NTRU Prime [HCY19]. Karatsuba's method itself can not resist against the vertical power analysis attacks on lowest-level multiplications. However, the application of Toom-Cook makes these attacks fail to recover full private-key, as considered in [HCY19] "Unfortunately, if the optimized version uses Toom-k as the first layer, the approaches can only reveal the first and last 1/k of private-key coefficients. How to adapt them to a fully optimized NTRU Prime in pursuit of full private-key recovery is worth further investigation". According to Equation (9), the lowest-level multiplication of bw1 × aw1 and bw7 × aw7 can be used to recover B3 and B0 while the other coefficients are hardly recovered straightforwardly.
Indistinguishable guessing keys. A critical factor of successful power/electromagnetic analysis is distinguishing the statistical analysis results of the correct key from the other wrong keys. In general, divide-and-conquer attacks can use the correlation, differential, or mutual information between the power/electromagnetic samples and the Hamming weight of the intermediate value in the cryptographic algorithm. Xu et al. focused on the output of multiplication in Kyber and mounted a chosen-ciphertext SPA attack on polynomial multiplication using few traces [XPSR + 21]. Kyber executes the Montgomery reduction (mod ± q) by f qmul() after the multiplication. The coefficients of the secret-key range from −2 to 2 and are obtained from the binomial distribution. The adversary can distinguish the coefficient values(−2, −1, 0, 1, 2) by choosing different ciphertexts, while the Hamming weights of the output of f qmul() can be divided into different classes. However, in Saber, the integer moduli are powers of 2 so that there is no need for explicit modular reduction. The last-level multiplication is implemented by the OVERFLOWING_MUL function.
where s coef f denotes the coefficient of the secret key and b coef f denotes the coefficient of the ciphertext b. It can be seen that there are many similar correlation coefficients, even completely equal values. For example, the output of OVERFLOWING_MUL for an arbitrary b coef f and s coef f = 2 can be viewed as a one-bit left shift of b coef f . Without modular reduction, the

SASCA on the Toom-Cook
In this section, we describe how to construct a factor-graph targeting such generic Toom-Cook software implementations. In a factor graph, each variable node (represented by a circle) is connected to a factor node (represented by a rectangle) by an edge if the factor depends on it. As seen in Section 3, the variables are interconnected after Toom-Cook&Karatsuba transformation. The secret key s is split into bw1, bw2, . . . , bw7 of degree 64, multiplying aw1, aw2, . . . , aw7 obtained by the ciphertext. It further splits the polynomial into four polynomials of degree 16 and transforms to decrease the complexity. For example, aw1 1 , . . . , aw1 9 and bw1 1 , . . . , bw1 9 deduced from aw1 and bw1 perform the 16 · 16 schoolbook polynomial multiplications, respectively.
The factor graph corresponding to the example of schoolbook polynomial multiplication of bw1 i and aw1 i is illustrated in Figure 3 (a), named as SFG for simplicity. The factor graph represents variables and factor nodes by circles and squares, respectively. The factor nodes are further split into two groups. The first group of factors is characterized by side-channel information. Its purpose is to add the side-channel information, i.e., the results of the template matching. During the schoolbook multiplication of bw1 i and aw1 i , the value of aw1 i is known according to a known ciphertext. The attack can observe the side-channel information during the execution of the multiplication operation to obtain the posterior distribution of the multiplication result r1 i . Thus the first type of factor corresponds to the a priori knowledge of the multiplication results obtained by side-channel leakages, for example, f The second group of factors is modeling the relationships between the variables nodes in the schoolbook. We add the multiplication as factor nodes f mul , which can be defined as: where OV ERF LOW IN G is defined in Equation (11).
The last step is to formalize the implementation of the Toom-Cook evaluation. During the Toom-Cook-4-way evaluation, bw1, · · · , bw7 can be obtained by substituting the seven points values, as Equation (9). The nodes bw1, · · · , bw7 are connected to its corresponding KFG, i.e. KF G bw1 , · · · , KF G bw7 . We defined the factors f 1 , · · · , f 7 to describe the relationships among variables during the evaluation of different points, shown in Figure 5.
The various factor graphs above construct the complete factor graph for implementing Toom-Cook. The relationships among the different factor graphs are described in Figure 6.  There are 16 · 2 + 1 = 33 variable nodes in one SFG and 9 in one KFG. The overall graphs include 33 · 9 · 7 + 4 = 2083 variable nodes. The 16 · 9 · 7 = 144 leakage factors f L modeled the observed side-channel leakage of multiplication in a trace. It requires close to ten million templates (2 16 · 144) overall. Moreover, the BP algorithm is used to determine the marginal probabilities of the secret key B3, B2, B1, B0 by iteratively executing the message propagation computation. It can be found that many short loops degrade the BP performance in the factor graph.

Improving Practical Single-Trace Attacks
This section shows how to address these problems and perform single-trace attack on the Toom-Cook more practical. Firstly, we decrease the templates number by deep neural networks. Secondly, we prove the factor graph can be optimized by merging tracks based on Bayes' algorithm. Then, the update rules are improved for the short loops during belief propagation.

Decreasing the Number of Templates
One straightforward way to decrease required templates is to switch to Hamming weight templates. For the large bit width of the intermediate value in the algorithm, we profile templates for Hamming weights instead of every possible value. It performs a template matching and, therefore, gives a vector of probabilities conditioned on the leakage l. We take node r1 i [0] in SFG as an example. It has where  also proposed an efficient CNN architecture and showed that the networks do not need to be very complex to perform well in the side-channel context [ZBHV19]. We aim to utilize a suitable model with a good balance between training time and effect. Our MLP architecture is based on the model proposed in [RAD20], and hyperparameters are trained to fit the requirements for the target implementation.
In the profiling phase, we train the network using the Adam optimizer [KB14] and a learning rate of 10 −2 . The inputs of our network are the power traces, and it outputs the distribution over the class labels. The training label is set as the Hamming weight of the intermediate values in Toom-Cook multiplication. The activation function for hidden layers is SeLU, and that for output layers is Softmax. Softmax is a nonlinear function, mainly used for the classifier output of multi-class classification, which produces a distribution over the class labels. In some scenes, the output is used as the probability since it reflects the difference among various classes [KSH12,SWT14,TYRW14]. In the deep learning-based side-channel context, it also uses the output of NN into the multi-trace Bayes' probability to distinguish the secret [ZBHV19]. The outputs are assigned to the factor nodes f L instead of the probabilities in Equation (15). The MLP model is shown in Figure 7. The input layer contained inputs corresponding with the sample points of each schoolbook and output layer had the same number neurons as the Hamming weights species to predict target value y .
where f s denotes the nonlinear activation function, j denotes hidden layer j-th neuron, l denotes output layer l-th neuron, N 1 denotes the number of neurons in output layer, N 2 is the number of neurons in a hidden layer, σ denotes the threshold of the neuron, w ij denotes weights between i-th and j-th neuron and x i denotes the neurons output. The result of this classification is a probability vector of the Hamming weight prediction. The samples of total schoolbook in one trace are send to the input of the network. Each trace segment in the training set is assigned its label as the Hamming weight of multiplication result to train the network. The normalization of the trace is applied to the input of the network. Besides, we select the same number of training traces for each class of Hamming weight to avoid overfitting training.

Factor Graph Optimization
The memory cost of a factor graph is influenced by the number of nodes and edges. Generally, the larger graph requires more memory. The factor graph of schoolbook multiplication contains 16 same factors f mul , as Figure 3 (a). Based on the Sum-Product algorithm in the message-update rules, the marginal probability of variable bw1 i is . . .
In template attacks, a single trace is usually not enough to recover the key with high confidence in practice. The posterior probabilities of the key candidates are calculated from multiple traces based on Bayes' theorem [OM06].
The sum of probabilities of the candidates is equal to 1, i.e. l (p(bw1 l i )). Based on Bayes' rule, the marginal probability can be represented as Since the initial probability of bw1 i is an uniformed value, the denominator in the above formula does not influence the ranking among the key candidates. The Bayes' probability can be converted into the marginal probability by normalization.
Figure 3 (b) shows the simplified graph according to the merging trick. In contrast with the factor graph representation in Figure 3 (a), which includes 33 variable nodes, the new one consists of only 2 variable nodes in one SFG. Moreover, the number of variable nodes in the overall graphs drops from 2083 to 130, reducing time and memory complexities for the following belief propagation.

Improving Belief Propagation
In this section, we analyze the factor graph on the performance of belief propagation. Based on this, we improve the belief propagation steps for the special case during the attack.
The way of SASCA to describe the implementation and leakage is similar to decoding factor graphs using a BP algorithm in the Low-Density Parity Check Code (LDPC) [GGSB20]. LDPC error correction decoders are widely used in communication systems for their strong performance. The factor graph can be transformed into the Tanner graph. It is viewed as a graphical representation for LDPC codes. The nodes correspond to the variable nodes and factor nodes. The KFG in Figure 4 can be represented as a Tanner graph as Figure 8.
In LDPC, short cycles especially, cycles of length 4, influence the performance using the BP algorithm [CH06]. In this Tanner graph, there are also existing many circles of length 4, for example, the red lines in Figure 8. To deal with these short cycles, we first detect them with the parity-check matrix. The parity checks are used to check the errors in the received codeword, called the parity-check matrix. The parity-check matrix H of the above graph equals: 1 1 1 0 0 0 0 0 0  1 0 0 1 0 0 1 0 0  0 1 0 0 1 0 0 1 0  0 0 0 1 1 1 0 0 0  1 1 0 1 1 0 0 Firstly, it can straightly identify the cycles of length 4 in the matrix. It indicates that the cycle of length 4 occurs when two rows have ones in two same column locations in the parity-check matrix. It can be found that four cycles of length 4 are all connected to factor f 5 add , as shown in the bold numbers in the parity-check matrix H.  To avoid those shortest cycles of length 4, we split the BP in KFG into two steps. It can be found that the removal of factor f 5 add in the parity-check matrix H can avoid all cycles of length 4. Thus we first perform the standard BP algorithm on the subgraph, as shown in Figure 9.(a). Note that there are also exist cycles in the subgraph.
However, the influence of a large cycle is slight on the BP. Then, the marginal probabilities p(bw1 1 ), p(bw1 2 ), p(bw1 4 ), p(bw1 5 ) are input to the initial distribution in the second subgraph, as shown in Figure 9.(b). In this step, it computes the joint distribution by p(bw1 9 ) = bw11,bw12,bw14,bw15 Finally, the marginal probabilities of bw1 1 , bw1 2 , bw1 4 , bw1 5 , bw1 9 are updated. Compared to the original BP on KFG, this method splits a whole loopy BP into two steps and executes in sequential, which mitigates the performance degradation.

Attacks on the Saber case
In this section, we perform the SASCA on the Saber case. In particular, it uses performance and success rate to show the efficiency of optimization methods including factor graph optimization, improving belief propagation and decreasing the number of templates.

Evaluation
We evaluate our attack method using leakage simulations in various settings. The target is the C reference implementation of Saber as submitted to the NIST contest 1 . The experiments are run on an Intel i7-10510U (2.3GHz). The simulation traces are generated by the Hamming weight (HW) with an additive Gaussian noise model. The simulated leakage is L = HW (v) + N (0, σ), where N (0, σ) represents a Gaussian distribution with zero mean and standard deviation σ. In this section, we analyze the improvement of our methods under different noise levels (σ = 2, 5, 10).
For each simulation, we generate 30 samples for each intermediate variable, and there have 144 intermediate variables of multiplication. The 30 samples are generated to simulate the continuous leakage in the practical scene over a period. The same samples do not influence the comparative evaluation of different methods. Thus, it has 4320 samples in one trace. We use 50, 000 traces for training and perform the template matching to obtain the corresponding conditional probabilities. For each noise level, we perform 100 experiments and compute the averaged results. We first evaluate the effects of the factor graph optimization. The coefficient b1 of the secret key is split into bw1, . . . , bw7 during Toom-Cook evaluation. Each element splits into 9 coefficients during Karatsuba evaluation. For example, bw1 is split into bw1 1 , . . . , bw1 9 . Thus these coefficients are our target during the attack. Based on Figure 3, we perform BP on the original SFG and obtain the marginal probabilities of bw1 1 , . . . , bw1 9 . Then we perform BP on the Bayes-based SFG. The 100 independent traces are performed to compute the success rate. The success rates under original SFG and Bayes-based SFG are the same, shown in Figure 10. The success rates tend to decrease with the increasing noise level.
We also evaluate the whole factor graph under the two types of SFG. The execution time during the BP on the factor graph is recorded to compare the efficiency of the two SFGs. We average the results from 100 experiments to decrease the influence of noise. Table 2 shows the performance metrics of Bayes-based SFG and original SFG under a certain noise level (σ = 2). The execution time difference among different subgraphs is due to the various scale of the distribution corresponding to the space of possible values. Based on Equation (9), it can calculate the space of each variable, denoted as N bw1 = 8, N bw2 = 624, N bw3 = 65, N bw4 = 65, N bw5 = 624, N bw6 = 624, N bw7 = 8. The higher complexity of distribution costs more time during BP. From Table 2, Bayes-based SFG requires less time than the original SFG while maintaining the same success rate as expected. The time needed to the same success rate reduces to less than 1/2 for original attacks. In order to evaluate the efficiency of improved BP, we perform the original BP and the improved BP on the same KFG. Note that there exist cycles in the KFG, which degrade the performance of the BP algorithm. The attack targets are the coefficients bw1_3, bw1_2, bw1_1, bw1_0 corresponding to the nodes bw1 1 , bw1 2 , bw1 4 , bw1 5 in the KFG, as Equation (10). We first perform the original BP to calculate the marginal probabilities. The number of iteration is set to 5 since the depth of the graph is not large. We attack 100 traces with the independent key to compute the success rate. The same case is also set to improved BP algorithm. A summary of the results obtained after the two BP algorithms is given in Table 3. As predicted above, the cycles in the factor graph significantly impact attacking performance. The attack's success rate with improved BP outperforms the original BP while taking less time.

Analyzing a real device
The practical attack is performed with the electromagnetic radiation measurement of the STM32F103RB (ARM Cortex-M3) micro-controller on the STM32 Nucleo-64 board. It was also used by many other implementations of LWE encryption. The target consists of the C reference implementation submitted to the NIST contest. The STM32CubeIDE and the ST programmer from STMicroelectronics are used to compile and program the device. A Langer RF2 near-field probe is placed in proximity to the chip. The electromagnetic signals are sampled through a digital oscilloscope DSOX3024T from KEYSIGHT. Besides, a PA303 Amplifier is used to pre-treatment signals. The setup is depicted in Figure 11(a). We record the leakage traces at a sampling rate of 100MSa/s and the bandwidth limit of 200MHz. Figure 11(b) presents the EM patterns of multiplications on the target device. The leakage is located at line 3 of toom_cook_4way() procedure (Figure 1), where the decoded ciphertexts are multiplied by the secret key. Each operation contains 144 schoolbook multiplications, implemented by the MUL instruction. The leakage of MUL instruction is exploited to perform the SASCA.  To magnify the leakage of multiplication results, we choose the 16 ciphertexts which distinguish HW difference among different key classes, similar with the chosen ciphertext on attacking Kyber [XPSR + 21]. The training set consists of 90,000 traces captured for the chosen ciphertexts. We use a unified deep learning model to profile all the leakage of multiplications. The MLP architecture is shown in Figure 7. Each trace segment in the training set is assigned its label as the Hamming weight of multiplication result to train the network. We set the learning rate to 10 −2 and the number of epochs to 1000 with a batch size of 4096. The normalization of the trace is applied to the input of the network.
After training the network, we use the profiled model to attack the target trace. The target trace segment is sent into the network to predict the Hamming weight of the intermediate value. Thus the probability distribution under 17 classes (0~16) is obtained. These probabilities are assigned to the side-channel leakage nodes f L in the SFG factor graph. In this type factor graph, aw1 i [0] ∼ aw1 i [15] are the ciphertext values, while the bw1 i is the unknown node. Since the efficiency has been proved and evaluated, it performs BP on the Bayes-based SFG.
To compared to the traditional template attack, we also perform the template attack on the same profiling measurements. We use the trace set to build Hamming weight templates for each multiplication. The points-of-interest (POIs) used for template building are determined with the correlation coefficients. We have also set different numbers of POIs in our attacks, i.e. 50, 80, 100. Then the obtained probabilities are used to perform the BP on our factor graph. We perform 100 experiments with independent attacking traces. The success rates of attacking bw1_3, bw1_2, bw1_1, bw1_0 are shown as the grey histograms in Figure 12. During the deep learning-based attack, the parameters are the same with template attacks. After BP, it can recover the coefficients of the secret key. The success rates are shown as the black histograms in Figure 12. We can observe that the deep learning-assisted SASCA achieves a higher success rate than the template attack. Meanwhile, it simplifies the profile procedure other than a large number of templates in the template attacks.

Discussions
We establish that single-trace attacks on the Toom-Cook. The method can be applied to other cryptographic algorithms using Toom-Cook as a polynomial multiplication. We now discuss the attack scenario and some possible countermeasures.

Scenario and limitation
A straight application of KEMs (NTRU, Saber) is the authenticated key exchange protocol. New protocol standards, including TLS 1.3, are increasingly advocating and mandating PKE/KEMs that utilize ephemeral key pairs to achieve the notion of perfect forward secrecy [SFG20].
The key exchange at the beginning of a TLS session involves one keygen, one encapsulation, and one decapsulation in the post-quantum TLS key exchange schemes [SSW20,BBCT22]. It makes session key recovery more attractive for recovering the realtime encrypted information [RBRC20]. Some implementations reuse the keys at the cost of limited forward secrecy. It usually restricts the update rules. For example, Microsoft Windows TLS library Schannel caches keys for two hours [CNE + 14]. It limits the number of traces required by DPA/CPA and makes the single-trace attack more threatening.
Like template attacks, SASCA needs profiling and the ability to configure keys during the profiling phase. Since the complexity of building the profiles takes more time than DPA/CPA, it requires many traces to create a good template. The POI (Points-of-Interest) for each attack point aligns with the observable leakage is necessary for a successful attack. In addition, one needs to know the implementation to construct a factor graph in the attack. We limit our attack to a single device. A more threatening attack can be extended to the cross-device scene. It needs more complicated profiling than on a single device. There are many works to conquer the variation of devices, including multi-device traces training [DGD + 19], principle comportment analysis [GDD + 19], and frequency transformation [ZSX + 20]. In the future, we will extend our method to the cross-device context based on the related technologies.

Potential defenses
There already exist masking implementations for the lattice schemes [VBDK + 21,BGR + 21]. Masking was proposed to resist the DPA and lead to interference in single-trace attacks. The randomized intermediate variables in the scheme destroy the relationship between the sensitive information and side-channel leakage.
However, the SASCA on NTT can also attack the masked implementation [PPM17,PP19]. The masking splits the secret key into two shares before decryption, and it simply recovers each share individually and adds them up to obtain the unmasked input. Due to the linearity of the polynomial multiplication and addition, masking is a natural fit for lattice-based schemes [RSRVV15, VBDK + 21]. In these schemes, the private key s can be split into two shares s , s , which satisfy s = s + s mod q. Then, the polynomial multiplications, additions and other operations can be computed on each share individually. In the last step, the decoding module is designed carefully to output two shares of the message. The targets are the polynomial multiplications between the two shares and the ciphertext b at the beginning of the first stage. Thus it performs the SASCA on the NTT implementation of b T s and b T s to recover s , s individually [PPM17,PP19]. The principle of attacking masking is also suitable for the single-trace attack on Toom-Cook.
Shuffling is one of the hiding countermeasures and is viewed as an effective countermeasure against algebraic side-channel attacks. It can use Fisher-Yates algorithm [FY53] to decrease the SNR during the execution. To seek higher security, it can integrate other countermeasures, for example, clock jitter, instruction shuffling, and dummy operations [CK09, VCMKS12, BPS + 20]. How to protect the implementation with lightweight countermeasure is one of our future work.

Conclusion
In this work, we investigate the security of the Toom-Cook multiplication concerning single-trace attacks. We show how to apply SASCA when targeting Toom-Cook in Saber and additionally adapt the deep learning power attacks to decrease the tremendous amount of multivariate templates of traditional SASCA. More concretely, we prove that the marginal probabilities of the Sum-Product algorithm in the message-update rule are equal to the posterior probabilities based on Bayes' theorem in the factor graph of schoolbook multiplications. The nodes in the graph representation of the schoolbook results in the Toom-Cook can be merged by the Bayes' theorem. We also improve the BP algorithm to eliminate the influence of short cycles that frequently appears in the factor graph of Toom-Cook. These technologies are also applicable to other cryptographic schemes that perform Toom-Cook algorithm.