A Side-Channel Attack on a Masked IND-CCA Secure Saber KEM

. In this paper, we present the ﬁrst side-channel attack on a ﬁrst-order masked implementation of IND-CCA secure Saber KEM. We show how to recover both the session key and the long-term secret key from 16 traces by deep learning-based power analysis without explicitly extracting the random mask at each execution. Since the presented method is not dependent on the mask, we can improve success probability by combining score vectors of multiple traces captured for the same ciphertext. This is an important advantage over previous attacks on LWE/LWR-based KEMs, which must rely on a single trace. Another advantage is that the presented method does not require a proﬁling device with deactivated countermeasure, or known secret key. Thus, if a device under attack is accessible, it can be used for proﬁling. This typically maximizes the classiﬁcation accuracy of deep learning models. In addition, we discovered a leakage point in the primitive for masked logical shifting on arithmetic shares which has not been known before. We also present a new approach for secret key recovery, using maps from error-correcting codes. This approach can compensate for some errors in the recovered message.


Introduction
Public-key cryptographic schemes in current use depend on the intractability of specific mathematical problems such as integer factorization or the discrete logarithm problem.However, it is known that when large-scale quantum computers become a reality, factoring and discrete log can be efficiently solved using the Shor algorithm [Sho99].Even if it will take many years until large-scale quantum computers are available, the need for long term security (what we protect today must remain secure also in 10 years from now) makes this an issue that needs immediate attention.
In response to this situation, the National Institute of Standards and Technology (NIST) started a few years ago a project for standardizing post-quantum cryptographic primitives (NIST PQ standardization project).The candidate primitives in this project rely on problems that are not known to be solvable by a quantum computer.The two most common areas for such problems are lattices problems and decoding problems for error correcting codes.In round 1, security was the main focus in evaluation, whereas round 2 considered implementation aspects to a larger extent.The project recently entered round 3, where it is expected that security in relation to side-channel attacks will have a larger focus.
As mentioned, lattice-based cryptography is perhaps the most promising areas in postquantum crypto.The remaining candidates in round 3 are split into two subsets, the finalists and the alternates.Among the finalists for the primitive key encapsulation mechanism (KEM), 3 out of 4 finalists are lattice-based (and one more among the alternates).
Among lattice-based schemes one may further split into several categories: NTRU-based schemes with finalist NTRU [C + 20]; Learning With Errors (LWE)-based schemes with finalist Kyber [S + 20]; and the Learning With Rounding (LWR)-based schemes with finalist Saber [D + 20].The hardness in these problems comes from inserting unknown noise into otherwise linear equations.
Side-channel attacks were introduced by Kocher [KJJ99] and are today considered as the main threat against implementations of cryptographic algorithms, in particular for applications in embedded devices.Side-channel attacks exploit information obtained from physically measurable alternative channels and the most common ones are timing measurements and the measured power consumption of a device.Side-channel attacks and the corresponding countermeasures have been a major area of research for many years now, often targeting cryptographic standards.A more recent sub-area is the investigation of side-channel attacks for post-quantum cryptography.This is getting increasing attention in the research community, in particular in connection with the NIST PQ standardization project.The analysis and protection against side-channel attacks for the round 3 finalist candidates is an urgent area to explore.
The first and most basic form of side-channel analysis and protection is obtained by considering the timing channel, simply measuring the execution time of software implementations of the cryptographic algorithms.The general protection method is to make implementations such that they all run in constant time.This is today a standard assumption for software implementations.The timing channel can be extended to consider cache-timing attacks, where time variation due to memory management in the executing device is considered.A typical example of an exploit is the use of look-up tables.
Even with constant time implementations and avoiding implementation weaknesses such as the use of look-up tables, a software implementation is still vulnerable to attacks if power measurements from the CPU can be used.Additional protection measures need to be considered and the main tools are such techniques as masking and shuffling.
A fully side-channel protected implementation of a lattice-based cryptosystem was first to proposed in [RRVV15] followed by [RdCR + 16], based on masking.It should be noted that masking involves doing linear operations twice, whereas non-linear operations calls for more complex solutions which decrease the speed even more.The masked implementation in [RRVV15] increases the number of CPU cycles on an ARM Cortex-M4 by a factor more than 5 compared to a non-protected implementation.
Whereas these protection attempts consider Chosen-Plaintext Attack (CPA)-secure lattice schemes, it is more interesting to consider to secure primitives designed to withstand Chosen-Ciphertext Attacks (CCA).CCA secure primitives are usually obtained through a transform and a CPA secure primitive.The most common transformation is the Fujisaki-Okamoto (FO) transform or some variation of it [HHK17].The CCA-transform is itself susceptible to side-channel attacks and should be masked [RRCB20].Examples of recent masked implementations are: [OSPG18] of a KEM similar to NewHope; and [BBE + 18, MGTF19, GR19] on different lattice-based signature schemes.
Narrowing in on the NIST round 3 finalists, only the candidate Saber has an associated protected software implementation available [BDK + 20].Saber is a Module-LWR-based KEM that is a finalist in the third round of the NIST PQ standardization project.LWR means that noise is added through rounding instead of adding explicit error terms as for LWE.
In [BDK + 20] the authors construct a first-order masked implementation of the Saber CCA-secure decapsulation algorithm that comes with an overhead factor of only 2.5 compared to the unmasked implementation.It is claimed that this side-channel secure version can be built with relatively simple building blocks compared to other candidates, resulting in a small overhead for side-channel protection.The masked implementation of Saber is based on masked logical shifting on arithmetic shares and a masked binomial sampler.The work includes experimental validation of the implementation to confirm suppression of side-channel leakage on the Cortex-M4 general-purpose processor.
Side-channel attacks on the unprotected implementations of NIST PQ standardization project candidates have been considered in some recent papers.In [SKL + 20] a message recovery attack (session key recovery) was described using a single trace on the unprotected encapsulation part of some of the round 3 candidates.In [RRCB20] side-channel attacks on several round 2 candidates were described.In [XPRO] unprotected Kyber was attacked as a case study, using an EM side-channel approach.In particular, a mechanism of turning a message recovery attack to a secret key recovery attack was proposed, giving a secret key recovery using e.g.184 traces for 98% success rate.In [GJN20] similar ideas for timing attacks were considered.
In the very recently posted paper [RBRC20] 1 , the authors improve the key recovery attacks on unprotected implementations of three NIST PQ finalists, including Saber.They also discuss how to attack masked implementations by attacking shares individually.However, no actual attack on masked Saber is performed and because only a single trace is available for each unknown mask value, the success rate in message bit recovery for such a two-step approach may be far from 100%.
Contributions: In this paper, we present the first side-channel attack on a masked implementation of IND-CCA secure Saber KEM.It does not require a profiling device with deactivated countermeasure as in previous attacks on masked implementations of LWE/LWR-based KEMs [RBRC20, SKL + 20].We show how to recover both the session key and the long-term secret key, by deep learning-based power analysis using a small number of traces without explicitly extracting the random mask at each execution.Since the presented method is independent of the mask, we can do error correction by capturing multiple traces for the same input.This is an important advantage over the attacks in [RBRC20, SKL + 20] which must succeed from a single trace.The independence of the mask also makes it possible to use the device under attack for capturing traces for the profiling stage, if it is accessible for a sufficiently long time.This typically maximizes classification accuracy of deep learning models created at the profiling stage.We also present a new approach for secret key recovery, using maps from error-correcting codes.This approach can compensate for some errors in the recovered message.
The remainder of this paper is organized as follows.In Section 2 we give the necessary background both on Saber and on the use of deep learning in side-channel attacks.In Section 3 we describe the main part of the work, which is a message recovery attack on the decryption/decapsulation algorithm.In Section 4 we subsequently show how an attack recovering the long-term secret key can be done, using the message recovery attack from the previous section.Section 5 concludes the paper and describes future work.

Background
This section provides background information on the Saber algorithm, the masked implementation of Saber from [BDK + 20], profiled side-channel attacks, and Test Vector Leakage Assessment (TVLA).

SABER algorithm
Saber [D + 20] is a finalist candidate in the NIST PQ standardization project, where the security is based on the hardness of the Module Learning with Rounding problem (MLWR).It starts with an IND-CPA secure encryption scheme, Saber.PKE, and then presents an IND-CCA secure key encapsulation mechanism (KEM), Saber.KEM, which is transformed from Saber.PKE through a version of the FO transform.Algorithms Saber.PKE and Saber.KEM are described in Fig. 1 and 2, respectively.

Saber.PKE.KeyGen()
We now introduce some notations used in the description of Saber.Let Z q denote the ring of integers modulo a positive integer q and R q the quotient ring Z q [X]/(X n + 1).Saber sets n = 256.The rank of the module is denoted by l and it increases for a higher security level.
In Saber, the positive integers q, p, and T are chosen to be a power of 2, i.e., q = 2 q , p = 2 p , and T = 2 T , respectively.We use x ← χ(S) to denote sampling from χ, if χ is a distribution over a set S. The notation S can be omitted, i.e., we write x ← χ, if there is no ambiguity.
Let U denote the uniform distribution and β µ the centered binomial distribution with parameter µ, where µ is an even positive integer.Thus, the samples of β µ lie in the interval [−µ/2, µ/2] and its probability mass function is where the coefficients of polynomials in R q are sampled in a deterministic manner from β µ using seed r.
The functions F, G, and H are hash functions used, where F and H are implemented using SHA3-256, and G is implemented using SHA3-512.The algorithms also employ an extendable output function gen to generate a pseudorandom matrix A ∈ R l×l q from a seed seed A .This extendable output function is implemented using SHAKE-128.
The bitwise right shift operation is denoted by and can be extended to polynomials and matrices by applying it coefficient-wise.Saber also includes three constants to efficient implement rounding operations by a simple bit shift, i.e., two constant polynomials h 1 ∈ R q and h 2 ∈ R q with all coefficients set to 2 q − p −1 and 2 p−2 − 2 p − T −1 + 2 q − p−1 , respectively, and one constant vector h ∈ R l×1 q with each polynomial set equal to h 1 .Three parameter sets (see Table 1) are proposed in the round 3 Saber document, i.e., LightSaber, Saber, and FireSaber, aiming for the security levels of NIST-I, NIST-III, and NIST-V, respectively.These parameter sets achieve decryption failure probabilities bounded by 2 −120 , 2 −136 , and 2 −165 , respectively.For a more detailed description of the different parts of Saber, we refer to the design document [D + 20].,c)

Masked Saber KEM
Masking is a well-known countermeasure against power/EM analysis [CJRR99].
First-order masking protects against attacks leveraging information in the first-order statistical moment.A first-order masking partitions any sensitive variable x into two shares, x 1 and x 2 , such that x = x 1 • x 2 , and executes all operations separately on the shares.The operator "•" depends on the type of masking, e.g. it is "+" is arithmetic masking and "⊕" in Boolean masking.
Carrying out operations on the shares x 1 and x 2 prevents leakage of side-channel information related to x as computations do not explicitly involve x.Instead, x 1 and x 2 are linked to the leakage.Since the shares are randomized at each execution of the algorithm, they are not expected to contain exploitable information about x.The randomization is usually done by assigning a random mask r to one share and computing the other share as x − r for arithmetic masking or x ⊕ r for Boolean masking.
A challenge in masking lattice-based cryptosystems is the integration of bit-wise operations with arithmetic masking which requires methods for secure conversion between masked representations.Saber can be efficiently masked due to specific features of its design: power-of-two moduli q, p and T , and limited noise sampling of LWR.Due to the former, modular reductions are basically free.The latter implies that only the secret key s has to be sampled securely.In contrast, LWE-based schemes also need to securely sample two additional error vectors.
Masking duplicates most linear operations, but requires more complex routines for nonlinear operations.The first-order masked implementation of Saber presented [BDK + 20] uses a custom primitive for masked logical shifting on arithmetic shares and an adapted masked binomial sampler from [SPOG19].A particular attention is devoted in [BDK + 20] to the protection of the decapsulation algorithm, Saber.KEM.Decaps(), since it involves operations with the long-term secret key s.
If artificial neural networks are used, then at the profiling stage a network is trained to learn the leakage "profile" of the target device for all possible values of the sensitive variable.The training is done using a large number of traces captured from the profiling device, which are labelled according to the selected leakage model (e.g.Hamming weight, Hamming distance, identity, etc).Afterwards, at the attack stage, the trained network is used to classify traces captured from the device under attack (which may be the same or different from the profiling device).

Test vector leakage assessment
The Test Vector Leakage Assessment (TVLA) introduced by Goodwill et al. [GJJR11] is a popular statistical technique which is used as a metric for evaluating side-channel leakage and as a tool for feature extraction from side-channel measurements [RJJ + 18, RRCB20, SKL + 20].
TVLA applies the Welch's t-test to find differences between two sets of side-channel measurements.The t-test takes a sample from each of the two sets and establishes whether they differ by assuming a null hypothesis that the means of two sets are equal.
The TVLA of two sets of measurements T 0 and T 1 is carried out as follows: where µ i , σ i and n i stand for mean, standard deviation and cardinality of the set T i , for i ∈ {0, 1}.The null hypothesis is rejected with a confidence of 99.9999%only if the absolute value of the t-test score is greater than 4.5 [GJJR11].A rejected null hypothesis means that the two data sets are noticeably different and thus might leak some information.
In this work, we use TVLA for a posteriori analysis of side-channel measurements.

Message recovery attack
First, we present an attack which recovers a message from traces captured during the execution of Saber.KEM.Decaps() by the device under attack.Later, in Section 4.2, we show how both the long-term secret key can be extracted from a few recovered message.

Main idea
Side-channel attacks aiming to extract a secret S from a set of side-channel measurements T captured from a masked implementation of an algorithm A face two problems: 1. How to find points in T which leak information about S?
2. How to recover S without knowing the value of the mask at each execution of A?

Finding leakage points
The attacks on non-masked implementations [RRCB20, SKL + 20] solve the problem (1) by identifying leakage points in side-channel measurements using, for example, TVLA [GJJR11], or Correlation Power Analysis (CPA) [BCO04].However, such an approach does not apply to masked implementations because the value of the random mask at each execution of A is unknown.Previous works addressing masked LWE/LWR-based KEMs [RBRC20, SKL + 20] suggest to solve the problem (1) by first deactivating the masking countermeasure, or fixing the mask to a constant, and then finding leakage points as in the non-masked case.However, in order to deactivate the countermeasure, or fix the mask, one requires the implementation source code of the algorithm under attack.The source code may be proprietary.In addition, a modified source code may be optimized differently by the compiler due to the changes made to deactivate the countermeasure.This might change the shape of power traces, as we show in Section 3.5.4.
We solve the problem (1) using a deep learning method which works without explicitly extracting the random mask at each execution.We first hypothesize an approximate location of the leakage point in a trace based on knowledge of the algorithm under attack and through experience gained in power analysis of its non-masked implementations.Then, we verify each hypothesis by training a deep learning model on an interval of trace covering the selected point.If the model learns with a high accuracy, the point is assumed to leak.Otherwise, we shift the interval window and repeat the training.If all shift attempts fail, the hypothesis is rejected.

Recovering message without knowing the mask
In the previous work on LWE/LWR-based KEMs, the problem (2) is addressed by either constructing a template [RBRC20], or training a deep learning model [SKL + 20] on power/EM traces from a profiling device in which the masking countermeasure is deactivated, or the mask r is fixed to a constant.Profiling aims to distinguish the difference between cases when a message bit value of "0" is processed at the leakage point; from the case when a message bit takes the value of "1".
During the attack, traces captured from a device under attack are given as input to the template/model to separately recover the jth bit of the shares m ⊕ r and r, for all j ∈ {0, 1, . . ., 255}.Finally, the message is computed as m = r ⊕ (m ⊕ r).Note that such two-phase attacks must recover the message from a single power/EM trace because a new random mask r is generated for each execution of the algorithm.
We show that it is possible to train an accurate deep learning model capable to recover message bits from a masked LWE/LWR-based KEM directly, without explicitly knowing or manipulating the value of the mask.At the profiling stage, a model for the bit j, N j , is trained on traces containing jth bits of both shares, m[j] ⊕ r[j] and r[j], and labelled by the value of the message bit m[j] (see Fig. 3).At the attack stage, the model N j takes an interval of trace containing m[j] ⊕ r[j] and r[j], and performs processing equivalent to recognizing their values from the shape of power traces and doing a logic operation, XOR, on them.A similar strategy is used in [MPP16] for extracting the secret key from a masked implementation of AES except that we use the message bit values as a leakage model, while in [MPP16] the Hamming weight of S-Box output is used as a leakage model.
Another difference is that, since public-key encryption is performed using the public key, for LWE/LWR-based KEMs we can pre-compute a set of ciphertexts corresponding to any set of messages (random or chosen).Therefore, if the device under attack is accessible, we can use it to capture training traces for the profiling stage.This is clearly not possible in the case of symmetric encryption algorithms since they use the secret key for both encryption and decryption.Using the device under attack for profiling is advantageous because the deep learning model's classification accuracy does not deteriorate due to differences in training and test traces caused by manufacturing process variation [WBFD19].
An advantage of our attack over the two-phase attacks on masked LWE/LWR-based KEMs [RBRC20, SKL + 20] is that we can improve message recovery probability by combining score vectors of multiple traces captured for the same ciphertext.The two-phase attacks must rely on a single trace.

Assumptions
We assume that the adversary knows that the device under attack implements the Saber algorithm and has physical access to the device under attack to capture power traces.We consider two scenarios: 1: The access time is sufficient to capture both, training traces for the profiling stage and test traces for the attack stage.In this case, the device under attack can be used for profiling.

2:
The access time is sufficient to capture test traces for the attack stage only.In this case, a different device, identical to the device under attack is required for profiling.

Trace acquisition
Next, we describe equipment for trace acquisition and how leakage points are located.

Equipment
Our measurement setup is shown in Fig. 4. It consists of the ChipWhisperer-Lite board, the CW308 UFO board and CW308T-STM32F4 target board.The ChipWhisperer is a hardware security evaluation toolkit based on a low-cost open hardware platform and an open-source software [New].The ChipWhisperer-Lite board can be used to measure power consumption and control the communication between the target device and the computer.The power is measured over the shunt resistor placed between the power supply and the target device.ChipWhisperer-Lite uses a synchronous capture

Locating leakage points
In previous work, a number of vulnerabilities were discovered in the non-masked LWE/LWRbased PKE/KEMs [ACLZ20, SKL + 20, RRCB20, RBRC20].One is Incremental-Storage vulnerability resulting from an incremental update of the decrypted message in memory during message decoding [RBRC20].The decoding function (line 2 of Saber.PKE.Decryt() at Fig. 1) iteratively maps each polynomial coefficient into a corresponding message bit, thus computing the decrypted message one bit at a time.
It was observed in [RBRC20] that, in a non-masked implementations of the decoding function (see indcpa_kem_dec() at Fig. 6), there are two points containing exploitable Incremental-Storage vulnerability.The first one is at line 8 of indcpa_kem_dec() where the message bits m[j] are computed and stored in a 16-bit memory location v[i] in an unpacked fashion.Since v[i] can take only two possible values, 0 or 1, an attacker can recover the message bit m[j] by distinguishing between 0 and 1.The second point, located at line 4 of POL2MSG() procedure where the decoded message bits are packed into a byte array in memory.
First we explain how we located the position of POL2MSG() in traces.Fig. 5(a) shows a trace representing the initial part of Saber.KEM.Decaps().The trace is obtained by averaging 50,000 traces captured for random ciphertexts.Since POL2MSG() packs the message bits into a byte array, we expect its trace to look like a block of repeating, similar patterns.The segment of Fig. 5 Since poly_A2A() is executed immediately before POL2MSG(), we can hypothesize it is located in the interval between the points 4,000 and 15,000 in Fig 5(b).
Next we describe how we used deep learning to check if these two intervals contain exploitable leakage points.

Profiling and attack stages
Let m = {m 1 , m 2 , . . ., m t }, where m i ∈ {0, 1} 256 , for i ∈ {1, . . ., t}, be a set of messages selected at random.Let c = {c 1 , c 2 , . . ., c t }, be the set of corresponding ciphertexts c i = Saber.PKE.Enc(pk, m i ; r i ).Let T i ∈ R l be a trace captured from the profiling device D pro during the execution of Saber.KEM.Decaps() with c i as input, where l is the trace size.
A pseudocode in Fig. 7 describes the profiling and attack stages of the presented message recovery attack.At the profiling stage, a neural network model N j : R l → I 2 , where I = {x ∈ R | 0 ≤ x ≤ 1}, is trained for a selected message bit j.A network N j maps a trace T i ∈ R l into a score vector S j,i = N j (T i ) ∈ I 2 whose elements s j,i,k represent ProfilingStage(D pro , t, l, j) /* # of training traces t, trace size l, bit position j */  To train a network N j , each trace T i in the training set T is assigned a label l j (T i ) = m i [j], for i ∈ {1, . . ., t}.We use 70% of the set T for training and 30% for validation.Categorical cross-entropy is used as a loss function.No input normalization is applied.Nadam optimizer (an extension of RMSprop with Nesterov momentum) with the learning rate 0.0001 and numerical stability constant epsilon=1e-08 is used.The training is carried out for a maximum of 100 epochs with batch size 10 and early stopping.
Since the MLP architecture in Table 2 is very simple, the training time is short, e.g. for |T | = 100, 000 and trace size l = 7, 000, training a model takes less than 5 minutes on average on a quad-core 4.0 GHz PC.
At the attack stage, d traces T = { T1 , . . ., Td } are captured from the device under attack D att for the same ciphertext c.To recover a message bit m[j], the model N j is used to compute score vectors S j,i = N j ( Ti ) for every i ∈ {0, 1, . . ., d}.The most likely value of

Experimental results
In the experiments, we use three CW303 ARM devices shown in Fig. 8.The device D 1 for profiling and all three devices for testing.One can see that D 1 and D 2 look similar.They are acquired from the same chip vendor.The device D 3 is visually different from the others, it is acquired from a different chip vendor.We test on multiple boards to investigate how the classification accuracy of the models trained on D 1 is affected by manufacturing process variation.It is known to be an issue for advances technologies [WBFD19].
Using the equipment and the method described in Section 3.3, we captured from D 1 a large set of traces T of size 100,000 for training the MLP models.We also captured from each of D 1 , D 2 and D 3 a smaller set of traces T of size 1,000 for testing the models.
Throughout the section, we use p j and p j,d to denote the probability to recover a specific message bit j ∈ {0, . . ., 255} from a single trace and n traces, respectively.Similarly, we use p m and p m,d to denote the probability to recover a complete message from a single trace and d traces, respectively.

Single-trace attack
1. Using POL2MSG() leakage point Table 3 lists expected probabilities to recover a message bit m[j] from a single trace from a test set T using the MLP model N j trained on the 7,000-point interval shown in Fig. 9, for j ∈ {0, . . ., 7} and | T | = 1, 000.The size l = 7, 000 is chosen by dividing the trace in Fig. 9 by the number for shares, dividing the result by the number of bytes in one share, and adding to the share size the byte size plus some safety margin.
As expected, the models N j best classify the test traces captured from the same device as the one used for profiling, D 1 .For D 2 the difference in the classification accuracy is 1.2% on average.For D 3 it is to 2.7% on average.
We can also see that the expected message recovery probabilities differ for different bits.The bit m[7] is the most difficult.To understand the reasons for this, we perform a posteriori TVLA analysis described in Section 3.5.3.   3 lists expected probabilities to recover a message bit m[j] from a single trace from a test set T using the MLP model N j trained on the 600-point interval shown in Fig. 10, for j ∈ {0, . . ., 7} and | T | = 1, 000.The size l = 600 is selected by dividing the interval shown in Fig. 10 by the number of bytes and adding to the result some safety margin.We can see that poly_A2A() leaks less than POL2MSG().Another difference is that, for poly_A2A(), the expected probabilities for all bits except m[0] are similar.The bit 0 is more difficult to recover.However, later in Section 3.5.3we show that m[0] is a special case for the byte 0. For other bytes, m[0] is not different from the rest.
Clearly, it is possible to exploit both leakage points simultaneously by combining score vectors of the corresponding models.

Multiple-trace attack
If classification errors are mutually independent, the probability to recover a message bit j from d traces captured for the same ciphertext can be estimated as [Dub13, p. 64]: where p j is the probability to recover a message bit j from a single trace and d is odd.Assuming that the expected probabilities for the other message bytes are the same as the ones for the first byte, the probability to recover a 256-bit message from d traces can be computed as p m,d = ( 7 j=0 p j,d ) 32 .Table 5 lists the minimal number of traces d required to achieve p m,d > 99.9% for D 1 , D 2 and D 3 , respectively, for the case when both leakage points are used simultaneously.

A posteriori TVLA analysis
To understand why some bits are more difficult to extract than others, we perform a posteriori TVLA analysis of side-channel measurements.This section shows results for the first byte of the share m ⊕ r.
For each j ∈ {0, . . ., 7}, we partition the training set T into two sets containing traces in which m i [j] ⊕ r i [j] = k when c i is applied as input: for k ∈ {0, 1} and i ∈ {1, . . ., t}, where r i [j] denotes jth bit of the mask r i .
From the t-test results we can see that overall POL2MSG() leaks greater than poly_A2A().This is also apparent from Tables 3 and 4. For POL2MSG(), the shape of t-test plots is different for every bit of a byte (see Fig. 13).Possibly the peaks represent the storage of the decoded message bit m[j] into a byte array in memory and the addition (OR) of the current content of the byte array with the following 7 − j bits, j ∈ {0, . . ., 7}.Note that our method of labelling training traces is "memoryless", a label for N j is determined by the bit m[j] only and not by the bits m[0], . . ., [j − 1] already stored in the byte array   by the time j is processed, as in [RBRC20].While we might be losing in classification accuracy for m[7], we need to train eight models only to recover all bits of a byte.In the Hamming weight based-method [RBRC20], 44 templates are required to recover all bits of a byte.
Tables 6 and 7 compare sum of squared pairwise t-differences (SOST) values of the first four bytes of the share m ⊕ r for POL2MSG() and poly_A2A() leakage points.We can see that, on average, SOST of different bytes do not differ significantly.The byte 0 is more difficult than other bytes for POL2MSG().The bit 0 of byte 0 is a specially difficult case for poly_A2A().

Effect of masking countermeasure deactivation
In this section, we show a recompiled reference source code of masked Saber, in which masking countermeasure is deactivated, may produce very different power traces from the original source code.The trace shown in Fig. 9 represents two consecutive executions of POL2MSG() in indcpa_kem_dec_masked() without any modifications.The trace in Fig. 14 represents two consecutive executions of POL2MSG() in indcpa_kem_dec_masked() with the mask fixed to 0. We can see that, in the latter case, one execution of POL2MSG() is twice as short than in the former.As a result, a model trained on traces in Fig. 14 is likely to fail in recovering the message from the traces in Fig. 9.
Traces in the two cases differ because the compiler applied a different optimization strategy due to the changes made to deactivate the countermeasure.

Recovering the long-term secret key
The session key can be derived directly from the recovered message and the public key.In this section, we show how to recover the long-term secret key using maps from errorcorrecting codes.
For IND-CCA-secure lattice-base schemes, [RRCB20] proposes a secret key recovery attack through the analysis of recovered messages; this approach however assumes that the message is perfectly recovered.Perfect message recovery is not trivial.The previous solutions to achieve a very high message recovery probability by using multiple traces might require a large number of traces in a combination with masking.
We next describe a new secret key recovery attack that compensates for the errors in  the recovered message.We start with a basic version that is optimized in terms of the sample complexity, i.e., the required number of traces.This basic version may encounter a large failure probability if the success probability in the message-recovery attack is low.We thus propose novel improvements using error correcting codes and other observations to reduce the noise occurring in the process of recovering the secret bits.
We focus on LightSaber, observing that the other two parameter sets (see Table 1) can be attacked in a similar manner.

The basic version of secret key recovery
If we use the message-recovery attack described in the previous section to recover the message bit m[i], then partial secret information of s[i] is known.Using the decision table as shown in Table 8, we could recover the first 256 positions of s with four queries, when perfect message recovery is assumed.
This attack version works in the most optimistic setting.In the real case, the attack success probability could be very low, given an insufficient success probability in the message recovery attack.For instance, as shown in Table 4, the lowest probability of recovering one message bit for a D 3 device is only 0.947, resulting in a probability of 0.804 for recovering a single position in the secret key.Thus, the chance of getting all the secret coefficients correct is negligible.

New improvements
We now describe the improved key-recovery attack, consisting of three novel techniques.Mapping a secret coefficient to various message bits.Since the leakage quality varies for different bits in a message byte (see Table 6), we could map a secret coefficient to several message bits to reduce the largest error probability for recovering such a coefficient.We  Using lattice reduction algorithms for post-processing.Since the connection between the public key and the secret key, i.e., b = ((A T s + h) mod q)

Thus, the decision table in
is publicly known, one could employ a post-processing step with lattice reduction algorithms to fully recover the secret key s.
Now we assume for a restricted adversary model with a D 3 device (see Section 3.5), so the adversity has limited access to the targeted device.For the j-th bit in a message byte, we use the leakage point POL2MSG() or poly_A2A() that maximizes the probability p j from a single trace for j ∈ {0, 1, .., 7}.The notation p j is defined in Section 3.5.We know from Table 3 and 4 that (p 0 , p 1 , .., p 7 ) = (0.984, 0.985, 0.988, 0.963, 0.972, 0.991, 0.975, 0.947).
Similar to the assumption in Section 3.5.2,we assume that the expected probabilities for the other message bytes are not worse than the ones for the first byte.Since we connect each s[i] to the recovery of eight consecutive message bits with different index modulo 8, the probabilities of having less than one error, two errors, and no less than three errors are 0.9856, 0.0138, and 0.0006, respectively.With the [8, 4, 4] 2 extended Hamming codes, one error can be corrected and two errors can be detected with the corresponding secret coefficient marked as an erasure; only the last case will introduce an erroneous secret coefficient.We only need to recover 512 − 14 = 498 positions, so the expected number of failures of recovering s[i] is about 498 × 0.0006 ≈ 0.3.We also have about 14 + 498 × 0.0138 < 21 positions marked as erasures, but we know the coefficients are small since they are generated from the centered binomial distribution β µ .The resulting lattice problem is easy to solve since the dimension of the new lattice is quite small.In summary, one could recover the long-term key using only 16 traces even from a D 3 device which is different from the profiling one.
Note that the improved key-recovery attack from noisy messages has significance beyond evaluating the security of masked Saber.We leave the investigation of its wide applications for future work.

Conclusion
We presented a side-channel attack on a masked implementation of Saber which recovers the session key and the long-term secret key using 16 traces for LightSaber.Our approach has several advantages over previously proposed ones, including a possibility to increase success probability by analyzing multiple traces and using the device under attack for profiling.We discovered a previously unknown leakage point in the primitive for masked logical shifting on arithmetic shares.We also described a new approach for secret key recovery which can compensate for some errors in the recovered message.
Future work includes assessing higher-order masking schemes and combined ones, as well as designing deep learning-resistant countermeasures for LWE/LWR-based PKE/KEMs.

Figure 3 :
Figure 3: Profiling and attack stages of the presented message recovery attack.
message XORed with mask
(a) marked by two red lines is a possible candidate.Fig. 5(b) shows its zoomed version.By further zooming into the interval of Fig. 5(b) marked by the red lines, we can distinguish 64 repeating patterns representing the processing of bytes.Fig. 5(d) gives a more detailed view of one byte processing; it takes 196 samples.

Figure 7 :
Figure7: Description of the profiling and attack stages of the presented message recovery attack on masked Saber PKE; D pro is the profiling device and D att is the device under attack (can be the same instance).

Figure 9 :
Figure 9: The 7,000-point trace used as an input to N j for all j ∈ {0, . . ., 7} in the attack using POL2MSG() leakage point sampled 4 pt/clock cycle.
600-pt interval for training and testing

Figure 11 :
Figure 11: The t-test results for the first byte of the share m ⊕ r.

Figure 13 :
Figure 13: The t-test results for POL2MSG() leakage point separately for each bit.
32-byte mask fixed to 0 32-byte message XORed with mask

Figure 14 :
Figure 14: The execution of POL2MSG() with the mask fixed to 0 sampled 4 pt/clock cycle.

Following
the basic idea in [RRCB20], we prepare ciphertexts (c m , b ) where c m = k 0 255 i=0 x i ∈ R T and b = (k 1 , 0) ∈ R 2×1 p .Then, the decryption algorithm computes m = ((b T Then we could prepare ciphertexts (c m , b ) where c m = k 0 255 i=0 x i ∈ R T and b = (0, k 1 ) ∈ R 2×1 p to recover the remaining 256 positions of s.In summary, one needs eight traces for LightSaber.

Table 2 :
The MLP architecture used in the message recovery attack; l = 7,000 and 600 for POL2MSG() and poly_A2A() leakage points, respectively.
the probability that the jth bit of m i , m i [j], is equal to k = {0, 1} in trace T i .We use the multilayer perceptron (MLP) architecture listed in Table2.

Table 3 :
Probability p j to recover m[j] from a single trace using POL2MSG() leakage point.

Table 4 :
Expected probability to recover a message bit from a single trace using poly_A2A() leakage point.

Table 5 :
Probability p m,d to recover a 256-bit message from d traces captured for the same ciphertext.

Table 6 :
SOST of the first 4 message bytes for POL2MSG().

Table 7 :
SOST of the first 4 message bytes for poly_A2A().

Table 8 :
Chosen pairs of (k 1 , k 0 ) to determine s[i] based on m[i] for LightSaber without error correction.(X: m

Table 9 :
Chosen pairs of (k 1 , k 0 ) to determine s[i] based on m[i] for LightSaber with [8, 4, 4] 2 extended Hamming codes.(X: m Employing extended Hamming codes.The [8, 4, 4]2 extended Hamming codes with code length 8, dimension 4, and minimum distance 4 can correct one error and detect the event that two errors occur.We design a new decision table shown in Table 9, mapping the secret coefficient s[i] to a codeword of the [8, 4, 4] 2 extended Hamming code.We will show that the error correcting capability of this [8, 4, 4] 2 extended Hamming code is sufficient for our attack; one may need to employ low-rate codes with larger minimum distance if a less accurate message-recovery is assumed.