A Faster Third-Order Masking of Lookup Tables

. Masking of S-boxes using lookup tables is an effective countermeasure to thwart side-channel attacks on block ciphers implemented in software. At first and second orders, the Table-based Masking (TBM) schemes can be very efficient and even faster than circuit-based masking schemes. Ever since the customised second-order TBM schemes were proposed, the focus has been on designing and optimising Higher-Order Table-based Masking (HO-TBM) schemes that facilitate masking at arbitrary order. One of the reasons for this trend is that at large orders HO-TBM schemes are significantly slower and consume a prohibitive amount of RAM memory compared to circuit-based masking schemes such as bit-sliced masking, and hence efforts were targeted in this direction. However, a recent work due to Valiveti and Vivek (TCHES 2021) has demonstrated that the HO-TBM scheme of Coron et al. (TCHES 2018) is feasible to be implemented on memory-constrained devices with pre-processing capability and a competitive online execution time. Yet, currently, there are no customised designs for third-order TBM that are more efficient than instantiating a HO-TBM scheme at third order. In this work, we propose a third-order TBM scheme for arbitrary S-boxes that is secure in the probing model and under compositions, i.e., 3-SNI secure. It is very efficient in terms of the overall running time, compared to the third-order instantiations of state-of-the-art HO-TBM schemes. It also supports the pre-processing functionality. For example, the overall running time of a single execution of the third-order masked AES-128 on a 32-bit ARM-Cortex M4 micro-controller is reduced by about 80% without any overhead on the online execution time. This implies that the online execution time of the proposed scheme is approximately eight times faster than the bit-sliced masked implementation at third order, and it is comparable to the recent scheme of Wang et al. (TCHES 2022) that makes use of reuse of shares. We also present the implementation results for the third-order masked PRESENT cipher. Our work suggests that there is a significant scope for tuning the performance of HO-TBM schemes at lower orders.


Introduction
Side-channel attacks are a major security threat for cryptographic implementations [Koc96,KJJ99].Masking is an effective countermeasure for side-channel attacks, in particular, differential power/electromagnetic attacks.The popularity of the masking countermeasure is due to its simplicity which paves the way for formal security analysis in the probing leakage model [ISW03, Cor14, BBD + 16], and its connection, rather equivalence, to the more realistic noisy leakage model [CJRR99,PR13,DDF14].Despite the shortcomings of the probing leakage model, for instance, the independent leakage assumption, it continues to attract great attention of the research community.
Masking countermeasures implemented in software, particularly those based on the probing leakage model, can be categorised into two types: circuit-based masking schemes and Table-based Masking (TBM) schemes.Circuit-based masking schemes, that include schemes proposed in [ISW03, RP10, CGP + 12, RV13, CRV14, PV16, CGPZ16, GR16, GR17, GRVV17, WGY + 22] and many more, are based on representing the computation, say, of a block cipher, as a boolean or an arithmetic circuit.On the contrary, the TBM schemes, that include schemes proposed in [CJRR99, SP06, RDP08, Cor14, CRZ18, VV20, VV21] represent the non-linear computations as a lookup table.For the case of SPN-based and Feistel-based block ciphers, the S-box is represented as a lookup table that is then masked using TBM schemes.Note, however, that the linear layers of block ciphers are still represented as a circuit as the masking of these layers is relatively efficient.
At first and second orders, the TBM schemes, particularly, [CJRR99,RDP08], are more efficient than circuit-based masking schemes [Vad17].A particular advantage of most of the TBM schemes is that they support pre-processing.
Currently this feature is not possible with most of the circuit based schemes.However, a recent work by Wang et al. [WGY + 22] has made significant progress in this direction.The authors claim that they can reuse all but one share across masked multiplications.Also, their approach facilitates pre-processing of the computation.In general, the pre-processing phase (a.k.a.offline phase) and the post-processing phase (a.k.a.online phase) refer to the computation that happens before and after the availability of the secret input(s), respectively.The goal of the offline/online paradigm is to achieve a faster online phase with the help of the pre-computed results from the offline phase.
In particular, pre-processing in a TBM scheme corresponds to the shifting of a temporary table by all the independently sampled shares, of course, except the final share.Postprocessing involves the lookup of the shifted table with the final share as the index and its associated computations.However, at higher orders, the TBM schemes such as [Cor14] and its successor [CRZ18] need an amount of RAM memory that is proportional to the masking order for full pre-processing and, hence, becomes infeasible to implement on resource-constrained devices.One way to overcome the memory requirement of Higher-Order TBM (HO-TBM) schemes is to do sequential processing of the masking of S-boxes and it leads to significantly poor performance compared to the circuit-based masking schemes, particularly, bit-sliced masked implementations such as in [GR17].A recent work [VV21] demonstrated how the RAM memory requirement can be made essentially constant for the HO-TBM scheme from [CRZ18] .They implemented a fully pre-processed single execution of the AES-128 block cipher with 10 shares (i.e., at ninth order) on an Arm-Cortex M4 microcontroller and showed that the online execution time is competitive with that of bit-sliced masking though the overall execution time was still very high compared to bit-sliced masking.
The HO-TBM schemes such as [Cor14, CRZ18, VV21] shift a temporary table by each input share at a time and do mask refresh for each row of the table after every shift.More concretely, consider an (n, m) S-box S that needs to be looked up with a masked input A temporary table T is first initialised to (S, 0, 0, . . ., 0), where all the k columns except the first is set to zero.The rows of the Table T are shifted by each x i , by making use of an auxiliary table.After each shift, every row of the Table T is independently refreshed and, hence, these schemes make use of a lot of computing time and (pseudo) randomness.While the complete lack of mask refresh can lead to security flaws as noted in [Cor14], an important consequence of these mask refreshes is that the security proofs in the probing leakage model and under compositions, i.e. k − 1-SNI security proofs [BBD + 16], can be written elegantly.
In this work, motivated by the designs of [RDP08,VV20] that make a sparing use of mask refreshes in the second-order security context, we investigate the question of designing a customised third-order secure TBM scheme that significantly reduces the number of mask refreshes and, hence, reduces the computation time and randomness usage compared to instantiating the current HO-TBM schemes at third order.

Our Contribution:
We propose an efficient third-order secure TBM scheme for arbitrary S-boxes (see Algorithm 3).Our scheme is secure in the probing leakage model and under compositions, more specifically, it is 3-SNI secure [BBD + 16].We design the scheme in two steps.In Step 1, we propose a 3-NI randomised lookup table scheme without any explicit mask refresh.Whereas in Step 2, we refresh the output obtained from Step 1 using a 3-SNI mask refresh, the 3-RB procedure, from [BDF + 17, BBD + 20] (see Algorithm 2).Hence, the composition of these two steps is 3-SNI secure.Our approach requires (explicit) mask refresh only once in the final step, which is a significant reduction in the number of calls to mask refresh per (n, m) S-box to one from 3 • 2 n + 1 in the third-order instantiation of [Cor14,CRZ18].We would like to note that the speedup of the pre-processing step for our scheme comes only from our 3-NI Algorithm 1.We chose the 3-SNI refresh algorithm 3-RB from [BDF + 17, BBD + 20] instead of the 3-SNI full-refresh mainly to make the online time competitive with the state-of-the-art table-based and circuit-based masking schemes.
Altogether, our proposed third-order randomised lookup table scheme is very efficient compared to the third-order instantiation of the most efficient (in terms of the overall computation time) HO-TBM scheme from [CRZ18].In the experiments section (Section 3), we demonstrate that our scheme reduces the total running time of a single execution of thirdorder masked AES-128 by 78.87%, and also facilitates full pre-processing.The experiments were run on a 32-bit Arm-Cortex M4 microcontroller.In Table 4, we provide a detailed comparison of our work with the circuit-based implementations of [RP10, GR17, WGY + 22].The online execution time for Algorithm 3 is approximately 8 times faster than the bitsliced masked implementation of AES-128 at third order, and it is comparable with the online execution times for [Cor14, CRZ18, WGY + 22] (see Tables 3 and 4).
To further improve the overall execution time, one may try to consider an implementation of the proposed scheme on processors with large registers.But, it turns out that the extension of Algorithm 3 to large register variant (LRV) suffers from a second-order attack (see Appendix D).We would like to stress that our basic scheme still beats the overall running time of increasing shares LRV variant of [CRZ18] (see the third row of Table 3) without making use of large registers.For completeness, Table 1 provides estimates for the number of bit operations, RAM memory and randomness usage per (n, m) S-box for Algorithm 3, and the HO-TBM schemes from [CRZ18] and [VV21].
Table 1: Comparison of our proposal with the HO-TBM schemes [CRZ18,VV21] instantiated at third order to mask a single (n, m) S-box.The schemes are compared in terms of RAM memory (in bits), true random values (in bits), and the total running time (in number of bit operations).RAM memory includes the number of bits required for the randomised lookup table along with the auxiliary table.

RAM #True random Time
The rest of the paper is organised as follows.In Section 2, we present our third-order TBM scheme (Algorithm 3) along with its 3-SNI security proof.The experiment results are presented in Section 3 and the paper concludes with Section 4.

Proposed Third-Order TBM Scheme
This section describes our proposal for a third-order TBM scheme.As mentioned previously, the motivation for our scheme are the resource-efficient second-order TBM schemes from [RDP08,VV20].Our goal is to securely compute y = S(x), where is stored in the form of a lookup table, and the input x is in the secret-shared form Needless to say, a naïve extension of the second-order scheme from [VV20] (a variant of [RDP08] that supports pre-processing) to the third order scenario will be insecure, as demonstrated in Appendix A. Also, the third-order attack from [CPR07] on the HO-TBM scheme from [SP06] is now well-known.To prevent these sorts of third-order attacks, we opted for the shift of the table by x 1 in addition to the shift by a combination of two shares as done in [RDP08,VV20].This brings down the total number of shifts of the randomised table from 3 in [Cor14, CRZ18, VV21] to 2. More importantly, we avoid any explicit mask refresh used in the current HO-TBM schemes.But, a mere construction of the randomised table in two shifts cannot make the scheme secure since it leads to an attack as described in Appendix B. Hence, we opt to use a vector of random masks to protect the randomised table in the second shift.Moreover, the order of random masks has to be chosen with caution, otherwise, it will lead to a tuple of intermediate variables whose simulation demands the balanced S-box property (see Section 2.2 (Para 1) and Appendix C).
This approach of constructing the randomised table in two shifts with a sufficient amount of random values will only result in a 3-NI secure scheme (see Remark 1).One way to achieve a 3-SNI secure instantiation that assures composability is to refresh the outputs of the 3-NI scheme using a 3-SNI mask refresh [BBD + 16, Proposition 4, Section 5].A natural choice for the 3-SNI mask refresh from the literature is the full refresh algorithm instantiated at third-order from [BBD + 16].This procedure is nothing but multiplying the secret with one using the ISW multiplication over F 2 n [ISW03].Since the computation and randomness complexity of this algorithm is quadratic in the number of shares, there will be an additional overhead on the online execution time of the resulting scheme.To reduce the overhead associated with the SNI mask refresh, we make use of the RefreshBlock gadget by Barthe et al. [BDF + 17] which was later proven to be SNI secure in [BBD + 20].We have instantiated their scheme at third order and call it 3-RB.For convenience, we slightly modified their notation and presented their scheme in Algorithm 2. Concretely, the amount of randomness and computation time for 3-RB and 3-SNI full refresh are 4 random values and 8 xors vs. 6 random values and 12 xors, respectively.
Overall, our scheme consists of two sub-procedures.The first procedure is presented in Algorithm 1, and Algorithm 2 describes the second procedure.Algorithm 1 begins with an offline phase that deals with the computation of the randomised lookup table.This phase consists of constructing an auxiliary table T aux to hold the result of the shift by x 1 , whose entries are shifted further with x 2 and x 3 in one step to build the final table T .The offline phase is followed by an online phase, where a lookup of the table T using x 4 results in the secret sharing of S(x).Continue the online phase further in Algorithm 2 by receiving the outputs from Algorithm 1 and mask refresh them using Algorithm 2 to generate the final output sharing of S(x).We would like to stress that this final step of mask refresh is crucial to prove the 3-SNI security of our scheme.We summarise the steps of our proposed third-order lookup table scheme in Algorithm 3. We have also explicitly marked the offline and online computation phases in the description of the methods.Note that for the table indexing and access, we have used the parenthesis notation like T (a) , and for the vector Y access we have used the [ ] notation (like Y [i]).Table 1 provides estimates for the number of bit operations, RAM memory and randomness usage per (n, m) S-box for Algorithm 3.
Remark 1.The scheme presented cannot be proved 3-SNI due to the following: suppose that if the probed triplet is (x 2 ⊕ υ ⊕ x 3 ), S(x 1 ⊕ υ ⊕ a) ⊕ y 1 , y 1 .This triplet requires the knowledge of three input shares for the simulation but as per the definition of SNI, we are allowed to use only two shares since there are only two intermediate variables in this triplet and the other probed variable is an output share.

Correctness
The following equations provide the proof of correctness of Algorithm 1.
Algorithm 3: Third-order secure masked S-box computation with pre-processing Input : 1 Process the input shares of x using Algorithm 1 to obtain the shares of S(x) 2 Refresh the shares of S(x) using Algorithm 2

Security Proof
Before proceeding with the security proof, we will first recollect the t−NI and t−SNI security notions from [BBD + 16].

Definition 1. t-Non-Interference (t-NI) [BBD
Let G be a gadget that takes secret input shares (x 1 , x 2 , . . ., x k ) and let the output shares be (y 1 , y 2 , . . ., y k ), where and let the output shares be (y 1 , y 2 , . . ., y k ), where Let the adversary observe t I many input and intermediate shares, and t O many output shares such that t I + t O ≤ t.Then, G is said to be t-SNI secure if the set of t observations can be simulated using only t I input shares of x, where t I ≤ t.
From the given definitions, it can be observed that the bound on the number of input shares in the context of SNI simulation depends only on the observations of the input/intermediate variables vs. the total number of observations in NI notion.We first prove Lemma 1, showing that the Algorithm 1 (using shift and lookup using final share, x 4 ) is 3-NI.The output shares from Algorithm 1 are refreshed using 3-RB (Algorithm 2).Finally, we will conclude by showing that the composed construction presented in Algorithm 3 is 3-SNI.We would like to stress that, unlike the second-order [RDP08] scheme, our scheme does not require the S-box balancedness property for the simulation.To recall, an (n, m) S-box S, m ≤ n, is balanced if every output word is the image under S of exactly 2 n−m input words.
We now proceed with the security proof of Algorithm 1.The intuition behind the security proof presented in Lemma 1 is as follows.Since the offline phase in this algorithm uses three input shares x 1 , x 2 , x 3 to build the randomised lookup table, it is trivial to see that the simulation of any 3-tuple of variables from the offline phase is possible using at most three input shares.This leaves us with the online phase where the final table lookup x 4 outputs y 4 = T (x 4 ).Since x 4 and y 4 are not combined with any other variables of Algorithm 1, the adversary has to probe either x 4 or y 4 individually to obtain x.With the remaining two probes, he can observe at most two input/intermediate variables.Hence, the task is two prove that the simulation of the observed variables together from the offline and the online phases depends on at most three input shares.For ease of reference, we list all the input, output, and intermediate variables of Algorithm 1 in Table 2.
Proof.The gadget here is the S-box S that takes as input x in the form of four input shares x 1 , x 2 , x 3 , x 4 = x ⊕ x 1 ⊕ x 2 ⊕ x 3 , and outputs shares y 1 , y 2 , y 3 , y 4 = S(x) ⊕ y 1 ⊕ y 2 ⊕ y 3 .
To demonstrate that the construction is 3-NI, we need to prove that the simulation of any three variables of Algorithm 1 (as listed in Table 2) requires the knowledge of a maximum of three input shares.Let I be the set of probes the adversary chooses to observe in the gadget.We construct an index set J that holds the set of input share indices required for simulating the observed probes from I. The goal is to show that |J| ≤ 3.
2. Probing x i or y i (except y 3 ) results in J = J ∪ {i}.The output variable y 3 is an exception because it depends on x 4 as y 3 = Y [x 4 ].Hence, update the index set as J = J ∪ {4} when y 3 is probed.This covers the inputs and outputs of the gadget.
4. When a pair of variables from the subset {I 4 , . . ., I 11 } are probed, then update J = J ∪ {2, 3}.Note that the above pair of variables have the random variable υ in common.Even though I 6 is a random mask chosen independent of the secret, the index of this mask still depends on υ.Note that this case covers probing the same variable at distinct values of index a, 0 ≤ a < 2 n .
It can be observed from the above index set construction that we add at most one input share index per probed variable, thus |J| ≤ 3. Note that the indices {2, 3} are added only by probing a pair of variables.Now we are going to discuss the simulation of the set of probed variables I using the input shares x |J .
1.It is trivial to simulate any of the probed output share(s) y 1 , y 2 , or y 3 by assigning them uniform and independent random value(s) since the same would have happened in the actual implementation.If the final output share y 4 is probed, we can still assign a uniform random value since there always exists an unprobed output mask y i , i ̸ = 4, that randomises y 4 .4. The simulation of the intermediates I 6 or I 7 is as follows: if any other variable involving υ i.e, from the subset {I 4 , . . ., I 11 }, is not probed, then assign a random value.Otherwise, compute the probed variable using x 2 and x 3 .

Any probed random variable like
5. Similarly, depending on whether the output mask y 1 is probed or not, simulate I 9 either with a randomly chosen value or calculate the observed value using x 1 .
6.The simulation of the variables I 10 or I 11 that appear in the construction of T that involve more than one input shares is as follows: (d) The complex case in this setting would be probing (y 3 , y 4 ) along with either I 10 or I 11 .Assigning random values to the output shares y 3 and y 4 would fix the value of y 1 ⊕ y 2 since So, we cannot use the fact that neither y 1 nor y 2 are unprobed.This is where we carefully design the scheme such that the index of Y (in I 10 or I 11 ) is randomised with the help of b.So, sample b at random, thanks to the unprobed υ.We would have added 4 to J due to the probed y 4 .Depending on whether the sampled d equals x 4 , assign I 11 = y 4 .Similarly, for the case of I 10 being probed, we can assign I 10 = y 4 ⊕ y 3 .If the index b ̸ = x 4 , the table entry can be assigned a uniform random value due to unprobed Y [b].In [RDP08], this case calls for the S-box to be balanced.But, we do not require the balanced S-box property for the simulation in our scheme (see Remark 2).
Thus, we can conclude that any triple consisting can be simulated with the knowledge of at most three input shares.
Remark 2. As explained in the security proof of the second-order scheme with pre-processing from [VV20, Theorem 1], simulating the pair y 3 = S(x) ⊕ y 1 ⊕ y 2 , S(υ ⊕ a) ⊕ y 1 ⊕ y 2 in their 2-SNI security proof requires the S-box to be balanced.This is because the output masks y 1 and y 2 are reused for the entire table and there is no additional random mask left.So, the variable S(υ ⊕ a) can only be assigned a random value provided the S-box, S is balanced.But, in our case, thanks to the vector of randomness Y , the tuple can be simulated in Step 2(d) for any S-box.
Proof.Because Algorithm 1 is 3-NI and Algorithm 2 is 3-SNI, their composition is known to be 3-SNI [BBD + 16, Proposition 4, Section 5].Since the detailed proof is not available in [BBD + 16], for the sake of completeness we are giving a formal proof below.
Let G 1 and G 2 be the gadgets corresponding to Algorithm 1 and Algorithm 2, respectively.To prove that Algorithm 3 is 3-SNI, we need to show that the gadget G obtained by the composition of the sub gadgets G 1 and G 2 is indeed 3-SNI.The graphical representation of the composition of gadgets is presented in Figure 1.Let the number of input and intermediate probes on the gadget G be t I and t O such that Let t I1 and t O1 be the number of probes on input/intermediate and output variables of the sub gadget G 1 , respectively.Similarly, let t I2 and t O2 be the number of probes on the gadget G 2 such that and Let J, J 1 and J 2 be the set of input shares required for the simulation of the gadgets G, G 1 and G 2 , respectively.Since the gadget G 1 is 3-NI (from Lemma 1), whereas for the 3-SNI gadget G 2 , we have Note that simulating the observations in the gadget G is nothing but simulating the observations in the sub gadgets G 1 and G 2 , so This shows that the set of input shares required for the simulation of the gadget G is bounded by t I i.e., the number of probes on input/intermediate shares.Hence, we can conclude that the gadget G is 3-SNI.

Implementation
This section presents the implementation details of our scheme presented in Algorithm 3. We provide a detailed comparison of our work with the state-of-the-art masking schemes instantiated at third order.The schemes are compared in terms of the overall RAM memory, computation time, and the number of TRNG calls.Our approach achieves an improved overall execution time compared to the higher-order scheme of [CRZ18] (at third order) while maintaining the online time.The latter is achieved by making use of the full pre-processing advantage of lookup table-based masking schemes.We ran our third-order scheme for the block ciphers AES-128 and PRESENT (80-bit key variant).The source code for our third-order scheme implementation is available at [AV].
The target embedded device is NXP's FRDM-K64F which possesses a RAM memory of 256 KB, 1MB flash memory, and has a processor clock speed of 120 MHz.The device comes with an in-built hardware random-number generator clocked at 48 MHz.The RNGA module requires approximately 300 clock cycles to generate a 32-bit random word.We compile our implementations using the −O1 flag.Even though this setting would increase the overall clock cycle consumption when compared to other flags, further compiler optimisations may impact the side-channel security of the implementation, as reported in [BWG + 22].Our implementation includes the third-order masked full block cipher implementation of AES-128 [FIP01] and 80-bit key PRESENT [BKL + 07].While implementing the block cipher AES-128, we have used the publicly available code from [Cor] and [VV].For the PRESENT cipher, we referred the unmasked implementation from the public repository [Klo].The code size is 26.5 KB for the masked implementation of AES-128 using Algorithm 3, whereas for the full cipher masked implementation of PRESENT, the code size is 25.8 KB.We would like to stress that our implementation code is only for the purpose of benchmarking and it requires additional hardening countermeasures to resist against real-world side-channel attacks [BWG + 22].
Since the computation in our schemes is divided into two phases, offline (pre-processing) and online (post-processing), the total number of clock cycles for the execution is the sum of offline and online computations.By offline computation we mean the total number of clock cycles used for the processing that is independent of the input secret.By online computation, we refer to the computation that is performed after the availability of the secret input.The total RAM memory includes the amount of space required for the pre-processed randomised lookup tables and the associated variables.This amounts to the pre-processing of the lookup tables of 160 (10 rounds × 16) and 496 (31 round × 16) S-box invocations for AES-128 and PRESENT ciphers, respectively.The true random values and the input seed to PRGs are generated using the in-built RNGA module.
Table 3 on Page 549 presents the comparison of our implementation results for the 3-SNI secure implementation of AES-128 with other 3-SNI lookup table-based schemes [CRZ18,VV21].To maximise the speed of computation, multiple entries of the randomised table can be packed and processed in parallel using the large register variant (LRV) approach presented in [CRZ18].It can be observed from Table 3 that there is an approximate factor two reduction in the RAM memory usage of Algorithm 3 when compared to the increasing shares variant of [CRZ18].Where as there is a 78.87% reduction in the overall running time of the proposed scheme presented in Algorithm 3 when compared to the LRV variant with increasing shares approach of [CRZ18].One may be tempted to adopt a similar LRV strategy to Algorithm 3 to improve the overall running time of our scheme.But it turns out that packing the entries using LRV variant for our generic scheme is insecure and the details can be found in Appendix D.
We present the implementation results for the circuit-based schemes [RP10, GR17, WGY + 22] in Table 4 on Page 549.A recent work of Wang et al. [WGY + 22] demonstrated a scheme that facilitates reusing t = k − 1 shares across the ISW multiplication gadgets.Similar to the strategy followed in the recent LUT-based schemes [VV20, VV21], the authors of [WGY + 22] have divided the computation into offline and online phases.They have provided their source code optimised in C and assembly for a single round of masked AES-128 [Wan].Also, their implementation pre-processes the linear layers.Since their code is not a full AES-128 cipher implementation, we could only estimate the clock cycles needed for a single complete execution of AES-128 instantiated at third order.Our online execution time (without the pre-processing of the linear layers) is comparable to that of Wang et al.'s scheme [WGY + 22] with full pre-processing (including the linear layers) (see Row 1 of Table 3 and Row 3 of Table 4).The offline time of [WGY + 22] is much faster than ours, and so is the comparison w.r.t. the randomness usage.We think that it would be an interesting research direction to explore whether the share resue technique of [WGY + 22] can be adapted to the table-based masking schemes, in particular, to our scheme.
In Table 4, we compare the online execution time of our scheme with that of bitslicing and [RP10].It can be observed that our scheme (Algorithm 3) requires eight times lesser clock cycles when compared to the 3-SNI instantiation of masked AES-128 implementation using bitslicing (optimised with 32-bit ISW-AND followed by mask refresh using full refresh [DDF14]).Where as the online time of Rivain and Prouff's third-order instantiation [RP10] is 33.5 times slower than our proposed third-order scheme.We also implemented the lightweight cipher PRESENT using the proposed scheme.The interesting observation here is that even though the overall execution time of the third-order masked PRESENT cipher is less than that of AES-128, the online time for the lightweight cipher is higher than the latter.

Conclusion
In this paper, we proposed a third-order secure table-based masking scheme that significantly outperforms the state-of-the-art table-based masked software implementations in terms of time, RAM memory, and randomness usage.A future research direction on this topic is to design table-based schemes at fourth and higher orders that are significantly more efficient than the fourth-order and slightly higher-order instantiations of current higher-order table-based masking schemes.While nearly eliminating the maskrefresh in our proposals leads to efficient schemes, the proofs become more involved.As we saw, there is a fine line separating security and efficiency, and one needs to be vigilant of the same.The lack of formal verification tools that can directly verify the security of table-based masking schemes is making the effort to design new schemes more involved.It would be a beneficial contribution to develop such verification tools.

B Constructing the Randomised Table in Two Shifts
To avoid the straightforward attack discussed in Appendix A, one can construct the randomised table T in two steps.Construct a T aux in Step 1 by shifting it with x 1 and protect the shift by the output mask y 1 .In Step 2, shift T aux using (x 2 ⊕ υ) ⊕ x 3 followed by y 2 and y 3 .The steps are summarised in Algorithm 5 on Page 553.

C Scheme which works only for Balanced S-boxes
One natural way to thwart the attack described in Appendix B is to to use a vector of random masks to protect T since repeating the same set of output masks across the table results in an insecure scheme.This idea is presented in Algorithm 6 on Page 555.
The order of the masks has to be carefully chosen since it may lead to a simulation that requires the balanced S-box property.Consider the tuple y 3 , y 4 , (T aux (a) ⊕ y 2 ) a=c , c is a constant.Then, y 3 ⊕ y 4 ⊕ T aux (a) ⊕ y 2 ) a=0 = y 3 ⊕ S(x) ⊕ y 1 ⊕ y 2 ⊕ y 3 ⊕ S(x 1 ⊕ υ) ⊕ y 1 ⊕ y 2 = S(x) ⊕ S(x 1 ⊕ υ).
One can simulate the above expression with a random value only when the S-box is assumed to balanced.The simulation fails, otherwise.
I 4 and I 8 can be simulated with a random value as this would have happened in the actual implementation.Recollect that I 8 = Y [a ⊕ d], where d = (x 2 ⊕ υ) ⊕ x 3 and a being the loop counter.But the case of the variable Y [b] at a=c , requires careful attention when probed along with y 3 .It is possible that y 3 and Y [b] maybe same or distinct.The simulation in this case happens using the input share x 4 .Recall that probing y 3 resulted in J = J ∪ {4}.If υ remains unprobed, then sample a random value for d and compute b = c ⊕ d, else compute d using the input shares x 2 and x 3 .To simulate the probed variable, compare the value of b with x 4 .If b = x 4 , then generate a random value and return the same value for both y 3 and Y [b], else return two independent uniform random values.3. Needless to say, any intermediate variable depending on at most one input share (including constants, sampled randomness, and input shares) is straightforward to simulate.The intermediate variables {I 1 , . . ., I 5 } fall under this category which also covers the variables in the construction of T aux .

(a)
at least one unprobed random mask: if either the output share y 1 or the random mask Y [d] remains unprobed, the simulation does not require the knowledge of any input share since the unprobed value acts as a one-time pad.(b) Otherwise, this would belong to Case 4 described above.Then we would have 1 ∈ J due to y 1 and 2, 3 ∈ J due to probing the pair I 10 (or I 11 ), Y [d] .So, compute the observed variables using input shares x 1 , x 2 , x 3 .(c) Even the pair (I 11 at a=c1 , I 11 at a=c2 ), c 1 , c 2 being constants (or I 10 at two distinct indices), can be assigned values, thanks to the vector of random values, Y .

Figure 1 :
Figure 1: Our scheme resulting from the composition of gadgets 3-NI G 1 and 3-SNI G 2 .
Let the adversary observe t I many input and intermediate shares, and t O output shares such that t I + t O ≤ t.Then, G is said to be t-NI secure if the set of t observations can be perfectly simulated using t I + t O many input shares of x. [BBD + 16].Let G be a gadget that takes (x 1 , x 2 , . . ., x k ) as the input shares of

Table 3 :
[CRZ18]son of third-order masked implementation of AES-128 using our scheme (Algorithm 3) vs.[CRZ18].Also, the results for third-order masked implementation of PRESENT using Algorithm 3. Total memory and true random values are in KB and the clock cycles are represented in millions (M).

Table 4 :
Comparison of third-order circuit based masked implementation of AES-128.Total memory and true random values are in KB and the clock cycles are represented in millions (M).