Bitslice Masking and Improved Shuﬄing: How and When to Mix Them in Software?

. We revisit the popular adage that side-channel countermeasures must be combined to be eﬃcient, and study its application to bitslice masking and shuﬄing. Our main contributions are twofold. First, we improve this combination: by shuﬄing the shares of a masked implementation rather than its tuples, we can amplify the impact of the shuﬄing exponentially in the number of shares, while this impact was independent of the masking security order in previous works. Second, we evaluate the masking and shuﬄing combination’s performance vs. security tradeoﬀ under suﬃcient noise conditions: we show that the best approach is to mask ﬁrst (i.e., ﬁll the registers with as many shares as possible) and shuﬄe the independent operations that remain. We conclude that with moderate but suﬃcient noise, the “bitslice masking + shuﬄing” combination of countermeasures is practically relevant, and its interest increases when randomness is expensive and many independent operations are available for shuﬄing. When these conditions are not met, masking only is the best option. As additional side results, we improve the best known attack against the shuﬄing countermeasure from ASIACRYPT 2012. We also recall that algorithmic countermeasures like masking and shuﬄing, and therefore their combination, cannot be implemented securely without a minimum level of physical noise.


Introduction
Ever since the introduction of Differential Power Analysis (DPA) by Kocher et al. [KJJ99], the idea that side-channel countermeasures must be combined to be effective has become a mantra. For example, one general conclusion of the DPA book is that "implementing a combination of several cheap countermeasures typically leads to much better protection than one expensive countermeasure" [MOP07]. By "much better protection", one implicitly means that the complexity of an attack against a combination of countermeasures should be higher than the sum of the complexities to attack each countermeasure separately.
Following this intuition, any pair of countermeasures could potentially be combined, raising the question of whether they indeed lead to concrete benefits in terms of security vs. performance tradeoff. In this respect, it was also put forward that the most promising countermeasures to combine are the ones that provide complementary improvements. One popular example of complementary countermeasures is hiding and masking [MOP07].
Hiding aims at reducing the attack's Signal-to-Noise Ratio (SNR) [Man04], by increasing the measurement noise or reducing the side-channel signal, as for example happening when shuffling an implementation [HOM06]. Masking aims at reducing the data dependencies of the leakages by randomizing the intermediate computations of a cryptographic implementation thanks to secret sharing, so that side-channel analysis becomes hard [CJRR99]. The statistical confirmation that this combination is sound was then put forward in an important paper by Rivain et al. [RPD09]. Assuming a shuffled execution of η independent operations (e.g., S-boxes), each of them masked with d shares, they showed that for a sufficient noise variance σ 2 in the leakages, the complexity to attack the corresponding implementation grows in O η · (σ 2 ) d . Based on this state-of-the-art, the main contribution of this paper is to improve the security, the applicability, and the evaluation of this important combination of countermeasures, by following two complementary tracks: 1. In terms of security under a sufficient noise regime, the proposal of Rivain et al. increases the complexity of attacks against masked implementations by a factor η. 1 Given that shuffling can be viewed as noise emulation and masking as noise amplification, a natural question is whether one could amplify the noise emulated by the shuffled operations thanks to masking and improve this gain to a factor η d ?
We answer the question positively and describe an efficient solution for this purpose, with a systematic investigation of the options to combine masking and shuffling. It is obtained by shuffling noisy shares rather than noisy tuples of shares.
2. In terms of applicability, several recent advances in the analysis and design of software masking exploit the bitslicing concept [GR17, BGR18, BDM + 20]. The latter has not been studied by Rivain et al. and raises new design questions. Namely, say you want to securely implement 128 independent AND gates in a 32-bit (ARM Cortex) device. What is the best combination of bistlice masking and shuffling given that both of them "utilize" the independent operations differently? We show that the good strategy is to mask first and shuffle next. In other words, fill the 32-bit bus with shares and shuffle what remains (i.e., four 32-bit operations in this example).
We combine these results with a technical/consolidating contribution. Namely, we describe an improvement of the multivariate attacks against shuffled implementations proposed in [VMKS12] that we use in our worst-case security evaluations and comparisons. Eventually, we note that the security evaluations of Rivain et al. rely on simple (correlation-based) side-channel attacks [PRB09]. This led them to the conclusion that shuffling remains useful even in low-noise regimes. Yet, recent results showed that such evaluations can significantly overestimate the worst-case security level of an implementation [BCS21, LZC + 21]. We confirm experimentally that such overestimations are also observed when combining (e.g., the masking and shuffling) countermeasures.

Cautionary notes.
As for any side-channel countermeasure, the security gains that can be expected when combining masking and shuffling are not unconditional. So the question we tackle is not "should I combine masking and shuffling" but "when should I do it". We therefore propose a systematic evaluation that includes all the parameters influencing its answer (e.g., the physical noise level, but also the cost of the randomness and the amount of parallelism that can be leveraged). It allows our conclusions to apply to a wide range of algorithms and to identify the ones for which the "masking + shuffling" combination will be the most interesting. Our analyzes are also based on a bitslice parameter which, when set to its minimum value, models a non-bitslice implementation. It allows us to observe that exploiting bitslicing is always beneficial, justifying the bitslice focus of our title.
Related works. Somewhat surprisingly, and to the best of our knowledge, there are not many papers focused on combining countermeasures. As far as the masking + shuffling combination is concerned, the work of Bruneau et al. studies the (different) context of masking based on shuffled tables recomputations [BGNT18], which builds on the observation that recomputed tables are attractive targets for side-channel analysis [TWO13]. The work of Patranabis et al. implements a (hardware) combination of masking and shuffling in a Rivain et al. fashion (so their shuffling is not amplified by masking as we propose) [PRC + 19]. More specifically related to shuffling, another work of Patranabis et al. shows the interest to shuffle over larger sets of operations [PRV + 16], which is in line with our conclusions regarding when shuffling gains interest. More general studies like [Man04,GM11] evaluate other (hardware) combinations of countermeasures. Finally, a recent paper by Coron et al. initiates a theoretical analysis of shuffling in the probing model [CS21]. Their results are asymptotic but provide an alternative view on the shuffling countermeasure and raise the question of whether our information theoretic evaluations and the more optimistic conclusions they lead to could be further formalized.

Background
In this section, we introduce the notations used in the paper, the information theoretic tools needed for our evaluations and the two side-channel countermeasures we investigate, together with a discussion of their quantitative impact on the leakages.

Notations
Random variables are denoted with capital letters Y and their realizations with lower cases y. We use the notation y for the vector of inputs that are sent to the independent (possibly shuffled) operations. Concretely, y is a vector of size η and each element of the vector is denoted as y i . When additionally masking the implementation, we use the notation y j for the j-th share of the vector y, such that the element-wise addition d−1 j=0 y j = y. Operations on vectors are always element-wise. We further denote y j i as the j-th share of element i such that d−1 j=0 y j i = y i . Without loss of generality, we assume that the leakage of a vector is a vector and the leakage of a variable is a variable. The leakage is always denoted with l with the target random variable in superscript. For example: • l θ is the leakage (vector) on the full permutation θ used for shuffling.
• l θc is the leakage on the permutation at index c.
• l y j i is the leakage on the j-th share of the i-th element of y.
• l y j θc is the leakage obtained when accessing the j-th share of the element indexed at cycle c when shuffling with the permutation θ.

Information theoretic metrics
Next, we introduce Information Theoretic (IT) metrics used to evaluate the effectiveness of side-channel countermeasures (and, as we will see, of adversarial strategies as well). The rationale behind this choice is that the number of traces N required to perform a (worst-case) statistical attack against a leaking implementation is inversely proportional to the Mutual Information MI(Y ; L) between the secret vector Y and the leakage L [DFS15]: . (1) Since the cardinality of Y is generally too large to be exhausted, MI(Y ; L) is usually computed for each element of the vector independently, in a divide-and-conquer manner.
In this case, one focuses on the complexity to recover one such element which is worth: . (2) The relation is given for a small constant cst (that depends on the entropy of Y i , denoted H(Y i ), and the target success rate of the attack [dCGRP19]). In the context of attacks against implementations where the Y i variables and their leakages are independent, the divide-and-conquer approach does not imply information losses [GS18]. As will be discussed next, the situation is different when shuffling, since it creates dependencies between the leakages of different Y i variables. In general, computing MI(Y ; L) requires the knowledge of the (true) Probability Density Function (PDF) of the leakage conditioned on y, that we denote as f(L = l|Y = y). Such knowledge is only available in simulated evaluation contexts, where the leakage function is defined by the evaluator. In this paper, and unless mentioned otherwise, we will assume that the leakages follow a (possibly multivariate) Gaussian distribution with noise covariance Σ 2 so that: Thanks to this PDF, the conditional probability of a sensitive variable Y given the leakage, denoted as Pr[Y = y|L = l] := p(y|l), can be computed via Bayes. Assuming that Y is uniformly distributed (which is the case for the cryptographic secrets), it is expressed as: The MI can then be estimated by sampling, as described in [BHM + 19], Equation 6: where n t (y) is the number of leakage samples l y (i) against which the conditional probability distribution is "tested" in the estimation of the metric (i.e., the larger n t (y) is, the more accurate the sampled estimate ofMI(Y ; L)). When moving from simulated evaluations to concrete evaluations of actual leakages, the true leakage distribution is generally unknown. The best option for an adversary is to approximate the leakage distribution with a model, that we next denote asm(·|·). This model is typically obtained by profiling the target device [CRR02]. The amount of information that can be extracted thanks to this model is captured by the so-called Perceived Information (PI). As the MI, the PI can be estimated by sampling by replacing p(·|·) withm(·|·) in Equation 5 [BHM + 19], leading to: The PI is smaller than the MI unless a perfect model is used by the adversary. This metric can be used to capture estimation and assumption errors caused by imperfect leakage profiling. In the following, we additionally use it to capture suboptimal adversarial strategies designed to be computationally efficient against shuffled implementations.

Masking
General principle. Masking is a popular countermeasure against side-channel attacks. It consists in representing a variable a as an encoding (tuple) which is a vector a of d uniformly distributed elements (or shares) fulfilling a = i a i [CJRR99]. By doing so, all the sets of d − 1 shares remain independent of the secret, which was formalised as d-probing security [ISW03]. Masked implementations then aim to maintain this property through the computations. For this purpose, the generic solution is to perform linear operations share-by-share and to use masked multiplication gadgets for the non-linear operations: they allow multiplying two encodings while ensuring probing security. A prominent example is the ISW multiplication introduced by Ishai, Sahai and Wagner [ISW03] which is recalled in Algorithm 1 where ⊗ denotes the multiplication in GF(2).
Concrete security. Duc et al. showed in [DDF14] that probing security reduces to security in the more realistic noisy leakage model where the adversary has access to the noisy leakages of all the intermediate variables. The (noise and independence) conditions needed for this result to hold in practice and its connection with the previous information theoretic metrics have then been made explicit in [DFS15], leading to the following bound for the data complexity of a side-channel attack against a masked implementation: Precisely, the exponential complexity increase is only relevant if MI(Y j ; L) is small enough and the security order d is maintained as long as the leakage function is a noisy linear combination of the shares. Typical defaults that can contradict this second assumption are glitches in hardware [NRS11] and transition-based leakages in software [BGG + 14]both of them can be prevented thanks to algorithmic tweaks. As will be clear next, these defaults are not critical in our evaluations (i.e., do not change our main conclusions).

Shuffling
General principle. Shuffling is another side-channel countermeasure that leverages independent operations within a circuit (e.g., the 16 AES S-boxes). It consists in performing these operations in a random order with the goal to confuse the side-channel adversary. Algorithm 2 is an example of shuffling that applies an arbitrary function op(·) independently to all the elements of the input vector y. The first step is to generate a permutation θ which is uniformly selected among all the permutations of the set {0, . . . , |y| − 1}, that we denote Θ. The size of the permutation is next given by π. For now, it corresponds to η which is the number of independent operations on which we shuffle and corresponds to the size of y. As will be seen next, there are also cases where π > η. The algorithm iterates deterministically over the elements of this permutation. On the c-th iteration, the index θ c of the element of the input vector to process during that iteration (denoted s) is loaded. The output is finally updated such that z s ← op(y s ). In a side-channel attack, all these operations generate leakage on the processed data. Namely, at iteration c, the c-th element in the permutation generates some leakage denoted as l θc . We shall refer to it as the permutation leakage. The manipulation of the input vector also generates some leakage that we next call data leakage. We use the notation l y θc to represent the leakage generated when processing the θ c -th index in the vector y during cycle c.
Leaks ; l y θc and l z θc Concrete security. Under the assumption that the noise is sufficient to hide the permutation indexes, shuffling offers an increase of a side-channel attack's data complexity that is linear in the number of independent operations η on which the permutation is applied [HOM06,VMKS12]. Using the previous information theoretic notations it gives : where the denominator is the MI u on a similar un-shuffled implementation.
Computing the PDF. The optimal way to compute the leakage PDF of a shuffled implementation is given by the next equation: where l is the concatenation of the permutation and data leakage vectors. Summing over all the permutations rapidly turns out to be too computationally intensive (e.g., for π = 16, the number of modes of this mixture is already ≈ 2 44.2 ). As a result, more computationally efficient approaches have been proposed, at the cost of a possible information loss. The AC12 attack. At ASIACRYPT 2012, an approach was proposed to recover y by exploiting the leakage on the permutation indexes [VMKS12], using the following equation: where w AC12 (θ c = s|l θ ) is the weight assigned to a cycle c. Its goal is to indicate the probability that the targeted value y s is manipulated at the cycle c. They propose many solutions to derive w AC12 with different time and data complexities. 2 We focus on the so-called "direct permutation leakages" (DPLeak), for which this weight is computed as: Putting things together, the AC12 attack recovers the full vector y by applying Bayes on each of the elements of the vector independently, using a model: Note that since this model is imperfect, the adversary using it exploits some perceived information rather than the whole mutual information. Note also that this attack is not a divide-and-conquer one since the leakages of the permutation are exploited jointly.

Improving the AC12 attack strategy
We next propose an improved strategy to attack shuffled implementations, and use simulated information theoretic evaluations to demonstrate its gains over the AC12 one.

Attack specification
In principle, the attack we propose is similar to the AC12 DPLeak one. Yet, it uses a slightly different model that is expressed by the following equation: where the first term processes the permutation leakages and the second term processes the data leakages. Our modifications compared to Equation 12 are twofold. First, the weighted sum over c is not performed on continuous PDFs but on the probabilities obtained after the application of Bayes (as reflected by the right term in the above equation). Second, we notice that the weights in the sum estimated with Equation 11 depend on the full permutation leakage. Since each term in the sum corresponds to the leakage at cycle c, we modified the weights such that they give the probability that the index manipulated at cycle c is equal to s. So Equation 12 can be viewed as a heuristic alternative to the standard approach taken for analytical attacks such as [VGS14]. And Equation 13 corresponds to the correct factorization of the probability distribution that the attacker tries to estimate. Formally, this leads to:

Models comparison
Methodology. Next, we simulate a shuffled implementation (i.e., Algorithm 2) where the leakages are distributed according to Equation 9. From these leakages and the knowledge of the true PDF, we first estimate the MI according to Equation 5 which represents the best attack possible against the implementation. We do that for small permutation sizes (since for large π values the direct computation of the MI is computationally hard). For the same implementations, we extract the PI thanks to Equation 6 for both modelsm AC12 (·|·) (Equation 12) andm New (·|·) (Equation 13). It allows discussing which model is the best and how far it is from the optimal attack enumerating all the permutations. Practically, the simulations take two parameters. The first one is the noise variance σ 2 in Equation 3. It represents the amount of noise that is intrinsic to the implementation. The second one is the number of independent operations on which we shuffle, which corresponds to the size η of the secret vector y. As a result, we have Due to the aforementioned computational limitations, we take values η ∈ {2, 4, 6}. However, we note that the evaluation of the AC12 attack and our improvement do not suffer from such a limitation, meaning that PI could be evaluated also for a larger η. 3 Results and discussion. The results of these simulations are reported in Figure 1. On the left, we observe the resulting MI and PIs according to the noise parameter σ 2 : DPLeak AC12 is the label of the modelm AC12 (·|·) and DPLeak New is the label of the modelm New (·|·). On the right, we report the ratio between the MI and the PIs. This part of the plot is used to highlight the gap between the efficient adversaries' models and the worst-case attack. Based on these simulations, we first observe that our new model improves over the AC12 one. Indeed, the PI for the DPLeak AC12 adversary is always (sometimes significantly) lower than DPLeak New. Second, we observe that for high noise levels, our new model offers a good approximation of the MI while the AC12 one suffers from a bias. We note that the DPLeak New could possibly be further improved with analytical attacks such as [VGS14]. However, Figure 1 shows that for high noise levels on which we will focus, DPLeak New is already close to the worst-case attack, which is in line with the observation in [ADP + 20] that such analytical attacks only lead to minor improvements when aiming at recovering ephemeral secrets (which is the case of suffling permutations).
In this paper, such small-scale information theoretic evaluations (based on the worstcase MI) will be quite systematically used to evaluate and compare different combinations of masking and shuffling. On the negative side, (i) they only correspond to attacks considering a representative leakage function, which is not equivalent to proving security in general: the gap between provable analyzes and worst-case attacks for shuffling is highlighted in [CS21] and tightening this gap is an important open problem; and (ii) these small scale examples do not directly apply to the typical sizes of concrete implementations (e.g., H(Y i ) = 8 and η = 16 for the AES). On the positive side, (i) information theoretic evaluations as we propose have quite systematically been shown to be excellent indicators of the bounds that can be obtained in masking security proofs: see for example the sequence of papers [SMY09, SVO + 10, PR13, DDF14, DFS15] for an illustration; and (ii) the same holds (up to constant factors) for the extrapolation from small variables to larger ones, and the concrete attacks we propose do not suffer from computational limitations. Since all our observations are confirmed for growing values of d and π, we believe they provide a necessary first step towards a better understanding of the masking + shuffling combination of countermeasures which improves over the one of Rivain et al. [RPD09], both in terms of the security levels we claim, and in terms of the considered attacks' coverage.

Systematic information theoretic analysis
As discussed in Subsection 2.3, a masked d-probing secure circuit is composed of two types of operations: linear ones can be applied share-by-share, non-linear ones require to mix the shares securely. We now analyze different approaches to combine masking and shuffling for these two types of operations. 4 We start with paper-and-pencil intuitions to express the security gains such combinations provide in simple terms and/or to revoke some possible options. Then, we evaluate some relevant combinations with an information theoretic analysis. For now, we focus on the high-noise regime, where exploiting the permutation leakages does not improve the attacks [VMKS12]. Simulations taking permutation leakages into account are reported in the extended (ePrint) version of this work.

Linear operations
Based on the notations of Subsection 2.1, applying a linear operation op l (·) to an encoding of the secret vector a of size η consists in applying op l (·) independently to all the elements of its share vectors a i . Namely, the output encoding of b is derived such that b i s = op l (a i s ) for all 0 ≤ i < d and 0 ≤ s < η. The challenge when combining masking and shuffling for linear layers is to identify in what order should pairs of indexes (i, s) be used to load a i s . We consider three possible shuffling configurations that can be applied to a linear layer, as summarized in Figure 2 where a color denotes a permutation and a box the pair(s) (i, s) that are accessed deterministically at each cycle of that permutation.

Shuffling tuples
A first straightforward possibility is to shuffle tuples of shares. Only the indexes s of the vectors are shuffled and the shares i of that index are accessed sequentially and deterministically. Algorithm 3 is an instantiation of such a combination and a graphical representation is given in Figure 2a. In terms of security, this shuffling and masking combination is similar to the case where we only shuffle an implementation from Algorithm 2. Namely, because the shares are accessed in order, at the cycle c information on all the shares a i θc can be combined together to get information on a θc , without being impacted by Algorithm 3 Masking and shuffling tuples.
shuffling. So similarly to a shuffled-only implementation, the security of this combination is linear in the number of operations η on which we shuffle: This equation is confirmed by the IT analysis of Algorithm 3 given in Figure 3. On the left, we report the MI of a masked and shuffled implementation MI m+s (A; L), for various d and η. On the right, we report the ratio between the MI of a masked-only implementation MI m (A; L) and MI m+s (A; L). Based on Equation 1, this ratio is the increase of the attack data complexity that shuffling and masking provide compared to masking only. We observe that, as expected from Equation 15, the gain equals the permutation size π = η for sufficiently large noise variance. We also observe that this gain is independent of d, which confirms that there is no non-trivial interaction between shuffling and masking in this case. Indeed for η = 2, the gain equals two for the 2-and 3-share implementations.

Shuffling shares.
A natural option to improve the interaction between masking and shuffling is to shuffle the shares of independent variables instead of their tuples. As illustrated in Figure 2b and described formally in Algorithm 4, it consists in processing the shares sequentially and deterministically and in picking up a random permutation for each vector of shares a i . For each share index, the operations op l (a i s ) are performed with s selected according to a fresh permutation. As a result, the permutation is always applied to independent values. Algorithm 4 Masking and shuffling shares.
Such an approach is beneficial to side-channel security since to obtain information about a secret element a s , an adversary now has to retrieve at which cycle c the d shares a i s are manipulated. Without knowledge of the permutations (e.g., because of sufficiently noisy leakages as we assume for now), she succeeds for a share with probability 1 η , and so with probability 1 η d for the d shares. As a result, the shuffling shares method can be interpreted as providing an increase of the noise on each share by a factor equal to the permutation size π = η. Masking then amplifies this emulated noise exponentially. The impact of this combination on the (worst-case) attack data complexity is given by: This equation is confirmed by the IT analysis of Algorithm 4 in Figure 4. For large noise, the MI m+s (A; L) of the shuffled shares implementation is η d times lower than the MI m (A; L) of the masked only implementation. For example, for η = 4 and d = 2, this ratio is of 4 2 = 16. For η = 3 and d = 3, this ratio is of 3 3 = 27. Therefore, by using d permutations of size π = η, this solution provides an exponential amplification of the noise emulated thanks to shuffling. For completeness, we report similar results by assuming a leaking permutation in the extended version of this work. As expected, slightly more noise is required to hide information on the permutation indexes, but for the rest, conclusions remain unchanged and we keep the same asymptotic improvement.

Shuffling everything.
The two previous options shuffled either over the shares or over the tuples. A last solution is to shuffle jointly all the shares corresponding to all the pairs (i, s) by using a single permutation on π = d·η elements. This combination is illustrated in Figure 2c and formally defined in Algorithm 5. In terms of side-channel security, recovering information about Algorithm 5 Masking and shuffling everything.
a secret element a s without knowledge of the permutation (e.g., because of sufficiently noisy leakages) now requires that the adversary exploits the leakage of all the d shares a i s . This implies to find the cycles c where the d pairs (i, s) with 0 ≤ i < d are used. To do so, she chooses the first share out of the set of d · η shuffled values. Since d of these values correspond to a share a i s , she succeeds with probability d d·η . The second share can be guessed with probability d−1 d·η−1 by excluding the first selected share. The adversary can then obtain information on a s if she selects correctly all the d shares which happens with probability . Hence, the attack data complexity grows as: . (17) We

Discussion.
We conclude from these analyses that N tuples ≤ N shares ≤ N everything . However, these successive improvements come at the cost of more or larger permutations. Hence, they raise the question of which option provides the best cost vs. security tradeoff, which we will discuss in Section 5. For this purpose, a preliminary is to generalize our results from linear operations to non-linear ones, which we tackle in the next section. We note that Rivain et al. used the shuffling everything method for their linear layers (and derived a correlationbased bound for this purpose), but they only used the shuffling tuples one for the AES S-boxes [RPD09]. To the best of our knowledge, the shuffling shares method is new. As will be shown next, it is the one leading to the best cost vs. security tradeoff. It allows avoiding the imbalance between the security levels of linear and non-linear operations, which is caused by the use of two types of shuffling, and forced Rivain et al. to artificially increase the permutation size of the non-linear layers by using dummy operations.

Non-linear operations
The second building block for d-probing secure circuits are non-linear operations. We will consider the standard ISW multiplication for this purpose.In this case, all the pairs (i, j) have to be accessed to compute the cross products a i ⊗ b j . We assume a setting where η independent ISW multiplications c s = a s ⊗ b s , with 0 ≤ s < η, have to be computed. To perform such operations, all the triplets (s, i, j) have to be accessed to compute all the a i s ⊗ b j s cross-products. Next, we list different ways to shuffle the computation of these triplets and discuss the different security levels they lead to. We note that such combinations of masking and shuffling for non-linear layers are more complex than for linear ones: there are more possibilities to be considered and their security is sometimes hard to assess. We therefore start by presenting the simplest solution of shuffling tuples used by Rivain et al. We then introduce a generic taxonomy of shuffling configurations which allows describing the design space of the masking + shuffling combinations, and we illustrate this taxonomy with the shuffling tuples approach. Finally, we prune this design space and focus on a number of solutions that can be viewed as the counterparts of the shuffling shares and shuffling everything approaches previously described for linear operations. We also argue why this pruning is practically relevant. As for linear operations, we perform simulations in order to confirm the effect of the countermeasures on the MI, which requires modeling the full leakage distribution of the gadgets. In the case of non-linear operations, the simulations include a fresh leakage (i.e., bit value + noise) each time an intermediate value is used or produced. Hence, it includes all the d manipulations of the input shares, all intermediate cross-products and all the randomness used in the gadget.

Shuffling tuples.
This first method is similar to the shuffling tuples for linear layers. It is presented in Algorithm 6 where only the accesses to the tuples a s and b s are shuffled with a permutation of size π = η. Then the pairs (i, j) are accessed deterministically within these tuples.Similarly to the shuffling tuples for linear layers, this allows an adversary to obtain information on the unshared secrets a s , b s and c s with probability 1 η . The resulting security guarantee is then linear with the size of the permutation similarly to Equation 15. This is confirmed by our IT analysis of Algorithm 6 reported in Figure 6. There, the ratio between the MI of a masked only ISW multiplication MI m (A; L) and the one of shuffled and masked ISW multiplication MI m+s (A; L) is always equal to η.

Shuffling more.
As in the previous subsection, the next step is to shuffle over more operations to reach a better security gain. For example, one natural goal is to amplify the effect of shuffling Algorithm 6 Shuffling tuples ISW (configuration (1 s , 0 i , 0 j )). {a 0 , a 1 , . . . , a η−1 } and {b 0 , b 1 , . . . , b η−1 } and randomness r i,j s defined as: ∀s, ∀i, ∀j, such that i < j, r i,j s ← {0, 1} and r j,i s = r i,j s and r i,i s = 0. Output: outputs {c 0 , c 1 , . . . , c η−1 } such that ∀s ∈ {0, 1, . . . , η − 1} with masking so that a gain factor of η d can be maintained for full implementations. For this purpose, we first describe all the possible shuffling and masking combinations that handle the indexes s, i and j independently, with the next shuffling configurations.

Input: inputs
Shuffling + masking configurations. We describe all the possible combinations of masking and shuffling with Algorithm 7 along with a configuration of the form (x α , x β , x γ ). The algorithm is composed of three nested loops where each loop is responsible for shuffling (or not) one of the indexes s, i or j. The superscripts correspond to the index manipulated by the loop and the position in the triplet corresponds to the order of the loops. For example, the first superscript designates the outermost loop and the third one the innermost loop. Additionally, the x value is a bit set to 1 if the loop is shuffled and 0 otherwise. We note that swapping the indexes i and j leads to the same gadget since the multiplication is commutative. Hence, the configuration (0 s , 0 i , 0 j ) denotes an unshuffled implementation, and the previous shuffling tuples solution given in Algorithm 6 corresponds to the configuration (1 s , 0 i , 0 j ), where only the outer loop on s is shuffled. The Greek letters in Algorithm 7, namely α ∈ A, β ∈ B and γ ∈ Γ, are uniquely set to s, i or j. The corresponding capital letter is the set of values that the index can take. Pruning the configuration space. Based on the previous configurations, there are 6 possible ways to order the loops and, for each of these orderings, there are 2 3 possibilities of shuffling. This leads to a total of 48 cases. To reduce the number of configurations to investigate in detail, we first notice that in Algorithm 7, shuffling the indexes i or j never leads to a security improvement. Indeed, shuffling on i (resp., j) only means shuffling the accesses to the shares a i s (resp., b j s ). Because in an encoding a s of a s , the position of the shares does not affect security, an adversary can obtain information on a s by observing leakages on all the shuffled shares a i s without being impacted by the shuffling of i (resp., j). We further note that in the high-noise regime (that we assume for now), this observation also holds when we shuffle i and j. In this case, the cross-products are shuffled and information on each of the a i s involved in these cross-products is obtained from the η Algorithm 7 Generic shuffled & masked ISW multiplications. {a 0 , a 1 , . . . , a η−1 } and {b 0 , b 1 , . . . , b outputs {c 0 , c 1 , . . . , c η−1 } such that ∀s ∈ {0, 1, . . . , η − 1}, c s = a s

Input: inputs
observations of the shuffled a i θ s . But information on a s can still be recovered from the input/output tuples of the multiplication, without being impacted by the shuffling of i. So this configuration makes the information of the cross-products harder to exploit, while it is known that the information provided by these cross-products is dominated by the information of the multiplications' input/output tuples when the noise is large [CS19]. Therefore, we next limit our investigations to configurations with 0 i and 0 j . This reduces the set of combinations to 3 possibilities. We next list them and discuss their security. The first one is (1 s , 0 i , 0 j ) and corresponds to the aforementioned shuffling tuples (Algorithm 6) for which the security impact is only linear in the size of the permutation. The second configuration is (0 i , 1 s , 0 j ) (resp., (0 j , 1 s , 0 i )). For this configuration the security of the inner loop variable b j (resp., a i ) differs from the one on the outer loop variable a i (resp., b j ) and the security of the inner loop variable is similar to the shuffling tuples option, which is not desirable. Finally, the third configuration is (0 i , 0 j , 1 s ). It is similar to the shuffling shares of linear layers, where the permutation is applied to π = η independent elements. In this case, every loading of the input shares a i s and b j s as well as the updates of c i s are shuffled among the η independent ISW multiplications. As a result, the information on each of these operations is reduced by a factor η that is later amplified by masking. Therefore, it provides an exponential gain factor η d as in Equation 16. This gain is confirmed by the IT analysis depicted in Figure 7 for d = 2 and η = 2, where the ratio between MI m (A; L) and MI m+s (A; L) for high enough noise is of 2 2 = 4.

Shuffling everything-like.
A last question is to know whether it is possible to obtain a better security, similar to the one obtained for the shuffling everything of linear layers. For these operations, the improvement was obtained by shuffling jointly all the pairs (i, j) instead of independently. For the ISW multiplications, this can be done by merging some of (or all) the loops in Algorithm 7. To list the possible combinations when two loops are merged, we use Algorithm 8 along with notations of the form (x α,β , x γ ). For example, (x i,j , x s ) means that the outer loop is a merge of the loops on i and j and hence over d 2 elements. More precisely, in Algorithm 7 the operator × denotes the Cartesian product and θ α,β is a permutation over the set A × B. Alternatively, all loops can be merged together into a single one operating on d 2 · η elements. We denote this configuration as (x s,i,j ) and detail it in the extended version of this work. Based on these configurations, first, loops on i and Algorithm 8 Generic shuffled masked ISW multiplications with 2 loops merged. {a 0 , a 1 , . . . , a η−1 } and {b 0 , b 1 , . . . , b η−1 }, shuffling configuration (x α,β , x γ ) with α, β, γ ∈ {s, i, j} and randomness r i,j s defined as: ∀s, ∀i, ∀j, such that i < j, r i,j s ← {0, 1} and r j,i s = r i,j s and r i,i s = 0. Output: outputs {c 0 , c 1 , . . . , c η−1 } such that ∀s ∈ {0, 1, . . . , η − 1}

Input: inputs
j must be merged to avoid the same asymmetry issues as in the shuffling shares approach. This reduces the possibilities of merging loops to three combinations: (1 s , 1 i,j ), (1 i,j , 1 s ) and (1 s,i,j ). However, all these options induce a non-uniform permutation of the outputs shares, since the output share c 0 0 is valid once it has been updated d times in Algorithm 8. Therefore, during the first few iterations it is unlikely that the current operation is the d-th update of c 0 0 as it becomes for the last few iterations. This effect is illustrated in the extended version of this work, where we show that the probability to generate c 0 0 increases with t, with some differences depending on the shuffling configurations. Such non-uniform permutations of the output shares may offer some level of security. Yet, their analysis requires analyzing complex permutation biases that are specific to the configurations and parameters d and η. We therefore rule out these options in our investigations. Table 1 contains a summary of the different combinations of masking and shuffling we considered, when the secret vectors are of size η. The table reports the security gain factor compared to a masked-only implementation (i.e., MI m /MI m+s ), the size of the permutation(s) used (i.e., π) and the number of fresh permutation(s) needed (i.e., # perm.). As mentioned above, the shuffling everything approach cannot be straightforwardly applied to non-linear layers. Hence, we next focus on the shuffling shares solution which allows the same exponential security gain for both linear and non-linear layers, enabling balanced designs (and evaluate the shuffling tuples option for comparison purposes).

Time versus security for shuffled ISW
In the previous section, we have discussed how to amplify the impact of shuffling by combining it with masking, for both linear and non-linear operations. In this section, we focus on the practical instantiation of the shuffling tuples and shuffling shares methods, with a focus on 32-bit software platforms, and we compare them to a masked-only implementation. That is, we study Algorithm 7 with configurations (1 s , 0 i , 0 j ), (0 i , 0 j , 1 s ) and (0 s , 0 i , 0 j ) in the context of bitslice software implementations. We first detail the randomness that is required by such implementations. Then, we describe the parameters that influence their execution time and their security level. Finally, we propose concrete performance evaluations and leverage them to discuss general guidelines for combinations of masking and shuffling that best trade performance and security. As a preliminary remark, we mention that when protecting a cryptographic implementation with shuffling, a preliminary (cipher-specific) challenge is to find independent operations that can be executed in parallel. To keep the following discussions independent of the primitive to protect, we therefore consider a general use case where we aim to implement #AND ISW multiplications in parallel. As will be clear next, it allows us to draw general conclusions about how masking and shuffling should be combined (i.e., which of the countermeasures should use the available parallelism in priority), which can then be directly translated into concrete guidelines for implementing actual ciphers (or possibly modes of operations, in case they offer additional levels of parallelism).

Randomness requirements
The randomness required to combine masking and shuffling is composed of two terms. The first one is due to masking and remains independent of the shuffling parameter η. For the masked (only) ISW multiplication from Algorithm 1, d · (d − 1)/2 random bits per bitwise multiplication are needed. This means that #AND times more random bits are needed in our context. The second term is due to shuffling and corresponds to the randomness needed to generate the permutation(s) θ. Precisely, a permutation on η elements can be generated with η · log 2 η random bits [VMKS12]. When combined with masking by shuffling tuples, a single of these permutations must be generated, as reported in Table 1. When combined with masking by shuffling shares, d 2 of these permutations must be generated. Hence, the shuffling shares strategy requires a total amount of random bits given by:

The bitslice masking + shuffling design space
We now introduce the different parameters that influence the execution time as well as the security level of masked and shuffled bitslice implementations. These parameters are summarized in Table 2, where the first block contains parameters that are under control of the implementers and the second one contains parameters that depend (partially) on the software platform used. We next detail these parameters and their interactions.  (A i ; L) Mutual information on a single unshuffled shared bit First, we recall that bitslicing is a software programming technique that represents an algorithm (e.g., a block cipher) with Boolean operations [Bih97]. It takes advantage of the parallelism enabled by bitwise instructions which are available in most (all) the modern micro-controllers (MCU). For example, it is possible to place 32 bits in a register a, 32 bits in a register b, and to obtain the 32 bits corresponding to a ⊕ b in a single instruction. This strategy is particularly appealing in the context of masking where multiple bitwise ISW multiplications can be applied in parallel. Several works have shown how it can be efficiently used to protect block cipher implementations at arbitrary security orders [GR17, BGR18, BDM + 20]. Next, we observe that in contrast with bitslice implementations that favor parallelism, shuffling rather applies to serialized independent operations: the more such operations, the larger the size of the permutation and therefore the impact of shuffling. As a result, the main question when combining bitslice masking and shuffling is whether one should favor serialization or parallelism? To answer this question, we will analyze the impact of the "effective size" of the register, denoted as bs, which is the parameter reflecting the tradeoff between serialization and parallelism. It is defined as the number of useful bits in a single register, and therefore corresponds to the number of Boolean operations that are parallelized. The maximum value of bs is given by the physical register size. Therefore, on a 32-bit software platform, we have bs ≤ 32. Yet, the designer could also reduce bs to increase the permutation size η that is equal to #AND/bs. For example, Figure 8 illustrates two options in the masking + shuffling design space for #AND=64. The first option (Figure 8a) is to maximize the parallelism with bs = 32 and so η = 2. A second option (Figure 8b) is to decrease the parallelism with bs = 16 and to increase the permutation size up to η = 4. The other parameters that we need to evaluate the performance vs. security tradeoff of masked and shuffled implementations are the randomness cost and the noise level. Precisely, and as described in Subsection 5.1, both masking and shuffling require randomness. Hence, the throughput at which random bits are generated has an impact on the cycle count. We introduce the parameter r that is the latency in cycles needed to generate 32 random bits once requested. As for the noise level, we quantify it with the amount of information that is available on a single share of an unprotected implementation MI u (A i ; L).

Performance evaluation and discussion
To evaluate the security vs. performance tradeoff of the shuffling tuples and shuffling shares options, we complement the previous security evaluations by measuring the total cycle count of protected implementations running on a 32-bit ARM Cortex-M3 with the design parameters from Table 2. 5 Precisely, when only masking we use a bitslice instantiation of Algorithm 1 that is repeated η times to operate #AND bitwise secure multiplications. The security level is then given by Equation 7. For the shuffling tuples strategy, a similar approach is used but the η ISW multiplications are performed out of order thanks to the generation of a single permutation. The security level is then given by Equation 15. For the shuffling shares strategy, we use an efficient implementation that does not require to pre-compute randomness but rather generates it on-the-fly. Its security level is given by Equation 16. Regarding the randomness cost, we use a (software) PRNG and set its latency according to the parameter r. The resulting execution time versus security curves are reported in Figure 9 for #AND=128 and in Figure 10 for #AND=512. The x-axis is the number of cycles to perform the secure multiplications normalized by #AND. The y-axis is the data complexity of the worst-case attack. Each data point is for a different masking order d, with the leftmost being d = 2 and increasing with steps of one when moving to the right. Red curves are for masked-only implementations, green ones for shuffling tuples and blue ones for shuffling shares. Continuous lines are for bs = 32, hence using the full physical register. Dashed curves are for bs = 16, hence doubling the serialization. Overall, this figure allows comparing all the proposed combinations of countermeasures in terms of their security level and performance, for different masking security orders. We draw two general conclusions regarding the combination of masking and shuffling from these plots. First, the choice of taking bs = 32 is always better than taking bs = 16. That is for a fixed execution time (x-axis), the resulting attack data complexity is always larger for bs = 32 than for bs = 16, and this trend holds for lower bs values. It implies that a designer should first favor the parallelization of the masked computations he has to perform, and shuffle what is left to serialize. It also confirms the relevance of bitslicing when combining masking and shuffling. 6 Second, the combination of masking and shuffling benefits from a larger #AND. For example, if 80 cycles are spent per bitwise multiplication, the resulting security is about 2 40 for #AND=128 for shuffling shares according to Figure 9b. For the same number of clock cycles per bitwise multiplication, having #AND=512 provides a security around 2 50 according to Figure 10b. We note that our results show that large bs values are beneficial if the permutation is leak-free (concretely: if the noise is large enough). The case where significant information can be recovered from the permutation leakage makes shuffling useless. But there may exist corner cases where the permutation is close to being recovered, where increasing the bs parameter may marginally affect our conclusions by making the recovery of a larger permutation more difficult. We doubt this corner case will be practically useful (since for this approach to be relevant, it would have to compensate for the performance increase that the increased serialization causes) and leave it as an open problem.
Masking only vs. shuffling tuples. By comparing the masked-only implementation with the shuffling tuples strategy, we observe that shuffling tuples does not bring a significant gain. That is, for a fixed performance level (i.e., value of the x-axis), shuffling tuples at best brings a marginal improvement. This can be explained by the fact that shuffling tuples is always slower than masking only due to the permutation generation, while only bringing a gain of η in attack complexity. Increasing this security gain would require reducing bs, and so to favor serialization. But as shown above, this degrades the time vs. security tradeoff which rather pushes for masking with maximum parallelism.
Masking only vs. shuffling shares. Interestingly, the conclusion is more shaded when considering the shuffling shares approach. In this case, the interest in combining countermeasures essentially depends on the randomness cost r, noise level MI u (A i ; L) and amount of independent operations available #AND. That is, for expensive randomness, relatively low noise (still sufficient for the countermeasures to be effective) and high #AND, shuffling shares is the best solution. Whenever decreasing #AND or the randomness cost, or increasing the noise level, this advantage vanishes and masking-only can become the best option. (e.g., in Figure 9c). We summarize this design space with Figure 11, where the x-axis is the randomness cost and the y-axis is the noise level MI u (A i ; L). Black areas represent contexts where masking alone is more efficient than shuffling shares to reach a given security target (here 64-bit). White areas represent contexts where shuffling shares is more efficient. Black areas are for cheap randomness and high noise levels (and lie in the bottom left). By increasing the number of independent multiplications #AND (and accordingly η), this area is reduced and the masking + shuffling approach becomes beneficial. Concretely, this makes it more practically relevant for a primitive like Keccak with its 1600-bit state than for the AES block cipher with its 128-bit state. Masking only is the more efficient than masking + shuffling in black areas.

Real-world (low-noise) case study
Based on the evaluations of Section 5, we can conclude that the new (shuffling shares) combination of masking and shuffling that we propose in this paper can indeed be an interesting asset for the design of side-channel secure implementations. It could in particular be quite effective to protect bitslice designs based on large permutations (e.g., Keccak [BDPA13]) where a lot of parallel operations can be identified. Yet, these evaluations have been performed under the assumption of a sufficient noise level that may not be observed in practice. In this section, we therefore complement these analyses with a real-world evaluation of the masking + shuffling countermeasure implemented in a commercial MCU. For this purpose, the following experiments aim to quantify the information reduction per share that this combination leads to. Concretely, we look at the information reduction for the first share a 0 c of a 4-share shuffled and masked implementation. After discussing the PI on the permutation, we study the ratio between the shares PI for this implementation (i.e.,PI m+s (A 0 c ; L)) and the one of the masked only implementation (i.e.,PI m (A 0 c ; L)). The motivation for this investigation comes from the intuition that shuffling could be an option to protect devices where the physical noise is not sufficient for masking to be directly effective. To clarify whether this expectation is founded, we propose a worst-case evaluation where the adversary has full knowledge and control of her target implementation during a profiling phase. In particular, she has knowledge of the randomness used to generate the permutations enabling profiled attacks against the shuffling as in Section 3. We conclude this section by briefly discussing how our observations evolve when relaxing the adversarial capabilities, in the spirit of a backwards evaluation [ABB + 20].

Target and measurement setup
We investigate the implementation of ISW multiplications enhanced with the shuffling shares solution, on the 32-bit ARM Cortex-M3 of the STM32 VLDISCOVERY board where the scheme is implemented with 4 shares and with permutation sizes of η = 4 and η = 16. 7 We modified the board and removed the decoupling inductances and capacitors. We used the available slot to add an 8[MHz] external crystal and derived the maximum 24[MHz] internal clock from it. To measure the side-channel leakage L, we placed a current probe (i.e., the CT1 from Tektronix 1[GHz]) on the dedicated jumper between the on-board power regulator and the MCU power pins. Eventually, we sampled this signal with a 12-bit resolution at a sampling rate of 500[MSamples/s] thanks to a PicoScope 5244D.

Worst-case adversary
Attack description. We evaluated this implementation with the adversary of Section 3 using the modelm New (y|l) from Equation 13. We extract information on the secret variables using a template attack in a principal subspace according to the methodology of [BS20,BS21]. Precisely, we estimate the leakage PDFs with Gaussian templates after dimensionality reduction using linear discriminant analysis [APSQ06]. The same modeling is used both for the permutation indexes' and the shares' leakage.
Permutation leakage. The permutation leakage analysis is reported in Figure 12. On the left figures, the x-axis is the time and the y-axis is the SNR (in log-scale) of the indexes of the permutation θ c . We observe that the maximum value for each of them is around 1 and many other dimensions lead to an SNR two orders of magnitude lower (especially for η = 16). These values suggest a strongly leaking permutation, which is confirmed by the graph on the right where we report the number of dimensions of the leakage vector exploited by the adversary |L| on the x-axis and the y-axis gives (in log-scale) the information on permutation indices θ i . For most of them (especially for η = 4), the PI is close to H(θ i ), meaning that the adversary can strongly reduce the entropy. Information loss on shares. The impact of the shuffling on the shares' leakage is reported in Figure 13. The green dashed curves represent the expected impact of the shuffling in case permutation indexes do not leak (as assumed in Section 4), namely 1/η. The red dashed curve equals 1 and corresponds to a level of leakage such that shuffling is completely ineffective. As the number of dimensions exploited by the adversary increases, this ratio gets closer to one. For η = 4, it is stuck at 1, meaning that the adversary directly gains full knowledge of the permutation and is therefore not impacted by the shuffling. This is expected from the results in Figure 12. For η = 16, a similar behavior is observed but a larger number of dimensions must be exploited. For |L| = 2, 000, the average ratio is equal to 0.92, meaning that the information per share is only reduced by 8%. For example, it implies that the gain factor of this combination of countermeasures with 8 shares is around 2 (it would be 16 8 with sufficient noise). We refer to [BS21], Figure 16, for details about the impact of such an information reduction factor on the worst-case security of bitslice masked implementations. We conclude that this implementation lies in the low noise region where a limited gain is obtained by combining masking & shuffling. Discussion. We note that these conclusions are based on a worst-case attack strategy. Non-profiled attacks unable to characterize (and therefore exploit) the permutation leakages may lead to a more positive conclusion (i.e., force the adversary to sum the noise variances of her shuffled operations). Yet, assuming that only such weaker attacks are possible appears as a risky strategy given recent progress (e.g., using deep learning) where a worst-case information extraction is approached with limited knowledge of the target implementation [MDP20]. These experiments also confirm the interest in applying shuffling to large designs (e.g., permutation-based), since large η values make the attack more difficult. Eventually, we insist that these experiments do not show that masking and shuffling cannot be implemented securely (the previous sections showed the opposite). What they show is that when the noise level provided by a leaking device is too low for masking to be effective, it is in general too low for shuffling to be effective as well. Indeed, bitslice masking on similar MCUs generally requires a large number of shares [BS21]. It may even be more challenging to implement shuffling securely on low-cost embedded devices since the memory access it relies on can be leakier than (for example) bitslice computations. So overall, it remains an important challenge to ensure a sufficient level of noise on embedded MCUs, so that masking, shuffling and their combination becomes effective. Reaching this goal with existing low-cost devices (similar to the ARM Cortex-M3 we analyzed) is an interesting research direction. Yet, the recurrent difficulties caused by low physical noise for the implementation of side-channel countermeasures also suggest that solving the issue at the technological level by guaranteeing a minimum level of intrinsic noise on security MCUs could be highly beneficial in terms of security vs. performance tradeoff.