PROLEAD

Even today, Side-Channel Analysis attacks pose a serious threat to the security of cryptographic implementations fabricated with low-power and nanoscale feature technologies. Fortunately, the masking countermeasures offer reliable protection against such attacks based on simple security assumptions. However, the practical application of masking to a cryptographic algorithm is not trivial, and the designer may overlook possible security flaws, especially when masking a complex circuit. Moreover, abstract models like probing security allow formal verification tools to evaluate masked implementations. However, this is computationally too expensive when dealing with circuits that are not based on composable gadgets. Unfortunately, using composable gadgets comes at some area overhead. As a result, such tools can only evaluate subcircuits, not their compositions, which can become the Achilles’ heel of such masked implementations. In this work, we apply logic simulations to evaluate the security of masked implementations which are not necessarily based on composable gadgets. We developed PROLEAD, an automated tool analyzing the statistical independence of simulated intermediates probed by a robust probing adversary. Compared to the state of the art, our approach (1) does not require any power model as only the state of a gate-level netlist is simulated, (2) can handle masked full cipher implementations, and (3) can detect flaws related to the combined occurrence of glitches and transitions as well as higher-order multivariate leakages. With PROLEAD, we can evaluate masked implementations that are too complex for existing formal verification tools while being in line with the robust probing model. Through PROLEAD, we have detected security flaws in several publicly-available masked implementations, which have been claimed to be robust probing secure.


Introduction
Since Kocher et al. reported the first Side-Channel Analysis (SCA) attacks as a threat to the security of cryptographic hardware [Koc96,KJJ99], protecting concrete implementations of cryptographic algorithms has been attracting the researchers' attention. Based on this groundbreaking work, the past twenty years of research have shown successful SCA attacks [KJJ99,BCO04,GBTP08] based on traces obtained by measuring one out of potentially various physical characteristics of a device [Koc96,KJJ99,GMO01,HS13,GST14]. In particular, if a designer does not consider SCA as a serious attack vector, an adversary may exploit the dependency between a physical characteristic of the device and the processed data revealing sensitive information efficiently.
As there has been an urgent need to mitigate information leakage, numerous countermeasures trying to protect cryptographic devices from any SCA have been proposed during the past decades. So far, masking, based on the concept of secret sharing [Sha79], is the most popular proposal [CJRR99]. Masking reaches its popularity due to its basic security assumptions simplifying the design and verification of concrete masking schemes such as the ones given in [ISW03,Tri03,NRS11, RBN + 15, GMK16, GMK17, GIB18,GM18]. If the masking scheme satisfies that all input sharings are drawn uniformly and if the noise level is sufficient, conceptual simple adversary models abstract the SCA security of a masking scheme [DDF14]. A basic and efficient -and therefore widely used -adversary model is the d-probing model [ISW03]. Due to its simplicity, the verification under the d-probing model is efficient but does not cover physical shortcomings. Consequently, the more advanced robust probing model has been built on top of the d-probing model to cover different types of physical defaults [FGMDP + 18] including glitches, transitions, and couplings.
However, designing, implementing, and verifying a masked implementation of a cryptographic algorithm is often a manual and error-prone task. Consequently, some masking schemes are shown to be insecure because of, so far, unnoticed design flaws or inaccurate formal modeling of the adversary [MMSS19].
A common approach to verify the experimental security of a masked implementation is to collect power/electro-magnetic traces from a prototype and perform a leakage assessment based on statistical hypothesis testing, i.e. t-test [GJJR11] or χ 2 -test [MRSS18]. Leakage assessment takes place after fabricating a prototype of the device. If the leakage assessment reports a leakage, the design needs to be analyzed to avoid the flaw leading to the fabrication of a new device under test. This procedure is repeated until no further leakage is detected.
The fabrication can be a new design configured on an FPGA with a significantly short time to market. However, in the case of ASIC design procedures, refabrication is time-consuming and expensive; hence leakage evaluation techniques at early design stages, i.e., before fabrication of the chip, are getting more and more popular. In particular, formal verification and leakage simulation have become promising research fields. Nevertheless, their concrete instantiations, given as automated verification frameworks, come with restrictions. Formal verification tools can efficiently verify the robust probing security of a tiny circuit, e.g., a small S-Box, but the complete formal verification of a larger design, like a masked round-based implementation of a cipher, becomes computationally infeasible. While leakage simulation tools can handle larger circuits, they mainly check resistance against a particular Differential Power Analysis (DPA) or Correlation Power Analysis (CPA) attack, where assumptions about the leakage and power models as well as targeted intermediate values are required [HSZ13, FGBR20, NPH + 20, ZPTF21]. More formally, they cannot evaluate the given circuits based on the robust probing model. As a shortcoming, they may report an implementation as secure, while an undetected design flaw (caused by not being probing secure) is exploitable by another attack not covered by the evaluations.
To build provably-secure masked implementations of large designs, small and securely composable masked circuits, so-called gadgets, are applied as building blocks of larger circuits. Therefore, the security properties of such gadgets can be individually verified by formal verification tools, e.g., [BBC + 19, KSM20]. However, formal verification of a masked full cipher is still not feasible if the implementation is made by not-necessarily composable gadgets. These implementations are potentially beneficial and usually more efficient compared to their gadget-based variants. For example, [KMMS22] provides a comparison of different masked byte-serial implementations of the Advanced Encryption Standard (AES). However, their probing security remains unproven. We must rely on experimental leakage assessments that might be inaccurate, incomplete, setup-dependent, and therefore not always trustworthy. In short, we can evaluate the leakage of large designs in an experimental setting, but we cannot see this as a security proof. Even if we detect no weaknesses by a leakage assessment, we can never be sure that no security flaw exists.

Glitch-Extension.
Glitches are unexpected signal transitions occurring in combinational circuits. Due to imbalanced delay paths and switching delays, signals may arrive asynchronously at a gate resulting in a different but temporary output before the output signal reaches its intended state. To cover glitches within the robust probing model, glitchextended probes replace all probes allowing an adversary to access all stable intermediates (either register outputs or primary inputs) that contribute to the probed wire.

Transition-Extension.
Transitions potentially recombine the contents of the same memory element in two consecutive invocations. Hence, by overwriting a memory element, i.e., a register or a flip-flop, an adversary gains information about the old and the new values. To model transitions in the robust probing model, all probes are replaced by transitionextended probes recording the signal during two consecutive invocations.
Coupling-Extension. Couplings lead to unintentional and undesired recombinations of values on adjacent wires. To model couplings in the robust probing model, we need to replace all probes with coupling-extended probes that observe multiple neighboring wires.

Security Notions
According to the probing model, a circuit achieves (g, t, c)-robust d-probing security if a d-probing adversary cannot learn anything about the processed secret. More formally, the joint distribution of any observation set is statistically independent of the distribution of any secret. Unfortunately, for the verification of d-probing security, the statistical independence of each observation set must be checked. As the number of probe combinations grows with d and the complexity of the circuit, verification of d-probing security becomes infeasible for large circuits and particularly at higher orders. To bypass the verification of d-probing security, small and secure building blocks, so-called gadgets, are composed to construct more complex circuits. Hence, the d-probing security of a large circuit is satisfied if all involved gadgets are composable without violating d-probing security. To formally verify the composability of a gadget, Barthe  While gadgets satisfying d-SNI are indeed composable, it turned out that d-SNI is an over-conservative security notion in practice and leads to significant area and randomness overheads [CS20]. To relax the requirements in terms of area and randomness, Cassiers et al. introduced d-Probe-Isolating Non-Interference (PINI) as a less conservative and, therefore, more efficient security notion [CS20].

Boolean Masking
Boolean masking is a common and well-studied approach to protect hardware implementations against SCA attacks [CJRR99]. According to the concept of secret sharing [Sha79], Boolean masking splits a sensitive variable X ∈ F n 2 into s > 1 independently and uniformly distributed shares To generate a sharing of X, we sample {X 0 , ..., X s−2 } uniformly and randomly from F n 2 . For the remaining To avoid SCA leakages about X, all operations of the cipher need to be performed on {X 0 , ..., X s−1 }, i.e. the shared representation of X. To achieve security under the d-probing model (cf. Section 2.1) it must hold that s ≥ d + 1.

Statistical Hypothesis Tests
Hypothesis tests apply statistical procedures to data to assess the strength of evidence against the underlying null hypothesis H 0 . Usually, H 0 denotes a general statement that there is no relation between two groups of samples. To accept or reject H 0 , hypothesis tests provide a quantitative value in the form of a significance level.
For our purposes, we evaluate whether there is a significant dependency between two categorical variables R with r different categories and C with c different categories, both from a single population. Hence, the corresponding H 0 states that R and C are statistically independent. The frequency of observed samples of R and C are stored in a two-way contingency table with r and c categories. For better understanding, we depict a generic two-way contingency table in Table 1.
The associated p-value, i.e., the probability to accept H 0 , is finally computed based on the probability density function f and the gamma function Γ. MoD09] is an alternative to the χ 2 -test of independence based on a likelihood ratio test. The method uses the multinomial distribution. It tests the goodness of fit of the observed to the expected frequencies, under the assumption that H 0 holds, by estimating the G-statistic x as follows.
where the expected frequencies E i,j are computed as for the χ 2 -test, i.e., Equation (1).
Since the distribution of x is approximately χ 2 -distributed with the same v, the p-value is computed in the same way as for the χ 2 -test, i.e., Equation (2).

Hypothesis Testing for Leakage Detection
Due to its simplicity and efficiency, the χ 2 -test is more common than the G-test. If the contingency table is small, i.e. 2×2, the χ 2 -test can be easily calculated by hand. Moreover, the χ 2 -test computes a squaring while the G-test conducts the more complex logarithm function. In the field of leakage detection, the χ 2 -test is a complement to Welch's t-test as it detects some flaws which the t-test cannot detect. For a detailed comparison of the χ 2 -test and t-test, we refer to [MRSS18]. However, as the χ 2 -test is an approximation of the G-test based on Taylor expansion, it is not as accurate as of the G-test. If F i,j and E i,j are different, the χ 2 -test approximation overestimates the outlier what may lead to erroneous results while the G-test computes correctly [Sok95,Hoe12]. We remark that these inaccuracies occur if we deal with a sparse contingency table, i.e., the amount of data is low while the contingency table is large. As in the domain of SCA, contingency tables might often be sparse, we choose the G-test as the underlying hypothesis test of our tool PROLEAD.
For the sake of completeness, we would like to note that -in contrast to the G-test -exact hypothesis tests such as Fischer's exact test [Fis22,Agr92] do not approximate the p-value. Hence, they provide an accurate significance level. Exact tests are of great relevance when dealing with small data sets as the approximation error when applying the χ 2 -test increases the more expected frequencies are smaller than 5 [Yat34].
On the other hand, evaluating larger contingency tables with an exact hypothesis test is not feasible. Usually, Fischer's exact test is applied on up to 2 × 4 contingency tables or combined with a probabilistic heuristic such as Monte Carlo sampling [M + 87].

Statistical Power Analysis
Whenever we apply statistical hypothesis tests, we ensure that we can trust their results. In particular, our experiment must satisfy that the underlying hypothesis test detects even a small dependency. The question of whether a hypothesis test reliably detects a dependency is closely related to the metric of statistical power. To understand the statistical power, we first define the estimation errors that can occur while performing hypothesis tests in the context of SCA.
Definition 1 (False Positive). False positives occur if we reject H 0 despite it being true. It means that PROLEAD wrongly classifies a secure design as insecure.
Definition 2 (False Negative). False negatives occur if we accept H 0 despite it being false. It means PROLEAD cannot detect existing leakage and wrongly classifies an insecure design as secure.
While the p-value gives the probability of a false positive, the statistical power is related to the false-negative probability β. As the power of a test defines the probability that it rejects H 0 correctly, it is computed as: While F denotes the cumulative distribution function for the noncentral χ 2 -distribution, Q denotes the cumulative distribution function of the central χ 2 -distribution, and γ denotes the lower incomplete gamma function. Moreover, x crit defines the critical value of the underlying distribution for a given confidence level a. Hence, x crit estimates the a-quantile of the χ 2 -distribution. Last, λ is the noncentrality parameter depending on φ and F * , * (given in Table 1) and is computed as: As we want to detect even small effects, we choose φ = 0.1 following the proposal of Cohen [Coh88] which estimates φ = 0.1 as a small effect, φ = 0.3 as medium effect, and φ = 0.5 as huge effect. Consequently, we numerically estimate F * , * for fixed φ = 0.1 and β ≤ 10 −5 .

Related Works
The evaluation of masked implementations has a pivotal role in protecting devices against SCA attacks. Therefore, many researchers contribute to this topic by presenting automated tools revealing flaws in protected implementations. So far, the research concentrates on two different approaches, namely formal verification, and leakage simulation.
Formal Verification. Automated tools for formal verification can prove the security of masked implementations, assisted by predefined adversary models. As formal verification is complete, no false negatives occur and the result can be seen as security proof under the chosen model but, sometimes, the verification is too conservative leading to false positives [PMK + 11, KSM20]. The reduction of masking's security properties to simple and abstract adversary models allows the evaluation of complex circuits, i.e., masked implementations operating on many shares. Nevertheless, the verification under more sophisticated adversary models, i.e., models that cover physical defaults, is too computational complex for larger circuits. That is why formal verification tools are mostly applied only to gadgets.  [CS20], which allows composability as well as a more efficient implementation compared to d-SNI. All security notions can be verified under the (0, 0, 0)and (1, 0, 0)-robust d-probing model, i.e., covering glitches. Concerning efficiency, SILVER achieves fast verification of gadgets at higher orders as well as small S-Boxes on low orders. On the negative side, SILVER is not able to analyze even a middle-size masked circuit. By a recent extension [MKSM22], SILVER is also able to evaluate a certain form of circuits under the (1, 1, 0)-robust d-probing model. Hence, SILVER was the first formal verification tool that covers glitches and transitions simultaneously. While the current versions of maskVerif and SILVER can verify a design under various conditions, the overhead in terms of verification time, especially for verification with SILVER is impractical for large circuits.
Leakage Simulation. Since the formal verification of larger circuits becomes infeasible, another research branch focuses on evaluating these circuits by simulating the power consumption of a particular prototype running the implementation. For efficiency reasons, only a fixed set of input vectors is simulated resulting in an incomplete evaluation. Hence, the high performance of leakage simulation comes at the cost of accuracy and false negatives. We remark that [BBYS22] presents a wide range of leakage simulators. Therefore, we focus on how simulators abstract the leakage. The accuracy of a simulator mainly depends on the abstraction level of the simulation. Simulations at the Register-Transfer Level (RTL) are helpful to verify the security of a hardware implementation during the earliest design stage. Since a high-level description is the only available source, the simulation cannot take any hardware-specific internals into account. Usually, the simulation of a trace on RTL is done by simulating the internal logic given in the high-level description and applying a leakage function like Hamming weight (HW) [AMM + 06, Rep16,FGBR20]

Technique
The procedure PROLEAD follows to verify d-probing security consists of two steps. Initially, PROLEAD generates all probing sets that must be considered for the desired leakage verification. Later, a verification step goes through all relevant probing sets and tests their information leakage. During verification, a simulator generates the inputs for the circuit (based on the given settings) and simulates the circuit to obtain intermediate values.
Afterwards, a statistical hypothesis test evaluates the independence of the intermediate's underlying distributions. In this section, we present both steps in detail. We start by reviewing the leakage models PROLEAD supports and give some information about the specification of the settings.

Notation
We denote functions by sans-serif lower-case characters, e.g. f(.). We use an upper-case and bold character, like X, to denote a list of elements. Further, we use subscripts to refer PROLEAD to a specific element of a list. For example, x i ∈ X denotes the element with index i in the list.

Circuit Model
It holds that every sequential circuit that contains no combinational loop (i.e., not covering circuits with an asynchronous design architecture) can be uniquely modeled as a Mealy machine [Mea55] following the schematic of Figure 1. The circuit model combines combinational logic with a single register stage containing all synchronization elements. Hence, even if a circuit encompasses multiple register stages, it is modeled as it is depicted in Figure 1. The combinational logic processes the primary inputs I and the register state S, and returns all register inputs and primary outputs O.

Combinational Logic
Register Stage Each vertex v ∈ V models a combinational gate or a register cell while we model each connection between the gates, i.e., wires, as an edge e ∈ E.
• The underlying Boolean function f v computed by v.
We remark that e ∈ E only specifies the signal itself (i.e. the output value of a gate) but not the connection between multiple gates. Hence, a connection v0 and e ∈ E in v1 . Further, e ∈ E can be the input of multiple gates as the same signal can be the input of multiple gates. Additionally, we create a list

Leakage Models
To verify d-probing security, PROLEAD can be dictated to consider a specification of the (g, t, c)-robust d-probing model [FGMDP + 18].
Definition 4 (Probe). Let (V, E) be the representation of a sequential circuit and T be a list of considered time instances, i.e. clock cycles. A probe p = (e, t) with e ∈ E and t ∈ T records a signal on wire e during clock cycle t.
Definition 5 (Probing Set). A Probing Set P = {p 0 , ..., p |P|−1 } defines a list of probes. We denote the extensions applied to a probing set P with superscripts. For example, if P contains standard probes, we refer to the same list when extended by glitches by P g .
By default, we perform the verification step under the (1, 0, 0)-robust d-probing model. This so-called glitch-extended d-probing model is a widely-used adversary model for hardware implementations. It considers physical defaults in terms of glitches. To formalize the behavior of glitches, i.e., their propagation through combinational logic, we utilize the conservative glitch-extension procedure from [FGMDP + 18] given in Definition 6.
Definition 6 ((1, 0, 0)-Robust d-Probing Model). Consider a list of d-probing sets P. We transform all d-probing sets P i = {p 0 , ..., p d−1 } ∈ P into glitch-extended probing sets P g i ∈ P g by substituting each probe p ∈ P i individually. We substitute p = (e, t) by all probes placed on the input wires of combinational gates that contribute to e and record during clock cycle t.
Then, the verification step goes through all P g i ∈ P g and analyzes each glitch-extended probing set. We give the details of the probe extension procedure in Section 4.4. Further, we can expand the verification by considering the joint occurrence of glitches and transitions under the (1, 1, 0)-robust d-probing model. Usually, hardware implementations are analyzed concerning glitches only. However, transitions can lead to security flaws, especially in iterative circuits, e.g., round-based cipher designs [HSS12,CS21]. Formalizing the influence of transitions, i.e., value changes, whose leakage depends on the previous as well as the new value, is done by extending all p ∈ P g by transitions according to Definition 7.
We should highlight that PROLEAD always considers the glitches and does not support non-robust (0, 0, 0) case. As PROLEAD is supposed to evaluate hardware circuits, we cannot ignore glitches. For a correct evaluation, the user should also consider transitions. However, we made the cover of transitions an optional feature to allow the designer to identify the source of leakage. More precisely, if PROLEAD reports the detection of leakage for glitch + transitions, the designer can turn off the transitional effect and re-evaluate the circuit to find whether the detected leakage is due to the glitches or not. Further, PROLEAD does not cover coupling as information about the layout becomes required, i.e., placement of cells and routing of signals. Since PROLEAD works with the gate-level netlist as the result of a synthesis process, such information is not available. Moreover, the user can restrict the verification of higher-order probing security to univariate leakages or cover multivariate leakages as well.
Univariate Leakage. Univariate attacks exploit leakage during a single point in time. Hence, we formalize the verification of univariate d-order probing security by an attacker who places up to d probes at the same clock cycle. Therefore, an attacker can spot up to d intermediates during a single but arbitrary clock cycle. We model the attacker's capabilities by verifying all relevant probing sets whose probes record during the same clock cycle. More formally, it holds that all probes p ∈ P i record during the same clock cycle.
Multivariate Leakage. In contrast to the univariate setting, a v-variate attack combines information of v different points in time [MM12]. Consequently, a multivariate attacker can place each probe at an arbitrary clock cycle. To formalize the adversary's behavior,

PROLEAD
we make no restrictions on the probing sets concerning the clock cycles. Hence, a probing set can contain probes at any relevant gate and record at any clock cycle. Note that by 'any relevant gate' we refer to the list E * defined below, and by 'any clock cycle' we still stay with the targeted clock cycles (given as a list T) defined in the configuration file by the user.

Configuration
PROLEAD receives three input files. 1. A gate-level netlist written in Verilog as used in digital circuit design to abstract the circuit. Such a netlist is produced by synthesizing the circuit's behavioral description (e.g., VHDL or Verilog) using a hardware synthesizer, e.g., Design Compiler [Inc] or Yosys [Wol].
2. Any ASIC standard cell library can be used for the synthesis, but the functional behavior of each cell of the library should be defined in a custom file, which should be given to PROLEAD as well. Such a file is required for the simulator of PROLEAD to understand how to simulate the cells used in the given netlist. We integrated the functional behavior of most of the cells in NanGate 45 nm open-cell library.
3. A custom configuration file, allowing the users to specify their requirements. All settings regarding simulation and verification take place in this file. Primarily, the user adjusts the simulator by defining the total number of simulations and the simulation time frame in terms of clock cycles. To start a simulation, the user should ensure a correct initialization of the primary inputs to the circuit by defining an input sequence. The input sequence formalizes the state of all primary inputs during an arbitrary number of initial cycles. For example, in case of a cryptographic core, how and when plaintext and key should be given to the circuit and how handshaking signals (like reset) are controlled.
For verification, PROLEAD supports a customized list of wires, which is also defined by the user. Formally, a wire is either considered in the verification or ignored. The user specifies this by including the considered wire into the list E * . To ease the definition E * by the user, we include all wires to the list by default. Hence, the default case is to perform a complete evaluation in terms of wires. Furthermore, the user specifies the leakage verification by setting an appropriate security order d and choosing a proper leakage model. The configuration file contains more fine-grained settings which we ignore to express here for the sake of brevity.

Generation of Probing Sets
After reading the given design and configuration files, and making the graph (V, E), as the first step PROLEAD generates all d-probing sets that fit the desired leakage evaluation specified in the configuration file. Depending on the leakage model, the probing set generation specifies either P g or P g,t . Both lists encompass all probing sets (either P g i or P g,t i ) that are considered for verification. In the following, we algorithmically describe all steps required to generate the probing sets.

Extraction of Relevant Wires
We start by determining which e ∈ E are relevant for the verification procedure. For efficiency reasons, we generate a small list of relevant wires H as long as they fit the defined configuration. The generation of H is presented in Algorithm 1. As we, at least, consider glitches, every probe p gets extended by probes on the whole combinational circuit that contributes to the intermediate signal probed by p. According to the circuit model (cf. Figure 1), a probe p on an intermediate signal of the combinational logic leads to additional probes on a subset of primary inputs and register outputs. We consider the n-bit output of the combinational circuit, with n = |S| + |O|, 2 as a set of n coordinate functions f 0 (I 0 , S 0 ), ..., f n−1 (I n−1 , S n−1 ) while each f j operates on an individual set of primary inputs I j ⊆ I and register outputs S j ⊆ S. For illustration, we consider the following example based on a single coordinate function f j (I j , S j ). Since the subcircuit computing f j (I j , S j ) is fully combinational, a single probe on the output of f j (I j , S j ) expands to probes on all signals in I j and S j . Hence, placing additional probes on intermediate signals of a coordinate function is not necessary as all inputs of f j are already covered. Therefore, we don't have to consider a probe on every intermediate signal of the circuit. It is enough to place probes on all output wires of the combinational circuit {S ∪ O}. Consequently, we add the output wires of all coordinate functions to H. To this end, we need to be in conformity with the user-defined configuration regarding allowed and ignored signals. More precisely, it should hold that H ⊆ E * . We satisfy these properties for each e ∈ H by analyzing only elements in E * (cf. Line 2 of Algorithm 1). To examine if a signal is the output of a coordinate function, we determine whether the circuit propagates the underlying signal to another combinational gate. This is done in Lines 4-8. Considering the circuit model, given in Section 4.2.1, each e ∈ H can only be either an input of the registers or a primary output which is additionally not an input of any combinational gate.

Combination of Probes
After inserting all relevant wires of the circuit into H, based on the user-defined configuration we place and combine the probes resulting in multiple (yet non-extended) d-probing sets P i ∈ P. For efficiency reasons, we avoid duplicates in all P i ∈ P. Hence, it holds that Furthermore, the order of probes inside the probing set does not affect the verification. We abstract the generation of all P i ∈ P as finding all possible d-combinations of elements L ′ with |L ′ i | = d and L ′ i ∈ L ′ in a predefined list of elements L = {l 0 , . . . , l |L|−1 } which is a common problem in combinatorics. We show the generation of all possible d-combinations in L ′ in Algorithm 2. For simplicity, we operate on a bit-vector of |L| indices M = ⟨m 0 , . . . , m |L|−1 ⟩ with m i ∈ F 2 . Initially, it holds that m i = 1 if i < d and m i = 0 otherwise to start with the first combination, i.e., Line 3 of Algorithm 2. In Lines 15-32, M is modified to represent the next possible d-combination of element indices. Based on the current combination of indices shown by M, we store the corresponding d-combination L ′ i = {l m0 , . . . , l m d−1 } before the indices are updated (cf. Line 13). Our algorithm terminates if the last d-combination is reached. This is the case if it holds that m i = 0 for all i < |L| − d. We capture the last combination in Line 6.  However, to compute the probing sets, we have to specify L and add the relevant timing information. As the relevant timing combinations differ for the univariate and multivariate cases, we present each procedure separately.

Algorithm 2 Make d-combinations of elements
Univariate Probe Generation. Covering only univariate leakages leads to a straightforward probe generation approach shown in Algorithm 3. As all probes must be annotated with the same clock cycle, we generate d-combinations according to Algorithm 2 applied on H. Afterwards, we consider the d-combinations of wires . This procedure results in |H ′ | · |T| probing sets. For example, suppose that d = 2 and H ′ 0 = {e 0 , e 1 } and three targeted clock cycles T = {t 0 , t 1 , t 2 }. To consider {t 0 , t 1 , t 2 }, we generate |T| probing sets In other words, P i covers all probes from H ′ 0 recording at clock cycle t i . end for 12: end for Multivariate Probe Generation. In this case, each probe p ∈ P can be annotated with every targeted clock cycle t ∈ T. Hence, each possible d-combination of T is a possible annotation for P. Before we apply Algorithm 2, we generate a list of considered probes P ′ by storing probes on all relevant wires e ∈ H and at every time instance t ∈ T. Hence, the resulting set P ′ encompasses |H| · |T| probes. Afterwards, we apply Algorithm 2 on P ′ in order to generate the d-probing sets. We again explain this by an example. Suppose d = 2 and two relevant wires H = {e 0 , e 1 } and the targeted clock cycles {t 0 , t 1 , t 2 }. This results in six relevant probes P ′ = {p 0 , p 1 , p 2 , p 3 , p 4 , p 5 }. It holds that:

Probe Extension
Up to now, P contains all probing sets relevant for verification under the (0, 0, 0)-robust d-probing model, i.e., still non-extended. However, as we take glitches and, if specified, transitions into account, we should extend P. Both supported models cover glitches; hence, we start by transforming P into P g containing all glitch-extended probing sets. For the sake of efficiency, we process every e ∈ H and precompute its set of wires after glitch-extension e g ∈ H g e . To this end, we use a recursive backpropagation procedure given in Algorithm 5. In short, for the given e ∈ H, the procedure checks if the probe extension stops, i.e., whether e is a register output or a primary input (see Line 6). If so, e is added to H g e . Otherwise, the same procedure is repeated for all inputs of the gate whose output is e (Lines 10-13). Having the lists H g e for all e ∈ H, it is enough to substitute every probe p in all d-probing sets P i ∈ P with the corresponding glitch-extended probing set. This is done by replacing p = (e, t) with p g = (e g , t) for all e g ∈ H g e . This way, P g which is equivalent to P under the (1, 0, 0)-robust d-probing model is achieved. if e ∈ E out vi then 3: v ← v i ▷ The cell whose output is e 4: end if 5: end for end for 14: end if Extension based on Transitions. After the generation of P g and possible optimizations (explained below), the extension for transitions is done straightforwardly. Namely, every probe p = (e, c) ∈ P g i ∈ P g is substituted by a tuple of two probes recording the same signal but at two consecutive clock cycles, i.e., {p, p ′ } with p = (e, c) and p ′ = (e, c − 1). Hence, P g,t is constructed.

Optimizations
The verification step can evaluate the list of probing sets P g or P g,t to verify d-probing security. However, avoiding unnecessary probes and probing sets accelerates the verification procedure. In particular, we remove probes and probing sets if they fulfill one of the properties defined below.
Definition 8 (Duplicate). Consider a probing set P and two probes p i , p j ∈ P. We refer to the tuple (p i , p j ) as a duplicate if p i = p j and i ̸ = j. Hence, P contains the same probe twice.
Definition 9 (Subsequence). Consider two probing sets P, R. We refer to P as a subsequence of R if P ⊆ R. Hence, P is fully covered by R.
Due to the construction of P (cf. Algorithm 3 and Algorithm 4) neither duplicates nor subsequences can occur in the set of standard probes. However, duplicates, as well as subsequences, are introduced during the glitch extension. In particular, if multiple but different probes lead to overlapping probing sets after glitch extension. Therefore, we remove duplicates and subsequences after the glitch extension. As the transition extension is a bijection (two different probes never share the same transition-extended probe), extending P g with transitions does not introduce new duplicates or subsequences.
Removing Duplicated Probes. We remark that detecting duplicates in a sorted probing set P g i has a complexity of O(|P g i |). Therefore, we sort each duplicate-prone probing set P g i ∈ P g with IntroSort [Mus97]. As IntroSort has a complexity of O(|P g i | log |P g i |) the sorting is very fast.
Removing Subsequences. First, we remove duplicated probing sets P g i ∈ P g with P g i = P g j̸ =i which are easy to identify. To this end, we apply IntroSort to sort the probing sets of P g . Afterwards, we go through all probing sets in P g and remove every P g i ∈ P g with P g i = P g i−1 . It means that we remove all duplicates of P g i except its first occurrence. Second, we search for all tuples (P g i , P g j̸ =i ) with P g i ⊂ P g j . If a tuple is found, we mark P g i as a probing set to be removed and ignore it in further searches. Finally, all marked probing sets are removed from P g . As the search for tuples has the complexity of O(n 2 ), we can increase the efficiency through parallelization. Each thread can compare a dedicated set of probing sets P g i ∈ P g with all other probing sets in P g and mark P g i in a shared memory if P g i should be removed. After the termination of all threads, the entire marked probing sets can be removed. Since this process may take a very long time if the circuit is very large and a high security order d is defined, this can be deactivated through the user-defined configuration.

Verification
Given a list of probing sets P g , the verification step analyzes every P g i ∈ P g by simulating intermediates recorded by the probes in P g i . The same holds for P g,t . We give a high-level overview of the verification approach in Algorithm 6. By accomplishing the verification, the procedure returns a list encompassing the p-values p i ∈ G for each P g i ∈ P g . The verification approach can be divided into three steps, presented as follows. Note that p-values refer to the result of statistical hypothesis tests explained in Section 2.4. UpdateDistributions(n g , S, P g , D) 8: G ← Evaluation(n g , D) 9: end for

Simulation
The Simulation procedure in Line 5 of Algorithm 6 emulates the circuit for an arbitrary input sequence to compute all probed intermediates. Through the configuration (see Section 4.3), the user specifies the number of groups n g for which the statistical hypothesis test should be evaluated. Traditional tests in the context of SCA are either fixed versus random or fixed 1 versus fixed 2 , i.e., two groups. For each group, the user should naturally define the fixed value(s) or the random value (i.e., how many random bits are required). This should not be necessarily two groups, and PROLEAD supports any arbitrary number of groups, e.g., multiple fixed values.
During the definition of the initial primary input sequences (also stated in Section 4.3), the user defines which primary inputs at which clock cycles take a value assigned to one of the above-defined groups. This includes their masking as well. For example, two groups are defined as 64'h0000000000000000 and 64'h$$$$$$$$$$$$$$$$ referring to a fixed 64-bit vector fully filled by 0, and a 64-bit random vector. Exemplary, suppose that the circuit is an encryption function of a cipher masked with 3 shares, i.e., the 64-bit plaintext should be given by means of 3 shares P 0 , P 1 , and P 2 assigned to SelectedGroup 0 , SelectedGroup 1 , and SelectedGroup 2 respectively. For every simulation, the simulator selects one of the aforementioned groups randomly, and generates a 64-bit random vector if the random group is selected. Then, the masking (with 3 shares) of the selected 64-bit vector, so-called SelectedGroup, is constructed by 2 other 64-bit random vectors, i.e., selecting SelectedGroup 0 and SelectedGroup 1 at random and making SelectedGroup 2 as SelectedGroup ⊕ SelectedGroup 0 ⊕ SelectedGroup 1 .
The user further defines the maximum simulation length in number of clock cycles while the simulator emulates the circuit's state after each clock cycle iteratively. The user can also define an end condition to be checked during each simulated cycle. If the circuit state fulfills the end condition, the simulation terminates. A possible end condition is to stop if one or multiple primary outputs, e.g., a done signal, reached a specified value.
As given in Section 4.2.1, the given circuit is modeled as a Mealy machine consisting of a register stage storing the output of a combinational circuit, whose input is provided by a combination of the primary inputs and the registers' output. Therefore, once the masked primary inputs are prepared, the simulator starts iterating through the clock cycles. To this end, the following operations are performed per clock cycle until the end condition is met or the maximum number of clock cycles is reached.
1. The primary inputs are updated. More precisely, it is checked if for the current clock cycle a new value for the primary inputs is defined through the initial input sequence (see Section 4.3).
2. Register outputs are updated, i.e., the signals connected to registers' outputs reflect the corresponding values stored in the registers.
3. Now, the input of the combinational circuit is fully defined; hence, it can be evaluated. This is done following the concept of event-driven simulation. Since we consider no delay for the gates, this can be simplified by processing the combinational circuit in the order of the logical depth 3 . At the end of this step, the outputs of the combinational circuit (which are the circuit's primary outputs and/or the registers' input) are provided.
4. As the last step of every clock cycle, the registers store the values appearing at their inputs based on the status of their corresponding control signals, e.g., clock, enable, reset, etc.
Note that the above procedure is valid only for circuits, where all registers are synchronized, e.g., all of them see the positive-edge/negative-edge of the clock signal. PROLEAD optionally can handle other circuits, e.g., with latches and clock gating, but this requires to evaluate the combinational circuit two times per clock cycle, once when the clock signal is high and one more time when it is low (i.e., lower efficiency).
To decrease the runtime, the simulator of PROLEAD works on 64-bit variables, i.e., handling 64 independent simulations in parallel, which indeed follows the concept behind bit-slicing. We should highlight that only the above-given operations per clock cycle are performed on 64-bit variables. Almost all other operations, e.g., the group selection and preparation of masked primary inputs, are done ordinarily (not bit-sliced). Further, since simulations are fully independent of each other, several simulations are performed in parallel by means of multi-treading, which is also adjusted by the user though the configuration file. As shown in Algorithm 6, the total number of simulations is denoted by n total which is divided by n step (also defined by the user based on the available memory) denoting the size of each simulation set which should be performed before updating the evaluation results. This allows the user to observe the evaluation results after each n step simulations.

Update Distributions
Before the evaluation takes place, we convert the simulation results into one individual and independent distribution table per probing set. This is essential since the statistical hypothesis test requires such distributions to estimate p-values. For this, we go through all simulations and concatenate the recorded bits of all p ∈ P i into a value with at least n probe bits while n probe = P i . In the following, we refer to these concatenated n probe -bit values as keys. Each entry of a distribution table stores an individual key and how often the key occurs per group, i.e., n g individual numbers.
For each set of n step simulations, we compute the keys of each probing set and update the corresponding distribution table. To efficiently search for a key in the distribution table, we keep the distribution tables sorted by their keys. Hence, searching for keys has logarithmic complexity O(log n) in the number of observed keys. If a key is found in the table, the occurrence of the corresponding group is increased by one. Otherwise, an empty table entry with the new key (for all n g groups) is added into the sorted distribution table before incrementing the corresponding occurrence. As we searched for the key before, the complexity for insertion into the sorted distribution table becomes independent of whether the key exists in the table or not. This process is also parallelized through multi-treading as the distribution table of different probing sets can be updated independently of each other. Hence, multiple distribution tables are updated in parallel and each table is modified only by a single thread.

Leakage Evaluation
As given in Section 2.4, we make use of the G-test as the statistical hypothesis test. More precisely, we apply the G-test on the n g distribution tables of each probing set individually to achieve a measure for the independence of the distributions. Hence, we obtain the corresponding p-value p i for each probing set, based on which we report the detectability of a leakage. This is the case if − log 10 (p) exceeds the predefined threshold, i.e., the null hypothesis H 0 is rejected. Line 8 of Algorithm 6 performs this operations and stores p-values in G which can be shown, printed, or stored in a file after each n step simulations.

Statistical Confidence
We refer to our results as statistically confident as soon as the error probabilities become acceptable. For the false positive probability, we predefine a threshold probability and reject H 0 if the computed p-value becomes smaller than the threshold. As p < 10 −5 is a common threshold for leakage assessment, e.g. t-test [SM15] and χ 2 -test [MRSS18], we decide to set the threshold probability to 10 −5 . Hence, the chance to falsely reject H 0 , i.e. to report a secure design as insecure is smaller than 10 −5 for each probing set. However, false positives can become very likely if the number of probing sets exceeds 10 −5 . If this happens, e.g. if the experiment considers millions of probing sets, we recommend decreasing the threshold to an acceptable level. However, the decisive factor is not that the threshold is exceeded but that the p-value decreases continuously with an increasing number of simulations. To bring the false negative probability to an acceptable level, we apply the power analysis techniques introduced in Section 2.4. During the verification procedure, we continually monitor the sample size needed to satisfy statistical confidence for a predefined tuple (β, φ). In particular, we numerically estimate the number of simulations required to satisfy an error probability of β for an effect size φ. To estimate the number of required simulations, we define a range in which we try to approximate the necessary number of simulations. In practice, we define a very large range from one to one billion to be sure that the number of required simulations lies in the range. Then, we apply a trial-and-improve strategy which is one-simulation accurate and achieves logarithmic complexity in the upper bound of the range. For the case studies in Section 5, β equals the threshold of 10 −5 for experiments with less than 10 5 probing sets while we set the effect size to φ = 0.1. Hence, by applying these parameters, we detect all small effects with an error rate of 10 −5 . While we suggest using these parameters as the default settings, we remark that the user can choose arbitrary parameters according to the desired security guarantees and the evaluated design. The estimation of required simulations takes place after each leakage evaluation step. As the computation of the statistical power depends on the degree of freedom, we estimate the number of required simulations for the probing set resulting in the highest degree of freedom. Hence, we can be sure that the estimated number of simulations is enough for all considered probing sets. Naturally, the degree of freedom grows with every new spotted intermediate that we store in the contingency table. Hence, the number of required simulations can only be estimated during the evaluation and grows if the degree of freedom grows. Nevertheless, the growth slows down or stops if all possible probed values of a set are considered in the contingency table. Hence, we stop the evaluation as soon as the number of performed simulations reaches the number of required simulations. In Section 5.3, we visualize the progression of the different parameters based on two pracical case studies in Figure 2.

Case Studies
In order to examine the ability as well as performance of PROLEAD, we evaluated several designs, which are mainly available through public repositories, like GitHub. In other cases, we either received the design from the corresponding authors or constructed the designs by ourselves. In short, PROLEAD can rapidly identify leakage in unmasked designs and those which we were aware of their security flaw. Further, we found out several mistakes and shortcoming of publicly-available masked designs which are supposed to be probing secure. In the following we elaborate each case study, but we keep the description of each one short and mainly refer to the original article.

Setup.
We made use of Synopsis Design Compiler and NanGate 45 ASIC standard cell library to synthesis and generate the netlist of each case study. We made sure to avoid optimization across different modules (i.e., keeping the design hierarchy 4 ) to not violate any assumptions of the original designer, e.g., not violating non-completeness [NRS11]. We further provided the functional description of most of the cells in the NanGate 45 library thereby PROLEAD can understand and simulate the given netlist (see Section 4.3). We ran the evaluations on a machine with an AMD EPYC 7352 (48 hyper-threading cores) and 128 GB of memory.
For the entire evaluations we considered two groups (n g = 2), one group fully random and the other one fully zero. In other words, we performed fixed versus random G-test when fixed inputs is {0} t and random input $ ← F t , where t stands for the size of the input vector. For all case studies, we kept the key of the design (if any) to the zero vector. Note that this does not have any effect on the result of our evaluations. However, if the design receives the key in a masked form, we gave the masked representation of the zero vector updated at the start of each circuit simulation.
For all case studies, we first conducted the evaluations by only covering glitches, i.e., no transitions. If we found no leakage, we then extended the evaluation by additionally covering the transitions. Therefore, if not stated, when we report vulnerability of a circuit, we mean when only glitches are taken into account. Likewise, when we report security of a circuit, we refer to the case where both glitches and transitions are covered. For all case studies, we set the effect size to φ = 0.1. Therefore, if we report a design as secure, we mean that no effect with a effect size of φ ≥ 0.1 was detected by PROLEAD. Only when we discover strong leakage, we increased φ to improve the readability of our results. The summary of all conduced evaluations are shown in Table 2.

Unmasked Designs
The vulnerability of unmasked designs is expected. Hence, just for sanity check, we evaluated a round-based implementation of SKINNY-64 [BJK + 16], which is available online 5 . PROLEAD reports first-order leakage at all clock cycles in less than a second using less than 100 simulations.
Hiding Countermeasures. When evaluating probing security of unmasked implementation, adding either amplitude or temporal noise does not have any effect on their vulnerability. Examples include adding an independent noise module, adding jitter to the clock, or randomizing the clock source [MOP07]. The same holds if the circuit is realized by a dual-rail pre-charge logic as a power-equalization technique. Placing a single probe on one of the rials would lead to leakage about the actual signal. It is even the same in case of Masked Dual-rail Pre-charge Logic (MDPL) [PM05], as one probe on a gate output rail propagates to e. g., a m = a ⊕ m, b m = b ⊕ m, and m, which clearly leaks information  about a and b. Here, we would like to stress that such dual-rail pre-charge logics are free of glitches, but the glitch-extended probes should still be propagated to the input of combinational circuit. The name "glitch-extended" is not bounded by glitches. The propagation delay of CMOS gates depends on the given input; for example there is a small delay difference when the input of an AND gate changes from 11 to 01 or to 10. 6 This makes the power consumption of the gate to depend on the given data, although no glitch happens on the circuit. A similar concept with larger granularity is known as "data-dependent time of evaluation" [MOP07] Nevertheless, via several case studies elaborated below, we show that many implementations are not probing secure, while the authors have not seen leakage in practice using often 100 million measurements. This obviously depends on the quality of the measurement setup and many other factors involved in the experimental analyses. The leakage might be detected using another setup running in another environment, or if the gates of the underlying circuits are realized differently. This highlights the relevance of (robust) probing security model. If a circuit is probing secure when both glitches and transitions are taken into account, its security in practice is independent of how the gates are realized and how precise the measurement setup is. Of course, this statement does not include the coupling effects yet. SILVER [KSM20] is able to evaluate the (robust)-probing security of small masked circuits, e.g., S-Boxes with low input size (including the masked inputs and fresh masks). For large circuits, either the tool cannot build the Binary Decision Diagram (BDD) of the given circuit or the evaluation is not accomplished in a reasonable time.

Small Masked Circuits
As the first masked circuit, we have taken the TI design of the S-Box of the PRESENT cipher [BKL + 07] presented in [PMK + 11], where the S-Box is decomposed to two quadratic bijections, and each of which is masked by three shares following classical TI scheme [NRS11]. Similar to SILVER, PROLEAD detects no leakage with glitches and transitions. As stated in Section 3, this is one of the cases in which maskVerif is too conservative and reports leakage.
As a flawed design, we took the PRESENT TI S-Box of [EGMP17], without correction terms (i.e., with non-uniform output sharing) which is supposed to be insecure. Both SILVER and PROLEAD find probes with first-order leakage.
The first AES TI S-Box with 3 shares has been proposed in [MPL + 11]. Due to the size of the circuit and its number of inputs (including 52 fresh masks), it is out of the capacity of SILVER. However, PROLEAD confirms its first-order security (glitches and transitions).
Other d + 1 masked AES S-Boxes at arbitrary order have been introduced in [GMK16, GMK17, CRB + 16] The first-order designs have been successfully evaluated by SILVER in [KSM20], what PROLEAD also does. The higher-order designs, however, are too large for SILVER, while we are able to confirm their security by PROLEAD.

Masked Full Ciphers
Serialized PRESENT. The first complete cipher design, which we have evaluated, is the nibble-serialized PRESENT encryption function of [PMK + 11], where the above-discussed  TI S-Box is instantiated. PROLEAD has confirmed the first-order security of the design in less than two minutes while SILVER cannot handle such designs. By exchanging the S-Box with the flawed one, and evaluating the encryption function in whole with PROLEAD, we also detected first-order leakage rapidly.
To visualize the results and the statistical confidence, we show the progression of p and the number of required simulations for both serialized PRESENT designs in Figure 2. Each plot encompasses the p-value as − log 10 (p) which is drawn in black, while the horizontal line (red) indicates the 10 −5 threshold of the p-values. Moreover, the grey line shows the relationship between processed simulations and required simulations to achieve β < 10 −5 . Hence, our results become statistically confident if as many simulations as needed are processed. This is visualized by another threshold at 1.0 (drawn in blue) which has to be passed by the grey line. For the uniform design, shown in Figure 2(a), we plot the results for 512 000 simulated encryptions. 512 000 turns out to be twice as much as needed since the result is confident after 225 567 simulations. We can conclude that the design is secure since p stays under the threshold for more than 225 567 simulations. For the non-uniform S-Box, shown in Figure 2(b), we detect leakage after thousands of simulations. Hence, we conclude that the corresponding effect size is higher than 0.1 and set φ = 0.3 to detect only moderate effect sizes. We simulate 25 600 encryptions while β falls under 10 −5 after around 16 000 simulations. Although we can only detect medium effect sizes reliably, the leakage is visible as the p-value is far above the threshold.
Serialized AES. We further evaluated the masked full cipher designs of [MPL + 11, GMK16, CRB + 16], whose underlying masked S-Boxes we have already evaluated. In summary, we have not detected any first-order leakage when evaluating the first-order designs. For each design, PROLEAD required between 13 and 50 minutes to accomplish the full evaluation process, i.e., glitches and transition with an effect size of φ = 0.1.

Registers with enable.
It is a common practice to use registers with enable in designs, where some registers should not store their inputs at every clock cycle. We noticed that NanGate 45 cell library does not contain any of such register cells, which are then realized by a normal register and a multiplexer to re-store the register's output when the enable signal is low. Although this does not pose any problem to either the functionality or the security of the design, placing a probe on the input of such a register would propagate to its output as well (through the aforementioned multiplexer). Therefore, even without considering transitions, the evaluation would observes the input and output of those registers, hence inherently covering transitions.
Null Fresh. The authors of [SM21a] have introduced a technique to realize first-order secure implementation of (up to) cubic functions with two shares and without any fresh PROLEAD masks. They applied the technique on the S-Box of several ciphers and provided the encryption function of the full ciphers in GitHub 7 . We evaluated all designs explained below.

Midori-64 [BBI + 15].
It is a round-based implementation supporting both encryption and decryption. Considering glitches, we did not find any first-order leakage. The S-Box has an internal register stage; hence, the full cipher implementation forms a pipeline with two stages, i.e., two clock cycles per cipher round. Hence, two consecutive plaintexts (for encryption) can be given to the circuit. If the input (plaintext) is the same for both cycles, i.e., the reset signal is high for two clock cycles, PROLEAD detects first-order transitional leakage. It is actually a general problem and not dedicated to this design. Suppose that the first pipeline stage realizes the function f(.) and the second pipeline stage the function g(.). Let us denote the input of the cipher by A which is given to the circuit two consecutive clock cycles. After two clock cycles, the first pipeline stage has computed f(A) and the second one g f(A) , which is equal to the application of one cipher round on A. In the next clock cycle, the input of the first pipeline stage changes from A to g f(A) . Therefore, placing a transition-extended probe on the first pipeline stage would observe some bits of A and the corresponding bits of g f(A) . It means that by one probe, the input and output of a cipher round are recorded, which most of the time leads to detectable leakage. As a general rule, in pipeline designs, consecutive inputs should not be the same, i.e., should not have the same masking (initial sharing). By filling one pipeline stage with another independently-masked input or a zero vector right after giving the desired input, i.e., inserting a bubble into the pipeline as suggested in [CS21], the observed first-order transitional leakage has vanished.

PRESENT-80.
It is the encryption function based a nibble-serial design architecture. In short, we found out that the plugged masked S-Box is first-order secure, but not the encryption function. Considering only glitches, we have not observed any first-order leakage, but once transitions are taken into account, we rapidly (i.e., using less than 10 000 simulations) detect first-order leakage. The reason for such a leakage lies on its serialized architecture. In such a design, at every clock cycle some multiplexers decide whether the register stores the S-Box output or the P-Layer output, or loads in parallel (or serial). Hence, placing a probe at a register input propagates to many other register outputs including some of those belonging to the masked S-Box.
There is no problem during the serialized computation of the S-Box, after which the P-Layer is applied. Since the S-Box has an internal register stage, during the application of the P-Layer, one state nibble so-called X (on which the S-Box is already applied) appears at the S-Box input module, and the internal register is filled with lets say f(X). The P-Layer stores one bit of X at the first nibble of the state register which is again given to the S-Box module at the next clock cycle. Hence, a probe placed at the output of the S-Box internal register observes some information about shares of X in two consecutive clock cycles. Together with other glitch-extended probes explained above, information about X is revealed. This can be avoided by disabling the S-Box internal register when performing P-Layer.

PRINCE [BCG + 12].
Similar to that of Midori-64, it is a round-based implementation of both encryption and decryption, and forms a pipeline of two stages. The authors have themselves stayed that their masked S-Box does not provide a uniform output sharing. However, they claimed that the diffusion layer makes the masked input of every S-Box in the next round uniform. By PROLEAD we observed first-order leakages at clock cycles 7, 9, and 13, i.e., in rounds 4, 5 and 7 of the encryption. It is indeed confirmed that the diffusion layer helps since no leakage during the second encryption round are observed, but after some rounds the sharing of the S-Box inputs becomes non-uniform. However, we should note that this leakage might be not exploitable since an attack would require to guess many key bits to be conducted in the cipher's middle rounds.
AES-128. The authors have provided two masked designs for inversion in GF(2 4 ) 2 , one with one fresh mask and another one without any. As they stated, these masked S-Boxes do not have uniform output sharing. Hence, similar to that of PRINCE, they used the diffusion layer (MixColumns) to make the input sharing of the S-Boxes in the next round uniform. PROLEAD rapidly detected first-order leakage in these implementations. Through inspecting the reason, we found a design flaw in their scheme, which we explain as follows.
Let Placing an extra register at the output of A O (I ′ ) would solve this issue, but the output of the MixColumns is directly given to the input affine of the S-Box module (for the next round) without any register (see Figure 6 of [SM21a]). Therefore, a probe placed on the input affine combined with the input isomorphism of the S-Box would again propagate to the output of several registers storing A O (I ′ ) which are again not jointly uniform. The only solution which we found to mitigate this leakage is to add one more register at the output of MixColumns, i.e., one register to store I ′ , one to store A O (I ′ ), and one more to store the MixColumns output O. However, the problem that we have reported above on the non-uniformity of the S-Box inputs in the middle rounds of PRINCE, holds true here as well, since this technique would avoid the leakage at the second cipher round, but not necessarily at all rounds.
Null Fresh 2. In a follow-up work [SM21b], the authors have extended their technique to the second-order with three shares, and presented second-order glitch-extended probing secure implementation of quadratic functions without any fresh masks. Composing such functions still necessitates the insertion of fresh masks, which can be at minimum. The authors applied their proposed scheme on the S-Box of several ciphers and provided the HDL code of the masked full cipher implementations in GitHub 8 , which we have taken and evaluated as given below.
Keccak [BDPA]. Both first-and second-order designs realizing a round-based implementation of Keccak-f [200] with three pipeline stages and without any fresh masks are provided by the authors. Following the principle given for Null Fresh Midori-64 design with respect to bubbles in the pipeline (in page 24), we detected no leakage.
Midori-64, PRINCE, SKINNY-64. These round-based implementations have often four pipeline stages (seven stages for PRINCE), and require 128-bit fresh mask bits per clock cycle. Our analyses did not find any first-or second-order leakage.
PRESENT-80. Similar to the first-order design [SM21a], it is a nibble-serial implementation of the encryption function. We have detected first-and second-order univariate leakage at the second cipher using around 5 000 simulations. The reason is that the output sharing of the second stage of the decomposed S-Box is not uniform, although the authors claimed its uniformity. Hence, after the application of the masked S-Box on all state nibbles and the P-Layer, the probes which are placed on the S-Box in the second cipher round exhibit leakage. If the key is also masked, the leakage is restricted to only second order. Otherwise, first-order leakage is detected. This can be solved by either exchanging the S-Box design with another one with uniform output sharing or introducing 8-bit fresh mask at the end of each S-Box to refresh its output sharing.
Low-Latency Keccak. In [ZSS + 21], the authors have introduced a technique to combine masked χ and θ functions of Keccak without placing any register in between. This allowed them to construct a generic round-based design with one clock cycle per round, supporting all variants of Keccak and at any arbitrary order. This comes at the cost of a relatively high demand for fresh masks.
Since the design is generic, we have evaluated the smallest variant, i.e., Keccak-f [25] which is available in GitHub 9 . Focusing on the first-order design, by means of only glitches we have not found any first-order leakage confirming their security arguments. However, we have observed strong first-order leakage when transitions are also considered in the evaluation, i.e., using 2-3 million simulations. The reason for such a leakage is that the design is made to perform one masked Keccak round in every clock cycle, i.e., one register stage in the round-based implementation. A probe at the input of a single state register would propagate to output of several state registers, due to the composed combinational circuit χ and θ. Taking transitions into account, these probes record those state registers at two consecutive Keccak rounds. This is not a general statement, but when the output of a masked circuit is written on its input, transitions usually lead to leakage. For example, this holds true for most of the cases studied in [MKSM22]. This also holds true for this Keccak implementation. To be more precise, the leakage is detected only during the first and the last clock cycles, i.e., the first and last Keccak rounds, which potentially can lead to exploitable leakage. This is due to some multiplexers placed to load the input at the first clock cycle and another multiplexers to avoid the θ function at the last Keccak round.
The same hols for higher-order designs. However, many more simulations are required to detect such higher-order transitional leakages. This is due to the size of the combinational circuit and the dependency of each register input to several other register outputs. For example, two probes placed on register inputs of a second-order design propagates to 328 other probes. This hardens the detection of leakages since the distribution tables need many samples to be filled, i.e., better estimated. For the second-order design we required around 500 million simulations to detect the aforementioned transitional second-order leakage. This becomes harder on higher-order designs, hence very unlikely exploitable.
Low-Random Masking. In order to reduce the fresh masks in designs that achieve secondorder security, a technique has been introduced in [BDMS22] which allows to reuse the fresh masks at every cipher round. More precisely, the fresh masks should be updated only together with the given input (plaintext and key). The authors have applied the underlying technique on several ciphers and provided full cipher designs in GitHub 10 , which we fully evaluated. All designs have a round-based architecture, while for each cipher two designs are provided. One has a higher number of pipeline stages requiring the lowest number of fresh masks, and the another one has a lower latency necessitating a higher number of fresh masks. In none of the provided designs, we have found any first-order leakage. Hence, the evaluations given below focus only on the second-order leakages.

LED-128 [GPPR11].
We have not found any leakage in the design with 5 pipeline stages. However, we detected multi-variate second-order leakage in the 3-stage design using few hundred thousand simulations. In such designs, the authors coupled two masked S-Boxes and used their shares to blind each other computations to fulfill the non-completeness. Each S-Box is decomposed to two quadratic functions F and G. The second-order leakage is detected when a probe is placed on an output of F of one of the coupled S-Boxes and the second probe on an output of G of the other S-Box.

Midori-64.
Evaluating the 4-stage encryption/decryption design, we detected multivariate second-order leakage by around 1 million simulations. Since the same fresh masks are used in all clock cycles of an encryption, by placing a probe on a part of the S-Box e.g., at the 4-th clock cycle, information about some fresh masks are obtained. When the second probe is placed on another part of the S-Box in the next clock cycle where the same fresh masks are used, we observe detectable leakage. This leakage has been detected at the 4th and 5th clock cycles, i.e., at the border of the first and second cipher rounds; hence, it is expected to be exploitable.
Note that it is a pipelined design, and 4 consecutive plaintexts can be given to the encryption function. After giving the target plaintext, the 3 other pipeline stages can be filled by zero or random inputs, i.e., bubble strategy. We examined both scenarios, and have seen the same leakage. Potential solutions might be (1) to set fresh masks to zero in clock cycles when pipeline does not contain meaningful data, or (2) to swap between 4 different fresh masks corresponding to 4 given consecutive plaintexts. None of such hints are given in the original paper [BDMS22], nor the implementations in GitHub consider/suggest such scenarios.
The 3-stage design has a univariate second-order leakage which is not originating from the aforementioned source. We detected such a leakage using around 1 million simulations when two probes are placed at different outputs of the G function starting at the third clock cycle. The S-Box is decomposed to quadratic functions F and G.

PRINCE.
Wen evaluating the 6-stage design, we observed univariate second-order leakage using around 37 million simulations when two probes are placed at the input affine function of the decomposed S-Box in the second cipher round, i.e., at the 7-th clock cycle. The origin of this leakage is not the reuse of fresh masks. We observed the same leakage when the fresh masks change at every clock cycle. The leakage is indeed due to the way the fresh mask bits are reused by different S-Boxes in a round. Suppose that an S-Box module receives fresh masks ⟨r 0 , r 1 , . . . , r 37 ⟩. The same fresh masks in the same order are given to all other S-Boxes in a cipher round. Changing this order may avoid the observed leakage. The 4-stage design also exhibits second-order (but multi-variate) leakage using 30 million simulations. The reason is similar to what we have explained for the 4-stage Midori-64 design, i.e., two probes in consecutive clock cycles.

SKINNY-64.
The 4-stage design shows univariate second-order leakage similar to the 6-stage PRINCE design. We detected the leakage by 1 million simulations when two probes are placed on input affine function of the decomposed S-Box in the second cipher round, i.e., 5-th clock cycle. We further detected multi-variate second-order leakage (similar to several other cases) when two probes are placed on consecutive clock cycles by obtaining information about the fresh mask in one clock cycle and revealing some leakage about the secrets by the second probe in the next clock cycle.
The 3-stage design has also univariate second-order leakage (detected using 1 million simulations), similar to the 4-stage design, but at the 4-th clock cycle, as the design has one less stage.

GHPC
There are a few recent developments in the areas of composable security, i.e., constructing hardware gadget (for e.g., a 2-input AND gate) whose security is guaranteed when composed.

PROLEAD
This is highly beneficial to construct a masked circuit at arbitrary order. To the best of our knowledge, the most efficient composable 2-input AND gadget extendable to any arbitrary order is known as Hardware Private Circuits (HPC2) [CGLS21] following the security notion PINI [CS20]. Generic Hardware Private Circuits (GHPC) [KSM22] extends the HPC2 to construct arbitrary large gadgets (of any input and output size), but is limited to first-order.
Let us consider an exemplary circuit made by two GHPC gadgets realizing two functions cascaded, i.e., d = f g(a, b), c . More precisely, where a = a 0 ⊕ a 1 (resp. for b, c, d, and t) and r 0 , r 1 denote the fresh masks.
Each GHPC gadget has two register stages, i.e., the latency of two clock cycles. When the input ⟨a, b, c⟩ is given, it takes a couple of clock cycles till the output is ready, i.e., ⟨d 0 , d 1 ⟩. When the input ⟨c 0 , c 1 ⟩ is not synchronized with the intermediate value ⟨t 0 , t 1 ⟩ by means of extra registers, the circuit does not form a pipeline, and the inputs a, b, c should stay stable till the output is ready. In this non-pipeline scenario, the authors suggested to remove some optional internal registers of the GHPC gadgets for the sake of area efficiency. This is based on an assumption that not only the given input but also the fresh masks stay stable until the entire circuit is evaluated. In other words, all inputs including and fresh masks are given to the circuit and stay stable and unchanged for a couple of clock cycles till the output is ready. The same concept has been considered in the design of non-pipeline DOM multipliers [GMK16]. This theoretically, does not pose any issue. The authors have confirmed the security of their designs by SILVER. Since SILVER evaluates the circuit in the steady state, removing such optional internal registers would not affect the evaluation result of SILVER, i.e., with and without such internal registers SILVER reports the security of the designs made by GHPC gadgets.
However, PROLEAD reports first-order leakage when transitions are taken into account. The second GHPC gadget initially calculates the given input c based on r 1 and the output of the first gadget ⟨t 0 , t 1 ⟩, which is not yet ready. After two clock cycles, when the output of the first gadget is valid, the second gadget calculates its output with the same fresh mask r 1 . The transitional leakage related to the consecutive values stored in the registers of the second gadget would cancel the effect of fresh masks, leading to first-order leakage. This unfortunately cannot be solved by updating the fresh masks at every clock cycle, since the circuit without optional internal registers would not generate the correct output at every clock cycle. More precisely, if the optional internal registers are removed, the fresh masks must stay stable until the circuit is fully evaluated. The only way to operate GHPC circuits securely and functionally correct is to keep the optional internal registers and update the fresh masks at every clock cycle. Note that such optional internal registers are independent of extra registers which may be added to the circuit (outside of gadgets) to synchronize the input of every gadget and construct a fully-pipeline design.

Summary
We have provided several case studies, in which the security of only small components (e.g., S-Boxes) have been analyzed, e.g., by SILVER. However, when such modules are plugged into larger designs, the security of the final constructions could only be evaluated by experimental analysis, which can naturally be erroneous. The authors of those designs have not seen the leakages which we observed by PROLEAD, but it does not mean that the same implementations evaluated by means of a different setup in another environment (with lower noise) show the same level of robustness. This indeed re-highlights the application of composable secure gadgets and the necessity of having proofs for the implementations instead of being dependent on manually-crafted optimized designs and experimental We further would like to stress that the leakages which we found by PROLEAD in the aforementioned case studies are not necessarily exploitable. However, such designs cannot be considered probing secure, which is often in contradiction with the authors' claims. A natural question is whether (robust) probing security is important in practice. It is true that (robust) probing model is relatively conservative, but it captures any leakage independent of how the actual circuit is realized in hardware and which timing information the cells of the underlying library have. In short, we believe that if a circuit is (robust) probing secure (guaranteed for example through composable gadgets), its physical realization is very likely secure in practice as long as the specifications of the circuit are not changed, e.g., the netlist stays unchanged. This statement stays valid even if the underlying ASIC library changes which just alters the power and timing characteristics of the instantiated gates.

Limitations
While PROLEAD can evaluate the probing security of larger designs whose probing security cannot be evaluated by formal verification tools, it turns out that some designs that satisfy higher-order probing security are even too large for a complete evaluation with PROLEAD. For example, consider the DOM, AES S-Box [GMK16] with different security orders evaluated during Section 5. In Figure 3, we see that the number of probing sets grows exponentially with the increasing number of shares leading to an exponentially increased runtime of PROLEAD. We can conclude from Table 2 that we can analyze a DOM AES S-Box up to the third order, while a fourth-order evaluation possibly takes months. Besides runtime, the memory requirements grow exponentially and may become the limiting factor on some devices. On the one hand, the number of contingency tables grows. On the other hand, each contingency table may grow exponentially as the number of probes in a probing set increases with the desired security order. Moreover, we remark that an increased number of probes per set and the resulting larger contingency tables only lead to statistically confident results if more simulations are considered. Hence, the number of required simulations to reliably detect an effect also grows with the desired security order. Nevertheless, we remark that a partial evaluation of larger designs is still possible as the user can limit the evaluation in terms of considered probes and clock cycles. Thus, although a complete evaluation is not feasible, PROLEAD can help the user to find potential vulnerabilities in a design.

Conclusions
In this work, we introduced PROLEAD, a new simulation-based approach to evaluate the probing security of masked implementations. Although being dependent on simulations, in contrast to the state of the art, PROLEAD is free of any leakage/power model and directly examines the robust probing security of the given implementations. Thanks to gate-level simulations, bit slicing, and parallelisms, PROLEAD enjoys a high-performance feature being able to evaluate masked full cipher implementations in a reasonable time, e.g., first-order security of a masked AES-128 encryption function in a couple of hours. This is certainly out of the capacity of formal verification tools (like SILVER), which can at most evaluate subcircuits, e.g., gadgets or small S-Boxes. We should also mention that since PROLEAD is based on simulations and statistical hypothesis tests, its evaluation results cannot be considered as a proof when it reports the robustness of the given design. However, if PROLEAD detects a leakage, the found probing set can confidently be considered as a counterexample violating the desired security. Furthermore, PROLEAD can estimate the reliability of the results by reporting the false-negative probability. Users can adapt the required statistical confidence level to their needs while PROLEAD computes the minimum number of simulations that are needed to satisfy the required security level. Due to the estimation of confidence, it is always clear what a user can expect from a result given by PROLEAD and how reliable the results are. Naturally, the results of PROLEAD are more reliable when a higher number of simulations are considered in the evaluations. Hence, similar to any statistical hypothesis test, there is a trade-off between the confidence level and the number of samples involved in the evaluation. Nevertheless, we believe that PROLEAD is a highly helpful tool to rapidly examine the probing security of masked implementations prior to experimental analyses and/or fabrications. The tool can also be used by practitioners and engineers without having access to any SCA measurement setup. Through several cases studies, we have shown the ability of PROLEAD to find design flaws in implementations which are claimed robust probing secure. At the moment, PROLEAD supports glitch-and transition-extended probing security. A natural follow-up work would be in the direction of extending its features to cover security evaluations based on random probing model.