Improved Leakage-Resistant Authenticated Encryption based on Hardware AES Coprocessors

. We revisit Unterstein et al.’s leakage-resilient authenticated encryption scheme from CHES 2020. Its main goal is to enable secure software updates by leveraging unprotected (e.g., AES, SHA256) coprocessors available on low-end microcontrollers. We show that the design of this scheme ignores an important attack vector that can signiﬁcantly reduce its security claims, and that the evaluation of its leakage-resilient PRF is quite sensitive to minor variations of its measurements, which can easily lead to security overstatements. We then describe and analyze a new mode of operation for which we propose more conservative security parameters and show that it competes with the CHES 2020 one in terms of performances. As an additional bonus, our solution relies only on AES-128 coprocessors, and it halves the amount of key material needed in order to encrypt and authenticate.


Introduction
In a recent work from CHES 2020, Unterstein et al. propose a leakage-resilient Authenticated Encryption (AE) mode of operation that is targeted for implementation in low-end microcontrollers empowered with hardware (e.g., AES, SHA256) coprocessors [USS + 20]. Their main target application is secure firmware update. The ingredients they use for this purpose are the leakage-resilient Pseudo-Random Function (PRF) of Medwed et al. for generating an ephemeral key and the tag [MSJ12], and a mode of operation adapted from Degabriele et al. to process the message [DJS19,KS20]. In this paper, we first exhibit some limitations of this work, both in the proposed design and in its security evaluation. We then describe mitigations that we analyze theoretically and validate experimentally.
The first limitation of [USS + 20] relates to its motivation of enabling secure firmware updates by leveraging hardware coprocessors rather than countermeasures like masking. While this appears as a strong motivation given the challenge of implementing masking securely on low-end devices [BS20], we observe that the main security property that is required for this purpose is ciphertext integrity with leakage in decryption. As discussed in [BPPS17], this property can be reached with two calls to a strongly protected block cipher and letting most of the other parts of the implementation leak in an unbounded manner. Yet, it requires a careful instantiation of the tag verification. Namely, either this tag verification is protected at the implementation level (e.g., via masking), which contradicts the goal of relying only on hardware coprocessors, or it has to be designed such that leaking the verified value cannot lead to forgeries. It is shown in [BPPS17] that an inverse-based verification can remain secure with unbounded leakages. The variant in [DEM + 20] achieves a similar result without inverse, but based on a slightly stronger assumption, which translates into security against Simple Power Analysis (SPA).
By contrast, the proposal of Unterstein et al. does not include such a careful design. Hence, we first show that a standard Differential Power Analysis (DPA) against its tag verification can lead to forgeries, completely breaking the integrity of a leaking software update using this scheme. We note that such an attack was already exhibited in [BBC + 20], Section 4.3, for an ARM Cortex-M0. We repeat it here for slightly more noisy targets embedding hardware coprocessors. We next propose an adaptation of the tricks in [BPPS17,DEM + 20] that we incorporate into an improved mode as a remedy.
A second limitation of [USS + 20] relates to its security analysis of the leakage-resilient PRF used for ephemeral key and tag generation. The security of this construction is based on the use of "carefully chosen plaintexts" which are aimed to generate key-dependent algorithmic noise. As mentioned in [MSJ12], this (heuristic) security relies on parallelism and on the fact that (i) exploiting leakage after the AES MixColumns operation is difficult (e.g., because only transition-based leakages that are intensive to guess provide significant leakage), and (ii) the leakage models of different S-boxes are sufficiently similar.
We revisit these requirements by investigating two hardware coprocessors available on ARM Cortex-M33 and M4 devices, with respectively 32-bit and 128-bit architectures, that are similar (yet not identical) to the ones analyzed by Unterstein et al. We observe that while the leakage of the AES MixColumns operation is indeed difficult to exploit, a profiling using a linear subspace and up to 4 · 10 9 traces allow us to exploit the different leakage models of different S-boxes, leading to significant reductions of the data complexities reported at CHES 2020. Since the complexity of these attacks directly impacts the number of bits that can be "absorbed" per block cipher call in the tree-based PRF of Medwed et al., these results imply important updates on the efficiency of such constructions. More conceptually, these investigations put forward that evaluating leakage-resilient PRFs is a quite delicate task. That is, in contrast with standard countermeasures like masking and shuffling, which mostly depend on a sufficiently high level of noise in the measurements (that may be uneasy to obtain on low-end devices, but is at least easy to evaluate), leakage-resilient PRFs enable attacks with averaged observations so that they rather require a sufficiently low side-channel signal in the measurements. Our experiments show that evaluating such a signal is a quite sensitive process and that slight changes in the measurement setup or profiling methods can significantly modify the perceived level of security of such primitives. Hence, they suggest that considering sufficient security margins is needed in order to mitigate risks of overstated security. We insist that we do not claim leakage-resilient PRFs cannot be implemented securely: our only claim is that their security evaluation has been less thoroughly studied than for countermeasures like masking. So while such PRFs remain interesting candidates for secure implementation in low-end devices with hardware coprocessors, our results suggest a deeper understanding of their security assumptions is needed, which is an interesting research direction.
Based on these security analyses and keeping the motivation of providing efficient and secure firmware updates, a third observation is that the AE mode in [USS + 20] requires twopasses in decryption. While the use of two passes is needed to provide strong confidentiality guarantees in the presence of decryption leakage (as for example witnessed by the TEDT mode of operation [BGP + 20]), optimal integrity guarantees in the presence of leakage can already be obtained with a single-pass and online mode, as reflected by the so-called Grade-2 designs in [BBC + 20]. We therefore provide a new mode of operation that is exclusively based on block ciphers and satisfies such integrity guarantees. 1 An obvious direction to achieve this goal would be to use the TET mode of operation described in [BGP + 20], that relies on Tweakable Block Ciphers (TBC). Yet, this solution would require to instantiate the TBC with block cipher calls. Using generic constructions like [Men15] would then require arguing about security in the presence of leakage during the intermediate computations (especially since our integrity guarantees are aimed to hold with unbounded leakages). 2 Besides, the tag verification of TET is based on inverting a TBC while our goal is to leverage a leakage-resilient PRF for this part.
We propose a solution avoiding these caveats which instantiates a Grade-2 design by leveraging a block cipher based hash function introduced by Mennink [Men17] slightly tailored for our purposes. It requires four block cipher calls per message block. We also propose a careful instantiation of the tag verification that does not rely on masking We show that this design is competitive with the CHES 2020 proposal and that it even improves it when a SHA256 coprocessor is not available (which is sometimes the case in low-end embedded devices). For completeness, we additionally describe solutions to ensure confidentiality in the presence of decryption leakage, which can be necessary in contexts where Intellectual Property (IP) must be protected [BPPS17,USS + 20]. One heuristic option is to leverage a multi-user variant of our Grade-2 mode. Another option is to use a block cipher based variant of TEDT leveraging Mennink's hash function. As a bonus, our solutions also halve the amount of key material compared to [USS + 20].
In summary, we show that (i) the AE scheme in [USS + 20] does not enable secure firmware updates unless its tag verification is protected against side-channel attacks with implementation-level countermeasures (which contradicts its design goals); (ii) the evaluation of the LR-PRF used in [USS + 20] is a sensitive task that hardly transfer from one device (or measurement setup) to another, hence suggesting the use of conservative security parameters when instantiating it; (iii) block cipher based modes of operation can be designed in order to mitigate the aforementioned integrity issue at the algorithmic level, in one or two passes depending on the type of protection against leakage-based IP theft required, and avoiding the use of SHA256 which can lead to performance gains.
We finally mention an application note from NXP on leakage-resilient primitives as evidence that there is industry demand for such concrete security solutions. 3 Related works and terminology. The idea of limiting the power of side-channel attacks thanks to adapted cryptographic designs is an old one, already present in Kocher's first patents [Koc05]. Dziembowski and Pietrzak formalized this idea and proposed the first leakage-resilient primitive proven based on standard reductionist arguments [DP08]. Their work has then been the source of many leakage-resilient designs for various cryptographic primitives such as Pseudo-Random Number Generators (PRNGs) [Pie09,YSPY10], PRFs [SPY + 10, FPS12] or Pseudo-Random Permutations [DP10]. These seed results on basic cryptographic primitives next triggered analyzes of complete functionalities like encryption and authentication [BPS15,PSV15], and rapidly shifted the attention of designers to Authenticated Encryption (AE) schemes mixing both integrity and confidentiality guarantees [BMOS17, BKP + 18, BPPS17, GPPS19]. Two definitional frameworks exist for this purpose. Barwell  , key evolution mechanism that provides some confidentiality guarantees comes at very limited price and we therefore have limited incentives to design a Grade-1b mode in the context of this paper.
2 A more efficient alternative would be to assume the AES-256 to provide a tweakable block cipher (e.g., as done in [BÖS11]). Yet, AES-256 coprocessors are less standard than AES-128 ones, and such an assumption is at least theoretically disputable in view of the related-key attacks described in [BK09]. In particular, such attacks show that formally, the AES-256 cannot instantiate an ideal cipher.
is well-suited to analyze integrity and confidentiality guarantees separately and to take advantage of the different physical requirements they may lead to. We will use the term leakage-resilient for the PRF of Medwed at al. [MSJ12], the term leakage-resilient for the AE mode of [UHSS17] which is based on the definitions of Barwell [GGM86], the leakage-resilience of which has been analyzed by Faust et al. [FPS12]. Its originality mainly lies in a careful selection of the plaintexts to encryption to generate "key-dependent" algorithmic noise, which has only been analyzed heuristically so far. We finally mention that the issue of secure tag comparison is not a new one in cryptography. It has been frequently discussed in the context of authenticated encryption and has for example been put forward by both Barwell et al. [BPS15] and Berti et al. [BPPS17].

Background
In the following, we first recall the side-channel attack we use in the rest of the paper. Then, we continue with a description of the leakage-resilient PRF from [MSJ12].

Linear subspace template attacks
Profiled side-channel attacks can be performed by building templates in a linear subspace [APSQ06]. To do so, the adversary estimates the conditional probability density function of the leakage according to: where l is a leakage vector of length n p and x a sensible variable. The matrix W with dimensions n d × n p is the linear projection from the leakage domain (directly sampled from a scope) to its linear subspace with n d dimensions. The vectorμ x of dimension n d is the estimated mean of the leakage in the linear subspace conditioned to x. Finally, the matrixΣ is the estimated pooled covariance matrix within the linear subspace. Thanks to this conditional distribution, the adversary can use Bayes' theorem and apply maximum likelihood to recover the secret, just like in a standard template attack [CRR02].
When dealing with subspace-based template attacks, a few parameters have to be carefully chosen such as n p , n d and a method to derive the projection W. In this work, the dimensions of interest are selected to be the n p most informative time samples according to Signal-to-Noise Ratio (SNR) [Man04]. The value of n p as well as n d are chosen in order to maximize the Perceived Information (PI) after projection [BHM + 19]. Finally, the projection is build based on a Linear Discriminant Analysis (LDA) [SA08].

Medwed et al.'s leakage-resilient PRF
The LR-PRF introduced in [MSJ12] and used in [USS + 20] aims at reducing the number of different plaintexts that can be encrypted with single a key, in effect leading to reducing DPA security to SPA security. As illustrated in Figure 1, it processes an input x with a key k using a tree-based construction: at each stage, a plaintext depending on n b bits of x is encrypted with a key taken as the output (ciphertext) of the previous stage. The last stage of the LR-PRF is a whitening that encrypts a public plaintext. Therefore, the total number of calls to the block cipher (or stages) to run the LR-PRF is worth 1 + n/n b where n is the size of the LR-PRF input. In order to mount a DPA against k, the adversary has access to 2 n b different plaintexts. Since she has control of x, she can observe multiple leakage for the same input and so obtain averaged leakages with minimum noise.
In [MSJ12], the authors format the plaintext based on the n b bits of x such that all the bytes have the same value (e.g., set to these n b bits). As a result, for a byte-oriented cipher like the AES, the bytes of the plaintext that enter the key addition plus the S-box in the first cipher round will be equal. This makes the guesses performed by the DPA adversary equal for all the bytes. In a parallel implementation and assuming that the leakage model of each S-box is identical, an adversary will not be able to distinguish the leakage from the S-boxes. In case this attack vector is the best one (i.e., if attacking after the MixColumns operation is hard), the adversary will be forced to enumerate the permutation on all the key bytes. We will further elaborate on these hypotheses in Subsection 4.1.

Attack against the integrity of the [USS + 20] LR-AE
In this section, we first describe the experimental setup used in the rest of the document. Then, we demonstrate a DPA against the tag verification that can be used to mount forgeries against the leakage-resilient AE proposed in [USS + 20].

Targets and measurement setup
We perform experiments on two commercial MCUs with AES coprocessors. The first one is an ARM Cortex-M4 STM32F439ZI mounted on a (unmodified) demonstration board Nucleo-144. 4 The coprocessor has a 128-bit architecture meaning that all the S-boxes are processed in parallel. The device is running at its maximum clock frequency of 180 [MHz].
The second target is an ARM Cortex-M33 STM32L562QEI6QU with ARM TrustZone and a dual core. The MCU is also mounted on a (unmodified) demonstration board STM32L562E-DK. 5 The coprocessor has a 32-bit architecture meaning that a single AES column is processed in parallel. This target is also running at its maximum clock frequency of 110 [MHz]. On both targets, the supply voltage is lowered to 1.8[V] which is the minimal specified value, in order to reduce the noise generated by the on-chip voltage regulator. The two investigated targets have similar but not identical coprocessors as the ones investigated in [USS + 20], which are respectively based on 32-bit and 128-bit architectures, and are running on ARM Cortex-M4 and ARM Cortex-M3 devices. We evaluate the coprocessor of a more recent ARM Cortex-M33 in place of the ARM Cortex-M3.
Similarly to [USS + 20], the side-channel leakage is measured with an electromagnetic (EM) probe. The EM signal is sampled with a PicoScope 6424E at 5[GSamples/sec] (with a 500[MHz] analog bandwidth), using a 10-bit resolution. This signal is amplified thanks to a PA 306 from Langer that offers 30[dB] of gain on the bandwidth 100[kHz] to 6[GHz]. For the ARM Cortex-M33, the probe RF-B 0.3-3 from Langer is used while for the ARM-Cortex-M4 the probe RS H2.5-2 from Rohde&Schwarz is selected. The quality of the equipment is similar to the one used in [USS + 20], excepted that we did automatically position the probe instead of manually. For both targets, the probe was positioned thanks to a XYZ table used to scan the top of the (untouched) package. The SNR is estimated for all the S-boxes with 200,000 traces at each point of a grid. For the Cortex-M33, the grid is of size 16x16. For the Cortex-M4, we use a 64x64 grid. 6 The heatmaps obtained after the grid scan of Cortex-M4 are reported in Appendix A, Figure 15. For both targets, the heatmap for each S-box suggests to place the probe around the same area. Therefore, the probe is moved to the position maximizing the averaged SNR across the 16 S-boxes and we do not exploit spacial resolution as investigated in [UHSS17, UHS + 18].
Overall, we were able to capture the traces at a rate larger than 30,000[measurements/sec] for the two investigated targets. Similarly to [SM16], the AES coprocessor is used as a PRG to generate the random plaintexts and keys, which saves most of the cost of communication with a control desktop. Namely, it encrypts in CBC mode a zero string with a known random IV. The SNRs are then computed on-the-fly thanks to a one-pass algorithm similar to [SM16], which totally saves the cost of disk writing. The estimations of SNR as well as most of the side-channel analyses have been performed with the open-source library SCALib. 7

Tag recovery attack
In [BBC + 20], the authors present a practical side-channel attack against a tag verification implemented on an ARM Cortex-M0. It can directly be used to forge valid messages under adversarial control [BPPS17], and therefore to break the integrity of a software update. We repeat that experiment with two differences: (i) we use an ARM Cortex-M4 which is more noisy, and (ii) we do template attacks in a linear subspace instead of directly in the leakage domain. 8 The target code is described in Listing 1, where a 128-bit tag candidate S is compared with the correct tag T by using 4 XOR instructions. After using 3 OR and 1 comparison, the returned value is 1 if all the bits of S and T are equal. We report the corresponding assembly code obtained with objdump in Appendix A, Listing 2.
More precisely, considering the scheme from [USS + 20], Algorithm 3, an adversary can try to forge a valid tag T for a ciphertext of her choice and a nonce. By having access 5 https://www.st.com/en/evaluation-tools/stm32l562e-dk.html 6 The Cortex-M33 package is smaller than the Cortex-M4, which explains the different resolutions. 7 https://github.com/simple-crypto/SCALib 8 The maximum SNR on the ARM Cortex-M4 we investigate is approximately 0.15 while it is approximately 0.8 on a similar ARM Cortex-M0 as investigated in [BBC + 20]. 1 u i n t 3 2 _ t t a g _ v e r i f ( u i n t 3 2 _ t * S , u i n t 3 2 _ t * T) { 2 u i n t 3 2 _ t f l a g ; to a decryption oracle, she can attempt decryption with multiple random tags S. The oracle will first compute the valid tag T (lines 12 and 13) and then compare it with S (line 14) allowing one to mount a DPA on the XOR instructions in Listing 1. 9 In Figure 2a, we report the median rank of the 128-bit tag estimated with [PSG16], in function of the number of calls to the decryption oracle. In Figure 2b, we similarly report the success probability if no enumeration is possible. After 3, 000 calls, the adversary can recover the valid tag in full without enumeration with a probability almost equal to one. After 200 calls, she is already able to reduce the rank of the valid tag below 2 32 . This would allow her to still succeed in the attack by enumerating these 2 32 tags.   Metrics estimated on 100 independent random keys. 2 · 10 6 profiling traces used.
In the context of firmware updates investigated in [USS + 20], this tag recovery could be exploited as follows. First, an adversary can anyway select a (garbage) ciphertext C and one nonce of her choice N . By targeting the verification, she is able to obtain the valid tag T and to decrypt the (garbage) plaintext M corresponding to the ciphertext C. This already results in a valid authenticated (garbage) firmware update which can lead to practical issues (e.g., denial-of-service) if no additional protection mechanisms are implemented. Second, if we additionally assume that the adversary can observe the garbage plaintext, she can then also recover the random string R output by the LR-PRG that depends only on N and a secret key, by adding C to M . In this case, she can choose a (malicious) plaintext of her choice M and compute the corresponding ciphertext C = R ⊕ M , which is valid for the nonce N . Afterwards, she only needs to repeat the attack against the tag verification to recover the corresponding valid tag T This attack therefore allows the adversary to forge an authenticated firmware update (C , T , N ) that she controls without recovering the long-term key. We note that it admittedly assumes that the adversary can observe the garbage plaintext, which may require some additional (possibly side-channel based) reverse engineering. But on the one hand, our following investigations will show that for a well instantiated mode, integrity can hold even with an unbounded leakage of ephemeral secrets (e.g., plaintexts), and on the other hand, this attack shows that if one single correct plaintext/ciphertext pair is ever recovered, then forging any malicious software update of at most this size becomes possible.
Based on the above, a natural question is: where does this weakness arise from? That is, does this tag recovery attack result from problems with the model, the security proofs or a wrong instantiation of the construction? In this respect, we note that the model used to analyze [DJS19,KS20] is the one of Barwell et al. which considers leakage during verification [BMOS17]. They observe that a MAC whose verification algorithm recomputes the tag and checks for equality with the candidate tag cannot be strongly unforgeable with leakage. Therefore, they proposed an instance of leakage-resilient MAC based on pairings to mitigate this issue. The work of Degabriele et al. then suggested that such expensive pairings could be avoided in the (quite realistic) non-adaptive leakage setting, enabling simple MAC verifications that work by recomputing the tag on an input pair (nonce, ciphertext) and checking that it is identical to the given tag. Yet, it turns out their analysis ignores the leakage of the tag comparison. We show in this section that a simple DPA with non-adaptive leakages is sufficient to forge MACs in this case, as discussed in [BPPS17]. Such an issue was for example recognized in the ISAP v2 design [DEM + 20], which comes with a discussion of secure tag comparison that was missing in ISAP v1 [DEM + 17]. But it seems that it was ignored in the theoretical treatment of [DJS19,KS20], an error that was then propagated in [USS + 20], as explicited by the sentence: "this does not add any additional attack vectors, as all sensitive operations are located within the LR-PRF and LR-PRG". Clearly, the comparison of the tags is another sensitive operation.

Evaluation of the [MSJ12, USS + 20] LR-PRF
In this section, we detail our side-channel analysis of the LR-PRF. We first give a qualitative analysis aimed to highlight the relevant sources of leakage to target such a construction. Next, we describe our attack results quantitatively and conclude with a more general discussion on the evaluation of low-complexity DPA/SPA. 10

LR-PRF: qualitative leakage analysis
The heuristic security provided by Medwed et al.'s LR-PRF is based on two main working principles. On the one hand, it bounds the amount of average leakage traces that an adversary can exploit to perform a DPA. Namely, she can only access 2 n b traces to attack each stage of the LR-PRF. On the other hand, the plaintexts that are encrypted are not random since they are structured as vectors with 16 times the same value. This aims to generate "key-dependent algorithmic noise" as described in Subsection 2.2. For this purpose, it is additionally required that the implementation executes the S-boxes in parallel, that their leakage models are sufficiently similar and that targeting the MixColumns operation is hard. If these conditions are satisfied, it implies that the adversary will not be able to distinguish the leakage from the 16 S-boxes (due to their identical input), at least leading to the requirement to enumerate all the possible permutations of key bytes, which has a cost of 16! ≈ 2 44.25 for the AES (see [MSJ12], Section 5). As a result, the main qualitative questions when analyzing the physical security of such a primitive are whether attacking MixColumns is indeed hard, and whether the S-boxes' leakage models are sufficiently similar. In the rest of this section, we analyze these two points in detail.
Regarding the first question, Figure 3 reports the SNR observed on the S-boxes and MixColumns output bytes, for our two targets. On the ARM Cortex-M33, we first note that the 4 columns are processed serially, confirming that the underlying architecture is 32-bit. For the ARM Cortex-M4, all the SNR peaks occur simultaneously, hinting towards a 128-bit architecture. We observe that the SNR on the MixColumns operation is at least two order of magnitudes lower than the one on the S-boxes. We assume the latter is due to MixColumns's being implemented by harder to target combinatorial logic and/or its leakages requiring wider key guesses to predict transitions. Hence, we conclude from this experiment that Medwed et al.'s first requirement is sufficiently satisfied. 11 Regarding the second question, Figure 4 depicts the leakage distributions in a linear subspace and in the direct leakage space for a constant value at the input of each of the first AES column's S-boxes. The distributions are shown for the two first dimensions and the linear subspace is the one of the first S-box. Looking at the distributions on the ARM Cortex-M33 (resp., Cortex-M4), we observe in Figure 4a (resp., Figure 4c) that they are already different in the original leakage space. These differences in the first dimensions are enhanced by moving to the linear subspace, as depicted in Figure 4b (resp., Figure 4c).
In Figure 5, we then study the difference between the leakage distributions of the first vs. the other S-boxes, based on Hotelling's T 2 -test [Hot31]. On these plots, the X-axis is the amount of averaging performed by the adversary and the Y -axis is the significance of the difference. First, we observe that by increasing the amount of averaging, the significance increases. This is expected since it reduces the noise on the distributions, making their means easier to estimate. Additionally for the ARM Cortex-M33 (resp., Cortex-M4), by comparing Figure 5a (resp., Figure 5b) with Figure 5c (resp., Figure 5d) we observe that the differences are of the same magnitude in the linear subspace and in the direct leakage space but that more dimensions are needed in the second case (200 dimensions instead of 6). We notice that the advantage of using additional dimensions is larger on the ARM Cortex-M33. The difference between the distributions on the ARM Cortex-M33 is also larger than on ARM Cortex-M4, presumably due to a more serial implementation.
While these results remain qualitative, they highlight that the leakage models of different S-boxes are different for our two targets. Hence, they contradict one of Medwed et al.'s assumptions. Informally, our results suggest that with a precise enough profiling, it should be possible to exploit these differences, enabling the adversary to escape the enumeration of a 16-permutation as expected from the LR-PRF design. In that context, projecting the leakages in a linear subspace allows "concentrating" these differences in a    few dimensions which makes it easier to exploit by an adversary/evaluator. In Appendix B, we illustrate that the amount of information extracted in the original leakage space and in the linear subspace is similar. But since the second solution uses less dimensions, it requires less profiling traces to accurately profile a model. We note that [UHSS17,UHS + 18] already pointed out that linear subspaces and spacial resolution can be used to observe different leakage models for different S-boxes. Our experiments show that a similar effect can be obtained without spacial resolution for the targets we look at.

LR-PRF: quantitative attacks
We now report the results of attacks against the LR-PRF. We used template attacks in linear subspace, as described in Subsection 2.1, since they provided the best results. For the ARM Cortex-M33 (resp., Cortex-M4), the profiling is performed on 500 · 10 3 (resp., 2 · 10 6 ) averaged traces. For example, for 256 (resp., 2048) averaging, we collected a total of 256 · 500 · 10 3 (resp., 2048 · 2 · 10 6 ) measurements. We note that the profiling was performed using random keys and plaintexts in order to avoid biasing our models with key-dependent features that would not be exploitable during the online attack phase.
The median key rank estimated according to [PSG16] is reported in Figure 6, where the X-axis is the LR-PRF parameter n b (see Subsection 2.2). We observe that by increasing n b , and so the number of different averaged traces available to the adversary, the key rank significantly decreases. The fact that the key rank decreases below 16! ≈ 2 44.25 confirms that we are able to exploit the different leakage models of the different S-boxes. Averaging also allows decreasing the key rank but its impact saturates at some point, similarly to what can be observed in [USS + 20], Figure 10. On the ARM Cortex-M33 (resp., Cortex-M4) starting from 64 (resp., 512), averaging does not improve the attacks.
Concretely, for the ARM Cortex-M33 our adversary reduces the median key ranks down to 2 9 if n b = 8, and down to 2 50 if n b = 4. For keeping it larger than 2 90 , we need n b ≤ 2. The ARM Cortex-M4 is more resistant thanks to its parallel architecture: with n b = 8, the median key rank is equal to 2 25 but with n b ≤ 5 it is already above 2 96 .

LR-PRF: the evaluation challenge
Our analyses imply that smaller n b values than reported in [USS + 20] should be used for our similar targets and measurement setup. For example, Figure 11 in this reference suggests that on both targets, n b = 6 leads to more than 100 bits of security. In our evaluations, this value is reduced to 20 (resp., 70) on the ARM Cortex-M33 (resp., Cortex-M4).
Technically, these results are not in strong contradiction with previous works. For example, as illustrated in [UHSS17, UHS + 18], the resistance of a LR-PRF highly depends on the adversary's measurement capabilities and the exploitable signal within the target. On an FPGA implementation of the LR-PRF, is has been showed that such schemes can be insecure even for n b = 1 depending on the floorplanning. To some extent, this is typical for any side-channel countermeasure: eventually, their security depends on physical assumptions. Fulfilling such assumptions is always a challenge, whether being to ensure a sufficient noise level or a sufficiently low side-channel signal. Yet, one important conclusion of our experiments is that the "low signal" requirement of LR-PRF constructions ensuring low complexity DPA (or even SPA) to be hard is also more challenging to evaluate. The main reason of this fact is that such constructions generally enable averaging the physical noise. In this context, evaluating the signal becomes very dependent on the shape (i.e., the deterministic part) of the leakage function, while "noise amplification" countermeasures such as masking or shuffling only require that the ratio between signal and noise (SNR) is sufficiently low, independently of the signal shape. Since the shape of a leakage function is quite sensitive to (even minor) modifications of the target designs and measurement setups, constructions relying on limited signal (if implemented without additional countermeasures), inherently carry a higher risk of suboptimal (i.e., non worst-case) evaluation. Such a risk is illustrated by the comparison between our analysis and the one of [USS + 20] where mild variations of the measurement setup and a collection of 1,000 more profiling traces lead to different security parameters for similar targets (i.e., same architecture but not exactly the same MCU). They could be amplified if the adversary obtains a better a-priori knowledge of her target, or performs larger key guesses than the evaluator. Hence, we conclude that designers should be conservative when selecting the parameter n b of a LR-PRF. As will be detailed in Section 6, low n b values are anyway not overly detrimental from a performance viewpoint, since they can be amortized for long messages.

The LR-BC-2 mode of operation
In order to design a one-pass AE mode with Grade-2 leakage security, we need to combine an ephemeral key evolution scheme and a compression function. On the one hand, the ephemeral key evolution scheme allows iteratively processing each block of message with a fresh key, which is reminiscent of designs conferring confidentiality guarantees in the presence of encryption leakage [BBC + 20]. On the other hand, the compression function is used to progressively absorb the blocks to make the computation more and more dependent on the already processed blocks, leading to a kind of digest that can then be authenticated using a fixed-length leakage-resilient Message Authentication Code (LR-MAC). While relying on a TBC is not only a convenient solution to implement the above principle but also a way to achieve high security bounds (e.g., with 2 TBC calls per block as in TET [BGP + 20]), we do not have the same degree of freedom when only relying on a block cipher, as explained in the introduction. In this section, we show how we can still reach high security guarantees in the presence of decryption leakages when only block ciphers like the AES, are available. Our new mode, coined LR-BC-2, only requires four BC calls per block, which is equivalent in performance to a variant of TET where each TBC call would be implemented by 2 BC calls as in [Men15], even if such a composition would be insecure in our (unbounded) leakage setting. We start with a general description of our design and its rationale. We follow with a formal quantitative analysis of its integrity in the presence of leakage, which is the most important property for our intended (secure firmware update) application. We finally provide a shorter (heuristic) discussion of confidentiality in the presence of leakage, for which LR-BC-2 does not exhibit particular innovations.

Specification and rationale
Our mode follows the general 3-step blueprint suggested in [BBC + 20] for Grade-2. In the initialization step, we compute a random 2n-bit state (K 1 , L 1 ) from a key derivation function KDF K (N ), where K 1 is an ephemeral encryption key. For this first step, we borrow the BC-based bit-by-bit PRF of the previous section that we apply to the nonce N to get a seed K 0 , allowing producing our initial state (K 1 , L 1 ). As a second step, the bulk of the computation is made as a one-time encryption of M 1 · · · M where, at each successive processing of a message block M i , the corresponding ciphertext block C i is created and absorbed, and the state is refreshed. Eventually, in the finalization step, we use an LR-MAC, aka Tag Generation Function (TGF) to authenticate the final state with an appropriate verification mechanism. We now focus on the second step.
Our starting point is the BC-based compression function family due to Mennink [Men17], which achieves optimal collision security using only 3 calls to the underlying n-bit block cipher. In the sequel, we denote this function by MenH. Given a 2n-bit state (K i , L i ) and a n-bit block C i to absorb, MenH(K i , L i , C i ) compresses the 3n-bit input into 2n-bit output state (K i+1 , L i+1 ). The reason why we rely on MenH is twofold: (i) in the ideal-cipher model, the collision bound holds even if all the intermediate values are exposed, which comes in handy for our integrity with leakage purpose; (ii) by carefully choosing MenH in the family, we can use it to turn K i into an ephemeral secret key allowing encrypting M i , which only requires a single additional call to the BC. More precisely, if E is the underlying BC, given an i-th 2n-bit state in such a way that K i+1 can, in turn, be used as the next ephemeral key. The value θ is a non-zero constant, e.g. θ = 10 * .
In Figure 7, we depict the encryption algorithm of our new mode, where (K 1 , L 1 ) is the initial state. The 4 BC calls per block M i can easily be parsed from the picture. For each processing of message block, the 3 BC calls in black as well as the black wires and the linear map L (computed over F 2 n ) represent MenH. In red, the additional BC call, M i , and the wires correspond to our plug-in that turns MenH into a block encryption. We note that many L transforms are possible for the collision/preimage-resistant compression function. Our specific choice is illustrated in Figure 8(a). We additionally assume that the bit-length of any message and any ciphertext is a multiple of n since our goal is to show the leakage security of our design, which is orthogonal to (standard) padding issues. We now picture the design of KDF in Figure 8(b). It first uses the key K to set up an IV (K 0 , L 0 ) for a constant cst so that L 0 := cst. The preimage resistance of the half state L-value as the image of a single-length compression function ensures that no internal state can collide with (K 0 , L 0 ) because of this constant cst. To avoid IV-collisions between encryption and decryption, we simply apply MenH once with the nonce: we make the call MenH(K 0 , L 0 , N ). This first call inside KDF forces the initial state (K 1 , L 1 ) to diverge for distinct nonces. We also add a separation bit before the nonce. This is for reusing the same secret key both in the initialization and the finalization of our mode.  It remains to explain the design of TGF, which we would like to build without relying on hardware-level countermeasures such as masking, and to argue why the verification of maliciously chosen tags does not reveal enough information on a valid tag T during the decryption of fresh ciphertexts. A natural starting point for this purpose is the recent work of Guo et al. [GSWY19], which describes different such constructions with good bounds. We now argue why they cannot be used in our context and how to tweak them so that they lead to secure implementations based on unprotected hardware coprocessors.
For this purpose, let us start with the MAC illustrated in the red box of Figure 9(a), and integrated in the LR-BC-2 mode of operation. Guo et al. showed that it ensures beyond-birthday security if implemented with two strongly protected block ciphers (e.g., thanks to masking) and the verification is done by inverting the upper block cipher. The problem in our context is that we do not want to rely on masking. Hence, typically, the lower PRF will be strongly protected thanks to the leakage-resilience features discussed in Section 4, and the upper block cipher call will rely on an unprotected hardware AES implementation. As a result, if we apply this idea in our context we cannot assume anymore that the intermediate key X will be leak-free. In fact, a simple DPA against LR-BC-2's decryption is possible. By querying decryption on (N, C 1 , . . . , C , S) for many candidate tags S and fixed nonce and ciphertext blocks, the last state (A, B) and X, are fixed. The DPA is then run against X in the verification when E −1 X (S) is performed on distinct S.  A first step towards solving this issue, illustrated in Figure 9(a), is to use a tag verification similar to the one proposed for ISAP v2 [DEM + 20]. That is, rather than performing the verification by inverting the block cipher (which, as discussed in [BPPS17], enables the tag comparison to remain secure with unbounded leakages), use an additional call to the block cipher to encrypt a constant (e.g., zero) plaintext and compare the resulting Z values during verification. Such a tweak mitigates the previous attack since this time, the adversary cannot perform a DPA against BC X (A) by decrypting multiple tag candidates S. She rather has to attack this block cipher in the forward direction. But in order to fix the key X for α different A values so that a DPA can be performed with data complexity α, she will have to find multi-collisions on B. For reasonable values of α, this task becomes computationally intensive as the probability of finding an α-collision for a random function roughly decreases to 2 −n(α−1) q α , which remains safe for around q ≈ 2 n(α−1)/α queries. Therefore, to extract the internal X the adversary is only left with an SPA. Yet, it remains that the tag T and the value Z are only 128-bit wide and a collision on Z indicates a collision on T with high probability, so that this tag verification is only birthday secure in the black box world. 12 Indeed, by making 2 n/2 encryption queries as well as 2 n/2 decryption queries, with good chance we will have that among the many tags T returned in the ciphertexts we will have as many corresponding Z (that we can compute) and a colliding Z returned from the decryption. For that identified decryption query we simply guess that the valid tag is the one from which the colliding Z was generated.
Therefore, a second step to keep our TGF/TVF beyond-birthday secure is to ensure that Z depends on a wider state, which we achieve by generating an additional value Y that we use as a plaintext to produce Z, as illustrated in Figure 9(b). We note again that the security of this proposal cannot hold with unbounded leakages as the decryption-based proposal in [BPPS17]. In particular, it requires SPA security for the intermediate key X and the tag T which is internally computed during the verification. Yet, the latter is natural in our case, since we assume it for the leakage-resilient PRF anyway.
While this solution may already look strong, it however remains that a repetition of B implies a repetition of the ephemeral key X, while we have to reduce the number of times PRF[E](1 ·) -our notation for the leakage-resilient PRF using E as underlying block cipher with a bit of separation at its input -is run on the same input to avoid SPA with (constant but) larger number of leakages against the tag verification. Since the MenH outputs are not random over {0, 1} 2n (Mennink showed they are up to a tight birthday bound, but we aim for beyond-birthday integrity), there may be up to α 2 repetitions of a B-value among all the final states (A, B) if there are some α-multi-collisions on the underlying n-bit output compression functions inside the computation of MenH. Indeed, even if there are at most α multi-collisions on a single value of B, each of these collisions may themselves come from previous α multi-collisions on the 2L+D value, which is the key-input value of the last lower BC call in MenH -see Figure 8(a).
Since the adversary controls C , the corresponding plaintext-input of that BC call can be chosen equal between key-input collisions. We would thus have to deal with α 2 possible repetitions of B since B is the output of two successive iterations of some compression functions. In turn, the key X may be manipulated up to 2α 2 times in Figure 9(b). 13 Concretely, this essentially implies that in order to guarantee security with up to q ≈ 2 n(α−1)/α queries, one would also require that the block cipher implementation used in the construction of Figure 9(b) resists SPA with q spa = 2α 2 queries. For 96 bits of security, this gives q spa = 32 (which correponds to n b = 5 in our empirical evaluation of Section 4.2). For 112 bits of security, this gives q spa = 128 (which corresponds to n b = 7). Those values correspond to SPA requirements that are not satisfied by the devices we evaluated.
As a result, the third and the last step to build our TGF/TVF is to carefully add one more call to a n-bit output compression function on (A, B) before going into PRF[E](1 ·).
While this obviously reduces the number of repetitions from α 2 to α since the final states are pairwise distinct, we also have to introduce one more feed-forward in order to avoid collisions on (A, X). We illustrate our proposed TGF/TVF functions in Figure 10. In this case, we get ≈96-bit security with q spa = 8 (which corresponds to n b = 3) and ≈112-bit security with q spa = 16 (which corresponds to n b = 4). Those values better match what our implementations offer and (we believe) correpond to an interesting security vs. performance tradeoff. More precise evaluations are given after the proof in the next section.
We finally note that a simple but expensive solution to remove this multi-collision SPA is to directly send A and B to a leakage-resilient PRF with 2n input bits.

Integrity in the presence of leakage: formal analysis
Definition 1 (Nonce-Based AEAD [Rog02]). A nonce-based authenticated encryption scheme with associated data is a tuple AEAD = (Gen, Enc, Dec) such that, for any security parameter n, and keys in K generated from Gen(1 n ): • Enc : K × N × AD × M → C deterministically maps a key in K, a nonce in N , some blocks of associated data in AD, and a message in M to a ciphertext in C. Definition 2 (CIML2). An authenticated encryption AEAD = (Gen, Enc, Dec) with leakage function pair L = (L Enc , L Dec ) provides (q e , q d , q l , t, ε)-ciphertext integrity with nonce misuse and leakages (both in encryption and decryption) given a security parameter n if, for all (q e , q d , q l , t)-bounded adversaries A L , we have: with the PrivK CIML2 A,AEAD,L game of Figure 12 where A L makes at most q e leaking encryption queries, q d leaking decryption queries and q l leakage evaluation queries.
For the sake of simplicity we do not deal with associated data. Therefore, when the adversary makes an encryption (resp., decryption) query for a nonce N and a plaintext M (resp., ciphertext C) we implicitly set the associated data to the empty string.

Idealized leakage assumptions.
As a usual first step in the analysis of leakage-resistant modes of operation, we assume that the computation of the ideal cipher is leak-free as long as it is called a small constant number of times on each ephemeral secret key [PSV15]. This small number corresponds to the 2 n b queries of the leakage-resilient PRF and to the α-multi-collisions. More precisely, we make the following simplifying assumption.
Given a block cipher E : {0, 1} n × {0, 1} n → {0, 1} n , we say that E satisfies (α, u)leakage independence if for any adversary A the leakage contains no information about the u ephemeral keys in the next experiment where they are explicitly used at most 2α times. Let U 1 , . . . , U u be random values over {0, 1} n , and Q 1 , . . . , Q u and R be initially empty sets. 14 In the experiment the adversary A can make the following query: on input and finally (v) compute and return E Ui⊕Bij (A ij ) and E Ui⊕Bij ⊕0 * 1 (A ij ) as well as the leakage of these computations.
As clear from the previous discussions, the smaller is α, the weaker is the assumption. Depending on the number u of explicit random ephemeral keys, there might be many collisions among U -values. However, since no computation is repeated in this game, the adversary has no clue about which keys collide if she halts before making a forbidden query (i.e, those creating an abort). As a result, the only information the adversary can get from the experiment is many pairs of plaintext-input blocks and ciphertexts-output blocks corresponding to unknown keys. Hence, the leakage independence ensures that the leakage does not help her to guess collisions, for instance. In our mode of operation, we manage to create these non-aborting conditions in TGF and TVF.
Leakage function. By repeating nonces for different values of M 1 (and thus C 1 ) it is easy to mount a DPA on L 1 due to the computation of L 1 + C 1 in (the lower half of) the compression function of MenH -see Figure 8(a). By learning L 1 , the adversary has the two plaintext-inputs of the last two BC calls in MenH and she can still change them at will. In turn, the key-inputs K 1 + L 1 + 2D 1 and 2L 1 + D 1 will be leaked. We thus directly give the seed K 0 to the adversary in any leaking encryption or decryption query. For KDF and TGF, we simply give the leakage associated to the computation of the block cipher (and not the intermediate values). We already argued about the leakage of PRF[E] used in KDF and TGF in the previous sections and we also commented on why computing T does not leak it (except in encryption where T is the tag which is explicitly returned by the algorithm) since the ephemeral key X is refreshed given any final state (A, B), except a small constant number of times related to a maximum multi-collision parameter α that we control in the proof. We would have the same property for Y if it was not involved in the last verification Z = E S (Y ) in decryption, where S is the candidate tag and Z = E T (Y ). However, with many distinct S for a single prefix ciphertext, Z will leak by a DPA on the comparison check and thus Y = E −1 S (Z) can be discovered. Based on these premisces, we describe the unbounded leakage function pair L = (L Enc , L Dec ) used in our proofs as: • L Enc (K, N, M ): 0 = L prf (K, 0 N ) to compute K 0 and 1 = L prf (K, 1 B) to compute X, the seed K 0 allowing to compute MenH offline on IV (K 0 , L 0 ), where L 0 = cst is a constant, and tgf = L tgf (X, A) to compute T . Then, Enc = ( 0 , 1 , K 0 , tgf ).
• L Dec (K, N, C S): 0 = L prf (K, 0 N ) to compute K 0 and 1 = L prf (K, 1 B) to compute X, the seed K 0 allowing to compute MenH offline on IV (K 0 , L 0 ), and tvf = L tvf (X, A) to compute T, Y and which includes Y, Z, where E T (Y ) = Z, to implicitly check whether T = S without leaking T by verifying if Z = E S (Y ), which can be done by the adversary. Then, Dec = ( 0 , 1 , K 0 , tgf , Y, Z).

The idea behind our integrity result with leakage is that the final state (A, B) is fresh throughout all the online queries, and so are the pairs (Y, Z). By showing that (Y, Z)
is independent of all the other queries, the adversary is thus left with brute-forcing a compatible key-input S of the block cipher matching a plaintext-ciphertext pair (Y, Z).

Theorem 1 (CIML2). LR-BC2[E] with the leakage function pair
where α is a maximal multi-collision parameter, q is the number of offline ideal cipher queries, q e is the number of leaking encryption queries, q d is the number of leaking decryption queries, q on = q e + q d + 1 is the total number of on-line queries, σ is the total number of message blocks involved in the online queries, and Q = q + 7q e + 9q d + 4σ is the total number of ideal cipher calls excluding those implicitly computed in PRF[E].
To simplify the proof we give more power to the adversary by additionally returning Y and Z in any leaking encryption query. That is, even if the leakage differs in encryption and decryption, we decide to give both in the case of an encryption as the adversary may anyway make a leaking decryption query for the received valid ciphertext. That way, we can assume that the adversary never asks the decryption of a received valid ciphertext, and, conversely, the adversary never asks the encryption of the underlying message related to a ciphertext involved in a decryption query (since K 0 is equal in those cases with the same nonce, and it defines a bijection between the message blocks and the ciphertext blocks no matter what is the candidate tag); we will say the adversary never makes a cross query. As usual, we also assume that the adversary never repeats a query because it is useless (the leakage functions directly give all the information she can get with repetitions).
Proof. We use a sequence of modified CIML2 games where Game 0 is the proper CIML2 game and the last game is a dummy game where the adversary looses. In each game, Game i, W i is the event that the adversary wins in that game.
In the state of MenH we call K-value (resp., L-value) the upper (resp., lower) half state value appearing in Figure 7. We also refer to K +L+2D-value (resp., 2L+D-value) to denote the upper (resp., lower) key-input of the block cipher call made in MenH as shown in Figure 8 Game 1: This is the CIML2 game except that we bring the following modification in the winning condition. Now, we also consider the adversary A to be unsuccessful if there is at least one (α + 1)-multi-collision on any of the PGV-like (n-bit output) compression functions or in a (forward) BC output. If we call F 1 the event that the adversary wins in Game 0 but fails in this game we have Pr Clearly, F 1 implies that there are some (α+1)-multi-collisions in any underlying single-length compression function or the the block cipher in forward calls. Since we have the following compression functions and block cipher (forward) calls: . We note that we do not consider cross-function collisions here. Also, an A-value (resp., a B-value) is a particular case of a K-value (resp.s an L-value). In the following games we have at most α-multi-collisions.
we have a MenH-collision. As a second case, we deal with the probability that an internal state (K , L ) collides with an IV (K 0 , cst). This can only happen if L = cst which implies finding a preimage of cst for the compression function f 1 . We thus have the term Q/(2 n − Q), where Q ≤ 2 n−c is the total number of calls to the ideal cipher. It remains to bound the probability that internal states collide, but this is directly related to MenH-collision. Therefore, Pr[F 2 ] ≤ε MenH + Q/(2 n − Q) since ε MenH ≤ε MenH + ε mu-coll , and the multi-collisions of Mennink's proof implies F 1 . Game 4: In this game we bring yet another modification to the winning conditions. To be successful, the adversary must fulfil the winning conditions of Game 3 but she fails if she creates some final states (A, B) and (A, B ) such that their respective X-value, denoted X and X , are such that X + X ∈ {0, 1} (using the finite field notation). That is, the adversary does not win if she forces the re-use of some computation E X (A) or E X+1 (A) in an online query.  A, B 1 ), . . . , (A, B a ) with that A-value. Since A is the output of a consecutive computation of two compression functions f 1 (f 2 (K , L ), C ) we know that a ≤ α 2 . Indeed, since their is no state collision from Game 2, the previous states are all distinct except if the corresponding queries only differ on the last block C (the prefix must be equal). In the latter case, that means that we must have a (proper) collision on f 1 . In the former case, i.e., when the previous states differ, we must have a (proper) collision on f 2 . That means that we have no more than α 2 repetitions of A (since the adversary does not repeat a query and never makes a cross query).
With this in mind we can now consider several cases.
First, we deal with the case where some of these A pairs collide on a W -value. Consequently, those W -colliding pairs lead to a common U -value (the PRF output). But then the X-value differs since X i = U + B i with necessarily distinct B i for the appropriate index. So, between the pairs (A, B i ) and (A, B j ) that collide on a same W , X i + X j = 0. Now it looks like we are stuck if B i + B j = 1 as it implies X i + X j = 1, but we next show that W -collision are very unlikely in that case.
Second, we bound the probability that there is a special f 1 collision of the form Assuming that the BC triple (h, m, E h (m)) is already defined in the view, the probability that a forward ideal cipher query (h, m + 1, ·) or a backward query (h, ·, E h (m) + 1) defines the triple (h, m + 1, E h (m) + 1) is bounded by 1/(2 n − Q). By taking the union bound on all the possible such queries, we find at most Q/2 · 1/(2 n − Q) (since each query defines a single triple which defines a single try).
Third, we show that among the respective X-values X 1 , . . . , X a associated to the pairs (A, B 1 ), . . . , (A, B a ), those related to non W -colliding pairs imply X i + X j ∈ {0, 1}, except with a very low probability. To show this, we bound the probability that Given the index i, such a probability is lower than 2i/(2 n − Q) since we have at most 2i targets of the form (1 W i+1 ). Given A, by summing on the index i we get the lower bound a−1 i=1 2i/(2 n − Q) = 2/(2 n − Q) · a(a − 1)/2. Now, considering all the final upper half state A-values, we will show that the union probability is lower than q on /α 2 · α 2 (α 2 − 1)/(2 n − Q) as the probability is maximal when all the A-values are repeated α 2 times with exactly q on /α 2 distinct A-values. Let A 1 , . . . , A ν be all the distinct A-values that are respectively repeated a 1 , . . . , a ν times in a final state. We can compute the lower bound as From now on we can assume that the computations of E X (A), E X+1 (A) are fresh in the online queries.
Game 5: This game is as Game 4 except that we add one more failure event F 5 which causes the adversary A to loose the game if she wins in Game 4 but manages to make an (offline) ideal cipher query that (re)produces a triple (X, A, T ) or (X + 1, A, Y ) for some pair (A, X) involved in an online query. We want to discard the case where A brute-forces X from offline queries and we have Pr Bounding Pr[F 5 ]: First, we note that the leakage on U , X and X +1 is small so that it does not reveal anything useful to the adversary. Indeed, as there are at most α multicollisions on a W -value, each U = PRF[E] K (1 W ) is only explicitly (but secretly) used twice in the online computation of E U +Bi (A i ) and E U +Bi+1 (A i ) (where X i = U +B i ) and for at most α indexes of the final states. Each U -value is only explicitly called 2α times in the online queries. Then, the (α, q on )-leakage independence of E ensures that the leakage does not give more information than E does. A-values A 1 , . . . , A ν with their respective multiplicity a 1 , . . . , a ν , we can show that all the 2q on pairs (

Second, considering (again) all the
where T ij , Y ij denotes all the T -values and the Y -values coming from A i , for 1 ≤ j ≤ a i , are pairwise distinct except with probability q on (2α 2 −1)/(2 n −q on ). Note that the adversary A only gets both pairs (A i , T ij ), (A i , Y ij ) if the corresponding query is a leaking encryption query. The leakage in decryption only reveals the pair (A i , Y ij ). Now, if F 5 occurs A manages to find a key X ij (even if it is X ij + 1, at the first place) that might be involved in a decryption query and A then knows E Xij (A i ) = T ij , which leads to a ciphertext forgery. To increase her advantage in finding X ij , A should focus on (A i , Y ij ) since for distinct i-indexes she must be lucky to find an X-collision (or more exactly, figure out the collision but the offline queries are independent from X-values before that time): it is more efficient to put all her power on the corresponding i-index. Now, we show that A should focus on the A i 's with the highest multiplicities. Her goal is to make an offline query (X + 1, A i ) for many keys X and hope falling on Y ij , for some 1 ≤ j ≤ a i ≤ α 2 . In that case she can guess that the X "hitting" Y ij is actually X ij . For each try where she has to bind the A-value in each ideal cipher query before getting the result, the probability of such a hit is maximal when a i is maximal since a single try has more chance to hit more targets. More precisely, if q i is the number of time A makes a query with A i , then if a i ≤ a i , for a fixed value of Since E is an ideal cipher, A's success is independent of the chosen i-index among those for which a i is maximal because she does not know how many times the X-key she is looking for has already been used. Anyway, we assume that a 1 = α 2 and that the keys X 1j , for 1 ≤ j ≤ α 2 are those that are already the most involved in the ideal cipher queries. Hence, the probability to find a good key X 1j is bounded by qα 2 /(2 n − Q). We stress that making inverse ideal cipher call is less efficient for A as there are at most α multi-collisions on some Y -values. Finding an X-key is thus more likely in forward calls where A-values can be repeated up to α 2 times. From now on, forging a ciphertext cannot be made by guessing an X-key and thus by computing explicitly an honest tag T for the decryption of an invalid ciphertext.
Game 6: Game 6 is like game 5 except that a winning adversary in game 5 is no more considered successful if she finds a forgery by (purely) guessing a valid candidate tag S. More precisely, guessing S in that case means that A defines a block cipher triple In game 6, the probability that adversary computes a ciphertext forgery is 0. That is, Pr[W 6 ] = 0, and we have: Eventually, q on (α 2 − 1) + qα(α + 1) ≤ 2α 2 Q. Hence, the bound.
And for α ≥ 5, the above equation is bounded by 7α 2 2 βn−n and the multi-collision term remains dominant. Therefore, the bit security for 5 ≤ α ≤ 15 is up to α/(α + 1)n.
So overall, and as already outlined in the previous section, the integrity of our construction increases with the multi-collision parameter α, but increasing α also implies a stronger hardware assumption since we assume SPA security with q SPA = 2α queries. Interestingly, our detailed analysis is slightly better than the estimates of Section 5.1. For example, we reach 96-bit security with α = 3 rather than α = 4 (and for α = 4 we reach 102-bit security). That is because, we actually tolerate α multi-collisions. We thus "forbid" (α + 1)-multi-collisions, and bound the probability that it eventually happens. These results therefore show that for actual security parameters as available on low-cost embedded devices, LR-BC-2 can provide strong concrete integrity guarantees.
Incidentally, we note that the LR-MAC of Figure 10 also improves the results of Guo et al. at FSE 2020 [GSWY19], which require two strongly protected block cipher calls where we only use one call to the LR-PRF (which could be replaced by a strongly protected block cipher, if not aiming for exploiting hardware co-processors as in this work).

Confidentiality in the presence of leakage: heuristic analysis
The confidentiality with leakage of LR-BC-2 follows the standard approach of previous leakage-resistent modes of operations. As a heuristic argument, it is easy to see that ignoring decryption leakages and assuming fresh nonces, every message block is encrypted with a fresh key. So at high-level, and using the simplified assumptions of [BBC + 20], the security of our mode reduces to the SPA security of a single message block.
A bit more formally, to prove the CCAmL1 security of our mode, which stands for CCA security with misuse-resilience and leakage-resistance [GPPS19], we can rely on the hard-toinvert leakage assumption, which follows [YSPY10] and was used in [BGP + 20] for TEDT. In the CCAmL1 game, the adversary is not granted access to a leaking decryption oracle but only to a black-box decryption. We briefly sketch the leakage function in encryption queries. For a fresh nonce in encryption, we can rely on Mennink's indifferentiability result for MenH, recalled in Appendix D in order to establish that the initial state (K 1 , X 1 ) is independent and random up to the birthday bound (which is unavoidable in leakage confidentiality with the current state of cryptographic knowledge). We stress that a linear map L which satisfies the collision-resistance condition automatically satisfies the indifferentiability condition. Now, we also have to show that the initial state is kept secret and we will explain why any internal state will remain sufficiently hidden when nonces are not repeated (as required only for the computation of the challenge ciphertext).
For the first ephemeral key K 1 , a first part of the leakage comes from the first call to MenH inside KDF as shown in Figure 8(b). The second part of the leakage comes from the next call to MenH and to E K1 (L 1 ⊕ θ) in the bulk of the computation. Assuming that the first part remain safe (since K 1 = E K0+L0+2D0 (N ) is computed once and a single time with K 0 ), the leakage reduces to the second part, and more precisely to E K1 (L 1 ), E K1 (L 1 ⊕ θ) and the key-input K 1 +L 1 +2D 1 . So K 1 is only used a very small constant number of times. Therefore, the protection of the internal ephemeral key material can be propagated and the E K (L ⊕ θ) value remains hidden enough and random. Now, each block of message, i.e., M -value, is only XORed with that previous value. In the hard-to-invert leakage model, we can iterate the argument until the final state. Finally, given a final state, TGF is also made independent of the message processing as argued in the integrity proof while the adversary here has much less information on the internal states. Therefore, a single execution on a final state (A, B) := (K +1 , X +1 ) does not leak much about it and the intermediate computation in TGF also remains safe as already argued in the proof of CIML2.
We finally note that while the LR-BC-2 mode of operation ensures the strongest integrity in front of decryption leakages, formalized by the CIML2 notion, its confidentiality guarantees are limited to encryption leakages (i.e., CCAmL1). In order to obtain CCAmL2 security (where the 2 stands for decryption leakages), one option is to leverage a second pass, and to adapt the TEDT mode of operation to the block cipher case. We represent the resulting LR-BC-3 mode of operation in Figure 13: it works by replacing Hirose's hash function [Hir06] by Mennink's one [Men17]   Alternatively, we observe that a simple trick to provide some heuristic confidentiality with decryption leakages is to limit the number of incorrect verifications a chip can perform at the protocol level (and to freeze it afterwards). In this case, the number of measurements a side-channel adversary can perform will be limited by this number of incorrect verifications for each target chip. Yet, in case multiple chips encrypt with the same key, it may still happen that a sufficient number of traces can theoretically be collected. In this respect, we observe that multi-user security could be of interest to limit such multi-chip attacks without requiring the distribution of more secret key material: by using a public key P in combination with the nonce 0||N in Figure 8(b), one could send firmware updates encrypted based on different public keys for different chips.
In the following section, we conclude the paper by analyzing the implementation cost of these different solutions and compare them with the benchmarking of [USS + 20].

Performance evaluation
We next detail the performances obtained by our two main LR-BC modes of operation. We first compare them in terms of block cipher calls. Next, we detail the results obtained on the two previously mentioned targets and compare them with [USS + 20].

BC calls count for LR-BC variants
In Table 1, we detail the number of AES calls required to encrypt l-block messages with LR-BC-2, LR-BC-3 and [USS + 20]. For the two first modes, decryption requires 3 additional block cipher calls for the tag verification. We separate these figures for the three main steps of the modes, namely the KDF, bulk computation and TGF.  The main observations from this table are twofold. First, the constant part of the cost of these modes is dominated by the execution of the LR-PRF. In this respect, the LR-BC-2 and LR-BC-3 modes are similar since they only use the LR-PRF on 258 bits. By contrast, the CHES 2020 construction has a larger constant cost due to its more expensive finalization. Indeed, it uses the LR-PRF on 384 bits [USS + 20]. Second, our one-pass modes asymptotically require 4 · l calls to the block cipher and LR-BC-3 requires 5 · l such calls, which directly reflects their security grade. Evaluating [USS + 20] is more difficult for this part, since it requires 2 · l calls to a block cipher and a call to a hash function on all the message (which can be implemented in software or hardware as discussed next).

LR-AES encryption throughput
We start by describing the two optimizations we used in order to implement the LR-BC mode variants based on AES coprocessors, that we next denote as LR-AES. Note that the CHES 2020 mode was instantiated with both the AES and SHA-256, as in [USS + 20]. 18 Since these schemes heavily rely on calls to the AES, the first guideline we followed is to use the ECB mode (available on most coprocessors) whenever multiple (e.g., 2 in our case) calls to the AES with the same key are needed. This trick allows saving at least the cost of writing the key to the coprocessor, as well as its initialization. Therefore, it significantly speeds up the bulk computation of all the variants of LR-BC as well as [USS + 20]. In addition, if the coprocessor has a direct memory access (DMA), it will be able to run the two encryptions without relying on any software instruction. The second optimization is to work on explicitly aligned words (i.e., uint32_t): this ensures that the compiler will operate on machine words instead of bytes. Roughly, the combination of these optimizations lead to a ≈ 50% improvement of the throughputs reported in [USS + 20].
The curves in Figure 14 depict the encryption throughput ([cycles/byte]) for variants of LR-AES and [USS + 20]. By comparing Figure 14a with Figure 14b, we notice that the Cortex-M4 systematically offers a better throughput than the Cortex-M33 thanks to its faster AES coprocessor (with a 128-bit architecture). For the rest, the comparative results are similar on both targets and follow the trends outlined in Table 1.
More into the details, for short messages (e.g., 16 bytes), the cost of the KDF and TGF dominates the cycle count. LR-AES variants reach the better throughput. Indeed, 18 In the C implementation proposed by [USS + 20] and accessible at https://github.com/ Fraunhofer-AISEC/leakres-aead-microcontroller/blob/master/host/src/, the 256-bit digest provided by the hash function is truncated to 128 bits that are then fed to the LR-PRF. This leads to a possibility of forgery by looking for a collision on these 128 bits, which has a cost of 2 64 offline queries to SHA-256. Hence, we report the results of a more secure version that does not truncate SHA256's output. In the same implementation, the tag comparison is done with memcmp, of which the execution time depends on the inputs to compare. This leads to a direct timing side-channel leakage that can be exploited by an adversary to forge tags. Listing 1 is constant time avoiding this source of leakage. they require ≈ 2 · 130 calls to the AES for these steps, while [USS + 20] require ≈ 128 additional calls. For long messages, the cost of the KDF and TGF are amortized and the bulk computation dominates the cycle count. Therefore, the Grade-2 variants of LR-AES give the best throughputs. The Grade-3 variant pays one additional call to the block cipher per message block. As for the comparison with [USS + 20], it highly depends on the presence of a hardware accelerator for SHA-256. When available, its throughput is ≈ 2.3 better than our Grade-2 designs (but without strong integrity guarantees in the presence of leakage due to the attack in Subsection 3.2). If there is no hardware accelerator for the hash function, then LR-AES-2 is better with a factor ≈ 2.94 (resp., ≈ 2.17) on the ARM Cortex-M33 (resp., Cortex-M4). We note that because of LR-AES-3 and [USS + 20] are two-pass, the entire ciphertext must be stored in memory when decrypting (it is not the case for LR-AES-2 which is online, but only offers CCAmL1 confidentiality).
Overall, these results suggest that the high cost of the LR-PRF will be quite well amortized for actual software updates, and that hardware coprocessors enable making the 4 · l / 5 · l block cipher calls per message block of our modes acceptable in practice. For example, the firmware of our ARM Cortex-M33 has a size of 21,184 bytes and the one of our ARM Cortex-M4 has a size of 19,064 bytes. Even with the most conservative security parameter for the LR-PRF (n b = 1), encrypting this amount of data is sufficient to reach a good amortization (as shown in Figure 14). Also, considering the maximum clock frequencies of our two target devices, decrypting their ciphertext with LR-AES-2 will take 0.014 seconds (resp., 0.0043 seconds) on the ARM Cortex M33 (resp., M4).

Conclusions
Securing low-cost embedded devices against side-channel attacks is a challenging task. It is for example well-known that implementing standard countermeasures like masking and shuffling hardly gives strong guarantees without a sufficient noise level, that may not be readily available in such devices. This fact naturally suggests the use of leakage-resilient modes of operation as an interesting alternative to these countermeasures, especially when hardware coprocessors with bounded leakage are available. A previous work of Unterstein et al. proposed one construction in this direction, taking secure firmware update as an exemplary motivation. In this work, we first showed that this proposal ignores one important attack vector against the integrity of their scheme (namely the leakage of the tag verification) and as a result, it does not offer the claimed guarantees. We therefore propose new modes of operation (LR-BC-2 and LR-BC-3) that fix this issue. 19 Next, we revisit the concrete side-channel security evaluation of the leakage-resilient PRF that is leveraged by our modes (and the one of Unterstein et al.) and show that it is quite sensitive to variations of target device and measurement setup. We conclude that such modes are interesting candidates for leveraging the existing ecosystem of embedded micro-controllers, but that their security parameters should be selected conservatively.

B Perceived Information Analysis
Next, we compare the Perceived Information (PI) for profiling methods using Gaussian templates in the original leakage space (GT) and in a linear subspace (LDA+GT). The number of dimensions of the Gaussian templates are denoted as n d . If applicable, the number of dimensions taken into account to project in the linear subspace is denoted as n p . See Subsection 2.1 for additional details. Figure 16, that reports the PI according to the profiling set size, leads to two main observations: First, both types of templates converge to similar values, meaning that a similar amount of information can be extracted with GT and GT+LDA. Second, we observe that the LDA+GT approach converges faster, meaning that less profiling samples are needed in order to fit the model. The gap between LDA+GT and GT depends on the number of profiling samples. : PI for the best models we found with and without using linear subspace.

C Compression functions and multi-collision
Ths Merkle-Damgard (MD) paradigm for hash functions is based on the iteration of a compression function. Here we give a multi-collision probability upper-bound for general compression functions that relies on a single call to a block cipher. Our result can be viewed as a generalization of the collision bound for PGV-like compression [BRSS10], for α-multi-collisions from α = 2 to α ≥ 2.  , y); as well as (iv) g mid (k, x, y) → g out (g −1 in (k, x), y) = z with the following properties: Good compression functions are the compression functions satisfying the so-called T1-property in [BRSS10] of generalized PGV-like compression functions [PGV93]. The 12 secure PGV compression function of the so-called group-1 satisfy T1.
In this paper, we use many compression functions. We will rely on Mennink's double length compression function which makes 3 calls to the underlying block cipher. This compression function has a double-length state and it is easy to see that it implicitly contains 3 different types of (single-length) good compression functions. This is why a multi-collision bound on good compression functions can be usefull. While it is easy to compute exponentially many collisions on MD(f ) from several 2-collisions when f is good compression function (due to a result of Joux [Jou04]), multi-collisions on f are still interesting to bound. We believe this to be of independent interest and especially for other modes based on Hirose's compression function or Tandem.
Although, the probability of multi-collisions have been studied in the case of random hash function [STKT08], we are not aware of a general lower-bound for multi-collisions on PGV-like compression functions. The following theorem fills this gap. We note that the definition of collision resistance of [PGV93,BRSS10] includes a target point as in the definition of preimage resistance. Here, there is no target point, which explains the difference between (q + 1)q in the original bounds and q(q − 1).
Theorem 2 (Multi-Collision). Let E be a n-bit block cipher modeled as an ideal cipher and let f [E] be a good compression function. Then, by making at most q (forward and backward) queries to E, the probability of finding an s-multi-collisions is bounded by: 2 n (2 n − q) s q s .
Since f [E] is a good compression, any forward query (k, x) defines a single f -query g −1 in (k, x). Likewise, any backward query (k, y) defines a single f -query g −1 in (k, E −1 k (y)). Therefore, we do not distinguish E-queries from f -queries.
Proof. Let C s (q) be the event that there is at least one s-multi-collision for the good compression f [E] after q queries. We partition this event twice as follows by considering the events C s (q, z) and C s (q, z, i), for any z ∈ {0, 1} n and any s ≤ i ≤ q. C s (q, z) is the event that the first time an s-multi-collision occurs it is on the output z. C s (q, z, i) is the same as C s (q, z) except that the first time the s-multi-collision occurs is at the i-th query. We thus have: where C * (z, i 1 , . . . , i s ) is the event that all the i j -th queries (h ij , m ij ) are exactly those which hit z, for 1 ≤ j ≤ s. That is, f (h ij , m ij ) = z, while each of the other i -th query for i ≤ i s and i = i j , for 1 ≤ j ≤ s, does not collide on z (no matter whether other s-multi-collisions occur or not before the i s -th query). We find: where K i (h i , m i ) is the number of time the key k i from g in (h i , m i ) = (k i , x i ) has already been used in some computation E ki (·) in a previous query before the i-th query. Assuming that f (h i , m i ) = z, we argue why the factor 1/(2 n − K i (h i , m i )) is indeed the right piecewise probability: (i) in the case of a forward query, there is a single possible output y of E given z and the i-th query (h i , m i ) (or equivalently (k i , x i )) since g out (h i , m i , ·) = z has exactly one solution by assumption on g out ; (ii) in the case of a backward query, there is a single possible input x of E given z and (k i , y i ) since g mid (k i , ·, y i ) = z has exactly one solution by assumption on g mid . In both cases, the number of remaining possible outputs of forward or backward evaluations of the random permutation E ki is exactly 2 n minus all the previous calls to it, which is K i (h i , m i ) by definition. Let q ≤ 2 n−2 . For s = 2, we match the bound Pr[C 2 (q)] ≤ q(q − 1)/2 n of [BRSS10]. In that case, Pr[C 3 (q)] ≤ q 3 /2 2n . More generally, for s ≥ 4, and q ≤ 2 n−1 , we have Pr[C s (q)] ≤ q s /2 n(s−1) . To summarize, we have that good compression functions resist to s-multi-collision at least up to 2 s−1 s n queries, for s ≤ n.

D Mennink's Double-Length Compression Function
Mennink's double-length compression function provides optimal security based on 3 calls to the underlying BC modeled as an ideal cipher. To be collision resistant, we also must have: a 12 , a 13 , a 24 , a 32 , a 33 , a 44 = 0, and a 12 = a 32 , a 13 = a 33 . In that case, the collision advantage for n = 128 allows q ≈ 2 118 ideal cipher queries to have a collision probability of 1/2. The main Mennink's bound depends on two parameters t 1 and t 2 that control the maximum number of some bad events in the security proof. To give a concrete non-optimal simple bound, if we set t 1 = q/t 2 , for t 2 ≥ 9, we have: where ε MenH is the collision-resistance advantage. A closer look at Mennink's proof shows ε MenH ≤ε MenH + ε mu-coll , where ε mu-coll is thrice the advantage of finding a t 2 + 1 multicollision for any of the following 3 compression functions a 1 (K, L, D, C), a 3 (K, L, D, C) and f 1 (h, m) = E h (m) + m involved in the last 2 BC calls of Figure 17. Since in our mode we separately deal with multi-collisions (which implies those of Mennink's proof), we are only interested in:ε MenH ≤ (2t 2 2 + 7t 2 + 18)q 2 n − q , for any t 2 ≥ 1, as long as our multi-collision parameter equals t 2 .
Preimage resistance. With more constraints on the linear map, the everywhere preimage resistance advantage of the compression function is roughly ≤ q 2/3 /2 n , but asymptotically in n [Men17]. For n = 128, we can actually have 180 bits of security (instead of 191 bits, for q ≈ 2 3n/2 ). Yet, this is a strong (and tight) result. However, we do not need such a bound in the security of our mode. The only place where we need to rely on the preimage resistance is when we argue about state collision resistance. Although most of the state collisions can be avoided by directly relying on the collision resistance of the compression function itself, the initial phase of our mode derives many IV's that serves to initialize MenH in the MD construction. This forces us to show that none of these IV's can appear in some internal (non initial) state as we want to maintain a fresh final state to have independent tags. A natural strategy would be to generate random IV's from the nonces in the initialization and to directly apply Mennink's result. However, this does not lead to a strong security in our case: if we have q on online queries (adding up encryption and decryption), we will have an asymptotic bound of q on q 2/3 /2 n allowing "only" q in ≈ 2 n/2 and q ≈ 2 3n/4 or q on ≈ q ≈ 2 3n/5 , for instance. And the bound would be even worse for n = 128.
Instead, our strategy is to generate the IV's as (K 0 , 1) for a random n-bit half state K 0 depending on the nonces to solve the above preimage issue. That way, we only need to rely on preimage reistance for a single half-state value. Collision between initial states and internal states thus imply computing a preimage of 1 for a PGV-like good compression. IV collisions are now more likely as it is enough to collide on the n-bit value K 0 . Still, it is easy to avoid causing collisions in the next state by first absorbing the nonce in a compression function call, since the first evaluation input (K 0 , 1, N ) is always fresh. This costs us one more compression function call but it improves the bound drastically.
Since we do not have to comply with the more restrictive choice for the linear map to be preimage resistant, we have more freedom to choose the coefficients. In our choice we avoid using K in plaintext-input to give more confidence in confidentiality since it lets open the possibility to make a proof of our mode based on a related-key secure block cipher in the standard model for a future work.
Indifferentiability. Mennink's compression function is also indistinguishable from a random function with domain {0, 1} 3n and range {0, 1} 2n if the linear map satisfy the requirement for collision resistance. The indifferentiability adantage is of birthday type in n.
This result is very convenient to argue about the confidentiality with leakage of our mode: the internal leakage does not help to distinguish our key-stream from a truly random stream up to the birthday bound. As long as the internal computation remains hidden enough, the security will follow easily by other standard techniques.