MIRACLE : MIcRo-ArChitectural Leakage Evaluation A study of micro-architectural power leakage across many devices

. In this paper, we describe an extensible experimental infrastructure for evaluating the micro-architectural leakage, based on power consumption, that stems from a physical device. Building on existing literature, we use it to systematically study 14 diﬀerent devices, which span 4 diﬀerent instruction set architectures and 4 diﬀerent vendors. The study allows a characterisation of each device with respect to any leakage eﬀects stemming from sources within the micro-architectural implementation. We use it, for example, to identify and document several novel leakage eﬀects (e.g., due to speculative instruction execution), and scenarios where an assumption about leakage is non-portable between diﬀerent yet compatible devices. Ours is the widest study of its kind we are aware of, and highlights a range of challenges with respect to 1) the design, implementation, and evaluation of, e.g., masking schemes, 2) construction of accurate leakage models, and 3) selection of suitable devices for experimental research. For example, in relation to 1), we cast further doubt on whether a given device upholds the assumptions required by a given masking scheme; in relation to 2), we conclude that (statistical or formal) device leakage models must include information about the micro-architecture being modelled; in relation to 3), we claim the near mono-culture of devices that dominates existing literature is insuﬃcient to support general claims regarding leakage. This is particularly important in the context of the FIPS 140-3 standard for non-invasive side-channel evaluation.


Introduction
(Micro-)architecture as a concept. In the context of processor design, the term architecture 1 describes the interface between hardware and software. It defines how hardware and software interact, i.e., what is "visible" to the programmer, and typically includes a definition of 1) state, 2) instructions that act on said state, and 3) an execution model for said instructions. The term Instruction Set Architecture (ISA) is often used synonymously, with micro-architecture 2 describing an implementation of the associated ISA, i.e., as a specific processor core. As such, the ISA represents a logical abstraction of an underlying, physical micro-architectural implementation.
Beyond definitional precision, this abstraction enables behavioural diversity while ensuring functional compatibility. Or, put another way, maximising flexibility with respect to implementation while retaining consistency with respect to usage: a by-design disconnection between behavioural and functional semantics of instruction execution means different micro-architectures can realise the same ISA, but employ different techniques or optimisations in their design and/or implementation. By harnessing this fact, specific 3 micro-architectures can optimise instruction execution to suit a given market or use-case, without placing a burden on the programmer. Doing so often acts within a broader strategy to address limitations on scaling (e.g., clock frequency) that stem from Moore's Law.
(Micro-)architecture as an attack vector. From a different perspective, however, the same property can be problematic. For example, development of high-assurance software typically requires detailed knowledge of and control over both functional and behavioural semantics of instruction execution. When met, such requirements permit 1) formal reasoning and guarantees about functional correctness, and 2) management of the associated (implementation) attack surface, e.g., by instrumenting suitable countermeasures. The abstraction of a micro-architecture by an ISA limits the degree to which this is true, meaning the requirements are often not met (or at least not sufficiently so).
The way micro-architectural side-channel attack techniques (see, e.g., [Sze19, Section 4] and [GYCH18, Section 4]) are enabled and/or exacerbated is a good example of this problem. At a high level, such techniques exploit leakage coming from sources in a particular micro-architecture: one can classify leakage as either 1. discrete (or digital), meaning it relates to logical, or functional characteristics, e.g., data-dependent instruction execution latency (i.e., number of cycles) caused by microarchitectural state and execution model, or 2. analogue, meaning it relates to physical, or behavioural characteristics, e.g., datadependant power consumption [KJJ99,MOP07] or EM [GMO01,AARR02] emission caused by the behaviour of CMOS transistors which constitute the micro-architecture.
In a sense, the latter acts as a superset of the former: because analogue leakage can capture fine-grained, potentially sub-cycle features, discrete forms of leakage will typically be captured by it indirectly. Either way, micro-architectural abstraction implies 1) the security properties of software are difficult to reason about, and may even differ depending on the micro-architecture it is executed on, and therefore 2) development of robust software-based countermeasures is a significant challenge. Such implications have led to arguments (see, e.g., [GYH18]) for migration of traditionally opaque ISAs toward more (semi-)transparent alternatives in which (selected) micro-architectural features are visible. Likewise, they have motivated hardened micro-architectural designs (see, e.g., [KGBR19,MGH19]) which mitigate the lack of such transparency.
Problem statement. Focusing on analogue forms of leakage, Rivain and Prouff [RP10, Section 1] consider that, in their design and implementation, existing countermeasures against relevant side-channel attacks "start from the assumption that a cryptographic device can keep at least some secrets and that only computation leaks". In practice, such countermeasures, plus associated proofs of their security, are only as robust as the assumptions they are based on; this is evidenced by attacks, such as those of Balasch et al. [BGG + 14], which highlight instances where they are invalid. Having an accurate device leakage model 4 , i.e., a model of how and why a device leaks, is therefore imperative. Although some aspects of a device leakage model can be reasoned about in an analytical manner, others, particularly if the underlying design is unknown, demand experimental analysis using a physical instance of the device (which is akin to reverse engineering, cf. [GOP21]). Set within this context, we addresses a 2-part problem statement. First, there is a limit on the current state-of-the-art, with respect to both 1) which devices have associated leakage models, and 2) how accurate those leakage models are. Second, expansion or improvement with respect to either limit demands both hardware and software infrastructure which is flexible while also allowing accurate, reproducible, and therefore robust conclusions; such infrastructure does not currently exist.
Remit. Concretely, this paper addresses the problem statement above by first introducing and then using experimental infrastructure we dub MIRACLE (a backronym for MIcRo-ArChitectural Leakage Evaluation). On one hand, we position the paper as contributing in the following ways. First, versus the Rosita framework of Shelton et al. [SSB + 20], for example, we deliberately attempt to broaden the scope by increasing the diversity and complexity of devices and hence micro-architectural features considered. Second, we place explicit value on increasing the extent to which "folk-law" observations are explainable.
To a greater degree than previously, doing so provides a formal basis for the topic. Third, we place explicit value on the reusability of associated artefacts. We posit, for example, that the infrastructure and data sets stemming from our work can support forms of leakage-aware verification, such as that of Barthe et al. [BGG + 20], which demand accurate, fine-grained device leakage models. On the other hand, however, we carefully limit the remit in the following ways. First, we focus exclusively on analogue micro-architectural leakage related to power consumption. We use the acronym Micro-architectural Power Leakage (MPL) to fix this detail from here on, although stress that most results are more general, e.g., apply to EM as a replacement for power consumption. Second, we focus exclusively on evaluating (i.e., identifying, characterising, and documenting) MPL; this means we deem exploitation of, and countermeasures addressing MPL out of scope.
Organisation. The paper is organised as follows: • Section 2 surveys existing literature related to MPL. Based on this, Section 3 then attempts to define a precise, unified terminology. This allows development of a structured classification for leakage sources and effects, and clearer discussion of associated work. • Section 4 describes MIRACLE, our experimental infrastructure for evaluating the MPL which stems from a given device. • Section 5 analyses specific data sets produced by the infrastructure, in order to document several novel, low-level leakage effects, and to explore some overarching high-level hypotheses. These include 1) to what extent implicit or explicit assumptions, such as Only Computation Leaks (OCL) [MR04] and Independent Leakage Assumption (ILA) [RSVC + 11, Section 2.2], made during design or implementation, actually hold in practice, and 2) whether and how identical implementations leak on different but compatible devices, and thus how "portable" the countermeasures they use are. • Section 6 then, finally, attempts to summarise the implications for situations when consideration of MPL is important; we pitch this as a limited set of design or implementation guidelines for digital design and cryptographic engineers.

Background
In this section, we survey existing literature related to MPL. The goal is to 1) organise and summarise relevant work, which spans different research fields and focuses, and 2) clarify the relationship between this paper, and relevant work which we either build on or produce implications for. Due to space constraints we focus on a minimal survey of the most recent and/or relevant work only. A more detailed survey can be found in [MPW21][Appendix A].
(Micro-)architectural power leakage effects. There have been many efforts to study particular sources of MPL, and how they can undermine masking based countermeasures.
In [BGG + 14], the authors discuss how physical effects in a hardware device ("glitches and transition-based leakages") can halve the security order of a masked implementation. This was followed by [PV17], which details several specific leakage effects which can be repeatedly demonstrated as undermining a masked software implementation. This includes the "overwrite" effect where registers are overwritten with sensitive values, the "memory remnant" effect where values written or read from memory are buffered unexpectedly, and the "neighbour-leakage" effect where explicit accesses to one register can cause implicit accesses to an adjacent register. Subsequent works such as [CGD18] and [SSB + 20] replicate some or all of these effects in different devices, and introduce either new variants or more specific cases of known variants. In [MMT20], the authors survey four devices and explicitly build on the work of [PV17]. We consider [MMT20] the work most similar to ours, in that it also surveys a range of devices and effects, confirming (as discussed later) that leakage effects vary widely between devices.
Other works have examined particular processor cores in great detail from a microarchitectural perspective [CGMA + 15, BP18, DAK19], but have not looked more widely at families of cores, such as all ARM Cortex-M devices.
(Micro-)architectural power leakage modelling and tooling. In [MWO16], the authors build a statistical device leakage model of ARM Cortex-M0 and Cortex-M4 cores by examining tuples of adjacent instructions, and using statistical regression to model the resulting information leakage. This successfully captures even very esoteric kinds of leakage in instructions which are executed adjacently. However, as explored in [SSB + 20], it can fail to capture cases where instructions which are executed far apart yet still interact and hence induce leakage. Load and store instructions are a good example, as discussed in [SSB + 20, Section IV(C)] and examined in detail in Section 5.1. Another limitation of statistical modelling techniques is ensuring that a representative sample of instruction executions and sequences are used to build the model. The authors of [MWO16] were able to to group instructions with similar leakage and so reduce the problem space, but the problem of generating large amounts of initial stimulus remains.
Complementing statistical approaches to device leakage modelling are formal modelling approaches. These have mostly been applied to masked hardware implementations, as with the maskVerif [BBC + 19] and REBECCA [BGI + 18] tools, or more recently, SIL-VER [KSM20]. Such techniques have also been applied to software masking. In [BGG + 20], the authors use a Domain Specific Language (DSL) to describe the individual instructions of a processor core, and the state elements they update. This model of the micro-architecture is supplied by the user, and may need to be reverse engineered from an existing device. Assuming the accuracy of the micro-architecture model, this is a compelling approach. The TORNADO tool [BDM + 20] is another similar approach, which combines the Usuba bit-slicing compiler [MDLG18] and the tightProve [BGR18] tools to generate C programs with formal side-channel security assurance in the register probing model. Gigerl

Terminology
Per Section 2, study of MPL exists at the intersection of several research fields. This has led to terminology which is sometimes inconsistent or imprecise, making it difficult to clearly discuss associated work. Before any technical contributions we attempt to address this issue, by first addressing architectural then micro-architectural concepts. In each such step, we use a model for instruction execution to classify associated leakage sources. Doing so allows one to reason about how and why leakage occurs, and thus how it may be avoided (or not). It is important to note that none of this will be deemed innovative to digital design engineer. As well as supporting the remainder of this paper, however, we posit that shared terminology and understanding is a necessary starting point to enable 1) digital design engineers (cf. [BMT16]) to reason about the impact of MPL from their (hardware) designs and implementations, and 2) cryptographic engineers to develop useful device leakage models, and thus leakage-free (software) designs and implementations.

Notation
Let MEM[i] denote the i-th element of memory: MEM[i] j is used to specify an access granularity of j bytes where appropriate, with j = 1 (implying memory is byte addressed) assumed if/when omitted. Let GPR[i] denote the i-th General Purpose Register (GPR); we refer to a given Special Purpose Register (SPR) by name, including an optional field where appropriate. Within the specific context of ARMv7-M, for example, PC ≡ GPR[15] might denote the Program Counter (PC), whereas CPSR[C] ≡ CPSR 29 might be used to denote the carry flag within the Current Program Status Register (CPSR).
Let E i and C i denote some i-th execution and clock cycle respectively. We describe E i and E j (resp. C i and C j ) as being separated by a distance of n if |i − j| = n, noting that the specific case where n = 1 implies they are consecutive.

Model
We assume an ISA will include a definition of (at least) • a set of architectural state, namely storage elements such as GPRs, SPRs, and memory, • a set of architectural instruction semantics: any given instruction may read values from state elements, performs computation, and write values to state elements, and • an instruction execution model. An ISA will usually adopt a in-order execution model, wherein each execution cycle will atomically and independently fetch, decode, then execute a single instruction; only architectural state is guaranteed to be preserved between execution cycles.

Leakage
Concept 1. Architectural Power Leakage (APL) can be inferred from the architectural definition of instruction execution, meaning it stems purely from architecturally visible detail (and is thus micro-architecture agnostic).

Concept 2. Intra-instruction leakage
can be reasoned about with within the context of 1 execution cycle (i.e., execution of 1 instruction). This contrasts with inter-instruction leakage, which cannot: it occurs, and thus must be reasoned about over n > 1 execution cycles (i.e., execution of n instructions).
Note that inter-instruction leakage is not limited to adjacent instructions (as executed in E i and E i±1 ). As we explore later, interaction between and thus leakage from non-adjacent instructions (as executed in E i and E i±j for j > 1) is also plausible.
Given an instruction sequence, the concepts outlined above imply associated APL can be inferred abstractly, i.e., without using a concrete implementation of the ISA: viable approaches include the use of an instruction set simulator, or even static analysis. For example, consider the following 3-instruction sequence: 1 add a0, a1, a2 // HW ( a1 ) + HW ( a2 ) + HW ( a1 + a2 ) + HD ( a0, a1 + a2 ) 2 add t0, t0, t2 // HW ( t0 ) + HW ( t2 ) + HW ( t0 + t2 ) + HD ( t0, t0 + t2 ) 3 lsl t0, t0, # 4 // HW ( t0 ) + HW ( t0 << 4 ) + HD ( t0, t0 << 4 ) Based on an assumed power leakage model for register access and bus activity, the annotation captures inferred architectural power leakage, i.e., 1) Hamming weight leakage for each operand related to each read from the GPRs, 2) Hamming weight leakage related to each result computed, and 3) Hamming distance leakage related to each write to the GPRs. It is trivial to identify instances of inter-and intra-instruction leakage: the former occurs when an instruction reads operands from the GPRs or performs computation, whereas the latter occurs when an instruction writes results to the GPRs and thereby overwrites a value already there (as written by a previous instruction, so implying the required interaction

Model
Existing literature (see, e.g., [PV17, CGD18, SSB + 20, MMT20]) has demonstrated that APL cannot accurately capture every pertinent leakage source or, therefore, effect: to do so, one must also consider leakage which stems from the micro-architecture. We address this fact by extending the model in Section 3.2: specifically, we • extend the architectural state with a set of micro-architectural state plus a mapping function between architectural and micro-architectural state elements, and • extend the architectural instruction semantics with a set of micro-architectural instruction semantics.  In the left design, only combinatorial leakage is evident, based on how the signals t 0 and t 1 change on each clock cycle; subsequent signals, e.g., t 2 and t 3 , change only once per clock cycle. Note this ignores the effect of wire delay on t 0 and t 1 . In the right design (a multiplexer), if the sel, x, and y signals all change, then the delayed sel signal will cause Hamming distance leakage on r: this causes it to glitch, i.e., change multiple times per clock cycle.
One could define micro-architectural state elements as being those implicitly supporting execution of instructions (by maintaining any associated values while execution occurs); this contrasts with architectural state elements, which are those explicitly used by instructions. Likewise, micro-architectural semantics describe execution of an instruction in terms of micro-architectural state elements; this contrasts with architectural semantics, which do so in terms of architectural state elements.
In contrast with architectural state, micro-architectural state may or may not be preserved across either execution or clock cycles. As such, the micro-architectural state mapping function and the instruction semantics must be considered, to some extent, with respect to an execution context which includes 1) the current micro-architectural state, 2) the "current" instruction (i.e., that executed in E i ), and 3) the "surrounding" instructions (i.e., those executed in E i±j for 1 ≤ j < m given some m). For example: where pipelining is employed, instructions following a branch may or may not be (completely) executed depending whether the branch is taken or not; where pipelining is employed, the (concrete) state element corresponding to an (abstract, architectural) register may be part of the forwarding logic versus the register file; where register renaming is employed, the (concrete) state element corresponding to an (abstract, architectural) register may differ for each write to that register.

Leakage
Concept 3. Micro-architectural Power Leakage (MPL) cannot be inferred from the architectural definition of instruction execution, meaning it stems purely from architecturally invisible detail (and is thus micro-architecture specific).

Concept 4. Intra-cycle leakage
can be reasoned about with within the context of 1 clock cycle. This contrasts with inter-cycle leakage, which cannot: it occurs, and thus must be reasoned about over n > 1 clock cycles.
Concept 5. Sequential MPL occurs when the value of a state element changes.
We specifically use the term "sequential", because it stems from the sequential logic elements, flip-flops and registers constructed from them, used within a design. A specific instance of sequential MPL is necessarily classified as inter-cycle, because it occurs at the boundaries of clock cycles. However, it may be classified as either intra-instruction, because it can stem from isolated execution of 1 instruction (e.g., during iterative computation of a multiplication), or inter-instruction, because it can stem from interaction between execution of n > 1 instructions (e.g., a specific state element updated by different instructions executed in E i and E i±j for j = 0). Concept 6. Non-glitching combinatorial MPL occurs when the output of a combinatorial logic element toggles; glitching combinatorial MPL occurs when the output of a combinatorial logic element toggles more than once per clock cycle.
In existing literature, it is common for the catch-all term "glitching" to be used for either case. We carefully distinguish between the cases, observing that all leakage caused by glitching is combinatorial leakage but not all combinatorial leakage is caused by glitching. The designs in Figure 1 represent (simple) instances of the non-glitching and glitching cases respectively. Various other instances are arguably more subtle, in the sense they relate more to the physical properties of an implementation than logical properties of a design. For example, it is common to observe combinatorial leakage stemming from an inverter chain: these structures are used, for example, to mitigate issues due to 1) the standard cell library (e.g., to allow a high degree of fan-out), and/or 2) place-and-route (e.g., to drive signals over long wires).
Beyond some niche exceptions (e.g., multi-cycle paths) we deem out of scope, a specific instance of combinatorial MPL is necessarily classified as intra-cycle and therefore intrainstruction. This is because, within the context of a sequential logic design, the value produced by a combinatorial logic element must settle before the next clock cycle (otherwise a violation of the critical path occurs).

Summary
Concept 7. A given power leakage source can be classified as being either 1) architectural 2) sequential micro-architectural 3) non-glitching combinatorial micro-architectural 4) glitching combinatorial micro-architectural and further qualified as manifesting on an a) intra-instruction, b) inter-instruction, c) intra-cycle, or d) inter-cycle, basis. A leakage source is therefore defined as the design or implementation feature that enables leakage to occur.
We stress that, due to our focus on the architectural and micro-architectural levels within what is a larger stack of abstractions, this classification should be viewed as necessary but not sufficient: it cannot capture every leakage source. A pertinent example is that of capacitive coupling between wires (see, e.g., [CBG + 17, CEM18, LBS19]), which, although important to model, we regard as a separate problem. Our justification is that there is no architecture or micro-architecture at that level of abstraction, only standard cells or wires: capacitive coupling leakage can therefore be regarded as at best indirectly related to any particular instruction sequence and execution of it, versus architectural or MPL where the same relationship is more direct.

Concept 8.
A power leakage effect is the form of information observable whenever a leakage source causes leakage due to execution of an instruction sequence we term a leakage trigger.
The term micro-benchmark 5 refers to a well established concept: it is a short, self-contained instruction sequence specifically designed to analyse some feature in the processor core that executes it. For example, the performance counter based nanoBench framework of Abel and Reineke [AR19] was used to determine (or reverse engineer) the latency, throughput, and port usage of x86 instructions, and cache architecture of x86 processor cores. We adapt the term as follows: Concept 9. A leakage micro-benchmark is a specific instruction sequence constructed to prove or disprove a hypothesis about, e.g., the existence of a leakage source and any associated leakage effect.

Infrastructure
Fundamentally, we are interested in evaluating the MPL stemming from a given processor core. In this section we describe MIRACLE, our experimental infrastructure and methodology for doing so. We start by introducing high-level terminology related to the components and processes involved, then, in subsequent subsections, offer more, lower-level detail. We refer to the physical integrated circuit containing some System on Chip (SoC) as a device; such a device will contain one or more cores 6 . Most devices cannot be used in a stand-alone manner, because, for example, they require surrounding infrastructure for power delivery. We refer to this infrastructure as the host platform, noting that both general-purpose (i.e., support multiple devices), and special-purpose (i.e., specific to, and even integrated within a device) instances are possible. As a result, the target of evaluation is defined by a 3-tuple of core, device, and platform. An evaluation, which we refer to as an experiment, is coordinated by a controller (i.e., a workstation) in two steps: it 1) acquires a set of power consumption traces during execution of some micro-benchmark by the target, then 2) subjects the traces to some form(s) of analysis, attempting to prove or disprove an associated hypothesis. We refer to the power consumption traces as the trace data set and results of analysis as the analysis data set.
Note that we avoid direct comparison of data sets stemming from different targets and therefore devices. For example, a conclusion such as "leakage is stronger in target X than target Y" is not possible. However, we do compare targets with respect to the associated hypothesis with a particular micro-benchmark. For example, a conclusion such as "target X and Y both exhibit leakage effect Z" is possible.

Devices
Let f d i denote a device where f is the family identifier (e.g., ARM), d is the core identifier (i.e., the specific processor core), and i is the instance number (where more than one exits). Where necessary, we permit further annotation such that =⇒ a device operating at a specific clock frequency (of xMHz)

Platforms
Each device is supported by and so situated within a specific platform. MIRACLE (currently) supports 3 different platforms, described in detail by the following:     • SCALE describes a platform based on the SCALE 7 host board; it supports a range of interchangeable target boards, and thus devices. The trace acquisition pipeline includes an on-board NXP BGA2801 amplifier (with 22 dB gain), and an on-board 2.6 MHz low-pass filter. • CW308 describes a platform based on the ChipWhisperer CW308 (or UFO) 8 host board; it supports a range of interchangeable target boards, and thus devices. The trace acquisition pipeline includes an off-board Agilent 8447D amplifier (with 25 dB gain), and an off-board MiniCircuits SLP-30+ 32 MHz low-pass filter. • SASEBO-GIII describes a platform based on the SASEBO-GIII [HKSS12] side-channel analysis platform; it houses two FPGAs, a Xilinx Kintex-7 (model xc7k160tfbg676) target FPGA, and a Xilinx Spartan-6 (model xc6slx45) support FPGA, and thus can be reconfigured to support a range of devices. The trace acquisition pipeline includes an off-board MiniCircuits BLK+89 D/C blocker, an off-board Agilent 8447D amplifier (with 25 dB gain), and an off-board MiniCircuits SLP-30+ 32 MHz low-pass filter.
The trace acquisition process is coordinated by a controller, which communicates with the platform and a PicoScope 5000 series oscilloscope, which terminates the trace acquisition pipeline. Although a platform-and/or device-specific approach to configuration and programming is required (e.g., via a Xilinx Platform Cable USB II for the SASEBO-GIII platform, or a Segger J-Link and OpenOCD for ARM-based devices), communication between controller and platform is more uniform: each platform uses an FTDI-based FT232 USB-to-UART bridge then exposed to the device (on-board for the SCALE platform, offboard for the CW308 and SASEBO-GIII platforms).

Micro-benchmarks
Notation. MIRACLE (currently) supports 23 different micro-benchmarks, which we refer to using major/minor as a short-hand: major is the major identifier (or class), and minor is the minor identifier (i.e., instance within said class). Where clear from the context, we use the minor identifier alone. Each micro-benchmark must be implemented for each device, or, rather, for each unique ISA. The 7 different cores (currently) supported by MIRACLE span 4 different ISAs, however, which presents a challenge with respect to how to describe them. We use one of two approaches, depending on who or what the user of such a description is. First, our human-readable description uses the combination of 1) a written explanation of the underlying goal, plus 2) an illustrative, pseudo-code example implementation modelled on the use of ARMv7-M; in some instances, we employ stylistic alterations 9 to improve their readability. On one hand, we use ARMv7-M because we expect this to be the most familiar of those ISAs support (and so add most value with respect to illustrating the underlying goal). On the other hand, although concrete ARMv7-M assembly language syntax and instruction mnemonics are used, for example, we opt for abstract, symbolic notation for values, register identifiers, etc. More specifically, we use the following notation: • A through H represent variables; note these are not hexadecimal literals, which would be prefixed by #. Unless otherwise noted or self-evident, any two variables, say A and B, are always allocated different architectural registers. • rA, for example, denotes a register which contains some variable A which is relevant to the associated experiment. • rX and rY denote registers which contain an address or variable which is irrelevant to the associated experiment. • rZ denotes a register which contains zero.
Second, our machine-readable description is then a functionally equivalent but ISA-specific realisation of the pseudo-code; in doing so we must manage differences between ISAs (e.g., the availability of a specific instruction type, or addressing mode), but, in all cases, we carefully avoid impact on the underlying hypothesis. Beyond formulation of the microbenchmark goal, we found this to be the most fragile and therefore challenging aspect of the development process.
Structure. Each micro-benchmark is implemented using the 2-part structure described by Figure 2: the goal-agnostic, high-level micro-benchmark driver ( Figure 2a) is implemented using C, whereas the goal-specific, low-level micro-benchmark payload ( Figure 2b) is implemented using assembly language for the appropriate ISA. The driver part is identical for all devices, but specific to a given micro-benchmark: it is responsible, e.g., for 1) receiving a set of inputs, which are generated uniformly at random then communicated by the controller, 2) managing aspects of trace acquisition (e.g., the trigger signal), and 3) invoking the associated payload. Note that word_t is typedef'ed to reflect the word size, e.g., to uint32_t or uint64_t. The payload part is responsible for realising the micro-benchmark itself; the micro-benchmark kernel is surrounded by additional instructions, which, for example, 1) clear callee-save registers before execution to ensure a fixed architectural state and prevent interaction with the input, and 2) clear all registers used by the kernel after one execution to prevent interaction with subsequent executions.
Acquisition. For a given device, we execute each micro-benchmark n times and so acquire n associated traces of power consumption. Without loss of generality we fix n = 20, 000 for ASIC-based devices and n = 30, 000 for FPGA-based devices, and assume that each trace is comprised of m samples (the larger n used for FPGA-based devices stems from their lower signal-to-noise ratio). Let T j i denote the j-th sample in the i-th trace, and V k i ∈ {A, B, . . . , H} denote the k-th input variable manipulated by the micro-benchmark during acquisition of the i-th trace: T and V constitute the trace data set for a given experiment. MIRACLE (currently) supports 14 devices and 23 micro-benchmarks: these lead to acquisition of 322 such trace data sets in total, one per experiment.

Analysis.
Having acquired the trace data set for a given experiment, we then apply an analysis step in order address the associated hypothesis. In concrete terms, we apply Correlation Power Analysis (CPA) [BCO04] with a Hamming weight or Hamming distance leakage model. We opt for this approach, versus techniques such as Test Vector Leakage Assessment (TVLA) [GJJR11] or Welch's t-test [Wel47], because it allows more confidence in a qualitative assessment of 1) whether or not inputs variables were manipulated or interacted, and 2) exactly which variables were involved if so; our choice aligns with Arora et al. [ABPP21], for example, who warn that use of TVLA (alone) "should be treated with caution as it is sensitive to both false positives and negatives".
Per our remit of evaluating MPL, the "attack" component 10 of CPA is irrelevant: we simply use CPA in order to "detect" if and when input variables are manipulated or interact. In concrete terms, doing so is captured by two cases: if H(x) denotes the Hamming weight of x, then 1. to perform Hamming weight analysis of variable V x , i.e., test for unexpectedly access or manipulation of one input variable, let H i = H(V x i ) for 0 ≤ i < n, whereas 2. to perform Hamming distance analyses of variables V x and V y where x = y, i.e., test for unexpected interactions between two input variables, let Then, if T j denotes the mean of all traces at sample j and H denotes the mean of all values in H, we compute an analysis data set represented by the correlation coefficient for 0 ≤ j < m (per, e.g., [MOP07, Chapter 6]). Note that not all experiments require use of both Hamming weight and Hamming distance analysis, meaning some experiments in Section 5 may omit one or other analysis. For example, in an experiment designed to test for unexpected interactions between two input variables, Hamming weight analysis is omitted; in an experiment designed to test for unexpectedly access or manipulation of one input variable, Hamming distance analysis is omitted. After MIRACLE has automatically acquired the trace data sets and applied the analysis required for a given experiment, we then manually interpret the resulting analysis data set: in either case, a significant peak (or spike) in R is suggestive of the hypothesised manipulation or interaction having occurred.

Artifacts
A concrete goal of MIRACLE is to make access to artifacts and the reproduction or improvement of our results easy. Hence, all of our benchmarks and experiment infrastructure is Table 2: A summary of results stemming from the micro-benchmarks in Figure 3, i.e., cases which explore Hamming distance leakage from combinations of ldr and str instructions. Note that AC, for example, indicates that the Hamming distance between A and C was leaked. Device available online 11 . We also developed a web-based interface 12 which offers a straightforward way to 1) inspect the binary or disassembled form of each micro-benchmark, and 2) view and compare the associated analysis data sets 13 .

Case studies
Note that all devices (currently) supported by MIRACLE, per Table 1, have a 32-bit data-path and could be described as micro-controller class. Although MIRACLE is generalpurpose, this (initial) selection was intended to facilitate an exploration of MPL stemming from ostensibly similar (i.e., from the same class, and so designed to satisfy similar use-cases) devices commonly used in existing literature.
Using the MIRACLE infrastructure outlined in Section 4, this section presents said exploration via a set of case studies: each has the same structure, in the sense it 1) describes the set of micro-benchmarks used, 2) summarises the resulting analysis data sets, then 3) discusses those results, e.g., attempting to explain their occurrence, relevance, and/or implication. Some case studies replicate and generalise results in existing literature, while others, to the best of our knowledge, are novel (either in terms of the leakage effect and/or associated source). Note that in order to support 3), we stress the summarisation of results in 2): rather than exhaustively present the analysis results for all devices for all case studies, even though they exist, we select a subset which most easily support points in the high-level discussion.

Memory: hidden state
This case study focuses on the so-called memory remnant effect, as observed, for example, by Papagiannopoulos    to "leakage originating from consecutive SRAM accesses". In short, it captures the fact consecutive memory accesses may interact even if those accesses involve different architectural state This can be a challenging effect to identify and resolve, because 1) intermediate (e.g., ALU) instructions may not prevent leakage, and thus 2) leakage may occur due to instructions which occur far apart.

Results
Each micro-benchmark in this case study is functionally equivalent across the set of devices used, and, where permitted by the ISA, identical. Despite this fact, we observe markedly different leakage behaviour across those devices. For example, from Some experiments show very consistent behaviour across architectures. For example, all of the ARM cores behave identically for the memory-bus/ld-ld experiment. This result is widely reported in the literature. However, when looking at interactions between load and store instructions (such as might occur when spilling registers to the stack) separated by an ALU instruction (memory-bus/ld-st and memory-bus/st-ld) we see very different results not only between different cores, but even between the same core implemented by the same manufacturer in different devices. For example, loaded and stored values interact in the ST-based ARM S1 and ARM S3 devices. However, the ARM S2 and ARM S4 devices, when running exactly the same code, do not leak in the same way despite the underlying core and ISA being identical. Closer inspection of the data-sheets for these devices reveals that the ARM S2 and ARM S4 are high performance variants, with higher maximum operating frequencies. A similar comparison can be made between the Xilinx MicroBlaze cores, where the 3-stage MB X0 had clear interactions between loaded and stored values, but in the longer pipelined MB X1 and MB X2 , the loaded and stored values do not interact. For the store-store experiments, we see no Hamming distance leakage between the memory-bus/st-st-1 and memory-bus/st-st-3 variants. From this, we conclude that (for the number of traces we collected), leakage originates exclusively from registers in either the core or the memory hierarchy, not from inside the SRAM. Indeed, only the memory-bus/st-st-2 experiment causes Hamming distance leakage between the values being stored. This suggests that in some cases, so long as there is an intervening store to another address (thus flushing the store data-path), values with the same mask may be safely overwritten in memory without causing leakage. This avoids the need for expensive countermeasures (some of which are well described in [SSB + 20]), but we urge developers to verify that this holds on their own platforms themselves.

Discussion
In explaining some of the differences between devices with the same core but different leakage behaviours (ARM S2 and ARM S4 ), we hypothesise that the need to meet tighter timing requirements drove various design decisions regarding the core and memory interconnect, which apparently have led to totally separate load and store data-paths. Regardless of the actual reason for this difference in behaviour, from a leakage perspective, this is a critical difference between cores which must be accounted for.
Based on our observations, we extend the notion of adjacent instructions to describe different types of instruction: Concept 10. Two distinct instructions can be described as It should be clear that store-adjacent and load-adjacent instructions are mutually exclusive, and are both subsets of memory-adjacent instructions. We can now say that for some cores, memory-adjacent instructions will leak the Hamming distance between values written or read from memory, e.g., the ARM N2 . However, in the case of the ARM S2 , only load-adjacent and store-adjacent instructions will leak as such. We believe that tagging instructions as such will make formal modelling of memory hierarchy related leakage much easier to reason about. There are also obvious similarities between memory consistency models and ordering constraints or fences in various ISAs. We hope that these rules can be easily added to static program checkers, and be used as extra information when looking for interactions between variables. Based on the leakage taxonomy developed in Section 3, we classify these effects as sequential, inter-instruction MPL. Per device, we can now explain concisely whether, e.g., load-adjacent instructions suffer from inter-instruction sequential MPL or not.
For the ARM devices, we believe that the range of observed behaviours are much less surprising given a thorough reading of the AMBA-AHB bus standard [ARMe]. This bus standard is used by all of the ARM micro-controllers in the study. It is explicitly described as a pipelined bus architecture. Hence, the existence of some micro-architectural state should be expected. Observing [ARMe, Figure 1-1], the AHB block diagram, it is clear that there are opportunities for registers to be placed in several places, with only a finite number of sensible design choices for engineers to follow. This is shown in our results, because although different cores from the same manufacturer do differ, given that they differ, they differ consistently. This suggests it is possible to separate the MPL modelling of CPUs from the modelling of the memory interconnect.

Memory: data bus widths
This case study focuses on the interaction between data-type width (e.g., uint8_t, uint16_t, and uint32_t) and memory bus width. The central question is how does the memory sub-system satisfy an n-byte memory access, e.g.     : Pseudo-code for micro-benchmarks described in Section 5.2.1, i.e., those related to the case study on width of the memory access path. Note that for both of these experiments I is a parameter rather than a variable: the micro-benchmark is executed 8 separate times (i.e., for I ∈ {0, 1, . . . , 7}) for the byte cases, and 4 separate times (i.e., for I ∈ {0, 1, . . . , 3}) for the half-word cases. Table 3: A summary of results stemming from the micro-benchmarks in Figure 4a and Figure 4c i.e., cases which explore Hamming distance leakage from 8-bit, byte load (ldrb) instructions and 16-bit, half-word load (ldrh) instructions respectively. Note that each numbered column refers to the offset I into the array X, with a bold number used to highlight the offset for A; a in the i-th column indicates that the Hamming weight of A was leaked.
Device ldrb ldrh 0 1 2 3 4 5 6 7 0 1 2 3  Figure 4b and Figure 4d, i.e., cases which explore Hamming distance leakage from 8-bit, byte store (strb) instructions and 16-bit, half-word store (strh) instructions respectively. Note that each numbered column refers to the offset I into the array X, with a bold number used to highlight the offset for A; a in the i-th column indicates that the Hamming weight of A was leaked.
Device strb strh 0 1 2 3 4 5 6 7 0 1 2 3 on an ST-based ARM Cortex-M0 and in the ELMO [MWO16] power model: based on their observations as a starting point, our aim is to then 1) evaluate whether the same leakage effect is evident on other devices, and 2) explain the underlying leakage source(s). Figure 4 includes pseudo-code for the micro-benchmarks used; both cases uses an 8-byte, word-aligned array X. Each element of X is initialised to zero, bar one which is instead initialised to a security-critical value A. For the load (Figure 4a) case, the micro-benchmark loads an 8-bit (ldrb) or 16-bit (ldrh) value from a given offset I within X. If only those bytes required are accessed, we expect leakage only for an offset which implies access to the security-critical value; if leakage is observed at other offsets, we infer that bytes other than those required are also accessed. For the store (Figure 4b) case, the micro-benchmark stores an 8-bit (strb) or 16-bit (strh) zero value at a given offset I within X. Again, the presence (resp. absence) of leakage for a given offset allows us to infer which bytes are accessed. Table 3 shows consistent sub-word load behaviour for all of the ARM cores in the case study. Again, this effect has been noted in the literature, and we are unsurprised to confirm it across multiple devices and manufacturers. We note however the mixed results for MicroBlaze devices.

Results
For stores, Table 4 shows differing behaviour between different cores, and even the same core implemented in different devices when adjacent bytes in memory are overwritten with zeros. We note that results for ARM S1 , ARM S3 , ARM S2 , and ARM S4 appear to mirror the results observed in Table 2. Comparatively few devices showed any leakage in this case and it is not immediately clear why Hamming weight leakage should be visible for a byte in memory when the word is being written too, but that exact byte is not changing value.
For the store experiments, we also note the clear divide between ST-based ARM devices which show MPL, and NXP-based ARM devices which show no MPL for the same experiments. Recall again that where possible, the binary code running on each device is identical, yet still yields very different MPL behaviour.

Discussion
We investigated the inconsistencies in the Microblaze results by analysing the implemented designs. We found that in all cases, Vivado had combined four BRAM primitives (RAMB36E1) to create a single logical 32-bit word memory. The "Enable" signal for each BRAM primitive was driven by the same LUT, meaning that any sub-word access to the logical memory would implicitly enable access to all of the constituent BRAM primitives. However, the write enable bits were all driven independently (which is expected, or sub-word writes would not work). We believe this explains why for the load-byte case in Table 3, we see leakage for indices other than where the variable is stored. We cannot confidently explain the results for the load-halfword case however, but suggest it may be down to glitching or very low level signal timing issues which cannot be verified simply by inspecting the implemented design. This might explain why we see leakage in both halfword cases for the MB X0 , but not for the MB X1 and MB X2 . We hope that by making our artifacts public, it will be easier for other researchers to investigate the inconsistencies. Generally, we believe that the behaviour of the Vivado synthesis and implementation tools as they relate to leakage is a substantial area of interest unto itself.
For the inconsistencies in the stores case, we hypothesise the behaviour here is heavily dependant on the behaviour of the RAMs. For example, it is possible that even when only a single byte is being explicitly accessed, the entire word is implicitly accessed (i.e. it's value is read out of the cell array) within the RAM, whether it is for a load or a store. We note that we never see Hamming weight leakage when the value in memory is explicitly overwritten with zeros. This might imply forwarding behaviour inside the RAMs, where the RAM always reads the word being accessed (regardless of whether it is a load or store), and where the word being read is being written in the same cycle, the written value is forwarded to the RAMs read data register.
As to the difference in stored value leakage behaviour between manufacturers, again we hypothesise this is down to differences in manufacturing approach for the RAMs. We do not believe it is reasonable to conclude that NXP-based devices are on the whole less leaky, but note that this starkly illustrates how families of devices can behave very differently under leakage analysis. Again, we classify these effects as load/store-adjacent, inter-instruction, sequential micro-architectural.

Memory: sequential accesses
This case study focuses on the behaviour of sequential accesses (i.e., loads or stores) to memory. Such an access patter can arise, for example, when the nature data-type is uint8_t, or when accessing regions of (e.g., an array in) memory with unknown alignment. The central question is if one loads (resp. stores) bytes from (resp. into) different addresses in memory into (resp. from) different architectural registers, is it possible they interact? From an architectural perspective the answer should clearly be no, and so any leakage must stem from (micro-architectural) state within the memory sub-system. Figure 5 includes pseudo-code for the micro-benchmarks used; both cases uses an 8-byte, word-aligned array X. For the load (Figure 5a) case, the micro-benchmark loads a sequence of bytes from different, consecutive addresses in memory into different architectural Table 5: A summary of results stemming from the micro-benchmarks in Figure 5a, i.e., cases which explore Hamming distance leakage from 8-bit, byte load (ldrb) instructions using different, consecutive addresses and different architectural registers; the results focus on ST-based devices only. Note that a in the i-th column and j-th row indicates that the Hamming distance between the i-th and j-th bytes loaded was leaked.    registers: note that ldr rA, [rX, #0] should be read as "load the 0-th element of X, i.e., the variable A, into register rA". Any Hamming distance leakage between the bytes loaded allows us to infer the presence of shared state, e.g., within 1) the memory sub-system, and/or 2) the pipeline stages used by the core (i.e., between a value being received from memory by the core, and being written into a GPR). For the load (Figure 5a) case, the micro-benchmark stores a sequence of bytes from different architectural registers into different, consecutive addresses in memory: note that str A, [rX, #0] should be read as "store register rA, i.e., the variable A, into the 0-th element of X". Any Hamming distance leakage allows similar inferences to the load case, but could also indicate that additional pipeline register(s) exist between the register file and the memory write-port.

Results
The results for a representative subset of the memory-bus/seq-ld experiment can be found in Table 5. We can see two main effects, namely 1) bytes within a word can interact but may not, and 2) the i-th byte of the array will often interact with the (i + 4)-th byte of the array.

Discussion
We believe the first effect is due to the necessary multiplexing to select any byte of a loaded memory word, and to place it in the least significant byte of an architectural register. Clearly this multiplexing does not always cause visible MPL for the number of traces used in these experiments.
For the second effect, we believe this is because (as established in prior experiments), even when a byte of memory is requested, in reality, an entire word is loaded. Hence, when we load the first four bytes of the array, we are really repeatedly loading the entire first word. When we load byte 4 of the array, an entire new word is loaded into some micro-architectural state, and we see Hamming distance leakage between all corresponding bytes. Again, none of the instructions shared architectural destination registers, meaning all of the interactions are due to the MPL.
We believe the inconsistency of our results is down to two major factors. First, that if     we collected more traces we would see more consistent results. Second, that for a given number of traces, not all intra-cycle MPL (e.g., in multiplexer trees) can manifest, due to subtle differences in the final post-layout silicon design.
For this particular effect, we note the existence of two distinct sources of MPL according to our classification. First, we see inter-instruction sequential MPL, as we have with other memory bus experiments. However, we also see intra-instruction, glitching combinatorial MPL, where bytes of the loaded word interact with each other through the multiplexers, which select which byte of the word is written back to a general purpose register. Shelton et al. [SSB + 20] focus on an ST-based ARM Cortex-M0, which has a 3-stage pipeline and thus 2 sets of pipeline registers. We extend this remit, applying a similar methodology (i.e., that of using a set of micro-benchmarks) to devices which have more diverse micro-architectures and hence different, more complex pipeline structures. In doing so, we demonstrate how to answer an important question, namely in a scalar pipeline, do consecutive instructions that use different destination registers cause Hamming distance leakage between instruction results? From an architectural perspective the answer should clearly be no, and so any leakage must stem from (micro-architectural) state within the pipeline structure.

Pipeline register overwrites
The information generated by this methodology is useful, for example, to 1) reverse engineer details of the (unknown) pipeline structure, and, therefore, 2) model how in-flight instructions proceed through each stage of execution, e.g., whether and how they, and associated intermediate values, interact with each other. Of course, with white-box access to the micro-architectural design, e.g., the HDL, per the MAPS simulator of Le Corre Table 6: A summary of results stemming from the micro-benchmarks in Figure 6, i.e., during execution of a given instruction pair, was there Hamming distance leakage; N/A indicates the instruction pair (e.g., due to use of ror) is not supported by the ISA associated with that device.
Device eor-eor eor-add eor-lsl eor-ror eor-ldr eor-str eor-nop Table 7: A summary of results stemming from the micro-benchmarks in Figure 6, i.e., during execution of a given instruction pair, which operands caused Hamming distance leakage; N/A indicates the instruction pair (e.g., due to use of ror) is not supported by the ISA associated with that device. Note that AC, for example, indicates that the Hamming distance between variables A and C was leaked. The Hamming distance between variables A and B was leaked in all cases, so have been omitted.  [CGD18] for ARM Cortex-M3, one can obtain similar information more directly. However, we argue that a grey-box approach is more scalable: it can cater for cases when said design is unknown and/or inaccessible. Figure 6 includes pseudo-code for the micro-benchmarks used:

Micro-benchmarks
1. pipeline/eor-eor (Figure 6a): an eor instruction followed by another eor instruction, neither of which access the same architectural state (i.e., general-purpose registers). The aim is to answer the questions is there Hamming distance leakage between the operands, and is there Hamming distance leakage between the results? 2. pipeline/eor-add (Figure 6b Table 6 shows which devices leak the Hamming distance between the results of adjacent instructions. Interactions between certain pairs of instructions consistently leak the Hamming distance between instruction results across all devices (e.g., the eor-ldr) pair. We were surprised not to see more clear evidence of Hamming distance leakage between consecutive ALU type instructions; particularly the add, lsl and ror instructions. We offer two possible explanations for this. 1) that the interactions are too weak to detect for the number of traces we use, 2) that the results of the instructions make poor targets for Hamming distance analysis anyway. It is also possible that some interactions are purely combinatorial glitching leakage, which we would expect to be much weaker than sequential leakage, which involves state elements. Table 7 shows which instruction operands of the experiments in Figure 6 leak their Hamming distance. We find generally very consistent behaviour for interactions between the first and second operand registers of consecutive instructions. As has been reported in the literature we find that two adjacent instructions will usually leak the Hamming distance between their respective rs1 and rs2 operands. This was a key finding of [PV17] and [CGD18], and we have replicated their results across many other CPU cores. This is is visible in Table 7 as AC and BD interactions.

Results
While some very consistent behaviour is easy to see in Table 7, we note two unexpected effects. First, that for some cores, there are unexpected Hamming distance interactions between operands which do not obviously interact. For example, the ARM Cortex-M4 base devices leak the Hamming distance of rs1 of the eor, and rs2 of the add. We were also surprised to consistently see Hamming distance leakage between both register operands of the eor, and the register operand of the right shift by immediate instruction. This occurred consistently in the ARM Cortex-M3 and Cortex-M4 devices, and the larger MicroBlaze cores.

Discussion
We were able to discern, in the case of instructions with a register and immediate operand (like the load / store word instructions) exactly which operands from the previous instruction interact with the register operand, and which with the immediate operand. For example, all of the ARM cores consistently leak the Hamming distance between rs2 of an eor instruction followed by rs1 of a store word instruction. It is also useful to see that in the larger MicroBlaze cores, it is actually rs1 of the eor instruction which collides with the store word instruction. Knowing which operand register of the eor instruction interacts with the store instruction is essential for MPL modelling.
Comparing Table 6 and Table 7, we can see that generally speaking, Hamming distance leakage between consecutive instruction operands is much more consistently detectable than Hamming distance leakage between the results of consecutive instructions. Given that the result of one instruction will almost always become the operand of another, we believe that operands are inevitably more effective attack targets. That said, we note that leakage between the results of consecutive instructions, even when the instructions share no architectural registers between them, is likely to be an unexpected source of MPL for some implementations. Even in cores with very short pipelines.
In cores with longer pipelines (MB X1 , MB X2 ) we see consistent MPL between the results of adjacent instructions. This is expected, as there are likely multiple pipeline stages between the result being calculated and the write-back of the result to the register file. In general, we note the very different MPL behaviour across all of the devices (even devices with the same CPU core) mirrors our results for the memory bus related experiments. This would suggest that MPL models, even when they only look to model intra-core effects, cannot be transferable between devices; even if the devices are based on the same CPU core. Indeed, it raises the possibility that a device with multiple identical CPU cores 14 , could find that core 0 leaks differently to core 1, despite being implemented in the same piece of silicon.
We note the differing results for whether nop acts as a barrier between ALU type instruction operands. Even two ARM Cortex-M4 based devices (ARM S3 and ARM S4 ) behave differently. In the former a nop does not alter the pipeline registers which store instruction operands, causing Hamming distance leakage. In the later, the nop does act as a MPL barrier. Borrowing the notions introduced in Section 5.1, we can say that even though the eor instructions are not program-adjacent, they still interact. This behaviour, while not exactly expected, is certainly understandable given that the ARMv7-M architecture only guarantees the behaviour of a nop in terms of memory alignment. Researchers who use nop as a makeshift MPL barrier should beware that this is not portable. This has implications for some of the experiments in [SSB + 20] (e.g., Listings 1 and 2) where nop is used to separate instructions in time. We do not believe this had an adverse effect on the experiments in [SSB + 20], as our own results confirm nop is a reasonable leakage barrier in the ARM Cortex-M0 core. However, this cannot be expected on other, e.g., ARM Cortex-M4 cores. This lends some evidence to the utility of dedicated leakage barrier instructions, such as FENL [GMPP20], which guarantee some level of leakage barrier like behaviour.
Again, it is important to note the differences between different devices (even devices with the same underlying CPU core). That such basic instruction sequences leak visibly   on some devices and not on others points to a real challenge for formal and statistical modelling approaches.
In terms of classifying these sources of MPL, we find that program-adjacent, interinstruction, sequential MPL is the dominant source. This is expected, since we specifically went searching for MPL related to pipeline registers.

Control-flow instructions
This case study focuses on a relatively complex case for pipelined instruction execution, namely that of conditional and unconditional changes in control-flow. Note that we use the terms branch and jump to refer to the conditional and unconditional cases respectively, mirroring similar terminology used, e.g., in RISC-V and MicroBlaze. The central question is how does a change in control-flow effect micro-architectural state, such as pipeline registers, and so, e.g., does a change in control-flow prevent interaction between instructions before and after it?   Figure 7, i.e., cases which explore the impact a change in control-flow has on leakage. Note that AC, for example, indicates that the Hamming distance between A and C was leaked; R indicates that the Hamming distance between results produced by the two eor instructions was leaked. The Hamming distance between variables A and B and variables C and D was leaked in all cases, so have been omitted. Device branch-pre branch-post jump-pre jump-post

Micro-benchmarks
2. pipeline/branch-post (Figure 7b): as pipeline/branch-pre, except with the first eor instruction moved immediately after the beq instruction rather than before it. Note that if the branch is not taken, said eor instruction should not be executed therefore. 3. pipeline/jump-pre (Figure 7c): as pipeline/branch-pre, except with the beq instruction replaced with a b instruction (i.e., an unconditional branch, or jump) and the associated cmp instruction removed. 4. pipeline/jump-post (Figure 7d): as pipeline/branch-post, except with the beq instruction replaced with a b instruction (i.e., an unconditional branch, or jump). and the associated cmp instruction removed.

Results
There are three broad outcomes from analysing these sequences: 1. There is no interaction between the ALU instructions, whether they are separated by a branch/jump or not.

The ALU instructions interact if the branch/jump instruction is absent, but do not inter-
act if it is present. This tells us that any state which the ALU instruction interacts with is somehow cleared or overwritten by the branch/jump instruction. The branch/jump can then be thought of as a weak MPL barrier which prevents values manipulated in one loop iteration bleeding through the pipeline into the next iteration. 3. The ALU instructions always interact, regardless of whether a branch/jump is placed between them. In this case, extra care must be taken with masking schemes employing loops, as the "loop branch" instruction cannot be relied upon to stop micro-architectural state from one loop iteration bleeding into the next.
This approach is similar to the dominating instruction approach taken in [SSB + 20] to identify whether instructions share micro-architectural state. From the results in Table 8, we can see that while no instructions leak in the branchpre case, we get scattered Hamming distance leakage in some cases for the other experi-ments.
For the MicroBlaze based targets (MB X0 , MB X1 , MB X2 ) We see MPL between the operands and results of XOR instructions following the control-flow change, even when the first XOR instruction is not executed.
On all of the ARM cores, we find MPL between one set of operands in the jump-pre case. However, we note that the ARM S3 also leaks the Hamming distance of one set of operands in the same way, even though ARM has no delay-slot mechanism in the architecture. We investigate this further in Section 5.6.

Discussion
In the case of the MicroBlaze, we believe the leakage is explained by referring to the MicroBlaze Architecture and its use of branch delay slots. This is a technique where the instruction immediately following a branch or jump can be executed as a way of hiding the latency of re-filling the pipeline after a control-flow change. MicroBlaze has variants of branch and jump instructions with and without delay slots, and our experiments used the non-delay-slot variants. We believe that on a micro-architectural level, the delay-slot instruction is always executed, but its result is only thrown away at the last pipeline stage. This is clearly the case for the MB X0 and MB X1 cores, which leak operands and results. The MB X2 does not leak its result, suggesting the delay-slot instruction is squashed before its result is computed.
For the ARM jump-pre case, we suggest this is caused by the jump instruction only over-writing a single pipeline register, and leaving the other unmodified from the instruction preceding the jump. The unmodified value will then collide with whatever the next corresponding instruction operand is.
We note the lack of attention paid to branch and jump instructions with respect to MPL behaviours in the literature. We believe this sort of analysis is important because as we have shown, just because control-flow instructions don't normally read secret or sensitive variables, doesn't mean they can't cause or allow them to interact. Knowing if a "loop back" instruction sequence clears pipeline registers can reduce the need to explicitly clear certain micro-architectural state, or show that it is necessary.
We classify this effect as inter-instruction, sequential MPL between program-nonadjacent instructions. The non-adjacency of instructions is an important distinction, since it is not the branch or jump instructions which cause the interaction, rather, they fail to prevent interactions between other instructions.

Speculative execution in short pipelines
Despite being associated with larger, out-of-order and super-scalar cores with deep pipelines, speculative execution still takes place even in very shallow pipelines. In an n stage scalar pipeline, there are in principle n instructions in flight. Depending on the stages where control-flow changes occur, several instructions which are never executed from an architectural point of view, still enter the pipeline and interact with architectural and micro-architectural state elements. Further, different types of control-flow change can occur at different points in the pipeline. For example, a conditional branch can only occur after reading and comparing some architectural state, in a decode or execute pipeline stage. An unconditional jump however can occur at the decode or fetch stage, since the destination address can be computed from just the program counter and the instruction. All of this can occur in a scalar, pipelined CPU which may or may not have a branch predictor.
These scenarios are all micro-architectural reasons for speculative execution to occur. There is also an architectural reason: branch delay slots. These are a somewhat out-offashion idea, where the instruction immediately following a control-flow change may also : Pseudo-code for micro-benchmarks described in Section 5.6.1, i.e., those related to the case study on speculative execution. Table 9: A summary of results stemming from the micro-benchmarks in Figure 8, i.e., cases which explore the impact speculative execution has on leakage.
Device jump-fwd jump-bwd branch-fwd branch-bwd Device jump-fwd jump-bwd branch-fwd branch-bwd Instances of Hamming weight leakage: A, for example, indicates that the Hamming weight of A was leaked, noting that only cases that involve speculatively executed instructions are shown. be executed, even when the control-flow change is taken. This was a way of mitigating some of the performance impact of control-flow changes in pipelined CPUs, since one instruction may be executed for free while the new instruction stream is fetched. The MIPS architecture made use of branch delay slots, as does the MicroBlaze architecture analysed in this work. They are absent from most new and modern architectures now, as they significantly complicate out-of-order and super-scalar core design.
Here, we present (to our knowledge) the first study of how speculative (and from a software developers point of view, unexecuted) instructions can be identified from power traces. We show how to very simply identify the depth of any speculative behaviour, which guides software developers as to how much this issue might affect a side-channel secure implementation. This is particularly important given the results in Section 5.5, which shows that control-flow changes do not always act as a barrier between ALU instructions. Figure 8 includes pseudo-code for the micro-benchmarks used. Although not obvious, each one uses the same structure in which 1 initial eor instruction separated from n other eor instructions by a branch; the branch is always taken, with the resulting change of control-flow meaning the latter eor instructions are not executed from an architectural perspective. The underlying question is does the initial eor instruction before the branch interact with the unexecuted latter eor instructions after the branch, i.e., are the latter eor instructions speculatively executed, and therefore does speculatively execution cause MPL in a manner similar to, but more subtle than conventional execution? More specifically:

Micro-benchmarks
• speculation/branch-fwd (Figure 8a): an eor instruction (line 3) separated from two other eor instructions (lines 5 and 6) by a beq instruction (i.e., a conditional branch): this causes a forward (versus backward) change in control-flow, i.e., using a target which is at a higher (versus lower) address. • speculation/jump-fwd (Figure 8b): as branch-fwd, except with the beq instruction replaced with a b instruction (i.e., an unconditional branch, or jump) and the associated cmp instruction removed. • speculation/branch-bwd (Figure 8c): an eor instruction (line 11) separated from two other eor instructions (lines 13 and 14) by a beq instruction (i.e., a conditional branch): this causes a backward (versus forward) change in control-flow, i.e., using a target which is at a lower (versus higher) address. • speculation/jump-bwd (Figure 8d): as branch-bwd, except with the beq instruction replaced with a b instruction (i.e., an unconditional branch, or jump) and the associated cmp instruction removed. • speculation/loop-0 ( Figure 8e): a for loop structure in which an eor instruction (line 3: at the beginning of the loop body) separated from three other eor instructions (lines 8, 9, and 10: after the end of the loop terminates) by a beq instruction (i.e., a conditional branch): this causes a backward (versus forward) change in control-flow, i.e., using a target which is at a lower (versus higher) address.
Note that we carefully distinguish between conditional and unconditional, and forward and backward changes in control-flow. This allows us to deal with different static branch prediction strategies, e.g., predict taken for a conditional backward branch (which is indicative of a loop). Table 9 shows the results of our experiments. Each row represents a single device and shows which variables leaked, and how, for each speculation experiment listed in Figure 8. We identify this by using correlation analysis to search for the Hamming weight of operands, and the Hamming distance between consecutively accessed operands, and consecutive instruction results. In terms of the Hamming distance results (Table 9a), there are a number of devices which exhibit this effect. We believe that the MicroBlaze targets (MB X0 , MB X1 and MB X2 ) exhibit speculative leakage mainly because of the architectural branch delay slot. This is explained in the MicroBlaze Processor Reference Manual [Xil,Page 44]:

Results
When executing a taken branch with delay slot, only the fetch pipeline stage in MicroBlaze is flushed. The instruction in the decode stage (branch delay slot) is allowed to complete. This technique effectively reduces the branch penalty from two clock cycles to one.
Looking at this part of the manual, we see clearly that this behaviour is built into the micro-architecture. It explains why for the MicroBlaze targets, we clearly see Hamming weight leakage in the XOR instruction operands immediately following all taken controlflow changes and Hamming distance leakage between the operands, i.e., the result of the XOR.
We also see this behaviour in the ARM N2 , ARM S2 , ARM S3 , and ARM S4 devices. Here, we consistently see Hamming weight leakage from the operands of the instruction following a control-flow change, and the Hamming weight of the instruction result. This is a purely micro-architectural effect, since the ARM architecture has no notion of a branch delay slot. This result was not expected, given the shallow pipeline of the ARM cores. We suggest two possible reasons for seeing these effects in some cores and not others. First, there may be implementer visible design choices around configuring the core for high performance, or low power. Some of these choices many affect when and if different registers are cleared, or how multiplexer trees are implemented, all of which will affect the MPL. Second, it is possible that EDA tooling plays a part, and that these effects are simply unintentional side effects of synthesis and place and route tools optimising for various metrics other than side-channel resilience or energy efficiency. For example EDA tooling quirks might explain Hamming distance leakage in the ARM N1 device for the jump-bwd experiment, since no other ARM Cortex-M0 based core leaks this way.

Discussion
Ultimately, one may view the architectural and micro-architectural speculation shown here as essentially caused by the same thing: pipelined instruction execution. The key difference for a software developer being that it is only documented (however partially) for the MicroBlaze architecture. In the ARM cores, that MPL effects are visible for unexecuted instructions surprised us greatly, and to our knowledge, this is the first study to detail the effect.
For first-order masking schemes, clearly these effects represent significant dangers that software developers and MPL model builders need to be aware of. For higher order schemes, they may naturally be more difficult to exploit owing to the small number of share re-combinations they result in. That said, there is clearly potential for unexpected combinations of all of the effects detailed in this and other works. In any case, they must still be accurately modelled if the absence of leakage must be demonstrated or proved. We believe simply knowing these effects exist is useful, as they are often unexpected. Who would guess that unexecuted instructions after the end of a loop body could interact with the first instruction of the next loop iteration?
We belive this effect significantly complicates how formal and statistical leakage models for countermeasure development are built. Specifically, we are unaware of any models or tools which explictly handle partially executed instructions (i.e., those which appear in the shadow of a control-flow change). Failing to model these effects will lead to false negatives in terms of leakage detection by the model. This is of most concern to formal models, which must explicitly capture such interactions. Statistical models can capture these effects implicitly, but still require that these interactions are captured during the profiling step of the model building process. Our results show that the exact MPL behaviour of partially executed instructions varies substantially between different devices, meaning that accurate models must necessarily be device specific.
In terms of classification, we see these effects as program-non-adjacent, inter-instruction, sequential MPL. To help with further classification, we suggest using inter or intra-block (as in, basic block) as an additional distinguisher. Previous effects have all been intra-block, whereas this effect shows how leakage can occur across basic blocks. We can also tag this as speculative MPL, as opposed to non-speculative MPL, which is the case for all prior experiments.

Discussion
This section frames the findings in Section 5 at a higher level, and in a more directly usable form. We aim to give readers an intuition for MPL in general; we then discuss the implications of our work for two specific fields: 1) the design, implementation, and evaluation of masking schemes, and 2) construction of accurate fine-grained power leakage models.

A mental model of micro-architectural power leakage
We use a contrived CPU and memory sub-system Figure 9 block diagram, and describe the journey of instructions through the pipeline. At each stage, we note how instructions could be implemented in a reasonable micro-architecture, and link to examples of these options in our case study results.
Intuitively, we can investigate if the micro-architecture will cause share combinations by looking at convergence and divergence points in the data-flow of the system. Convergence points are where data from multiple places meet, and the micro-architecture selects one of the data items to progress. Divergence points are the opposite. Each can be further classified using the scheme developed in Section 3.4 as ISA, sequential or glitching-combinatorial.
We discuss types of MPL which can occur, some of which we have observed in this work, some of which are described in other works, and some of which are hypothetical. Our intention is to provide a starting point for engineers and researchers to identify and isolate causes of MPL in their own systems.

ALU-type instructions
An XOR (or similar) instruction might have two operand registers, or an operand register and an immediate. Decoding the immediate or register variant may cause glitches in the register address decode logic, causing unexpected toggling between registers which are not explicitly addressed by the instruction. This is described in [GHP + 20] and can be mitigated by structuring register file data ports as AND-OR trees, rather than multiplexer trees. The same effect can occur in the forwarding network, where results from uncommitted instructions are routed back to previous stages to prevent stalls. Hence the decode stage is a convergence point for instruction operands and results, which may be in-flight or at rest in the GPRs. The most common kind of MPL observed here is the sequential MPL, where instruction operands are loaded into pipeline registers P0 and P1, as seen in Section 5.4. Knowing which operands are loaded to which pipeline register for each instruction is essential for leakage modelling. Convergences can also be glitching-combinatorial, due to multiplexer switching, though in our experience, this is much harder to detect. One concrete example of this is the results in Section 5.1. Due to the nature of glitches, their occurrence can even be data dependant.  Figure 9: A contrived 5-stage, pipelined processor micro-architecture with architectural (blue), micro-architectural (red), and memory sub-system (yellow) components annotated as appropriate. Note that, we deliberately 1) exclude some components related to the control path, because they are usually irrelevant with respect to MPL, and 2) include some components related to the memory sub-system, despite normally being outside the micro-architectural remit, because they are highly relevant with respect to MPL. LSU: Load/Store Unit, ALU: Arithmetic Logic Unit.
In the execute stage, we can see combinatorial and glitching-combinatorial leakage depending on how the ALU is structured. For example, if different functions in the ALU are not gated, we might see Hamming Weight leakage for the result of an XOR, the result of and ADD, and the result of a shift. Barrel shifters can also undermine share-slicing techniques, as described in [GMPO19]. We may even see Hamming Distance leakage between these results, as they converge through a multiplexer tree to the P2 register. Loading of the P2 pipeline register will also cause sequential Hamming Distance leakage between the results of the current instruction, and the previous instruction which loaded P2. Note this is not always the previous instruction in program order.
Registers P2 and P3 force results of adjacent instructions to interact before being written to a GPR. This implies that adjacent instructions which share no architectural registers as operands or destinations will still interact from a leakage perspective. This is well documented in Section 5.4, [SSB + 20, ZBPF18,BP18,CGD18]. In most CPUs the results of subsequent ALU instructions do interact, for ARM Cortex-M0+ and PicoRV32 based systems, which have only 2 and 0 pipeline stages respectively, there is no such MPL.
Between P2 and P3, there is a convergence point for ALU instruction results and data loaded from memory. We observed this interaction between loaded data and ALU results in the 5 and 8 stage Xilinx MicroBlaze cores in Section 5.4, but not in the shorter pipelined cores. Finally, there is a divergence point where the result is written to the GPRs, where there will be Hamming Distance leakage between the old and new GPR values. This is the only form of architectural leakage which occurs.

Control-flow instructions
When entering the pipeline, jump instructions can change the program counter immediately, as they aren't dependant on any architectural state other than the program counter. Hence in many cases, it need not alter any of the data-carrying pipeline registers P0, P1, P2, and P3. We found in Section 5.5 that for all of the ARM cores, a jump instruction placed between two XOR instructions does not prevent Hamming distance leakage between some of the operands of the XOR instructions. This suggests that one of the pipeline registers is updated (likely with the branch target), but the other is left unchanged as an energy efficiency optimisation. We also found this was the case in the MB X2 and RV PRV based systems.
A programmer must consider if they rely on control-flow changes to act as implicit barriers between instructions to prevent share combinations, and whether that assumption holds for their system. This effect is also apparent in the loop speculation experiment in Section 5.6.
Branch instructions, being conditional on architectural state, tend to act as barriers between ALU-type instructions in our experiments in Section 5.5. In our hypothetical system, this would mean loading P0 and P1 with the operands to the comparison. Note this does not necessarily change the values of P2 and P3, meaning that we could potentially see Hamming distance leakage between the results of ALU instructions which are separated by a conditional branch. This is observed the case of the MicroBlaze based MB x0 and MB x1 systems.
On ARM systems, conditional branches are evaluated over two instructions: one does the comparison (cmp) and sets a condition flag, another tests a condition flag and performs the control flow change. Our experiments in Section 5.5 deliberately placed an ALU instruction between the cmp and branch instructions. It would be reasonable to reorder the cmp instruction to act as a barrier for the operands of the ALU instructions if the branch does not already provide this. However, we did not a have access to an ARM core with a longer pipeline (e.g., Cortex-M7), so we couldn't check if, as with the MicroBlaze cores, we also get Hamming distance leakage from pipeline registers which only store instruction results.
For branch and jump instructions, the actual control flow change may occur further down the pipeline. In our hypothetical system, branches feed the target address to the PC from the execute stage. This allows for subsequent instructions in the decode stage to speculatively read architectural state, either purely combinatorially, or sequentially into the operand registers P0 and P1. Indeed, we observed exactly this behaviour in Section 5.6, where the Hamming weight of operands were often detectable in unexecuted instructions. The potential for this is noted, but not investigated from a leakage perspective, in [CGD18]. In some cases, we even observed the Hamming weight of the instruction result, even in ARM cores with a short pipeline.
A different approach would be to prevent updates to P0 and P1 whenever a branch or jump is occurring in the execute stage. This is a recommendation supported by our results, since it results in fewer register updates (thus saving energy) and make reasoning about MPL much easier.

Load and store instructions
As in Section 5.1 and other works, memory accesses can have a large and counter-intuitive effect on MPL. Here, we focus on explanations for differences in behaviour shown in Table 2.
A load instruction with a base address operand register, an immediate offset and destination register enters the pipeline in the decode stage. It is subject to the same sources of MPL described for the ALU type instructions. Note that unless the address (either base or offset) is dependant on a secret value, a load instruction is unlikely to explicitly cause leakage in the decode stage of our hypothetical system.
In the execute stage, the load address is sent to the memory, which on the next cycle will load the M2 register with the requested data. The relevant bytes of M2 are then selected (load word/half/byte) and converge with ALU results in the P3 register, before being written back to the GPRs. Memory load data could have been put into P2, and observations in Table 2 show that this is sometimes the case. Note that the same CPU core, implemented by the same vendor in different devices, can exhibit different behaviours here. We believe that some cores have a separate load-data register as part of the memory bus for performance or timing reasons, while some merge this with a pipeline register to reduce area. This design choice has implications for leakage behaviour. If the M2 and P2 are separate, then loads which are potentially many instructions apart can interact. If they are represented by the same physical register, then ALU type instructions can act as a barrier between load data instructions.
As with ALU instructions, the existence of P3 in our hypothetical system can mean that load instructions which don't have the same architectural destination register can still leak the Hamming Distance between their loaded values, and the result of an adjacent ALU instruction. This is because the two streams of data converge before reaching the architectural register file. We see this in many devices even with short pipelines, in Table 6. For short pipelines, where P3 would not exist, this suggests that loaded data is actually routed to P2 in our hypothetical system.
The bytes select block, between M2 and P3 is the reason we see the kind of leakage patterns shown in Figure 5. Here, 1, 2 or 4 contiguous bytes of data are selected from the loaded 4-byte memory word. This multiplexing can cause potentially all of the bytes to leak their Hamming distance. This sort of MPL is expected in systems using a word-orientated memory interconnect, such as AHB-Lite or AXI-Lite. This also explains the effect noted in past works, where requesting a byte from memory can actually load the entire word which contains the byte, leading to unexpected MPL. Our results in table Table 3 show that this effect is common across most of the devices studies here.
For store instructions, the potential for MPL exists in the decode stage, as it must read an extra architectural register as the data to be stored. Knowing which pipeline register (P0 or P1 in our system) is overwritten with the data-to-be-stored, rather than the address, is important to avoid share collisions or the need to add barrier instructions which reduce performance. This need for barrier instructions to flush shared state is detailed in [SSB + 20]. Our results in Table 2 repeat similar experiments for a range of cores, showing how diverse the actual behaviour of cores can be in terms of memory hierarchy. A programmer who implemented all of the [SSB + 20] countermeasures on a core which does not need them would have an unnecessarily inefficient program.
In the execute stage of our hypothetical system, the data to be stored and the address must progress to the final stage of the pipeline before being sent to memory. Any earlier than this would cause incorrect program behaviour, as the store could be seen to have occurred before the instruction retires. In our design, the store data travels down a separate pipeline (a dedicated store buffer, M1 and M3) rather than the normal ALU pipeline. This separation or sharing of load, store and ALU data pipelines is a design trade-off between performance and area, and we believe explains why in Table 2, adjacently loaded and stored data interact in some cases but not others.

Implications for evaluation of masking schemes
Differences in MPL behaviour between devices have a considerable impact on the complexity, performance and code size of leakage resistant code. For example, the ARM S2 and ARM S4 devices do not cause collisions between consecutively loaded and stored values (see Table 2), while other similar devices do. Adding "flush" instructions for more leaky devices adds a significant performance and energy efficiency penalty, which is unnecessary on the nonleaking devices. Likewise, code which is secure on the less leaky devices may not be secure on others.
We did not survey the literature to discover which devices were used to derive overhead claims for different masking schemes and implementations, but suspect our results imply considerable margin for variance across devices. If such devices are chosen as benchmarking platforms for standardisation processes, it may be impossible to get a clear picture of the overhead required to implement certain masking schemes due to the underlying micro-architecture of the device.
As a single example, in [BDM + 20], the authors apply their tool to several NIST Lightweight Cryptography candidates, using an ARM Cortex-M4 based Nucleo STM32F401RE device. Our experiments clearly show that one ARM Cortex-M4 does not leak like another, posing the question of how portable the code generated by TORNADO or similar tools is. This is especially important, given its reliance on GCC to perform register assignment and low level instruction ordering, which may leave it open to some of the effects described in this and other works. Indeed, [SSG17] shows how generated bit-sliced code can fail to be secure on real-world targets despite formal guarantees as design time.
We suggest evaluations of software masked algorithms be evaluated on at least two devices which have different MPL characteristics. This will enable researchers to separate the overhead from a masking scheme, and the overhead from a particular micro-architecture. This has particular relevance for ongoing standardisation processes, such as the NIST Lightweight Cryptography project. We hope this work can inform the choice of devices.
Our results suggest tools which generate side-channel secure code will benefit considerably from being more micro-architecture-aware. Indeed, we believe they must be micro-architecture-aware in many cases in order to avoid generating leaking code. However, knowing which mitigations they must apply, or when they can omit them because a particular micro-architectural effect is not present in the target device, will likely lead to considerable performance and security improvements. Adding more shares to a masking scheme may remove the risk of the interactions seen in this and prior works, but this is unsatisfactory where performance and code size are an issue. We still want to be able to use the minimal number of shares possible for a given security requirement.
Finally, while the portability of side-channel resilient code has always been dubious, our results emphasise just how challenging efficient and portable side-channel resilient code is to build.

Implications for power leakage modelling
Capturing types of leakage. Using our classification of MPL variants in Section 3.4 and based on past work, we know both formal and statistical models cope well with capturing architectural leakage. Clearly architectural information alone can often be enough for developing an attack, and micro-architectural modelling or knowledge may add information to increase the effectiveness or efficiency of an attack.
However, experiments in past works which we replicate and extend here, show that only considering architectural sources of leakage is woefully insufficient for developing leakage models with the aim of showing the absence of leakage. That is, leakage models which only utilise architectural information about devices cannot capture enough detail for effective countermeasure development or real world proofs of security. Considering sequential MPL modelling, we believe there's no inherent difficulty in capturing this extra information for formal or statistical approaches, but the relative merits of each approach quickly become apparent. Formal models must have perfect information about the target device micro-architecture. Based on Section 5.6, we also know a trace of retired instructions is not enough, one must also model the pipeline of the target device explicitly. This is considerably more effort to build for a formal model than a statistical model, since a statistical model like ELMO will capture all of this information as a side effect of the initial device profiling with comparatively little white-box information about the micro-architecture. Formal models however must have access to this information, requiring either an NDA with the device manufacturer, or extensive reverse engineering of the kind we describe in this work. [BGG + 20] is a good example of this, where the quality of the security guarantee is heavily dependant on the user-supplied micro-architectural model of the target device. We believe the kinds of leakage-benchmarks developed in this work can play an important role in validating and building formal models of the kind described in [BGG + 20].
Once we move to combinatorial and glitching combinatorial leakage however, we see formal modelling approaches to software masking begin to struggle, since the number of variables and possible interactions explodes, even in the presence of perfect information about the system. Separating methodology, models, and devices. Given the wildly different leakage behaviours even for devices with the same CPU core, we believe the community must focus on building replicable methodologies for MPL modelling, rather than very accurate leakage models for particular devices.
For formal models in particular, a machine readable description of micro-architectural state elements, and when they are updated by instructions, will be an essential part of their methods. This allows for manufacturers to re-use the model building tool without sharing sensitive details about their device. In this respect, we believe the Domain Specific Language (DSL) based approach to describing device micro-architectures in [BGG + 20] is very promising. Indeed, the authors even demonstrate the necessity of their tool being micro-architecture aware by showing how a secure implementation in one ARM Cortex-M0+ based device is not secure in a ARM Cortex-M4 based device, which is well supported by our empirical results.
Temporal locality. In Section 5.1, Section 5.2 and Section 5.3, we saw how loads and stores in many devices are the only instructions to interact with certain micro-architectural state elements. As discussed in [SSB + 20], this is a problem for statistical models which only consider N adjacent instructions in a sliding window as contributing to leakage at any one moment. If a load instruction outside the current window altered a piece of state, then this will be missed if there are greater than N instructions before the next load.
One solution might be hybrid models. Specifically, ALU style instructions (including their interactions with memory accesses, as documented in Section 5.4) could be modelled statistically. This allows the most complex part of the target to be treated as a black-box, and inevitably captures combinatorial and glitching leakage inherently with the statistical profiling step.
For temporally distant interactions, which affect pieces of micro-architectural state which no other instructions interact with (e.g., loads and stores) formal modelling (particularly as described in [BGG + 20]) is a good approach. One would need only to model ISA state, architectural data values and a small subset of micro-architectural state elements -perhaps only two additional registers according to our results. The update rules for these pieces of micro-architectural state are easier to reverse engineer, as only a small number (of the order of 6 or 12, depending on the addressing modes supported by the ISA) of instructions alter them.
Benchmarking for model quality. Anyone building, buying or using a leakage model must verify its accuracy, which historically has been hard to do outside the results of a paper. We believe the set of leakage micro-benchmarks in this paper provide an important way for developers to gain confidence a model which they did not personally develop. By isolating critical corner case behaviours into individual experiments, we can realise something similar to unit testing for leakage models. By publishing all of our micro-benchmarks and the analysis flow, we hope they can be used to augment the development of new models.
We recognise it may always be possible to identify pathological cases of MPL which are not identified by leakage models, and these tools should focus on cases which make them the most useful to software engineers.

Conclusion
Summary. We have presented and used MIRACLE to explore MPL stemming from 32-bit micro-controllers spanning a range of different architectures (or ISAs), micro-architectures, and vendors. Although we expect the infrastructure presented in Section 4 to be more generally useful in the long term, the case studies in Section 5 already provide evidence that supports some specific conclusions: 1. Some assumptions made, e.g., to support a proof of security for a given countermeasure design or implementation, can be invalid: • The assumption that only computation leaks can be invalid. For example, Section 5.5 demonstrates that even instructions which are not retired, and thus not executed from an architectural perspective, can cause MPL as a result of speculation: the attack surface enabled by transient execution, which has been so well explored from the perspective of discrete side-channels (see, e.g., [CBS + 19]), also extends analogue side-channels. • The assumption that instructions leak independently can be invalid. For example, Section 5.4 replicates known leakage effects stemming from adjacent instructions; Section 5.1 extends this, proposing some more formal notions of adjacency, and demonstrates that even architecturally independent instructions can interact under some circumstances.
2. Given the above, claims regarding the portability of security properties between different devices can be invalid. For example, Section 5.4 demonstrates differences in efficacy of ad hoc leakage barriers based on nop instructions; this was also noted in [GMPP20, Section 4.1] and [CGD18]. A corollary of this fact is that it is insufficient to make a security claim based simply on an architecture (e.g., "ARMv7-M") or even a core (e.g., "ARM Cortex-M3"): throughout Section 5, we have demonstrated that it one must quote and evaluate a given instruction sequence on a specific device (e.g., "STM32F100RBT6B"). This fact is reinforced by Arora et al. [ABPP21], who conclude that MPL can even differ between instances of the same device. In our opinion, there are (at least) two resulting implications: 1) authors should adopt this approach in writing, alongside publication of associated artefacts to support reproducibility, and 2) care must be taken when selecting standard evaluation platforms, e.g., to support standardisation.
3. Protecting IP is clearly imperative in commercial contexts, and, as a result, many proprietary micro-architectures lack definitive, public documentation. MIRACLE is specifically design to cater for this by employing a grey-versus white-box approach. However, access to such documentation clearly allows more precise analysis with respect to MPL; taken to an extreme where the HDL is available, approaches such as that of Gigerl [GHP + 20] become viable. In our opinion, this trade-off motivates a shift in attitude. If vendors of security-related IP maximise access to pertinent details of the micro-architectural design while maintaining appropriate level of IP protection, this will give a higher level of assurance with respect to security; modulo a limit on precision, MIRACLE demonstrates such details can often be reverse engineered anyway. This fact is reinforced by Gao et al. [GOP21] who perform such reverse engineering for an ARM Cortex-M3.
Future work. We believe a more formal and quantitative way of reasoning about the types of leakage described in Section 3, and how they contribute to leakage models would be of clear benefit. This includes insights into when certain kinds of leakage may be safely ignored. We also believe investigation into building hybrid device leakage models, as described in Section 6, is a pragmatic way to get the best of both worlds from different modelling approaches. Finally, we have only investigated micro-architecture effects in terms of documenting their existence: further research into effective countermeasures (as described in [SSB + 20]) and exploitation is needed.