DAPA: Differential Analysis aided Power Attack on (Non-)Linear Feedback Shift Registers (Extended version)

Differential power analysis (DPA) is a form of side-channel analysis (SCA) that performs statistical analysis on the power traces of cryptographic computations. DPA is applicable to many cryptographic primitives, including block ciphers, stream ciphers and even hash-based message authentication code (HMAC). At COSADE 2017, Dobraunig et al. presented a DPA on the fresh re-keying scheme Keymill to extract the bit relations of neighbouring bits in its shift registers, reducing the internal state guessing space from 128 to 4 bits. In this work, we generalise their methodology and combine with differential analysis, we called it differential analysis aided power attack (DAPA), to uncover more bit relations and take into account the linear or non-linear functions that feedback to the shift registers (i.e. LFSRs or NLFSRs). Next, we apply our DAPA on LR-Keymill, the improved version of Keymill designed to resist the aforementioned DPA, and breaks its 67.9-bit security claim with a 4-bit internal state guessing. We experimentally verified our analysis. In addition, we improve the previous DPA on Keymill by halving the amount of data resources needed for the attack. We also applied our DAPA to Trivium, a hardware-oriented stream cipher from the eSTREAM portfolio and reduces the key guessing space from 80 to 14 bits.


Introduction
There are two major families of cryptanalysis -mathematical attack and physical attacks, including SCA and fault attacks. Mathematical attacks study the structure of a cryptographic primitive to find exploitable mathematical structures and utilise them to recover sensitive information from the primitive, for example the differential cryptanalysis [BS90] and linear cryptanalysis [Mat93]. Physical attacks, on the other hand, studies the hardware or software implementation of a primitive and tackles it through other physical means, for example observing the timing of the algorithm computation [Koc96] (timing attack), the power consumption [KJJ99] (power analysis) or injecting faults to the implementation [BS97] (fault attack).
Resource-constrained or low-cost devices such as Radio-Frequency IDentification (RFID) tags, wireless sensors nodes and smart cards, have always been in an ever-increasing demand and usage in this information era. These devices could be operating in hostile environments and are especially susceptible to SCA, in particular, the differential power analysis [CLK + 03, MAK15].
First proposed by Kocher et al. [KJJ99] in 1999, DPA involves statistical analysis of the power traces of cryptographic computations obtained using devices like oscilloscopes. It had been used to target cryptographic algorithms that handles sensitive information, including block ciphers [ÖGOP04,SSA14], stream ciphers [FGKV07,QGGL13] and even hash-based message authentication code (HMAC) [BBD + 13] and had proven to be practical with high success rate. Thus posting a serious threat to embedded implementation of cryptographic primitives.
DPA typically involves power modelling and key hypothesis to recover secret information, for instance the DPA on linear feedback shift register (LFSR) based stream ciphers [QGGL13,FGKV07]. In 2017, Dobraunig et al. [DEKM17] showed that DPA can also be used on shift registers to extract the bit relations of neighbouring bits, allowing attacker to significantly reduce the internal state guessing space. More specifically, with the knowledge that two neighbouring bits have the same or different values, guessing the value for one of them would determine the value of the other. In other words, it reduces the entropy by 1 bit for every known bit relation. Inspired by their work, we generalised their analysis methodology and combine with differential analysis, we call it Differential Analysis aided Power Attack (DAPA), to uncover more bit relations from a shift register and also taking into account the linear or non-linear feedback function.
In the rest of the paper, we use power attack and power analysis interchangeably. Moreover, as shown in our experiments, the leakage can also be captured through electromagnetic channel. For simplicity, we still refer it to as power attack as the exploited leakage arises from power consumption activity.

Related Work
The vulnerability of LFSR to side-channel attacks was first reported by Burman et al. [BMV07]. They exploited the fact that power consumption difference between two consecutive clock cycles, which is observable by simple power analysis (SPA), reveals information about the LFSR. They exploited this vulnerability to show recovery of secret key dependent internal state and later few countermeasures were proposed [BPMV16]. This attack was further extented to NLFSR by Zadeh et al. [ZH14], by exploiting the relations between neighbouring bits in the internal state using simple power analysis. Chakraborty et al. [CMM14] studied the susceptibility of Galois and Fibonacci construction of NLFSR to power attacks and showed that Galois NLFSR are more vulnerable. These observations were then further extented to attack GRAINv1 cipher [CMM17], which was an improvement over an chosen IV power attack on GRAINv1 by Fischer et al. [FGKV07]. The power attack in [CMM17] was enhanced by machine learning based classifiers as compared to previous approaches. NLFSR were previously exploited in real world attacks on KEELOQ code hopping scheme [EKM + 08], widely used for access control purposes such as garage openers or car door systems.
In 2017, Dobraunig et al. [DEKM17] proposed an attack on shift registers by inserting controlled differences into the IV and observing power differences of shift register update when the difference is introduced. This attack was finally extended to show practical attack on Keymill, where the NLFSRs are treated as black boxes with a small assumption that the newest bit position is not one of the feedback bits.
The main difference between the SPA on shift registers like [BMV07,ZH14] and the DPA on shift registers by [DEKM17] is the length of consecutive power difference to be collected. The former requires a collection of n consecutive power differences, where n is the length of the shift register, to recover the entire internal state in one shot. Any missing or misinformation of just a single power difference could lead to attack failure. On the other hand, [DEKM17] introduces difference into the internal state to recover one pair of neighbouring bit relations at a time. This drops the need for (relatively) long period of high precision measurements. In addition, even if full internal state recovery is unsuccessful, the guess space for the entire internal state is significantly reduced by the bits and pieces of bit relations. In this work, we take a step further by incorporating differential patterns into the power analysis 1 .
To the best of our knowledge, none of the aforementioned attacks are applicable to LR-Keymill because multiple NLFSRs are updated in parallel, while our DAPA practically breaks LR-Keymill.

Our Contributions
In this work, our main results are summarised as follows: • We extend the [DEKM17] observation beyond analysing the first clock cycle when the difference is injected, and we present a complete analysis on the power consumption changes of shift registers for various cases. • We propose DAPA on (N)LFSRs taking into account the feedback function 2 .
• We present a DAPA on LR-Keymill, an improved version of Keymill designed to resist the [DEKM17] attack, breaking their 67.9-bit side-channel security claim with 4-bit internal state guessing. • We reduce the attack complexity on Keymill by halving the amount of data resources needed to perform the key-recovery attack. • We conduct the experiments and verified our analysis on LR-Keymill.
• We present a DAPA on lightweight stream cipher Trivium, recovering the 80-bit key with just 14-bit key guessing.

Structure of this paper
We describe the generic analysis on (N)LFSRs in Section 2, followed by some toy examples to illustrate our attack strategy in Section 3. Next, we give the specification of LR-Keymill and Keymill in Section 4, present the DAPA on LR-Keymill and Keymill in Section 5, and present our experimental results in Section 5.5. Lastly, we apply our DAPA on Trivium in Section 6 and conclude our work in Section 7.

Preliminary
In [ZH14], Zadeh and Heys exploited the well known fact that at the rising edge of a clock, a D flip-flop consumes more power when there is a state change, either 0 −→ 1 or 1 −→ 0. In a nutshell, they analysed D flip-flop that is constructed from 6 NAND gates and showed that 3 of the gates changes when the D flip-flop changes its state, as compared to 1 gate change when there is no change its state. By the nature of a shift register, say left-shift, the state of a register bit (current bit value) will be updated to the state in the register bit on its right (succeeding bit value) at the rising edge of a clock. In other words, the power consumption of the register bits in a shift register is dependent on the value of the current and succeeding bit values. More precisely, if the succeeding bit is the same as the current bit value, the register bit consumes lesser power compared to the case when the bits are different and it has to change its state.
As there are many other activities happening concurrently with the updating of a register bit at the rising edge of a clock, it is difficult to identify and isolate power consumption of a particular register bit to uncover the relation between the current and succeeding bit values from a single power trace. However, if we can introduce a bit difference to that targeted register bit while keeping all other computations constant, we can gain information about some bit relations by comparing the power consumption differences between the original computation and the instance with the bit difference.

Power Consumption Differences and Bit Relations
Shift registers are often part of a linear feedback shift register (LFSR) or non-linear feedback shift register (NLFSR). We will address the feedback function in Section 2.3. For the moment, let us focus only on the shift registers.
Let [x]y denote a register bit of interest in the square parenthesis with bit value x, and y is the succeeding bit value. A bar symbol x denote having a difference, which is simply flipping of the bit value.
We define the power consumption difference the subtraction of the original power trace from the power trace with some differences. If a register bit has an increase in power consumption difference, we denote it as +1, −1 if it is decrement, and 0 if there is no difference in the power consumption difference. In practice, the power trace is the summation of the power consumption of all the register bits. Hence, we can apply simple arithmetic to compute the combined power consumption difference. On the other hand, if x = y, then so does x = y and both instances will consume more power to change its state. Thus regardless of the relation between x and y, both instances have the same power consumption traces.
• For both x = y and x = y, no change in power consumption. Difference: 0.

Power consumption difference of multiple register bits
Using Case 1 as the building blocks, we look at the combined power consumption of multiple register bits in Table 1. The middle two columns are the sub-cases considering individual register bits, and the right column denote the obtained relation and expected power consumption difference.
From Table 1, we can have the following observations. Observation 2.1: The change in power consumption in Case 2.1 is always a multiple of 2, {−2, 0, 2}. Observation 2.2: Observing power level +2 or −2 gives a clear indication of the relations of (x, y) and (y, z). Otherwise there is an ambiguity when the observed power level is 0, nevertheless knowing one of the relations determines the relation of the other pair to be the opposite. Observation 2.3: As seen in Case 3, if there is consecutive register bits with difference, the intermediate register bits do not contribute to the power consumption differences and the analysis can be reduced to the leading and ending register bits

Summary Table for Power Consumption Differences and Bit Relations
We denote a power consumption difference the subtraction of the original power trace from the power trace with injected difference. The following Table 2 summarises the necessary and sufficient conditions of the power consumption differences and deduced bit relations.
Although a bit relation does not reveal the actual value of the related bits, it reduces the guessing space by 1 bit because guessing the value for one register bit determines the value of the related bit too.
Observation of the power consumption differences. A natural question is whether such power consumption differences {−2, −1, 0, +1, +2} (which we refer to as 5-class difference in the following) can be observed in practice. Observing multi-class differences has been practically demonstrated in previous works. For instance, Saha et al. [SJB + 18] were able to report 5-and 9-class differences in a different context to break fault-hardened implementation of PRESENT and AES respectively on an 8-bit microcontroller. Our analysis typically only require observing a rise, no change, drop in power consumption differences, as seen in Section 3. This intuitively might seem easier than reported in [SJB + 18] where the value of change was required for the analysis. However [SJB + 18] do not need to know the sign (or polarity) of change, which as is crucial for the success of our attack. In Section 5.5, we conduct practical analysis on a 32-bit ARM Cortex-M3 microcontroller to confirm our hypothesis.

On (non-)linear feedback functions
When targeting (N)LFSRs, we need to consider the actual specification of the feedback function to determine how the differential propagates. A feedback function typically consists of 6 basic binary operations -AND (∧), NAND (∧), OR (∨), NOR (∨), XOR (⊕) and NXOR (⊕). The truth table and differential table of these operations are listed in Table 3. Linear feedback functions. The linear operations (XOR and NXOR) are rather simple, we can trace the differential trail trivially and know that it holds with probability 1. Hence, observing rise, no change or drop in power consumption difference will directly reveal some bit relation or reaffirm our knowledge about some known bit relation and the differential propagation.

Non-linear feedback functions.
For the non-linear operations (AND, NAND, OR and NOR), half of the time the differential propagates. Despite that, using DPA we are able to know the differential propagation which leads to knowing some information of the internal state. Let us consider [x]y, where the state of y is uncertain (half the chance with or without a difference), as seen in Case 1's from Section 2.2, if the current bit x has no difference, then we expect a rise or drop in power consumption difference if y has a difference. If there is no change in the power consumption, we know that y has no difference too. On the other hand, if x has a difference, then observing no change in the power consumption indicates that y has a difference too. Otherwise, we will observe a rise or drop in power consumption difference and we know y has no difference.
In addition to knowing if y has a difference, it could also reveal some information of the value of other bits. For example, let y = (x 0 ∧ x 2 ) ⊕ x 1 ⊕ x 2 and we know only x 2 has a difference, because of the non-linear operation AND, we are uncertain of the state of y. If through the differential power analysis we deduce that y has a difference, it is necessary and sufficient that x 0 = 0. Otherwise, we know x 0 = 1.

High-level Description of DAPA on (N)LFSRs
Our attack methodology can be broken down into the following three steps: 1. Determine the differential patterns 2. Perform the power measurements 3. Recover the internal state Step 1 (Offline): Determine the differential patterns. In this preparation phase, the goal is to choose a differential pattern that we would want to have in the shift register. Although the choice is highly dependent on the target algorithm, there is a general strategy.
One obvious entry point is through the IV 3 , which essentially every (N)LFSR-based algorithm should have. The main idea is to introduce some difference in the IV and analyse how it would propagate throughout the internal state.
Notice from Table 2 that when we can deduce a bit relation when exactly one of the two consecutive bits has a difference (so-called active bit) and the other is constant (inactive bit). To reduce the number of instances to run and measure, we can choose to introduce differences such that the differential pattern alternates between active and inactive bits as much as possible 4 .
As seen in Table 3, non-linear operations can cause ambiguity in the differential pattern. Fortunately, there is a clear distinction between having a difference or not by observing the power consumption difference. More explanation in Step 3.
This step would take up a significant amount of time as the attack complexity depends heavily on the selected differential patterns. Generally, there is no need to find optimal differential patterns, so long as the execution, say introducing the different IVs, is feasible and the attack complexity is practical, we are good to move to the next step.
Step 2 (Online): Perform the power measurements. This is the only online phase of the attack and rather straightforward -collect the power traces of various computations and take the difference to obtain power consumption differences.
Step 3 (Offline): Recover the internal state. In this step, we try to gather as many pieces of bit relations to link the internal state bits together. From the rise, drop or no change in power consumption (collected in Step 2) of the differential patterns (determined in Step 1), we can deduce the bit relations.
When non-linear operations are involved, our first goal is to determine if that bit has a difference. Recall that if both the consecutive bits have difference or both are constants, there is no change in the power consumption, otherwise there is a rise or drop in power consumption. Thus, depending on the preceding bit and the power consumption behaviour, we can deduce whether there is a difference. In addition, as explained in the previous section, it could also reveal some information of the actual value of some bits.
Finally, after gathering as many bit relations as we can, we enumerate the possible values for the leading bit in each chained bit relation, other bits within the chain will be defined according to the bit relations. The true internal state will be one of these candidates.
On overcoming algorithmic noise: The (dynamic) power consumption of a (N)LFSRbased algorithm would typically be influenced by simultaneous toggling of various components. A conventional way is to collect more traces to filter the noise. In addition to that, our DAPA methodology has two advantages in overcoming the algorithmic noise.
Firstly, we can allocate more resources to noise filtering. This is possible because our attack could reduce the number of chosen IV or nonce to launch an attack, and effectively the number of different power traces, needed to recover the same amount of bit relations (see Section 5.4). To give a numerical example, instead of collecting 100 traces for noise filtering for each of the 10 differential patterns (a total of 1000 traces), we could collect 200 traces for noise filtering for each of the 5 differential patterns (still a total of 1000 traces) and get the same set of bit relations.
Secondly, it is possible to choose different differential patterns that recover that same bit relation. This allows us to have alternative ways to recover the bit relation or affirmation that our deduction is correct should one of the attempts is inconclusive.

Toy Shift Register
For the moment, let us omit the details of the feedback function and assume that we know exactly when a difference is introduced into the shift register.
We use a simple toy example to illustrate how we can recover the bit relations. Suppose we have a 6-bit shift register containing values c i , and x j the incoming bits in the next 5 clock cycles, denoted as In another instance, there are bit differences in the incoming bits x 0 , x 1 and x 3 .
After executing both computations and collecting their power traces, we can compare the power trace and deduce 4 bit relations as seen in Table 4. Table 4: Toy shift register example: Power consumption difference and bit relations obtained. Second and third columns are the register state of the original and with some difference, "Orig. dist." and "∆ dist." indicates the Hamming distance between the previous and current state, "Power diff." indicates the numerical power consumption differences, "Rise/Drop" is the observation of the power consumption difference at the rising edge of the clock, and last column is the bit relation obtained. Suppose that the attacker's goal is to recover the internal state at any clock cycle, the values of c i and x j are unknown to the attacker but he knows the differential positions in x j . From there, he is able to deduce the following relations c 5 = x 0 , x 1 = x 2 = x 3 = x 4 , and guess the shift register state at clock cycle 5 as one of the following: To summarise, if the attacker is able to obtain noiseless measurement for these 2 computation instances, he is able to reduce the guessing complexity from the naive 2 6 = 64 to just 4 guesses (third combination is the correct internal state).

Toy Non-linear Feedback Shift Register
Here, we use another toy example to illustrate how an analysis can be performed on NLFSR. Suppose we have a 4-bit maximum period NLFSR (from [Dub12]) defined as follows: [ . After executing both computations and collecting their power traces, we can compare the power trace and deduce 4 bit relations as seen in Table 5. Table 5: Toy NLFSR example: Power consumption difference and bit relations obtained. Second and third columns are the register state of the original and with some difference, "Orig. dist." and "∆ dist." indicates the Hamming distance between the previous and current state, "Power diff." indicates the numerical power consumption differences, "Rise/Drop" is the observation of the power consumption difference at the rising edge of the clock, and last column is the information obtained.
Starting from a difference X 0 = (0, 0, 0, ∆), we know the difference in the next cycle is X 1 = (0, 0, ∆, 0). Here, we observed a big drop 5 in the power consumption difference. As seen in Case 2.1 of Section 2.2, it indicates that the differential bit is different from both its neighbours, thus we have x 1 1 = x 1 2 and x 1 2 = x 1 3 . For the next update (to clock cycle 2), it could be X 2 = (0, ∆, 0, 0) or X 2 = (0, ∆, 0, ∆). Since we observe no change in the power consumption difference, we know that the succeeding bit x 2 3 has no difference. In addition, from Equation 1, we have which implies that x 1 1 = 1. From X 2 = (0, ∆, 0, 0), it could propagate to X 3 = (∆, 0, 0, 0) or X 3 = (∆, 0, 0, ∆). Since we again observe no change in power consumption difference, we know that x 3 3 has no difference and deduce that x 2 2 = 1 6 . Combining all these information, we reduced the possible states of X 2 from 2 4 = 16 to just 2 states as shown in the following: In fact, with information obtained up to clock cycle 2, it is sufficient to arrive at this same conclusion.
5 This drop in power consumption difference that was due to two register bits difference is relatively "big" compared to a drop caused by a single register bit. 6 From the knowledge that x 1 1 = x 1 2 = x 1 3 and x 1 1 = 1, it is already sufficient to deduce that x 2 2 = x 1 3 = 1. This observation reassured that our gathered information and observations are accurate.

Fresh re-keying scheme
Although there are countermeasures [BLGT05, CPM06, NRR06, PR11] like masking, hiding and threshold implementation to protect against DPA, they are generally costly to implement on the encryption algorithms, unless the primitives are designed to be side-channel protection efficient, for example Pyjamask [GJK + ] and CRAFT [BLMR19].
Fresh re-keying schemes were proposed by Medwed et al. [MSGR10] as a countermeasure against side-channel analysis for low-cost devices. Low-cost devices like RFID tags have very constrained physical space to implement the cryptographic algorithms, there may not be enough resources to implement effective side-channel protections like masking or threshold implementation. Instead, the idea of fresh re-keying scheme is to have a lightweight function g that derives session keys SK from given secret master key M K and nonce IV , denoted as g(IV, M K) = SK IV , and use the fresh session keys for block cipher encryption E.
Under the nonce-respecting scenario, a fresh re-keying scheme helps to protect the block cipher encryption against DPA since each encryption uses a different encryption key. However, the re-keying scheme now becomes the target of SCA. One can perceive a re-keying scheme as an encryption cipher, encrypting different nonces (plaintexts) using a same master key. While a re-keying scheme does not need very strong mathematical properties like block ciphers, it should have the following 6 properties given by [MSGR10]: 1. Good diffusion of the master key M K. 2. No synchronization between parties. Hence, g should be stateless. 3. No need for additional key material. 4. Little hardware overhead. Total costs lower than protecting E alone. 5. Easy protection against side-channel attacks. 6. Regularity.
Keymill [TRS16] is a NLFSR-based re-keying scheme designed by Taha et al. to be side-channel resilient at algorithmic level and not rely on the side-channel countermeasures to protect against SCA. However, Dobraunig et al. [DEKM17] found a DPA using the Case 1.1 analysis (Section 2.2), breaking the scheme with 4-bit internal state guessing and 128 chosen nonces 7 .
LR-Keymill [TRS17] 8 is an improved version of Keymill by the same designers, the idea was to update all 4 NLFSRs simultaneously with the same IV bits, making it nontrivial for the attacker to deduce which NLFSR consumes higher or lower amount of power, thus increasing the search space. Based on this argument, the designers claimed 67.9-bit security against DPA.

Remark on LR-Keymill implementation:
We note that the designers of LR-Keymill have recommended the implementation should update all 4 NLFSRs in parallel to generate algorithmic noise. A basic serial implementation of LR-Keymill where only one of the NLFSRs is updated in each clock cycle will defeat its purpose and be vulnerable to the original [DEKM17] attack.
In this section, we give the specification of LR-Keymill and Keymill. In Section 5, we show that by exploiting the feedback functions of LR-Keymill, we can still recover the secret internal state with just 4-bit internal state guessing. Also, we can half the amount of nonces (hence, power traces) needed to attack Keymill (Section 5.4).
At the initialisation phase, a 128-bit master key M K will be loaded into these registers. Next, a 128-bit initialisation vector IV is introduced to update the internal state. After some preprocessing phase, it will start releasing arbitrary amount of keystream bits as the session keys.

Feedback functions
The bits in the registers are indexed from 0 and in ascending order Rx = s 0 s 1 . . . s |Rx|−1 . When the clock cycle matters, we denote Rx c [i] as the i-th leftmost bit in register Rx at clock cycle c, where c = 0 denote the initial state right after loading the master key. For a range of bit registers from i-th to j-th bits inclusive, we denote it as Rx c [i ∼ j].
During an update, the feedback function F x draws the information from Rx and feedback to Ry, where y = x + c mod 4. Each registers does a left-shift, drops the leftmost bit s 0 and takes in the new bit into the rightmost position of the register.
The feedback functions are defined as follow: The key observation here is that when a new bit is introduced to R0 (resp. R1, R2, R3), it is at position s 30 (resp. s 31 , s 31 , s 32 ). After 5 (resp. 3, 3, 2) clock cycles, this new bit is extracted to its feedback function for the first time in monomial term s 25 (resp. s 28 , s 28 , s 30 ). Although a new bit is introduced to all 4 NLFSRs at the same clock, the clock cycle that the new bits are fed back differs.

LR-Keymill internal state update
For the first 128 updates, all 4 registers are updated with a nonce bit IV [c].
After the IV is completely absorbed, it is clocked for another 33 updates.
where y = x+c mod 4 and c ∈ {128, . . . , 160}. So far, no keystream bit is being outputted. Lastly, in each clock cycle, the leftmost bit from each register is XORed to form the output keystream bits KS.

Keymill internal state update
For the first 32 updates, each register is updated with a nonce bit IV [4c + x].
After the IV is completely absorbed, it is clocked for another 33 updates.
where y = x + c mod 4 and c ∈ {32, . . . , 64}. So far, no keystream bit is being outputted. Lastly, in each clock cycle, the leftmost bit from each register is outputted as keystream bits KS (4 keystream bits per cycle).

DAPA on LR-Keymill
In a nutshell, we analyse how a differential propagates through the internal state and extract sufficient information for us to reduce the internal state guessing complexity to the minimum of 4 bits.

Difference introduced at IV [4i].
Let the only difference in the nonce be at bit position 4i, the differential propagation can be seen in Figure 1-9, where differential bits are coloured in blue.       . This is where things start to get interesting. Notice that the difference is fed back to R2, introducing with a new difference, which is like Case 1.1. By observing the rise or drop of the power consumption difference, we can unambiguously determine the bit relation R2 4i+4 [(30, 31)].   R2 4i+6 [(30, 31)]. If we are lucky and observed a rise or drop in the power consumption difference, we know the bit relation for both registers. Otherwise, we know exactly one relation is equal while the other is not equal, but not the order. Nevertheless, this information is still useful to us when we consider a nonce difference to be at 4i + 1 (see analysis of Figure 14).

Difference introduced at IV [4i + 1].
We repeat the analysis with difference in IV [4i + 1] and observe how the differential pattern propagates. See Figure 10-15.  No difference is introduced at clock cycle 4i since the difference in the nonce is at IV [4i + 1]. From clock cycle 4i + 1 to 4i + 3, the same differential pattern and similar bit relation information is obtained as seen in Figure 1-3.

Difference introduced at IV [4i + 2] and IV [4i + 3].
The analysis for the other 2 bit positions are pretty much the same, thus for brevity we present the differential propagations (see

Key-recovery on LR-Keymill
Recall that besides the aforementioned new bit relations, we can obtain the combined relations for all 4 NLFSRs. Thus if we know 3 of the 4 relations, we can fully determine all the bit relations for a particular column of register bits.  Clock cycle c = 4i + 10 When we extend the analysis for another period of 4 cycles, we can determine the bit relations between 7 consecutive bits in all 4 NLFSRs, see Table 7.
When we perform the analysis for j = {0, . . . , 35} and a fixed i ∈ {0, . . . , 22} 9 , it is sufficient for us to obtain bit relations of 33 consecutive bits for all 4 NLFSRs. This gives Table 7: Bit relations learnt from introducing differences at IV [4i + j] where j = {0, . . . , 7}. The first column denotes the indices of each shift registers. (j) denotes where the relation is obtained, (!), (?) and (*) denote derived relation from multiple sources. The "Combined" column denotes the difference position in the IV to obtain the 4 combined bit relations.
Clock cycle c = 4i + 14 us all the bit relations within each shift register in the internal state at a particular clock cycle. All that is left is to guess one key bit from each shift register and we can roll back the updates to recover the master key.
In summary, we need noiseless measurements of 36 chosen nonces and and 4-bit key guessing to recover the master key.

Remark on filtering the noise
From a practical perspective, we expect some noise during the computation and power measurement. To filter the noise, one could vary the latter bits of the nonce to collect and average out the traces. This is because the internal state of LR-Keymill only depends on the nonce bits that are already absorbed. Let i = 0, we only need the first 47 nonce bits to be fixed 10 , there are still 81 bits of freedom to generate different nonces giving similar power traces to filter the noise.

Improved attack on Keymill
In [DEKM17], the authors recover 1 bit relation (x, y) per chosen nonce using the information gained from Case 1.1. In fact, as seen in Case 2.1, if one observes the power consumption difference in the next cycle, one could deduce another bit relation (y, z) because (x, y) is already known. Thus, every chosen nonce with a single bit input can actually recover 2 bit relations, effectively halving the number of nonces (and corresponding power traces) needed to launch the attack 11 There are better choices of IV differences that could further reduce the number of nonces needed. But we do not go further into improvement on the attack as the attack complexity is already very low.

Experimental Results
The leakage and attack on NLFSR was previously validated on an FPGA platform by Dobraunig et al. [DEKM17], targeting Keymill. In this paper, we validate our proposed attack on LR-Keymill. The target design is implemented on ARM Cortex-M3 microcontroller mounted on the Arduino Due board. The microcontroller has 512KB flash, 96KB SRAM and 84 MHz operating frequency. The measurements are captured using RF-U 5-2 near-field electromagnetic (EM) probe from Langer on a Lecroy WaveRunner 610zi oscilloscope at a sampling frequency of 1.25 GSamples/s. A 30 dB pre-amplifier is also used for better measurement quality.
For the implementation itself, we adopted bitslice approach, written in assembly. In this scenario, four bits, one from every NLFSR, will be updated simultaneously. The four bits are grouped together from the Least Significant Bits (LSBs) to allow more efficient state update. Updating multiple registers together in the bitslice implementation also result in algorithmic noise from all NLFSR as suggested by designers of LR-Keymill [TRS17].
As described, the target of the attack is the Hamming Distance (HD) of the internal states as the IV difference propagates. We then target the operation in which the register is updated, which is typically implemented as MOV R d ,R s , where R d and R s being the destination and source registers respectively. In this case, if we targeted the register storing the LSBs, the leakage is expected to be HD between the previous state and the updated state, which matches the theoretical model.

Profiling and Results
We first conducted the profiling to identify area and point of interest. For the profiling, we implement a simple single state update. We send random 2 input data, in which we can calculate the HD between them. In total, we measure 20, 000 traces, each is averaged 100× to minimize the effect of noise. The data is then split and subtracted to obtain 10, 000 traces of ∆HD. From these traces, we can build a profile of 9 different ∆HD {−4, −3, ..., 3, 4} as plotted in Figure 26. ∆HD classes can be clearly distinguished.

Identifying Differences
Previous results show that the classes have distinct power consumption and can be identified. However same difference with opposite polarity (like −1, 1) can be hard to distinguish on a real measurement. To this end, we investigate if ∆HD of opposite polarity (sign) can be recognised in a captured EM measurement. The experiment is done on the previously described bitslice implementation of LR-Keymill running on the ARM Cortex-M3. We executed two IV sequences so as to have a sequence of ∆HD = 0, −1, +2, 1 over 4 clock cycles and measured two (averaged) EM traces. The difference of the two EM traces is then checked to recognise ∆HD. The results are plotted in Figure 27. While ∆HD of 0, 1 and 2 are easily distinguishable, it is expected that it would be hard to distinguish 1 from −1. Nevertheless, one could easily check the polarity and magnitude of the peak to distinguish the two cases, which also matches the previous profiles. The peaks also matches the timing, which according to operating frequency and sampling frequency of the scope, must be separated by around 208 samples. Moreover, we choose the pair (−1, 1) for this experiment as being low consuming pair it is worst affected by noise, still we are able to distinguish them. Finally, we experiment to recognise a longer sequence generated from a LR-Keymill simulation on a real measurement. The initial state of LR-Keymill was randomly fixed and with a pair of IVs with difference in the first bit and executed on ARM Cortex-M3 and corresponding two (averaged) EM traces were measured. The averaging was done by repeating the IV 1000 times, which remains well within the adversary model and noise filtering capability. The results are shown in Figure 28. The top half of the figure (generated from Python simulation) shows the power traces generated by the two IVs and their difference in dotted blue line. The corresponding EM trace difference is shown in the bottom half of the figure showing observed differences, which perfectly matches the expected values and pattern as discussed in Section 5.1.1. The peaks again match the expected separation of 208 samples.

Practical Challenges:
The practical recovery of value and sign for an observed difference is demonstrated in Figure 28. However, often the difference recovery can be erroneous in value or sign or both due to its high sensitivity to noise. In our experiments, we observed that the sign has lower resistance to noise in traces. In the event that a sign of a particular difference bit is inconclusive, the adversary can still continue to perform the attack, with slightly higher complexity, by either of the following techniques: 1. discard noisy traces and repeat the experiment, say with a new pair of nonce with the same differential bit, 2. try to recover the bit relation from other power analysis instances, as there could be overlapping bit relations that can be recovered, or 3. guess that sign which increases the final key-recovery complexity by 1 bit for each inconclusive sign.
Moreover, our attack is not limited in length of sequence to recover at once. Shorter sequences can be independently recovered. Consecutive sequences with overlapping difference bits can also help in error correction for wrongly recovered differences. The attack can be trivially extend from difference recovery of shorter sequences to full key recovery by repeating the analysis on different IV differences. Note that, the analysis for different IV bit position are independent of each other and there is no snowball effect even if any of the iterations obtains inconclusive results. Thus, the key complexity increases only by 1 bit for each inconclusive difference bit recovered. As the attack on Trivium, presented in the following section, also require recovery of differences with sign, the same techniques are therefore directly applicable.

Application to Trivium
Trivium is one of the stream ciphers selected to eSTREAM porfolio [CP08], listed in Profile 2 that is particularly suitable for hardware applications with restricted resources. It is a synchronous stream cipher designed to generate up to 2 64 keystream bits from an 80-bit secret key K and 80-bit initial value IV . In this section, we illustrate how our DAPA can be applied to Trivium to reduce the key guessing from 80 to 14 bits.

Specification of Trivium
Its internal state consists of 3 shift registers, denoted as A c , B c , C c and some non-linear feedback functions connecting the shift registers, where c is the clock cycle and c = 0 is the initial state after loading the K and IV . The bits in the registers are indexed from 1 and in ascending order . For a range of bit registers from i-th to j-th bits inclusive, we denote it as X c [i ∼ j].
Here, we only describe the key and IV setup as that is all we need to know for our DAPA. We also use a different notation to describe the algorithm for the ease of describing our attack. Note that, in contrast with LR-Keymill, the indexing starts from 1.

Loading key and initial value
First, the 80-bit secret key K = K 80 K 79 . . . K 1 is loaded flush left into the 93-bit shift register A, while other bits are set to zero. Next, the 80-bit initial value IV = IV 80 IV 79 . . . IV 1 is loaded flush left into the 84-bit shift register B and other bits are set to zero too. Lastly, the 111-bit shift register C is initialised as all zeroes except the last 3 bits to be ones. The initial registers are denoted as

Internal state update
After loading K and IV , the internal state is updated 4 × 288 times before any keystream output. The updating is as follows where c ∈ {0, . . . , 1151}. Notice that the shift registers are now a right shift.

DAPA on Trivium
We consider the attack model of resynchronization attacks, where the adversary is allowed to manipulate the value of IV . If repeating of IV is permissible, the adversary can collect as many power traces of the same computation as he needs to filter the noise. Hence, we consider a more constraint scenario where the adversary is not allow to repeat IV . Nevertheless, we show that the adversary will still be able to collect thousands of power traces of the same computation (see Section 6.3).
At clock cycle 0, the adversary knows the value of all but 80 register bits, namely A 0 [1 ∼ 80] that contains the secret key K. And his goal is to guess the correct K with significantly lesser than 80-bit key guessing.

Introducing difference in IV
Suppose we introduce a difference at IV 3 , which correspond to a difference at B 0 [78]. At the rising edge of a clock, it results in some power consumption differences in shift register B 1 [78] and B 1 [79]. Since the value of IV is known, we know the expected power consumption difference from these register bits. For convenience of our analysis, from 2nd point of Case 2.1 we know that if we choose IV 2 = IV 4 , the combined effect of these 2 registers is that there is no change in power consumption difference. Thus, we can focus on the feedback function, in particular, the value of t 0 1 . By specification,
The simplest way to overcome that is to guess the key bits (14 bits) at these positions. For each key guessing of A 0 [67 ∼ 80], we can continue the aforementioned strategy for i = 0, . . . , 32 and recover the rest of the key bits A 0 [1 ∼ 66].
In summary, we need noiseless measurements of 33 chosen initial values IV and 14-bit key guessing to recover the 80-bit secret key K.

Remark on filtering the noise
From a practical perspective, we expect some noise during the computation and power measurement. To filter the noise, we could vary the leftmost IV bits that are not used to introduce differences. More specifically, the smallest index that we are introducing difference in the shift register is B 0 [14], which corresponds to IV 67 . We can enumerate the values of IV 80 , IV 79 , . . . , IV 69 to give us 2 12 power traces for each chosen IV difference 12 . If having 2 12 power traces is insufficient, we can always double the number of traces in exchange for recovering 1 key bit lesser.

Conclusion and Future work
We presented the general DPA strategy to extract bit relation information from shift registers through the power consumption difference. This methodology can be applied to both LFSRs and NLFSRs. Combined with differential analysis, we applied our DAPA methodology to break LR-Keymill security claim with 4-bit internal state guessing and halved the resources to perform DPA on Keymill. We experimentally verified our attack on LR-Keymill. Besides fresh re-keying schemes, we show that our methodology can also be applied on shift register based stream ciphers like Trivium, reducing the key-recovery to 14-bit key guessing.
The main issue with LR-Keymill security claim is that it is an upper bound assuming the NLFSRs as black boxes. It would be interesting to find a framework to analyse and find the lower bound to Keymill-like structures, taking the NLFSRs into account. Having all NLFSR identical in LR-Keymill could resist our DAPA but might introduce potential mathematical attacks due to symmetrical structure.