Cache vs. Key-Dependency: Side Channeling an Implementation of Pilsung

. Over the past two decades, cache attacks have been identiﬁed as a threat to the security of cipher implementations. These attacks recover secret information by combining observations of the victim cache accesses with the knowledge of the internal structure of the cipher. So far, cache attacks have been applied to ciphers that have ﬁxed state transformations, leaving open the question of whether using secret, key-dependent transformations enhances the security against such attacks. In this paper we investigate this question. We look at an implementation of the North Korean cipher Pilsung, as reverse-engineered by Kryptos Logic. Like AES, Pilsung is a permutation-substitution cipher, but unlike AES, both the substitution and the permutation steps in Pilsung depend on the key, and are not known to the attacker. We analyze Pilsung and design a cache-based attack. We improve the state of the art by developing techniques for reversing secret-dependent transformations. Our attack, which requires an average of eight minutes on a typical laptop computer, demonstrates that secret transformations do not necessarily protect ciphers against side channel attacks.

To maintain security, block ciphers must use non-linear components such as substitution boxes (S-Boxes) and repeatedly apply them to a secret intermediate state across all the cipher's rounds.However, the non-linear nature of these components also means that they are typically implemented using tables (either holding S-Boxes or their optimized and precomputed version, such as T-Tables for the AES) which are then accessed in a way that highly depends on the cipher's internal state.From a side channel perspective, the implications of such an implementation are disastrous: an attacker monitoring the cache can deduce the target's memory access patterns, which directly reveal the cipher's state.From this state, the attacker can simply reverse the cipher, deducing the key.
While the attack outlined above works in theory, in practice it is typically impossible to completely deduce the cipher's state by merely observing the cache after a single table application.The main limitation is that the cache stores information in cache lines, which are larger than the table entries.As an access somewhere in the line results in the CPU caching the entire line, the attacker is unable to distinguish nearby table entries.This results in the attacker recovering only partial information about the cipher's internal state during a single round, which is insufficient for key recovery.Consequently, most cache attacks typically recover information from multiple rounds, and subsequently achieve key extraction by combining it.More specifically, the attacker makes a hypothesis on some missing key bits, and then subsequently uses the cipher's structure to determine the output of the cipher's first round.The output of the first round is then combined with side channel leakage from the cipher's second round to obtain key extraction.Such an approach, however, inherently relies on the attacker's ability to predict the cipher's data propagation.While this is the case for popular ciphers, such as AES, which have a fixed key-independent structure, much less is known about side channel leakage from ciphers whose basic structures (e.g.S-Boxes and other transformations) are also key dependent.
For power and electromagnetic analysis, a negative indication about the protection provided by key-dependent transformations was recently given [GOC16;MCS17].However, to the best of our knowledge, no such investigation has ever been undertaken with microarchitectural side channels.As an example of a cipher implementation that uses key-dependent transformations, in this paper we focus on an implementation of the North Korean cipher Pilsung [Kry18a] as reverse-engineered by Kryptos Logic [Kry18b].We thus investigate the following question:

Do Pilsung's key-dependent transformations protect it from cache-based attacks?
To tackle this question, we mount the first cache attack on an implementation of Pilsung.

Overview of Pilsung.
Pilsung is an AES-like cipher, which uses a different S-Box for each byte at each round.These S-Boxes, as well as the affine permutation in each of the rounds, are generated using information from the secret key.Consequently, in the absence of knowledge about the secret key, an attacker has to find the specific S-Boxes and permutations used.Furthermore, because each S-Box entry is a single byte, the amount of information leaking via cache access is very limited, as a cache access only reveals two bits of information about the index.
Attacking Pilsung.Despite these challenges we show two separate attacks.The first attack completely recovers the first round key by exploiting the lack of cache alignment in the public Pilsung implementation.We extend the works on misaligned tables [SP13; ZW10] and show how the alignment of the tables affects the amount of information leaked.
Noting that the misaligned tables may be an artifact of the reverse engineering of Pilsung rather than a flaw in the original implementation, we also present an attack on Pilsung when the tables are properly aligned with cache lines.Attack Model.To decouple the analysis of the cipher from the intricacies of cache attacks, we model the cache attack as an oracle.The oracle observes the behavior of the cipher and returns the cache-line index of a specific S-Box access during the encryption process.We then show how to implement the oracle through a Prime+Probe [OST06] cache attack via the Mastik [Yar16] framework.
Attack Overview.Our attack first reverses the pseudo random permutation used in Pilsung.In a nutshell, like AES, Pilsung uses a two-step permutation.It first permutes the bytes of the state, and then mixes the values of groups of bytes.Following the standard AES notation, we use the names Shift-Rows and Mix-Columns.However, unlike AES, the Shift-Rows step in Pilsung uses a key-dependent permutation of the bytes of the state.
To reverse this permutation, we first brute force the mapping of state bytes to columns.For that we select multiple plaintexts in which four of the bytes are fixed, i.e. have the same value in all plaintexts.Fixing bytes fixes the corresponding indices accessed in the first round.Furthermore, if the permutation moves all four fixed bytes to the same column, fixing the bytes fixes the indices accessed in the second round.By counting the number of fixed indices we can identify the four plaintext bytes that map to each column.
After reversing the permutation, we proceed to recover the key.Our attack exploits the limited dependency of the first round S-Boxes on the second round key.Specifically, the S-Box used for the n th byte in the first round solely depends on the corresponding byte of the second round key.Thus, for the attack, we guess the two corresponding bytes in the first and second round keys, and verify the guess by checking the propagation of the plaintext values of the same byte to second round cache accesses.Overall, our attack requires 19472 oracle queries to fully break the cipher.
Overall Attack Performance.To perform the attack, we use the Mastik [Yar16] implementation of the Prime+Probe attack [OST06].Because the attack only achieves a noisy version of the oracle, we need to encrypt multiple plaintexts for each oracle query.Overall, we achieve complete key recovery by monitoring 3.52 × 10 7 encryptions.

Summary of Contributions.
In this paper we make the following contributions: • We perform the first cache attack on an implementation of Pilsung, a cipher that uses key-dependent transformations.We show that in Pilsung's case, key-dependent transformations do not provide an inherent protection against side channel attacks (Sections 4 and 5).• We show how to use a cache attack to reverse engineer secret permutations in AES-like ciphers (Section 4.3).• We show that the large number of S-Boxes supported in the Pilsung implementation, as reverse-engineered by Kryptos Logic [Kry18b], actually facilitates efficient cache-based attacks (Section 4.4).• As an independent contribution, we investigate the effects of misaligned tables on cache attacks and provide a formula for calculating the amount of leakage due to misalignment (Section 3).

Cache Attacks
To bridge the speed gap between the fast processor and the slower memory, processor designers introduce small banks of fast memory called caches.Caches store recently accessed cache lines-blocks of memory of a fixed size, typically 64 bytes-exploiting the temporal and spatial patterns of memory accesses often found in software.
Observing Cache Timing Difference.While caches are functionally transparent to the application software, the speed difference between accessing data in the cache (cache hit) and data not in the cache (cache miss) can reveal information on the contents of the cache.Moreover, because the contents of the cache depends on prior computation, recovering information on the contents of the cache may disclose secret information on prior computation.

Prime+Probe.
Prime+Probe [OST06] is a cache attack technique that exploits the structure of the cache to leak information.Modern caches are typically set associative, meaning that the cache is divided into multiple sets, such that each set can only store data from a subset of the physical memory.The Prime+Probe attack first primes one or more cache sets, filling them with the attacker's data.The attack then allows the victim to execute, before finally probing the cache, i.e. measuring the time it takes to access the previously cached data.Slow access indicates that the victim execution replaced some of the previously cached data, causing memory access in the prime stage.Thus, through monitoring cache sets in which the victim execution replaces data, the attacker can learn that the victim accessed data in memory locations that map to these cache sets.

Prime+Probe on Block Ciphers
Overview of Block Ciphers.Block ciphers are typically implemented as a sequence of rounds that operate on an internal state.For encryption, the state is initialized with the plaintext.Each round applies a function that mixes key material, i.e. some bits that depend on the key, with the current state to produce the next state.The final state is the ciphertext.The process is reversed for decryption.
Security via Non-Linearity.Because linear functions can be reverted trivially, the round function of ciphers include non-linear components.Many ciphers use S-Boxes, which map short sequences of bits to other sequences of bits.S-Boxes can be implemented as lookup tables, such that the n th table entry for S-Box S contains the value S(n).To speed up the round function, some ciphers store precomputed values in T-Tables.

Extracting Table
Indices.Cache attacks aim to find information on the indices used for specific table accesses.For example, in AES the indices of first round accesses to S-Boxes or T-Tables are the exclusive-or (XOR) of each plaintext byte with the corresponding key byte.Hence, finding the index used at the first round, combined with a known plaintext, allows the attacker to recover the key.The Prime+Probe attack, described above, can find the cache set that a victim program accesses.As cache lines are typically much larger than the lookup table entries, Prime+ Probe typically recovers only partial information on the table index used.Specifically, when the S-Boxes input and output are both 8 bits, and the cache line is 64 bytes, identifying the cache line accessed for a specific plaintext discloses as little as two bits of information on the corresponding key byte.1 Synchronous vs. Non-Synchronous Attacks.Cache attacks tend to have limited spatial and temporal resolution.The literature distinguishes between two attack settings.In a synchronous attack, the attacker is able to synchronize the attack with the victim.Typically, the attacker performs the prime stage before the victim encrypts or decrypts a block of data, and the probe step after the encryption or decryption completes.In an asynchronous attack, the attacker executes concurrently with the victim, but cannot a-priori control the relative ordering of the operations.
In this work we are interested in the weaknesses of Kryptos Logic's implementation of Pilsung rather than in advancing the techniques for cache attacks.Hence, we focus on the weaker attack model of synchronous attacks.
Handling Partial Information.In a typical attack, the attacker aims to identify a specific access to an S-Box.That is, the attacker knows that the input to the S-Box comprises a known value p combined with an unknown key material k, e.g. by calculating the exclusive or of the two.The attacker then wants to find the cache set accessed in this scenario, and from this find information on the index p ⊕ k and recover information on k.
The signal from Prime+Probe tends to be noisy, in part due to the limited temporal resolution, but more so because the victim accesses multiple cache sets during the cipher operation.To distinguish the targeted accesses, the attacker performs multiple Prime+ Probe rounds, with different plaintexts.The attacker then averages the access times to each cache set over all of the plaintexts, and over the set of plaintexts that result in the desired state at the input of the S-Box.Because in the latter case the cipher always has the same input to the target S-Box, the affected cache set will show a longer average access time when the desired state is achieved.

Encryption Function
Pilsung is a substitution-permutation network based on AES [Fip].It manipulates a 128-bit block divided to 16 bytes in 10 rounds, using 11 128-bit round keys.Each 128-bit value is viewed as a byte oriented 4 × 4 matrix.We denote the 128-bit plaintext by M = M [i, j], and the 128-bit r th round key by RK r = RK r [i, j], i, j ∈ [0, 3], r ∈ [0, 10].As for AES, the round function consists of the application of four functions: • Add-Round-Key.Each byte of the state is XORed with the corresponding byte of the round key.
• Sub-Bytes.Each byte of the state is passed through a non linear layer S-Box.For each round r and each state byte i, j, Pilsung uses a different S-Box in a key-dependent manner, thus leading to a total of 160 different S-Boxes (16 per round, across 10 rounds).We denote by SB r i,j the S-Box used in round r on state bytes in location i, j. • Shift-Rows.The 16 bytes of the state are rearranged according to a permutation.This permutation is key dependent and is different for each round.We denote by SR r the permutation used at round r.
• Mix-Columns.Each column-wise group of four bytes of the state is multiplied by a 4 × 4 matrix.This operation is the same as in AES.
To ease the explanation, we view the round function as the successive application of Add-Round-Key, Sub-Bytes, Shift-Rows and Mix-Columns.The last round omits the Mix-Columns and performs a final whitening before the output of ciphertext, using the last round key RK 11 .This is slightly different from the traditional view of AES as consisting of an initial key whitening, followed by rounds of Sub-Bytes, Shift-Rows, Mix-Columns and Add-Round-Key.
The only differences between Pilsung and the standard AES-128 encryption function lie in the key-dependent Sub-Bytes and Shift-Rows operations.For each round r, these are computed using the round key r + 1 in the following manner: Setting the Shift-Rows Permutation: Using the 128 bits of RK r+1 , a pseudo-random permutation SR r = SR r [i, j], i, j ∈ [0, 3] of 16 elements is computed, where SR r [i, j] = (i , j ), i (resp.j ) denoting the new row (resp.column) index of the permutation for the input i, j.The 16 bytes of the state are then permuted according to SR r .The algorithm for computing SR r from RK r+1 is known [Kry18a], but we do not exploit it for the attack and assume that SR r is chosen at random.

Setting the S-Boxes:
Using the 8-bit value RK r+1 [i, j], a pseudo-random permutation P of eight elements is computed.SB r i,j (x) is then computed by applying the permutation P to the output bits of the standard AES S-Box.The way P (and thus SB r i,j (x)) is computed from RK r+1 [i, j] is known and can be found in [Kry18a].

Key Scheduling
The key scheduling is again similar to that of AES, with few differences.First, a 256-bit master key MK is passed through a SHA1-based function, generating 256 bits of key material K. Next, 176 bytes of round keys RK r are generated from K using the AES key scheduling.However, instead of using N k = 4 (reusing the same notation as in the AES FIPS document [Fip]) as would be expected for a 10 rounds AES-like encryption function, Pilsung uses the key scheduling function with N k = 5 and N r = 10.This means that only the first 160 bits of K are used to generate the round keys.
We make two observations about Pilsung's key scheduling.First, the knowledge of K is enough to completely break the cipher.That is, the 256-bit key MK and the hash function can be ignored.Second, only 160 bits of K are used to generate the round keys.Of these 160 bits, the first 128 are used in the first Add-Round-Key as RK 0 , and the next 32 are used in the second Add-Round-Key, as the first 32 bits of RK 1 .

Implementation
The application of the S-Boxes is done through memory accesses.That is, the 10 × 16 S-Boxes are all stored in memory.With each S-Box occupying 256 bytes, the 16 S-Boxes of each round require 4 KB of memory, fitting exactly across a typical 8-way L1 of 32KB while avoiding set collision.This absence of collisions allows us to differentiate between the S-Boxes of the round and facilitates associating cache information with the S-Boxes.

Information in Misaligned Tables
We look at the typical first-round attack, where an input byte p is combined with a secret key byte k to form an index p ⊕ k to an S-Box. 2 The attacker can observe the cache line of the access to the S-Box, and tries to use such observations to learn information on k.In the standard scenario, we assume that the S-Box is cache-aligned, i.e. that the address of entry 0 of the S-Box is at the start of a cache line.When cache lines are 2 l bytes long, this implies that entry p ⊕ k of the S-Box falls in cache line (p ⊕ k)/2 l of the S-Box.Recovering this reveals the 8 − l most significant bits (MSBs) of k.Thus, on Intel machines, where cache lines are 64 = 2 6 bytes long, we can expect to recover 8 − 6 = 2 bits of information on each key byte in a first-round attack.
Due to the way that the S-Boxes are allocated in the reverse-engineered version of Pilsung [Kry18b], the S-Boxes are not cache-aligned.Instead, they are offset by some number of bytes.
Several works consider the case that the tables are not cache aligned.Osvik, Shamir and Tromer [OST06] suggest that misaligned tables add one bit of information, but do not elaborate.Zhao and Wang [ZW10] and later Spreitzer and Plos [SP13] further investigate the case of misaligned tables.Both show that in some cases of misaligned tables, the adversary can recover the secret completely, but stop short of analyzing the relationship between the misaligned tables and the number of bits that can be recovered.
We now show that if the S-Box is offset by 0 < δ = 2 l − δ < 2 l bytes, for δ = (2α + 1)2 β , the attacker can recover all but the β least significant bits (LSBs) of k.

Recovering the Most Significant Bits
We begin by observing that each group of 2 l values of p ⊕ k falls within two consecutive cache lines instead of a single cache line.Specifically, δ of them fall in the cache line with the smaller index, and the rest in the other.We now show how we can use this observation to recover the 8 − l MSBs of k.
As the S-Box is offset by δ , we have that the table entry p ⊕ k is located at cache line A = ((p ⊕ k) + δ )/2 l .Let p be such that p ⊕ k that is a multiple of 2 l , i.e. p ⊕ k has the l LSBs set to zero.For such a p, it holds that This is because the addition of δ < 2 l to p ⊕ k can only affect the lowest l bits of p ⊕ k, and no carry is generated because these bits are zero.Thus, assuming the attacker knows a value of p such that p ⊕ k is a multiple of 2 l , the 8 − l MSBs of k can be recovered using For an unknown k, the attacker cannot generally obtain a p such that p ⊕ k is a multiple of 2 l .However, there exists one such p in the range 0, 1, . . ., 2 l − 1.As discussed above, this range maps to two cache lines, and we know that for this value of p, the access will fall in the lower indexed cache line.Hence, observing an access to the lower line reveals the value of A, and solving above equation above reveals the 8 − l MSBs of k.

Recovering Other Bits
We now proceed to show how we use the observation that the values of p ⊕ k fall within two cache lines to recover more bits of k.First, as δ values of p ⊕ k fall in the cache line with the smaller index, we can obtain a set P = {p 0 , p 1 , . . ., p δ−1 } such that (p i ⊕ k) mod 2 l < δ for all 0 ≤ i < δ.We now show that given a k such that (p i ⊕ k ) mod 2 l < δ for all 0 ≤ i < δ we have k/2 β mod 2 l−β = k /2 β mod 2 l−β , that is, k and k differ only in the β LSBs.Having established this, the adversary can test all 2 l possible values of k to find at least one such that (p i ⊕ k ) mod 2 l < δ for all 0 ≤ i < δ, and from this k get all but the β LSBs of k.
We first note that because δ = (2α + 1)2 β , for all 0 ≤ x < 2 l we have There exists p i ∈ P such that (p i ⊕ k) mod 2 l = δ − 1.We therefore obtain that (δ − 1) ⊕ k = p i ∈ P .Thus, it holds that where the last transition follows from our assumption regarding k and the associativity of exclusive-or.Because δ = (2α + 1)2 β , we have that δ − 1 = δ ⊕ (2 β+1 − 1).Combining this with Equation 2, we obtain that Next, applying Equation 1 we obtain that (δ Conversely, as for all However, Equations ( 3) and (4) can both be true only if (k ⊕ k )/2 β = 0. Thus all key hypotheses that agree with the set P agree on all but the β least significant bits.
Specifically, due to the layout of the Pilsung context, the S-Boxes are all in odd offsets, i.e. β = 0. Hence, a first round attack can recover all of the key bits.

A Theoretical Attack on Pilsung
We now investigate the more complicated case where the tables are aligned in memory.For the theoretical attack, we assume that the attacker has access to an oracle O(M, K, r, i, j) that, given the plaintext M and the key K, returns the two most significant bits (MSBs) of the S-Box input at round r for the state byte i, j.We show how the attacker can use the oracle to recover the full 160 bits of secret key.We start by introducing new notations.We then show how much information can be obtained by attacking the first round.We then show how to use the oracle to reverse the key-dependent Shift-Rows permutation of the first round.Finally, we target the second round, recovering the secret key.

Notations
To simplify the explanations of the different attack steps, we introduce additional notations for the Pilsung's state.We use A r = A r [i, j], i, j ∈ [0, 3] to denote the state matrix before the Add-Round-Key (AK) operation of round r.Similarly, B r , C r and D r respectively denote the matrix state at round r before the Sub-Bytes (SB), Shift-Rows (SR) and Mix-Columns (MC) operations.The state is initialized with A 0 = M .Figure 1  Using these notations, we can write the output of the oracle as O(M, K, r, i, j) = (B r [i, j]>>6), where x>>r is the value x after a right-shift of r bits, i.e. x>>r = x/2 r .

First Round Attack
The first round attack essentially follows the same procedure as the standard Prime+Probe attack described in [OST06].That is, for a given target byte i, j, the attacker simply needs to query the oracle once in the first round.Such query gives the attacker the two MSBs of B 0 [i, j], the input of the S-Box i, j at round 0. As this value only depends on one byte of the plaintext and of the first round key, the attacker straightforwardly obtains the two most significant bits of RK 0 [i, j].More formally, Algorithm 1 describes the procedure to attack the first round from the oracle.
Algorithm 1 First round attack.
repeating Algorithm 1 for each of the 16 bytes' indices, the attacker obtains the two MSBs of each of the first round key bytes.As a result, the entropy is reduced from 160 bits to 128.Note that no further information can be obtained by analyzing the first round.

Reversing the Shift-Rows Permutation
Attacking the first round does not give enough information to break the cipher.To obtain information on the remaining key bits, the attacker has to go deeper in the cipher and target the second round.However, as the Shift-Rows operation is key dependent, the attacker cannot directly predict which bytes of the plaintext are going to affect which bytes of the second round S-Boxes.Consequently, the attacker first needs to recover the key-dependent permutation used during the first Shift-Rows.
We use a two step procedure for recovering the key-dependent permutation.We first find which set of four bytes is mapped to the which column after the Shift-Rows operation, without recovering the rows thy mapped to in the column.In the second step we find the ordering of these four bytes within this column.For the attack we do not make any assumption about how the permutation is generated.Thus, the procedure is also applicable to random permutations.

Column Mapping
As previously mentioned, the first step consists of finding which group of four bytes before Shift-Rows will be mapped to the same column after the application of the permutation.More formally, for each j ∈ [0, 3] we aim to find four indices 2 for a graphical representation of the goal of this step.Each color represents a group of four bytes that are mapped to the same column after the unknown SR.We want to find the color-wise mapping.However, we do not try to recover the ordering within a given color.More formally, we do not (yet) find the values of i k .To perform this column mapping, we query the oracle at the second round, revealing information on the MSBs of each byte of B 1 , the inputs of the second round S-Boxes.We start with the following observation: if (a) at least one byte of each column in D 0 is random, then the Mix-Columns operation will propagate this randomness among the bytes in A 1 , and thus B 1 .However, if (b) a given column index j of D 0 is fixed, then the corresponding column in B 1 will remain fixed as well.These two cases are illustrated in Figure 3.The top (resp.bottom) part of the figure shows the case a (resp.b).Colored bytes correspond to fixed values, while white bytes correspond to randomly varying bytes.The knowledge of which situation, (a) or (b), happened can be used to reverse the column mapping of the permutation.To do this, we first choose a set of set of four bytes of the plaintext M .We then fix the values of these four bytes and set the 12 others as random for several cipher executions.If case (b) occurs, we know that the indices of the four fixed bytes are mapped to the same column j after the Shift-Rows operation.However, if case (a) occurs, this means that at least one of the four fixed bytes' indices was not mapped within the same column as the other three after the Shift-Rows operation.
To find which case occurs, we query the oracle in the second round with n > 1 different plaintexts M i , such that four randomly chosen bytes are identical for all i, and with the remaining 12 bytes set as random.For each plaintext, the attacker queries the oracle for the entire state matrix.If case (b) occurs, at least one column will have the same four oracle answers for all plaintexts.The oracle queries of the other column will result in different answers with high probability.More precisely, as the oracle answer is two bits, the probability that another column shows the same answer over all n plaintexts for all the four bytes is 2 −8(n−1) .Thus, if case (a) occurs, it is likely that the oracle queries will show differences for all column.Increasing the number of plaintexts n reduces the likelihood of false positives.
Putting Things Together: The overall strategy to perform the column mapping consists of randomly fixing four bytes and then querying the oracle to detect if case (a) or (b) happened.If case (b) is detected, then the adversary knows which columns the corresponding indices of the fixed bytes will map to after Shift-Rows.If case (a) is detected, the adversary repeats the operation by fixing different byte indices until case (b) is detected.The probability to randomly find a group of four bytes mapping to a given column is As there are four columns, the probability to randomly find a column mapping is thus 4 1820 = 1 455 .Overall, as we are querying the full state for n plaintexts, the attacker will have (roughly) 50% chances to find a column mapping in n × 16 × 223 = 3568n queries.As there are 4 columns, and as each column might produce a false positive with probability 2 −8(n−1) when using n plaintexts, the overall probability of not getting a false positive is p = (1 − 2 −8(n−1) ) 4×223 .As a result, the probability that at least one column will exhibit a false positive among all the executions is 1 − p.While for two plaintexts Algorithm 2 Column mapping procedure.Input: Oracle O, n the number of plaintexts Output: only the probability to get a false positive among all the executions is as high as 0.97, it quickly drops to 2 × 10 −7 for five plaintexts.Finding the next column mapping is simpler since we already found the previous ones: the probabilities increase to 12 4 −1 = 1 495 for the second one, 8 4 −1 = 1 70 for the third, and finally 1 for the last one.More formally, Algorithm 2 describes the procedure to find the first column mapping.a ← random sets a to a random value between 0 and 255.

Column Ordering
From the previous method, we now know how to find which group of four bytes is mapped to a single column after the key-dependent Shift-Rows operation.However, we still do not know the byte ordering within this column.Without loss of generality, we will now assume, for simplicity, that the Shift-Rows operation maps the first column to itself.That is, we know that SR 0 [i k , 0] = (i k , 0), k ∈ [0, 3].We now aim to finding the corresponding mappings i k → i k .For more clarity, Figure 4 represents the goal of this step.That is, we aim at finding which indices map to which before and after the Shift-Rows operation for the first column.Naturally, we will repeat this process for the remaining three columns.For that purpose, we again look at the memory accesses from the second round S-Boxes.We now show that when fixing three out of these four bytes to some constant, the four corresponding cache accesses in the second round allow us to reveal the exact position of the fourth varying byte after the Shift-Rows operation.

SR
Still assuming that Shift-Rows maps the first column to itself, Equation 5shows the values of B 1 [i, 0], i ∈ [0, 3], which are the four inputs used to access the S-Boxes during the second round.The operator × denotes the multiplication over F 256 .
And we have where cst i denote unknown constants that depend on RK 0 [i, 0] and ).Such dependency can be observed when looking at the cache access patterns.That is, the cache accesses from SB 1 1,0 and SB 1 2,0 will be exactly the same up to a permutation given by the two MSBs of cst 1 ⊕ cst 2 .More formally, we have O(M, K, 1, 1, 0) = O(M, K, 1, 2, 0) ⊕ cst 1 ⊕ cst 2 .However, due to the non-linearity of the multiplication by 2 and 3, the cache accesses from SB 0 0,0 and SB 0 3,0 will be different from the ones from SB 0 1,0 and SB 0 2,0 .
Table 1 illustrates this phenomenon, where 0x and 0b denote hexadecimal and binary notations, respectively.The variable x represents C 0 [SR 0 [0, 0]].We arbitrarily set the constants cst 0 = 0xb8, cst 1 = 0x85, cst 2 = 0x32 and cst 3 = 0xdf .As a result, each column corresponds to the value computed in B 1 [i, 0].For four different arbitrarily chosen values of x, the table shows the two MSBs of the given operation, which corresponds to the oracle response.That is, the column i corresponds to O(M, K, 1, i, 0).As we can see, we have O(M, K, 1, 1, 0) = O(M, K, 1, 2, 0) ⊕ 0b10, and 0b10 = (0x85 ⊕ 0x32)>>6, Interestingly, we can detect other specific patterns if the only varying byte is at another position than D 0 [0, 0].For example, if C 0 [SR 0 [0, 0]] is the only varying value, we would observe a similar cache access pattern for both SB 1 2,0 and SB 1 3,0 .This means that when fixing three out of the four bytes within this column, the cache access patterns in the second round reveal information about which bytes are fixed after the Shift-Rows operation, thus revealing the mapping of the one byte that is left as varying.
Algorithm 3 Column ordering procedure.Input: Oracle O, byte index x to map, n number of plaintexts Output: Index y such that C 0 [x, 0] maps to D 0 [y, 0].
for j = 0 to 3 do 8: If we reuse the example of Figure 4, the attacker can decide to fix the bytes of C 0 with indices 1, 2 and 3 and leave the byte of index 0 as varying for several plaintexts.After the Shift-Rows operation, only the byte D 0 [2, 0] in the first column will be varying, and the other ones being fixed.Obviously, this information is not yet known by the attacker.They would then query the oracle for accesses to all four S-Boxes SB 1 0,0 , SB 1 1,0 , SB 1 2,0 , SB 1 3,0 .With the same reasoning as the equations above, they will observe that SB 1 0,0 and SB 1 3,0 will trigger the same cache access pattern up to some constant, while the other two S-Boxes will show completely different patterns.Such a pattern is the footprint that only D 0 [2, 0] is varying.The adversary would then know that C 0 [0, 0] maps to D 0 [2, 0].Note that several plaintexts are necessary to identify the different patterns of SB 1 1,0 and SB 1 2,0 .That is, given n plaintexts each S-Box has a probability of 2 −2n to incorrectly exhibit the same pattern has the SB 1 0,0 and SB 1 3,0 .This probability can be arbitrarily reduced by increasing the number of plaintexts.Repeating this procedure with the other three bytes of C 0 will completely reveal the ordering of the first column.
More formally, the procedure to recover the column ordering is given by Algorithm 3. Note that the described algorithm is written assuming (without loss of generality) that the first column of C 0 maps to the first column of D 0 .It can be trivially adapted to the needed column mapping.Array(n) denotes a function that returns an array of size n. a ← random sets a to a random value between 0 and 255.

Second Round Attack
The previous section shows how to fully recover the unknown permutation used during the first Shift-Rows operation.From that knowledge, we can now exploit the memory accesses of the second round S-Boxes to recover information about the first and second round keys.More specifically, we show how we can use the oracle in the second round to recover the key bytes of the first and second rounds in a divide and conquer manner.For simplicity, and without loss of generality, we assume that the first Shift-Rows operation is the identity mapping, and thus that C 0 = D 0 .
We start by showing how to recover both RK 0 [0, 0] and RK 1 [0, 0].For that purpose, Equation 8 shows the input value B 1 [0, 0] of the first S-Box SB 1 0,0 in the second round.
0] and M [3, 0], we can rewrite the equation as follow: where cst denotes some unknown constant that depends on RK 0 [i, 0] and SB 0 i,0 , i ∈ [1, 3].However, as this value is constant, the cache access of SB 1 0,0 that depends on B 0 [0, 0] will show the same pattern as the value B = B 0 [0, 0] ⊕ cst: We note that B only depends on M [0, 0], RK 0 [0, 0] and SB 0 0,0 , which in turn depends on RK 1 [0, 0].By trying all the possible values of RK 0 [0, 0] and RK 1 [0, 0], we can thus predict the corresponding value of B .For the actual subkey guesses, we know that the predicted B and the observed cache accesses of SB 1 0,0 will exhibit the same pattern.However, for wrong subkey guesses, the non-linearity of SB 0 0,0 will show a difference between the predicted B and the observed cache accesses.Assuming that n plaintexts are used, each pair of key candidate for RK 0 [0, 0] and RK 1 [0, 0] (among the 2 16 ones, or 2 14 if the first round attack has been performed) has 2 −2n chance to incorrectly exhibit the same cache pattern as B , thus being incorrectly classified as a correct key.The number of false positives can be arbitrarily reduced by increasing the number of plaintexts n.Naturally, the same procedure can be repeated to recover the remaining pairs (RK The procedure to attack the second round is given by Algorithm 4. B x,y,z = 2×SB(x⊕y) denotes a guess on the value B when the message is equal to x, the first round key byte is equal to y, and the second round key byte affecting the S-Box SB is equal to z.K 0 and K 1 respectively denote the space of the key bytes i, j of the first and second round.Note that the algorithm has been written assuming that the first Shift-Rows is the identity mapping.It can, however, trivially be adapted to any known permutation.

Complexity Summary
In this section, we briefly summarize the complexity requirement in terms of oracle queries of each of the different steps.The variable n represents the (tunable) number of plaintexts used for a given attack step.
• First round attack: In order to recover the 2 MSBs of a given byte RK 0 [i, j], the first round attack simply queries the oracle once in the first round.Repeating this procedure for each byte concludes the attack, which is completed in 16 oracle queries.• Column mapping: The first step of the recovery of SR 0 requires randomly finding sets of 4 bytes that are mapped to the same column after the permutation.It additionally requires repeating the procedure for n plaintexts in order to reduce the probability of false positives.The first column mapping will need (16 × 223)n oracle queries.Similarly, the second and the third ones will require (12 × 83)n and (8 × 18)n queries.We assume n = 4 as it already produces a very low probability.Overall, this step requires (16 × 223 + 12 × 83 + 8 × 18)n = 4708n = 18832 oracle queries.• Column ordering: The second step of the recovery of SR 0 requires fixing all but one of the bytes of a column, and then querying the oracle for the entire column.This procedure needs to be repeated for n plaintexts in order to avoid false positives.After the first byte ordering is found, the second one only needs to query the oracle for the 3 remaining bytes of the column.As a result, reversing one column requires (4 + 3 + 2)n queries.We again assume n = 4 to keep the probability of false positive small.As there are four columns to recover, the overall complexity of this step is 4(4+3+2) = 36n = 144 oracle queries.• Second round attack: Finally, the second round attack recovers a pair of bytes for both RK 0 and RK 1 by querying the oracle for a fixed byte index in the second round.As it will further be shown in Section 5, n = 30 is sufficient to recover the correct pair (note than a lower value could also be used, potentially at the cost of some enumeration).Finally, as this procedure has to be repeated for each byte, the complexity is 16×30 = 480 oracle queries.
Overall, the number of oracle queries to fully break the cipher is 16+18832+144+480 = 19472 oracle queries.

Experimental Results
We empirically confirm our analysis in Section 4 via a cache attack.As our attack is aimed at highlighting the weaknesses of Pilsung's implementation, we assume the strong attack model where the attacker has sufficient control over the cipher to perform a synchronous attack.To get cache use information, we use the Prime+Probe attack from the Mastik toolkit [Yar16].All experiments were performed on a MacBook Air (13-inch, Mid 2013 model) with Intel Core i5-4250U CPU processor at 1.30GHz, and 4GB of 1600MHz LPDDR3 memory.The attack targets the L1 data cache of the machine, which, like all modern Intel processors, is a 32 KiB, 8-way set-associative, with 64 sets and with cache lines of 64 bytes.

S-Boxes Memory Layout
The Pilsung implementation [Kry18b] initializes an array of S-Boxes when setting the key.Although the cipher only uses 10 rounds, the array has room for 30.For each round, the array consists of a 4 × 4 matrix of S-Boxes, where each S-Box is a 256 byte array.
Due to the way that the implementation allocates the context for the cipher, the S-Boxes array is not aligned in the cache, allowing a trivial key-recovery attack (see Section 3).We correct this, ensuring that each S-Box starts at a cache line boundary.With cache lines of 64 bytes, each S-Box occupies four cache lines, and each round occupies 16 × 4 = 64 cache lines.Because the S-Boxes are consecutive in memory, and due to the linear mapping of memory addresses to cache sets, S-Boxes for the same byte of different rounds overlap in the cache.

First Round Attack
For the first round attack, we generate 10 6 random plaintexts, and perform a synchronous Prime+Probe attack while encrypting each.We calculate the average probe time for each cache set to get the baseline cache activity level.To target a key byte, we split the results based on the value of the corresponding plaintext byte, and average the probe times for each of the targeted byte's values.Following Osvik, Shamir and Tromer [OST06], we normalize the results by subtracting the baseline probe time from the per-value average.As we can see, the activity in each of the cache sets is mostly uniform, except for four cache sets.In these cache sets, which cover the S-Box of the targeted byte, we see that one cache set shows more activity (i.e.longer probes) than the others.The cache set that shows more activity provides us with the first round oracle for the targeted bytes.For example, if the first cache set is active, we have O(M, K, 0, i, j) = 0.
Observing the results for the byte [0, 0], we see that when the value of the plaintext is between 0 and 63 the third cache set in the S-Box is active.This implies that for 0 ≤ M ≤ 63, we have O(M, K, 0, 0, 0) = 1, indicating that the top two bits of the corresponding key bytes are 01.Similarly, looking at the right half of the figure, we can determine that the top two bits of the byte [1, 3] of the key are 11.
Note that the requirement of 10 6 plaintexts to generate Figure 5 is not needed to complete the attack.Indeed, instead of monitoring the cache access for all 256 values that the target byte can take, one could focus on a single line of the figure to recover the two MSBs of the key, with only 10 6 256 plaintexts.Repeating this procedure for all 16 bytes allows the recovery of the two MSBs of RK 0 [i, j], i, j ∈ [0, 3], which conclude the success of the first round attack.

Reversing the Shift-Rows Permutation
As Section 4.3 describes, reversing the Shift-Rows permutation proceeds in two steps.We first brute force the mapping of state bytes to columns and then determine the new order of the bytes in the colums.Figure 6 shows the actual permutation SR 0 that is used in our experiments.Obviously, that information is unknown to the adversary, who aims at recovering it.

Column mapping
For the first step, we rely on the observation that if the attacker fixes the four bytes that map to the same column, the state values in the column are fixed in the second round, and consequently, the same cache sets will be accessed in the second round S-Boxes of the column.To perform the attack we iterate over all of the 16 4 combinations of four state bytes.For each such combination we select 10 4 random plaintexts, while fixing the values of the four state bytes we test, i.e. ensuring that the values of each of the four bytes do not change between the selected plaintexts.Note that while this high number of plaintexts originally aims at cancelling the measurement noise, it also directly prevents false positives from happening as shown in Section 4.3.1.We perform Prime+Probe attacks with the selected plaintexts and average the probe times for each of the cache sets.Figure 7 shows the normalized probe times in two cases.The left part of the figure shows the case that all four bytes map to the same column.More precisely, plaintext bytes of indices 15, 10, 4 and 0 are fixed.The right part shows the case where three bytes are in the same column and the fourth in another, by fixing the bytes of indices 14, 10, 4 and 0.
When fixing a byte in the plaintext, we expect the first round to always access the same cache set of the corresponding S-Box.Thus, when fixing four bytes, we expect to see four cache sets that have higher than average activity.Indeed, both parts of the figure show some picks that indicate increased activity.In the right part of the figure we only see the four picks that correspond to the bytes of indices 14, 10, 4 and 0. This corresponds to the case where the if condition at line 18 of Algorithm 2 is false, indicating that no column is fixed after SR 0 .Conversely, in the left part, we see increased activity in 8 cache sets.Four of these correspond to the first round S-Box accesses (with indices 15, 10, 4 and 0) and the other four correspond to the accesses of the second round accesses.This corresponds to the if condition being true, thus detecting a fixed column after SR 0 .As a result, counting the number of cache sets with increased activity identifies the case that all four selected bytes map to the same column.Moreover, the corresponding sets give the position of the fixed column, which is the first one in our case.As a result, we successfully recovered the column mapping from the bytes 15, 10, 4 and 0 to the first column after SR 0 .Repeating this operation for the other column finalizes the column mapping procedure.

Column ordering
After determining the state bytes that map to a column, we need to find the order of the bytes within the column.As noted in Section 4.3, we rely on the construction of the Mix-Columns.Specifically, we aim at recovering the ordering within the first column.For that matter, we fix the three plaintext bytes of indices 15, 10, and 4 and leave byte 0 as varying.To identify the new position of this plaintext byte, we perform 10 6 Prime+Probe attacks and then split the results based on the targeted byte.Figure 8 shows the results, with full results on the left side, and highlighted details on the right.When the plaintext byte changes, the values of all of the four bytes in the column change, and with them the cache set accessed in the second round.However, as noted in Section 4.3, due to the construction of the Mix-Columns matrix, the effect of the change on two of the bytes will be identical up to exclusive or with a constant, whereas the other two bytes will show different changes.
Observing the cache sets of the first and second S-Boxes highlighted in blue in the right side of Figure 8, we can see that whenever the first cache set is active in the first S-Box, the third cache set is active in the second and vice versa.The same holds for the second and fourth cache sets.That is, the oracle for the first S-Box is the same as for the second S-Box up to exclusive or with 2. This property holds for all plaintext values of the byte, indicating that in Equation 5 the value we vary is multiplied by 1 for the first and second outputs of the Mix-Columns step.Note that we can notice the presence of two active cache sets for the first S-Box.This is due to the fact that the plaintext byte 0 triggers the cache set associated to the first S-Box for both the first and second round of the cipher.However, the corresponding set can be ignored as it is known thanks to the first round attack.Thus, we can conclude that byte 0 is in the last row of the first column after the application of SR 0 , as shown in Algorithm 3. Repeating the process for each of the plaintext bytes allows us to completely recover the Shift-Rows permutation.

Second Round Attack
Recall (Section 4.4) that for the second round attack, we only need an oracle for the second round access to one of the S-Boxes of the column the targeted byte maps to.We note that Figure 8 already provides the information we require, and is enough to validate the attack.Yet, for completeness, we adapt the formula of Section 4.4 to the actual permutation SR 0 used in our practical experiment.That is, we show in Equation 11 the adaptation of Equation 8, allowing us to recover both RK 0 [0, 0] and RK 1 [3, 0], As for the column ordering procedure, we fix the plaintext bytes M [3, 3], M [2, 2] and M [0, 1] and rewrite the equation as follows: Where cst denotes some unknown constant that depends on RK 0 [i, 0], i ∈ [1, 3], SB 0 i,0 , i ∈ [0, 2] and RK 1 [0, 0].Again, as this value is constant, the observed cache access of SB 1 0,0 will have the same pattern as the value B = B 1 [0, 0] ⊕ cst: As explained in Equation 11, this relation can be used to (in this case) recover simultaneously both RK 0 [0, 0] and RK 1 [3, 0].That is, by trying all possible remaining 6 bits of RK 0 [0, 0] and 8 bits of RK 1 [3, 0], we predict the corresponding cache access of B .The correct pair (RK 0 [0, 0], RK 1 [3, 0]) is found by comparing the predictions to the actual cache accesses, which are already validated by Figure 8.We use the information to validate that with 30 values of the targeted byte we can always recover the corresponding key bytes for the first and second rounds.

Discussion and Conclusions
Countermeasures.Multiple approaches for addressing cache attacks have been suggested.Randomization-based approaches [LWML16; WUG+19] decouple memory lines from cache sets.While such approaches may prevent an attacker from finding the cache sets that correspond to memory locations, they do not protect against attacks that rely on cache capacity [SKH+19], and vulnerabilities may remain.
Approaches that partition the cache [DJL+12; KPM12; LGY+16; SSCZ11; ZRZ16] prevent adversaries from manipulating the cache state of the victim's cache partition.These approaches often rely on the control the operating system has over the mapping of virtual addresses, and do not apply to the L1 cache we attack.To protect the L1 cache, it may be possible to prevent concurrent access to it and to flush its contents upon context switch [GYCH19].However, as Ge et al. [GYH18] note, it may be impossible to completely eliminate hardware leakage in current processors.
Ensuring that cache lines are accessed in a secret-independent order [BGNS06] can prevent our attack, but may still be susceptible to attacks that have sub-cache-line resolution [IMB+19a;YGH16].Preventing secret-dependent memory access using constanttime programming [BLS12] is considered a secure approach for protecting cryptographic implementations.When applied to Pilsung, using this approach would require reading all of the values in an S-Box when it is applied.Fortunately for Pilsung, during each round it accesses each of the 160 S-Boxes exactly once.Thus, scanning each of these tables results in a sequential memory access pattern throughout the memory allocated for the S-Boxes.Luckily, such access patterns tend to be highly optimized in modern processors, and the optimization is likely to mask some of the added overhead of scanning the whole S-Box.We leave the task of implementing this countermeasure and benchmarking its performance to future work.
On the use of key dependent S-Boxes.Interestingly, the use of key-dependent S-Boxes weakens the security of the overall implementation against cache-based side-channel attacks.This is since, in the implementation analyzed in this paper, each of the 160 S-Boxes are stored in a different part of the memory where accesses to them trigger different cache sets.Thus, by observing a specific cache set the attacker gets more precise information about specific bytes of the secret key.As each Pilsung S-Box consists of the original AES S-Box with an additional key-dependent permutation on the output bits, the implementation can only store one copy of the AES S-Box and compute the permutations online.This would avoid releasing the information on which S-Box is accessed, but would result in a cost to performance.Moreover, even the single copy of the AES S-Box can be calculated online using the algebraic equations used for its construction, effectively eliminating any microarchitectural leakage.
Limitations.The main focus of this work is to demonstrate and to analyze the weakness of the Pilsung implementation as reverse-engineered in [Kry18b].This results in a few limitations of the attack presented in this paper.First, we trust the correctness of the reverse-engineering effort and that the source indeed matches the original Pilsung code and specifications.While we have no reason to doubt the correctness of the published code, in case of a mismatch between the published code and the original cipher our attack only applies to the code of [Kry18b].
The cache attack we demonstrate makes several assumptions that render it impractical for a real attack scenario.First, our attack is synchronous, implying that the attacker has the ability to trigger cipher invocations at precise times (probing the cache between invocations).While synchronous attacks are common in the literature [IAES15; OST06; WHS12; ZW10], we acknowledge that they represent a weaker attack scenario.However, we note that the gap between weak attack scenarios, such as ours, and more realistic attacks has been researched in the past [OST06].As such, we are confident that the attack could apply in more realistic scenarios, albeit with more noise and consequently more samples.
Finally, our attack targets the L1 cache, which implies that the attacker and victim must be located on the same core.We note that performing a Prime+Probe attack on the last-level cache (LLC) attacks is a solved problem [IAES15; LYG+15].Furthermore, we note that these attacks offer new opportunities for attacks on Pilsung.The S-Boxes of the different rounds overlap in the sets of the L1 cache that we target.Consequently, we need to take multiple samples to identify the access in the round we target.The number of cache sets of the LLC is significantly larger, and thus the likelihood that S-Boxes overlap is much smaller, potentially allowing the attacker to distinguish between accesses in different rounds.We leave the investigation of this scenario for future work.

Conclusions.
In this work we investigate the reverse-engineered implementation of the North Korean cipher Pilsung.We show that using key-dependent S-Boxes and permutations does not provide an adequate protection against cache attacks.We analyze the cipher and demonstrate complete key recovery after observing 35 million encryptions.The complete attack takes an average of eight minutes.Thus, our work demonstrates once again the need for careful, constant-time implementation as a defense for side-channel attacks.

Figure 1 :
Figure 1: Pilsung states after the different operations and rounds.AK, SB, SR, and MC stand for Add-Round-Key, Sub-Bytes, Shift-Rows, and Mix-Columns, respectively.

Figure 2 :
Figure 2: Illustration of the column mapping step.

Figure 3 :
Figure 3: Illustration of the two propagations cases with four fixed bytes.

Figure 4 :
Figure 4: Illustration of the column ordering step.

Figure 6 :
Figure 6: Permutation SR 0 used in the practical experiments.

Figure 7 :
Figure 7: First step in reversing the Shift-Rows.Normalized probe results with four fixed plaintext bytes.Eight cache sets show increased activity when the four bytes map to the same column (left), compared with four when the bytes do not map to the same column (right).

Figure 8 :
Figure 8: Second step in reversing the Shift-Rows permutation.Full information (left) and marked detail (right).The first and second S-Boxes in the right figure (marked in blue) show the same pattern up to exclusive or with 2.

)
Algorithm 4 Second round attack.Input: Oracle O, target byte index i, j , n number of plaintexts Output: Key candidates couples for RK 0