Fixslicing: A New GIFT Representation Fast Constant-Time Implementations of GIFT and GIFT-COFB on ARM Cortex-M

. The GIFT family of lightweight block ciphers, published at CHES 2017, oﬀers excellent hardware performance ﬁgures and has been used, in full or in part, in several candidates of the ongoing NIST lightweight cryptography competition. However, implementation of GIFT in software seems complex and not eﬃcient due to the bit permutation composing its linear layer (a feature shared with PRESENT cipher). Inthisarticle, we exhibit a new non-trivial representation of the GIFT family of block ciphers over several rounds. This new representation, that we call ﬁxslicing , allows extremely eﬃcient software bitsliced implementations of GIFT , using only a few rotations, surprisingly placing GIFT as a very eﬃcient candidate on micro-controllers. Our constant time implementations show that, on ARM Cortex-M3, 128-bit data can be ciphered with only about 800 cycles for GIFT-64 and about 1300 cycles for GIFT-128 (assuming pre-computed round keys). In particular, this is much faster than the impressive PRESENT implementation published at CHES 2017 that requires 2116 cycles in the same setting, or the current best AES constant time implementation reported that requires 1617 cycles. This work impacts GIFT , but also improves software implementations of all other cryptographic primitives directly based on it or strongly related to it.


Introduction
In parallel to the rise of pervasive computing and IoT, lightweight cryptography has naturally been a very hot topic in the past decade.Many new primitives have been proposed, from block ciphers to hash functions and authenticated encryption schemes, for various goals such as minimization of area, energy or power consumption, latency, etc.One can remark that there is no single algorithm that is more efficient than all others on every possible platform.Even though designers try to produce a primitive aiming at a particular class of platforms while maintaining good performance otherwise, we can generally observe that hardware-oriented ciphers tend to be less efficient on software and vice-versa.For example, the NSA did not propose only a single lightweight block cipher, but two of them [BSS + 15]: one oriented for constrained hardware platforms (SIMON) and one oriented for constrained software platforms (SPECK).
In hardware, it seems that the community is reaching a limit in terms of performances, with recent schemes [BSS + 15, BJK + 16, BPP + 17] that can be implemented efficiently using a very small data-path (minimizing area and power), while allowing also efficient trade-offs for fast and low-energy implementations.Yet, constrained software platforms such as small micro-controllers will play a very important role in the future.Even though hardwareoriented designs use a very small total number of bitwise operations when compared to classical designs such as AES, their situation in software is not so bright: many of these ciphers use hardware-friendly diffusion layers and an important number of cycles will be required to move these bits around, without much possibility to benefit from vectorization.This is especially true for ciphers using bit permutation such as PRESENT [BKL + 07] or GIFT [BPP + 17].Since this bit permutation is basically free in hardware (it consists of simple wirings), designers concentrated on how to maximize security when choosing this permutation layer.For example, GIFT permutation layer has been chosen with security as the only criterion (more precisely, maximizing its resistance against differential and linear attacks).
When high parallelism can be achieved in the operating mode where the primitive will be placed, one can always use highly bitsliced implementations (see performances of SIMON, SKINNY and GIFT on recent Intel processors with AVX2 instructions [BPP + 17]) that can lead to excellent performance: these ciphers again use a very small number of bitwise operations and the high parallelism will allow to strongly reduce the cycles wasted in moving bits around by unrolling the implementation.However, this strategy will not be applicable in the case of constrained micro-controllers, as these devices will not offer enough registers to perform such highly bitsliced implementations efficiently.These highly bitsliced implementations will also not be possible for serial operating modes, which are quite widespread in practice and are even more relevant for lightweight cryptography as it can save some area.
It remains rather unexplored how efficient hardware-oriented ciphers can be in software.Yet, this topic is quite important with the ongoing NIST LightWeight Cryptography (NIST LWC) competition, that started in 2018, with the goal of selecting the future authenticated encryption standard(s) for constrained environments.A first answer was given at CHES 2017, with a new very efficient implementation of PRESENT cipher on various micro-controllers [RAL17].It is based on a decomposition of the permutation layer over two consecutive rounds, resulting in a more software-friendly representation.
However, PRESENT has a rather low security margin with regards to linear cryptanalysis and its advanced extensions.It also has the disadvantage to only come in a 64-bit block version, which is to be avoided [BL16] unless a Beyond-Birthday-Bound (BBB) operating mode can be used (generally much more costly).Actually, one can observe that none of the NIST LWC candidates use PRESENT as internal primitive, even though it is widely considered as one of the first lightweight ciphers.Recently, at CHES 2017, the GIFT family of block ciphers was proposed to correct these two issues with PRESENT.GIFT has a 128-bit version and provides a much stronger resistance against linear cryptanalysis than PRESENT, thanks to a careful choice of its S-box, its diffusion layer and how they operate together.It has actually been used as a basic block for several NIST LWC candidates, such as The problem is that software performance of GIFT is believed to be poor on microcontrollers, because even using table-based implementations, moving the bits around for the diffusion layer will cost many expensive rotations, shifts, masks, exclusive-ORs, etc.To the best of our knowledge, no micro-controller implementation has been previously reported for GIFT.Our Contributions.In this article, we propose a new non-trivial representation of both versions of the GIFT cipher over several rounds.More precisely, we show how the seemingly-complex bit permutation of GIFT-64 can be rewritten over 4 consecutive rounds, using only a few simple operations.This new very clean representation, that we named fixslicing, allows an efficient bitsliced implementation of GIFT-64 on ARM Cortex-M3, requiring only about 800 cycles to cipher two 64-bit input blocks.Our setting assumes Initialization.The 64-bit (or 128-bit) plaintext is loaded into the cipher state S which will be expressed as 4 16-bit (or 32-bit) segments.In the perspective of a 2-dimensional array, the bit ordering is from top-down, then right to left.Namely, for GIFT-64, we have: while for GIFT-128 we have: The 128-bit secret key is loaded into the key state KS partitioned into 8 16-bit words.In the perspective of a 2-dimensional array, the bit ordering is from right to left, then bottom-up.
SubCells.The substitution layer of 16 (or 32) identical 4-bit S-boxes can be applied in parallel with the following operations.
where ∧, ∨ and ¬ are logical AND, OR and NOT operation respectively.
PermBits.The bit permutation of GIFT has the special property that each bit located in a slice i remains in the same slice through this permutation.Now, different 16-bit (or 32-bit) permutations are applied to each S i independently.They map a bit located at position j in slice i to position P i (j) in the same slice i.We provide in Tables 1 and 2 the P i (j) values for GIFT-64 and GIFT-128 respectively.AddRoundKey.This step consists of adding the round key and round constant.Two 16-bit (or 32-bit) segments U, V are extracted from the key state as the round key: RK = U V .Then, for the addition of round key, U and V are XORed to S 1 and S 0 of the cipher state respectively for GIFT-64, or S 2 and S 1 of the cipher state respectively for GIFT-128: For the addition of round constant, S 3 is updated as follows: for GIFT-64 where the byte XY = 00c 5 c 4 c 3 c 2 c 1 c 0 .

Key schedule and round constants
The key schedule and round constants are the same for both versions of GIFT, the only difference is the round key extraction.A round key is first extracted from the key state before the key state update.For GIFT-64, two 16-bit words of the key state are extracted as the round key while for GIFT-128, four 16-bit words of the key state are extracted as the round key The key state is then updated as follows, where ≫ i is an i bits right rotation within the 16-bit word.
The round constants are generated using a 6-bit affine LFSR, whose state is denoted as c 5 c 4 c 3 c 2 c 1 c 0 .Its update function is defined as: The six bits are initialized to zero, and updated before being used in a given round.The values of the constants for each round are given in the table below, encoded to byte values for each round, with c 0 being the least significant bit.

Naive bitsliced implementation of GIFT
Naive bitsliced implementations of the GIFT family of block ciphers can be achieved by following straightforwardly the specifications.First, in the case of GIFT-64 and GIFT-128, one has to rearrange the inputs in their bitsliced representation.This can be done using the SWAPMOVE technique [MPC00]: which consists in swapping the bits in B masked by M with the bits in A masked by (M n).Regarding the substitution layer, the 4-bit S-boxes can be computed in parallel in only 13 operations as described in Section 2. The main difficulty lies in the diffusion layer as it refers to the least bitslice-friendly operation.For the sake of clarity, let us consider the case of GIFT-64.In order to apply the 16-bit permutation P 0 to S 0 , a basic approach would be to move the bits using masks and shifts, resulting in the following operations: which requires about 27 cycles on ARM Cortex-M processors.In the same way, P 1 , P 2 and P 3 can be implemented in approximately 14, 27 and 18 cycles, respectively.Therefore, the diffusion layer requires about 100 cycles for a single round.This highlights why ciphers using bit permutation are generally considered inappropriate for software implementations on micro-controllers.
Still, it is possible to minimize the impact on performances by operating on several blocks in parallel for 32-bit (and above) architectures.In order to give some insights on how GIFT performs on ARM Cortex-M3 and M4 using the naive bitsliced implementation, we benchmarked a code fully written in C language, compiled by arm-none-eabi-gcc 9.2.1 using the flag -O3 for optimized speed results, on the STM32L100C and STM32F407VG development boards.Note that our benchmark simply measures the execution time to expand the key and to encrypt 128-bit data, without any operating mode.Implementation results are listed in Table 3.For encryption functions, the data in ROM refers to precomputed round constants while under RAM usage, I/O refers to the amount of memory needed to store the input and ouput plus the temporary variables (excluding the round keys).As expected, the result is that GIFT is not well suited for software bitsliced implementations on micro-controllers.While our C implementation requires about 4 000 cycles to encrypt 128-bit data using GIFT-64, twice as much are required when using GIFT-128.This gap is due to the fact that, on top of having more rounds than GIFT-64, the slice permutations P 0 , • • • , P 3 of GIFT-128 operate on 32 bits instead of 16, increasing the number of masks and shifts to compute.However, the next section introduces a new GIFT representation which challenges this conclusion.

GIFT-64
Let us consider a bitsliced representation of the cipher state: for each nibble, bit 0 is placed in the slice 0, bit 1 in slice 1, bit 2 in slice 2 and bit 3 in slice 3.For ease of description, a slice can be placed in matrix form, as shown in the top row of Figure 3.During the SubCells application, when each slice is stored in independent words, all the 16 S-boxes are implemented in parallel in bitslice manner, as seen in Figure 2.Then, according to the GIFT designers [BPP + 17], the bit permutation can be implemented as follows: • Take the transpose of each individual slice matrix • Apply the following row swaps: -Slice 0 matrix: swap row 1 with 3 -Slice 1 matrix: swap row 0 with 1, and swap row 2 with 3 -Slice 2 matrix: swap row 0 with 2 -Slice 3 matrix: swap row 0 with 3, and swap row 1 with 2 We give a graphical representation of 4 rounds of this process in Figure 3.As explained in Section 3, the diffusion layer requires bits to be moved around individually in the slice (and not entire chunks of the slice), resulting in a significant overhead.In order to avoid these issues, we propose a new way to represent GIFT-64.The idea is to fix the first slice matrix to never move and find the easiest operations that could keep the bits of other slice matrices synchronised after application of the linear layer (so that the S-box computation that comes after will indeed involve the proper bits).This representation is given in Figure 4 and one can see that even though the bit positions are different, each S-box will have exactly the same bits indexes involved when compared to the classical representation given in Figure 3.For example, after one round, the classical representation will have bits 16/21/26/31 in row 0 and column 1 and we can see that the exact same quartet will appear as well in the new representation, but in row 1 and column 0 instead.The fact that this quartet appears in a different row/column has no impact on the actual computation of the Sbox right after, since the computation is bitsliced.
The very nice property of this new representation is that it requires very few operations: each round, we only apply a row or column rotation to the three last slice matrices, while the first slice matrix is never moved.More precisely, for a round i: • if i%4=0, rotate slice j matrix by j columns to the left • if i%4=1, rotate slice j matrix by j rows to the top • if i%4=2, rotate slice j matrix by j columns to the right • if i%4=3, rotate slice j matrix by j rows to the bottom This entire process, which applies different functions for each 4 consecutive rounds, will be much less costly in software than having to transpose and then swap rows around.Even better: the new and the classical representations are naturally fully synchronised again after applying these 4 rounds, which avoids any representation correction to be applied at the end of the cipher (since GIFT-64 has 28 rounds, which is a multiple of 4).This is due to the fact that P 4 i = Id for all i.Therefore, no matter which slice matrix is fixed, the new and the classical representations will be fully synchronised after 4 rounds anyway.Each cell represents a bit, and the numbers in the cells then denote the actual index of that particular bit in the state.Slice 0 (resp.1/2/3) depicted in red (resp.yellow/green/blue) represents all the bits at position 0 (resp.1/2/3) of the S-boxes of the cipher state.
We call this technique fixslicing.Note that it is close to the software optimization of PRESENT in [RAL17] which consists in decomposing the permutation over 2 rounds, as our new representation can be seen as a decomposition of P 0 , • • • , P 3 over 4 rounds.Actually, the fixslicing technique is a particular case for permutations which ensures that, from a bitsliced perspective, all bits within a slice remains in the same one through the permutation.Therefore, it can be applied to all permutations that verify this property, and the number of rounds to consider for the decomposition equals min(order(P i )) for all i.The other side of the coin of this new representation is that the round keys and round constants have to be adapted to fit the new way the bits are positioned.While this is not an issue for the round constants by using a precomputed look-up table, adapting the key schedule might result in some computational overhead.The naive approach would be to run the key schedule using the classical representation, before rearranging bits for all round keys.However, one can take advantage of the fact that after 4 rounds all key words are back in the same position within the key state (yet the words themselves will be rotated because of the rotation operations in the key schedule).In other terms, because , each key word has to go through the same bit reordering every 4 rounds.Therefore a more efficient approach is to rearrange bits for the first 4 round keys only, and to adapt the key schedule accordingly.More details on how to compute the key schedule in the fixsliced representation are given in Appendix A.1.

GIFT-128
As for GIFT-64, we consider a bitsliced representation of the cipher state.For ease of description, a slice i can be represented as a pair of matrices i L and i R , as shown in the top row of Figure 6.During the SubCells application, when each slice is stored in independent words, all the 32 S-boxes are implemented in parallel in a bitsliced manner, as seen in Figure 5.Then, according to the GIFT designers [BPP + 17], the bit permutation can be implemented as follows: • Take the transpose of each individual slice matrix • Shuffle the left and right matrices of each slice (i.e.shuffle i L and i R for all i).
• Apply the following row swaps: -Slice 0: swap the 2 bottom halves -Slice 1: swap the top and bottom halves of the slices independently -Slice 2: swap the 2 top halves -Slice 3: cross swap the top and bottom halves We give a graphical representation of 5 rounds of this process in Figure 6.
As for GIFT-64, one can see that the process will be very costly in software, with lots of transpositions, shuffle and swaps.We therefore propose a new way to represent GIFT-128, thanks to the fixslicing technique.However, unlike GIFT-64, note that the classical and the new representation will not be synchronised anymore after 4 rounds since P 4 i = Id for all i.For GIFT-128 we have P 31 0 = P 10 1 = P 31 2 = P 5 3 = Id.In other terms, by fixing the fourth slice to never move, we can define a routine so that the classical and new representation are naturally synchronised after 5 rounds.Since GIFT-128 has 40 rounds (which is a multiple of 5), it avoids any correction to be applied at the end of the cipher.This representation is depicted in Figure 7.Each cell represents a bit, and the numbers in the cells then denote the actual index of that particular bit in the state.Slice 0 (resp.1/2/3) depicted in red (resp.yellow/green/blue) represents all the bits at position 0 (resp.1/2/3) of the S-boxes of the cipher state.
One can again see that even though the bit positions are different, each S-box will have exactly the same bit indexes involved when compared to the classical representation given in Figure 6.We recall that this representation implies that the key schedule and constant addition have to be adapted to fit the new way the bits are positioned.
The first 2 rounds are similar to the ones used for GIFT-64.Namely, in the first round, we simply rotate each matrix of each slice i (thus i L and i R for all i) by i columns to the left.In the second round, we simply rotate each matrix of each slice i (thus i L and i R for all i) by i rows to the top.For the third round, we swap the matrices i L and i R for i ∈ {0, 2} before swapping the first and third columns with the second and fourth ones respectively, for matrixes 0 R , 1 L , 1 R and 2 L .During the fourth round, we swap the first and third rows with the second and fourth ones respectively, for each matrix of slice 1.Then, for each matrix of slice 0 (resp.slice 2), we rotate by 2 columns to the left before swapping rows of the left-half block (resp.right-half block).Finally, the fifth round consists in swapping 1 L with 1 R , rotating i L and i R by 2 rows to the top for i ∈ {0, 2} and swapping the first and second rows of each matrix for slice 0, while swapping the third and fourth rows of each matrix for slice 2. All these operations are illustrated in Figure 7 for greater clarity.
The above mentioned method to adapt the key schedule for GIFT-64 cannot be straightforwardly applied to GIFT-128.Indeed, the new and the classical representations of the state are synchronised after 5 rounds, but the key schedule part is almost synchronised after 4 rounds (the key word will return to its original position after 4 rounds, albeit rotated).Thus, it looks like the synchronisation will happen only every 4 × 5 = 20 rounds.However, one can remark that twice as much subkey material is used for GIFT-128 compared to GIFT-64, and there the key words used every two rounds are the same (albeit rotated, and for different part of the internal state).Thus, we have an almost synchronisation that will happen only every 2 × 5 = 10 rounds instead.In other terms, each key word has to match every new representation of the state at some point.Instead of applying the naive approach for all round keys, which consists in running the key schedule using the classical representation and then rearranging bits, we suggest to apply it only for the first 10 round keys.At this stage, all key words will be expressed in each representation, allowing to adapt the key schedule for each of them, without reordering bits.More details on how to compute the key schedule in the fixsliced representation are given in Appendix A.2.

Efficient software implementations of GIFT
This section shows how to take advantage of the fixslicing technique to achieve efficient implementations of GIFT on ARM Cortex-M processors.We also briefly discuss the gap for other platforms that do not come with an inline barrel shifter or rotate instruction.

GIFT-64
In the case of GIFT-64, thanks to our new fixsliced representation, the linear layer consists in rotating either rows or columns depending on the round number.Depending on how the bits are arranged within the slices (i.e.row-wise or column-wise bitsliced representation), these operations refer to either half-word (16-bit) or nibble (4-bit) rotations.In the rest of this section we consider a row-wise bitsliced representation.The ARM Cortex-M being a 32-bit architecture (and since we have 4 slices in GIFT-64), two 64-bit blocks B and B can be processed at a time.Instead of simply concatenating 16-bit slices of both blocks within a 32-bit words, we suggest to interleave the nibbles as follows: so that 16-bit rotations are now 32-bit rotations, which can be implemented in a single cycle using the ror instruction.Actually, it can be computed for free by taking advantage of the inline barrel shifter, since instructions can shift or rotate one of their operands without any additional cost.Therefore, the implementation cost of the linear layer is now equivalent to 42 nibble rotations (3 have to be computed every 2 rounds).Such rotations can be computed in 3 cycles on ARM Cortex-M processors assuming that the required masks are already loaded in some general purpose registers, resulting in a total of 42 × 3 = 126 cycles.The following calls to the SWAPMOVE routine lead to the above mentioned row-wise nibble-interleaved bitsliced representation.Although a bitsliced representation without interleaving the nibbles could be built for 12 SWAPMOVE instead of 16, each half-word rotation would require at least 3 cycles, therefore doubling the cost of the linear layer to at least 252 cycles.Regarding the non-linear layer, it is possible to save 1 instruction by omitting the NOT operation.Indeed, this operation applies to a slice that will be then exclusive-ORed with the round key.Therefore, we suggest to compute the NOT on the corresponding round keys.Moreover, because the key schedule is completely linear, one can simply apply the logical negation to the right chunks of the key: before computing the key schedule.Note that this can be done once, when the encryption key is being derived and/or stored on the device, therefore saving 28 cycles per 128-bit data encryption.
On the other hand, a nibble-interleaved bitsliced representation requires twice as much memory to store the round keys and constants in order to avoid extra computations on the fly.It would still be possible to store these variable as 16-bit words but one would have to pay extra cycles to expand them into 32-bit words, nibble-interleaved with theirselves.As a matter of efficiency, we did not consider this option for our implementations.The round keys and constants are stored in 32-bit words, leading to a memory requirement of 112 and 224 bytes for all the round constants and the round keys, respectively.

GIFT-128
Regarding GIFT-128, because only a single block can be processed at a time on 32-bit processors, we consider a row-wise bitsliced representation without any interleaving.Unlike GIFT-64, it is not possible to distinguish only 2 but 5 kind of operation since each step of the new representation requires different slice transformations.At steps 1, 2, 4 and 5, these transformations can be implemented by means of nibble, half-word, byte and full-word rotations, respectively.The third step does not clearly refer to any n-bit rotation but can be simply computed using the SWAPMOVE process.Again, full-word rotations can be implemented for free on ARM thanks to the inline barrel shifter.Even though the nibble, byte and half-word rotations can be implemented in at least 3 cycles, our implementation requires 5 cycles as 2 additional cycles are spent in loading the appropriate masks into registers.This is due to the fact that, unlike GIFT-64, it is not possible to keep all the masks in registers during the entire encryption routine as 12 different ones are needed.The same statement also applies to SWAPMOVE calculations, leading to a cost of 5 cycles per process.As a result, the linear layer of GIFT-128 can be implemented in about 12 × 5 × 8 = 480 cycles in total, according to our new representation.
Note that row ordering matters to match with this interpretation of the new representation.Our GIFT-128 implementations use a row ordering from top-down, which can be achieved using the 14 following calls to the SWAPMOVE process: Regarding the non-linear layer, contrary to GIFT-64, it is not possible to get rid of the NOT operation within the S-box computation as the round keys are not exclusively-ORed to S 0 .Therefore, our implementation of the non-linear layer follows straightforwardly the specification and requires 13 × 40 = 520 cycles in total.

Without rotate instruction
Thanks to the inline barrel shifter, our fixsliced implementations fit very well the ARM architecture since the linear layer can be computed for free every 2 and 5 rounds for GIFTb-64 and GIFTb-128, respectively.However, one could ask oneself how it would perform on platforms that do not come with an inline barrel shifter and/or rotate instructions.For instance, RISC-V has no rotate instruction without an appropriate extension (e.g., Bitmanip [Wol20]).In this case, one rotation can be computed by means of 2 shifts and 1 OR, resulting in at least 3 cycles.Therefore, instead of having the linear layer for free every 2 and 5 rounds, it would require at least 4 × 3 = 12 cycles, leading to a minimum overhead of 12 × 14 = 168 and 12 × 8 = 96 cycles for GIFTb-64 and GIFTb-128, respectively.Moreover, nibble, byte and half-word rotations on RISC-V cannot be computed in 3 but 5 cycles because the barrel shifter is not inlined, resulting in an additional overhead of 2 × 4 × 14 = 112 for GIFT-64.On the other hand, this should not affect GIFT-128 since our implementation spends 5 cycles for all these rotations because 2 additional cycles are spent to load the appropriate masks in registers.While ARM Cortex-M processors only have 14 general purpose registers, RISC-V has 32 such registers, so all the masks can be kept in registers during the entire encryption process.Finally, the SWAPMOVE process would require 6 cycles instead of 4, increasing the cost to pack the input and unpack the output to 16 × 4 = 64 and 14 × 4 = 56 cycles for GIFT-64 and GIFT-128, respectively.Note that it would also add (6 − 5) × 3 × 8 = 24 cycles to GIFTb-128 since it relies on 3 SWAPMOVE calls in order to compute the linear layer every 5 rounds.
As a result, on platforms without inline barrel shifter or rotate instruction, we estimate a total overhead of 168 + 112 = 280 (i.e.140 per block) and 96 + 24 = 120 cycles for our fixsliced implementations of GIFTb-64 and GIFTb-128, respectively.Taking into account the overhead to pack/unpack the data would lead to a total overhead of and 280 + 64 = 344 (i.e.172 per block) and 120 + 56 = 176 cycles for GIFT-64 and GIFT-128, respectively.Overall, this means a penalty of around 40% cycles for GIFT-64 and 15% cycles for GIFT-128.Therefore, fixslicing is still of interest on such platforms compared to the classical representation, although the ARM architecture allows to boost its performance.

The GIFT block ciphers
Our GIFT implementations, which are written in ARM assembly, are put into the public domain and available at https://github.com/aadomn/gift.Results for various lightweight block ciphers including GIFT are provided in Table 4.
The implementations of RECTANGLE-64/128, SIMON-64/128 and SPECK-64/128 are the ones from scenario 2 -Best execution time -of the FELICS framework [DCK + 19].In this scenario, the key schedule is not taken into account as the round keys are assumed to be precomputed and stored in RAM.The benchmark consists in measuring the time required to encrypt 128-bit data using the CTR mode.We followed the same approach for our GIFT implementations to ensure a fair comparison.The results for PRESENT-64/128 are taken from [RAL17] and were obtained using the same methodology.Regarding the key schedule, results from the FELICS framework were extracted from the scenario 0, which consists in a simple benchmark of the key schedule and a block encryption/decryption.Except for RECTANGLE, for which implementations are written in ARM assembly, note that the results for the other above mentioned ciphers come from C codes.Therefore, better results can be expected for these algorithms by considering assembly implementations.Table 4 also includes results for the current best AES constant-time implementation from [SS16].Note that, as in Table 3, RAM usage for encryption functions does not take into account the memory required for the round keys to be compliant with the results from the FELICS framework.
As expected, our new GIFT fixsliced representation allows extremely efficient software bitsliced implementations, requiring at best 766 and 838 cycles to encrypt 128-bit data for GIFTb-64 and GIFT-64, respectively.Note that this is about 8 times more efficient than our naive bitsliced implementations written in C reported in Table 3.On the other hand, the amount of memory to store the round keys is increased by a factor 2. GIFT-64 outperforms all other 64-bit ciphers listed in Table 4, except SPECK-64/128 which is well known for its outstanding performances thanks to its ARX structure.Especially, our implementation of the GIFT-64 key schedule according to the new representation outperforms all the other ones.GIFT-64 key exp.refers to the key schedule including the rearrangement of the encryption key to match the fixsliced representation, while GIFTb-64 key exp.assumes a key already in the right representation as input.Note that rearranging the encryption key can be done only once, when this latter is being derived and/or stored on the device, at the same time that the S-box optimization described in Section 5.
Regarding GIFT-128, we observe a factor of 1.6 in terms of performance compared to GIFT-64.Considering that the factor in terms of rounds is about 1.4, it is a remarkable result since its new representation is slightly more complex.However, the cost of the key schedule is more than doubled due to the fact that the optimization for GIFT-64 does not apply to GIFT-128 as stated in Section 5. Still, it allows a slightly better performance than the AES key schedule.Note that, unlike for GIFT-64, we do not make a distinction between GIFTb-128 key exp.and GIFT-128 key exp.as our adapted key schedule starts from the key in its classical representation anyway.For encryption routines, it results that our GIFT-128 implementations largely outperforms the current best AES one reported in Our GIFT implementations only require 4 32-bit random words to mask the internal state at the beginning of the algorithm.Regarding the key schedules, the same amount of randomness is required to mask the initial key.For both GIFT-64 and GIFT-128, the internal state fits in 4 registers.Therefore, it is possible to handle the state and the masks in 8 registers, avoiding any additional memory access during the encryption routine.
When taking first-order masking into consideration, the advantage of GIFT-128 over AES-128 is even more significant since the number of nonlinear operations to secure is smaller.However, note that the reported results for AES do not take advantage of the optimized AND gate from [BDLCU18] and therefore bear the cost of additional operations and randomness generation.Compared to our unmasked implementation results reported in Table 4, we observe a penalty factor about 2.5 in terms of execution time, showing that GIFT is well suited for software masked implementations thanks to our fixsliced representation.

The GIFT-COFB authenticated cipher
Since GIFTb-128 defines the underlying block cipher of GIFT-COFB, we can easily have a look at the benefits of our fixsliced representation when applied to this authenticated cipher.To do so, our GIFT-COFB implementation computes the COFB mode using C code while calls to the GIFTb-128 primitive are handled by our assembly implementation.Tables 6 and 7 summarize our implementation results for GIFT-COFB and Ascon [DEMS19], another submission to the NIST LWC competition.For both versions of Ascon, namely Ascon-128 and Ascon-128a, we consider the ARM optimized implementations bi32_arm, available online at https://github.com/ascon.We believe this is a fair comparison since the core function is written in assembly in a fully unrolled manner, while the rest of the algorithm is handled by C code, just like our GIFT-COFB implementation.According to our benchmark, fixslicing makes GIFT-COFB a very efficient authenticated cipher, running at 79 cycles per byte for long messages, versus 58 and 42 cycles per byte under the same setting for Ascon-128 and Ascon-128a, respectively.However, because the considered Ascon implementation are highly speed-optimized, their code size are bigger than our fully unrolled implementation by a factor 1.2 and 1.5 for Ascon-128 and Ascon-128a, respectively.We observe that our first-order masked implementation of GIFT-COFB requires about thrice as much cycles as Ascon-128, taking into account the randomness generation on the STM32F407VG micro-controller.Although it is unclear how Ascon-128 would perform compared to our fixsliced implementations when taking first-order masking into account, we expect it to be more efficient for messages composed of several blocks since masking can be restricted to the initialization and finalization phases as done in [AFM18].

Conclusion
In this article, we proposed a new representation for the GIFT family of lightweight block ciphers called fixslicing, and showed how it can be used to obtain extremely fast implementations on micro-controllers, making GIFT a very efficient candidate on these platforms.Especially, our fixsliced representation fits very well the ARM architecture as the inline barrel shifter allows to compute the linear layer for free every 2 and 5 rounds for GIFT-64 and GIFT-128, respectively.Our implementations, available online at https://github.com/aadomn/gift to validate our overall strategy, run in constant-time since they are bitsliced in essence.This result directly provides efficient implementations of GIFT-COFB, a submission to the NIST LWC competition, placing it as a very promising candidate on micro-controllers.
We also report implementation results for GIFT and GIFT-COFB when adding first-order masking and observe a penalty factor about 2.5 and 2.1, respectively.According to our benchmark, GIFT-COFB masked at first-order requires about thrice as much cycles than Ascon-128 without masking.Further work should be conducted to draw a clear picture when comparing both algorithms regarding masked implementations.
More generally, we believe that the approach of not following the classical cipher representation for a few rounds might be applicable to other designs.Especially, bitsliced implementations can take advantage of the fixslicing technique as long as each bit located in a slice remains in the same one through the linear layer, as is the case for GIFT.From a design point of view, considering a permutation with a low order for the linear layer might be of interest, since it allows to define a compact routine to resynchronize the slices.Furthermore, the key schedule should be designed accordingly to avoid any additional calculations due to round keys adjustment.

A.2 GIFT-128
In the case of GIFT-128, adjusting the key schedule according to fixslicing is more tricky since the new and the classical representations of the state are synchronised after 5 rounds, while the key words will return to their original positions after 4 rounds.We suggest to compute the key schedule in the classical representation for the first 10 round before rearranging them in order to match the fixsliced representation of the state.At this stage, all key words will be expressed in each representation, allowing to adapt the key schedule for each of them, without reordering bits.As stated in Section 4.2, each key word will be exclusive-ORed to the state in the same representation every 10 rounds.After 10 rounds, 2 out of 4 key words will have been updated thrice while the two others will have been updated twice, as detailed in Table 8.Therefore, our adapted key schedule relies on double and triple update functions for each representation, which are illustrated in Figure 9.
Table 8: Round keys' representations depending on the round number.Exponents refer to the number of times the key words have been updated.Blue and red arrows refer to double and triple key updates, respectively.
Representation # Round # Round key

Figure 2 :
Figure 2: Cubic representation of the main state of GIFT-64.Each color refer to a slice matrix while the black cuboid is where an Sbox is applied.

Figure 3 :
Figure3: Classical representation of the GIFT-64 round function during 4 rounds.Each cell represents a bit, and the numbers in the cells then denote the actual index of that particular bit in the state.Slice 0 (resp.1/2/3) depicted in red (resp.yellow/green/blue) represents all the bits at position 0 (resp.1/2/3) of the S-boxes of the cipher state.

Figure 4 :
Figure4: New representation of the GIFT-64 round function during 4 rounds.Each cell represents a bit, and the numbers in the cells then denote the actual index of that particular bit in the state.Slice 0 (resp.1/2/3) depicted in red (resp.yellow/green/blue) represents all the bits at position 0 (resp.1/2/3) of the S-boxes of the cipher state.

Figure 5 :
Figure 5: Cubic representation of the main state of GIFT-128.The black cuboid is where an S-box is applied for both matrices.

Figure 6 :Figure 7 :
Figure6: Classical representation of the GIFT-128 round function during 5 rounds.Each cell represents a bit, and the numbers in the cells then denote the actual index of that particular bit in the state.Slice 0 (resp.1/2/3) depicted in red (resp.yellow/green/blue) represents all the bits at position 0 (resp.1/2/3) of the S-boxes of the cipher state.

Figure 8 :
Figure8: GIFT-64 key update functions from round i to i + 4, according to the different fixsliced representations over 4 rounds.Each cell represents a bit, and the numbers in the cells then denote the actual index of that particular bit in the 16-bit key word.Note that i mod 4 = 3 refers to the classical representation.The cell colors match the corresponding slice for the add round key operation.

Figure 9 :
Figure9: GIFT-128 double/triple key update functions from round i to i + 10, according to the different fixsliced representations over 5 rounds.Each cell represents a bit, and the numbers in the cells then denote the actual index of that particular bit in the 16-bit key word.Note that i mod 5 = 4 refers to the classical representation.The cell colors match the corresponding slice for the add round key operation.

Table 3 :
Naive bitsliced implementation results on ARM Cortex-M3 and M4 for various versions of GIFT.

Table 4 :
Constant-time implementation results on ARM Cortex-M3 and M4 for various versions and representations of GIFT, as well as other lightweight block ciphers.For encryption routines, speed is expressed in cycles per block.Emboldened (resp.italic) results refer to speed (resp.code size) oriented implementations.

Table 6 :
Constant-time implementation results on ARM Cortex-M3 and M4 for GIFT-COFB and Ascon to secure 16 bytes of message along with 16 bytes of additional data.Emboldened (resp.italic) results refer to speed (resp.code size) oriented implementations.

Table 7 :
Running time (cycles) of constant-time speed-oriented implementations of GIFT-COFB and Ascon on ARM Cortex-M4 for different message sizes along with 16 bytes of additional data.