The SPEEDY Family of Block Ciphers Engineering an Ultra Low-Latency Cipher from Gate Level for Secure Processor Architectures

. We introduce SPEEDY , a family of ultra low-latency block ciphers. We mix engineering expertise into each step of the cipher’s design process in order to create a secure encryption primitive with an extremely low latency in CMOS hardware. The centerpiece of our constructions is a high-speed 6-bit substitution box whose coordinate functions are realized as two-level NAND trees. In contrast to other low-latency block ciphers such as PRINCE , PRINCEv2 , MANTIS and QARMA , we neither constrain ourselves by demanding decryption at low overhead, nor by requiring a super low area or energy. This freedom together with our gate-and transistor-level considerations allows us to create an ultra low-latency cipher which outperforms all known solutions in single-cycle encryption speed. Our main result, SPEEDY-6-192 , is a 6-round 192-bit block and 192-bit key cipher which can be executed faster in hardware than any other known encryption primitive (including Gimli in Even-Mansour scheme and the Orthros pseudorandom function) and oﬀers 128-bit security. One round more, i.e., SPEEDY-7-192 , provides full 192-bit security. SPEEDY primarily targets hardware security solutions embedded in high-end CPUs, where area and energy restrictions are secondary while high performance is the number one priority.


Introduction
In this paper we revisit the following fundamental problem: How do we design a secure encryption algorithm whose hardware implementation is fast?Specifically, we care about the entire latency of the hardware circuit from the point where the inputs are provided to the point where the final outputs are ready and stable, i.e., the latency of a fully-unrolled hardware implementation entirely made from combinatorial logic.Previous approaches, which led to the design of established low-latency constructions like PRINCE [BCG + 12], PRINCEv2 [BEK + 20], MANTIS [BJK + 16] and QARMA [Ava17], considered a low number of rounds and, to some extent, a small gate depth as design criteria.While both are obviously important factors to achieve a low latency, there are further aspects which have been ignored at the design level in the past -first and foremost the latency characteristics of the underlying hardware.At first sight it may appear to be of limited interest to tailor a cryptographic primitive towards one specific device technology due to the potential loss of generality.However, in the hardware world there has been only one de-facto standard for integrated circuit fabrication since the 1980s, namely Complementary Metal-Oxide-Semiconductor (CMOS) technology.The construction of CMOS logic gates, i.e., the arrangement of pand n-channel MOSFETs (Metal-Oxide-Semiconductor Field-Effect Transistors) to create a certain functionality, has remained largely unchanged since its original proposal in 1963.In other words, CMOS logic gates -the essential building blocks for the vast majority of our computing technology today -have not experienced any fundamental redesign in almost 6 decades.Merely their size has seen a progressive decrease according to Moore's famous law [Moo65].Notably, there are some operations which can be constructed more naturally from complementary logic.In particular, complementary gates in silicon hardware are naturally inverting and non-inverting Boolean functions cannot be realized in a single stage (i.e., they require more than one pull-up and pull-down network) [RCN04].Among the naturally inverting logic gates some can be realized using only the minimum (lower bound) of 2n transistors, where n is the number of inputs the gate receives.These 2n transistors are then arranged in the classical layout of one pull-up network, built from p-channel MOSFETs (PMOS), and one pull-down network, built from n-channel MOSFETs (NMOS).The simple Boolean functions NAND, NOR and INV/NOT are constructed this way, but also the compound or complex logic gates AND-OR-INV (AOI) and OR-AND-INV (OAI).We argue that logic cells with these properties are immensely beneficial for low-latency constructions as they produce outputs much faster than their counterparts, independent of the particular specifications or the minimum feature size of the fabrication process.When diving deeper into the physical characteristics of hardware circuits built from silicon, it is possible to make even further distinctions.In particular, we point out that cell layouts which require PMOS transistors to be connected in series (stacked) suffer from the lower mobility of PMOS compared to NMOS transistors more significantly.In consequence, a noticeable negative impact on the latency of such gates can be observed and larger transistor widths are required to partially offset this performance loss at the price of an increased area [RCN04].Among the previously listed cells, only NAND and INV/NOT gates do not classically require PMOS transistors to be stacked.NOR gates with more than two inputs suffer most severely from the mobility mismatch due to the larger PMOS stacks.To clarify the impact of such observations on the performance of gates in common standard cell libraries, we present latency figures for individual logic gates exemplarily for NanGate 45 nm and 15 nm Open Cell Libraries (OCLs) in Section 2. All gate-and transistor-level considerations described above are universally applicable to CMOS standard cells, independent of the particular foundry, manufacturing process and minimum feature size.Hence, it makes sense to take such characteristics into account when attempting to implement a certain function, like an encryption algorithm, as a hardware circuit with minimum latency.When revisiting previous latency-driven constructions in cryptography, it is clear that such low-level observations have not been considered in the past.We provide first contributions towards hardware-aware low-latency design and construct a family of ultra low-latency block ciphers based on the underlying principles.

Motivation
Approaches to secure the internals of modern Central Processing Units (CPUs) have received significant attention in the last few years as microarchitectural attacks, notably Meltdown [LSG + 18] and Spectre [KHF + 19], revealed serious shortcomings in the security architectures of widely deployed high-end processors.Hardware-based mitigations for such attacks are proposed "en masse".Many of them call for a higher level of encrypted communication inside of CPUs as well as between CPUs and their surrounding hardware components.Among the former are proposals for secure caches such as ScatterCache [WUG + 19] and CEASER [Qur18].Both of them are compared to a number of further cache architectures in [DXS19].To implement new features of this kind in the next generations of mainstream processors without causing a large performance penalty, high-speed encryption primitives are among the most important building blocks.
Secure caches are only one example of security applications in CPU environments that require high-speed encryption.Dedicated hardware instructions, memory encryption, pointer authentication (as renownedly implemented using QARMA in ARM processors) and similar hardware-assisted mechanisms against software exploitation fall into this category as well.We expect to see a lot more of such features implemented in future generations of secure processor architectures, especially when more highly-optimized cryptographic primitives become available.SPEEDY is meant as a general purpose high-speed encryption primitive for all these applications and not limited or tailored to a subset of them.Most low-latency ciphers published in the literature so far, such as PRINCE [BCG + 12], PRINCEv2 [BEK + 20], MANTIS [BJK + 16] and QARMA [Ava17], try to meet tight area and energy requirements in addition to low latency.These properties make them particularly suitable for highly-constrained microcontrollers in the Internet of Things (IoT).However, keeping the primitives suited for battery-powered devices requires sacrifices with respect to maximum performance.High-end CPUs do not impose the same kind of restrictions on area and energy, yet they require even higher performance in terms of latency and throughput.SPEEDY is able to outperform the state of the art by focusing on maximum encryption speed and high security only.

Related Work
Designing cryptographic primitives with minimum execution time in hardware is still a young and emergent research discipline.At CHES 2012 the authors of [KNR12] delivered first results in that area by comparing the latency properties of multiple (lightweight) block ciphers.It was concluded that, among other factors, the use of cryptographically-strong 4bit (or even 3-bit) S-boxes should be favored over larger substitutions and that a low number of rounds should be maintained even at the price of a heavier linear layer when designing a low-latency primitive.These demands were immediately met by the first dedicated low-latency block cipher called PRINCE which has been presented at ASIACRYPT 2012.PRINCE is a 64-bit block cipher with a 128-bit key and 12 cipher rounds which features an innovative reflection property that allows to encrypt and decrypt data with essentially the same circuit.Recently, an updated version called PRINCEv2 has been proposed which claims to increase the security level of PRINCE by making small modifications to the key schedule and the middle rounds [BEK + 20].This work also provides a comparison of multiple low-latency block ciphers which confirms that PRINCE and PRINCEv2 are still the fastest such primitives in public literature [BEK + 20].The comparison also includes the tweakable block ciphers MANTIS [BJK + 16] and QARMA [Ava17] as well as the low-energy block cipher Midori [BBI + 15] and demonstrates that all three of them come at a latency overhead between 22 % and 42 % (considering the encryption-only variants) compared to PRINCE in open-source NanGate libraries.This result may not come as a surprise, since tweakable block ciphers such as MANTIS and QARMA are expected to require a larger circuit depth due to the additional tweak input and since Midori has not been designed with low latency being the primary design goal, although its substitution layer has been chosen particularly to offer a small delay.However, two recent works claim that cryptographic primitives aside from traditional block ciphers are able to outperform PRINCE in terms of latency.First, the high performance cross-platform permutation Gimli introduced in [BKL + 17] is claimed to enable encryption with a 1.7 times smaller latency than PRINCE in [GKD20], while the low-latency pseudorandom function (PRF) Orthros introduced in [BIL + 21] claims to achieve a latency about 7 % below PRINCE's.We analyze both claims in our comparison in Section 7 and conclude that the latter is consistent with our results, while the former is clearly not.Orthros is able to achieve a lower latency than PRINCE by computing the sum of two keyed permutations [BIL + 21] which makes the resulting primitive non-invertible (in contrast to block ciphers like SPEEDY).Apart from the full cryptographic primitives discussed above, there are also some works focusing on particular cryptographic building blocks only.For instance, in [LSL + 19] it is shown how to construct involutory low-latency Maximal Distance Separable (MDS) matrices.The authors of [BFP19] present techniques for finding small low-depth circuits for cryptographic functions.In [BMD + 20] the main goal is to construct S-boxes whose masked variants (i.e., their side-channel protected versions) have a low latency in hardware which conceptually requires a low AND depth and AND gate complexity.Low-latency hardware masking in general, used to protect cryptographic primitives against sidechannel attacks, has received significant attention in the last few years, as demonstrated in [MS16, GIB18, ABP + 18, BKN19, SBHM20].However, this field is not directly related to the development of low-latency symmetric primitives in general, as the requirements are vastly different and sometimes even direct opposites.1

Our Contribution
We introduce SPEEDY, a family of ultra low-latency block ciphers dedicated to semi-custom, i.e., standard-cell-based, integrated circuit design.In order to tailor this cryptographic primitive towards maximum execution speed in hardware we first analyze which type of logic gates and circuit topologies are particularly suited for ultra low-latency encryption.Our considerations in this regard are novel and have, to the best of our knowledge, not been applied in previous designs of symmetric cryptographic primitives.SPEEDY can be instantiated with different block and key sizes and varying numbers of rounds.However, due to our S-box width of 6 bits and our main target application of 64-bit high-end CPUs we decided to use the least common multiple of 6 and 64, namely 192 as the default block and key size and call this instance SPEEDY-r-192.We claim that SPEEDY-r-192 achieves 128-bit security when iterated over r = 6 rounds and full 192-bit security when iterated over r = 7 rounds, while the r = 5 round variant already provides a decent security level that is sufficient for many practical applications.Our extensive evaluation of hardware implementations in 6 different standard cell libraries shows that both SPEEDY-5-192 and SPEEDY-6-192 achieve a lower latency in hardware than any other known encryption primitive, while SPEEDY-7-192 is only marginally slower than PRINCE.Considering the provided security levels this is a significant improvement over the state of the art in the area of (ultra) low-latency cryptography.

Background
In this section we revisit the necessary concepts which build the foundation for SPEEDY and analyze the primary traits that make certain CMOS standard cells and circuit topologies particularly useful for high-speed cryptography.

Natural CMOS Gates (NCGs)
A static CMOS gate is constructed by combining a pull-up with a pull-down network.The pull-up network, as the name suggests, is responsible for pulling the output of the gate up to VDD whenever the Boolean function should result in a logical '1'.The pulldown network, analogously, is responsible for pulling the output down to GND whenever the Boolean function should output a logical '0'.The networks are built in a mutually exclusive manner such that only one of them is conductive for each combination of input signals [RCN04].While the pull-up networks are exclusively built from PMOS devices, the pull-down networks are built from NMOS devices.PMOS devices can be understood as switches that conduct current between their drain and source terminals whenever their gate voltage is low, NMOS devices conduct current between the terminals whenever their gate voltage is high.For the opposite gate voltages the transistors are in a high-resistance state.The assignment of PMOS transistors to pull-up networks and NMOS to pull-down networks originates from the fact that PMOS devices cannot produce so-called strong zeros, while NMOS devices cannot produce strong ones [RCN04].In consequence, static CMOS gates with a single stage are naturally inverting by design.Non-inverting Boolean functions require at least two stages of pull-up and pull-down networks.Thus, as already discussed in Section 1, certain logic functions are a more natural fit for technologies that are based on complementary metal-oxide-semiconductor logic.Inverting Boolean functions include for instance the common logic gates INV/NOT, NAND, NOR, XNOR, AOI and OAI.Most of them (all except XNOR) can be realized as static gates by using only the lower bound of 2n devices, namely n PMOS and n NMOS transistors.We call all inverting logic gates which require only one stage and 2n transistors for their implementation Natural CMOS Gates (NCGs).All NCGs commonly found in standard cell libraries with 1 ≤ n ≤ 4 inputs are depicted in Appendix A, Figure 4.Such logic cells are not only interesting from a hardware design perspective because they require a lower number of transistors and therefore have a smaller area footprint, they are also faster than their opposition and therefore beneficial for low-latency constructions.

Latency of CMOS Logic Gates
The time that a physical instance of a logic gate requires to respond to a change in its input signals by updating its output signal is called the delay or the latency of a cell.Considering CMOS hardware, the latency of a physical instance of a logic cell depends on a number of factors.Besides environmental influences like the temperature and the supply voltage, also the transition time of the input signals and the capacitance that needs to be driven at its output play a significant role.In this subsection, however, we want to compare the base latencies of static CMOS gates when all outside factors are equal.Tables 1 and 2 list the latencies of common logic gates in two open-source standard cell libraries, namely NanGate 45 nm and 15 nm Open Cell Libraries (OCLs), respectively.The latency values are given in picoseconds and have been obtained by analyzing a netlist containing only the individual logic gate enclosed between standard D-flip-flop cells for typical operating conditions (25 • C, nominal voltage) with the Electronic Design Automation (EDA) software Synopsys Design Compiler Version O-2018.06-SP4using Composite Current Source (CCS) models of the standard cells.Please note that for simplicity only the logic gates with the minimum drive strength (denoted by the suffix "_X1" in NanGate libraries) are shown here.However, the following arguments and considerations also apply to the higher drive strength variants.As expected, the natural CMOS gates, defined in the previous subsection, produce their outputs significantly faster than the competition.Interestingly, though, some significant differences between analogous natural gates such as NAND and NOR can be observed.In NanGate 45 nm technology for example, the NAND4_X1 cell is more than twice as fast as the NOR4_X1 cell.This is due to the different physical behavior of p-type and n-type MOSFETs realized in silicon as semiconductor material.In n-type MOSFETs the majority carriers are electrons which are negatively charged.In p-type MOSFETs on the other hand, the majority carriers are positively charged holes [RCN04].Holes are less mobile than electrons, which means they move slower.Therefore, simply speaking, PMOS transistors operate slower than NMOS transistors of the same size.This situation is even amplified when connecting PMOS devices in series (stacking) and leads to a significant performance degradation and an increased area demand due to the larger widths required to partially offset the performance penalty and achieve balanced rise and fall times.Classic CMOS NOR gates require stacks of n PMOS transistors and are therefore among the logic functions which suffer the most from the lower mobility of holes as majority carriers.Since For n = 3 and n = 4 the situation depends on the exact sizing of the transistors chosen by the cell designer for each particular gate.This choice determines the trade-off between area and latency of the logic cells.Typically, either NAND3 and NAND4 or OAI21 and OAI22 are the fastest gates for n = 3 and n = 4, respectively.In NanGate 45 nm technology OAI21 (n = 3) and NAND4 (n = 4) are the fastest cells for their respective number of inputs while in 15 nm technology NAND3 (n = 3) and OAI22 (n = 4) cells are the fastest, as apparent in Tables 1 and 2.

Suitability for High-Speed Encryption
There are several factors to be considered when determining which cells in a standard gate library are most suitable for low-latency encryption.Building a low-latency encryption primitive in hardware is essentially the task of creating a circuit that, as quickly as possible, establishes an, as highly as possible, non-linear relationship between the plaintext and, as many as possible, independent key bits.Of course, this is an extreme oversimplification of the large number of requirements that symmetric cryptographic primitives need to fulfill in order parry all known attacks.Yet, when following this simplified idea, the design process for an ultra low-latency cipher should start at the gate level.In particular, we are interested in logic gates that are capable of establishing a Boolean relationship between as many inputs as possible in a short period of time.In that regard, we introduce a new metric, which we call the Fan-in-to-Latency Ratio (FLR).Essentially, we divide the fan-in n of each gate (i.e., the number of inputs it receives) by its latency.Let f : F n 2 → F 2 be the Boolean function of a logic gate and n the number of inputs it receives (i.e., the fan-in), then the Fan-in-to-Latency Ratio (FLR) of f can be expressed as Equation 1.
By calculating the FLR for each logic gate in a standard cell library one can rank the gates by their suitability for ultra low-latency encryption.Tables 1 and 2 list the FLR scores for all standard logic gates with n inputs for 1 ≤ n ≤ 4. The FLR score reflects the ability of a logic gate to rapidly evaluate a Boolean function on multiple inputs.Hence, the higher the value in the FLR-column for a logic gate, the higher is its potential to be suitable for ultra low-latency encryption.NAND and OAI gates are among the logic cells with the highest FLR scores, while XOR and XNOR gates are among the worst performers.Thus, despite the importance of XOR (and XNOR) gates in symmetric cryptography (mostly for key addition and strong linear layers) it is prudent to limit their occurrence to a minimum.Obviously, the kind of Boolean logic function that is evaluated plays a significant role in determining its suitability for high-speed encryption as well.In that regard, a further important aspect is the linearity of a function.Lin(f ) denotes the linearity of the Boolean function f , defined by Equation 2, where f : F n 2 → Z is the Fourier transform of f given by Equation 3.
Lin(f ) := max Tables 1 and 2 provide the linearity of all listed logic gates.The linearity of a Boolean function f : F n 2 → F 2 is lower bounded by 2 n 2 and upper bounded by 2 n .Whenever Lin(f ) = 2 n , f is an affine function, i.e., Equation 4 holds with α ∈ F n 2 , c ∈ F 2 .
In our tables, the logic functions INV/NOT, BUF, XOR, XNOR have maximum linearity (2 n ) and can be expressed as constant or affine functions, while the logic gates AND2, NAND2, NOR2 and OR2 reach the lower bound for the linearity of 2 n 2 .While both, linear and non-linear functions, are useful for the construction of secure encryption algorithms, they are typically used in different layers or round operations.The non-linear layer in block cipher design is typically the substitution layer while all other operations tend to be linear.Often the substitution boxes, in short S-boxes, are among the most resource consuming elements in terms of area, energy and latency.Therefore, it is particularly interesting to optimize this building block towards the desired design goal when developing and implementing a cipher.In that regard, non-linear gates with a high FLR score, like NAND and OAI, are the prime candidates for building strong and fast S-boxes.

Latency of Logic Circuits
It is insufficient to consider only the latencies of individual logic elements in order to determine the resulting total latency of a combinatorial circuit or path.When connecting logic gates to logic circuits, the individual propagation delays of the gates depend significantly on their direct electrical environment.Merely summing up the base latencies of the gates in a path (e.g., the values given in Tables 1 and 2) may give a very incorrect idea about the path's total latency.Despite the fact that some obvious correlation between these quantities can be observed, the gate depth of a path is not always directly proportional to its latency.Therefore, it is important to also consider adequate circuit topologies which minimize the latency of combinatorial circuits when designing a low-latency cipher.In this regard, we first want to dispel two common myths about the latency of CMOS circuits: • Myth 1: Each CMOS standard cell has a fixed delay and each instantiation of the same exact standard cell adds (approximately) the same latency to a path.
Truth: This is false.The propagation delay of a CMOS cell is always a function of the transition time of its input signals, which is influenced by the drive strength of preceding cells and the capacitance of the nets they need to drive, as well as the capacitive load that the CMOS cell itself needs to drive at its output.The variations of the delay of a cell instance depending on its electrical environment can easily be in the range of 200-300%.Therefore, it is not uncommon that two instances of the same cell in different positions of a logic circuit have delays associated with them (e.g., in a timing report) that differ by a factor of 3 or 4.
• Myth 2: Adding a gate to a path of a circuit and not making any other changes to the path will always increase the path's latency.
Truth: This is also false.Often, adding a well-placed buffer or inverter (where logically applicable) to a path in order to charge a significant capacitive load faster can decrease the overall latency of the path.Hence, the mere gate depth is not always indicative of the latency of a circuit.Generally, the topology of a circuit, primarily the fan-out of the logic cells, is similarly important as the number and type of gates in its critical path when determining the maximum latency.
In the following we provide an example which demonstrates the incorrectness of the two myths.We consider a simple circuit in Figure 1(a) where the output signal of a single  2, it is obvious that the actual latencies of the gates in this circuit are significantly larger.The first XOR gate in particular which feeds the other 8 gates requires a latency which is more than 4 times as large as its base latency due to the significant capacitive load it needs to drive.The XOR gates in the second stage do not drive any large loads but their latency is increased because their input signals have a large transition time.It is noteworthy that this is a synthesis result, which means that the actual capacitances and resistances of the routing (i.e., wiring) are not even considered yet.After placing and routing this circuit in a chip design the latencies would likely be even larger.Figure 1(b) shows a circuit with the same logic functionality and the same 9 total XOR gates, but here the output of the first stage XOR is buffered by a drive strength buffer (BUF_X4).Although this change increases the gate depth of the circuit, it decreases its overall latency.The first stage XOR now only needs to drive a small load and the last stage XORs are driven by input signals with a short transition time.As a result, the buffered circuit has a total latency of 18.675571 ps (Fig. 1(b)) while the circuit without a buffer has a total latency of 29.169073 ps (Fig. 1(a)).Hence, the buffered circuit is more than 35% faster.Please note that the NanGate 15 nm library does not provide XOR gates with a higher drive strength, thus up-sizing the first stage XOR itself is not an option here and buffering the high fan-out net is inevitable when the latency should be reduced.Of course, this is done automatically by the synthesis tool.Our point here is simply that, regardless of how the large fan-out is addressed by the tool or the designer, e.g., up-sizing the gate or inserting a buffer, it assuredly causes an increased latency compared to a circuit with the same depth and the same gates in both levels, but with smaller fan-outs.Thus, we conclude that dedicated low-latency circuits should use topologies where the fan-outs of the logic gates are as small as possible (e.g.tree-based).

Finding Circuits with Minimum Latency
We would like to caution against the common perception that professional synthesis tools can readily be used to find and generate a netlist with minimum achievable latency for a simple Boolean function like an S-box coordinate function.First of all, the complexity of checking any possible circuit representation composed of a finite (but usually large) set of standard cells for a Boolean function is often remarkably high and market-leading EDA tools are built for time efficiency (especially the synthesis routines).Furthermore, the proprietary synthesis algorithms may not be sufficiently configurable to consider latency as the only or primary design goal.The tools may rather take area and energy into account as well and not consider latency optimizations that come at a harsh penalty for the other two optimization goals.In our own experience, the thresholds for such decisions cannot be adjusted sufficiently by the designer.Thus, we have found that constructing optimal building blocks for ultra low-latency cryptography needs to be done from scratch (by hand or via heuristics) instead of analyzing many different variants with a synthesis tool and selecting the ones that delivered the best performance.In our evaluations, the synthesis algorithms usually produced the best results with respect to low latency, when the underlying gate structure was already given and only incremental performance optimizations were required.

Ultra Low-Latency 6-bit S-box
In this section, we describe the technique we have used to build an ultra low-latency S-box from gate level.In order to design an S-box which is extremely fast in CMOS hardware while at the same time provides good cryptographic properties, we used the following criteria: • Ultra low-latency: As explained in Subsection 2.2, NAND and OAI gates are among the best-suited logic gates for low-latency S-box design.Thus, we search for S-boxes that can be realized with as few as possible levels of only NAND and OAI gates.Furthermore, as discussed in Subsection 2.3, we try to make sure that in as many stages as possible the logic gates have a minimum fan-out.
• Bijective mapping with fully-dependent outputs: Since we aim for an SPN cipher, we need the S-box to be a bijective mapping.Moreover, we restrict the search to the S-boxes with fully-dependent outputs.In more detail, this means that all input bits are involved in the computation of each output bit.
• Small linearity and uniformity: To provide strong resistance against differential and linear attacks, we are only interested in S-boxes with small uniformity u and linearity l defined as By definition, the latency of a vectorial Boolean function, e.g., an S-box, is the maximum of the latencies of its coordinate Boolean functions.Besides, to have a bijective fullydependent S-box with a small linearity, all of its coordinate functions must be balanced, fully-dependent and have a small linearity.Hence, our strategy was to first find low-latency Boolean functions and in a second step try to combine those into an S-box.It is noteworthy that the S-boxes within the same class of extended bit-permutation equivalence have roughly the same latency cost (with a small margin of tolerance).Moreover, those functions will have the same uniformity and linearity.We recall from [LP07] that two n-bit to m-bit vectorial Boolean functions F and G of the form F n 2 → F m 2 are called extended bit-permutation equivalent, if there exist a ∈ F n 2 , b ∈ F m 2 , P in a bit permutation function of n bits and P out a bit permutation function of m bits such that Therefore, it is sufficient to consider S-boxes only up to this equivalence.

Suitable Boolean Functions
To achieve a minimal latency, we searched for coordinate functions that can be realized in only two levels of NAND and OAI gates, or more specifically NAND2, NAND3, NAND4, OAI21 and OAI22 gates, while the larger and slower NAND4 and OAI22 gates should only be used in one of both levels.Additionally the first stage of NAND and OAI gates should have a fan-out of 1 for each gate.With this restriction, we are able to find Boolean functions with an extremely low latency in CMOS hardware.We empirically found that Boolean functions based on NAND gates exclusively achieve the best cryptographic properties and latencies with only two levels at a higher quantity; therefore, in the following we limit ourselves to S-boxes which are possible to be built only from NAND gates.However, using the same process described in the following we have created S-boxes based on OAI gates exclusively (functions based on a mix between NAND and OAI have shown to be less promising) and compare them to the NAND-based boxes at the end of this section.By considering all the possibilities for the inputs of the NAND gates at the first level, we aim at building all the n-bit Boolean functions f (x 0 , . . ., x n−1 ); i.e., for each input for NAND gates we test 2n possible inputs: either x i or its inverted value ¬x i with 0 ≤ i < n.
We then filter the Boolean functions with respect to the aforementioned criteria, that is balancedness and low-linearity.Please note that selecting the inverted inputs requires additional inverter gates before the first stage of NAND gates.Yet, since each of the S-box inputs feeds multiple coordinate Boolean functions at the same time it is prudent to instantiate buffers to drive those nets anyway and an inverter can serve the same purpose.Following this argument, the inverted inputs do not cause any significant extra cost.The first step is to find all the Boolean functions f : F n 2 → F 2 which are: 1) possible to be built by using two levels of NAND gates as explained previously, 2) balanced, 3) fully-dependent on all the input bits, and 4) with linearity at most l.It is important to mention that the order of checking these features is quite important for reducing the computational cost.We save all those Boolean functions in a set, named F.Note that if there is a function f ∈ F, then all of its extended bit-permutation equivalent functions such as 2 , b ∈ F 2 and P a bit permutation function of n bits, are included in F. Next, we reduce the Boolean functions within F by the extended bit-permutation equivalence, and only keep one representative of each equivalence class in another set F * .Note that if there are N * f Boolean functions in F * , then there are about ) functions in F. This reduction corresponds to the n! permutations of the input bits, the 2 n constants we can add to the input and the single bit we can add to the output.

Building Sboxes
To find all the bijective S-boxes S = (f 0 , . . ., f n−1 ) such that each coordinate function is in F, we can simply choose n of those N f functions and then check for the necessary criteria, but this requires about (N f ) n steps of checking all the criteria which for n > 4 is a large computation cost.The two main options to reduce this cost is (i) considering permutation equivalence and (ii) to select the coordinate function step-by-step and filter after each additional choice.Since it is sufficient to find the bijective S-boxes up-to the extended bit-permutation equivalence, we can restrict the first coordinate function f 0 to be chosen from F * that is due to the freedom on choosing the constant and the bit-permutation in the input of the S-box.Besides, for all the other coordinate functions f 1 , . . ., f n−1 , we can fix an input's output to a constant, e.g., f i (0) = 0 and this is because of the freedom in the output constant of the S-box.Note that since f 0 is chosen from F * and it is a representative function, we already considered that f 0 (0) = 0.Moreover, since we are still left with the freedom on the output bit-permutation of the S-box, we can fix the order of the coordinate functions of the S-box.In other words, if we consider that the elements of F are indexed, then we can fix the index of f 1 to be smaller than the index of f 2 and both are smaller than the index of f 3 and so on.This way, we reduce the number of choices to build an S-box to about • 2 41 which is still not feasible to search.The other main technique to reduce the computation cost of this search is that instead of choosing all the coordinate functions at once and then check for the criteria, we choose them one by one and in each step of choosing a coordinate function, we check for the probable possible criteria.In more details, in step one, we choose f 0 ∈ F * , then in step 2, we choose f 1 ∈ F. Before, going to step 3, we can check for balancedness and linearity of the component function f 0 ⊕ f 1 .We go to the next step, if the criteria for f 0 ⊕ f 1 have met, otherwise, we stay in step 2 and choose another function as f 1 .In step 3, after choosing f 2 ∈ F, we again can check for balancedness and linearity of the component functions We go to step 4, if all these criteria have met.In this way, we choose all the n coordinate functions to build the S-box, and then we can check for the uniformity criterion.This technique, together with several other low-level techniques for speeding up the search, reduces the computation cost of this search significantly.Our search algorithm is written in C++ code and we run it on an Intel Core i7 CPU with 8 threads for about 10 days to exhaustively search all the possible 6-bit S-boxes.Finding all 5-bit S-boxes only requires about two hours.We also constructed 7-and 8-bit S-boxes, but due to the larger linearity or uniformity value, they would not have been beneficial over the 6-bit S-box.

Results
In case of 6-bit S-boxes, the minimum linearity and the minimum uniformity of all S-boxes possible to built, is 24 and 8, respectively.For these properties, up to the extended bit-permutation equivalence, there are only two class of such S-boxes.We choose the S-box class equivalent to the one shown in Figure 2 and given in Table 3, because of the higher algebraic degree.For the chosen S-box class, we have the freedom to choose the input/output constants a and b and also P in and P out bit-permutation functions.We choose the output constant b in such a way that there is no need to insert an inverter in the output of the NAND gates of the second gate level.Even though it is a tiny improvement, the input constant a is chosen in a way to minimize the latency of the whole structure.Finally, we choose the bit-permutations in such a way that it improves the cryptographic properties of the round function for SPEEDY which is explained in more detail in Section 6.Note that the optimum choice of these bit-permutations can be different for round functions of different primitives.Altogether, we end up with the S-box presented in Table 3.Its corresponding implementation is depicted in Figure 2. Furthermore, the disjunctive normal form (DNF) of the S-box is presented below, which is equivalent to the representation by  the 2-level NAND gates.

S-box Latency Comparison
We benchmark our chosen S-box with respect to minimum latency in hardware and compare it to a number of other S-boxes from literature in Standard (AES) 8-bit S-box [oST01] for the comparison.Under the abbreviation OAIU8L24 we have listed a 6-bit S-box built from two levels of OAI22 gates with uniformity 8 and linearity 24 (same properties as the SPEEDY S-box).By min(RU8L24) we denote the minimum latency achieved among 10 randomly generated 6-bit S-boxes with uniformity 8 and linearity 24 (without focusing on a particularly efficient implementation).Finally, the inverse of the SPEEDY S-box is included.However, this inverse is not required for the SPEEDY encryption and therefore only relevant for the latency of its decryption.Minimizing the decryption's latency is not a focus of this work.
From the comparison it becomes clear that the SPEEDY S-box is impressively fast in hardware.It is much faster than any other S-box with more than 4 input bits (#ib), especially when considering the optimized version with direct instantiation of standard cells in the code based on Figure 2. Additionally, it even outperforms multiple of the 4-bit low-latency S-boxes (including Midori Sb 1 , QARMA σ 1 and QARMA σ 2 ).This is a great result, since the SPEEDY S-box not only provides better diffusion in general but also offers stronger protection against linear and differential attacks than any 4-bit S-box possibly could.Thus, we are confident in our S-box choice as the centerpiece for an ultra low-latency cipher.

Specification of SPEEDY
SPEEDY is a family of ultra low-latency block ciphers with different block and key sizes, and varying numbers of rounds.Precisely, SPEEDY-r-6 is an instance of this family with block and key size 6 bits and it iterates over r rounds.The internal state is viewed as an × 6 rectangle array of bits.We use the notation x [i,j] to denote the bit located at row i and column j of the state x with 0 ≤ i < and 0 ≤ j < 6.
It is important to emphasize that in the remainder of this paper, all the indices start from zero and the zero-th bit or word is always considered the most significant one.Besides, note that if there is an addition or a subtraction in the indices of the state, it is always in modulo for the first (row) index and in modulo 6 for the second (column) index.

Initialization:
The cipher receives a 6 -bit plaintext and initializes the internal state with it using the same order used for indexing bits, i.e. it first fills x [0,0] , then x [0,1] and so on.
Then, 1/ round functions, R r (with 0 ≤ r < 1/), are applied on the internal state, the first 1/ − 1 ones of which (up to the round keys and round constants) are identical.Each round function is composed of the following four different operations: (2×) SubBox, (2×) ShiftColumns, MixColumns, AddRoundConstant and AddRoundKey.Considering x ∈ F ×6 2 as the input, y ∈ F ×6 2 as the output of operations, 0 ≤ i < and 0 ≤ j < 6, the round operations are defined as follows: • SubBox (SB): The 6-bit S-box S is applied to each row of the state. (y The table for the S-box (in hexadecimal notation) is given in Table 3 and its implementation based on two-level NAND trees is shown in Figure 2.
• ShiftColumns (SC): The j-th column of the state is rotated upside by j bits.
• MixColumns (MC): A cyclic binary matrix is multiplied to each column of the state.
For simplicity, we identify the applied matrix with α = (α 1 , . . ., α 6 ) that is parameterized for each version of the cipher with different value.
• AddRoundKey (A kr ): The 6 -bit round key k r is XORed to the whole of the state.
• AddRoundConstant (A cr ): The 6 -bit constant c r is XORed to the whole of the state.
Similar to PRINCE, the round constants are chosen as the binary digits of the number π − 3 = 0.1415 . . . .Table 5 presents the first 100 × 64 bits of this constant.We use the first 6 bits as c 0 , the second 6 bits as c 1 and so on.
Round Function: Using the above mentioned round operations, the first 1/ − 1 round functions (with 0 ≤ r ≤ 1/ − 2) are defined as while in the last round, the linear layer and constant addition are omitted, and instead an extra key addition is applied, i.e.,
Instantiation: As already mentioned, SPEEDY is a family of block ciphers that allows instantiations of a wide range of block sizes and security levels.One may choose the block size of the encryption (6 ) by to the type of data blocks that need to be encrypted, and select the number of rounds (1/) based on the necessary security level.By applying an appropriate α = (α 1 , . . ., α 6 ) value with regards to the rationale explained in Section 5, SPEEDY-r-6 is ready to use.
To provide encryption of 64-bit blocks, which is the common instruction and data width in modern CPUs, we suggest to instantiate SPEEDY-r-192 with α = (1, 5, 9, 15, 21, 26) as the linear layer's parameter.We leave the number of rounds to be chosen based on the required security level.That is, for 128-and 192-bit security levels, we recommend using 1/ ≥ 6 and 1/ ≥ 7 rounds, respectively.More details about our security claims are provided below.The security analysis and the implementation of this instance are discussed in Section 6 and Section 7, respectively.Furthermore, for this instance we suggest to use β = 7 and γ = 1 for the key schedule parameters that the corresponding permutation P (given in Table 6) receives.We provide several test vectors for SPEEDY-r-192 encryption in Appendix G.
Security Claim While SPEEDY can be instantiated with different block and key sizes, the default is 192 bit as it constitutes the least common multiple of 6 (our S-box width) and 64 (the instruction width in high-end CPUs).We expect that SPEEDY-r-192 achieves 128-bit security when iterated over r = 6 rounds and full 192-bit security when iterated over r = 7 rounds, while the r = 5 round variant already provides a decent security level that is sufficient for many practical applications (≥ 2 128 time complexity when data complexity is limited to ≤ 2 64 ).Compared to the security claims made for example for PRINCE (≥ 2 127−n time complexity when data complexity is limited to ≤ 2 n ) or PRINCEv2 (≥ 2 112 time complexity when data complexity is limited to ≤ 2 50 ) the security level claimed by SPEEDY-5-192 is already superior.

Design Rationale
The primary criterion for the design of SPEEDY is to use round operations with a low latency that still provide good enough cryptographic properties to provide a secure encryption with a small number of rounds.To achieve this goal, we applied the ultra low-latency S-box found in Section 3.While the design approach for the S-box is described in Section 3, all details regarding the design choices for the other round operations are explained in the following.

MixColumns:
It is clear that the latency cost (in terms of XOR gate depth) of XORing n bits, i.e., x 0 ⊕ . . .⊕ x n−1 is equal to d = log 2 n .This means that XORing n bits with 2 d−1 < n ≤ 2 d , has the same cost for all n values with respect to the latency of the circuit (considering identical topology).Therefore, to use the maximum capacity of the given latency, it is prudent to choose n = 2 d .In the design of SPEEDY, since the A kr+1 operation from round r + 1 occurs right after the A cr and MC operations from the r-th round, it is possible to merge all three operations.Considering that x and y from F ×6 2 are the input and output of the merged A kr+1 • A cr • MC operation, respectively, then each output bit can be calculated as Hence, it is possible to implement the whole A kr+1 • A cr • MC as a merged function within three XOR gate levels.Note that since the input k r+1 [i,j] is not in the critical path of the circuit, k r+1 [i,j] and c r [i,j] can be combined with each other beforehand.Depending on the value of the round constant bit, we actually only need to use k r+1 [i,j] itself or its inverted value ¬k r+1 [i,j] .Figure 3 depicts the corresponding circuit to implement each output bit of the merged function.Please note that the fan-out of each XOR gate in this circuit is 1.It is important to consider that for CMOS technologies where the XNOR gate is significantly faster than the XOR gate (such as NanGate 45 nm), it is easily possible to implement this linear layer with only XNOR gates instead of XORs and simply exchange the buffers and inverters of the next S-box stage to revert its inverted output.For the MC operation, we decided to use the same binary cyclic matrix with polynomial representation of 1 + z α1 + . . .+ z αw−1 and multiply it with each column of the state.Therefore, each output bit of the MC operation is the XOR of w input bits.As explained above, the optimal choices for w are 3, 7, 15 and so on, so that it is possible to implement the above mentioned merged function with 2, 3, 4 XOR gate levels, respectively.While in PRINCE, MIDORI and QARMA block ciphers, this technique of merging is used by applying cyclic matrices of w = 3 and repeated after each S-box layer, we found that it is a good trade-off to use cyclic matrices with w = 7, but only after each second S-box layer, which is effectively cheaper from a latency cost perspective.For each SPEEDY-r-6 version of the cipher, we need to find a bijective × binary cyclic matrix M with polynomial representation of 1 + z α1 + . . .+ z α6 .Finding an appropriate bijective cyclic matrix with w = 7 being an odd integer, is quite possible for wide range of .But, since the value of α = (α 1 , . . ., α 6 ) is always dependent on the value of , we leave it as a parameter of the cipher's instantiation.Since, the probability of M being a non-singular matrix is high, we can add extra criteria regarding the choice of the α parameter.
to be smaller or equal to 6.The reason for this criterion is explained later, in the corresponding paragraph for ShiftColumns.Note that this criterion is only possible for ≤ 42.
• Maximum branch number: Branch number of a matrix is defined as bn := min where hw denotes the Hamming weight of a binary array.In case of a bijective × binary cyclic matrix M with polynomial representation of 1 + z α1 + . . .+ z αw−1 , the branch number cannot be higher than w + 1.In our case, we restrict the choice of the α parameter to the ones which provide maximum branch number, i.e., 8.
• For the corresponding matrix M of parameter α = (α 1 , . . ., α 6 ), we build a binary table H such that the element in the position (i, j) is 1, if and only if there is an x ∈ F 2 \ {0} with hw(x) = i and hw(M × x T ) = j.Then, we compute the following three numbers: (5) As explained later in Section 6, larger values for bn r lead to a stronger resistance of the r-round SPEEDY against differential and linear attacks.Therefore, for all the possible choices of α which are meeting the first two criteria, we compute the above bn r numbers and choose one of the corresponding α values which leads to the maximum bn r values.
It is noteworthy that the branch number bn is the same as bn 2 defined as Besides, bn r with r > 2 can be considered as an extension for the definition of branch number, and hereafter, we will call it a higher-order branch number.

ShiftColumns:
The existence of the first SC operation, right after the first SB makes it possible that input bits of each S-box in the second SB operation are all from the outputs of different S-boxes of the first SB operation.Therefore, since the applied S-box has the full diffusion property (in both straight and inverse direction), each output bit of SB • SC • SB is a function of 36 consecutive input bits.Namely, for SB • SC • SB, the output bit in the position [i, j] is a function of all input bits in the position of the form [i + p, q] with 0 ≤ p, q < 6, while for (SB • SC • SB) −1 , the output bit [i, j] is a function of all input bits of the form [i − p, q].By considering the first criterion for MixColumns, namely that α 1 , α 2 −α 1 , α 3 −α 2 , α 4 −α 3 , α 5 − α 4 , α 6 − α 5 and − α 6 are all smaller or equal to 6, it means that the output bit of MC • SB • SC • SB and equivalently, output of one key-less round function MC • SC • SB • SC • SB is dependent on the whole 6 input bits.The same holds for (MC • SB • SC • SB) −1 in the decryption side, hence, the input of one key-less round function is dependent on the whole 6 output bits.
Moreover, the same arguments hold for inserting the second SC, right after the second SB operation, which means that each output bit of SB•MC•SC•SB depends on the whole 6 input bits which equivalently holds for the rotated key-less round function SC • SB • MC • SC • SB.Altogether, one key-less round function or rotated round function, in both encryption and decryption directions, provides full diffusion.In other words, in a key recovery attack, to compute one output bit of those functions, the attacker needs to know the value of the whole input state.Note that knowing the value for the whole input state of these functions requires knowing the whole state of the round key.This means, if the attacker wants to extend a distinguisher by appending one complete round (or rotated round) function, to do a key recovery attack, he needs to guess the whole 6 bits of the key.It is important to mention that since existence of any key-independent linear operation right before the ciphertext does not add any security to the encryption, we exclude the MC and the second SC operations from the last round.
Key Schedule: Since the main target of our design is to provide a low-latency encryption routine, and since other cost factors of the implementation such as area or energy consumption of the circuits are only secondary priorities, one can apply a key schedule built from costly operations.Yet, since we do not aim for related-key security, and since the round function has a strong diffusion, we found that using a linear key schedule is sufficient for our purposes.Besides, updating round keys by a bit-permutation function in an unrolled implementation has no latency, area or energy costs, thus we decided to use such a key schedule.Furthermore, we wanted to use a bit-permutation such that it is easy to generalize for all SPEEDY-r-6 members.To do so, we chose the general affine mapping in the finite integer field of {0, . . ., 6 − 1}, that the permutation P maps x, an element of this field, to P (x) = βx + γ mod 6 .The only requirement for P being a bijection is that β and 6 need to be co-prime, i.e., gcd(β, 6 ) = 1.

Security Analysis
In this section, we provide details about the cryptographic properties of the SPEEDY family of block ciphers.We start with differential, linear and algebraic properties of the S-box S and expand them over a round function of the cipher.By applying properties for the round function, we discuss the security of an 1/ round structure of SPEEDY.

Cryptographic Properties of the S-box:
The S-box S, presented in Section 3, is the heart of the SPEEDY design and it needs to be studied in detail.As described before the uniformity and linearity of S is equal to 8 and 24, respectively.This means that the maximum probability of differentials over S is 8 • 2 −6 = 2 −3 and the maximum absolute correlation of linear approximations is 24

). As one important part of the Differential Distribution Table (DDT) and Linear Approximation Table (LAT), we
present the 1-bit to 1-bit differentials and linear approximations in Appendix B, Table 10.In more detail, entry (i, j) of the 1-bit to 1-bit DDT denotes the probability that having only one active bit in the position i of the S-box inputs leads to only one active bit in the position j of the S-box output.In case of 1-bit to 1-bit LAT, entry (i, j) of the table denotes the absolute correlation value for the x i = y j linear approximation.Even though, one of the criteria for building the low-latency S-box was to provide full dependency of the output bits on the input bits, this is not sufficient to provide all information about algebraic properties of the function.We provide the algebriac normal form (ANF) representation of both S and S −1 in Appendix C. As shown, not only all the input/output variables are non-linearly involved in all the output/input coordinates (i.e., the S-box provides full diffusion in both straight and inverse directions), each coordinate function is quite dense with respect to the number of involved terms.Another interesting information is that the ANF degree for coordinates of S is 5, 3, 3, 3, 4 and 5, respectively, while in the case of S −1 , these numbers are 5, 4, 5, 4, 5 and 5, respectively (cf.Appendix C).
Cryptographic Properties of SB • SC • SB: Since in the round function of SPEEDY, two SB operations are connected through the SC operation which is a simple bit permutation, it is necessary to look at the properties of this combination.We first investigate the 1-bit to 1-bit differentials and linear approximations of SB • SC • SB.Since each input bit of the second SB operation comes from a different first-stage S-box, 1-bit to 1-bit transitions over SB • SC • SB are possible if and only if the transitions over the first and second SB operations, both are 1-bit to 1-bit transitions.Besides, without any extra assumption (such as independency between the state bits), it can be proven that probability or correlation of this 1-bit to 1-bit transitions over SB • SC • SB is the multiplication of probabilities or correlations over two active S-boxes (one from the first SB and another from the second SB operation).Since SC does not change the column position of active bits, it is easily possible to compute these probabilities.Appendix B, Table 11 presents the 1-bit to 1-bit differential probabilities and linear correlations over SB • SC • SB such that entry [i, j] denotes the maximum possible probabilities or linear correlations that an active input bit in the column i transits to an active output bit in the column j.To compute these values, we used the following equation which T 1 and T 2 denote the Table 10 and Table 11, respectively.
Note that the maximum entry for differential transitions is 2 −6 and for linear transitions it is 15 • 2 −7 ≈ 2 −3 .We are only interested in 1-bit to 1-bit transitions, because the probability or the correlation of such transitions are among the highest ones and also because based on such transitions, we can build differential or linear characteristics with a high differential probability or linear correlation.Again due to the fact that SC does not change the column position of the bits and each input bit of the second SB is the output of a different S-box, it is possible to compute the algebraic degree of SB • SC • SB.The degree of any output bit in the columns 0, 1, . . .and 5 is equal to 19, 15, 13, 13, 13 and 20, respectively.It is important to mention that replacing the current S-box with another bit-permutation equivalent S-box will change differential, linear and algebraic properties of SB • SC • SB.While in Section 3, we ended up with a bit-permutation equivalency class of S-boxes, we tried all the S-boxes of this class to find an S-box such that the maximum entry in Table 11 and also the number of entries with maximum value are as small as possible.Moreover, we want the minimum algebraic degree over SB • SC • SB coordinates to be as large as possible.Note that due to the structure of the round function, since encryption with S-box P out • S • P in is identical to encryption with S-box P in • P out • S (up to a column permutation in the state of plaintext, ciphertext, round key and round constants), we can consider one of them to be the identity bit-permutation and only need to choose the other one.

Differential and Linear Attacks
Since there are 1-bit to 1-bit differential and linear approximations over SB • SC • SB and the corresponding probability or correlation of those transitions are quite significant, it is necessary to choose a strong MC operation.The criterion of having branch number bn = 8 ensures that the maximum expected differential probability (EDP) of differential trails and the maximum expected linear potential (ELP) of linear trails over two rounds of SPEEDY is equal to (2 −6 ) 8 = 2 −48 .To discuss the resistance of r-round SPEEDY, we use the higher-order branch number bn r defined in Equation 5 to have an overview about the minimum number of active S-boxes in differential or linear trails.Therefore, using this estimation the maximum EDP of differentials and the ELP of linear trails over r-round SPEEDY is estimated by 2 −6•bnr .In case of SPEEDY-r-192, with the recommended α parameter, we have bn 3 = 13 , bn 4 = 20 , bn 5 = 25 , bn 6 = 32 .
Hence, we estimate that EDP (resp.ELP) of any differential (resp.linear) trails over 3, 4, 5 and 6 rounds is smaller than 2 −78 , 2 −120 , 2 −150 and 2 −192 .Actually, assuming that all the 1-bit to 1-bit differential or linear transitions through the S-box are possible, and by considering that there are at most 8 active words (of 6-bit) per state of operations, we searched for the minimum number of active S-boxes.We found that this number is 13, 23 and 35 for 2, 3 and 4 rounds.Assuming that all these 1-bit to 1-bit transitions occur with differential probability (or linear potential) of 2 −3 , the EDP (resp.ELP) of any differential (resp.linear) trails over 2, 3 and 4 rounds is smaller than 2 −39 , 2 −69 and 2 −105 .We emphasize that these values are an upper bound, which means that a trail with such EDP or ELP must not necessarily exist.
Higher-Order Differential, Integral and Cube Attacks SPEEDY's round function has a strong diffusion and high algebraic degree.While, we investigate these properties for one complete round precisely, for a larger number of rounds, we expect that the ANF representation would be dense with respect to the number of involved terms.Therefore, we believe that these attacks are weaker than differential and linear attacks and less of a concern.

Number of Rounds
For a low-latency block cipher, a large security margin is not reasonable and is usually considered as wasteful.Since the attacker cannot add more than one round to extend a distinguisher and therefore to use the distinguisher in a key recovery attack, we believe a security margin of one round is sufficient.Therefore, we recommend to choose the number of rounds with respect to the required security level of the block cipher's application.For example, in case of the SPEEDY-r-192 instance, we recommend to use SPEEDY-6-192 and SPEEDY-7-192 for 128-bit and 192-bit security levels, respectively, while for more practical applications, such as a security level of 2 128 time and 2 64 data complexity, we recommend to use SPEEDY-5-192.
Further Security Analysis Additional security analysis results with respect to impossible differential, zero-correlation linear-hull, meet-in-the-middle and implementation attacks can be found in Appendix D.

Hardware Implementation
In this section, we analyze the minimum achievable latency of fully-unrolled SPEEDY hardware implementations as well as the area required for the time-constrained circuits and compare them to a number of other cryptographic primitives that have been suggested for high-speed single-cycle encryption in literature.Implementing SPEEDY in hardware is rather straightforward since almost all round operations which require any logic and may not be realized through wiring alone are already chosen as circuit representations.In detail, Figure 2 shows the hardware circuitry for the 6-bit high-speed S-box while Figure 3 depicts the logic circuit that implements the combined A kr+1 • A cr • MC function.The ShiftColumns operation does not require any logic, which means that only the initial and the final AddRoundKey functions remain.Obviously these are implemented with a single stage of regular XOR gates.Table 7 presents the minimum latency results achieved for different instances of Gimli, MANTIS, Midori, Orthros, PRINCE, PRINCEv2, QARMA, and SPEEDY (in alphabetical order).All results have been obtained by synthesizing the fully-unrolled cipher circuits between two register stages for minimum clock period using the Synopsys Design Compiler Version O-2018.06-SP4software while executing four stages of the compile_ultra command (three incremental).We have repeated the analysis with 6 different standard cell libraries, 4 of which are manufacturable cell libraries from a commercial foundry, while the remaining 2 are open-source libraries which are not manufacturable but can be used for producing universally comparable and reproducible synthesis results.Please note that Gimli is a key-less permutation.Therefore, in order to create an encryption circuit from the primitive we have realized it in Even-Mansour scheme [EM97] with two different keys at the beginning and end.With respect to our SPEEDY implementations we distinguish between results that are achieved when giving the regular behavioral (or dataflow) description of the cipher to the synthesis tool and those results we have obtained by optimizing the code and instantiating the desired standard cells directly in the HDL code (according to the gate-level descriptions shown in Figures 2 and 3).It is obvious that this optimization has a significant impact on the performance in NanGate libraries, but less of an impact in the commercial technologies.
In order to force the synthesizer to use our suggested gate-level structures for MC and SB we set a size-only attribute on the relevant cells in Synopsys Design Compiler before the first compile_ultra command.The synthesizer then only scales the drive strengths of these cells.In a next step three compile_ultra -incremental commands are executed without size-only attribute, so that all optimizations are allowed again.With that technique the highest quality of results is achieved and the majority of manually-instantiated cells still remain unchanged.24).Yet, the claim that it outperforms PRINCE by a significant margin, made in [GKD20], is very doubtful considering our results.Please note that for all ciphers except Midori we have used hardware implementations written by the original authors of the corresponding papers (Qameleon authors for QARMA).
Table 8 shows the corresponding area consumption for the fully-unrolled and highly latency constrained circuits.Clearly, SPEEDY requires a larger circuit area compared to all other ciphers except Gimli.However, this is mainly caused by its 192-bit state (which is larger than for all other ciphers in the table except Gimli).In more detail, when multiplying the area of the 64-bit ciphers by 3 (to encrypt 192 bit at once) many of them require a larger area than SPEEDY-5-192 and all MANTIS and QARMA instances even exceed the area of SPEEDY-6-192.Thus, we believe that for their block widths and the high security and performance levels that the SPEEDY instances provide, their area consumption is acceptable.
Power consumption figures for all circuits are given in Appendix E, Table 12.Because synthesis results disregard the impact of wire capacitances on the latency of hardware circuits, we have exemplarily taken all netlists generated for the 65 nm technology through a Place and Route (PnR) process in order to estimate the post-layout latencies.These are given in comparison to the pre-layout values in Table 9. Naturally, the overhead introduced by the physical layout is greater for the circuits that have a larger area footprint, e.g., Gimli, Orthros and SPEEDY, because connected cells might be wider apart from each other and longer wire lengths are required to connect them (also because metal utilization increases and wires have to be routed on higher, thicker metal layers).However, despite the slightly larger overhead SPEEDY-5-192 and SPEEDY-6-192 are still the fastest encryption primitives after PnR.Details about the SPEEDY decryption and associated implementation results are provided in Appendix F.

Code and Reproducibility
A reference software implementation in C and hardware implementations of SPEEDY-r-192 encryption and decryption in VHDL, along with synthesized netlists in NanGate libraries and associated synthesis scripts, are all available in our GitHub repository found here: https://github.com/Chair-for-Security-Engineering/SPEEDY.

Conclusion
In this work we have introduced SPEEDY, a family of ultra low-latency block ciphers developed for extremely high execution speed in CMOS hardware and dedicated to semi-custom, i.e., standard-cell-based, integrated circuit design.The primary targets for SPEEDY are security architectures in high-end CPUs which require ultra low-latency encryption, such as secure caches, dedicated hardware extensions, memory encryption, pointer authentication and many more.SPEEDY achieves higher performance than any competitor because of hardware-specific gate-and transistor-level observations that have been exploited in its design to make it extremely performant in CMOS hardware.While SPEEDY can be instantiated with different block and key sizes, the default is 192 bit.Based on our analysis, we are confident that 7 rounds provide full security, while 5 rounds already provide a higher security level than PRINCE or PRINCEv2 for example.Our extensive evaluation of hardware implementations demonstrates that both SPEEDY-5-192 and SPEEDY-6-192 are faster than any proposed version of PRINCE, PRINCEv2, MANTIS, QARMA, Midori, Gimli and Orthros.Thus, SPEEDY is a significant upgrade over the state of the art for any application where area and energy are secondary design goals while high performance is the number one priority.C ANF Representation of S and S −1

D Additional Security Analysis
Impossible Differential and Zero-Correlation Linear-Hull Attacks One active bit, with respect to both differentials and linear correlations, and in both forward and backward directions can propagate to all the state bits over one (rotated) key-less SPEEDY round function and more importantly, none of this activeness is deterministic.But, it should be noted that the activeness of these bits can be related to each other if the last operation is MC.Therefore, by combining one round propagation in the forward direction and one round propagation in the backward direction, it might be possible to find impossible differentials or zero-correlation linear-hulls over two (rotated) key-less round functions.But, if we add one SB operation in the middle, we ensure that there are no such distinguishers; in other words, there are no impossible differentials or zero-correlation linear-hulls over Therefore, by applying the 2-round distinguisher and extending by one round for key recovery, it might be possible to have a successful attack on 3-round SPEEDY, but we expect that more than 3 rounds are secure against those attacks.

Meet-in-the-Middle Attack
The maximum number of attacked rounds using meet-inthe-middle technique can be evaluated considering the maximum length of three features: partial-matching, initial structure and splice-and-cut.For partial-matching, the number of rounds in both forward and backward directions cannot reach the full diffusion rounds which for SPEEDY in both directions is smaller than one round.The condition for the initial structure is that the key differential trails in both forward and backward directions do not share active non-linear components.As any key differential in SPEEDY affects the whole state after one complete round in both directions, there is no such differential which shares active S-box(es) in more than one round.Therefore, it only works up to one round.Splice-and-cut may extend the number of attacked rounds up to the number of full diffusion rounds, i.e., again one round.Thus, it is not possible for the attacker to mount a successful meet-in-the-middle attack on a (2+1+1) = 4-round SPEEDY.

Implementation Attacks
The protection of SPEEDY against implementation attacks like timing, power analysis or fault injection attacks is not a focus of this work.Clearly, a straightforward and unprotected implementation of SPEEDY will be susceptible to adversaries who are capable of observing the characteristics of the implementation during its execution.Although this attacker model traditionally requires physical access to the executing device and therefore is typically considered to be less of a concern for desktop and server CPUs (the targeted application area for SPEEDY) there have been more and more successful remote power analysis attacks on such devices recently, most notably the PLATYPUS attack [LKO + 21].Thus, even in such contexts, physical adversaries can no longer be ignored and protecting SPEEDY against said attacks is a great direction for future research.
In that regard, a recent work has pointed out that, although it is hardly feasible to apply hardware masking to unrolled low-latency cryptography without sacrificing a large portion of its performance due to the necessary inclusion of register stages, simple reset methods (i.e., randomly pre-charging the combinatorial circuit) deliver very promising results against passive side-channel attacks if applied properly [Moo20].The parallelism, speed and asynchronicity of SPEEDY are assumed to be even higher than for the investigated PRINCE instance.Thus, we believe that this kind of protection mechanism can most reasonably

Figure 1 :
Figure 1: Impact on the latency of the circuit in NanGate 15 nm technology when buffering the high fan-out net.Total latency is 29.169073 ps without the buffer (left) and 18.675571 ps with the buffer (right), despite the larger gate depth on the right.

Figure 2 :
Figure 2: Implementation of the 6-bit S-box of SPEEDY based on two-level NAND trees.

Figure 3 :
Figure 3: Implementation of each output bit of the merged function A kr+1 • A cr • MC of the SPEEDY design.

Table 1 :
Fan-In, Latency, Fan-In-to-Latency-Ratio and Linearity of logic gates in NanGate 45nm Open Cell Library (OCL) for typical operating conditions.

Table 2 :
Fan-In, Latency, Fan-In-to-Latency-Ratio and Linearity of logic gates in NanGate 15nm Open Cell Library (OCL) for typical operating conditions.

Table 3 :
The 6-bit S-box of SPEEDY.

Table 4 .
[Ava17] about the synthesis tools and process are given in Section 7. Please note that up to now only 4-bit S-boxes have been proposed for low-latency constructions in literature, namely (in alphabetical order) the Midori S-boxes [BBI + 15], the Orthros S-box [BIL + 21], the PRINCE S-box [BCG + 12] and the QARMA S-boxes[Ava17].Yet, in order to compare the SPEEDY [oST79]]so to larger substitution boxes we chose the ASCON 5-bit S-box[DEMS19], the Data Encryption Standard (DES) S 1 6-to-4-bit box (as a representative of the 8 different DES S-boxes)[oST79], the Q2263 6-bit S-box [BMD + 20] and the Advanced Encryption

Table 4 :
Latency comparison of different S-boxes with varying numbers of input bits (#ib).If not stated otherwise, each S-box is implemented as a lookup table (using with/select in VHDL).
* = Optimized HDL code with direct instantiation of library cells based on Figure 2.

Table 7 :
Minimum latency of fully-unrolled encryption-only circuits of different cryptographic primitives.

Table 8 :
Area consumption of fully-unrolled encryption-only circuits of different cryptographic primitives when synthesized for minimum latency.

Table 9 :
Comparison of pre-layout and post-layout latencies in a commercial 65 nm CMOS technology.
* = Optimized HDL code with direct instantiation of library cells based on Figures2 and 3.

Table 10 :
1-bit to 1-bit differential probabilities and linear correlations of the SPEEDY S-box.

Table 11 :
1-bit to 1-bit differential probabilities and linear correlations over SB • SC • SB.

Table 13 :
Estimated latency, area, and power consumption of the SPEEDY decryption routine.