RISC-V Instruction Set Extensions for Lightweight Symmetric Cryptography

. The NIST LightWeight Cryptography (LWC) selection process aims to standardise cryptographic functionality which is suitable for resource-constrained devices. Since the outcome is likely to have significant, long-lived impact, careful evaluation of each submission with respect to metrics explicitly outlined in the call is imperative. Beyond the robustness of submissions against cryptanalytic attack, metrics related to their implementation (e.g., execution latency and memory footprint) form an important example. Aiming to provide evidence allowing richer evaluation with respect to such metrics, this paper presents the design, implementation, and evaluation of one separate Instruction Set Extension (ISE) for each of the 10 LWC final round submissions, namely Ascon , Elephant , GIFT - COFB , Grain-128AEADv2 , ISAP , PHOTON


Introduction
The LWC selection process. In a detailed survey of various examples, Bernstein [Ber20] notes that modern, open cryptographic selection processes (or contests) are not without their issues. Set within the broader context of standardised cryptographic functionality, however, they represent an undeniably important and influential mechanism: modulo imperfections stemming from the non-trivial technical and non-technical challenges involved, they act to motivate and organise collaborative effort, and, at best, produce more robust outcomes as a result.
After a series of exploratory workshops in 2015 and 2016 and a report [MBTM17] summarising the context and goals, NIST initiated a selection process for LightWeight Cryptography (LWC) via an associated call [SCA18b] released in 2018. The process scope involves two specific forms of cryptographic functionality, with each submission specifying a suite of algorithms with required support for an Authenticated Encryption with Associated Data (AEAD) API [SCA18b, Section 3.1], plus optional support for a hash function API [SCA18b, Section 3.2]. Although the term is open to interpretation more generally, the call defines lightweight to mean "tailored for resource-constrained devices" [SCA18b,Section 1]. This implies said algorithms should, e.g., be 1) efficient on constrained hardware and software platforms (versus existing standards), 2) efficient for short messages, and 3) amenable to countermeasures against implementation attacks.
The 56 round 1 submissions accepted were reduced to 32 round 2 submissions in 2019 [TMcc + 19], and then again to 10 round 3 or final round submissions in 2021 [TMC + 21]. The (ongoing) final round is expected to last approximately 12 months, implying a conclusion to the process in 2022. Beyond application of the minimum acceptability requirements [SCA18b, Section 3], a range of factors mean that objective comparison between and then selection of submissions in each round, the final round perhaps most importantly, is a significant challenge. First, even in the final round, there are a large number of submissions and variants thereof. Second, there are a large number of relevant implementation technologies: these include hardware-oriented (e.g., FPGA, ASIC) and software-oriented (e.g., micro-controller) instances. Third, there are a large number of relevant evaluation criteria [SCA18b, Section 4]: focusing on implementation-related examples, and so ignoring the complex, stand-alone challenge of cryptanalytic evaluation, these span at least cost [SCA18b,Section 4.3] (e.g., area and/or memory footprint 1 ), efficiency [SCA18b,Section 4.3] (e.g., latency, throughput), and resilience to implementation (e.g., side-channel and fault) attack [SCA18b,Section 4.2]. The product of these and other factors demands significant effort be invested, in part due to the design space of implementation techniques (spanning representation of data, and computation with it) and technologies which must be explored.

ISE-supported software implementation. Within said design space, Instruction Set
Extensions (ISEs) attempt to add domain-specific support (e.g., state, instructions) to an otherwise general-purpose base Instruction Set Architecture (ISA). Although applicable to many domains, the study of cryptographic ISEs [BGM09, HV11,RI16] spans at least a 25 year period; work by Nahum et al. [NOOS95] is among the first identifiable instances.
As a fundamental and long-lived computer systems interface, the design and extension of an ISA demands careful consideration (cf. [Gue09,Section 4]) and must deliver quantified improvement for the workload of interest to be viable. ISEs often are viable, however, because, for example, they represent a hybrid between use of hardware or software alone. This is particularly true with respect to the constrained platforms and evaluation metrics of relevance to the LWC selection process: a well designed ISE can result in lower memory footprint and latency than a software-only implementation, and greater flexible and efficiency (with respect to improvement per additional logic gate) than a hardware-only implementation.
ISEs were not (explicitly) considered during the AES selection process, but, after it concluded in 2002, were added to almost every major ISA; at the time of writing, these include (at least) x86 [SCA22a,Section 12.13] (see also [Gue09,DGK19]), POWER [SCA18a, Section 6.11.1], ARMv8-A [SCA20, Section A2.3], SPARC [SCA16, Sections 7.3+7.4], and RISC-V [SCA22b, Sections 2.4+2.5] (see also [MNP + 21]). Using this fact as motivation, we argue that considering ISEs during the LWC selection process is important because doing so offers 1) improved understanding and concrete evidence which can inform the LWC process itself, and 2) preparatory analysis which can inform ISA designers seeking to support the LWC process outcome.
Contributions. As such, this paper makes two central contributions: 1. Based on careful analysis, we present the design, implementation, and evaluation of one separate ISE for each of the 10 LWC final round submissions; for most submissions, our work represents the first exploration of implementations supported by domain-specific ISEs.
2. We present a number of novel software-only (i.e., without requiring an ISE) techniques and implementations. For most submissions, our work represents the first exploration of implementations supported by special-purpose, cryptographic Zbkb and Zbkx bit manipulation extensions, an approach which is particularly effective for Elephant (Section 3.3) and GIFT-COFB (Section 3.4); in the latter case, for example, we demonstrate how to optimise bit-sliced implementations of GIFT-128 (as used in GIFT-COFB) using Zbkb, rendering it more efficient than either standard or fix-sliced alternatives for short plaintexts/ciphertexts.
Note that all material associated with the paper, e.g., documentation and source code relating to all hardware and software implementations, are openly available 2 under an open source license: we expect this material to evolve throughout the remainder of the LWC process and beyond.
Organisation. The paper is organised as follows. In Section 2 we present various background information, including a definition of the scope of and basis for our work. In Section 3 we analyse the LWC final round submissions, and produce associated ISE designs for RISC-V. Then, in Section 4 and Section 5 respectively, we discuss the implementation and evaluation of those designs based on instances of the RISC-V compliant Rocket [AAB + 16] core.

Background
Scope. In part to cope with the large design space considered, and thus engineering effort required, we fix the scope of our work in the following ways: 1. For each submission, we only consider the primary algorithm; each such algorithm is based on a "building block" component or kernel which dominates computation. We only consider intra-kernel ISEs, i.e., ISEs for use within a given kernel: the definition of a kernel implies that any extra-kernel opportunities for ISEs have at best a marginal impact, so are not considered viable. Furthermore, we only consider partial implementation of a given kernel where appropriate. Romulus is based on the Skinny-128-384+ kernel, for example, but only uses it to encrypt data; we do not consider support for decryption, therefore, although it would clearly be possible to do so if it were more generally useful. 2. We do not consider the hash function API: focusing on the the AEAD API alone seems sufficient, because, for each submission, use of the same kernel is evident across the algorithms which support both APIs. 3. We only consider a 32-bit ISA (and also ISEs for it therefore). Although consideration of a wider set of ISAs is more generally useful, we rationalise this decision by noting it aligns with the (implied) scope of the LWC process: the NIST call outlines a requirement to consider "8-bit, 16-bit and 32-bit microcontroller architectures" [SCA18b, Section 3.4], for example, meaning a 64-bit ISA is deemed out of scope. 4. Although some discussion of the topic is included for completeness in Section 5, we do not consider support in the ISA nor ISEs for countermeasures against implementation attacks (other than their ability to deliver data-independent execution latency). We rationalise this decision by noting it aligns with the (implied) scope of the RISC-V scalar cryptographic extensions [SCA22b]: for example, the Zkne and Zknd extensions [SCA22b, Sections 2.4+2.5] for AES do not consider interaction with masking-based countermeasures against DPA-like attacks [KJJ99,MOP07]. 5. For most submissions, we considered multiple ISE design variants. However, we only present results for the single ISE design variant we deem most effective, i.e., that which offers the greatest improvement in execution latency per additional logic gate. We stress that the results are therefore a "snapshot", rather than exhaustive exploration of the (large) design space.

Design
NIST are careful to use "algorithm(s)" throughout [SCA18b, Section 5], presumably to at least allow selection of a suite of rather than a single algorithm. Although one could conclude that multi-algorithm ISEs, i.e., ISEs which support more than one algorithm, are attractive therefore, focusing on them is arguably premature until the outcome is clear. In this section, we therefore adopt a 2-step design process. First, we focus on independently developing an ISE design for each algorithm: each of the following subsections acts to summarise such a design at a high level, with any lower-level technical detail (e.g., instruction encoding, semantics, etc.) deferred to an associated appendix. We use a uniform structure in each such subsection by presenting 1) an overview of the submission, 2) an overview of the kernel within said submission that we focus on, 3) implementation options (including related work, e.g., implementation results), then, finally, 4) a description of the ISE design. Second, and based on the above, Section 3.12 concludes with a broader discussion of opportunities relating to design of ISAs, ISEs, and the algorithms themselves; by taking a broader perspective, this second step therefore highlights if and where multi-algorithm ISEs can be extracted from the single-algorithm ISE designs.

Constraints
In their study of support for AES in RISC-V, Marshall  On one hand, we recognise that adopting these constraints means potential ISE designs might be ignored; this fact potentially renders our results sub-optimal, at least versus a more permissive alternative where the constraints are not adhered to. A pertinent example is the approach of Steinegger and Primas [SP21], which captures the 320-bit Ascon state within 10 general-purpose registers then used as input and output by a tightly-coupled accelerator for an entire round. This approach may be reasonable for a specific use-cases, and variants of it are in fact viable for all the LWC candidates. However, the approach violates Requirement 2: although a useful option in the overall design space, our approach (namely a focus on more traditional, RISC-like ISEs) is fundamentally different.
On the other hand, we argue that the same constraints maximise potential utility of our ISE designs. For example, within the context of RISC-V they 1) support multiple implementation options, including a more traditional integrated approach or via the indevelopment Custom Function Unit (CFU) 3 specification, and 2) offer an easier route to standardisation and deployment as a result of limiting impact on other aspects of the ISA. Beyond this, the constraints also permit extrapolation to other ISAs, e.g., via the ARMv8-M custom instruction mechanism [CP20]; doing so would be more difficult otherwise.

Ascon
Submission overview. The Ascon [DEMS21] submission specifies the AEAD algorithms [DEMS21, Section 2.4] Ascon-128, Ascon-128a, and Ascon-80pq, and the hash function algorithms [DEMS21, Section 2.5] Ascon-Hash and Ascon-Hasha. We focus on the primary algorithm Ascon-128, and, more specifically therefore, a kernel represented by the p a and p b permutations [DEMS21, Section 2.6] (a single permutation p, often referred to as Ascon-p, with a and b rounds respectively).
Kernel overview. The Ascon-p permutation manipulates a 320-bit state, which is organized in five 64-bit words, by iteratively applying a round function p. This round function is essentially a Substitution-Permutation Network (SPN) and comprises three parts: (i) the addition of an 8-bit round constant c r to a 64-bit state-word, (ii) a substitution layer that operates across the five words of the state and implements an affine equivalent of the S-box in the χ mapping of Keccak, and (iii) a permutation layer consisting of linear functions that are similar to the Σ functions in SHA2 and performed on each state-word individually. The S-box maps five input bits to five output bits and is applied to each column of the state, whereby the five state-words are arranged vertically. Implementation options. The substitution layer is normally implemented in a bit-sliced fashion using logical ANDs, XORs, and NOTs. On the other hand, the permutation layer performs an operation of the form x = x ⊕ (x ≫ n) ⊕ (x ≫ m) on each 64-bit word x of the state. On 32-bit architectures, the Ascon-p permutation is usually implemented in a Bit-Interleaved (BI) fashion, which means each 64-bit word of the state is split up into two 32-bit words, one containing the bits at even positions and the other the bits at odd positions. This representation has the advantage that one can perform a 64-bit rotation through two 32-bit rotations, which is particularly beneficial on 32-bit ARM Cortex-M microcontrollers due to their "free" rotations. Even though bit-interleaving has the potential to speed up the linear functions of Ascon-p on any 32-bit platform (including RV32), one has to take into account that this performance gain for the permutation comes at the expense of conversions between the BI representation and normal representation whenever data is injected into or extracted from the state. ISE description. The substitution layer consists of logical operations on 64-bit words, which can be split up into two operations on 32-bit chunks. An optimized implementation of the S-box requires 17 native RV32GC instructions [CJL + 20], which can be reduced to 15 with the help of two Zbkb instructions. The permutation layer can achieve a more significant speed-up since its operations of the form x = x ⊕ (x ≫ n) ⊕ (x ≫ m) map naturally to two custom sigma instructions that use the upper and lower part of a 64-bit state-word as input and produce either the upper or lower part of the result. The rotation amounts can be specified through immediate values. In this way, the instruction-count of the full permutation layer can be reduced from 80 (i.e., 16 per-word) to only 10. This reduction of the number of instructions to 10 is independent of whether bit-interleaving is applied or not, which means that using the BI representation has actually an adverse impact on the overall performance due to the conversions between BI and normal representation.

ISE design.
Note that additional, more detailed material relating to the ISE design for this candidate can be found in Appendix A. We focus on the primary algorithm Dumbo, and, more specifically therefore, a kernel represented by the Spongent-π[160] permutation (see also [BKL + 13]).

Elephant
Kernel overview. Spongent-π[160] used in Dumbo is a 80-round Spongent permutation [BKL + 13] (essentially a PRESENT-type permutation [BKL + 07]). It operates on a 160-bit state and consists of three layers in each round: 1) XORing the state with two round constants, of which one is computed by a 7-bit LFSR ICounter 160 , i.e., 0 153 ∥ ICounter 160 (i), while the other one is rev (0 153 ∥ ICounter 160 (i)), where i denotes the round index and rev is a function reversing the order of the bits of its input; 2) sBoxLayer 160 , a 4-bit S-box applied 40 times in parallel; 3) pLayer 160 , moving the bit j of state to bit position 40 · j mod 159 while the bit 159 keeps unmoved.

Implementation options.
We developed the pure-software implementation of Spongentπ[160] from scratch by ourselves, in which we presented several optimisation techniques based on our base ISA. The 160-bit state is stored in five 32-bit words S 0 , S 1 , S 2 , S 3 , and S 4 , where each S i stores bits 32i to 32i + 31 of the state. First, we precompute all the round constants so that the first layer is simplified to require only few instructions to load/prepare the constants plus then two XOR instructions. Second, Zbkx provides a dedicated instruction for the parallel 4-bit S-box, namely xperm4, which is very beneficial for sBoxLayer 160 . Concretely, the xperm-style look-up table for sBoxLayer 160 is construct with three registers before Spongent-π[160] starts: li rl , 0xF4120BDE ; the lower half of S -box look -up table li rh , 0x63C958A7 ; the higher half of S -box look -up table li rm , 0x88888888 ; the mask used in xperm -style S -box Each 32-bit word S i (stored in rx) can perform eight 4-bit S-boxes simultaneously with two xperm4 and two XOR instructions via xperm4 ry , rl , rx xor rx , rx , rm xperm4 rx , rh , rx xor rx , rx , ry so in each round the whole sBoxLayer 160 needs 20 instructions in total. Last, we divide the pLayer 160 into two steps: 1) for each word S i , we firstly apply the unzip instruction (from Zbkb) twice and thus make S i be a form shown in the 3rd row of Figure 1; 2) we then take advantage of eight SWAPMOVE operations (SWAPMOVE will be explained in detail in Section 3.12) to swap the bits between different words, i.e., SWAPMOVE(S0, S1, 0x000000FF, 8); SWAPMOVE(S0, S3, 0x000000FF, 24); SWAPMOVE(S1, S4, 0x000000FF, 24); SWAPMOVE(S2, S4, 0x0000FF00, 16); SWAPMOVE(S0, S2, 0x000000FF, 16); SWAPMOVE(S1, S2, 0x0000FF00, 8); SWAPMOVE(S2, S3, 0x0000FF00, 8); SWAPMOVE(S3, S4, 0x00FF0000, 8); and, afterwards, we use three rori instructions (for right-rotation, also from Zbkb) to make S 1 , S 2 , and S 3 correctly-aligned.

ISE description.
At first, we designed a custom instruction for the parallel 4-bit S-box, where we integrated the first step of pLayer 160 (i.e., two "unzip" instructions) at the end. Moreover, we designed two instructions for the specific SWAPMOVE operations used in our second step of pLayer 160 . Because each of our custom instruction has 1 destination register and each SWAPMOVE swaps bits between two different words, so 2 custom instructions are therefore required to perform one complete SWAPMOVE here. We also integrated the final three right-rotations into the custom instruction to further reduce the latency.

ISE design.
Note that additional, more detailed material relating to the ISE design for this candidate can be found in Appendix B.

GIFT-COFB
Submission overview. The GIFT-COFB [BCI + 21] submission specifies an eponymous AEAD algorithm. We focus on this, the only and therefore primary algorithm, and, more specifically therefore, a kernel represented by the GIFT-128 block cipher (see also [BPP + 17]).
Kernel overview. GIFT-128, belonging to GIFT block cipher family, is based on a SPN with a key length and a block size of both 128 bits. It is a 40-round block cipher with an identical round function that consists of three steps, namely SubCells, PermBits, and AddRoundKey. A typical technique to implement GIFT-128 is bit-slicing [BPP + 17], where the 128-bit cipher state is expressed as four 32-bit slices S 0 , S 1 , S 2 , and S 3 . SubCells is essentially a 4-bit S-box, which needs 11 bitwise logical operations in bit-slicing. PermBits has a special property that bits in S i remain in the same slice through the permutation. AddRoundKey includes three sub-steps: add round key (to S 1 and S 2 ), add round constant (to S 3 ), and key state update (with a main operation of two 16-bit word-wise rotations).
We refer readers to [BPP + 17] or the GIFT-COFB specification [BCI + 21] for more details.

Implementation options.
In addition to naive bit-slicing, a new representation for GIFT-128, namely the fix-slicing, is proposed in [ANP20]. In this work, we considered both different types of state representation for GIFT-128. According to [ANP20], fix-slicing is faster on 32-bit ARM Cortex-M microcontrollers in relation to the naive bit-slicing. However, thanks to Zbkb instructions, we are able to execute the PermBits very efficiently, which makes naive bit-slicing outperform fix-slicing on our base ISA. In detail, only three or four instructions are required in order to permute a 32-bit state slice S i in each PermBits operation (we save the last rori for S 3 ): unzip rx , rx unzip rx , rx rev8 rx , rx rori rx , rx , imm Figure 1 illustrates how unzip and rev8 permute bits (of a single S i ) during PermBits, from which we observe that the output of rev8 is already the output for S 3 [BCI + 21, Table 2.2]. For S 0 , S 1 , and S 2 , we just further rotate the resulting state slice to the right (using rori) with the corresponding offset (i.e., 24, 16, and 8 respectively). Furthermore, Zbkb can also speed up the key state update operation. Concretely, we assume a 32-bit key state word W 6 ∥ W 7 (stored in rx). With the help of pack instruction, we can quickly obtain W 6 ≫ 2 ∥ W 7 ≫ 12 through pack ry , rx , rx ; ry = ( W7 ) || ( W7 ) rori rx , rx , 16 ; rx = ( W7 ) || ( W6 ) pack rx , rx , rx ; rx = ( W6 ) || ( W6 ) rori ry , ry , 12 ; ry = ( W7 >>> 12 ) || ( W7 >>> 12 ) rori rx , rx , 2 ; rx = ( W6 >>> 2 ) || ( W6 >>> 2 ) pack rx , ry , rx ; rx = ( W6 >>> 2 ) || ( W7 >>> 12 ) ISE description. We implemented both the fix-slicing and the naive bit-slicing implementation of GIFT-128 on the base ISA, and designed ISE for each of them. The fix-slicing implementation separates the computation of round key-update from the main GIFT-128 and uses an efficient round key pre-computation to align with the fix-slicing representation. On the other hand, the ISE for the bit-slicing implementation includes only two instructions to accelerate PermBits and the key state update, respectively. In essence, the ISE for fix-slicing include an instruction for the so-called SWAPMOVE operation (which will be discussed in detail in Section 3.12), three instructions for the rotation of nibbles, bytes, and halfwords in a 32-bit register, whereby the rotation amount is encoded as an immediate value, and three further instructions for the key-update function. The latter three instructions perform a sequence of SWAPMOVEs and operations that consist of rotations of 32-bit words, logical ANDs with a constant, and logical ORs. Each of the three key-update instructions operates on a single 32-bit word.
ISE design. Note that additional, more detailed material relating to the ISE design for this candidate can be found in Appendix C.

Grain-128AEADv2
Submission overview. The Grain-128AEADv2 [HJM + 21] submission specifies an eponymous AEAD algorithm. We focus on this, the only and therefore primary algorithm, and, more specifically therefore, a kernel represented by the keystream-generation function of the underlying Grain-128a stream cipher (see also [HJM07,rHJM11]).

Kernel overview.
Grain-128a is based on (a variant of) the "original" stream cipher Grain, which was a candidate of the eSTREAM competition and selected for the final eSTREAM portfolio. The kernel is a function that computes a 32-bit word of the keystream using an internal state of a size of 256 bits. This state consists of a 128-bit Linear Feedback Shift Register (LFSR) and a 128-bit Nonlinear Feedback Shift Register (NFSR). The kernel consists of three major sub-functions: one to update the LFSR (called f function), and one to update the NFSR (called g function) and one to compute the 32-bit output word (called h function).

Implementation options.
A naive implementation of the sub-functions to update the LFSR and NFSR consists of a large number of bit-level operations. It is therefore more efficient to implement the sub-functions such that they operate on 32-bit words, in which case the kernel basically consists of shifts, ANDs, and XORs. The kernel of Grain-128AEADv2 is simpler (and, therefore, faster) than the kernel of the other NIST finalists, but this simplicity comes at the expense that the kernel is executed more often. Another specific property of this kernel is that the instructions provided by Zbkb/x (e.g. rotations) are not capable to reduce the execution time significantly. ISE description. The kernel can be accelerated through a set of ten custom instructions, the most important of which is an instruction to extract a 32-bit word that lies at a certain position within a 64-bit word (held in two source registers). Furthermore, the set includes two instructions for the f function, three instructions for the g function, and four for the h function. Each of these instruction gets two state-words as input and computes the contribution of these two state-words to the result of f , g, and h, respectively. Finally, all the contributions have to be XORed together.

ISE design.
Note that additional, more detailed material relating to the ISE design for this candidate can be found in Appendix D.
Kernel overview. The main distinguishing feature of the ISAP family is their built-in mode-level countermeasures against passive side-channel attacks. However, from a kernel perspective, the main instance Isap-A-128a uses exactly the same Ascon-p permutation as the Ascon family of AEAD algorithms. Isap-A-128a evaluates this permutation over either one, six, or 12 rounds, depending on the concrete (sub-)operation the permutation is part of. As already explained in Section 3.2, Ascon-p operates over a 320-bit state and consists of (i) a round-constant addition, (ii) a substitution layer based on a bit-sliced 5-bit S-box, and (iii) a linear layer performing XORs and rotations of 64-bit words. Implementation options. Similar to Ascon, optimized implementation of Isap-A-128a for 32-bit platforms can take advantage of bit-interleaving to speed up the linear layer of the permutation. However, as explained in Section 3.2, bit-interleaving has actually a negative effect on the overall performance when the linear layer is accelerated through a small set of custom instructions. This is because an ISE-supported implementation of the linear layer always consists of only 10 instructions, regardless of whether bit-interleaving is applied or not, which means the conversions between bit-interleaved and normal representation actually slow down the execution. ISE description. The ISE described in Section 3.2 for Ascon-p can re-used for Isap-A-128a.

ISE design.
Note that additional, more detailed material relating to the ISE design for this candidate can be found in Appendix E. , and, more specifically therefore, a kernel represented by the PHOTON 256 permutation (see also [GPP11]).

PHOTON-Beetle
Kernel overview. The PHOTON 256 permutation operates on an internal state of 256 bits, organised into an (8 × 8)-element matrix of 4-bit nibbles. The permutation is SPNlike, consisting of 12 rounds that each apply 4 round functions: these are AddConstant, SubCells, ShiftRows, and MixColumnsSerial. Per [GPP11, Section 2.2], the 4-bit PRESENT S-box is used in SubCells; in contrast to the AES MixColumns round function, MixColumnsSerial is specifically optimised to facilitate a serial application of operations in F 2 4 . Implementation options. As reflected by the submission, 3 implementation techniques are applicable to PHOTON 256 ; in line with the similar SPN-like structure, and, at least to some extent, round functions, said techniques to analogous to those for AES. First, one can focus on online computation. Doing so mirrors the algorithmic description, whereby each round function is computed; this potentially includes arithmetic in F 2 4 , bar small look-up tables, e.g., for the S-box. Second, one can focus on offline pre-computation. Doing so mirrors the AES T-tables technique: the action of SubCells and MixColumnsSerial is pre-computed using a look-up table, careful indexing into which can also cater for ShiftRows. Third, and finally, one can use bit-slicing.

ISE description.
The ISE design assumes a column-packed representation, and consists of 1 instruction: the second implementation strategy above is followed, but the look-up table that would normally be computed offline is instead computed online (in hardware). Given an input column, the instruction computes 1 nibble of the output column by applying SubCells and MixColumnsSerial. This allows 8 such instructions to compute an entire output column (including AddConstant and ShiftRows, the latter realised simply through indexing of the columns); 64 such instructions can be used to compute an entire round. In a sense, this approach is similar to the design adopted by RISC-V [SCA22b, Sections 2.4+2.5] for AES (as documented in [MNP + 21], stemming from work by Nadehara et al. [NIK04] and Saarinen [Saa20]).

ISE design.
Note that additional, more detailed material relating to the ISE design for this candidate can be found in Appendix F. Kernel overview. Skinny-128-384 is an SPN-based tweakable block cipher that uses a compact S-box, a very sparse diffusion layer, and a very light key schedule. Due to the high security margin of Skinny, the Romulus designers decided to use a Skinny variant with a reduced number of rounds, namely 40 instead of 56. Skinny-128-384 operates on an internal state of a size of 128 bits that can be viewed as a (4 × 4)-element matrix of bytes, similar to the AES. The round function is composed of five operations in the following order: SubCells, AddConstants, AddRoundTweakey, ShiftRows, and MixColumns. SubCells applies an 8-bit S-box, which can be efficiently implemented in hardware, to every byte of the state. The AddConstants operation XORs some round-dependent constants to the first column of the state. AddRoundTweakey extracts eight bytes from the tweakey state and XORs them to the state, whereby the bytes are permuted and updated with simple LFSRs. ShiftRows rotates the bytes of the state row-wise to the right by 0, 1, 2, and 3 positions, similar to the ShiftRows transformation of the AES. Finally, MixColumns multiplies each byte-column of the state by a binary matrix. Implementation options. The most efficient software implementations of Skinny-128-384 for 32-bit platforms are based on the fix-slicing technique, which can be seen as a special form of bit-slicing [AP20a]. In this work, we considered both the straightforward implementation that uses a look-up table for S-box as well as the fix-slicing implementation.

ISE description.
For the table-based implementation, the ISE design assumes a rowpacked representation of the state matrix, and can be described as supporting 1) update and use of the round constant (which involves application of an LFSR), 2) update of the tweak key (which involves application of an LFSR), and 3) application of the round functions. Using a row-packed representation, MixColumns can be realised via a short sequence of XORs; this allows the latter aspect of the ISE to focus on the remaining, roworiented round functions, i.e, SubCells, ShiftRows, and AddRoundTweakey. Application of SubCells across an entire packed row of the state matrix is rationalised by the low cost S-box design: even if 4 parallel S-box instances are used, the cost in terms of area is still low in relative terms. For the fix-slicing implementation, the ISE includes instructions for MixColumns, specific SWAPMOVE operations, and round key pre-computation (e.g., LFSR, key permutation, and key update).

ISE design.
Note that additional, more detailed material relating to the ISE design for this candidate can be found in Appendix G. Kernel overview. The Sparkle permutation consists of three basic building blocks, namely (i) a non-linear layer that is composed of six parallel instances of the ARX-box Alzette, (ii) a simple linear diffusion layer, (iii) the addition of a step counter and round constant to the 384-bit state. Alzette can be seen as a small 64-bit block cipher that operates on two 32-bit words and performs three additions and four XORs whereby one of the operands is rotated by a fixed distance, as well as one ordinary addition and four ordinary XORs. On the other hand, the linear layer is, in essence, a Feistel round with a linear Feistel function, followed by a swap of the left and right half of the state. Implementation options. An ARM Cortex-M implementation of Alzette consists of only 12 instructions when exploiting the "free" rotation of the second operand. On the other hand, when Alzette is implemented using the base RV32GC instruction set, a total of 33 arithmetic/logical instruction are necessary, which can be reduced to 19 instructions when the bit-manipulation extension Zbkb is available. The linear layer consists of two rotations of 32-bit words (which are part of the so-called ℓ operation) and a number of xor and register-move (i.e., mv) instructions. Using the base-ISA, the linear layer consists of 32 instructions, among which are six mv instructions. However, these mv instructions can be avoided when the permutation is fully unrolled, thereby reducing the instruction count of the linear layer to 24. A further reduction by four instructions is possible when using the rotation instructions from Zbkb.

ISE description.
There are two basic options for speeding up Alzette with the help of custom instructions. The first is to define instructions for operations of the form x = x ⊕ (y ≫ n) and x = x + (y ≫ n), where x and y are two 32-bit words and n is a fixed rotation amount, which can be encoded as an immediate value. In this case, a single instance of Alzette consists of 12 instructions and is very similar to an ARM Cortex-M implementation. A more speed-optimized ISE would consist of two custom instructions, of which one computes the x word of the output and the other the y word. Each of these instructions can be encoded with two source register addresses, one destination register address, and an immediate value specifying one of six 32-bit constants. In this case, Alzette consists of only two instructions. The instruction count of the linear layer can be reduced from 24 to 16 with the help of a custom instruction for the ℓ operation.
ISE design. Note that additional, more detailed material relating to the ISE design for this candidate can be found in Appendix H.

TinyJAMBU
Submission overview. The TinyJAMBU [WH21] submission specifies an eponymous AEAD algorithm family. We focus on the primary algorithm TinyJAMBU-128 [WH21, Section 3.3], and, more specifically, a kernel represented by the keyed permutation P n , which is iterated either n = 640 times (P 640 ) or n = 1024 times (P 1024 ).
Kernel overview. The permutation P is based on a 128-bit non-linear feedback shift register whose feedback path consists of four bit-wise XORs and a bit-wise NAND, which is the only non-linear operation of TinyJAMBU. One can easily identify the state-update function as the most performance-critical operation; it gets besides the 128-bit state and the number of rounds also a key as input. However, TinyJAMBU does not involve a key-schedule. The permutation P n distinguishes itself from the permutations of other finalists like Ascon, Sparkle, and Xoodyak by an extremely small state size the fact that it is keyed (i.e., P n is a non-public permutation). Furthermore, the number of rounds is much higher, which is compensated by an extremely simple round function (basically just a shift of the 128-bit state along with five bit-operations).

Implementation options.
On a 32-bit processor, it is possible to compute 32 instances of the permutation simultaneously, which means the XOR and NAND operations are performed on 32-bit words. One of them is a word of the state, one is a word from the key and the other four are extracted from the state at certain positions. The latter boils down to extracting a 32-bit word from two adjacent 32-bit state-words through an operation of

ISE description.
Extracting a 32-bit words from two state-words can be done with three native RV32GC instructions. However, this operation can be easily mapped to a custom instruction (which we call fsri) that reads two 32-bit words from registers and gets the position of the word to extract through an immediate value. Even though fsri saves only two instructions, it still improves the execution time of TinyJAMBU significantly since these word-extractions account for about 80% of the execution time of the state-update operation.
ISE design. Note that additional, more detailed material relating to the ISE design for this candidate can be found in Appendix I.

Xoodyak
Submission overview. The Xoodyak [DHM + 21] submission specifies an eponymous algorithm, which supports both AEAD and hash function modes. We focus on this, the only and therefore primary algorithm, and, more specifically therefore, a kernel represented by the Xoodoo[12] permutation (see also [DHAK18]).
Kernel overview. The state of the Xoodoo[12] permutation has the form of a (3 × 4)element matrix of 32-bit words, which can be visualized via three horizontal 128-bit planes (one above the other), each consisting of four 32-bit lanes. It is also possible to view the 384-bit state as 128 columns of three bits lying upon another (i.e., each bit belongs to a different plane). As its name indicates, Xoodoo[12] executes 12 iterations of a round function consisting of five steps: a column-parity mixing layer θ, a non-linear layer χ, two plane-shifting layers (ρ west and ρ east ) between them, and a round-constant addition. Both ρ layers move bits horizontally and perform lane-wise rotations of planes as well as rotations of lanes by 11, 1, and 8 bits to the left. On the other hand, in the parity-computation part of θ and in the χ layer, state-bits interact only vertically, i.e. within 3-bit columns. The θ layer mainly executes XORs and left-rotations by 5 and 14 bits. Finally, the non-linear layer χ applies a 3-bit S-box to each column of the state, which can be computed using logical ANDs, XORs, and bitwise complements. Implementation options. An optimised implementation of Xoodoo[12] permutation on RV32IMAC was proposed in [CJL + 20]. This implementation takes advantage of a technique known as lane complementing, which allows one to reduce the number of bitwise complements that have to be carried out in the χ transformation from 12 per round to three. However, this optimisation is not necessary on our base ISA, due to the andn instruction provided by Zbkb. andn combines a logical AND with a bitwise complement of the second operand, which benefits the implementation of χ to be more straightforward and more efficient on our base ISA.

ISE description.
When adhering to the requirements for custom instructions mentioned in Section 2, then the only opportunity to speed up Xoodoo[12] is the manipulation of the parity-plane (i.e., three 32-bit parity-lanes) through an operation of the form e = (p ≪ 5) ⊕ (p ≪ 14). We call the custom instruction implementing this operation xorrol.
ISE design. Note that additional, more detailed material relating to the ISE design for this candidate can be found in Appendix J.

Discussion
Observations regarding ISA design. One could imagine two different approaches to improving this starting point. The arguably more CISC-like approach (see [CDPA16, Section V]) would be to add a dedicated "shift-then-XOR" instruction to the base ISA; more general-purpose instances of this same approach include the ARM Cortex-M "flexible second operand" mechanism. The arguably more RISC-like approach (see [CDPA16, Section VI]) would be to retain the original instructions (resp. micro-ops) only, but implement a mechanism by which they can be fused (or combined, into a macro-op). By using compressed instructions [SCA19, Chapter 16], for example, one can express a similar operation as c.slli ry , imm c.xor ry , rz Celio et al. [CDPA16] argue that by fusing these 2 instructions in the micro-architecture front-end, the same (effective) instruction throughput is achieved as use of the 1 noncompressed, dedicated instruction, but, crucially, without "bloating" the base ISA. However, a micro-architecture which supports fusion is more complex as a result; for resource-constrained devices, support for dynamic, run-time fusion is potentially unattractive therefore. A conceptual alternative would be static, compile-time fusion. If there were a way to "merge" 2 compressed instructions into 1 non-compressed instruction, their fused semantics could be expressed at compile-time and executed by a less complex micro-architecture. • There are several algorithms which use 32-bit (e.g., Sparkle) or 64-bit (e.g., Ascon) rotation. This fact relates to a more general challenge of selecting an n-bit natural word size for an algorithm: one could say that a larger n can be a positive for base ISAs with a large word size (e.g., allowing more effective use of the data-path) but a negative for base ISAs with a small word size (e.g., because n-bit operations need to be synthesised by a sequence of m-bit alternatives, for m < n), and vice versa. Put another way, choice of an n somewhat biases how efficient an implementation of the algorithm can be when using a given base ISA.
The other dimension to this choice, however, is how well a particular ISA supports a particular n. There is precedent in RISC-V for supporting 32-bit operations when XLEN = 64 (e.g., rorw in Zbkb [SCA22b, Section 3.26] and similar), but not 64-bit operations when XLEN = 32. Following a RISC-like design philosophy, the argument would likely be that the latter, e.g., 64-bit rotation, can and so therefore should be synthesised using a sequence of 32-bit instructions. That said, and although total orthogonality is clearly unrealistic, it seems there are some opportunities along similar lines. A pertinent example is a family of so-called funnel shift instructions, which appeared in drafts 4 of the B extension but not the ratified B (i.e., Zba, Zbb, Zbc, and Zbs) nor Zbkb extensions. Although counterarguments (e.g., their ternary, 4-address format) exist, one could view their omission as a missed opportunity: a general-purpose funnel shift eliminates the need for bit-interleaving (where relevant) without needing a further, special-purpose ISE.
Observations regarding ISE design.
• For some algorithms, an ISE design for RV32GC is harder to scale (or generalise) into one for RV64GC than for other algorithms. PHOTON-Beetle uses PHOTON 256 , for example, which uses an (8 × 8)-element state matrix of 4-bit nibbles. Where XLEN = 32 it is possible to pack 1 column into each 32-bit word; where XLEN = 64, the natural generalisation is to pack 2 columns into each 64-bit word. However, this natural generalisation of the representation renders the associated implementation more difficult, e.g., with respect to the ShiftRows round function. On one hand, this does not seem a significant problem; it is already true of support for AES in RISC-V (cf. aes32esi versus aes64es in Zkne [SCA22b, Section 2.5]), for example. On the other hand, however, one could also argue that scalability is an attractive property and so favour designs which facilitate it. • There are several algorithms (e.g., Elephant and Romulus) where "small" n-bit LFSRs, for n < XLEN, are used. Although the LFSR update is typically dominated by other components of a given algorithm, an associated ISE could plausibly offer incremental improvement over use of the base ISA alone; if it were parameterisable (e.g., with respect to the tap sequence), such an ISE could represent a somewhat general-purpose primitive. • There are several algorithms (e.g., GIFT and Romulus) where the implementation technique of fix-slicing [ANP20, AP20b] is applicable; this fact is specifically highlighted and explored by Adomnicai and Peyrin [AP20a]. Where fix-slicing is applied, an implementation will often make use of a primitive termed SWAPMOVE. May et al. [MPC00, Section 3.1] are among the first 5 to define and make use of this primitive: the basic idea is that some bits in an operand x are swapped with some bits in another operand y, with n and m controlling which bits. As such, SWAPMOVE has 3 inputs of XLEN bits (x, y, and m), 1 input of ⌈log 2 XLEN⌉ bits (n), and 2 outputs of XLEN bits (x and y). In various ISE designs, we cope with the number and type of inputs and outputs through specialisation, e.g., employing 1) a 1-operand variant that involves only x, and 2) a small, hard-coded set of n and m. Given a more general-purpose ISE for SWAPMOVE is more attractive, however, it seems useful to carefully explore the trade-off between general-and special-purpose. For example, through careful inter-algorithm analysis, it might be possible to identify a somewhat general-purpose set of n and m which afford a compact and so viable encoding.
Observations regarding algorithm design.
• For some algorithms, a change to the interface could plausibly yield more efficient implementations. PHOTON-Beetle uses PHOTON 256 for example, which initialises an (8 × 8)element state matrix of 4-bit nibbles from a 16-element array of 8-bit bytes using a row-major ordering. Use of a column-oriented representation of the state matrix can imply a significant conversation overhead therefore, which could be reduced by changing the interface to allow a column-major ordering (although doing so clearly then penalises row-oriented representation in the same way). • For some algorithms, a change to the parameterisation could plausibly yield more efficient implementations. PHOTON-Beetle, uses PHOTON 256 for example, which, per [GPP11, Section 2.2], implies use of the 4-bit PRESENT S-box. A different parameterisation is possible, however, which implies use of the 8-bit AES S-box: although reasonable counterarguments also exist, one could argue that opting for the latter will maximise overlap with existing ISEs and so minimise the additional hardware components required (e.g., by using an AES S-box shared with Zkne [SCA22b, Section 2.5], if that extension were also supported).

Implementation
In the same way as the ISA, a given ISE design represents an interface between hardware and software. In this section we consider both sides of said interface, as defined in Section 3: Section 4.1 considers the hardware-oriented side, i.e., how the ISE is realised, then Section 4.2 considers the software-oriented side, i.e., how the ISE is utilised. Doing so shifts our focus from abstract design to concrete implementation, which then represents the basis for evaluation in Section 5.

Hardware
Host core. To realise each ISE design, we use the highly configurable, RISC-V compliant Rocket [AAB + 16] host core. At a high level, the core executes instructions using a 5-stage, in-order pipeline; support is included within the core for a branch prediction mechanism, and in the wider system for a 16 kB instruction cache and a 16 kB data cache.
To support the execution of associated instructions, two modifications are made to the host core for each ISE design. First, an ISE-specific Functional Unit (FU) is integrated into the host core. At least two different approaches are possible, namely 1) an internal integration, where the FU is integrated directly into the pipeline, and 2) an external integration, which integrates the FU using the Rocket Custom Coprocessor (RoCC) [AAB + 16, Section 4] interface. Although it requires less micro-architectural modification, using the RoCC interface locates the FU in the commit stage; this can degrade performance, due to inefficiency resulting from how forwarding is implemented. Our ISE designs are intended to permit single-cycle execution, which means the efficiency of forwarding is important. As such, we opt for the former approach, which allows location of  the FU in the execute stage. Second, ISE-specific modifications are made to the instruction decoder, which, e.g., allow it to correctly provide input operands to the FU, control the FU so it performs the required computation, and accept output operands from the FU. What we term the unextended core, i.e., Rocket as is, supports RV32GC only. In line with our definition of base ISA, we define the base core, i.e., a baseline for our work, as the unextended core plus additional 6 support for Zbkb and Zbkx. We then further extend this base core with support for an LWC-specific ISE, yielding what we term an extended core. Figure 2 illustrates the outcome, with our modifications highlighted in red. Note that the Zbkb/x FU realises the Zbkb and Zbkx extensions, so is fixed across all ISE designs; the LWC FU realises a given ISE design, so is different for each ISE design therefore. Also note that neither the Zbkb/x FU nor any of the LWC FU extend the existing critical path, so have no impact on the clock frequency. As such, and by design, the associated instructions have a 1 cycle execution latency.
LWC FU. The implementation of each LWC FU stems fairly directly from the associated ISE definition; each such definition uses pseudo-code which is intentionally similar to the openly available 7 Register Transfer Language (RTL) implementation used.

Experimental platform.
To produce an experimental platform which permits evaluation of, e.g., area and cycle-accurate execution latency, we make use of the SASEBO-GIII [HKSS12]: this includes two FPGAs, namely a Xilinx Kintex-7 (model xc7k160tfbg676) target FPGA, and a Xilinx Spartan-6 (model xc6slx45) support FPGA. We use the former exclusively, synthesising stand-alone designs for it using Xilinx Vivado 2019.1; default synthesis settings are used, with no effort invested in synthesis or post-implementation optimisation. The FPGA uses a 200 MHz external clock input, which is adjusted into a 50 MHz internal clock signal for use by the host core itself.

Software
High-level strategy. To utilise each ISE design, we developed a software implementation which can be executed by the associated extended core. For a given algorithm, we start with a base implementation. This is the source code 8 submitted for a given algorithm. The base implementation is used as is, with one exception: the submission for Grain-128AEADv2 was ported from C++ to C, then adapted to cope with, e.g., assumptions around unaligned access to memory. Using appropriate C pre-processor directives, we make minor alterations to the base implementation so the kernel implementation is selectable between the original and a compatible replacement developed by us; Table 1 summarises this information on a per-algorithm basis. We try to be consistent, using the most efficient parameterisation of and implementation strategy for the base implementation which is compatible with our replacement kernel. We view this approach as effective, in the sense it 1) allows focus on the kernel in question (so limits the volume of work involved), but, equally, 2) allows evaluation of the ISE design within a algorithm-wide rather than kernel-only context (so maximises utility of the outcomes).
Low-level strategy. We use a RISC-V capable instance of the GNU tool-chain 9 to compile each software implementation. Each replacement kernel implementation is written in assembly language. Rather than modify the tool-chain, instances of the .insn directive are used to generate ISE-based instructions.
• Each replacement kernel implementation is captured in a single, leaf function; there is no further opportunity for, e.g., function inlining. We respect the ABI, in the sense that a function prologue and epilogue are careful to preserve and restore any callee-save registers by using the stack. • Use of an ISE almost always reduces the number of instructions required to implement a replacement kernel, meaning loop overhead which stems from iteration, e.g., over rounds within it, can become more prominent.
To address this while providing at least some consistency, we support either partial, 2-fold unrolling or full, n-fold unrolling (for an appropriate n) of rounds within a replacement kernel. The former is often useful, for example, to avoid unnecessary copying of state output by an i-th round for use as input by the subsequent, (i + 1)-th round.

Evaluation
In this section, we present the result of evaluating our ISEs designs from both hardware and software perspectives. As a non-LWC comparison point, we consider an existing 10 ISEsupported implementation of AES-GCM [SCA07]. We attempt to align said implementation as closely as possible with the API used, by 1) "upgrading" it to support additional data, and 2) parameterising it using a 128-bit key. A set of results, limited to the relevant, extended ISA case only, are included for reference alongside those for LWC candidates.
9 Use of the Rocket core demands a specific tool-chain version; we used commit b468107e701433e1caca3dbc8aef8d40e0c967ed of https://github.com/riscv/riscv-gnu-toolchain, yielding, e.g., a working GCC whose version was 9.2.0. 10 https://github.com/rvkrypto/rvkrypto-fips Note throughout that, within the context of GIFT-COFB, we use FS and BS to refer to implementations based on fix-slicing and bit-slicing respectively; within the context of Romulus, we use FS and TB to refer to implementations based on fix-slicing and look-up tables respectively.
Hardware. Table 2 presents a summary of synthesis results for each ISE design. Reflecting the constraints in Section 3.1, note that all ISE design require combinational logic only, i.e., no state, so we report the number of FPGA Look-up Tables (LUTs) only. We measure (cumulative) overhead relative to the unextended core alone, and so exclude the wider system: doing so seems more representative, in that, e.g., the caches, would dominate otherwise. For example, the ISE for Sparkle (resp. TinyJAMBU) demands the most (resp. least) area: implementation of the Zbkb/x and LWC FUs produce a 14% and 10% (resp. 3%) overhead respectively, meaning 24% (resp. 17%) cumulative versus the unextended core.
Software: kernel. Table 3 presents a summary of low-level results, focusing on the kernels in isolation. For each kernel, we report both absolute results i.e., execution latency (measured in clock cycles) and instruction footprint (measured in bytes), and relative results i.e., increase/decrease factor versus use of the base ISA alone. Note that for some kernels, e.g., GIFT and Romulus, we use auxiliary functions relating to pre-computation of round keys. For clarity, and because our ISEs can be used within them, we include these in addition to the kernel itself.
For comparison, single-block encryption via aes128_enc_ecb_rvk32 (resp. decryption via aes128_dec_ecb_rvk32) using the ISE-supported implementation of AES-GCM requires 324 (resp. 321) cycles; the encryption key schedule via aes128_enc_key_rvk32 (resp. decryption key schedule via aes128_dec_key_rvk32) requires 264 (resp. 719) cycles; the GHASH function (dominated by a multiplication in F 2 128 ) via ghash_mul_rv32 requires 135 cycles. Software: API. Table 4, Table 5, and Table 6 present a summary of high-level results, focusing on the kernels in context, i.e., as invoked via the API using the aead_encrypt and aead_decrypt functions for a 16, 128, and 1024 byte plaintext (resp. ciphertext) respectively. This is important, because one kernel may represent a different proportion of the associated algorithm than another, and thus yield different overall improvements. We consider a range of cases, constrained such that the associated data and plaintext/ciphertext lengths are equal: counterarguments clearly exist (e.g., one might expect common usecases to require a short(er), fixed length associated data, and a longer, variable length plaintext/cipher), but adopting this approach aligns with the NIST micro-controller benchmarking framework 11 and so allows easier comparison of results.
Finally, Figure 3 presents a similar summary of the data to that used by NIST: for each algorithm, we select the most efficient ISE variant (with respect to execution latency) and plot the normalised cycles per byte across all parameterisations considered (i.e., 16 B, 128 B, and 1024 B plaintext, ciphertext, and associated data). As well as more clearly illustrating relative execution latency, including for the AES-GCM case, the graphs highlight cases where the overhead of initialisation is more (resp. less) effectively amortised for large (resp. small) inputs.
The results for the ISE-supported kernels show that the more hardware-oriented designs (e.g. Elephant, PHOTON-Beetle, Romulus (TB)) are generally accelerated by a larger extent than the more software-oriented designs, such as Ascon, Sparkle, and Xoodyak, which were already relatively efficient with only the base-ISA. Among the latter three algorithms, Sparkle achieves a much higher speed-up than Xoodyak, which is mainly because the ARX-box Alzette can be implemented with only two custom instructions since it operates on 64-bit parts of the state (i.e. two 32-bit words). On the other hand, Xoodyak is not particularly well-suited for ISE because it does not contain many operations that can be mapped to custom instructions with two source registers and one destination register.
An additional benefit of the ISE-supported implementations is their significantly smaller code size, which is mainly due to the reduced footprint of the kernels. Such size reductions are often downplayed and only seen as a minor side benefit of ISE, but such a view neglects the fact that a size reduction can yield a non-negligible reduction of execution time on processors with a small instruction cache. For example, according to Table 3, the base-ISA implementations of the kernels of Elephant, PHOTON-Beetle, and Romulus (TB) have a footprint of more than 16 kB and exceed the instruction-cache size of our Rocket core, thereby slowing down the execution due to cache misses. On the other hand, all ISE-supported kernels fit conveniently into the instruction cache.
Comparison with related work: hardware. Strictly limited to cases based on RISC-V, and presented in chronological order, various elements of related work yield useful comparison points.
Tehrani et al. [TGSMD20] describe an ISE for RV32 to support a range of lightweight, 64-bit block ciphers including GIFT-64-128 and Skinny-64-128, implementing and evaluating it using the VexRiscv core. First, they support computation of the substitution layer using a general-purpose instruction for nibble-wise table look-up; doing so is achieved by capturing the table (i.e., S-box) in 3 CSRs, and then applying it nibble-wise to a 32-bit input word supplied in GPR[rs 1 ]. Second, they support computation of the permutation layer. For GIFT-64-128 this takes the form of a special-purpose instruction, whereas for Skinny-64-128, a general-purpose instruction for nibble-wise matrix-vector multiplication is used; doing so is achieved by capturing a (constant) matrix in 8 CSRs, then applying it to a 64-bit input vector supplied in GPR[rs 1 ] and GPR[rs 2 ] (with two instructions required to compute the most-and least-significant 32-bit half of the result). We do not present a comparison with this work, because the ISE cannot be used 12 for either GIFT-128-128 or Skinny-128-384+ so is not applicable to GIFT-COFB or Romulus.
Altınay and Örs [AO21] describe an ISE for RV32 to support Ascon-p, implementing and evaluating it using the spike instruction set simulator. Their ISE includes two instructions. First, they support general-purpose rotation; similar instructions are now available via the standard B (bit manipulation) [SCA21, Section 1.3] and K (cryptography) [SCA22b, Section 2.1] extensions. Second, they support special-purpose computation of the S-box. Their instruction for doing so is CISC-like, in the sense it operates on data resident in memory: using an input register address rs 1 , it loads five 32-bit inputs x i ← MEM[GPR[rs 1 ] + 4 · i] 4 , applies the S-box to produce outputs r i from the inputs x i , then stores five 32-bit outputs MEM[GPR[rs 1 ] + 4 · i] 4 ← r i , where 0 ≤ i < 5 throughout. We do not present a comparison with this work, because 1) the ISE falls outside our constraints as outlined in Section 3.1, and, moreover, 2) no non-simulated evaluation results (i.e., area overhead, and cycle accurate execution latency) are available for it.
Steinegger and Primas [SP21] describe an ISE for RV32 to support Ascon-p, implementing and evaluating it using the RI5CY core. Their ISE includes one instruction, which essentially supports computation of an entire Ascon-p round in hardware. Implementation therefore demands tight integration with the core (e.g., using 10 hard-wired general-purpose registers to store the state), which, although delivering performance, arguably renders it more akin to a tightly-coupled accelerator than traditional ISE. Although the ISE falls outside our constraints as outlined in Section 3.1, it does represent a competitive trade-off: modulo differences with respect to the core used, [SP21, Table 1 + Section 4] demonstrate that a factor of 1.1 area overhead permits a significant, factor of 50 improvement in execution latency for Ascon. For certain use-cases, this trade-off can be argued as more attractive than one based on a more hardware-oriented (e.g., purely using an IP core) or more software-oriented (i.e., using a more tightly constrained ISE, as in our work) alternative.
Comparison with related work: software. Strictly limited to cases based on RISC-V, and presented in chronological order, various elements of related work yield useful comparison points.
Jellema [Jel19] presents an optimised implementation of Ascon, based on use of an E31 (supporting RV32IMAC) core; [Jel19, Figure 10] suggests a measured 6 · 118 = 708 cycle execution latency for the 6-round Ascon-p permutation. Modulo the different core, this can be compared with the base ISA and extended ISA columns of Table 3, where we measure 700 and 280 cycles respectively. At face value one might expect use of Zbkb/x to offer greater improvement, but in fact this result is expected: although we can use andn and orn within the substitution layer, we cannot use rol or ror within the diffusion layer (because XLEN = 32, so 64-bit rotation is not supported). Alternatively, [Jel19, Figure 11] suggests a measured 552076 cycle execution latency for the encryption of a 4096 byte plaintext and (inferred) 0 byte associated data; for this parameterisation, our implementation takes 479764 cycles using the base ISA or 263043 cycles using the extended ISA, i.e., the LWC-specific ISE.
Lemmen [Lem20] presents an optimised implementation of Elephant, based on use of an E31 (supporting RV32IMAC) core. We do not present a comparison with this work, because it focuses on the non-primary parameterisation Elephant-Keccak-f [200] so falls outside our scope.
Campos et al.
[CJL + 20] present a limited study of LWC algorithms, with the goal of assessing the impact of selecting assembly language versus C for their implementation. Per [CJL + 20, Section 2], their work is based on use of an E31 (supporting RV32IMAC) or VexRiscv (supporting RV32IM) core; we ignore use of the riscvOVPsim simulator, because, as they explain, it may not produce representative results. Modulo the different core, can be compared with the base ISA and extended ISA columns of Table 3. For Ascon, [CJL + 20, Table 7] suggests a measured 750 cycle execution latency for the 6-round Ascon-p permutation; per Table 3, use of Zbkb/x means our implementation takes 700 cycles, or 280 with an LWC-specific ISE. For Sparkle, [CJL + 20, Section 3.2] suggests an approximated 1708 cycle execution latency for the Sparkle-384 permutation; per Table 3, use of Zbkb/x means our implementation takes 1647 cycles, or 525 with an LWC-specific ISE. For Xoodyak, [CJL + 20, Section 3.2] suggests an approximated 1596 cycle execution latency for the 12-round Xoodoo permutation; per Table 3, use of Zbkb/x means our implementation takes 873 cycles, or 777 with an LWC-specific ISE.
Renner et al. [RPM20] present a hardware-in-the-loop benchmarking framework for the LWC process; since their focus is the framework, they use the source code submitted for a given algorithm. Their work is based on use of a Kendryte K210 core. Modulo the different core, their results can be compared with the unextended ISA column of Table 4,  Table 5, and Table 6.
Resilience against implementation attack. For constrained platforms of relevance to the LWC selection process, countermeasures against implementation attack are often classified as being either based on hiding [MOP07, Chapter 7] and/or masking [MOP07, Chapter 10]. Although we do not consider such countermeasures per se, some discussion of how our ISEs interact with them may still be useful: • The principle of constant-time implementation (i.e., that which exhibits data-independent execution latency) is important; delivering it acts as a hiding-based countermeasure against certain forms of attack, and is generally easier for ISE-supported than softwareonly implementations. We note that all our replacement kernel implementations are constant-time, in certain cases 13 representing an improvement to the base implementation considered. • Other hiding countermeasures instrumented at the ISA level, e.g., temporal skewing or shuffling, typically apply to ISE-supported implementation much like software-only implementations. That said, however, one can debate whether they are as effective. For example, an ISE-supported implementation will typically comprise fewer instructions, meaning less Instruction Level Parallelism (ILP) to harness through shuffling, and lower diversification. In turn, this acts to limit the security improvement possible. • The situation for masking-based countermeasures is more involved. For a linear operation, our ISEs can be used on a share-wise basis. For a non-linear operation, this is not possible: one would need to redefine the ISE to accept masked inputs and outputs, and augment the associated FU so it is mask-aware. We note that our adherence to 3-address instruction formats means [GGM + 21] would be one way to accommodate this for n = 2 shares, whereas [MP21] would be another way to do so more generally, i.e., for n > 2 shares. • It is important to note that ISAP is a somewhat special case, in the sense it delivers inherent mitigation for selected side-channel and fault attacks; since this is achieved at the mode level and our ISE applies at the kernel level (i.e., the Ascon-p permutation), we do not expect any negative interaction between said ISE and any security argument for ISAP. That said, it is important to keep this functionality in mind when interpreting performance results. Although inefficient in relative terms, ISAP includes by-design mitigation that other candidates would have to deliver via post-design means: the resulting overhead is costed into ISAP already, complicating any direct comparison.

Conclusion
Summary. ISEs to support standard cryptographic algorithms, e.g., AES, have now been included in almost every major ISA. Anticipating the LWC process will yield an outcome that warrants similar support, this paper investigated ISEs for each of the 10 LWC final round submissions. Through careful analysis of the constituent algorithms, and following a set of principled constraints (e.g., alignment with the wider RISC-V design principles, such as use of 3-address instructions), we first developed ISE designs for Ascon, Elephant, GIFT-COFB, Grain-128AEADv2, ISAP, PHOTON-Beetle, Romulus, Sparkle, TinyJAMBU, and Xoodyak, then implemented said designs using the RISC-V compliant Rocket core.
Broadly speaking, comparison with software-only alternatives shows that 1) the ISEs overhead in hardware is low, 2) the ISEs allow a reduction in execution latency, the degree of which is algorithm-dependent but significant in some cases, and, at the same time, 3) the ISEs allow constant-time execution, and a reduction in instruction footprint. Put together, these features highlight the value of ISEs within the context of resource-constrained devices and therefore the LWC process.
Observations. Based on our work, several high-level observations seem important to stress. First, and particularly when carefully paired with implementation techniques such as fix-slicing, our results demonstrate software-only implementations using Zbkb/x can be significantly more efficient than using RV32GC alone. This fact paints Zbkb/x (and so also Zbb) in a positive light with respect to general-purpose support: implementations and benchmarking for RISC-V which do not consider Zbkb/x (or Zbb) disadvantage it versus, e.g., ARM. Second, our results highlight a difference in relative improvement between algorithms that are more hardware-oriented versus more software-oriented. Put simply, ISEs for the former (e.g., Elephant, PHOTON-Beetle, Romulus) typically offer a greater improvement than for the latter (e.g., Ascon, Sparkle, Xoodyak): although the most efficient software-only implementations remain so when ISE support is considered, the difference between most and least efficient algorithms is significantly smaller. Stemming from the hybrid nature of ISE-supported software, this fact could be read as complicating the classification of hardware-versus software-oriented algorithms; either way, it highlights the need to consider use of ISEs as part of their evaluation. Third, our results act as evidence that ISEs which target an implementation technique (e.g., fix-slicing) are typically more general-purpose but less efficient, whereas ISEs which target an algorithm are typically less general-purpose but more efficient. Although a somewhat obvious statement, this suggests that once an outcome from the LWC process is known, the latter approach is more sensible in the longer term.