Rainbow on Cortex-M4

. We present the ﬁrst Cortex-M4 implementation of the NISTPQC signature ﬁnalist Rainbow. We target the Giant Gecko EFM32GG11B which comes with 512 kB of RAM which can easily accommodate the keys of RainbowI. We present fast constant-time bitsliced F 16 multiplication allowing multiplication of 32 ﬁeld elements in 32 clock cycles. Additionally, we introduce a new way of computing the public map P in the veriﬁcation procedure allowing vastly faster signature veriﬁcation. Both the signing and veriﬁcation procedures of our implementation are by far the fastest among the NISTPQC signature ﬁnalists. Signing of rainbowIclassic requires roughly 957 000 clock cycles which is 4 × faster than the state of the art Dilithium2 implementation and 45 × faster than Falcon-512. Veriﬁcation needs about 239 000 cycles which is 5 × and 2 × faster respectively. The cost of signing can be further decreased by 20% when storing the secret key in a bitsliced representation.


Introduction
The advance of large scale quantum computers is threatening all conventional public-key cryptography currently deployed due to Shor's algorithm [Sho94]. Hence, researchers are looking into quantum-safe replacements for existing protocols. In 2016, the American National Institute of Standards and Technology (NIST) [NIS] called for proposals to replace their existing standards for digital signatures, public-key encryption (PKE), and key-encapsulation mechanisms (KEM). In 2020, the third and final round of the standardization process (NISTPQC) with 7 remaining finalists and 8 alternate candidates started. Out of these remaining schemes, 6 are digital signature schemes (3 finalists and 3 alternate candidates). They can be grouped into three major families; each of which has its own advantages and disadvantages: Rainbow has a reputation for extremely fast verification (and signing), and comes with very small signatures. However, while implementations of both hash-based signatures and lattice-based signatures have received broad attention from the community, there appears to be only very little work on implementations of MQ-based schemes, even though the aforementioned characteristics of Rainbow make it particularly suitable either for root certificates, for any cases where the key can be built into the application, or in any situation not calling for frequent downloading or updating.

Preliminaries
We introduce the Rainbow signature in Section 2.1 and describe useful features of the Cortex-M4 for Rainbow in Section 2.3.

Recap of Multivariate Signatures
A Multivariate Quadratic Public Key Cryptosystem works on a field K = F q which is called the "base field". For Rainbow I this is F 16 . It has a public map P = T • Q • S : K n → K m where T and S are typically affine but is here (for Rainbow) linear. So, S : w → x = M S w and T : y → z = M T y. The map Q : x → y, called the central map must be quadratic and be easily invertible. The various MPKCs are characterized by the construction of their Q's, obviously it must be hard to decompose P : w → z into its component maps. Usually n > m and we have a digital signature.

Summary of Rainbow
Rainbow was proposed by Ding and Schmidt in 2004 [DS05], with a multi-stage Unbalanced Oil and Vinegar (UOV) structure. Since 2008 it has always appeared with exactly two stages [DYC + 08], and this is what we describe below.
• There are two "segments" of central maps in each which we designate "oil" and "vinegar" variables. In the first segment the vinegar variables are the x i for i ∈ V 1 = {1, . . . , v 1 } and the oil variables are the x i for i ∈ O 1 = {v 1 + 1, · · · , v 2 := v 1 + o 1 }.
• The central map Q has m = o 1 +o 2 structured quadratic equations y = (y v1+1 , . . . , y n ) = (q v1+1 (x), . . . , q n (x)), where (notice the unusual indexing): • Note that in every q k , where k ∈ O 1 , there is no cross-term x i x j where both i and j are in O 1 . So given all the y i in the first stage with v 1 < i ≤ v 2 , and all the vinegar variables x j with j ≤ v 1 , we can easily compute the corresponding oil variables x v1+1 , . . . , x v2 by solving a linear system.
So given all the y i in the second stage with v 2 < i ≤ n, and all the vinegar variables x j with j ≤ v 2 , we can easily compute x v2+1 , . . . , x vn by solving a linear system.  Table 1. Previously, against a Rainbow cryptosystem with m equations and n variables, the most pertinent attacks were substituting n − m variables at random and trying to solve for the remaining m variables ("Direct Attack"), and a structural attack which involves solving an associated quadratic system with n variables and n + m − 1 equations ("Rainbow Band Separation") [DYC + 08]. Recently, Beullens posted the new "Intersection" and "Rectangular MinRank" attacks against Rainbow [Beu20]. The Rainbow team acknowledged these attacks, emphasizing that Round-3 Rainbow still meets its planned security levels [DCK + 20b].

Computational Costs of Rainbow Signing
The signer, as above, calculates the hash digest z of message and inverts P with the secret key T , S, and Q, and does where w is the signature. Inverting the central map Q is clearly slower than inverting S, T . While inverting Q with given y, the signer randomly guesses vinegar variables x = (x 1 , . . . x v1 ) and solves (x v1+1 , . . . , x v2 ) by (1) ) is evaluated as quadratic forms inx. This is obtained from evaluation of secret-quadratic equations with secret valuesx and the matrix which we call matVO(x). If matVO(x) is a singular matrix the initial guesses are discarded and the process is restarted. The signer will repeat this procedure to solve for x i with i ∈ O 2 , that is the variables x v2+1 , . . . , x n as we have now values of x i for i ∈ V 1 ∪ O 1 = V 2 , using also the values y v2+1 , . . . , y n .
Clearly, the main computation cost of signing is solving linear equations and computing the matrices matVO(x) from vinegar variablesx, twice.
Note that randomness (vinegar variables and salt) is generated using AES counter mode according to the spec, with every byte sampled providing two random F 16 elements.

Variations on the Basic Rainbow
In NIST round 2 and 3, Rainbow's authors included circumzenithal and compressed variants, expanding most of the public key using AES counter mode from a seed, and storing only parts of the keys not producible in this way. The private key can be derived from the private matrices S and T and this public key and stored separately. This method, first appearing in [PBB10], reverses the normal procedure of deriving the public key from the private key during key generation. Note that a circumzenithal arc or rainbow is a meteorological phenomenon resembling an inverted rainbow. In the compressed variant, the entire private key is additionally generated on the fly from the public key seed and the private seed for S and T . These variations obviously trade key sizes for the time recomputing keys.

Cortex-M4
The Cortex-M4 is NIST's primary microcontroller optimization target for the post-quantum competition. The Cortex-M4 is a 32-bit processor that implements the ARMv7E-M instruction set which comes with a number of powerful instructions. For example, the DSP instructions [KRS19, BMKV20, BFM + 18] as well as the single-cycle long multiplication instructions [GKS20, CHK + 21, SJA19] proved to be very beneficial for implementing post-quantum cryptography.
However, for implementing Rainbow, we mostly rely on instructions that are also present in the ARMv7-M instruction set (a subset of ARMv7E-M) which is, for example, implemented by the Cortex-M3 microarchitecture. However, Cortex-M3 cores usually come with considerably less RAM which makes them arguably less suitable for Rainbow implementations.
The following features of ARMv7-M are particularly useful for implementing Rainbow: Conditional execution. The feature benefiting Rainbow the most is conditional execution.
Using the it instruction one can execute up to four instructions conditionally on a flag value. For example, ite EQ addeq r0, r1 addne r0, r2 either adds r1 or r2 to r0 depending on the Z flag (equal) being set or not.
Note that the ARMv7-M manual [ARM18, Section A4.1.2] states that "If the flags do not satisfy this condition, the instruction acts as a NOP, that is, execution advances to the next instruction as normal, including any relevant checks for exceptions being taken, but has no other effect" Hence, it is safe to use single-cycle instructions with secret-dependent conditions in constant-time code as the run-time will be one cycle irrespective of the condition flags. In future ARM architectures it needs to be carefully evaluated if this is still the case. An it block can consist of up to four instructions of which the first must be the then branch and the following can be either then or else. The it instruction encodes which instructions of an it block belong to which branch, e.g., itttt, ittee, and itete. The conditions that can be used are the same as those for branch instructions (eq, ne, cs, cc, mi, pl, vs, vc, hi, ls, ge, lt, gt, le) and the flags can be set using arithmetic instructions (e.g., adds, subs) or explicit comparison instructions (e.g. cmp, tst). The conditions within an it block must be the same for all instructions (or the opposite for the else branch). it* instructions takes 1 cycle each on the M4 (unless it is the second of a pair of 2-byte instructions, which doesn't happen in our implementations).
Barrel shifting. Standard data-processing instructions (e.g., add, eor, and) allow to have a flexible second operand, i.e., the second argument can be shifted or rotated without changing the latency (1 cycle) for each instruction. For example, add r0, r1, r2, LSL#2 will shift r2 left by two bit positions add it to r1 and store the result to r0. Similarly, other shifts and rotations can be used (lsr, asr, ror, rrx).
Special immediates. Standard data-processing instructions can also be used with a constant as a second operand. mov's are limited to 16 bits immediates 0x0000XYZW while immediates for other instructions are limited to an 8-bit value 0xXY shifted by some amount, or the special patterns 0x00XY00XY, 0xXY00XY00, and 0xXYXYXYXY.

Implementation Building Blocks
This section introduces the novel implementation approaches that can be used to speed up Rainbow implementations. Section 3.1 introduces fast bitsliced F 16 multiplication which is useful throughout all aspects of Rainbow. We can speed up the multiplication further by switching to a direct F 16 representation which is described in Section 4. Section 3.2 shows how we can adapt constant-time F 16 matrix inversion to benefit from the fast bitsliced multiplication. This speeds up the signing procedure of Rainbow and can also be adapted for F 256 parameter sets. Section 3.3 presents a novel approach for evaluating the public map P which is the core operation of Rainbow verification. We exploit that verification can run in variable time depending on both the public key and the signature. This also works for other parameter sets of Rainbow.

F 16 multiplication
The core operation within Rainbow is arithmetic in a finite field. As mentioned before, for rainbowI parameter sets this field is F 16 (for the higher levels it is F 256 ). The F 16 representation used within Rainbow is the tower field representation: Hence, an element is represented by four bits e i with e = (e 3 · x + e 2 ) · y + e 1 · x + e 0 . These bits are packed into a nibble with e 0 at the least significant bit position. Two elements are packed into a byte with the least significant nibble in the lower half of the byte.
One approach of multiplying two F 16 elements is using Karatsuba multiplication [KO63] and is, for example, used in the reference implementation of Rainbow. It allows us to implement a F 16 multiplication using three F 4 multiplications.
Bitslicing. As two F 16 elements fit into one byte, we can fit eight F 16 elements into one 32-bit register. However, we can achieve significantly faster F 16 multiplication routines that run in constant time, when we bitslice the field elements into 4 separate registers holding a total of 32 elements. To make use of fast bitsliced multiplication, we need a way of converting a packed nibble representation of F 16 elements into a bitsliced representation. A straightforward approach would load each field element individually, mask out the desired bit and pack it into the corresponding registers in the same order as the inputs. However, it is much more efficient to load 32 elements at once into four registers, and reorganizing the elements in an interleaved fashion as illustrated in Figure 1. Each row corresponds to a register containing 8 field elements. The colors denote the bit within the field element where light gray is the least significant bit, while dark gray is the most significant bit. This approach is similar to the one proposed by Chou for McBits [Cho17]. 1 This interleaving can be implemented efficiently in 28 cycles as shown in Appendix B. The same code can be used for the transformation from bitsliced representation to normal representation. The correct order of the field elements will be restored when reversing the bitslicing. Note that addition in F 16 is bitwise XOR and, hence, behaves the same on the bitsliced representation.
Bitsliced Multiplication. We first consider F 4 multiplication, then use it to construct F 16 multiplication, and then apply multiple simplifications to achieve a minimal instruction sequence. There are multiple approaches to arrive at the same instruction sequence, but we find this description the most intuitive to follow.
, it is easy to see that we can compute c 0 , c 1 , by computing where · denotes logical AND and + denotes XOR. This can be very efficiently computed on bitsliced elements.
We can now consider γ 0 and γ 1 separately and substitute the F 4 multiplication.
Hence, the least significant bits of the result can be computed as We proceed similarly for γ 1 : And hence, Now that we have established how a · b is calculated, we need to come up with an instruction sequence that does so efficiently. Consider the most common multiplication case within Rainbow: We have a large number (≥ 32) of field elements a (i) which are multiplied by a single field element b and then added to a bitsliced accumulator c (i) . This is, for example, the case in the matrix-vector multiplication. In this case, it is best to bitslice a (i) and keep b in nibble-sliced representation. For sake of explanation, we assume that we are multiplying exactly 32 elements a (0) , . . . , a (31) which are bitsliced into four registers. The register containing the least significant bits of and similarly for a 1 , . . . , a 3 and c 0 , . . . , c 3 . b is stored in the least significant four bits of a register, with b 0 denoting the least significant bit.
Algorithm 1 shows the instruction sequence that implements the computation of the product and accumulates it into c 0 , . . . , c 3 . If only a multiplication is needed, but no accumulation, c 0 , . . . , c 3 first need to be initialized to zero. The instruction sequence heavily relies on using conditional execution to only execute the additions of a i if certain bits of b are set. We compute a 0 + a 1 and a 2 + a 3 in two separate registers tmp0, tmp1 as those are used both in c 1 , c 3 and c 0 , c 1 , c 3 respectively. Also, we save another cycle by storing (b 2 · a 2 ) + (b 3 · a 3 ) in a temporary register tmp2 and (b 2 · a 3 ) + (b 3 · (a 2 + a 3 )) in tmp3 which is required to compute c 1 , c 2 and c 0 , c 1 , c 3 respectively. Another shortcut that we have been using is line 16, which is functionally equivalent to computing mov tmp3, #0 tst b, #4 in a single cycle. In total our instruction sequence requires 32 clock cycles, i.e., one clock cycle for each field multiplication.
This approach is directly extensible to parameter sets using F 256 (RainbowIII and RainbowV).

F 16 Matrix Inversion
Besides F 16 multiplication, the Rainbow signature requires solving two matrix equations. Since if A −1 exists, Ax = b ↔ x = A −1 b, we may without much loss of generality consider matrix inversion as a part of the signing procedure. As it operates on secret inputs, it is required to be constant-time which is not the case in a straightforward implementation of Gaussian elimination. We use an adapted version of the constant-time Gauss-Jordan elimination first presented by Bernstein, Chou, and Schwabe [BCS13]. Rainbow's constant time variant is illustrated in Algorithm 2 and is essentially the same as in the Rainbow reference implementation. However, for an implementation we need to choose how to implement the field arithmetic.
Algorithm 1 F 16 Multiply and Accumulate Instruction Sequence Input: 32 F 16 elements bitsliced into a 0 , a 1 , a 2 , a 3 Input: 1 F 16 element in the least significant nibble of b Input: 32 F 16 elements bitsliced in the accumulator c 0 , c 1 , c 2 , c 3 Output: Each of the elements in a i multiplied by b and added to c i c 3 += b 0 · a 3 7: eor tmp0, a 0 , a 1 tmp0 = a 0 + a 1 8: eor tmp1, a 2 , a 3 tmp1 = a 2 + a 3 9: tst b, #2 10: itttt ne conditional exec. if b&2 = 0 11: eorne c 0 , c 0 , a 1 c 0 += b 1 · a 1 12: eorne c 1 , c 1 , tmp0 c 3 += b 1 · (a 2 + a 3 ) 15: mov tmp2, #0 16: ands tmp3, tmp2, b, lsr #3 Set tmp3=0; set cs flag if b&4 = 0 17: itttt cs conditional exec. if b&4 = 0 18: eorcs tmp2, tmp2, a 2 tmp2 = b 2 · a 2 19: eorcs tmp3, tmp3, a 3 tmp3 = b 2 · a 3 20: eorcs c 2 , c 2 , a 0 c 2 += b 2 · a 0 21: eorcs c 3 , c 3 , a 1 c 3 += b 2 · a 1 22: tst b, #8 23: itttt ne conditional exec. if b&8 = 0 24: eorne c 2 , c 2 , a 1 c 2 += b 3 · a 1 25: eorne c 3 , c 3 , tmp0 c 3 += b 3 · (a 0 + a 1 ) 26: eorne tmp2, tmp2, a 3 tmp2 = b 2 · a 2 + b 3 · a 3 27: eorne tmp3, tmp3, tmp1 Algorithm 2 Matrix inversion using constant-time Gaussian elimination (for us F = F 16 ) Field inversion. For F 16 inversion (line. 9) the most efficient implementation uses a constant-time table look-up. As the number of possible input values is small (16), we can pack the look-up table (16 · 4 bit) into the 16-bit immediate arguments of 4 mov instructions and then select the right bits by shifting them into the right place. The code for the F 16 representation used in Rainbow is shown in Algorithm 3. For larger fields (e.g., F 256 ) this approach does not work, and one would rather store a table in flash memory, loop through it, and conditionally select the right element. For a = 0, the inverse doesn't exist and special treatment is needed, i.e., the entire matrix inversion fails and fail = 1. In that case, the matrix gets discarded and one samples a new set of vinegar variables. Note that a field element at index i can be efficiently retrieved from a packed matrix representation (starting at address a) using the following instruction sequence: lsrs i, i, #1  Field multiplication. The optimal choice for implementing F 16 multiplication is less obvious. To achieve the fastest multiplication one would want to keep the entire extended matrix A in bitsliced representation. However, when making sure that the pivot element is not zero in lines 4 to 7 and when inverting the pivot element in line 9, one needs to access individual field elements which is tedious and inefficient when working on a bitsliced matrix. Hence, it is faster to keep the matrix in normal (packed nibble-) representation, only perform the bitslicing ad hoc just before multiplying and convert back just after. It is notable, that that individual element accesses only occur to the left half of the matrix. Hence, we can bitslice the right half and keep it bitsliced throughout the computation. This is illustrated in Figure 2. As the output of the matrix inversion is always the input to matrix multiplication, it is possible to return the bitsliced inverse. An additional speed-up is achieved by letting the inner loops in line 6 and line 14 always start at k = 0. This does not change the result, but greatly simplifies the loop control and the overhead of accessing the packed elements. Overall, this results in a small speed-up even though the number of additions and multiplications is slightly increased.

Avoiding matrix inversion
As the inverse of the matrix is multiplied by the variables y directly after inversion and is not used at any other point in the Rainbow signature generation, one can also eliminate the matrix inversion and simply solve for x in Ax = y. The Gaussian elimination proceeds similar to Algorithm 2, but one cannot benefit from bitslicing the right part of the matrix. This approach is 33 000 cycles faster than inverting the matrix first and then multiplying. Unfortunately, according to the Rainbow specification the vinegar variables and the matrix are sampled from the same PRG. In the first layer, a new matrix is sampled until it is invertible before the vinegar variables of the second layer are sampled. If one wants to merge these steps one would have to change the way the matrix and variables are sampled or would have to roll back the PRG in case the matrix is not invertible before sampling another matrix. Therefore, we only use this approach to eliminate the inversion in the second layer of Rainbow.

Evaluating the Public Map P
One of the key advantages of Rainbow, is a very simple verification procedure: One applies the public map P to the signature z and verifies that the result matches the (randomized) hash of the message. The application of P consists of the substitution of the variables Algorithm 4 Traditional way of computing the public map P Input: Public Key A ∈ F (( n 2 ))×m in Macaulay form Input: Variables z ∈ F n Output: P(z) ∈ F m 1: h ∈ F m ← 0 2: for i ← 0, . . . , n − 1 do 3: Algorithm 5 Our way of computing public map P in variable time Input: Public Key A ∈ F (( n 2 ))×m in Macaulay form Input: Variables z ∈ F n Output: P(z) ∈ F m 1: h ∈ F |F|×m ← 0 2: for i ← 0, . . . , n − 1 do 3: .., z n into the system of equations represented by the public key. The public key is stored as a Macaulay matrix A ∈ F (( n 2 ))×m which allows us to sequentially load it exactly once while processing the variables.

Macaulay matrix indexing.
Here, by writing the index set as n 2 × m we mean that the indices in A i,j,k satisfy 0 ≤ i ≤ j < n, 0 ≤ k < m.
The standard procedure to compute P (which can be in constant time) is illustrated in Algorithm 4 and requires n 2 · m + n 2 field multiplications. The documentation of the UOV-derived NIST submissions [DCK + 20a, BPSV19, SPK17] each describe some variation of this. 2 However, we propose a different and much more efficient way to compute the public map only requiring (|F| − 2) · m + n 2 multiplications. This method is not mentioned in previous documents describing UOV-based MQ systems. Our modified procedure for computing P is illustrated in Algorithm 5. One key observation is that we do not need the verification to have a runtime that is independent of the inputs as both the signature and the public key are considered public. Therefore, we propose to use one accumulator (of m field elements) for each possible value of the monomial z i · z j . The corresponding column of the matrix A is then added to the accumulator corresponding to the value of z i · z j . This obviously may leak the value of z i · z j through a cache timing side-channel, but that does not need to concern us. The computation of the monomials within the loop costs n 2 multiplications. In the very end, we combine the accumulators by multiplying each of them with the corresponding F 16 element requiring (|F| − 2) · m multiplications as multiplications by 0 and 1 are trivial. This allows a massive speed-up at the cost of additional memory large enough to hold |F| · m field elements (or (|F| − 1) · m if one omits the buffer for z i · z j = 0.) In the case of rainbowI, the additional memory of 16 · 64/2 = 512 bytes is negligible. For the larger parameter sets using F 256 this approach is probably still worthwhile on some platforms. For rainbowIII (m = 80) and rainbowV (m = 100), 256 · 80 = 20 480 bytes and 256 · 100 = 25 600 bytes are required respectively.
One could further reduce the number of multiplications to (log 2 (|F 16 |) − 1) · m = 3 · m by instead doing more additions. First, we sum up the accumulators corresponding to the elements that have the least significant bit is set, i.e., 1, x + 1, y + 1, y + x + 1, yx + 1, yx + x + 1, yx+y+1, yx+y+x+1. Then, we sum up the accumulators corresponding to the elements that have the second bit set (x, x + 1, y + x, y + x + 1, y + x, y + x + 1, yx + x, yx + x + 1), multiply the sum by x, and then added to the first sum. Similarly, we proceed for the other two bits corresponding to y, and yx. That approach is then similar to the one by Cheng, Chou, Niederhagen, and Yang [CCNY12, Sec. 3.1]. However, we chose not to implement this trick as the performance gain is negligible and the final multiplications already take less than 1% of our total run-time.
Instead, as variable run-time is of no concern, we can further improve the procedure: F 16 Multiplication using LUTs. As the signature z is public, we may use look-up tables to compute the F 16 multiplications. This is particularly useful when individual field elements are to be multiplied when computing the monomials z i · z j as those multiplications are tedious to bitslice. We replace those multiplications by a look-up to a 256 element look-up Skipping parts of the public key. Whenever z i · z j = 0, the corresponding entries in A have no impact on the result h. This is the case when either z i = 0 or z j = 0. When z j = 0, the inner loop can be skipped saving load, addition, and store operations of m field elements. Even more importantly, when z i = 0, both inner loops can be skipped which saves (n − i) · m operations. The additional cost of branching depending on the variables is by far outweighed by the savings: Processing one column takes 37 cycles (3 cycles for multiplication using a LUT, 18 cycles load of accumulator and column, 8 cycle addition, and 8 cycle store.) Checking for z j = 0 in the inner loop costs two cycles (cmp, beq). As it is expected to skip the computation in 1 16 of cases, implementing the check pays off slightly. For the outer loop, the speed-up is more pronounced as we skip n/2 = 50 columns on average. This saves more than 1850 cycles and is expected to happen for every 16th execution, i.e., saving significantly more than the 2 cycles needed for the check.

Alternative F 16 Representation
In addition to F 16 tower field representation as mandated by the Rainbow specification, we have also experimented with using the direct representation F 16 = F 2 [x]/(x 4 + x + 1). By switching to that representation one can implement bitsliced multiplication using the instruction sequence presented in Algorithm 6. This sequence needs only 27 cycles (one cycle per instruction) compared to 32 cycles for the multiplication for the tower field representation.
Unfortunately, Rainbow keys, signatures, and all values sampled in the signing procedure are using the tower field representation and one would have to convert to and from the direct representation to make use of Algorithm 6. The conversion can only be done by multiplication by a 4 × 4 bit matrix while bitsliced. This conversion outweighs the performance gain from faster multiplication.
Consequently, the only way to benefit from this more efficient representation is to change the Rainbow specification to use F 2 [x]/(x 4 + x + 1) everywhere. For a Cortex-M4 implementation, there is no benefit to use the tower field implementation and a change of the specification would only make it faster. Clearly, the same is the case for other bitsliced implementations which are likely to be used on other microcontroller platforms. For Algorithm 6 Bitsliced Multiply and Accumulate for F 16 = F 2 [x]/(x 4 + x + 1) Input: 32 F 16 elements bitsliced into a 0 , a 1 , a 2  AVX2 implementations (e.g., the one from [DCK + 20a]) a change of the field representation does not have any impact on performance as field multiplication is implemented using constant-time table lookups. Hence, we argue that the Rainbow specification should be changed to use the direct representation for F 16 and F 256 .
When changing the field representation, one also has to update the lookup tables for the inverse described in Section 3.2 and the variable-time multiplication described in Section 3.3. Besides that, all other parts of Rainbow remain the same.

Results
This section presents the results when applying the optimization presented in this paper to the reference implementation that is part of the Rainbow submission package [DCK + 20a].
Platform. Due to Rainbow's large keys, we use the somewhat non-standard microcontroller EFM32GG11B 3 which is part of Silicon Labs' Giant Gecko Starter Kit. It comes with 512 kB of RAM and 2 MiB of flash memory. The core can run at a frequency of up to 72 MHz. It comes with a TRNG which we use to obtain the required randomness in Rainbow. Another feature of the EFM32GG11B that makes it an attractive target for post-quantum cryptography is that it comes with a cryptography accelerator supporting AES128, AES256, SHA-1, SHA256, and 256-bit multiplication. Section 5.2 presents how using the AES256 and SHA256 changes the performance of our implementations.
SHA2 and AES256. For hashing Rainbow uses SHA2. We use the SHA2 implementation from SUPERCOP 4 . Additionally, Rainbow uses AES256 extensively for expanding matrices from a random seed. We use the bitsliced implementation 5 by Adomnicai and Peyrin [AP20].
Benchmarking. We base our benchmarking on the testing and benchmarking framework pqm4 [KRSS]. As pqm4 is built for the STM32F407, we adapt their hardware abstraction layer to support the Giant Gecko. We use the arm-none-eabi-gcc compiler version 10.2.0 and compile with -O3. We do not run the Giant Gecko at the maximum frequency, but instead, down-clock it to 16 MHz and configure it to have zero wait states when fetching instructions and data from flash memory. This ensures that the resulting cycle counts are comparable to the ones produced by pqm4 on the STM32F407. Similar to pqm4, we use the built-in SysTick timer to count cycles. As the EFM32GG11B is not commonly used in the literature, we perform experiments to confirm that the timing behavior is comparable to the STM32F407. We benchmarked the schemes from pqm4 [KRSS] and found a very small cycle count difference of less than 1%. Table 2 contains the performance results obtained on the EFM32GG11B. The runtime of our implementation of verification heavily depends on the signature as explained in Section 3.3. Signing also has varying run-time depending on how many attempts are needed until the matrix inversion succeeds. Hence, we run 10 000 iterations of signing and verification (with different messages) and report the average. For comparison, we report the performance results of Moya Riera [MR19] for the round 2 parameters. Despite the larger parameters, we achieve a reduction in cycle counts by 27%, 47%, and 85% for key generation, signing, and verification respectively. For reference for the other parameter sets, we also report the cycle counts for the C implementation that is part of the Rainbow submission package [DCK + 20a].

RainbowI with and without precomputation
According to the specification, the Rainbow secret key is stored in nibble-packed representation. In our implementation, for each part of the secret key, the first step is to convert it to bitsliced representation. This change of representation can also be precomputed. We include the precomputation in the key generation, but it could also be implemented differently. This saves around 187 000 cycles for signing. However, this makes the secret key representation implementation-specific and platform-specific (due to Endianness) which may not be desirable. For rainbowIcompressed, this approach does not work as the secret key only consists of a seed that is used to re-sample the secret key during signing. One could also consider precomputing the bitsliced representation of the public key. However, this would only result in negligible speed-up due to the optimized verification algorithm that uses very few multiplications. Additionally, having an implementation-specific public key representation appears even less enticing.
The results for the alternative F 16 representation described in Section 4 are also shown in Table 2. It consistently reduces the runtime by up to 7% for signing. Table 3 presents the stack requirement and code size of our implementations. As we do not use any dynamically allocated memory, all intermediate variables are included in the stack. It does not include keys, the message, and the signature as those are allocated by the calling code. We measure the stack consumption by writing a fixed value to each byte of the stack, running the procedure and then checking how much of the stack has been overwritten.
For obtaining the code-size, we run arm-none-eabi-size on the binary that includes all the code required to execute, i.e., we strip out all unused code. However, this includes the platform code and we, hence, subtract 21 kB to obtain the code-size of the Rainbow code. We additionally report the code size when only a part of the signature scheme is needed. If used by the procedure, the code size includes 5 kB for AES256 and 8 kB for SHA256. SHA256 is only used in signing and verification. AES256 is only used in key generation, signing, and for circumzenithal verification (rainbowIcircumzenithal and rainbowIcompressed).
Optimizing RAM and code size is not the primary target for our work; we merely report them for completeness. classic inherently provides competitive memory consumption for signing and verification. circumzenithal and compressed require significantly more RAM. However, one needs to take into account that they also have smaller keys. For example, circumzenithal public keys are almost 100 kB smaller than classic public  keys. If keys reside in RAM, circumzenithal outperforms classic in terms of RAM consumption. Clearly, more RAM efficient implementations are possible, and it is an interesting area of future work.

Hardware Acceleration for SHA2 and AES
Symmetric cryptography is at the core of virtually all post-quantum cryptography schemes often making up the majority of cycles [KRSS] (e.g., up to 80% for Kyber [ABCG20], 81% for Dilithium [GKS20]). We report the cycles for our Rainbow implementations (with precomputation) in Table 4. When using software implementations for AES and SHA2, we see that for rainbowIclassic only 10% of signing and 4% of verification are spent in hashing. This looks very differently for circumzenithal verification (rainbowIcircumzenithal and rainbowIcompressed) where 92% are spent in symmetric primitives. Interestingly, the Giant Gecko provides hardware support for the symmetric cryptography needed by Rainbow. We, hence, also report results using the hardware accelerator. This provides a vast speed-up for verification of rainbowIcircumzenithal and rainbowIcompressed of 13×. For rainbowIclassic the speed-up is less notable.  [GKS20,Por19]. The results shown are taken from the corresponding publications and have been obtained by benchmarking on the STM32F407. However, as the EFM32GG11B timings are very close, the results are comparable to ours.

Comparison to other Post-Quantum Signature Schemes
For our implementation, we report the one with software implementations of AES and SHA256. Interestingly, both Falcon and Dilithium signing benefit from precomputation as well. For all implementations, precomputation is included in the key generation cycles.
Our implementation of rainbowIclassic signing is 4× faster than the state of the art Dilithium2 implementation and 4× faster than Falcon-512. Verification is 5× than Dilithium2 and 2× faster than Falcon-512. Consequently, our implementation of Rainbow on the Cortex-M4 is by far the fastest among the finalists of the NISTPQC competition.