Software Toolkit for HFE-based Multivariate Schemes

In 2017, NIST shook the cryptographic world by starting a process for standardizing post-quantum cryptography. Sixty-four submissions have been considered for the first round of the on-going NIST Post-Quantum Cryptography (PQC) process. Multivariate cryptography is a classical post-quantum candidate that turns to be the most represented in the signature category. At this stage of the process, it is of primary importance to investigate efficient implementations of the candidates. This article presents MQsoft, an efficient library which permits to implement HFE-based multivariate schemes submitted to the NIST PQC process such as GeMSS, Gui and DualModeMS. The library is implemented in C targeting Intel 64-bit processors and using avx2 set instructions. We present performance results for our library and its application to GeMSS, Gui and DualModeMS. In particular, we optimize several crucial parts for these schemes. These include root finding for HFE polynomials and evaluation of multivariate quadratic systems in F2. We propose a new method which accelerates root finding for specific HFE polynomials by a factor of two. For GeMSS and Gui, we obtain a speed-up of a factor between 2 and 19 for the keypair generation, between 1.2 and 2.5 for the signature generation, and between 1.6 and 2 for the verifying process. We have also improved the arithmetic in F2n by a factor of 4 compared to the NTL library. Moreover, a large part of our implementation is protected against timing attacks.


Introduction
The recent progress on the development of quantum computers has motivated NIST [oST17] to start a standardization process for post-quantum cryptography.In this paper, we are interested in the category of multivariate signature schemes.The choice of the best candidate is based on its security and its performance.At this stage, the security parameters of the candidates are already fixed, but the implementations can be improved.In this article, we study the efficient implementation of multivariate schemes.We present here software tools that allow the efficient implementation of HFE-based schemes (using arithmetic in F 2 n ).In particular, our software tools allow to speed-up the GeMSS [CFMR + 17], Gui [CDP + 17] and DualModeMS [FPR17] signature schemes, which are candidates submitted to the NIST post-quantum cryptography standardization process [oST17].The advantage of F 2 n is that each element can be represented as a vector of bits, which corresponds to the architecture of binary computers and can be naturally improved by vector instructions.The signature generation requires arithmetic in F 2 n [X], and its implementation is already provided by various libraries.Among the best, NTL [Sho03] provides high quality implementations of state-of-the-art algorithms.But these algorithms are not specialized for the case of sparse polynomials in F 2 n [X].Moreover, the implementations are not constant-time and so are vulnerable to timing attacks.For these reasons, we need to adapt the algorithms used.We have chosen to create a new library, which is based on constant-time arithmetic in F 2 n .Unlike NTL, which offers a general implementation, our library is specialized for a value of n, permitting more efficient arithmetic.Moreover, we exploit the sparse polynomial structure to improve the performance.More generally, our implementation uses the Intel vector instructions to obtain interesting speed-ups.Our library also supports DualModeMS [FPR17], which is a candidate in the NIST PQC standardization process.It is a modified HFE-based signature scheme which permits to decrease the size of the public-key, but by increasing the size of the signature.The parameters are chosen to minimize the sum of both sizes.By improving the implementation of HFE-based schemes, we automatically improve the implementation of DualModeMS.
Evaluation of multivariate quadratic systems.Many multivariate cryptosystems require to evaluate a multivariate quadratic system (MQS) to encrypt data or verify a signature (e.g.[CDP + 17, CFMR + 17, CHR + 16]).Encryption uses secret data and should be performed in constant-time, whereas verification is a public process and does not have this constraint.In HFEv-signature schemes, the evaluation step is the main part of verification.Efficient implementations of evaluation have been studied in [BBG06, CCC + 09, CHR + 16, CLP + 18].The authors of [BBG06] propose different strategies for the evaluations in F 2 , F 2 4 and F 2 8 .In [CCC + 09], the evaluation is vectorized with ssse3 instructions in F 31 , F 16 and F 256 .In [CHR + 16], the authors propose to optimize the evaluation in F 31 and F 2 256 by evaluating the public-key equations one-by-one.Their implementation is vectorized with the avx2 instructions set.In [CLP + 18], the authors present a faster evaluation with the same instructions set.To do so, they use a "monomial representation" of the public-key: for each monomial, the corresponding coefficients in each equation are stored together.We optimize the evaluation with this representation to obtain new speed records.
Root finding of a HFE polynomial.The main part of the signature generation in Gui and GeMSS is to find the roots of a polynomial F in F 2 n [X] with a specific form.Root finding is a fundamental problem in computer algebra with various applications in discrete mathematics.A survey of the main root finding methods can be found in [vzGG13].Recently, the successive resultants algorithm (SRA) [Pet14] has been proposed to find the roots of a polynomial in small characteristic, and this work has been extended for split polynomials in general finite fields.In [DPP16], root finding is improved for split and separable polynomials, when the cardinality of multiplicative group is smooth.In the case of the HFE polynomial F in F 2 n [X], F has a sparse structure and its coefficients are in a field of small characteristic.Moreover, the number of roots is generally small (it is almost always less than 10 for our parameters).The main challenge is to exploit the sparse structure of F to improve the complexity of the root finding: it should depend on the number of coefficients of F and not on its degree.In practice, the Berlekamp algorithm [vzGG13,Algorithm 14.15] is used, which computes GCD(F, (X 2 n − X) mod F ).The most costly task is the computation of X 2 n mod F , also called the Frobenius map, and the HFE structure can be exploited during the modular reduction by F .In [PCY + 15], the authors propose a method to compute the Frobenius map with multi-squaring tables, which is interesting when the degree of F is (approximately) smaller than n.We study how to implement the Frobenius map efficiently, optimizing as a function of the parameters.
Arithmetic in F 2 n .Arithmetic in F 2 n is a critical part of the root finding algorithm, because all operations in F 2 n [X] require it, and is studied in [ALH10b, ALH10a, TFA + 11, BG13].In particular, multiplication in F 2 n is the most critical operation.This is a well-known task and is studied in [BGTZ08, DG17, MS17, CCK + 17].We choose here to use the PCLMULQDQ instruction (Section 1.7) to obtain an efficient implementation.This instruction computes the product of two binary polynomials, each of degree strictly less than 64.

Organization of the Paper and Main Results
We present MQsoft [MQs18]: an efficient open-source library in C for HFE-based schemes such as GeMSS, Gui and DualModeMS.MQsoft is an improved version of the GeMSS additional implementation submitted to the NIST post-quantum cryptography competition [oST17].Our library permits to improve the fastest known implementations for GeMSS and Gui, as well as the signature generation of DualModeMS.The performance results are studied in Section 5. Table 1 summarizes the obtained speed-ups.For the levels of security 128 and 192 bits of Gui, we modify slightly a security parameter to improve the performance (cf.Section 5.2).The structure of MQsoft is depicted in Figure 1 which summarizes the main tasks required for each cryptographic operation.The critical part of an operation is represented by a plain arrow, whereas less important operations are represented by dotted arrows.
It is clear from Figure 1, that HFE-based schemes require an efficient implementation of arithmetic in F 2 n [X] and so in F 2 n .This is studied in Section 2. We have implemented stateof-the-art algorithms for arithmetic in F 2 n that use vectorization (sse2 and avx2) and the PCLMULQDQ instruction to improve multiplication in F 2 n (Section 2.2).The multiplication is computed with the schoolbook algorithm in blocks of 64 bits for n ≤ 384 and with Karatsuba otherwise.When PCLMULQDQ is not available, MQsoft uses the multiplication in F 2 [X] of the gf2x library (Section 1.7).The modular inverse is computed with the Itoh-Tsujii Multiplicative Inversion Algorithm (Section 2.5) together with multi-squaring tables (Section 2.4).
To optimize the arithmetic in F 2 n , the choice of n must be made before the compilation.This permits the specialization of the implementation.The library is flexible and allows to the choice of any n ≤ 576.F 2 n is built as F 2 quotiented by an irreducible polynomial f of degree n.When it is possible, we choose an irreducible trinomial for f to accelerate the modular reduction (Section 2.3).The modular reduction by f is vectorized for trinomials such that the degree of f (x) − x n is strictly less than 128, and for the parameters of studied schemes.We have vectorized the modular reduction by a pentanomial exclusively for n ∈ {184, 312, 448, 544}, because they are the parameters of Gui and DualModeMS256.
Otherwise, the modular reduction is implemented for pentanomials such that the degree of f (x) − x n is strictly less than 33.For n ≤ 576, 56% of the finite fields can be created with an irreducible trinomial.Our library vectorizes modular reduction for 92% of these cases.We obtain approximately a speed-up of a factor of 4 compared to the arithmetic in F 2 n of NTL.
In Section 4.2, the verifying process is accelerated via an efficient evaluation of multivariate quadratic systems using avx2 set instructions.We obtain new speed records for the constant-time and variable-time evaluations of binary multivariate polynomials.To do this, we have chosen to use the "monomial representation" as in [CLP + 18].For this, we have stored multivariate quadratic systems of m equations in , where Q is a upper triangular matrix such that Q i,j corresponds to the term x i x j , and C is the constant term.Since the multivariate quadratic systems will be evaluated in F n+v 2 , x 2 i = x i and so, the linear term x i is stored with the term x 2 i of Q.With this representation, the evaluation in x ∈ F n+v 2 is computed as C + xQx t .For 256 equations and 256 variables in F 2 , our variable-time evaluation is 1.38 times faster than in [CLP + 18].To obtain this, we use unrolled loops and a specific way to extract the terms x i .For the constant-time evaluation, we obtain a performance similar to [CLP + 18], which targets Haswell processors.However, on Skylake processors, the evaluation can be faster by using vector instructions in a specific way, as explained in Section 4.2.This method saves a factor 1.1 on Skylake (for 256 equations and 256 variables).
The core of the signing process is to find the roots of a univariate HFE polynomial F in F 2 n [X], which has a special structure.In particular, F is in the following note: Our goal is to exploit this structure to accelerate the root finding.We address this question in Section 3. We have been able to tweak Berlekamp's algorithm [vzGG13,Algorithm 14.15] to take advantage of the sparse structure of F .When D > n, the computation of X So, with a sparse polynomial, the computation of the roots is faster.This suggests to consider sparse HFE polynomials.In Theorem 1, we prove that making F more sparse improves the complexity.Because F is a part of the secret-key, the nature of this change requires a new analysis of security.We observe in practice that removing a small number of odd degree terms appears not to affect the security.However, the security of this method must be studied in depth.With Theorem 1, we can save 43.75% of the computations we would have done by removing only three terms having an odd degree in F .The general idea to make F more sparse has already been proposed in HFEBoost1 , but independently of this, the proof of Theorem 1 makes explicit a method to improve the complexity.It has the advantage of being in constant-time because the useless computations are known and so can be avoided.
When D < n, the strategy of [PCY + 15] becomes more efficient for computing the Frobenius map.The idea is to compute a table of X 2i mod F to accelerate the modular reduction.Thus, the squaring modulo F is computed by multiplying its i-th coefficient by the element X 2i mod F from the table for i ∈ 0, D − 1 .The authors of [PCY + 15] also suggest to do several squarings in one step, with multi-squaring tables.In Section 3.5, we make an explicit strategy for doing this efficiently (by exploiting the HFE structure of F when it is possible), and how to choose the number of squarings to do before the modular reduction.
The performance of both strategies described above depends on the required number of field multiplications.In Section 3.5, we compute accurately the number of multiplications of each method, in order to choose the best strategy as a function of the parameters.

Preliminaries
We briefly recall here the principle of a signature scheme based on HFEv- [KPG99].The public-key in HFEv-is given by a set of m quadratic equations in F 2 [x 1 , ..., x n+v ].These equations are derived from a single polynomial F ∈ F 2 n [X] (Section 1.2).Verification requires to evaluate the public polynomials (Section 1.4).We need to compute the roots of F in order to generate a signature (Section 1.3).From now, we always assume that the base field is F 2 .In Section 1.5, we introduce the specificities of GeMSS and Gui which are two submissions to the NIST PQC process based on HFEv-.

Main Parameters
The main parameters involved are: • D, a positive integer that corresponds to the degree of a secret polynomial.D is such that D = 2 i for i ≥ 0, or D = 2 i + 2 j for i = j, and i, j ≥ 0, • m, number of equations in the public-key, • n, the degree of a field extension of F 2 , • v, the number of vinegar variables, • ∆, the number of minus (the number of equations in the public-key is such that is m = n − ∆), • nb_ite ≥ 1, number of iterations in the verification and signature processes.

Keypair Generation
Secret-key.It is composed by a couple of invertible matrices2 (S, with the following structure: where with the form of (1) has a HFEv-shape.The particularity of a polynomial F (X, v 1 , . . ., v v ) with HFEv-shape is that for any specialization of the vinegar variables the polynomial F becomes a HFE polynomial [Pat96], i.e. univariate polynomial of the following form: with A i,j , B i , C ∈ F 2 n , ∀i, j, 0 j < i < n.By abuse of notation, we will refer to D as the degree of the HFEv polynomial.
The special structure of (1) is chosen such that its multivariate representation over the base field F 2 is composed by quadratic polynomials in F 2 [x 1 , . . ., x n+v ].This is due to the special exponents chosen in X that have all a binary decomposition of Hamming weight at most 2. Let θ = (θ 1 , . . ., θ n ) ∈ (F 2 n ) n be a basis of F 2 n over F 2 .We set We can now define a set of multivariate polynomials For easing the notations, we now identify the vinegar variables (v 1 , . . ., v v ) = (x n+1 , . . ., x n+v ).Besides, we shall say that the polynomials f 1 , . . ., f n ∈ F 2 [x 1 , . . ., x n+v ] are the components of F over F 2 .
In the implementation, we compute the components of F by using directly Equation (3).We replace X by n k=1 θ k x k , v 1 , . . ., v v in the expression of F , then we use a rearrangement of terms which minimizes the number of multiplications.This is detailed in Section 4.1.

Public-key.
It is given by a set of m quadratic square-free non-linear polynomials p = (p 1 , . . ., p m ) ∈ F 2 [x 1 , . . ., x n+v ] m .It is obtained from the secret-key by taking the first m = n − ∆ polynomials of: and reducing it modulo the field equations, i.e. modulo x 2 1 − x 1 , . . ., x 2 n+v − x n+v .
We summarize the public-key/secret-key generation in Algorithm 1.In practice, we merge the steps 6 and 7 by removing the ∆ last columns of T during the vector matrix product.
Algorithm 1 PK/SK generation in HFEv-schemes Randomly sample (S, T) See Section 4.1 for details on Step 5. 6: Take the first m = n − ∆ polynomials computed in Step 6.

Signing process
The main step of the signature process requires to invert the public-key, that is to say solving: To do so, we randomly sample r = (r 1 , . . ., r ∆ ) ∈ F ∆ 2 and append it to d.This gives 2 of the multivariate equation: To solve this equation, we take advantage of the special HFEv-shape.That is why, we randomly sample v ∈ F v 2 and consider the univariate polynomial . This yields a HFE polynomial according to Section 1.2.We then find the roots of the univariate equation: A core part of the signature generation is then the computation of the roots of F D (X) = F (X, v) − D .To this end, we use the Berlekamp algorithm as described in [vzGG13,Algorithm 14.15], which requires mainly to compute: We provide in Section 3.1 the best methods to compute X 2 n mod F D and the GCD (both in function of n and D).
We can now present a way to define the inversion function (Algorithm 2): Algorithm 2 Inverse map of the public-key The notation ∈ R stands for randomly sampling.
Call to Algorithm 6.

9:
until Roots = ∅ 10: The signing algorithm in GeMSS and Gui is an iterative process.The basic idea is to call Inv p nb_ite times.More precisely, the signing process is in the following algorithm: Algorithm 3 Signing process H ← HASH(M) 3: for i from 1 to nb_ite do 5: ⊕ is the component-wise XOR return (S nb_ite , X nb_ite , . . ., X 1 ) 10: end procedure nb_ite is a parameter that can be easily computed from m and the level of security.

Verifying process
Naturally, the verifying process is also iterative as shows in Algorithm 4. The main part of this process is still to evaluate the public-key.We describe how to implement this step efficiently in Section 4.2.

GeMSS and Gui
GeMSS and Gui are essentially based on the same principle.They still differ in the choice of parameters.26 summarizes the performance of the best implementations of GeMSS and Gui provided for the NIST submissions, in function of the parameters.For GeMSS, we consider the additional implementation.For Gui, we take the PCLMULQDQ additional implementation.Gui is implemented with the modified algorithms to achieve EUF-CMA security property [CDP + 17, Section 1.6].It is not the case for the GeMSS implementation.For this reason, we have modified the Gui additional implementation.This implementation provides the cryptographic operations of Gui with and without the EUF-CMA security property.We just replace the algorithms by their version without this property.Compared to the original implementation, this implies mainly a speed-up during the signature generation.In Gui, the parameters are chosen to minimize the time of signing and verifying a signature.To do it, small values of D are chosen, and larger values of m are used.This choice implies to increase n, and so the size of the public-key.In GeMSS, it is the opposite.The goal is to minimize the size of the public-key.This leads to smaller values of n.This choice implies to take larger values of D. As soon as m is fixed, the number of iterations required can be derived.In fact, the original parameters of Gui do not always provide the claimed security.The attack described in [Cou03, Theorem 6.2.1] has a complexity O(2 m nb_ite nb_ite+1 ).So, with m = 168 and nb_ite = 2, the original parameters of Gui-184 provide a security of only 112 bits.This problem has been mentioned in NIST's pqc-forum mailing list by W. Beullens the 04/27/2018.The Gui designers have answered the 06/15/2018, and choose to set the parameter nb_ite to 3.However, this provides only 126 bits of security.To reach 128 bits of security, we should set nb_ite to 4 (or to set m to 171).For completeness, we have measured Gui-184 for these three values of nb_ite.The modification slows down a factor nb_ite 2 the signature generation and the verifying process of Gui-184.

Choice of parameters. Table
To find a root of F (X, v) − D .During the signature generation, F (X, v) − D cannot have solution.In GeMSS, r and v are changed while F (X, v) − D does not have roots (cf.Algorithm 2).Then, when there are roots, one is deterministically chosen for fixed D , by using the SHA-3 of D as a randombytes generator.In Gui, v is changed while F (X, v) − D does not have a unique root.In practice, d is generated from the hash of a document.To do it, GeMSS uses SHA-3, whereas Gui uses SHA-2.The output size of the hash functions is the double of the level of security.In MQsoft, we have chosen the SHA-3 function from the Keccak Code Package [GBA13] and the SHA-2 function from OpenSSL.

Data Structure
We describe here the data structure used to store elements of F 2 n and F 2 n [X].This representation is crucial for the efficiency of our implementation.This is especially true for binary fields since operations in F 2 can be naturally vectorized.The arithmetic in F 2 n [X] is used during the root finding of a HFE polynomial.To be efficient, it is important to distinguish dense polynomials which appear during the computation of the Frobenius map and the GCD, from a HFE polynomial which is used to reduce X 2 n − X.We can notice that the HFE polynomial is sparse since it has only K = O(log 2 (D) 2 ) non-zero coefficients.

Representation of elements in
We have chosen the polynomial basis [HMV03].An element of F 2 n is represented by a polynomial in F 2 [x] of degree at most n − 1.The coefficients are stored as a vector of bits, requiring n w words, where w is the word size (in bits).The j-th bit of the i-th word is the coefficient of the term of degree wi + j, for i ∈ 0, n w − 1 and j ∈ 0, w − 1 .It is setting to zero when (wi + j) ≥ n.Example 1.Let w = 64 and P = x 36 + x 4 ∈ F 2 40 .To simplify the notations, we represent vectors of bits as 64-bit integers.P is stored as 0x0000001000000010.In particular, the bits from 37 to 63 are setting to zero.

Representation of dense polynomials in
is represented by its degree δ and a vector of (δ + 1) coefficients.The coefficients are stored from lower to higher degree of the corresponding terms in a buffer.The degree is stored in a local variable, excepted for the implementation of the fast GCD in Section 3.6, because it requires matrices in M 2 (F 2 n [X]).In this case, we use a C structure to store the degree and the pointer toward the coefficients buffer.

Representation of HFE polynomials in F 2 n [X].
In HFEv scheme, the HFEv polynomial is a part of the secret-key.During the signature generation (cf.Section 1.3), the vinegar variables of the HFEv polynomial are evaluated to obtain a HFE polynomial.Its degree D is a parameter of security and is assumed to be known.It is defined by the C directive #define.A HFE polynomial in F 2 n [X] is represented as a vector of coefficients where only terms X 0 , X 2 i and X 2 i +2 j are stored.It is chosen monic and so the leading term is not stored.If P is in F 2 n [X], we denote by P HFE its HFE representation.

Experimental Platform and Benchmarked Libraries
Tables 2 and 3 summarize the main informations about the platform used in the experimental measurements.LaptopS is used for all measurements, excepted in Section 4.2 16 GB DesktopS and in Section 5.For this latter, we present the performance of MQsoft in function of the processor.Our implementation targets Intel 64-bit processors that support the PCLMULQDQ instruction.This allows to improve the performance of multiplication in F 2 n (cf.Section 2.1 and 2.2).We also take advantage of the sse2, ssse3 and avx2 instruction sets to speed-up the implementation of arithmetic in F 2 n (Section 2), the vector matrix product in F 2 during the keypair generation (Section 5.4), and the evaluation of the public-key during the verifying process (Section 4.2).We explain here the main vector instructions3 that MQsoft exploits: • PCLMULQDQ: this instruction computes the product of two binary polynomials such that their degree is strictly less than 64.
• PSHUFB: this instruction from ssse3 takes 16 indexes on 4 bits, and searches the corresponding 8-bit elements in a table of size 128 bits.
• PSLLDQ and PSRLDQ: these instructions from sse2 computes respectively the left and right shift of a 128-bit register by a multiple of 8 bits.
• PALIGNR: this instruction from ssse3 concatenates two registers 128 bits, shifts the concatenation at right by a multiple of 8 bits, then return the 128 lower bits of the result.
• VPBROADCASTQ: this instruction from avx duplicates four times a 64-bit integer on a 256-bit register.
• VPERMQ: this instruction from avx2 permutes the 64-bit parts of a 256-bit register.
In particular, it can duplicate one 64-bit part four times.
• VPMASKMOVQ: this instruction from avx2 loads four contiguous 64-bit integers from a buffer, then applies a mask which permits to set to zero 64-bit parts.
• POPCNT: this instruction counts the number of bits set to 1 in 64-bit integers.It is used to speed-up the dot product of vectors in F 2 .
We benchmarked our software toolkit against the following softwares or libraries.
. The multiplication algorithm depends on the degree of the operands and the vector instructions set available.When PCLMULQDQ is not available, MQsoft uses the gf2x multiplication.
Magma is running with magma.avx64.dyn.exe to take advantage of vector instructions.In our tests, the avx64 version optimizes mainly the performance of multiplication in binary fields.
The measurements used one core of the CPU, and the C code was compiled with gcc -O4 -mavx2 -mpclmul -mpopcnt -funroll-loops.We use the version 6.4.0 of gcc.Turbo Boost and Enhanced Intel Speedstep Technology are disabled to have more accurate measurements, excepted when we use DesktopH and DesktopS.In practice, Turbo Boost generates a speed-up of 1.2 on LaptopS and 1.1 on ServerH.

Efficient Arithmetic in F n
Arithmetic in F 2 n is the core of the signature generation (Section 1.3) and the computation of f during the keypair generation (Section 1.2).The main involved parameters (Section 1.1) are n the degree of the field extension, v the number of vinegar variables, D the degree of the HFEv polynomial and nb_ite a constant between two and four in GeMSS and Gui.
In Section 4.1, we explain how to generate efficiently the inner secret polynomials of f (Equation (3)).This requires

Squaring in F 2 n
Squaring is used during the root finding algorithm (Section 3) which is the core of the signature generation (Section 1.3).It is also used during the so called Itoh-Tsujii algorithm [IT88] which computes the modular inverse in F × 2 n (Section 2.5).In binary fields, the squaring of B = n−1 i=0 α i x i ∈ F 2 n can be performed in linear time [LN96].The linearity of the Frobenius endomorphism implies that B 2 = n−1 i=0 α i x 2i .Since we have stored B as a vector of bits, squaring is the equivalent to insert a null bit between each bit of B.
To compute the square of a n-bit element, we divide it into words of 64 bits.For each one, the PCLMULQDQ instruction computes directly the binary polynomial multiplication of the 64-bit element by itself.This method requires n 64 calls to PCLMULQDQ.We have compared this method to table lookups of square [ALH10a].We have implemented Algorithm 1 of [ALH10a] which uses sse2 instructions and PSHUFB from ssse3, and an avx2 version that uses the VPSHUFB instruction.The PSHUFB instruction performs the search of the square of 16 elements on 4 bits in a table in constant-time, and the VPSHUFB instruction performs two times PSHUFB.On Skylake, both are less efficient than the PCLMULQDQ instruction, whereas on Haswell, we observe the opposite behavior because PCLMULQDQ is slower (Table 4).Table 5 summarizes the performance of squaring functions which are proposed in our library.The experimental process consists to compute the square of elements from a small buffer, then to measure the average cost of one operation.In order to compute the square with table lookups, the PSHUFB instructions must be used by pair: one computes the square of lower 4-bit elements of each byte, whereas the second is used for the higher 4-bit elements.For this reason, the squaring performance using PSHUFB depends only on n 128 .For the squaring using PCLMULQDQ, the implementation for n 64 equals to 3 is slower than for n 64 equals to 4. Indeed, 3 is odd and the 128-bit load and store instructions are less efficient when they are used to load and store a 64-bit element.The best squaring is the one using the PCLMULQDQ instruction: on the Skylake processors, it costs only one cycle of throughput, but seven cycles of latency (Table 4).However, the latency can be used to do other instructions, which improves the performance.

Multiplication in F 2 n
The multiplication of two distinct elements in F 2 n is a central operation involved in the keypair generation and the signing process.In MQsoft, we adapt the multiplication algo-rithm in function of n.We target the Skylake processors.
When n ≤ 384, we use a schoolbook multiplication by blocks of 64 bits.We use the PCLMULQDQ instruction for multiplying each block.Then, n 64 2 calls to PCLMULQDQ are required.This method is naturally in constant-time.Our implementation uses PCLMULQDQ which implies to use sse2 instructions.We use also the PALIGNR instruction from ssse3 to improve the implementation.This instruction concatenates two 128-bit registers and permits to extract 128 bits from the result.We use it to align on 128 bits the results of the multiplications.
When ) calls to PCLMULQDQ.
The trade-off between the schoolbook multiplication and Karatsuba depends on the performance of PCLMULQDQ (Table 4).For the Skylake processors, this instruction costs one CPI, which makes schoolbook multiplication more efficient for n ≤ 384.For the Haswell processors, PCLMULQDQ costs two CPI.This decreases the trade-off because each call to PCLMULQDQ is more penalizing.When n 64 is equals to 3, we remark that the Karatsuba multiplication in F 2 n is already faster on Haswell.In practice, we have compared the schoolbook multiplication to the three-term Karatsuba-like formulae described in [Mon05, Equation 3 with C = 0].The latter is slightly faster and requires only 6 calls to PCLMULQDQ.
Table 6 compares our multiplication with gf2x.As in Section 2.1, we measure the average cost to multiply elements from a small buffer.The multiplication of gf2x is sometimes abnormally slow.This probably dues to the fact that the implementation uses vector and no vector instructions in the same function, which penalizes it.This is probably the first reason to explain that our multiplication is faster.The second reason is that gf2x uses Karatsuba, which is slower than schoolbook multiplication for n ≤ 384 on Skylake.We have also remarked that installing NTL with gf2x decreases slightly the performance.For this reason, NTL is not installed with gf2x on our experimental platform.

Modular Reduction and Field Operation in F 2 n
In this section, we want to reduce R = 2n−2 i=0 r i x i the result of the previous multiplication/squaring in F 2 n .The choice of the irreducible polynomial f defining F 2 n (Section 1.6) is important for the modular reduction: it is faster when f is a trinomial or pentanomial [GK03,ALH10a,ALH10b].
For completeness, we explain here the principle of modular reduction by a trinomial.The method for pentanomials uses the same idea, and it is explained in Appendix A. Let We do a first step of reduction by f 3 by replacing x n by f 3 (x) − x n in Equation (6).We obtain: We iterate a new step of reduction: In Equation ( 7), the degree of R is max In two steps of reduction, we have then a method to compute the modular reduction for all trinomial such that 2(k 1 − 1) < n.
To optimize the computation of (7), we factorize by (f 3 (x) − x n ).In this way, we can rewrite R as: There are mainly two methods [BG13] to compute (8).The first is the shift-and-add strategy The second is the mul-and-add strategy: the multiplication by (f 3 (x) − x n ) is computed with the PCLMULQDQ instruction.In this case, it is recommended to choose k 1 strictly less than 64.In this way, (f 3 (x) − x n ) can be directly used as one of the operand of PCLMULQDQ instruction.We choose the first method because it requires a small number of low cost instructions.
These methods can be optimized for specific values of k 1 and n.Firstly, the case k 1 = 1 permits to avoid computations because S k1 = 0. Secondly, a classical trick is the use of the PSLLDQ and PSRLDQ instructions which permit to shift 128-bit registers by a multiple of 8 bits.The PALIGNR instruction can also be used.As it is explained in Section 1.7, it permits to concatenate two registers 128 bits, then to extract 16 contiguous bytes.We list here cases where these instructions could be used: • For the extraction of R k1 from R when n is a multiple of 8.However, it does not exist irreducible trinomials such that n is a multiple of 8 [Swa62].
• For the extraction of S k1 from R when 2n − k 1 is a multiple of 8.
• To obtain S k1 x n−k1 from S k1 when n − k 1 is a multiple of 8.
• For the multiplication by x k1 when k 1 is a multiple of 8.
For the shift-and-add strategy, we can also compute (R . In this case, the 128-bit shifts improve the implementation in the following cases: • For the extraction of R k1 x k1 from R when n − k 1 is a multiple of 8.
• For the extraction of S k1 x k1 from R when 2n − 2k 1 is a multiple of 8.
• To obtain S k1 x n−k1 from S k1 x k1 when n − 2k 1 is a multiple of 8.
• For the division by x k1 when k 1 is a multiple of 8.
We have implemented the shift-and-add method for trinomials with different SIMD instruction sets.ssse3 is used to improve the implementation with the PALIGNR instruction.We have also implemented the shift-and-add method for pentanomials, but it is vectorized only for n ∈ {184, 312, 448, 544}, because they are the parameters of Gui and DualModeMS256.For the implementation of HFE-based schemes, we estimate that the best strategy is to choose n such that it exists an irreducible trinomial of degree n.The performance of the modular reduction depends on the context.In Table 7, we reduce products from a small buffer, then we measure the cost of one modular reduction on average.We have removed the optimizations using PSLLDQ and PSRLDQ to have comparable measurements.In practice, they are used (for example, the ssse3 modular reduction for n = 375 takes 12.2 cycles with these optimizations).The ssse3 version is two times faster that without vector instructions because sse2 permits to perform two 64-bit instructions in one instruction.The avx2 implementation is slightly faster than ssse3 version, probably because the avx2 is faster to load and store data.• Left value: we measure the cost of one field multiplication on average during the computation of the naive exponentiation function (x i is computed as x i−1 x).Each result depends on the previous result, and the data are already loaded.
• Right value: we measure the cost of one field multiplication on average to compute the multiplication of elements of two buffers.The data are independent but each multiplication requires to load input and to store output.We remark that for n less than 375, the multiplication on independent data is faster.It is probably caused by the latency of the PCLMULQDQ instruction.The field multiplication with modular reduction using sse2 is the fastest, because the PCLMULQDQ instruction requires to use 128-bit registers.When sse2 is used with avx2, the implementation pays a penalty.However, this problem will be solved with the VPCLMULQDQ instruction on the future Ice Lake processors [Int18].
In Table 9, the modular reduction is measured when it is used with squaring in F 2 [X].As for the multiplication in F 2 n , the performance of squaring depends on the context.For this reason, we measure it in two ways: • Left value: we measure the cost of one field squaring on average during the raising of an element of F 2 n at the power 2 i (x 2 i is computed as (x 2 i−1 ) 2 ).Each result depends on the previous result, and the data are already loaded.
• Right value: we measure the cost of one field squaring on average to compute the squaring of elements of one buffer.The data are independent but each squaring requires to load input and to store output.
Table 9 shows the performance of squaring in F 2 n .The squaring using PCLMULQDQ is the most efficient.For the same reasons that for the multiplication in F 2 n , the best modular squaring is the one using only sse2 modular reduction.This is the default setting in MQsoft.
All measured functions in this Section are available is our library.They are implemented in constant-time.The modular reduction by pentanomials is also available to make the library more complete, but is not yet vectorized (excepted for n ∈ {184, 312, 448, 544}).

Multi-squaring in F 2 n
The multi-squaring [TFA + 11] is an operation computing successively several squarings.This operation is important to compute the inverse in F × 2 n (in Section 2.5).Algorithm 5 requires to compute B 2 i , for B ∈ F 2 n and various values of i.For small i, the best way is to raise B at the power two i times (as in Section 2.3).For larger i, the best method is to use precomputed multi-squaring tables.
The idea of multi-squaring tables is to store x k2 i mod f for k ∈ 0, n − 1 .Then, multi-squaring is equivalent to the dot product of the vectors (α 0 , . . ., α n−1 ) and (1, x 2 i mod f, . . ., x (n−1)2 i mod f ).The table requires to store n − 1 elements in F 2 n (1 is not formally stored), and the multi-squaring requires n − 1 multiplications between elements of F 2 and F 2 n and n − 1 additions in F 2 n .In a variable-time implementation, the multiplication by α k can be done by a conditional statement.In a constant-time implementation, the value of α k is duplicated in the mask variable, in the way to replace the multiplication by a bitwise AND with this mask.This process is explained in Section 4.2.
In variable-time implementation, the performance can be improved with larger tables [Mai15].Rather that to compute α k x k2 i mod f coefficient by coefficient, the coefficients can be grouped by block of B, and the 2 B possibility of B−1 k=0 α jB+k x (jB+k)2 i can be precomputed for j ∈ 0, n B − 1 .This method cannot be used in a constant-time implementation because of the timing attack on the memory latency.It permits to attack the index of the precomputed table.In our implementation, we use a constant-time implementation of multi-squaring.

n
The computation of the inverse in F × 2 n is often required for the arithmetic in F 2 n [X].In our case, it is required to compute the GCD (Section 3.6).To compute the modular inverse of A ∈ F × 2 n , there are mainly two methods.The first is to use extended Euclidean algorithm (EEA) [vzGG13,Algorithm 3.6].This method is not constant-time.The second is to compute A −1 = A 2 n −2 by Fermat's little theorem.The exponentiation can be done with the square and multiply method [vzGG13, Algorithm 4.8], costing n − 1 squarings and n − 2 multiplications in F 2 n .The Itoh-Tsujii Multiplicative Inversion Algorithm (ITMIA) [IT88] permits to modify the way to compute the power with an addition chain.It requires n − 1 squarings and only O(log 2 (n)) multiplications in F 2 n .The number of multiplications depends on the length of the chosen addition chains.ITMIA is described in Algorithm 5 for a specific addition chain which consists to read the bits of n − 1 from MSB to LSB.At the end of the each iteration, the variable inv is always A 2 val −1 .
Algorithm 5 ITMIA for a specific addition chain. function ITMIA requires to compute val successive squarings.The multi-squaring can be computed more quickly with precomputed tables (cf.Section 2.4).
Algorithm 5 is useful because it proposes automatically an addition chain for all values of n, but it is not always optimal.To choose the best addition chain is not easy, because it depends on the performance of multiplication, squaring and multi-squaring.Moreover, there is a large set of possible addition chains.This problem is studied in [Mai15], which proposes a software generating an efficient C++ inversion code.This software searches the addition chain which maximizes the performance of the generated code.However, the generator does not proposes implementation of multi-squaring tables in constant-time.For the moment, MQsoft uses Algorithm 5, but we propose in Appendix B examples of addition chains chosen to minimize the number of field multiplications.We have improved Algorithm 5 with multi-squaring tables to compute the variable tmp when i is zero or one.Because multi-squaring tables are huge, we use it only for the parameters of GeMSS, Gui, DualModeMS, and for the values of n used to evaluate the performance of MQsoft.The corresponding file in MQsoft requires 1.7 MB for 44 tables.

Performance of the Arithmetic in F 2 n
Table 10 compares the performance of arithmetic operations in our library with respect to several open source libraries (listed in Section 1.7).We choose the irreducible trinomial f (x) = x n + x k1 + 1 with k 1 ∈ 2, 32 to create the field F 2 n .All operations use modular reduction.We have measured the performance of FLINT [Har10, version 2.5.3](it is a C library), but the times are not relevant in our context.It turns that for n = 252, NTL is 100 to 200 times faster than FLINT.The main reason is that FLINT does not have special implementation for binary fields.We have used the type fq_nmod_t which store each element of F 2 n as a polynomial in F 2 [X] where each coefficient is stored on one word.Magma is also taken into account.The results are not significant because Magma is slowed down by its user interface.We remark that the squarings and multiplications of NTL are faster for n = 126 than for n = 62.It can be probably explained by the fact that NTL does not use trinomial for n = 62.Our implementation is 3.5 to 4.5 times faster than NTL for multiplication and 5 to 6 times faster for squaring.We think that NTL is slowed down by its C++ interface.For the inversion, the measurements are not comparable because NTL is not in constant-time.However, we have a speed-up of two on average.We compare now MQsoft to the constant-time arithmetic of [BG13], when trinomials are used to build F 2 n .In F 2 233 , they compute the squaring in 18 cycles and the multiplication in 38 cycles.We have approximately the same performance for n = 252: MQsoft is slower with dependencies but faster with buffers.In F 2 409 , they compute the squaring in 28 cycles and the multiplication in 97 cycles.MQsoft is slightly faster (we compare to the measurements in F 2 441 ).For the inversion, [BG13] is approximately two times slower.Our library takes advantage of using of multi-squaring tables.We replace n − 1 squarings by approximately 1 4 n squarings and two multi-squarings.

Efficient Implementation of Root Finding in F 2 n [X]
The most expensive part of the signature generation is to find the roots of a HFE polynomial F ∈ F 2 n [X] as defined in Equation (2).F is a D degree monic polynomial which is sparse because it has approximately log 2 (D) 2 2 non-zero coefficients.We have chosen to implement Berlekamp's algorithm [vzGG13, Algorithm 14.15] which finds the roots with an asymptotic complexity of O(nD 2 + (n + log(s))s 2 log(s)) operations in F 2 n , where s is the number of roots of F [vzGG13, Theorem 14.11 adapted for r = s and d = 1].For HFE polynomials, the factor O(nD 2 ) can be easily improved in O(nD log 2 (D) 2 + D 2 ) operations in F 2 n , by using the sparse structure of F (Section 3.5).Moreover, the HFE polynomial does not have many roots, so we can assume that s is negligible, yielding a final complexity of For the general polynomials, the author of [Pet14] proposes in 2014 the successive resultant algorithm (SRA).It requires O(n 3 D 2 + n 4 ) operations in F 2 to find roots, or O(n 2 D + n 3 ) with the fast arithmetic.The step in O(n 4 ) (or O(n 3 )) can be precomputed for a fixed finite field.In comparison to Berlekamp, SRA is interesting only when the polynomial has many roots.In [GvdHL15] and [DPP16], the root finding is improved for split and separable polynomials, and when the cardinality of multiplicative group is smooth.In our case, this method is not interesting because the HFE polynomials does not have many roots.
Improving root finding for sparse polynomials is a hard problem.In [BCR13], the authors propose the first sub-linear (in q) algorithm which detects the existence of roots for tmonomials in F q [X].The complexity is of 4 t+o(1) q t−2 t−1 +o(1) bit operations.This method is not interesting for a HFE polynomial because it is not enough sparse and also because in practice n is greater that the level of security.The algorithm costs approximately 4 o(1) 2 n+o(n) D log 2 (D) bit operations in our case.
In this section, we will remind the Berlekamp's algorithm.For each operation required, we study the different possible methods and compare their practical performance to choose the best.We specify how these methods can be tuned for a HFE polynomial.

Description of Berlekamp's Algorithm in F 2 n [X]
Algorithm 6 describes Berlekamp's algorithm [vzGG13,Algorithm 14.15].The main idea is to remark that all elements of F 2 n vanish on X 2 n − X.We can then compute G the GCD on F with X 2 n − X. G has the same roots than F but with a minimal degree (which is the number of roots).In general, the degree of G is small.The strategy is then to apply the so-called equal-degree factorization to find all roots.This is turned to be cheap.Indeed, let s be the degree of G, the equal-degree factorization costs (n + log(s))s 2 log(s) operations in F 2 n [vzGG13, Theorem 14.11 adapted for r = s and d = 1].Because the degree of X 2 n − X is big, we reduce X 2 n − X by F by using the repeated squaring algorithm in F 2 n [X], before computing the GCD.
Algorithm 6 Algorithm to find the roots of a univariate polynomial. function Step 2: Computation of the GCD.if degree(G) > 0 then Roots ← List of all roots of G, computed by the equal-degree factorization algorithm described in [vzGG13,Section 14.3].
Call to Algorithm 7.

return (degree(G), Roots) end if return (degree(G), ∅) end function
Thereafter, we will study the choice of the algorithms only for the steps one and two.The computation of equal-degree factorization is negligible since G has a small degree.For the set of completeness, the equal-degree factorization algorithm is summarized in Algorithm 7. We will compare the classical and fast algorithms and we optimize them for a HFE polynomial which is sparse, i.e. which has O(log 2 (D 2 )) coefficients in comparison to a dense polynomial that has D + 1 coefficients.
Algorithm 7 Algorithm to find the roots of a split monic univariate polynomial.
function FindRootsSplit(F ∈ F 2 n [X]) if degree(F ) < 1 then return ∅ else if degree(F ) == 1 then return List(F.cst)F.cst is the constant term of F , we create a list with the root of F and return it.
The notation ∈ R stands for randomly sampling.
The concatenation of the lists is returned.
end if end function

Polynomial Squaring in F 2 n [X]
Step 1 of Algorithm 6 requires to compute repeated squarings in . This operation requires D squaring in F 2 n .

Euclidean Division in F 2 n [X]
Let P ∈ F 2 n [X] be a univariate polynomial.The notation coef i (P ) will denote the coefficient of the term of degree i in P .
Step 1 of Algorithm 6 requires to compute the modular reduction of a polynomial P of degree at most 2D − 2 by a monic polynomial F .The classical algorithm [vzGG13, Algorithm 2.5] uses O(D 2 ) multiplications in F 2 n .
Algorithm 8 computes the modular reduction specialized for HFE polynomials.Let K be the number of terms of F HFE (Section 1.6).For a fixed i, each term of ( Then, the result is added to R, requiring K − 1 multiplications and additions in This method is in constant-time because we compute the multiplication by coef i−D (Q) even when it is null.
Algorithm 8 Euclidean division by a monic polynomial.function Euclidean(P, The new R has a degree at most i − 1. end for return Q, R end function

Improving Euclidean Division for Special HFE Polynomials
The previous method does not exploit that during Step 1 of Algorithm 6, the dividend is a square.Terms of odd degree are null.We show here that the complexity can be divided by two, in function of the term of higher odd degree of the divisor.For this, we introduce a new notation.Let P ∈ F 2 n [X], we denote by D(P ) the largest odd integer i such that coef i (P ) = 0.If it does not exist, we set D(P ) = −∞.The following lemma permits to demonstrate the main result (Theorem 1) of this part.

be respectively the quotient and remainder of the Euclidean division of A by H and d
0 for all odd i such that i > d−2.By definition of Q, q i = 0 for i < 0 and i > D − 2, so we show the lemma for the values of q i such that For this, we use a proof by induction on an odd j such that D − 1 ≥ j > max(−1, d − 2).The base case is trivial since q j = 0 for j > D − 2. Assume that q k = 0 for all odd k such that D − 2 ≥ k > j > max(−1, d − 2).To show that q j = 0, firstly we show these two properties: (1) coef D+j (HQ) = 0.
Proof of these two properties: (1) By definition, A = HQ+R and so (2) HQ = 2D−2 r=0 r =0 q h r− X r , so coef D+j (HQ) = D+j =0 q h D+j− .But q = 0 for > D − 2 and h D+j− = 0 for D + j − > D, so coef D+j (HQ) = D−2 =j q j h D+j− .When > j is odd, q = 0 by induction hypothesis.When is even, h D+j− = 0 because D + j − is odd and D(H) = d < D + j − .So D−2 =j q h D+j− = q j h D .These two properties implies coef D+j (HQ) = q j h D = 0.Because h D = 0, this implies that q j = 0.
We can now demonstrate Theorem 1.
Theorem 1.Let H be a HFE polynomial of degree D in F 2 n [X] where the k-th terms of highest odd degree have been removed (k ∈ 0, log 2 (D) − 1 ), and let A ∈ F 2 n [X] be a square of degree at most 2D − 2. If D is even, then the computation of the classical Euclidean division (Algorithm 8) of A by H can be accelerated by a factor Proof.During Algorithm 8, A is a square so D(A) = −∞, and so Lemma 1 can be applied it.Let d = D(H), the iterations where i is odd and strictly greater than D + d − 2 can be removed because coef i−D (Q) = 0. So, the number of iterations when i is odd is max( (D+d−2)−(D−1) 2 , 0) = max( d−1 2 , 0), whereas the number of iterations when i is even is D 2 .So, Algorithm 8 can be used with max( D+d−1 2 , D 2 ) iterations.Next, H is a HFE polynomial so d = 1 or d = 2 j + 1 for j > 0. By removing the 2 j + 1 degree terms for j from log 2 (D) − 1 to log 2 (D) − k by −1, d equals 1 or 2 log 2 (D) −k−1 + 1.It implies that the number of iterations can be written as D 2 + 2 log 2 (D) −k−2 .Algorithm 8 requires D − 1 iterations, so the proposed modification accelerates it of a factor (D − 1)/( D 2 + 2 log 2 (D) −k−2 ).This factor is at most 2.
Let K be the number of terms of the HFE polynomial (without removed terms), and k be the number of removed terms.For k = 0, the modular reduction costs (D − 1)(K − 1) multiplications in F 2 n (Section 3.3), whereas by removing terms (with even D), the cost is max The main gain comes from the fact to decrease the number of round loops during Algorithm 8.However, there is also a slight speed-up generated by the fact that terms are removed.
When F has exactly zero term of odd degree, we obtain that during the computation of X 2 n mod F , none of the odd degree terms appear because R = A − F Q and A, F and Q do not have odd degree terms.This result allows to do computations only for even degree terms, dividing by 2 the cost of the squaring and of the modular reduction.But in practice, to remove all terms of odd degree of the HFE polynomial decreases the security.By applying to F the change of variable X 2 = Y , the degree of the result is only D 2 .We will show in Table 11 that in this case, D must be multiplied by two to obtain the original security.
Table 11 studies the impact of d = D(F ) on the theoretical speed-up over the classical Euclidean division compared to the case D = 513, and on the security.We have done an experimental test to analyse the degree of regularity [FJ03, BFS04, BFSY05, BFP13] in function of the number of removed terms.The degree of regularity is a tool to analyse the security of HFE-based scheme against Gröbner basis attacks.We measure it during the Gröbner basis attack on HFE for n = m = 30.We observe in practice that removing a small number of odd degree terms appears not to affect the security.The security is decreased when the second to last term is removed.The results confirm that the security to attack F of degree D without odd degree terms is the same that attack a HFE polynomial of degree D 2 : the degree of regularity increases between D 2 = 512 and D 2 = 513.The case d = 1 seems to have the same behavior, but in the general case, the regularity degree does not decrease necessarily (for D = 130, the degree of regularity is 4 for d = −∞ but 5 for d = 1).The column speed-up on i corresponds to obtain speed-up by decreasing the number of iterations during Algorithm 8.The other column speed-up is the total speed-up, which uses the fact that remove terms decreases the number of multiplications for one iteration.To remove the higher terms generates the main part of the maximal speed-up.In practice, we propose to choose d = 63.This implies to remove one term when D = 130 and three terms when D = 514.

Frobenius Map in F 2 n [X]
The core of Algorithm 6 (Section 3.1) is to compute X 2 n mod F during Step 1.As in Section 2.4, we compare the classical repeated squaring algorithm with the version using multi-squaring tables.The main differences with Section 2.4 is that the coefficients are not in F 2 but in F 2 n .So, the tables are too large to be precomputed.However, they can be computed more quickly by exploiting the HFE structure of F .Both presented methods are in constant-time.
Classical repeated squaring algorithm.We can compute X 2 n mod F using repeated squaring algorithm [vzGG13,Algorithm 4.8].This requires n steps of modular squaring.More precisely, the number of steps is n − log 2 (D) because the modular reduction is useless when the degree of X 2 i , 1 ≤ i ≤ n, is less than D. So, we compute (X 2 log 2 (D) ) 2 n− log 2 (D) mod F (or (F − X D ) 2 n− log 2 (D) mod F when D is a power of two).This remark permits to avoid useless computations in the constant-time implementations.This method requires (n − log 2 (D) )D field squarings and (n − log 2 (D) ) call to Al- 2 ) field multiplications.This method can be improved with the trick from Section 3.4.To improve the performance, we compute the repeated squaring in-place.To do it, we allocate a buffer of 2D − 1 coefficients in F 2 n .The squaring and the modular reduction modify directly the current result.

Repeated squaring algorithm with multi-squaring tables. The authors of [PCY + 15]
propose to compute several squarings before reducing by F .Set i this number of times.
To compute i squarings creates a result of degree (D − 1)2 i , but only the terms X j2 i for j ∈ {0, . . ., D − 1} are not null.To compute the reduction, firstly compute one time a table of (X j2 i mod F ) for j ∈ {0, ..., D − 1}, then multiply each coefficient by the corresponding element in the table.The table is computed one time for all and is re-used for each modular reduction.
To create the table, we compute each X j2 i mod F as (X (j−1)2 i mod F )X 2 i mod F .The multiplication by X 2 i is just a shift of 2 i , and all terms of degree strictly greater than D − 1 are reduced with Algorithm 8 (by replacing 2D − 2 by D − 1 + 2 i ).The table is useful only when X j2 i is not already reduced by F , so when 2 i j ≥ D and implies j ≥ D 2 i .This table requires to store (D − D 2 i )D elements of F 2 n , and D − D 2 i calls to Algorithm 8 are required to generate it, costing O(2 i (K − 1)(D − D 2 i )) field multiplications.To compute X 2 n mod F , we take X 2 log 2 (D−1) +i mod F from the table, then we compute steps of modular multi-squarings.Each step requires to raise D elements of F 2 n at the power 2 i , then to multiply each by the corresponding elements of the table.It costs iD field squarings and (D − D 2 i )D field multiplications.To obtain X 2 n mod F , we terminate by ((n − i − log 2 (D − 1) ) mod i) steps of modular squarings with the classical repeated squaring algorithm.The final cost of this method is (n − i − log 2 (D − 1) )D = O(nD) field squarings and To choose the best algorithm for the Frobenius map, we just choose the one which minimizes the number of field multiplications.In practice, the repeated squaring algorithm is the best when approximately D ≥ n, whereas the multi-squaring version is the best when n > D.
Table 12 summarizes the performance of both strategies for the Frobenius map, and compares our implementation to NTL and Magma.We use the Modexp function from Magma and the PlainFrobeniusMap function from NTL, both computing X 2 n mod F .We have studied also the strategy from Section 3.4 which permits to improve the Frobenius map by removing odd degree terms in the HFE polynomial.We choose d = 65, which requires to remove one term when D = 130 and three terms when D = 514.The results confirm the theoretical speed-ups: MQsoft saves approximately 25% and 44% of computations by removing respectively one and three terms for the first strategy.Magma is also improved by this trick, probably because it uses also the classical Euclidean division, and does not compute multiplication by zero.It is not the case for NTL because it uses the fast Euclidean division [vzGG13, Algorithm 9.5].
The multi-squaring strategy is the fastest when D is small compared to n.However, for n = 354 and D = 129, to set d = 65 is enough to change the trade-off between both strategies.To remove odd degree terms is interesting for HFE-based NIST submissions which uses D equals to 129 or more.However, this implies to increase by one the original parameters (D = 129 and D = 513).Without modifying it, our best Frobenius map is 7 to 12 times faster than NTL.

GCD in F 2 n [X]
Step 2 of Algorithm 6 requires to compute the GCD of two degree D polynomials in The NTL library provides only the classical algorithm [vzGG13, Algorithm 3.5], which uses O(D 2 ) field multiplications and O(D) field inversions.We have implemented the half-GCD algorithm ([vzGG13, Algorithm 11.8],[BCG + 17, Algorithm 6.8]), which uses O(D) multiplications in F 2 n .Our implementation of the half-GCD is based on the Karatsuba polynomial multiplication (Appendix C) and on the fast Euclidean division (Appendix D).These algorithms are implemented with constant-time arithmetic, but the GCD is in variable-time because the number of successive remainders is variable.
Table 13 compares the performance of GCD algorithms.Our classical GCD is three to four times better than NTL.This results of the difference of performance between our operations in F 2 n .The half-GCD becomes faster only for a high degree (approximately 8193).

Generation and Evaluation of the Public-key
In this section, we study how implement efficiently important steps of the keypair generation and verifying process.Both are based on multivariate quadratic systems that we represent as quadratic forms.

Generation of the Inner Secret-key Polynomial f
For the set of completeness, we explain here the method used to compute f during the keypair generation (Section 1.2).It requires to compute F ( 3)).As a first step, we assume that F does not have vinegar variables.We use two classical properties: • The terms of degree strictly greater than 1 of F can be represented as a quadratic form XQX t , where X = (X, X 2 , X 2 2 , X 2 3 , . . ., X 2 log 2 (D) ) and Q ∈ M log 2 (D) +1 (F 2 n ) is upper triangular such that: We have the relation F (X) = C + B 0 X + XQX t .In particular, Q i,j corresponds to the term X 2 i +2 j of F .
• The linearity of the Frobenius implies With these two properties, it is easy to verify that X = xΓ where x = (x 1 , . . ., x n ) and Γ ∈ M n, log 2 (D) +1 (F 2 n ) is such that Γ i,j = θ 2 j i+1 .So, we deduce the multivariate quadratic form of F : We have F = xQ x t + B 0 θx t + C. In particular, Q i,j corresponds to the term x i x j of F .Since x 2 i = x i , we can add each B 0 θ i from the linear part to the term Q i,i of Q .We obtain: To simplify the computation of f , θ (Section 1.2) is the same basis that the one used to represent an element of F 2 n (Section 1.6).In this way, f = ϕ(xQ x t + C) and we store ϕ −1 (f ).This is a monomial representation of f .
To compute f , we compute first Γ with O(n log 2 (D)) field squarings.This matrix does not depend on F and so it can be precomputed.
For the linear terms, we must compute We have β = vV t , and so Xβ t = xΓV v t .ΓV is the matrix where the coefficient (i, j) corresponds to the term x i v j .
The final result is: We can notice that this representation of f permits easily to apply the linear change of variable by the matrix S. Just replace (x v) by (x v)S and we obtain:

Evaluation of Multivariate Quadratic Systems in F 2
The evaluation of the public-key is the main part of the verifying process (Section 1.4).It is iterated nb_ite times.Since the verification is a public process, it does not require to be protected against timing attacks.So, we can exploit the fact that for a random input, the evaluation of a monomial x i x j in F 2 has a probability of 75% to be null, and so to avoid 75% of computations.However, the evaluation in constant-time is required during the signature generation to evaluate the constant of the HFEv polynomial, which is quadratic in the vinegar variables.It is also used in other contexts, for example to encrypt a message for the HFE-based encryption scheme.In this section, we both study, variable-time and constant-time evaluation.have used the avx2 instructions set.We have chosen the monomial representation as in [CLP + 18], because it exploits naturally the fact that on average, 75% of monomials are null.
Our variable-time evaluation uses only the classical method [BBG06]: we initialize an accumulator acc to the constant term of p, and for each term p i,j x i x j with p i,j ∈ F 2 m , we add p i,j to acc only if x i = x j = 1.This process is described in Algorithm 9.
We have vectorized Algorithm 9. To do it, we just store acc with 256-bit registers, and we use 256-bit load, store and bitwise XOR to do vectorial computations.When m 64 is not a multiple of 4, we sometimes add the use of 64-bit or 128-bit registers, to speed-up the implementation.Algorithm 9 is vulnerable to timing attacks, since the addition is done only if x i = x j = 1.The traditional way to avoid this attack is to replace the conditional statement by a multiplication by x i (respectively x j ).But x i and x j are in F 2 , so the multiplication can be accelerated: it is equivalent to apply a mask which is the duplication of x i (respectively x j ) m times.With this strategy, we obtain Algorithm 10.
To vectorize Algorithm 10, acc and acc_i are stored in 256-bit registers.However, the optimal choice to put each mask in a 256-bit register is not trivial.On the one hand, we can store 256-bit masks in the buffer mask.In this way, one load permits to create the  15 shows the performance of the evaluation that uses avx2.To improve the performance, we use the option -funroll-loops of gcc which unrolls loops to improve the use of the pipeline.The factor of performance between variable-time and constanttime implementation depends on m: the factor is two for small values of m and four for high values.The performance is affected by cache penalty when the public-key is too large.For m = n + v = 256, we compare our code with the efficient implementation of [CLP + 18], by using a similar processor (ServerH).We have similar times for constant-time implementation, and a speed-up of 1.38 for variable-time implementation.This speed-up is mainly dued to unrolled loops.Moreover, we have splitted the loop i (respectively the loop j) in two loops with an Euclidean division by 64: the first is a loop for iq ∈ 0, i 64 , and the second is a loop for ir ∈ 0, 63 .In this way, for extracting x i which is the i-th bit from a vector of word, we take the ir-th bits of the iq-th word.It permits to simplify the extraction of bits from 64-bit variables.
For the constant-time evaluation, we have obtained our best times on Haswell by using Algorithm 11.However, on Skylake, Algorithm 12 is faster.For 256 equations and 256 variables (cf.Table 16), we obtain 61.4 Kc with Algorithm 11 against 55.5 Kc with Algorithm 12. Since Algorithm 11 is state-of-the-art on Haswell, we have obtained a new speeding record on Skylake, by a factor 1.1.For comparison, we obtain 23.2 Kc for the variable-time evaluation.
For m requiring one word (respectively two words), we use the 256-bit registers to do computations in F 2 m by pack of four elements (respectively two elements).This method implies to use masks to compute q i,j x j for four (respectively two) successive values of j.To optimize the cases m requiring one or two words is important because it permits to use a new strategy of parallelization: with k cores, the public-key can be splitted in k packets of 64 equations (respectively 128 equations), and each core can apply one time the evaluation for its part of the public-key.This method is interesting because the number of miss in the cache is decreased since each core has just a part of the public-key.In a general way, m can be splitted in the way to use evaluation algorithms for smaller number of equations.Hybrid representation of the multivariate quadratic systems.When m mod 64 is small, the monomial representation of the public key is not optimal for the variable-time evaluation.
The addition in F 2 m can be computed with 64-bit, 128-bit and 256-bit XOR, so when m mod 64 is small, many bits are unused during the computations.In memory, we use m 64 words to store each monomial.The unused bits are set to zero.When m mod 64 is small, to store the m mod 64 last equations separately is more efficient.This permits to obtain an optimal representation for a large part of equations, then to optimize the representation of the last remaining equations.We have applied this idea to GeMSS256, for which m is 324 and so m mod 64 is equal to 4. The 320 first equations are stored by using the monomial representation, whereas the four last equations are stored one-by-one.With this method, we save 18% of the pratical size of the public key, and we obtain a slight speed-up of 5% during the verifying process.

Performance of GeMSS, Gui and DualModeMS
In this section, we show the speed-ups obtained for GeMSS, Gui and DualModeMS thanks to MQsoft.The difference between Gui and GeMSS are explained in Section 1.5.MQsoft uses the SHA-3 function from the Keccak Code Package [GBA13] and the SHA-2 function from OpenSSL.Random elements ar generated by the determinist random bytes generator provided by the NIST during the competition (which is based on AES from OpenSSL).The original implementation of GeMSS requires NTL library to compute the root finding.In MQsoft, NTL has been completely removed.DualModeMS is based on GeMSS implementation and so is naturally supported by MQsoft.

Performance of NIST Implementations
Table 17 summarizes the performance measurements of GeMSS additional (best) implementation and Gui PCLMULQDQ implementation which have been submitted to NIST PQC standardization process.The measurements of GeMSS have been corrected since the submission.The parameter D had been set by error at 512 in the implementation (and during the measurements).Because the Frobenius map has not been implemented in constant-time, D = 512 allows to save 27% of computations in the critical part of the signature generation (cf.Table 11).As explained in Section 1.5, we have removed the EUF-CMA security property of Gui in the PCLMULQDQ implementation, because this property is not available in GeMSS original implementation.We have also added the Table 18 and 19 summarize also the speed-up between NIST submissions and MQsoft.For GeMSS, the main part of the implementation which was in variable-time has been modified to be in constant-time.To obtain these results, we have improved the complexity of the generation of f by using the method explained in Section 4.1, and we have vectorized the multiplication by T (cf.Section 5.4).We have improved the arithmetic in F 2 n with a better multiplication, a vectorial modular reduction and an inverse computed with ITMIA.
In the first implementation, the GCD was computed by NTL.With our efficient computation of the inverse in F × 2 n , the GCD becomes negligible compared to the Frobenius map.The verifying process is accelerated by a factor between 1.7 and 2, thanks to our vectorial implementation of the evaluation of the public-key.For our implementation of Gui, we obtain at least a speed-up of a factor 12 for the keypair generation, probably because the original implementation uses evaluations of F then an interpolation to compute f .The signature generation is faster by a factor between 1.2 and 2.5, thanks to our efficient implementation of the arithmetic in F 2 n .We target the Skylake processors, whereas the original implementation targets the Haswell processors.For this reason, the speed-up should be less advantageous on the Haswell processors.However, the modular reduction of the NIST submission uses PCLMULQDQ.This method is probably slower on Haswell than the shift-and-add strategy.Moreover, the original implementation uses multi-squaring tables to compute the Frobenius map (Section 3.5).For the 256-bit level of security, this is inefficient compared to the classical repeated squaring algorithm (cf.Table 12 for D = 513).But this error has been corrected in libpqcrypto4 , and so in SUPERCOP5 (SUPERCOP uses the implementation of libpqcrypto).The verifying process is faster by a factor between 1.6 and 1.9.

Statistics on the Performance of GeMSS and Gui
We propose here an other method to evaluate the performance of GeMSS and Gui.The experimental protocol is the following.Firstly, we measure the performance of a small number of keypair generation (between 10 and 100).The set of the measurements will permit to compute the average, the standard deviation, the median as well as the first and third quartiles.Secondly, for three keypairs, we measure the performance of the signing and verifying processes on a small number of documents (between 256 and 2048).To do this, we set the length of each document to 59 bytes, then we generate randomly a small number of documents, stored in a buffer.Then, we measure the performance of the signing process on these documents, and each signature is stored in a second buffer.Finally, we use this buffer to measure the performance of the verifying process.The code was compiled with gcc -O4 -mavx2 -mpclmul -mpopcnt -funroll-loops.Here, we use the version 6.5.0 of gcc.In Table 20, we present the results for the 128-bit security level, with three significant digits.The complete table is on our website [MQs18].In this section, we only measure the Gui implementation from libpqcrypto.The main difference with the original implementation is the correction of a mistake in the way to compute the Frobenius map (as explained in the previous section).This change impacts only the performance of the signing process of Gui-448.Then, we have updated MQsoft to achieve the EUF-CMA security property in order to be comparable to libpqcrypto.The EUF-CMA version implies a slow down of the signing process.We do not measure the performance of the NIST submission of GeMSS because it does not achieve this property.
In Table 20, we remark that the timings for the signing process are really unstable.This is explained by the fact that during Algorithm 2, the root finding algorithm is reiterated when any root is found.The verifying process is slightly unstable, since the evaluation of the public-key is implemented in variable-time.

Conclusion
MQsoft is an efficient library to do HFE-based multivariate cryptography.We obtain interesting speed-ups for GeMSS, Gui and for the signature generation of DualModeMS.
Our library provides an efficient constant-time arithmetic, which is on average four times faster than NTL.We have proposed a new method to improve the root finding for specific HFE polynomials, but the security analysis must be studied in depth.We have exploited the architecture to obtain efficient implementations for the evaluation of multivariate quadratic systems in F 2 .However, our library can be improved again.In fact, the generation of keypair is not completely vectorized.The library is in constant-time for a large part, but is not completely protected against timing attacks.The generation and inversion of random invertible matrices require to use constant-time Gaussian elimination [BCS13].It is not implemented in MQsoft for the moment.Moreover, the GCD is implemented in variable-time since the number of iterations is variable.To solve this problem, the Gui submission proposes a constant-time implementation which returns a correct GCD only if its degree is one.
During the review process, GeMSS was selected in the second round of the NIST PQC standardization process.MQsoft will permit to improve the performance of GeMSS, as well as to propose other trade-offs between security and performance.
• Computation of the quotient with Equation (13).
• Computation of the remainder: R = A − QF mod X D .
The fast Euclidean division can be used to compute the Frobenius map (Section 3.5).In the case of HFE-, Rec D (F ) −1 mod X D−1 can be precomputed and stored with the secret-key, because the computation does not have dependences with the constant of F (in Section 1.3, we are looking for a root of F (X, v) − D ).For the vinegar variant, the linear terms of F depend on the choice of vinegars variables and so Rec D (F ) −1 mod X D−1 must be computed for each of these choices.So, a precomputed table requires 2 v (D − 1)n bits, which huge.However, when D is odd, D−1 is a power of two and so the computation of Rec D (F ) −1 mod X D−1 can be improved.Because F has a HFE structure, we have that: and so: With this formula, only two iterations of Newton are required to compute Rec D (F ) −1 mod X D−1 (by computing Rec D (F ) −1 modulo X D−1 2 then modulo X D−1 ).The two others steps require only two multiplications in F 2 n [X] modulo a power of X.With a enough fast multiplication, fast algorithm is better asymptotically than classical Euclidean division (Section 3.3).However, a fast Euclidean division based on Karatsuba multiplication (Appendix C) is inefficient compared to the classical Euclidean division which exploits the sparse structure of F .This explains why NTL is slow to compute the Frobenius map in Section 3.5.
In the general case, the fast Euclidean division algorithm is interesting.It permits to improve the fast GCD algorithm in Section 3.6.The fast Euclidean division is implemented in MQsoft, as well as the fast Frobenius map, the fast GCD, and so the fast root finding algorithm.This is slower for HFE polynomials, but is efficient in the general case.

Figure 1 :
Figure 1: Dependencies between the different operations performed in MQsoft.
for i from nb_ite − 1 to 0 do 10:S i ← p(S i+1 , X i+1 ) ⊕ D i+1 and O(n(n + v + log 2 (D))) modular reductions.In Section 3, the signature generation requires O(nb_ite × (nD log 2 (D) 2 + D 2 )) field multiplications, O(nb_ite × nD) field squarings and O(nb_ite × D) field inversions.We use the polynomial representation defined in Section 1.6.It is the most efficient representation when PCLMULQDQ is available [TFA + 11].To compute the square (respectively the multiplication) of B in F 2 n , we choose to compute the square (respectively the multiplication) of B in F 2 [X] before to reduce the reduction by the univariate polynomial defining the extension.
The computation of ΓV requires O(nv log 2 (D)) multiplications in F 2 [X] and O(nv) modular reductions.So, the generation of the inner secret polynomials of f requiresO(n log 2 (D)) squarings in F 2 n , O(n log 2 (D)(n + v + log 2 (D)) multiplications in F 2 [X] and O(n(n + v + log 2 (D))) modular reductions.

Table 1 :
Speed-up of GeMSS, Gui and DualModeMS (best implementation provided for the NIST submissions versus our implementation).We use a Haswell processor (ServerH).

Table 3 :
OS and Memory.

Table 4 :
Performance of the PCLMULQDQ instruction in function of the architecture, as presented in the Intel Intrinsics Guide.Table4presents the cost of PCLMULQDQ in function of the architecture.The choice of the best algorithm of multiplication in F 2 n depends on the processor.Our choices target the Skylake processors, which use only one CPI (cycle per instruction).

Table 5 :
Number of cycles for computing the square of an element of F 2 [X] of degree n − 1, with MQsoft.We use a Skylake processor (LaptopS).
385 ≤ n ≤ 576, an element of F 2 n requires 7, 8 or 9 words, and Karatsuba multiplication [vzGG13, section 8.1] becomes faster than the schoolbook method.We split each input in two: the low bits create a 255 degree polynomial, and the remaining bits create a n − 257 degree polynomial.Thus, the Karatsuba algorithm requires one multiplication of 255 degree polynomials, one multiplication of n−257 degree polynomials and one multiplication of max(255, n − 257) degree polynomials.These multiplications are computed with the schoolbook multiplication, requiring 16+ n−256

Table 6 :
Number of cycles to multiply two elements of F 2 [X] of degree n − 1.We use a Skylake processor (LaptopS).

Table 7 :
Number of cycles to compute the modular reduction of an element F 2 [X] of degree 2n − 2 by f , with MQsoft.We use a Skylake processor (LaptopS).In Table8, the modular reduction is also measured when it is used with multiplication in F 2 [X].The performance of multiplication in F 2 n depends on the context.For this reason, we measure it in two ways:

Table 8 :
Number of cycles to compute the multiplication in F 2 n in function of the modular reduction, with MQsoft.We use a Skylake processor (LaptopS).

Table 9 :
Number of cycles to compute the squaring in F 2 n in function of the enabled instructions, with MQsoft.We use a Skylake processor (LaptopS).

Table 10 :
Number of cycles by operation in F 2 n .We use a Skylake processor (LaptopS).

Table 11 :
Impact of d on the performance and on d reg the degree of regularity.

Table 12 :
Number of mega cycles to compute the Frobenius map of a HFE polynomial.We use a Skylake processor (LaptopS).

Table 13 :
Number of mega cycles to compute the GCD in F 2 n [X] of a D degree polynomial by a D − 1 degree polynomial.We use a Skylake processor (LaptopS).

Performance of the Root Finding Algorithm in F 2 n [X]
Table14compares the best implementation of root finding of each library, and for the parameters of GeMSS and Gui.The results are similar to the performance of Frobenius map, which is the critical part of the root finding algorithm.MQsoft is five to nine times faster than NTL.

Table 14 :
Number of mega cycles to find the roots of a HFE polynomial.We use a Skylake processor (LaptopS).We have presented in this section the main algorithms that we have implemented in MQsoft to obtain an efficient root finding for HFE polynomials.However, our library proposes extra functions.The half-gcd requires to implement Karatsuba multiplication and fast Euclidean division, in both F 2 n [X].So, we propose a fast version of the root finding, based on a Frobenius map using the fast Euclidean division.This permits to have an efficient implementation of root finding for general applications, using a constant-time arithmetic in the base field.
Then, we compute Γ × Q with classical matrix product, requiring O(n log 2 (D) 2 ) multiplications in F 2 [X] and O(n log 2 (D)) modular reductions.Finally, we multiply the previous result by Γ t , requiring O(n 2 log 2 (D)) multiplications in F 2 [X] and O(n 2 ) modular reductions.
To evaluate the public-key p, we can use different representations.The representation by equation consists to store the m equations of p separately (p ∈ (F 2 [x 1 , . . ., x n+v ]) m ), whereas the representation by monomial consists to store the monomials of p separately (p ∈ F 2 m [x 1 , . . ., x n+v ], cf.Section 4.1).The authors of [CHR + 16] proposes a fast evaluation of the public-key with the representation by equation.In [CLP + 18], the authors present a faster evaluation.To do so, they use a monomial representation of the public-key.Both, [CHR + 16] and [CLP + 18],

Table 15 :
Number of kilo cycles to evaluate the public-key with MQsoft, for m = n + v.We use a Haswell processor (ServerH) with the avx2 instructions set.Turbo Boost is not used.

Table 16 :
Number of kilo cycles to evaluate the public-key with MQsoft, for m = n + v.

Table 19 :
Number of mega cycles(Mc)for each cryptographic operation with our library for a Skylake processor (DesktopS), followed by the speed-up between the best implementation provided for the NIST submissions versus our implementation.For example, 36 δ: +160% means a performance of 36 Mc with MQsoft, and a performance of 36 × 2.6 = 94 Mc for the NIST implementations.

Table 20 :
Statistics in mega cycles (Mc) for each cryptographic operation, for a Skylake process (LaptopS).For three different keypairs, documents of length 59 bytes are signed then verified.The EUF-CMA property is implemented for all schemes.

Table 26 :
Number of mega cycles for each cryptographic operation (best implementation provided for the NIST submissions).We use a Haswell processor (ServerH).Turbo Boost is not used.