Compact Circuits for Efficient Möbius Transform

. Möbius transform is a linear circuit used to compute the evaluations of a Boolean function over all points on its input domain. The operation is very useful in finding the solution of a system of polynomial equations over GF (2) for obvious reasons. However the operation, although linear, needs exponential number of logic operations (around n · 2 n − 1 bit xors) for an n -variable Boolean function. As such the only known hardware circuit to efficiently compute the Möbius Transform requires silicon area that is exponential in n . For Boolean functions whose algebraic degree is bound by some parameter d , recursive definitions of the Möbius Transform exist that requires only O ( n d +1 ) space in software. However converting the mathematical definition of this space-efficient algorithm into a hardware architecture is a non-trivial task, primarily because the recursion calls notionally lead to a depth-first search in a transition graph that requires context switches at each recursion call for which straightforward mapping in hardware is difficult. In this paper we look to overcome these very challenges in an engineering sense. We propose a space efficient sequential hardware circuit for the Möbius Transform that requires only polynomial circuit area (i.e. O ( n d +1 )) provided the algebraic degree of the Boolean function is limited to d . We show how this circuit can be used as a component to efficiently solve polynomial equations of degree at most d by using fast exhaustive search. We propose three different circuit architectures for this, each of which used the Möbius Transform circuit as a core component. We show that asymptotically, all the solutions of a system of m polynomials in n unknowns and algebraic degree d over GF (2) can be found using a circuit of silicon area proportional to m · n d +1 and physical time proportional to 2 · log 2 ( n − d ) · 2 n − d .


Introduction
Several cryptanalytic problems can be reduced to instances of solving a system of multivariate polynomial equations over GF (2).For example, block ciphers with low multiplicative complexity like LowMC [ARS + 15] employ only 3-bit S-boxes of algebraic degree 2. It is known that given any single plaintext-ciphertext pair from an r-round instance of LowMC gives rise to a system of equations in the secret key-bits of algebraic degree 2 r/2 [Din21].It is also known that forging a signature in public key signature schemes like UOV can be done by solving a set of quadratic equations over GF (2) [KPG99].Other than this there are specific problems in combinatorics like the graph-coloring problem (i.e.given a graph decide whether it can be colored using k colors with no two adjacent vertices assigned the same color) which can be reduced to an instance of solving multi-variate polynomials in GF (2) [Bar09,Appendix C].
The problem can be stated in the following way: given n indeterminates x 1 , x 2 , . . ., x n , and m polynomials f i ∈ F[x 1 , x 2 , . . ., x n ] (for i ∈ [1, m]), where F is any finite field.The task is to find common solutions x * ∈ {0, 1} n , such that f i (x * ) = 0 for all i.Over any finite field F, the problem is NP-complete already when the polynomials are quadratic.

Previous Work
To the best of our knowledge, there have been two previous works on hardware/software architectures for fast exhaustive search over GF (2).The main idea is as follows: the secret x * , we are looking for is obviously a point which evaluates to zero for all the f i .Thus at the index x * , the truth tables of all the Boolean polynomials f i will contain the constant 0. Hence, we are looking for the indices x * at which the logical OR of all the truth tables of all the f i 's is 0. In [BCC + 10], the authors use the Gray code technique to evaluate the truth table of each polynomial f i .Gray codes are linear codes which have the property that successive codewords differ by only one bit.There are many methods of constructing such codes in literature, and one of the simplest way is to define the i-th code word as g i = i ⊕ (i ≫ 1) (the ≫ denotes the rotate right operator).For example the eight 3-bit codewords listed sequentially are: 000, 001, 011, 010, 110, 111, 101, 100.Take any polynomial f i : we want to evaluate f i over all 2 n points of its input domain.Then it is more efficient to do this evaluation in the order specified by the Gray code, i.e. first f i (g 0 ), then f i (g 1 ), f i (g 2 ) . . .etc.The reason for this is as follows: note f i (g 0 ) = f i ( ⃗ 0) is just the constant term of f i , thereafter if t is the only bit-position where the successive codewords g j and g j+1 differ in, and we already have the value of f i (g j ) then we can use a Taylor-like expansion formula for Boolean functions to compute f i (g j+1 ): Here δfi δxt is the 1st order derivative of the function f i at the point x t .For example if f i = x 1 x 2 ⊕ x 3 ⊕ x 1 x 4 x 5 , then δfi δx1 = x 2 ⊕ x 4 x 5 and δfi δx2 = x 1 , δfi δx3 = 1 etc.It is known that the derivative has algebraic degree at least one less than the original function, and so if the derivative is not a constant or a degree one function we recursively evaluate the derivative term in Equation (1) with another round of Taylor expansion.The method obviously works best if the function f i is quadratic, but [BCC + 10] showed that it works for moderately higher degree functions too if some of the derivatives are precomputed and stored in memory.In a follow up work [BCC + 13], the same authors proposed a hardware circuit for the problem, however, for only degree 2 functions (that did not need pre-computations).The problem with this approach is the initial pre-computation of derivatives one needs to do that costs a significant computations and thus energy when translated into hardware.Furthermore the pre-computations have to be done for each polynomial f i , which implies that derivative computations for one set of polynomials can not be reused for another set and so a reduction of complexity via amortization over several polynomial systems is also not possible.
Another method to compute the truth table of a Boolean polynomial from its algebraic expression is via the Möbius Transform.This method does not require pre-computations.The transform can be simply evaluated as ⃗ v = M n ⃗ u, where ⃗ u is the 2 n × 1 algebraic normal form (ANF) vector of any n-variable Boolean function, M n is the 2 n × 2 n binary Möbius matrix, and ⃗ v is the truth-table of the function, with its i-th element being the function evaluation at the binary string representation of i.As we will soon see, a naive interpretation of this method requires time and space exponential in n to compute.However there exist more subtle methods to compute the matrix-vector product given above in polynomial space (bounded by n d+1 where d is the algebraic degree of the Boolean polynomial).Translating this to hardware is a non-trivial task as the underlying algorithm is significantly complex.In this paper, we will propose strategies to translate the Möbius Transform algorithm into a hardware circuit and we will demonstrate how to overcome the engineering challenges involved.We then show how multiple instances of the above Möbius Transform circuit can be efficiently utilized to solve or perform fast exhaustive search for roots of equation systems over GF (2) whose degree is bound by some constant d.We show that asymptotically, with silicon footprint proportional to m • n d+1 we can describe a circuit that finds roots of a system of m polynomial equations of degree d in n unknowns over GF (2).

Comparison with Linearization Algorithms
Linearization based algorithms like XL [CKPS00] and Elimlin [CB07] also attempt to find the solution of a system of Boolean equations through matrix manipulation techniques like Gaussian Elimination (GE).The idea is to rewrite every higher degree monomial in the equation system as a new linear variable.This converts a system of m equations of any arbitrary algebraic degree d to a system of m linear equations in around O(n d ) extended variables.Using hardware accelerators for GE like the SMITH framework [BMP + 06], one could also describe a circuit that finds roots of the system using silicon area proportional to m • n d .However as shown in [Bar09, Sec 12.3], such an approach will generate basis vectors for a space containing an exponential number of false solutions, and it is not immediately clear how efficient circuit architectures can be described to eliminate them.

Contribution and Organization
In this paper we present a novel hardware architecture for Möbius transform for n-variable Boolean functions of degree ≤ d that requires silicon resources that are polynomially bounded by n d+1 .We use the recursive definition of the transform found in [Din21, Sec 4.2], and identify and solve the engineering difficulties of translating such an algorithm into hardware.Parallel instances of this architecture can be combined to construct hardware solvers that find roots of an underlying equation system over GF (2) by exhaustive search.We describe the architectures of three such solvers the last of which is able to find all roots of any system of m Boolean equations in n unknowns and algebraic degree d in circuit area proportional to m • n d+1 and physical time proportional to 2 The rest of the paper is organized in the following manner.Section 2 presents some preliminary lemmas and definitions in this field.In Section 3 we look at the recursive definition of Möbius transform and we explain in detail how the hardware circuit for the same is designed.In Section 4, we first show how to combine multiple instances of the Möbius Transform circuit that produces a solver that finds at least one root of the underlying equation system.We then list two variants of this architecture, the last of which is able to find all the roots of the equation system.Section 5 concludes the paper.

Definitions and Preliminaries
Boolean function: An n-variable Boolean function is a map from {0, 1} n → {0, 1} and it can be uniquely represented by its algebraic expression, called algebraic normal form or ANF.The algebraic expression of such a function using the (⊕, •) basis can be written as is the binary string of length n, with i j as the individual bits and x i is defined as x ij j .The ANF vector ⃗ u = [a 0 , a 1 , . . ., a 2 n −1 ] is defined as the 2 n -length string of all the a i 's.For example, consider the 3-variable function f = 1⊕x 0 x 1 ⊕x 2 ⊕x 0 x 2 .We can write this as The function can be expressed as a length 8 bit-vector ⃗ u with bits at locations given by the binary strings 000, 110, 1 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 1 0 1 0 0 0 0 0 1 1 1 1 0 0 0 0 1 0 0 0 1 0 0 0 1 1 0 0 1 1 0 0 1 0 1 0 1 0 1 0 001 and 101 i.e. 0, 6, 1 and 5 set to 1 and the rest of the bits 0, which is to say that a 0 = a 1 = a 5 = a 6 = 1 and the rest of the a i = 0.
The algebraic degree of the function (provided the function is not identically null) is defined as the maximum hamming weight of the string i such that a i = 1.Thus in the previous example, the algebraic degree is 2. For functions having degree d, all the coefficients a i such that hw(i) > d are naturally 0. Since there are exactly n i length n strings of hamming weight i, we can see that the ANF of degree d function can be expressed using n ↓d := 1 ] be the truth-table of the function f , with its i-th element being the function evaluation at the binary string representation of i, i.e.
where M n is the Möbius matrix of size 2 n × 2 n .The i, j-th element of this matrix m ij is given as The operator ⪯ is a partial order over all binary strings: we say that j ⪯ i if the binary string representing j is less than or equal to the binary string representing i in all indices.For example, 4 ⪯ 5, since 100 is less than 101 at all bit-locations, but 3 ̸ ⪯ 4 since 011 exceeds 100 in the last 2 bit-locations.
The Möbius matrix M n has been widely studied in literature: for example it is well known that is lower-triangular and involutive i.e.
An example of the 8 × 8 Möbius matrix M 3 , i.e. for n = 3 is shown in Figure 1.This helps us see an alternative recursive definition of M n .If we define , where ⊗ is the matrix tensor product.Multiplication of a vector by this matrix can be quickly executed by the butterflylike operations shown in Figure 2. The butterfly operation shaded in blue is actually multiplication of the input 2-bit vector by the matrix M 1 .The figure tells us that for an n-variable function, the algorithm can be done in-place (without any additional memory) using around n • 2 n−1 xor operations and 2 n space.

Implementing the Möbius Transform
Given Figure 2, we can think of many strategies to implement the basic transform if one has access to exponential silicon resources.The operation consists of n stages of sequential xor layers, with each layer having exactly 2 n−1 xor operations over bits.Given this, one can think of several circuit strategies to implement this: it implements all the n butterfly stages as dedicated circuits sequentially.Consider one i (x) : {0, 1} n−1 → {0, 1} n to be the function that inserts a 1 in the i-th MSB position of x, and zero i (x) to be a function that inserts a 0 in the same position, i.e one 0 (1001) = 1 1101 and zero 0 (1001) = 0 1001 etc.Note that there are a total of 2 n−1 butterfly operations in each of the n stages.In the i-th stage (for 0 ≤ i ≤ n − 1), the j-th butterfly takes as input the bits in the position zero i (j) and one i (j) for all 0 ≤ j ≤ 2 n−1 − 1.This requires a total of n • 2 n−1 number of 2-input xor gates in total.However such a circuit is able to compute the transformation in a single cycle.

Expmob2
This configuration is slightly different from the previous circuit, in the sense that we have only a single stage butterfly which we operate over n clock cycles to compute the transform, i.e. similar to round based circuits of block ciphers in which a single round function circuit is iterated over a given number of cycles to compute the transform.Unlike the round function of a block cipher the successive stages of xor layers are not exactly similar.For example, consider the topmost butterfly circuit in each stage in Fig 2 .The 1st stage takes bits at positions 0 and 4 as input, the second stage takes bits 0 and 2, the third stage takes bits 0 and 1 and so on.So to create a round based circuit, it would seem that one would need multiple n to 1 multiplexers before each of the butterfly circuits.However this can be avoided using a simple observation.Consider π n to be the following permutation: The idea is that after the given stage of butterfly circuits, the bit at position i be shifted to position π n (i).Such a permutation over the bits requires only re-routing of wires and thus no additional silicon area.This is essentially the entire round function circuit which has to be executed for a total of n cycles for the transform to be computed.To see why this works, consider the following facts.Let B n be the block diagonal matrix defined as B n = M 1 ⊗ I n−1 , where I n−1 is the identity matrix of size 2 n−1 × 2 n−1 .Note that B n is transformation defined by the first stage of butterfly layer in Fig 2 .Let P n be the permutation matrix corresponding to π n .Then it is easy to verify that the Möbius matrix

Synthesis Results
In this section we will describe the flow of simulation followed for each of the circuits reported in the paper.The design was described at the RTL level using a hardware description language and functional correctness was first verified.Thereafter the circuit The blue shaded component represents one butterfly unit.was synthesized using the Nangate 15nm Open Cell Library [MMR + 15], mainly to ensure that the results obtained can be reproduced readily.One of the utilities of the Möbius Transform, is in solving equation systems.In order to ensure that equation systems are solved as quickly as possible, the circuit compiler was instructed to specifically optimize the total critical path of the circuit.A timing analysis is then performed on the synthesized netlist using sufficient number of randomly generated test vectors, which outputs the switching statistic of every node in the circuit.This information is used by a power compiler software to estimate the average power consumed by the circuit.Energy is computed as the product of the average power and the total physical time taken for the circuit to execute a given operation.
In Figure 3, we present comparative synthesis results for the circuits Expmob1 and Expmob2.It can be seen that Expmob1 performs better than Expmob2 in this regard, most probably due to the fact that additional hold/setup time constraints need to be met for Expmob2 for writing on to the register in each cycle.Similarly the additional energy required for the n successive register writes makes Expmob2 less energy efficient as compared to Expmob1 as shown in Figure 3c 1 .For a detailed tabulation of the synthesis results, please see Table 2 in Appendix B.
However both these circuits require exponential amount of logic gates which starts to become a bottleneck as n increases.We have already seen that a degree d Boolean function can be represented with only n ↓d < n d binary coefficients, which means that for small values of d the size of the ANF vector is polynomially bounded.Thus the size of the register that holds the ANF can be bounded by n d .However, it is not possible to use Expmob1/Expmob2 circuit to compute the transform on this reduced size register, since, although the initial ANF vector is small, the output of each layer of butterflies are progressively larger till it reaches 2 n (which is the expected size of the truth table) after the last stage.

Recursive Algorithm for Möbius transform
There exists algorithms that perform the basic transform (on functions limited to degree d) using polynomial space only, i.e. bounded by n d+1 .We state the algorithm appearing in [Din21,Sec 4.3].The algorithm requires a notional depth first traversal in a transition graph as shall be explained shortly.
The principal question is how do we circumvent the fact that even if we begin with a ANF vector of size that is polynomially bounded, each butterfly stage is likely to produce an output that is of size larger than the input.First let us make the following observation taking Figure 2 as a reference: consider the initial ANF vector A 0 = [1100 0110] and the vector A 1 = [1100 1010] just after the first layer.The initial vector corresponds to the function δx0 is simply the derivative of f at the coordinate x 0 .Both δf δx0 and f (0, x 1 , x 2 ) have number of variables which is 1 less than the original function, and it is obvious that both their algebraic degrees can not be more than that of the original function.
Now consider the vectors in the top and bottom halves of A 1 i.e.A top = [1100] and It is easy to observe/verify the following: B: Both A top /A bottom are outputs of the butterfly layer in which the input is A 0 .Whereas A top is the arm of the butterfly that does not require xor computations, some xor computations are required for A bottom .

C:
The remaining steps from the 2nd stage onwards can be seen as the parallel application of the Möbius Transform on the reduced variable Boolean functions f (0, Of course in the figure, both the transforms are computed parallelly, which requires 2 n−1 space each and so the total space requirement is 2 n which is the same as the original.The idea behind the recursive transform is to do these 2 sub-transforms sequentially, i.e. one after the other so that the same space (i.e.register locations) can be used for both the transforms so that the cumulative space requirement does not add up.Let us state the algorithm now formally (Algorithm 1).The algorithm is parameterized by two quantities: number of variables n, and the maximum algebraic degree d that the underlying function can have.
Algorithm  For the sake of simplicity, we have excluded many operational details in the above algorithm to give the reader a better idea of the flow of the algorithm.The space requirement of this algorithm is easy to estimate from the algorithm description.We start out with n ↓d coefficients required to store A 0 .Thereafter every successive i-th recursion stage requires n−i ↓d additional memory for all 1 ≤ i ≤ n − d.The final stage can use Expmob1/Expmob2 to perform Möbius Transform in-place and no additional memory is required.However in our experiments we preferred to use Expmob1 because it is slightly faster.Hence the total space requirement of this procedure is given by (for a proof of the following please see Appendix C): Notionally speaking the algorithm listed above describes a depth first recursion tree as shown in Figure 4, where each node in tree are connected to its two butterfly outputs.The depth first nature of the structure gives rise to complications even while implementing it in software.The problem with implementing such a routine, even in software, is the high number of context switches, that is needed to traverse one level down.In layman's terms, before we can do a downward dive in the tree, the current state information, variables etc has to be stored in a separate memory location (usually denoted as "call-stack").This costs time/energy and makes the algorithm less attractive from a practical point of view.

Hardware circuit Polymob1
The goal obviously is to construct a circuit that does not take more than a total of S(n, d) bits of register space.As such we are looking at a circuit architecture similar to the one shown in Figure 5.
To understand the challenges in this circuit, note that one needs to follow the flow given by the orange line in the recursion tree in Figure 4. Now there is one top-level register of size n ↓d storing the initial ANF vector A 0 .There is only one n−1 ↓d size register to store (n,↓d) coefficients the second level coefficients A top and A bottom .This implies that if in the first clock cycle the 2nd register stores A top , it must preserve this state till the entire left sub-tree rooted at this node is executed before it overwrites its state to A bottom .Similarly there is only one n−2 ↓d register to store potentially four ANF vectors (two each from the butterfly operation on A top and A bottom ).Thus the engineering challenge is to ensure that each register at the successive levels store and preserve appropriate state vectors till it is time to overwrite them, and so this in a manner that minimizes the total number of clock cycles required to execute the Möbius Transform.
Thus we arrive at the architecture in Figure 5.Each i-th level has a single register of size n−i ↓d (for 0 ≤ i ≤ n − d), and from i = 1 onwards each register is preceded by a 3:1 multiplexer of size n−i ↓d .This is because each register must be able to accept 3 different inputs: 1. Its own output state, or in other words it must be able to preserve its state.
2. Either of the 2 outputs of the butterfly stage preceding it.

Architectural Details:
We begin by noting that a 3 : 1 multiplexer is not necessary for the above architecture unless we add other functionalities to the circuit like the one described in Section 4.3.For executing the basic Möbius Transform a 2 : 1 multiplexer will also serve the purpose.We explain this with the first couple of registers but the same principle holds for the registers in the lower levels too.Note that the second register in its lifetime can only store two vector values A top and A bottom depending on how far the execution has reached in the process of the traversal of the recursion tree.Both of these are obtained from the butterfly operation on A 0 which resides on the register at the level just above.Thus the idea is to have a single 2:1 multiplexer separating the two registers, which takes as input the two outputs of the butterfly operation.When the 2nd level register would need to preserve state (A top or A bottom ), it can be done by appropriately setting the select signal of the multiplexer: for example to preserve A bottom we just need to set the select signal of the preceding mux so that it accepts the A bottom signal from the previous butterfly stage.
It remains to be seen how one can effectively set the multiplexer signals.In order to do that let us try to observe a small example.We will make use of a more general notation for the successive ANF vectors instead of just A top /A bottom , since we have to accommodate ANFs at different levels.We use the notation A[ℓ] b to denote the ANF vector at some level of the recursion tree: the ℓ term in the square braces denotes the level of the ANF vector in the recursion tree, and the term b which can be seen as a binary string or integer contains information about the coordinates over which the derivatives have been computed to obtain the function.For example, take the case when n = 5, d = 2.The ANF of original function f (x 0 , x 1 , x 2 , x 3 , x 4 ), we denote by the notation A[0] 000 : note that the subscript is a binary string of length n − d (which is 3 in this example).This is because there are n − d levels in the recursion tree, each obtained by taking derivative over some co-ordinate variable.The level 1 ANFs corresponding to the functions Generalizing this: if A[ℓ] b is the ANF vector at some level ℓ of the tree, then after applying the butterfly over the coordinate x ℓ , the two output vectors are denoted as , where e t is the unit vector of length n − d with 1 at the t-th position, eg. e 0 = 100 . . .0, e 1 = 010 . . .0 etc.At this moment, let us turn towards the example in Figure 6, where we have manipulated the select signals of each multiplexer so that the entire Möbius Transform is computed in 2 n−d = 8 cycles, i.e. in each of the 8 cycles we get one partial truth table of size d ↓d = 2 d = 4. Initially the top-level register would be initialized with A[0] 000 and the remaining registers would stay uninitialized.In the 2 cycles following this, the select signals of each multiplexer, is set to zero so that, after this each level ℓ register contains A[ℓ] 000 . Figure 6 shows us the flow of data in each of the 8 cycles succeeding this.
We introduce an additional notation: let S[ℓ] t be the select signal of the multiplexer between the registers at levels ℓ and ℓ + 1 at time t.Which is to say that if S[ℓ] t = 0 and the ANF vector at level ℓ at time t is A[ℓ] b , then at time t + 1, the ANF vector at level ℓ + 1 is A[ℓ + 1] b , and if S[ℓ] t = 1 then the corresponding vector is A[ℓ + 1] b⊕e ℓ (this can also be written as A[ℓ + 1] b+e ℓ since by design the coordinates of b at positions larger than ℓ are all 0).The two expressions can obviously be combined to give the single compact expression A[ℓ + 1] b+S[ℓ]t•e ℓ that caters for both values of S[ℓ] t .In Figure 6, we have done a series of assignments to the variables S[ℓ] t (for 0 ≤ ℓ < n − d and 0 ≤ t < 2 n−d − 1) so that the vector at the bottommost level of the register chain is always A[n − d] t for all 0 ≤ t < 2 n−d .Since Expmob1 is connected to the bottommost register, this ensures that the all the partial truth tables are faithfully computed and the circuit indeed computes the Möbius Transform of any five variable Boolean function of degree upto 2. However we are more interested in engineering the multiplexer signals for general values of n, d.To do so, equivalently consider the subscripts of the ANF vectors as integers, and return to the example in Figure 6.Initially all the subscripts at all the levels are zeros: thereafter we have the following subscripts assuming that all the S[ℓ] t 's are unknowns.
Note that the above follows since e ℓ = 2 n−d−1−ℓ as an integer, and therefore b We have already seen that for this to serve our purpose, the integer values of the last column of the above table should be 0 to 7. In other words we need S[ℓ] t 's from the set {0, 1} which are solutions of the following system of equations over the integers.
It can be verified that the assignments to the S[ℓ] t 's in Figure 6 satisfy the above equation system.We address the issue of the general case with the following theorem.Proof.We essentially have to prove that we can engineer the multiplexer signals S[ℓ] t efficiently so that the subscripts of the ANF vectors at the bottommost i.e. level n − d, at t = 0 → 2 n−d − 1 are each equal to t itself.Generalizing the observations made above, we need S[ℓ] t 's from the set {0, 1} which are solutions of the following system of equations over the integers.Let u := n − d.Let i be a sequence variable and set j := u − 1 − i for conciseness, then we have To solve the above equation system, observe that the right side always has a u-bit integer i.e. between 1 and 2 u − 1.Not only that, the left side of each equation resembles the decimal expansion of a u-bit binary string.For example the LHS of the last equation is the decimal expansion of the u-bit binary string Thus a trivial way to solve the above equation system is to assign to the unknowns the values obtained from the binary representation of the corresponding integer in the right side.For example, since the binary form of 2 u − 1 is the u-bit string of all 1s we can assign Thus we can see that a solution to the above equation system exists: however we will further show that each of the signals S[ℓ] t can be efficiently generated using a reasonable amount of logic circuits.Using the method outlined above, we can immediately see that S[u − 1] t = t + 1 mod 2 for all t.With some misuse of notation the above can be written as NOT (t mod 2), i.e. if we have a decimal up-counter implementing t, then the S[u − 1] t signal can be implemented by inverting the least significant bit of t.Similarly the sequence S[u − 2] t , t = 0, 1, 2, . . . is the second lsb of the sequence 2, 3, 4, . .., i.e. the second lsb of t + 2. For the general case, let us look at the i-th column from the end of the above equation system which has been highlighted in green.It can be seen that the sequence S[j] t = S[u−1−i] t , t = 0, 1, 2, . . . is the i+1-th lsb of the sequence (i+1), (i+2), (i+3), . .., i.e. the (i + 1)-th lsb of t + i + 1.Thus to construct all the signals S[ℓ] t all we need are the following circuit elements: 1.A u-bit decimal up-counter for the variable t.
Theorem 2. Furthermore it is possible to design a control circuit that generates all the select signals of the multiplexers in the Polymob1 circuit, incurring a total delay of 2 log 2 (n − d) gates.
Proof.As noted in the proof of Theorem 1, the control circuit consists of a u-bit decimal up-counter (where u := n − d) for the variable t and a series of u incrementers.However constructing the whole incrementer leads to a wastage of gates since we are only interested in generating the (i + 1)-th lsb of t + i + 1 for i = 0, 1 . . ., u − 1.
Consider any p-bit string ⃗ w = w p−1 , w p−2 , . . ., w 1 , w 0 (note that the indexing with starts from right side in this definition).Define the p-variable Boolean function g p, ⃗ w as follows For example the function g 8, 0001 0100 = ( and g 3,111 = 1.Each product term begins with a min index that has 0 in the sting ⃗ w.In the first example, in ⃗ w indices 0,1,3,5,6,7 have 0. Then each min index is ORed with indices larger than it that have 1 in ⃗ w.Further, if the length of ⃗ w is more than p, we truncate ⃗ w to its p least significant bits.We will prove that the (i + 1)-th lsb of x + i + 1 is given by the Boolean function x i ⊕ g i,bini(i) , where bin i (i) is the binary encoding of i using i bits, i.e. prepended with leading zeros when necessary.For small i, this is easy to verify.Denoting x j as the Boolean variable for the j-th bit of x, we know that for i = 0, the 1st lsb of x + 1 is given by x 0 ⊕ 1 = x 0 ⊕ g 0,0 .For i = 1, the 2nd lsb of x + 2 can be computed thus: when we add with 2, i.e. the string "10" the 1st lsb location generates no carry.The result of addition in the 2nd lsb location is therefore For general values of i, we proceed as follows.Let bin i (i) = c i−1 , c i−2 , . . ., c 0 , where each c j ∈ {0, 1}.Of these let the locations 0 ≤ n 1 < n 2 < • • • < n s ≤ i − 1 be such that c n k = 1 for k = 1 to s, and the remaining c j 's be 0. When adding two strings a, b, the carry out bit in the j-th position can be written as maj(a j , b j , carry j−1 ) (where maj is Using the above property of the majority function, the figure becomes self explanatory: however we still have to explain the symbols z j for j = 1 to s, which are the carry-outs for the position n j .By using the second property, we have The above follows because of the Boolean identity A ∨ BC = (A ∨ B)(A ∨ C), and i in the subscript of g is the truncation of bin i (i) to the appropriate number of bits.Following the same logic we now have Following this chain of arguments, it is straightforward to show that z s = g ns,i (x) and that the carry out of the (i − 1)-th location is z s . Thus it follows that the sum we are looking for is x i ⊕ g i,bin(i) (x).
The expression for g i,bin(i) (x) naturally has the longest circuit depth for i = u − 1 = n − d − 1.The number of product terms in the expression is bounded by n − d.Therefore the depth required to construct the product terms, if the and gates were arranged in a binary tree like manner is around log 2 (n − d).Furthermore, each collection of bracket containing terms that are OR-ed together can also have a maximum of n − d terms, which implies each such term can also be constructed using log 2 (n − d) depth.Putting this together we arrive at 2 • log 2 (n − d).We also have to account for the decimal up-counter t which counts from 0 → 2 n−d − 1 in steps of 1.But it is well known in circuit theory that the maximum depth required in this up-counter is only log 2 (n − d) (i.e. for the update bit of the msb flip-flop which is

Representation of the ANF vector:
So far we have avoided some of the finer operational details of the circuit to concentrate on the macro-level issues of dataflow through the circuit.One of the important topics we have not dealt with so far, is the issue of representing any degree-limited ANF coefficient set as a bit vector.The uncompressed ANF vector of an n-variable Boolean function has 2 n entries and mapping each coefficient into an array can be done canonically as explained earlier in Sec 2 and further shown in Figure 2.For example the x 0 x 2 term has coefficient 1: since the term can be written as x 101 := x 1 0 • x 0 1 • x 1 2 , the exponent vector 101 (5 in decimal) denotes the position where a one is inserted in the array.However when we are dealing with functions of a small degree d, coefficients of all terms of degree larger than d are zero and so in order to accommodate the potentially n ↓d non-zero coefficients we must be able to map them into an array of equal length, i.e. we need to decide which array location a given coefficient is going to reside in.This is important to decide for the following reasons 1.We have left the issue of the ordering in Lines 1,2 in Algorithm 1 open.The ANF vector should be so represented so that constructing the vectors A top /A bottom from A 0 should be efficient at all levels of recursion.
2. The ANF representation should be such that we can efficiently use Expmob1 at the leaf nodes of the recursion tree.
3. The circuit constructed for some n = n * , d = d * , should produce correct result when used for all n < n * and d < d * , i.e. the circuit should work seamlessly for all smaller and lower degree Boolean functions.
Let H(n, d) be the set of all binary strings of length n whose hamming weight is less than or equal to d, where we will treat the elements of this set as both binary strings and integers.The goal is to construct a mapping χ n,d : H(n, d) → 0, n ↓d − 1 , so that the coefficient of the x D term for any D ∈ H(n, d) is placed at location χ n,d (D) in the compressed ANF vector.From the description of Expmob1 in Section 3, the following things can be seen a) if we are using a butterfly circuit to construct the derivative wrt to any variable x ℓ , then the two inputs to the circuit are the coefficients at , where f 1 = δf δx0 and f 2 = (0, x 1 , x 2 , . ..).Note that A[1] 00... gets the ANF vector of f 2 and A[1] 10... gets the ANF vector of f 1 ⊕ f 2 after the butterfly operation.Therefore we have the following transitions 1. Algebraically f 2 is simply all terms of f with the terms containing x 0 removed.In terms of ANF, f 2 is therefore simply the terms contained at the indices of type 0 ∥ s (where s is any (n − 1)-bit string) in the uncompressed ANF vector.These are simply copied to the index s in f 2 .In the compressed world, therefore, all entries at location χ n,d (0 ∥ s) of A[0] 00... should go to location χ n−1,d (s) of A[1] 00... .However note that when expanded as integers, 0 ∥ s and s give rise to the same integer.Thus for ease of use χ n−1,d (s) can simply be denoted as χ n,d (0 ∥ s), and if we view the arguments of these functions as integers we do not need to define any χ n−i,d separately.
2. Similarly for f 1 ⊕ f 2 , in the uncompressed form, all terms at indices 0 ∥ s are added with terms at 1 ∥ s and copied to 0 ∥ s.We are yet to determine if the mapping χ n,d can be computed efficiently.The following lemma addresses this computational issue.

Lemma 1. For positive integers n, d with d ≤ n, and s
where we extend the definition of x ↓y as follows: Proof.As per the definition of χ n,d , given s we have to count how many integers strictly less than s have hamming weight bound by d.It is obvious that this number for any 2 m is simply m ↓d , i.e. number of m-bit strings of hamming weight less than or equal to d. Hence the number of such strings in the range [0, 2 i0 ) is i0 ↓d .The number of such integers in the range [2 i0 , 2 i0 + 2 i1 ) are strings which have 1 in the i 0 -th position and of hamming weight less than or equal to d − 1 in the last i 1 bits, and therefore equal to i1 ↓d−1 .Taking this argument forward for the successive i 2 , i 3 , . .., we arrive at the required result.

Helping Circuit Compiler synthesize faster
The above lemma shows that the map χ n,d (•) can be efficiently computed.However for ease of synthesis, one may want to precompute and store a few of the above values to help the circuit compiler construct an optimal circuit especially when n becomes larger.One could store all values of χ n,d (s), ∀s ∈ H(n, d) for this purpose, but note that the arguments "s" of this function are not exactly contiguous integers and thus we would not be able to store the function table in any continuous memory structure like an array.We could employ a hash table for this purpose, however designing a good collision free hash function for this purpose is an open problem.
Another method we could employ is to store the adjacency matrix of a graph that we describe below.Note that at the ℓ-th recursion step, we need access to locations χ n,d (0 ∥ s), χ n,d (1 ∥ s) of the current register, where s is an n − ℓ − 1 bit string.Imagine the graph G = (V, E), in which the elements of 0, n ↓d − 1 are nodes and each node α in this set is connected with at most n − d types of edges to at most n − d neighbors.An edge of type ℓ, ⊕ e ℓ if hw(β) ≤ d and unconnected otherwise.This is helpful because at step ℓ of the recursion tree, if α = χ n,d (0 ∥ s) then the two inputs to the butterfly circuit can be equivalently seen as the wires at locations α and β, as it can be easily deduced that β = χ n,d (1 ∥ s).One can now define the reduced adjacency matrix AM of size n ↓d × (n − d) such that Thus the ℓ-th recursion step can be re-written from: • For all n − ℓ − 1 bit strings s with hw ≤ d to the following equivalent form that uses the AM matrix: Using the 2nd description is much easier to write an RTL code for describing the Möbius Transform circuit in any hardware description language.Additionally, the circuit compiler also outputs the optimized netlist faster.In Appendix A, we outline an algorithm to generate AM efficiently in polynomial time.

Further Utilities
Using the circuit for smaller functions: The circuit once constructed for some upper limit (n, d) also caters for Boolean functions for any number of variables n 0 < n.Since any n 0 -variable Boolean function (for n 0 < n) is also an n-variable Boolean function, i.e. with the additional variables set to zero, the only thing we need to do is to embed the ANF of the n 0 -variable Boolean function as a the ANF of an n-variable Boolean function with appropriate zero padding.This is aided by the fact that χ n,d has been defined in a manner so that χ n−1,d (s) is the same as χ n,d (0 ∥ s) for any (n − 1)-bit string s.Thus in order to embed any (n − 1)-variable Boolean function we simply add the coefficient corresponding to s in χ n,d (0 ∥ s) and place 0 in χ n,d (1 ∥ s).Since the function χ n,d is monotonous this would amount to filling up locations 0, n−1 ↓d − 1 with coefficients of the smaller Boolean function and padding the remaining i.e. n−1 ↓d , n ↓d − 1 locations with zeros.By induction on i, the same applies to any arbitrary (n − i)-variable function.
Finding truth table when some variables are fixed to constants: Often one is interested to find solutions to a system of equations in which a fraction of variables has been fixed to some given constant [Cou02,CM03].Since our strategy in solving a system of polynomial equations is to compute the OR the respective truth tables (see Sec 1.1), we would therefore be interested to find the truth table of a polynomial when some variables are fixed.To do this, we could either first simplify the given Boolean polynomial by fixing some individual variables to constants and then using the corresponding reduced ANF vector as input to the circuit, after appropriately zero padding it.However this naturally requires additional computations, i.e to simplify the original polynomial in the first place.
However if the t variables to be fixed are lexicographically the first t variables of the system (for any t ≤ n − d) then we can do better.We see from Figure 6, that at the i-th stage the ANF vector at the bottom most register is A[n − d] bin n−d (i) .As a result the truth table output after the Expmob1 circuit is f (bin n−d (i), . ..), i.e. in which the first (n − d) bits of f is already set to bin n−d (i).Thus one can use this method to extract the truth tables when the number of variables to be fixed are less than n − d.Note that one may think that one would need to wait exactly i cycles to obtain the tables, which can be counterproductive if i is large.Note that Fig 6 already starts with A[3] 000 in the bottom most register at t = 0, as in the previous 3 cycles, i.e. t = −3, −2, −1, the corresponding S[i] t 's were all set to zeros.Instead if all of these were set to 1, then at t = 0, the signal in the bottom most register would be A[3] 111 , and we would get the truth table of f (1, 1, 1, . ..) from the Expmob1 circuit.Similarly by adjusting the initial select signals we can get the truth table where the first (n − d) variables are fixed to any arbitrary constant in the first cycle itself.

Synthesis Results
For the actual synthesis, we can do some optimizations as follows: In figure 6, we can see that the topmost register of size n ↓d essentially holds a constant value throughout the lifetime of the Möbius Transform operation, and as such it can be removed from the circuit if the ANF signal is assumed as available on the input wires to the circuit.Using this tweak, we again synthesized the Polymob1 circuit using the Nangate 15 nm open cell library for various values of n ∈ [8, 20] and d ∈ [2, 4].Note that the values of d chosen apply to a number of instances of cryptanalytic problems known in literature.For example, it is known that cryptanalysis of signature schemes like UOV amounts to solving a set of quadratic (d = 2) Boolean equations.The public key in the signature scheme PICNIC v3.0, consists of a single plaintext/ciphertext pair generated by the LowMC block cipher using the secret key as the block cipher key.The designers recommend using 4-round instances of the block cipher for this purpose, which reduces the relations between the plaintext, ciphertext and key to Boolean equations of degree 4 in the unknown key [Din21].Thus finding the secret key amounts to cryptanalysis of the block cipher using the single plaintext/ciphertext pair available as the public key of the signature, which amounts to solving n degree 4 equations in n unknowns.
The results are presented in Table 1.As stated earlier, the circuits were synthesized to minimize the total critical path, which allows us to clock them using higher frequencies.The minimum time T min taken to compute the transform is calculated as 2 n−d + (n − d) times the critical path T cr of the circuit.Since T cr increases logarithmically and the number of cycles increases exponentially wrt (n − d), some interesting tradeoffs can be observed: for example to compute the Möbius Transform of quadratic Boolean functions one may either use the circuit for d = 2, 3 or 4. Because of the exponential dependence on n − d, the total physical time taken to compute the transform undoubtedly decreases with increase in d: however it has to be paid for with larger circuit area and energy consumption.Furthermore, from a comparison between Tables 1 and 2, it can also be seen that the energy consumed by the Polymob1 circuits is an order of magnitude larger than the corresponding Expmob1/Expmob2 circuits for similar values of n.This is to be expected primarily because Polymob1 is essentially a serialized circuit that performs the transform using exponential amount of time (in n − d) whereas Expmob1/Expmob2 either

Solving Polynomial equations of degree ≤ d
One of the primary uses of the Möbius Transform circuit is in finding solutions of a system of Boolean polynomials bounded by some algebraic degree d.Recapitulating, if f 1 , f 2 , f 3 , . . ., f m are the m polynomials whose common root we are aiming to find, then the root r ∈ {0, 1} n is an n-bit vector which simultaneously satisfies We can combine the above in a single equation: In other words, we take the truth tables of each f i and cumulatively compute the logical OR of them.The common root(s) will be indices at which the vector of cumulative OR of the tables have 0 in them.However note that the Möbius Transform circuit we have constructed outputs the truth table in parts.We have seen that when at the lowest register the ANF vector is A[n − d] b , then the circuit outputs the truth table of the function f (b, x n−d , . . ., x n−1 ), i.e. when the first n − d bits have been set to the constant b.With this information let us begin to see circuit configurations that compute the root of a system of equations.

Polysolve1
Let's say we have m Boolean equations f i we need to simultaneously solve.We begin by having m copies of the Möbius Transform circuit in parallel, each for one of the polynomials f i .At the b-th step, we have the truth tables f i (b, . ..) output from each of the circuits.
We then have a layer of OR gates to compute the logical disjunction of all the truth tables.
After we have done this we have the combined truth table vector of length 2 d bits, whose zeroes give us the roots of the equation.The following cases may occur • The vector is the all 1 bit-string.This indicates that there is no root of the underlying system in which the first n − d bits are set to the constant b.
• The vector contains a single 0 at some position t, whose binary encoding is given by bin d (t).In this case the root of the system of equations is b ∥ bin d (t).
• The vector contains a multiple 0s, which indicates that there are multiple roots of the system beginning with b.We may wish to find all such roots, or any one of them.
For the moment let us concentrate on finding any one of them.
If the task is to find only one such root, the most efficient way to find this would be a priority encoder, that will encode to binary the first occurrence of zero in the 2 d -bit string.The circuit is described pictorially in Figure 8a.We describe some micro-level details of the circuit below: OR Network: In order to compute the disjunction of m vectors of length 2 d each, it is obvious that we need (m − 1) • 2 d number of 2-input OR gates.However we can ensure that the network has a total latency bounded by ⌈log 2 m⌉ OR gates by arranging the gates in inverted binary tree like manner, in which each level would contain around half the gates contained in the previous level.For example if m = 8, the first level would have a total of 4 OR gates of width 2 d bits, the next level 2, and the final level a single gate.This makes the total latency of the network equal that incurred in three OR gates.
Priority Encoder: For a 2 d → d bit priority encoder, the functionality can be simply described as a look-up table if d is small enough.For larger d, we can also use recursive description of the encoder functionality.In both cases, the critical path in the encoder is known to be proportional to d gates. .Since the area of a 2-input xor gate is much less than that of a scan flip-flop the total area can be loosely upper bound by m • n d+1 .The remaining circuit elements contribute md • 2 d−1 xor gates (for the Expmob1 circuits), (m − 1) • 2 d OR gates (for the OR network) and the area required for the priority encoder (this will be proportional to 2 d ).For small d, this can be ignored wrt m • n d+1 .

Circuit
Total Critical Path: As shown in Figure 8a, the total critical path in this architecture is due to the combination of Expmob1, the OR network and the encoder (call this τ A ). Since we have seen that each Möbius Transform takes around 2 n−d clock cycles to output all the truth tables, this implies that the circuit will take at least τ A • 2 n−d amount of physical time to solve the system of equations.

Polysolve2
In order to break up the long chain of combinatorial circuitry after the Möbius Transform computations one could install pipeline stages in between them as shown in Figure 8b.The introduction of the pipeline stages requires only m + 1 registers of size 2 d bits each (m for Reg1 and one more for Reg2 as shown in Figure 8b) and reduces the delay caused due to the long chain of combinatorial elements, and increases the computation time by only two cycles.However the breaking up of this combinatorial path means that the critical path Note that (as we shall see shortly) for smaller values of d that we report in this paper (i.e less than 4), there is not much difference between the critical paths of Polysolve1 and Polysolve2.For smaller values pf d, the total critical path τ A is not very high and the circuit compiler effectively balances out various parts of the netlist so that the total critical path of the Polysolve1 circuit is comparable with the Polysolve2 circuit.However as d increases, Polysolve2 performs much better wrt total circuit latency.

Polysolve3
So far Polysolve1/Polysolve2 have been focused to find only a single root in the series of partial truth tables generated from each f i .However some applications may need the underlying hardware accelerator to find all the roots of a given equation system.There are a few solutions to the above problem we could consider.First, the circuit may choose to communicate the disjunction of partial truth tables back to the processor, without applying the encoder.The root would then be extracted by the processor using its own instruction set architecture.Second, instead of a priority encoder, the 2nd register (Reg 2) in Fig 8b, could be additionally equipped with bitwise shift functionality.After the disjunction of the m truth tables is loaded on to it, the bits would be shifted out serially with another counter maintaining the index of the bit shifted out.Now if one of the shifted out bits is zero, then the index counter can be used to construct the root.However this would require freezing the operations of the Möbius Transform circuit for exactly 2 d cycles, i.e. the Möbius Transform circuit does not produce another partial truth table till the processing of the current table is completed.This implies that the underlying registers of the Polymob1 circuit would now actually need a 3:1 multiplexer preceding it to help in freezing the dataflow.However this increases the number of cycles required to execute the operation by a factor of 2 d i.e. from 2 n−d to 2 d • 2 n−d = 2 n cycles.
However the solution we propose here will require exactly R + 2 n−d cycles, where R is the total number of roots of the underlying equation system.The main issue arises when the disjunction of truth tables contains multiple zeros.In that case a priority encoder only fishes out the location of the zero which is numerically smallest.However consider the event when this actually happens: using the inverse of an encoder i.e. a decoder, one can convert the encoded vector V back to a 2 d vector of hamming weight one, with one at the V -th location.We then OR this vector with the current vector in Reg 2 and update it in the next cycle.The updated vector has one less 0 than the original vector in Reg 2. If this is now the all one vector then there are no more roots to fish out, else we repeat the process to decrease the number of zeros in Reg 2 by one, till it has the all one vector.Example 1.If d = 4, and the OR of the truth tables is T 0 = 1011 1111 1111 0111, then the priority encoder in the first cycle outputs 0001 which is the index of the first 0. The decoder outputs D 0 = 0100 0000 0000 0000, which after OR with T 0 becomes T 1 = T 0 ∨ D 0 = 1111 1111 1111 0111, and has one less zero than T 0 , and is written back to Reg2.In the next cycle we get the next root 1100 from the priority encoder which decodes to D 1 = 0000 0000 0000 1000.Therefore we have T 2 = T 1 ∨ D 1 = 1111 1111 1111 1111 which is now the all one string.
During this time the Polymob1 pipeline will have to be frozen (and thus a 3:1 mux functionality is needed in the Polymob1 circuit), and it is not difficult to see that if each of the i disjunction of the partial truth tables (for i ∈ [0, 2 n−d − 1]) has r i roots (with r i = R) then the i-th step will execute for exactly r i + 1 cycles.To see why, note that there are two scenarios: (a) when the disjunction of the partial truth tables is the all one string, it means that that the pipeline immediately moves on to the next partial truth table and thus only spends one cycle here, and (b) when the string has one or more than one zero the mechanism reduces the number of zeros in the string by one every cycle, as explained above, till the all one string is reached.This needs 1 + r i cycles.Therefore the total number of clock cycles required is The circuit is described diagrammatically in Figure 8c.

Circuit Area:
The only significant addition to the Polysolve1 circuit is the additional 2:1 muxes before each of the scan flip-flops in the Polymob1 circuit (thus achieving 3:1 multiplexer functionality).Thus the circuit area is now bound by m • n d+1 scan flip-flop and 2:1 muxes.

Total Critical Path:
As n increases, the critical path is expected to be due to the select signal generation of the Polymob1 circuit which has been shown to be proportional to 2•log 2 (n−d).If the underlying clock signal has this period then the total physical time taken to solve the system will be proportional to 2

Synthesis Results
In Figure 9, we present synthesis figures for the three solvers for the range of values of n ∈ [8, 20] and d ∈ [2, 4] (for a complete tabulation of the results please see Tables 3,4,5 in Appendix B).We have let m = n, so that the circuits directly output the roots of the underlying equation system.In [BCC + 10], the authors had used a different strategy: if there were n equations to solve, then the hardware circuit consisted of only ⌈log 2 n⌉ parallel solvers that output around 2 n−⌈log 2 n⌉ potential candidate solutions for the system.These candidates had to undergo another filtering step (performed in software) to reveal the actual solutions of the system.While this approach is also possible in our work, we preferred to let the hardware circuit be self-sufficient, i.e. able to find the true roots of the system on its own.In the figure, we plot the silicon area, the total physical time taken (which is the product of the critical path and the number of clock cycles required) and the total energy consumed.The red and the blue curves for each plot (denoting the Polysolve1 and Polysolve2 circuit respectively) are almost coincident, which implies that for the range of values of d we have chosen, the circuits have similar performance metrics.The figures of time and energy for the Polysolve3 circuit has been computed assuming that R = n.It is evident that the Polysolve3 circuit is much larger than the corresponding Polysolve1/Polysolve2 circuit since we need to use additional 2:1 multiplexer in each of the internal registers of the Polymob1 circuit to periodically freeze the dataflow.

Estimates for larger n
For larger values of n > 20, simulation takes upwards of 24 hours for each instance of Polysolve3, and so we report only estimates of the silicon area and total physical time using projections from the gathered data.For n ∈ [21, 50] we can estimate the circuit area using the formula m The unit here is the area of one scan flip-flop and one 2:1 mux.In the Nangate 15nm OCL, this would amount to around 11.5 GE of silicon area.Further, extrapolating from Table 3, we can estimate (albeit crudely) that the delay of one AND, OR, XOR gate to be around 20ps.From this we could estimate the total physical time for finding all the solutions of the system using the expression 2 • log 2 (n − d) • 2 n−d • 20 ps.The estimate is shown in Fig 10 .We can project that to solve a system in n = 50 unknowns of degree 4, the circuit should take around 15000 seconds.

Conclusion
In this paper, we propose, design and evaluate hardware architectures to perform Möbius Transform of Boolean functions using only polynomial amount of silicon area.In a nutshell, this is a serialized implementation of the basic transform and uses around 2 n−d clock cycles to generate the entire truth table of the Boolean function.The immediate application of the circuits is to use it to solve an underlying system of low degree equations over GF(2), a problem which occurs in many cryptanalytic attacks on real world cryptosystems.We further describe architectures for such equation solvers which keeps the critical path of the circuit to a minimum.The main conclusion of the paper is the demonstration that a system of m Boolean equations in n variables and algebraic degree upto d can be solved using silicon area proportional to m • n d+1 gates and using physical time proportional to 2 n−d • log 2 (n − d).We demonstrate this using the Nangate 15nm Open Cell Library, for which we report the simulated results in the final part of this paper.

Appendix A: Fast generation of the AM graph
The following algorithm generates the adjacency matrix for the AM graph in polynomial time.

Figure 1 :
Figure 1: An example of the Möbius matrix M for n = 3 This architecture implements the circuit in Fig 2 as a single unrolled circuit, i.e.

Figure 4 :
Figure 4: Recursion tree for the Möbius Transform algorithm.The blue shaded component roughly represents one arm of the butterfly unit.Note here (x, ↓ d) := x ↓d .

Figure 5 :
Figure 5: Hardware architecture Polymob1 for the Möbius Transform algorithm.The blue shaded part roughly represents one arm of the butterfly unit.Note here (x, ↓ d) := x ↓d .

Theorem 1 .
Given the circuit Polymob1 in Figure 5, in which each of the registers have been initialized with the ANF vectors A[ℓ] 0 n−d for all 0 ≤ ℓ ≤ n − d.Then it is possible to engineer the multiplexer signals S[ℓ] t for 0 ≤ t ≤ 2 n−d − 2, so that the circuit computes the Möbius Transform of an n-variable Boolean function of algebraic degree upto d in exactly 2 n−d clock cycles.

Figure 7 :
Figure 7: Visual representation of the addition x D and x D⊕e ℓ .Wlog, let us assume that the ℓ-th bit of D is zero i.e.D • e ℓ = 0. b) The coefficient of x D is copied as is from A 0 to A top (or if we follow the terminology developed later: from A[ℓ] b to A[ℓ + 1] b ).The coefficients of x D and x D⊕e ℓ are added and copied to A[ℓ + 1] b+e ℓ .c) If D be such that hw(D ⊕ e ℓ ) > d, then this last addition is not necessary since the coefficient of x D⊕e ℓ is 0 by assumption.However note that the size of the vectors A[ℓ] b and A[ℓ + 1] b are n−ℓ ↓d and n−ℓ−1 ↓d respectively.So if any D ∈ H(n − ℓ, d), then we ought to not only decide what χ n−ℓ,d (D) would be but also in which locations of A[ℓ + 1] b /A[ℓ + 1] b+e ℓ , the butterfly outputs would go to.It seems we need to determine a series of mappings χ n−ℓ,d , however we will see how only unified mapping will take care of our requirements.Let χ n,d (D) = y if D be the y-th largest integer with hamming weight less than or equal to d.For example when n = 8, d = 2, we have χ 8,2 (u) = u for 0 ≤ u ≤ 6, and χ 8,2 (8) = 7, χ 8,2 (9) = 8, χ 8,2 (10) = 9, χ 8,2 (12) = 10 etc.Note that we have f Area: If we do away with the first level register in the Polymob1 circuit the total number of scan flip-flops required for the successive registers in the m Polymob1 instances is m • n−d i=1 n−i ↓d < m • (n − 1) d+1 < m • n d+1 .It is not difficult to work out that the total number of xor gates required is m • n−d i=1 n−i−1 ↓d

FigureFigure 10 :
Figure 9: results for Polysolve1/Polysolve2/Polysolve3 circuits The vector of evaluations of a Boolean function at all its input points is called its Truth Table (therefore this is a 2 n length vector).The ANF and the Truth table vectors of any Boolean function are closely related by the Möbius transform.Let Store 1st butterfly output i.e.A top in T (requires no xors).
1: Recursive Möbius Transform Möbius (A 0 , n, d) Input: A 0 : The compressed ANF vector of a Boolean function f Input: n: Number of variables, d: Algebraic degree Output: The Truth table of f /* Final recursion step, i.e. leaf nodes of recursion tree */ if n=d then Use the formula B = M n • A 0 to output partial truth table B. /* Use either Expmob1/Expmob2 to do this */ end else Declare an array T of size n−1 1 2 Store 2nd butterfly output i.e.A bottom in T (requires some xors).Call Möbius (T, n − 1, d) end (n,↓d) coefficients x 3 , x 4 ) are denoted by A[1] 000 and A[1] 100 respectively (thus A top and A bottom defined earlier are equal to A[1] 000 and A[1] 100 respectively in this new notation).Similarly the two level 2 functions obtained by applying the butterfly layer on A[1] 000 (by taking derivative over x 1 ) are denoted as A[2] 000 and A[2] 010 .Similarly butterfly over A[1] 100 yields the two vectors A[2] 100 and A[2] 110 .
Thus in the compressed form we should add terms at locations χ n,d (0 ∥ s), χ n,d (1 ∥ s) (if 1 ∥ s has hamming weight less than or equal to d) of A[0] 00... and copy it to location χ n,d (s) of A[1] 10... .3. The same idea applies to all the levels of the recursion tree.4. Note that in χ n,d all integers of hamming weight less than or equal to d are mapped to itself.Thus at the lowest leaves of the recursion tree, we can apply the canonical version of Möbius Transform as used in Expmob1.