Improved Heuristics for Short Linear Programs

In this article, we propose new heuristics for minimising the amount of XOR gates required to compute a system of linear equations in GF(2). We first revisit the well known Boyar-Peralta strategy and argue that a proper randomisation process during the selection phases can lead to great improvements. We then propose new selection criteria and explain their rationale. Our new methods outperform state-of-the-art algorithms such as Paar or Boyar-Peralta (or open synthesis tools such as Yosys) when tested on random matrices with various densities. They can be applied to matrices of reasonable sizes (up to about 32 × 32). Notably, we provide a new implementation record for the matrix underlying the MixColumns function of the AES block cipher, requiring only 94 XORs.


Introduction
Lightweight cryptography has received a lot of attention in the past decade and with the increasing miniaturisation of electronic chips and devices, this trend will probably continue further.In a general sense, the study of lightweight cryptography refers to finding cryptographic primitives circuits that are efficient in constrained environments.The definition of efficiency obviously differs depending on the use-case: throughput, area, energy, power, latency are classical dimensions of the problem that have been considered so far.Trade-offs are often possible and desirable for a cryptographic primitive to adapt well to various situations.Yet, circuit area is a crucial criterion for lightweight cryptography in hardware, as smaller and cheaper devices means stronger area constraints.
Many lightweight ciphers have been proposed [BKL + 07, CDK09, GPPR12, BCG + 12, BJK + 16, BPP + 17] and they all share the particularity that a lot of efforts has been spent on minimising the circuit area; for example by using a state as small as possible, or by minimising the number of logical gates (especially the expensive ones such as XOR) required to implement the primitive.For many of them, a small or even the smallest implementation of their subcomponents is provided by construction, but it is interesting more generally to study how one can minimise the area cost of any function (circuit depth can also be an important criterion).
Tools providing a good circuit that is efficient in area and/or depth already exist, but this task being very complex they all have limitations: they either provide optimal or close to optimal circuits for small problem sizes, or they provide non-optimal solutions for bigger function domains.Examples of the former are SAT-based tools [Sto16], or LIGHTER [JPST17].For small circuits, LIGHTER is an automated tool that uses a meetin-the-middle approach to find efficient or even optimal circuits (under some assumptions).However, for larger circuits, the search space of the circuits grows exponentially large and this requires completely different heuristics.One can cite for example synthesis tools such as Yosys [Wol] and ABC [BM10] that offers circuit optimisation for large domain size.
Generally, a cipher can be broken down into two distinct layers: the linear layer and the non-linear layer (often composed of the application of Sboxes or modular additions).The tools mentioned previously can handle both linear or non-linear layers and many papers in the literature have proposed new techniques to implement the non-linear layer efficiently.This is especially true in particular for the AES S-Box.In [Can05], Canright used a technique involving the subfield arithmetic that eventually led to a small and compact AES S-box circuit.
While the non-linear layer, which generally costs a substantial part of the total encryption circuit, has been well studied, a lot of attention has also been given to the case of linear circuits, which can be seen as a potentially simpler sub-case.As the linear layer is usually implemented with many (relatively costly) XOR gates, improvements on this part can lead to non-negligible savings on the final circuit.
Although the problem of finding optimal linear circuits has been shown to be NPhard [BMP08,BMP13], there are existing heuristics that could give quite good approximations.Paar's algorithms [PR97], for instance, provide good estimations for matrices, even for large sizes.In recent years, a series of algorithms have been published solely focusing on reducing the number of XORs required for linear circuits.In 2017, [KLSW17] showed that by using the Boyar-Peralta (BP) algorithm [BP10,BMP13], many linear circuits can be more efficiently handled when the size of the matrix is not too large.Many research have built on the structure of the BP algorithm in order to further reduce the complexity of linear circuits [VSP18,RTA18,ME19].As a benchmark to compare the various heuristics, a prominent circuit to study is obviously the AES diffusion matrix, whose XOR gate cost has been reduced down to 97 XORs after several successive improvements [KLSW17].
Our Contributions.In this paper, we provide new heuristics to improve the XOR gates count in linear circuits, which can be used to optimise the implementation of diffusion matrices or any linear function with or without full rank.Firstly, we show that the randomisation of the BP algorithm [BP10,BMP13] can lead to improvements over the state-of-the-art.One particularly interesting case is that by inducing randomisation, we can obtain an implementation of the AES MDS matrix requiring only 95 XORs, a new record.The benefits of randomisation became increasingly more significant as the size and density of the matrix increase.
Then, we also propose two new algorithms, so-called A1 and A2, that aim at improving the XOR count even further.The main idea of the algorithms is to attempt to reach some of the outputs first, provided they are not too costly and at the same time, trying to minimise the cost required for the rest of the outputs.As for all known heuristics except Paar's algorithm, the size of the linear layers that A1 and A2 can handle is limited (32 × 32 matrices of low density such as the AES matrix will usually take around 30 to 40 minutes for a single run on a small computation cluster).We have implemented these algorithms on a set of many random square matrices (of sizes 15 × 15 to 20 × 20 with the densities ranging from 0.1 to 0.9) which is a subset from the benchmark set used in [VSP18].We compared the results obtained with various state-of-the-art algorithms such as Paar1, BP, and a randomised version of Shortest-Dist-First [RTA18].A distinctive trend we observed is that when the density, ρ, of the matrix is small, ρ ≈ 0.1 − 0.2, all the above-mentioned algorithms yielded about the same results (the problem is generally much easier to solve as the search space is very limited).When ρ ≈ 0.4 − 0.6, we start to see that A1 and A2 algorithms are performing better.More importantly, as the size of the matrix increases, the criteria used in A1 and A2 provide even stronger gains over other existing heuristics, which indicates that A1 and A2 are indeed fundamentally improving over the state-of-the-art.In order to further improve the circuits' size, we have also incorporated Yosys synthesis tool [Wol] in the picture, as well as some simple local optimisation techniques we propose in order to reduce the XOR count.In some cases, we even managed to reduce the circuit depth as well.The benchmark results are shown in Table 4 and Table 5.We have also implemented the algorithms on matrices from [DL18] and the results are shown in Table 6 and Table 7.All in all, even though these matrices came with an optimised circuit by construction, we surprisingly managed to improve 7 of them (out of 24).
Finally, as the ultimate benchmark, we have run both A1 and A2 algorithms on the AES diffusion matrix and it produces a circuit requiring only 94 XORs, again a new record.This circuit is shown in Appendix E.
Organisation of the paper.In Section 2, we provide a brief overview of the mathematical background required for the understanding of this paper.In Section 3, we explain with some small examples the various state-of-the-art heuristics that are used to find good linear circuits for linear matrices (Paar's algorithms, BP algorithm and its variants, as well as algorithms proposed by Masoleh, Taha and Ashmawy in [RTA18]).In Section 4, we will provide our heuristics, mainly A1 and A2 algorithms, as well as a randomised version of the BP algorithm, mainly to illustrate the importance of randomisation while finding optimised circuits.Also, we will provide a short description of the local optimisation techniques in order to further post-reduce the number of XOR gates, starting from a circuit provided by the various global optimisation algorithms.In Section 5, we will provide the results after comparing the heuristics.These results are mainly on two different groups of matrices -a subset of the benchmark set in [VSP18] (which is a set of random matrices with varying densities of size 15 × 15 to 20 × 20) and matrices of size 16 × 16 and 32 × 32 from [DL18].Finally, we conclude in Section 6 and propose possible future research directions.

Preliminaries
A linear Boolean function is a circuit consisting n input signals {x 0 , x 1 , ..., x n−1 } and m output signals {y 0 , y 1 , ..., y m−1 }.Following the notation in [BP10], the output signals are also called targets.A circuit that is constructed only by the XOR gates (which can be represented by ⊕), is considered as linear.To describe a linear circuit, we will be using a series of t i variables to represent the intermediate values.In this article, we will not put any restrictions on the number of such values t i , but studying optimised linear circuits with bounded memory is also a relevant research direction.
Generally, most cryptographic applications use matrices with elements from the finite field, F 2 k in their calculations.The finite field is defined by some irreducible polynomial of degree k with coefficients from F 2 .To do multiplication with a fixed element from F 2 k , we can represent the fixed element as a k × k matrix, K, with elements from F 2 and the input element to be multiplied with as a vector x = (x 0 , x 2 , ..., x k−1 ) T with each x i ∈ F 2 for i ∈ {0, .., k − 1}.A simple matrix multiplication with modulo 2 operation can be used to obtain the desired result: K • x for the purpose of multiplication.Therefore, a matrix M , of size n × n with fixed elements from F 2 k , can also be interpreted as a matrix of size nk × nk with elements from F 2 .Therefore, we will now discuss only matrices with elements from F 2 .
A naive implementation of matrix (blindly computing the XORs directly given in the matrix representation) is usually very inefficient as there will be a lot of repeated calculations.This is especially true as the density of a matrix increases.Currently, the possibility of finding a circuit with least XOR gates relies on the fact that we can reuse intermediate variables t i that were calculated once, for other computation steps.This strategy also forms the basis for all the algorithms that we discuss in this paper.
In [BP10], Boyar and Peralta coined the term "cancellation-free programs" which refers to programs that do not "cancel" out common terms; all cancellation-free programs have circuits that follow the rule: if the computation t i = u ⊕ w is in the circuit where For instance, Paar's algorithm produces a cancellation-free program whereas BP algorithm does not.Currently, it seems that by considering non-cancellation-free circuits, the XOR count can be improved.One such case can be observed in the example matrix given in Section 3.

Paar's Algorithm [PR97]
In [PR97], Paar proposed two global optimisation algorithms, we will refer to them as Paar1 and Paar2 (i.e.Algorithm 1 and its improvement respectively in [PR97]).Both Paar1 and Paar2 take a binary matrix as input and attempt to find a more efficient circuit than its naive implementation.In essence, Paar1 first searches and records the frequency for all possible pairing of the input variables, x i ⊕ x j for i, j = 0, ..., n − 1, i = j occurring in the matrix.The pair with the highest frequency will then be computed and placed back into the matrix as a new variable, e.g.t 0 .In the next round, all possible occurrences of u ⊕ v where u, v ∈ {x 0 , x 1 , ..., x n−1 , t 0 } will be considered.This process will repeat until all possible pairs of variables occur at most once in the matrix.Then, the remaining gates will just be computed naively.In Table 1, we take the example of a matrix M and apply Paar1 algorithm on it.
Paar2 is an improvement over Paar1, as when it considers all possible pairs with the highest frequency, it will then conduct a tree search for all such pairs.We note that this exhaustive search may not work very well as the dimension of the matrices increases.
Note that Paar1 and Paar2 produce cancellation-free circuits.However, the advantage of Paar1 is in its time complexity: it can be easily implemented for large matrices and obtains a circuit with a very reasonable computational effort.

BP algorithm [BP10]
The BP algorithm adopted a rather different approach to the problem.While the algorithm is much slower than Paar's algorithms and inapplicable for matrices of large size due to the exhaustive search nature for the calculation of the distance vector, it generally produces more efficient circuits.Notably, in [KLSW17], the BP algorithm managed to reduce the AES diffusion matrix circuit to 97 XORs.The general idea of BP is to reach a common path among as many targets as possible.Thus, the criteria for next gate in each step is to choose a gate that reduces the distance1 to the highest number of targets.We describe here this heuristic: 1. Place all variables {x 0 , x 1 , ..., x n−1 } into a set, Base.The rows in the matrix are denoted as set of targets {y 0 , y 1 , ..., y m−1 } that we want to compute Iter.

Pair with
Freq. highest freq. - Remark 1.In [BP10], the authors also mentioned that if there is any target with a distance value of 1 at some point (i.e.there exists two distinct Base elements, B i and B j such that the target y k is equal to y k = B i ⊕ B j ), then we can directly choose this gate to be added into Base as it will not increase the size of the resulting circuit (one gate will have to be spent to reach y k anyway, so using B i ⊕ B j is necessarily among the best possible choices).
In Table 2, we provide the circuit produced by the BP algorithm when using the same input matrix M that was used for Table 1.This circuit is slightly shorter as compared to the one in Table 1 as it allows for cancellation of variables.This can be observed by looking at the construction for the intermediate variable t 17 (t 17 = t 4 ⊕ y 5 , where and thus, a cancellation of the value t 4 ).This led to a reduction of the number of gates required to obtain y 6 .
In the rest of the article, we will say that a target is far (resp.near) if the corresponding value in the Dist vector is large (resp.small) relative to the other values in Dist.

BP algorithm for dense matrices [VSP18]
Visconti, Schiavo and Peralta proposed in [VSP18] a new heuristic for matrices that are dense.The gist of the heuristic is to approach the Boolean complement of the matrix that it wants to compute first.It will attempt to find the all-one vector along with the targets in the complementary matrix.Eventually, an additional XOR is used between the all-one vector and each of the complementary target to get the original target.They applied their strategy on a benchmark set, which is a set consisting of random matrices of different sizes New base New Dist element vector Iter.
New base New Dist element vector and density.Their results indicate that this new heuristic will be well suited for matrices that have a high density.
In Improved-BP, they start from the BP algorithm, but used a different tie-breaker.If there is a tie in one iteration of BP algorithm, instead of relying on the Euclidean-norm of the distances, the Improved-BP algorithm performs a local exhaustive search by testing all the best results in the next iteration to determine the best pair.
In Shortest-Dist-First (SDF), the algorithm selects the best pair by choosing a gate that maximises the number of pre-emptive gates (pre-emptive gates refer to targets that are at distance one away, thus, gates that we can directly add into the Base according to Remark 1).In the case that none can be found, it will look for a gate that maximises the number of targets at Dist[i] = 2 and so on; i.e. they try to select a gate such that it reduces the distance of as many "nearest" targets as possible.
In Focused-Search, the algorithm combines the criteria in SDF and the tie-breaking method from their Improved-BP algorithm.In [RTA18], they have implemented on some matrices and showed that this algorithm outperformed the other heuristics (BP, Improved-BP, SDF).However, we would like to emphasize that the matrices that the authors have tested on are very small: 20 × 8, 8 × 10 and 8 × 8.

New Heuristics
In this section, we propose new algorithms for finding optimised linear circuits.Our implementations can be downloaded from the following address: https://github.com/thomaspeyrin/XORreduce.

Random-Normal-BP (RNBP)
In [BP10], Boyar and Peralta have experimented on several tie-breaking techniques for their algorithm.Instead of having the Norm as the tie-breaker, they have also tried Random.Under Random, it considers the sum of the distance to targets and whenever there is a possible new base vector with the same sum with the current selection, with probability 1 2 , it will apply the Norm criterion to determine if we should swap or remain with the selected choice.As this results in a non-deterministic algorithm, the algorithm is repeated for a total of three times and the best circuit is then selected.However, this results in an uneven spread of probabilities among all the "equally good" candidates for the next gate according to the criteria.For instance, if there are 4 equally good candidates, the first, second, third and fourth candidates will have probabilities of 1 8 , 1 8 , 1 4 and 1 2 of being chosen respectively.
We propose first a simple variation of the BP algorithm: Randomised-Normal-BP (RNBP)2 .After several runs of the BP algorithm, we noticed a high number of ties even after the Norm tie-breaker has been used.Since those possible new bases passed the criteria with the Norm tie-breaker, we treat these bases to be equally good.Unlike the "Random" used in [BP10], we decided to add a simple randomisation process to choose the next base, giving each possible candidate after the Norm tie-breaker an equal probability of being selected.This algorithm shows the effectiveness of a randomised algorithm.The results obtained experimentally show that by randomising the choices and using the algorithm multiple times, we often produce a circuit that has a lower XOR count than itself without randomisation.This produces a better measurement of how good the criterion is as it transcends the order of computation of all possible choices.In the case of AES, RNBP is able to yield a 95 XOR circuit for the MDS matrix, beating the current record by 2 XORs [KLSW17].In Section 5, we can see that RNBP is faring better than BP.
For reference, an algorithmic description of our RNBP strategy can be found in Appendix A.1.

A1 & A2 algorithms
In this section, we propose two new non-deterministic algorithms, called A1 and A2, that are also based on the general BP strategy.Similarly to the Shortest-Dist-First (SDF) algorithm [RTA18], our rationale is to aim at reaching the targets that are the nearest (targets with the minimum Dist value that is non-zero).However, the main difference with SDF is that A1 and A2 are less restrictive: we do not look for a new base element that can reduce the distance of as many nearest targets as possible, instead, we just choose a gate that can minimise at least one of the nearest targets.We called this step the filtering step.We want to have this filtration to filter out gates that only minimise targets that are further away.After the filtration, we will use the criteria with Norm tie-breaker in BP subsequently to rank the gates that pass the filter in order to reach a common path among all the targets.For example, suppose if at step k we have a Dist = [3 4 5 6 7] and we have three possible gates t ka , t k b and t kc that will result in the following changes to Dist: respectively, then t ka is favoured over t k b and t kc as it reduces the nearest target, and between the t ka and t k b , it reduces the distance to the highest number of targets.
Let us provide our motivation for trying this strategy.In general, targets that are further away tend to contribute more to the reduction of the sum of distance due to their high Hamming Weight and thus, they have a higher probability to contain similar terms with other high Hamming Weight targets.This causes the BP's criteria of using the sum of distances to favour a move that reduces the distance to targets with high Hamming Weight.Hence, the common path is constructed to reach the higher Hamming Weight targets while not paying much attention to the ones with lower Hamming Weight.However, we want the targets that are the nearest to be taken into account as well when computing the common path.Thus, by doing this arrangement of criteria, we can avoid situations where targets with large Hamming Weight are dominating the common path.
For algorithm A1, the selection for the next gate is performed as follows: 1. Filtering.It must reduce the distance of at least one of the nearest targets 2. Selection.Among the ones that pass the filter, select the one which minimises the sum of distances 3. Tie-breaker.The Euclidean Norm will be used as a tie-breaker for the above criteria 4. Randomisation.If the tie-breaker does not resolve the tie, with equal probability, we will randomly pick one of the remaining candidates after the tie-breaking process In algorithm A2, we skip step 3 (tie-breaking) from algorithm A1.This provides more randomisation when compared to algorithm A1.
An illustration using the same matrix example as in Table 1 is shown in Table 3.This example is crafted to illustrate the strength of A1 and A2 as compared to the BP algorithm.The matrix M is crafted in a way that the targets share many common input signals.Ideally, we would like to approach the one that is the nearest among the set of all targets (that is, y 1 in the case of M ) and approach the rest of the targets by making those small gaps.The differences in selecting their next new base element can be seen from the very first iteration for BP and A1 in Table 2 and Table 3 respectively.The former tends to be greedier (in terms of distance reduction) in choosing the gate with inputs {x 4 , x 13 } which reduces the distance vector by 4.However, the targets that gained a reduction are the ones with the highest Hamming Weights too.The latter chose to reach the nearest target first (while keeping the overall distance in check) which resulted in the choice of the gate with inputs {x 2 , x 13 }.We can see that for this particular matrix M , A1's approach managed to achieve a smaller circuit.
For reference, an algorithmic description of our A1 and A2 heuristics can be found in Appendix A.2 and A.3.

Local Optimisation
We describe in this section some local optimisation techniques (LocalOpt) aiming at reducing XOR count further after a circuit of the matrix has been formed.Unlike BP algorithm where it finds the common path by choosing a gate that minimises the sum of distances, A1 and A2 (and also SDF) choose the one that is the nearest first, which often creates a 'walk' around the targets instead of many branches from the common path.Thus, many gates are only used once.This enables us to do some swapping operations with the resulting circuit from global optimisation without involving too many variables.After some swapping, there may be some repeated gates.Thus, by clearing these repeated gates, we are able to reduce the XOR count further.
While our aim is to reduce the XOR count, the depth might also be improved by LocalOpt for some circuits.Since the time taken to do the LocalOpt is insignificant when [0 0 0 0 0 0 0] compared to that of the global optimisation algorithms, we propose that LocalOpt is to be used concurrently with the Yosys synthesis tool [Wol] on the circuits produced by the global optimisation algorithms, in order to maximise the potential for savings.Since Yosys tool is also very fast as compared to the global optimisation, we can actually implement Yosys first then LocalOpt and vice versa and take the best circuit among the two.
We have two LocalOpt techniques: swapping orders to rearrange parts of a circuit and a local exhaustive search to try out all the various permutations and choose the tree that has the most savings.

Swapping orders
This method helps to identify and rearrange special parts of the circuit such that some repeated gates may appear.This rearrangement may also cause the depth of some gates and possibly the depth of the entire circuit to be reduced.
We first identify gates that are used only once for the computation of a value, then we can perform the following operation.Suppose that a part of the circuit is composed of: . . .If depth(b) < depth(c), then the depth of both a and d will be reduced.Therefore, the other parts of the circuits may see a depth reduction as well.We can observe that this helps to save an XOR gate as well: now the gate a and f are doing the exact same operation.Thus, we will check the depth of the new a and f and eliminate the one that has a larger delay and swap the orders if necessary to maintain the order of the circuit.

Local exhaustive search
As a full exhaustive search is infeasible due to the large search space, we propose to perform local exhaustive searches to identify possible opportunities to reduce the number of XOR gates.To describe a circuit, we can use a tree representation for it.Each gate can be represented as a binary tree with three nodes.The parent (and in this case, the root) of the tree will be the output value and the two child nodes are the input values.Starting from any intermediate or target values as the root of the binary tree, we can form the binary tree recursively until all the leaf nodes are the input variables.
For each intermediate and target value in the circuit, we extract the sub-part of the circuit (a binary tree) that leads to this value.We permute to obtain all the possible configurations of binary trees with the same root and leaf nodes.Then, we check all possible configurations to see if there is any tree that could reduce the number of gates required as compared to the one in the circuit.If there is such a tree, we will do a substitution.
For example, if part of the circuit is . . .
then all the possible computation trees with gate t 3 as the root are shown in Figure 1.The tree on the left is the original circuit whereas the others are the possible alternatives that do the exact same computations.Assuming t 1 is only used in the computation of gate t 3 , then the middle tree in Figure 1 will be the most efficient since t 2 has already been computed in the circuit and we can remove t 1 from it.Although the tree on the right in Figure 1 does not need t 1 (and therefore, we are able to remove t 1 too), at the same time it creates an additional node t k , thus we will not have any savings overall.Due to the large permutation tree that we may encounter, we limit the number of leaf nodes of each tree to be 5.That is, start from the root, we start looking for the child nodes for each level recursively until we have at most 5 leaf nodes, regardless whether we have yet reached the input values or not.

Results
Heuristics used for comparison.In this section, we will compare our results against various state-of-the-art algorithms.This includes the BP algorithm, as well as a randomised version of the Paar1 (RPaar1) and the SDF algorithm (RSDF).Similar to the random procedure added in RNBP, whenever we have a tie, a gate will be randomly selected.We did not implement the Focused-Search algorithm as it is extremely time-consuming for matrices that are larger.Since the search space of Focused-Search and SDF algorithms are a subset of that of RSDF, implementing RSDF may be a better comparison of how well the criteria actually performs in a randomised environment.
Implementation.All of the implementations (except RPaar1 and Paar1) are written in C++.As for RPaar1 and Paar1, they are written in Python.The reason for choosing a different language is that RPaar1 and Paar1 are extremely fast, thus we do not require the efficiency from writing in C++.Furthermore, we have no intention to compare the speed of these two algorithms with the other ones.Thus, we choose Python to have a quick and accurate implementation.For the BP algorithm, we use the implementation that is provided in the GitHub repository stated in [KLSW17].As for RNBP, SDF, A1 and A2, we implemented them with several changes from the code for BP.All implementations were benchmarked on AMD EPYC 7401 24-Core Processor model with 96 CPUs and 60GB of RAM.The implementations were run on a cluster of 90 cores, with one core per matrix.
Test set.We have tested the algorithms on a smaller scale of the benchmark set used in [VSP18].The benchmark set consists of random square matrices categorised by their size and density.For our test set, we have chosen a smaller subset: we have chosen 10 matrices from each size (ranging from 15 × 15 to 20 × 20) and density, ρ (ranging from 0.1 to 0.9).In total, we have 540 matrices for our benchmark subset.We have also tested the algorithms on the 16 × 16 and 32 × 32 matrices from [DL18].
Computational effort.Due to the non-deterministic nature of our algorithms, we run the matrices with different algorithms with a fixed time based on the size and density.We chose the amount of computational effort such that at least nine out of ten matrices in each category (size and density) will reach a stable state for all the algorithms implemented on it.We define stable state (or a result is stabilised) as a situation where the algorithm did not yield any improvements in terms of the number of XOR count for the last half of the time allocated.For instance, the ten matrices with size 15 × 15 with ρ = 0.5 were allocated a computational effort of 2000 seconds.This means that for at least nine of the matrices, no further improvement is observed in the last 1000 seconds for each algorithm.The circuit with the lowest XOR count yielded by each algorithm is then used for comparison.
LocalOpt & Yosys.We further optimised the circuits using Yosys synthesis tool and LocalOpt in the two possible orders (Yosys first, then LocalOpt, and vice versa).The circuit (with the lowest XOR count) among the two is then taken as the best circuit.We will term the circuit before this optimisation step as BLOY and the circuit after this step as ALOY.

Benchmark-subset
XOR count.The general trend of the average XOR count difference of the best ALOY circuits found by the different algorithms can be seen in Figures 2, 3, 4 and 5.The time allowed for each size/density can be found in Table 9 of Appendix B. In Table 4, the percentage of the smallest circuit yielded per size is shown.The ones in round bracket are the percentages for BLOY circuits.When the sizes of the matrices are small (i.e. 15 × 15), and if the density is small (0.1 to 0.2), all tested algorithms performed about the same.This is due to the fact that these problems have low branching, and thus low complexity.When the density is about 0.3 to 0.6, A1 and A2 performed better than the rest and when density is around 0.8 to 0.9, RSDF performed better as compared to the rest.
As the sizes of the matrices get larger, the distinction between each algorithm is clear.For size 20 × 20, when the density is small (0.1 to 0.2), all algorithms performed about the same.But when the density is around 0.3 to 0.9, algorithms A1 and A2 performed the best among all algorithms, producing circuits with the lowest XOR count.Depth Although the depths of the circuits generated by algorithms A1 and A2 are not as low as the BP algorithm, the depth of the circuits are generally better than that of the RSDF algorithm.Table 5 shows the percentage of best ALOY circuits (depth) produced by the various algorithms.Again, the ones in round bracket are the percentages for BLOY circuits.Note: we acknowledge that Paar1 and RandomPaar1 produce circuits with good depth levels (Paar1 yielded > 90% of the best depth for each size).However, the results for the two algorithms are omitted from Table 5 for the other algorithms to compare.Time to stability.On average, RSDF took the longest time to reach stability, followed by A2, A1 and then RNBP.Taking RNBP time as the base, RSDF, A2 and A1 required approximately 6.8, 3.2 and 1.7 more time in order to reach convergence.Some of the graphs for the convergence to stability are shown in Appendix C. The average time taken for the best circuit to form in each category is also shown in Appendix D. The time complexity for all the algorithms is dominated by the computation of Dist vector.Roughly speaking, it is positively related to the Hamming weight of the targets as well as the number of elements in Base.Since RSDF (also, A1 and A2), attempts to reach the nearest targets first, the Hamming Weight of the further targets remains large as the size of the Base increases, causing the time required for most of the runs to be longer than that of BP.

Matrices from [DL18]
We applied the algorithms on the matrices from [DL18].We compared the results across using RNBP, RSDF, A1 and A2 with the columns "Ours", "Yosys", "BP" and "Paar2" from Table 5 of [DL18].The results are summarised in Table 6 and Table 7 for 16 × 16 and 32 × 32 matrices respectively.The columns "Yosys" and "Ours" can be found in the column "Const."as the first and second entry respectively as Duval and Leurent constructed the matrices by choosing optimal paths in order to minimise the number of XOR required.After they have constructed such a circuit, they implemented Yosys synthesis tool [Wol] to further optimise it.
Computational effort.For 16 × 16 matrices, we allowed 15000 seconds for it to run.Within the time limit, all twelve matrices have reached stability.For 32 × 32 matrices, we allowed a total of 432000 seconds (5 days) of running time, however, not all of them reached a stable state.
Overall improvements.Despite the fact that these matrices are constructed specially for their optimised implementation, we are still able to improve 7 out of the 24 matrices.For the remaining 17 matrices, we have also achieved good results: 9 matrices with the same number of XOR gates as the construction and 8 matrices with a difference of at most 3 XOR.

The AES diffusion matrix
We have also applied the various algorithms on the AES MDS matrix.The algorithms were given a total of five days to run.As the AES InvMixColumn has a high density, we are unable to run it directly, thus, we used a decomposition from [DR02] and ran the algorithms on the two decomposed matrices.The results can be seen in † ALOY circuits avoid the problem of selecting only the first candidate lexicographically in the loop in the code given by [KLSW17]: they randomly generated two permutation matrices P and Q which help to shuffle the rows and columns for the targeted matrix M .The experiment is then repeated multiple times and the best result is kept.With this method, they have obtained an AES MixColumn circuit of 95 XORs after around 4 hours of computation.In the Random-Normal-BP algorithm in Section 4.1, we will see that we can actually achieve the same effect by just storing all the possible base candidates and randomly output one of them.A circuit of 95 XOR gates for AES MixColumn can be easily obtained just after a few minutes of running the algorithm.In [Max19], Maximov found a 92 XORs circuit for AES.Although he provided a lower XOR count for the AES diffusion matrix, we note that no general algorithm applying to all matrices has been described in [Max19].The improvement comes from an interesting trick that was accomplished using hand-analysis (by fixing and blocking certain gates) in order to later force the XOR reduction search to a very specific subspace.

Conclusion and Future Works
In this work, we have presented new algorithms for global optimisation of linear circuits.With the introduction of randomisation in the algorithms, our heuristics performed better on average as compared to various state-of-the-art heuristics.Especially in the case of AES, we can see an improvement of 3 XORs as compared to the previous record attained by [KLSW17].While the matrices we have shown in this paper are square matrices, this can be applied to non-square matrices as well.Algorithms A1 and A2 can be implemented efficiently to a size of 32 × 32 if ρ is approximately 0.3 or less.
We have also proposed two local optimisation techniques, LocalOpt, to reduce a few more XORs from an already built circuit and we recommend the use of LocalOpt along with Yosys to obtain the best circuit.
Future works: Large matrices.While algorithms A1 and A2 yield better linear circuits than algorithms BP and RNBP in most of the cases, they are also slower than them (A1 and A2 are, however, faster than RSDF).Most of the time is spent on the exhaustive computation of the Dist vector.As a result, these heuristics are infeasible for very large matrices with high density.It seems like A1 and A2 can go up to approximately 32 × 32 with a bias of 0.3.But anything with a large bias seems to be very computationally heavy.For further research, it would be interesting to focus on finding a good global optimisation algorithm for larger matrices (so far only Paar's algorithms can easily handle large matrices).Alternatively, if the cost of computing the Dist vector can be reduced, then A1 and A2 could also be used to evaluate larger matrices.

Local Optimisation.
Local optimisation exploits the knowledge of an efficient circuit and attempts to optimise it locally even further.Currently, LocalOpt is still in a very rudimentary stage and with just a few rules that we have implemented, we can see a portion of matrices being optimised further.Possible further research may actually look into other techniques of reducing XOR count or an integration of global and local optimisation to maximise the potential of both optimisation methods or having more techniques to reduce the XOR/depth of circuits.
Comments on KECCAK.One may think that the above algorithms can be applied to KECCAK's Theta (linear) layer.However, we do not think there is any other savings beyond the trivial ones.In the Theta layer of KECCAK, we view the structure of the primitive as a three-dimensional cuboid as in [BDPA11].To state briefly, each bit is XORed with all the bits on its left column and the column to the front-right (we refer to [BDPA11] for a detailed description).The Theta layer can also be viewed naturally as a 1600 × 1600 matrix.The trivial savings are to pre-compute the XOR sum for each KECCAK column first.Then, in order to compute the output of each bit, we then use two XOR operations: one to XOR the two involved columns together, and one to XOR that value to the target state bit.We verified our intuition using Paar1 algorithm, which yielded the same circuit count as described.Note: As RPaar1 is extremely efficient, we allowed each matrix to have a total of 10,000 runs.

2.
Initialise an m-integer vector Dist which keeps track of the distances of each target from Base 3. Repeat until all the targets are added into the Base: (a) perform XOR on every unique pair of elements in Base and re-evaluate Dist vector (b) Select the pair candidate that minimises m−1 i=0 Dist[i] and add it into Base (c) In case of tie, choose the pair candidate that maximises the Euclidean norm of Dist (Norm criterion)

a
= b ⊕ c d = a ⊕ e f = b ⊕ e . . .where a is only used in the evaluation of d (i.e. a is only used once).Let depth(k) represent the depth of a value k in the circuit.If depth(e) < max{depth(b), depth(c)}, then we may rewrite the above gates sequence as follows: . . .a = e ⊕ argmin{depth(b), depth(c)} d = a ⊕ argmax{depth(b), depth(c)} f = b ⊕ e . . .

Figure 1 :
Figure 1: Computation trees of gate t 3

Figure 2 :Figure 3 :Figure 4 :Figure 5 :
Figure 2: Average XOR count difference of the best ALOY circuits found by BP and A1

Figure 7 :
Figure 7: Average time taken to stability of 15 × 15 to 20 × 20 matrices.For every matrix, the time taken until the best circuit is formed is recorded.The average, standard deviation, minimum and maximum time (of ten matrices in each category) are shown below

Table 1 :
Computation sequence when applying Paar1 algorithm to matrix M .x i refers to the input signals ∀i ∈ {0, ..., 13}, y j refers to the targets ∀j ∈ {0, ..., 6} and t k refers to the temporary intermediate gates ∀k ∈ N

Table 2 :
Computation sequence when applying BP algorithm to matrix M .B refers to the initialized base elements, x i refers to the input signals ∀i ∈ {0, ..., 13}, y j refers to the targets ∀j ∈ {0, ..., 6} and t k refers to the temporary intermediate gates ∀k ∈ N

Table 3 :
Computations sequence when applying A1 algorithm to matrix M .B refers to the initialized base elements, x i refers to the input signals ∀i ∈ {0, ..., 13}, y j refers to the targets ∀j ∈ {0, ..., 6} and t k refers to the temporary intermediate gates ∀k ∈ N

Table 4 :
Percentage of best circuits (XOR counts) produced by various algorithms for various square matrix sizes.The brackets contain the percentage of the BLOY circuits.The best percentages for both ALOY and BLOY circuits are highlighted in bold

Table 5 :
Percentage of best ALOY circuits (Depth) produced by various algorithms for various square matrix sizes.The brackets contain the percentage of the BLOY circuits.The best percentages for both ALOY and BLOY circuits are highlighted in bold

Table 8
[BFI19]Max19]Two independent results[BFI19,Max19]have been published in IWSEC 2019 and ePrint after our submission to TCHES, in which the authors managed to achieve similar results for AES MixColumn matrix.More precisely,[BFI19]uses a method to

Table 9 :
Time allowed for different sizes and densities for RNBP, A1, A2 and RSDF