UpWB: An Uncoupled Architecture Design for White-box Cryptography Using Vectorized Montgomery Multiplication

.


Introduction
White-box cryptography originates from works [CEJvO02b][CEJvO02a] by Chow, Eisen, Johson and Oorschot, which are categorized as the CEJO scheme.Through combining look-up table (LUT) with encoding techniques, CEJO scheme aims to protect the standard block cipher (i.e., Advanced Encryption Standard) from key extraction attack even under white-box attack model.This strong security model assumes that the hostile user has full access and control to the environment, such as manipulating memory addresses, content, execution flow and so forth.In contrast, the conventional black-box security model assumes that only the external execution environment of cryptographic algorithm, such as the chosen plaintext and ciphertext pair, is observed by the attacker.Additionally, the well-known side-channel attack that examines execution time, power consumption, etc. leads to the gray-box model of relatively middle-level hypothetical strength.As of classical application, devices on the server side are usually executed under black-box model, while the client side commonly corresponds to gray-box or white-box model.Recently, WBC is in fast-rising demand among many security-critical scenarios, such as digital right management (DRM), mobile payment, banking and memory-leakage resilient software and so on [BIT16] [DLPR13] [BABM20].

Space-hard White-box Block Ciphers
Incompressibility.One research line of WBC follows the CEJO framework to provide white-box implementation for the standard block cipher, such as the white-box version of Advanced Encryption Standard (AES) [CEJvO02a] and lightweight block cipher [CG22].Nevertheless, almost all of them are still penetrated by key extraction attack up to now.The other research line devises new dedicated white-box block ciphers (DWBCs) to achieve provable security while imposes large computational overheads, which is the focus of this paper.As mentioned in [DLPR13] [BABM20], the basic aim of white-box programs includes resisting key extraction attack and leakage of message (one-wayness).Additionally, the code-lifting attack is also an important security issue under the white-box setting.In the syntax of code-lifting attack, an attacker could directly copy the code program and run it on chosen devices without the need to recover the value of secret key.At this point, the whole extracted program is equivalent to a big key.Incompressibility, as a security property to mitigate the code-lifting attack, is proposed in [DLPR13], which is popularly adopted in modern dedicated white-box block ciphers.The purpose of incompressibility is to increase the difficulty of extracting and transferring code programs by converting the programs into a significantly larger but functionally equivalent version.Furthermore, the program only keep functional in their complete form, i.e., the malfunction will occur when fragments of it are moved.In this way, if the adversary obtains part of the program, he should not be able to recover the value of secret key or should not be able to derive the compressed version of equivalent functionality and use it to decrypt arbitrary ciphertexts.For brevity, this work would like to refer to literatures [BAB + 19] [DLPR13] for formal definition of incompressibility.
When designing the dedicated white-box block cipher, a typical method to achieve incompressibility is to derive a large and incompressible program from a small-sized symmetric encryption scheme by implementing the key-related encryption functions with look-up tables.In this case, security against key extraction attack in the white-box setting is reduced to the well studied problem of key recovery for block ciphers in the standard black-box setting.Usually, the table-based incompressible version is referred to as the white-box implementation (formalized as big key mode in [BBL23]) while the original small-sized encryption scheme is referred to as the black-box implementation [BIT16] (also formalized as small key mode in [BBL23]).
Space-hardness.Following the concept of incompressibility, [BI15] also proposes a novel notation named as (M,Z)-space hardness, which quantifies the security against code lifting by the amount of code that needs to be extracted from the implementation by a white-box attacker to maintain its functionality.
Theorem 1 ((M, Z)-space hardness [BI15]).The implementation of a block cipher is (M,Z)-space hard if it is infeasible to encrypt (decrypt) any randomly drawn plaintext (ciphertext) with probability of more than 2 Z given any code (table) of size less than M.
It is noted in [BI15] that weak white-box security can be seen as a special case of (M,Z)-space hardness and corresponds to (M, 0)-space hardness.[FKKM16] also proposes the notation of weak incompressibility and makes it as essentially a formalization of space hardness.The first space-hard block cipher named as SPACE [BI15] is designed with different code sizes, which is applicable to a wide range of environments and use cases.Without the necessity of the external code, it is of higher portability than the CEJO scheme as well.However, the security-prioritized design strategy also leads to heavy performance penalties and restricts its wide application.Since the birth of space-hard cipher, algorithm designers have been conducting studies to improve its performance.One of the prevailing techniques is to utilize nested SPN structure, where the S-box is developed by another small-scale SPN block cipher.[BIT16] presents the first NSPN based white-box block cipher named as SPNbox, which utilizes AES variants to build the internal S-box and applies MDS matrix to develop the linear layer.More recently, [KI21] puts forward an updatable white-box scheme (Yoroi) based on the partial MDS matrix, which could refresh the LUT without extra re-encryption.WARX [LRH + 22] employs addition/rotation/XOR (ARX) primitive and random MDS matrix to improve the overall performance.However, contemporary DWBCs still suffer from limited performance due to the enormous round count, high dimension MDS matrix and nested loop structure (detailed in section 2).

The Story: Accelerating Black-box Implementation of WBC
Motivations.In general, DWBCs are designed with different security parameters and table sizes to meet a variety of use cases.The capacity of incompressible LUT under white-box implementation ranges from several KiloBytes to GigaBytes, allowing us to seek balance between security strength and resource constraint.Actually, for WBC with medium-sized parameter (e.g.SPNbox-16), the speed (measured by average cycle per bye) of black-box implementation is about three times slower than that of the white-box version when executed on software platform.The main reason for this performance gap is that the black-box implementation incurs nested loop dependency, which is hard to be parallelized under software platform and prone to deteriorate the performance.Modern CPU could resort to specific instruction (e.g.AES NI) or vector instruction (i.e.AVX) to achieve parallelism in the data level, but the nested loop dependency hinders the further parallelization in the loop level.Our objective is to show that the performance of spacehard cipher under black-box implementation can be significantly promoted by utilizing well-designed hardware accelerator.UpWB could be integrated into the cloud server and it would capture better area and energy efficiency than software-only platform.The recent progress on WBC has been nothing short of spectacular, which makes us optimistic that future advancements will bring the power of WBC to many more applications.
WBC is originally invented for software protection when we are short of the hardware support.Hence, questions may arise that there is no need to apply WBC if we already possess the secure hardware resource.The feasibility and profit to adopt hardware acceleration are summarized as following from three-fold perspectives.
(1) We would like to emphasize that this work aims to accelerate the server-side WBC under black-box implementation.The aforementioned question only considers the white-box implementation by default but fails to hold water for black-box implementation.The execution environment of secure hardware is commonly assumed to be non-white-box [SMG16a], which is consistent with the trusted environment of server-side and exactly explains why we only accelerate the WBC under black-box setting.The client-side WBC is still implemented with software program under white-box setting as usual.(2) Both white-box and black-box implementations are usually deployed in pair to fulfill the application of DWBC.Compared with the memory-hard white-box implementations of WBC, the black-box version features the advantages of facilitating fast key switching and consuming much less storage.Since the server side is assumed to already know the secret keys, the server-side WBC could be implemented in form of black-box implementation.Space-hard block ciphers actually resemble the asymmetrically hard functions in terms of memory hardness as noted in [BP17], which create two classes of users: one party (like the client side) has to compute the white-box implementation with increased memory hardness, while the other party (like the server side) who knows the secret can evaluate the black-box implementation with original memory efficiency.However, as mentioned before, the performance of black-box implementation on general-purpose processor is still problematic, which is hard to meet the needs of massive client terminals and the highly concurrent applications on the server side.Adopting hardware acceleration for the black-box implementation naturally boosts the throughput of WBC and matches the black-box security model of the server side, thus compensating the inferiority of space-hard ciphers to performance.
For example, Figure 1 depicts the application of WBC in cloud-based content distribution [BIT16].Cloud server has to encrypt contents in the black-box setting (thus retain conventional memory efficiency) and distribute them to user devices.Since running multiple white-box implementations for all users would require a prohibitive amount of memory, it is unreasonable to replace the server-side black-box implementation with memory-bound white-box version.The UpWB accelerator is leveraged to improve the computation throughput for the server side.User devices decrypt the contents in the white-box setting using the incompressible software program, without needing to rely on hardware.Additionally, although most private and remote servers could be protected much better than edge devices, it may still have additional threats, such as cache-timing attack [GSM15].This attack utilizes the architectural features of software processor, such as data dependency and cache memory access time, to extract the secret key, which has recently received wide attentions.By implementing the black-box implementation with a specific hardware accelerator, this issue could be addressed at the cipher-implementation level, in analogy with the profit of AES-NI instruction.Based on the above analysis, we explicitly summarize the difference between this work and the conventional WBC applied in DRM, as shown in Table 1.
(3) The server side under black-box implementation always deals with a large number of user keys simultaneously and demands a relatively higher throughput.In particular, devices on the server-side inevitably encounter cryptographic algorithms with diverse security parameters [LWD + 18].It is profitable to devise a configurable hardware architecture supporting different algorithms and parameters for the server side.Although a few works have achieved performance improvement for WBC on the general-purpose processing platform [RFS + 19][TGS + 22], the final computation throughput is still limited because CPUs cannot directly explore the parallelism of nested operations well enough.Therefore, a domain-specific hardware accelerator is more promising to achieve better performance and energy efficiency.Hardware acceleration of basic primitives tends to exert down-stream effects.We hope that this acceleration can also motivate more applications of WBC for developing potentially fancy cryptographic infrastructure [DLS + 22].Discussions on the security threats and applications.The recent hybrid codelifting attack [TI22] supposes that there are white-box and black-box attackers working together to achieve program recovery.The black-box attacker receives the leakage generated by the white-box attacker and uses it to perform cryptanalysis under the black-box setting, which determines whether the attack succeeds or not.In other words, the leakage from the collaborative white-box attacker enhances the ability of black-box attacker, which is similar to the attack model of strong incompressibility in terms of motivation.However, SPNbox and Yoroi themselves are designed under the assumption that the property of space-hardness is achieved in the presence of only a white-box attacker, i.e., without such collaboration of the hybrid-code lifting.In the context of hybrid code-lifting attack model, the 128-bit security strength of SPNbox against key-recovery attacks in the black-box model is compromised, but not meaning that SPNbox is completely broken.It is also remarked in [TI22] that increasing the number of rounds in SPNbox by several rounds (e.g., 12-round SPNbox-16) produces a good candidate to resist such hybrid code-lifting attack, if the original security strength is still demanded.As for the Yoroi scheme, its canonical representation and partial MDS matrix are vulnerable to the hybrid code-lifting attack, which cannot be circumvented by just increasing the number of round.Thus, Yoroi should be deployed under the very limited use case which supposes the black-box attacker cannot obtain the leakage, as suggested by [TI22].Based on the above considerations, the proposed UpWB hardware accelerator supports configurable parameters, which thereby is able to efficiently update the algorithmic parameters of SPNbox, Yoroi and WARX, including the number of rounds and random MDS matrix elements.[MN11] also makes an intuitive remark that the dynamic generation of MDS matrix causes the prediction of statistical properties and relationships between plain and cipher texts more difficult, which helps to avert the linear and differential cryptanalysis.
To mitigate code-lifting attacks is particularly central to the application of white-box cryptography besides the security against key extraction, as explicated in [BABM20].
The concept of incompressibility requires the cryptographic programs to be implemented with a very large program, which seems to be unsuitable for resource-constrained mobile devices and internet of things.As a consequence, large-sized space-hard ciphers are of main interests to the typical applications like DRM, but seem not to be the appropriate solution to mobile payment applications, which favor more lightweight deployments and demand other security properties including confidentiality, integrity and so on [BABM20].Therefore, in addition to the property of incompressibility, other notable techniques like application binding, hardware binding [BBF + 20], traceability [DLPR13] and so forth, should also be carefully considered for a broader applicability.
State-of-the-art.To our best knowledge, hardware designs for DWBCs have not been reported by now.Although some works focus on the performance improvement of WBCs under central processing unit (CPU) execution [RFS + 19][TGS + 22], the final throughput is still limited and only the white-box implementation is evaluated.Some works [SMG16b] [Sas18] implement the relatively lightweight white-box AES/Noekeon version on FPGA rather than the type of DWBCs discussed in this paper.In general, the nested loop dependency and computation-intensive modular operations are unfriendly to CPUs, which commonly have a low degree of parallelism.Graphic processing units (GPUs), on the other side, have rich computation resources but consumes much more energy and money cost.Nevertheless, hardware design has good scalability and could leverage the specialized data-flow information to achieve better computation efficiency.As a special focus, various algorithms for modular multiplication over GF(2 m ) have been intensively studied.However, an efficient and vectorized MM algorithm supporting random MDS matrix is still in lack, which will be further discussed in section 2 and 5.
Contribution.In this work, we focus on the hardware acceleration for multiple NSPN-based block ciphers under black-box setting.Our main contributions are illustrated as following: • We analyze several NSPN-based WBCs to identify the underlying operations and point out their distinction from conventional block ciphers.We manifest that the nested loop is prone to deteriorate the performance (Section 2).
• For the first time, we develop UpWB, an uncoupled and efficient hardware architecture for the server-side WBC under black-box setting, which can accelerate three series of DWBCs including SPNbox-8/16/24/32, Yoroi-16/32 and WARX-16.We introduce a fine-grained task partition mechanism to decouple the nested loop dependency, which can serve as a generic method and efficiently promote the computation throughput per area (Section 3).
• We propose several optimized hardware modules to achieve decent area-time efficiency, which encompasses 1) an adaptive and redundancy-free VMM to process multiprecision data, multi-size vector length and different irreducible polynomials; The refined VMM achieves almost the lowest area complexity among state-of-the-art works without pulling performance and flexibility down.2) a configurable MVM to deal with multi-size MDS matrices; The proposed MVM architecture adopts diagonalmajor data-flow to circumvent the vast connection cost and reduce memory footprint.
3) unified multi-scale (Inv)MixColumns by intensively sharing sub-operations; A design generator is developed to customize the constant modular multiplier for minimal circuit depth and 32% average savings on area cost (Section 4).
• We evaluate UpWB based on FPGA implementation and ASIC synthesis.The FPGA implementation of UpWB outperforms the optimized software counterpart by 7× to 30× in terms of computation throughput.The synthesis result of TSMC 28nm technology shows that 36× to 164× speedup is achieved when UpWB works under the peak frequency 1.3 GHz (Section 5).Last but not least, our implementation codes are available on https://github.com/xiang-rc/UpWB_ref.

Background
In this section, we firstly introduce the relevant algorithmic background along with the analysis of main operations.Then, we identify the design challenges that are specific for a unified and efficient architecture supporting NSPN-based WBCs.

Notations
Let GF(2 m ) denote the binary extension Galois field, whose element is in form of bit strings taken from the set {0, 1} m .Define the quotient ring of polynomials as GF(2)[x]/f (x), where f (x) is a suitable irreducible polynomial and f (x can be denoted as (a 0 , a 1 , ..., a m−2 , a m−1 ).Let A i represent the i-th w-bit chunk of a for 0 ≤ i < m w = W .Let A i [j] denote the j-th bit of A i , where j = 0, 1, ..., w − 1. Constants over GF(2 m ) are represented in Hexadecimal form.The addition over GF(2 m ) is the same as XOR operation ⊕.

Algorithmic Overview and Operations
The nested SPN-based DWBC takes in n-bit data block and k-bit secret key to perform encryption and decryption.The nested structure is embodied by the n s -bit substitution box, which is a small-sized SPN-type cipher as well.For WBC-8/16/32 (n s = 8/16/32), n and k are both determined as 128 bits.For WBC-24 (n s = 24), n and k are both scaled to 120 bits.From the implementation point of perspective, NSPN-based WBCs share common data-flow structure.Readers could refer to the original paper for more specifications.State.The state in SPNbox, Yoroi and WARX is organized in form of row vector containing t = n/n s elements: X = (X 0 , X 1 , ..., X t−1 ).Here each n s -bit element can be further divided into l = n s /8 bytes: X i = (X i,0 , ..., X i,l−1 ).
Data structure: To make full use of the inherent data-level parallelism of block cipher, the round-based implementation strategy is adopted in our architecture [UHM + 20], which means n = t × n s -bit data block is processed per cycle.Key Schedule.The round keys are generated by a key derivation function (KDF), where a k-bit master key is expanded to (R i + 1) round keys, namely (k 0 , k 1 , ..., k Ri ) = KDF (k, n s • (R i + 1)).Here R i denotes the number of rounds for internal SPN cipher.KDF is specified as the hash function SHAKE-128 [Dwo15].Table 2 summarizes the parameter set for related NSPN block ciphers.
Round Function.Since the round function for encryption is exactly opposite to that of decryption, we just introduce the former one here.As Figure 2 depicts, the overall structure of round function consists of three components, namely non-linear substitution box γ, linear layer θ and affine layer σ.The final iterative result is obtained by applying R o rounds of round functions to the state: In this paper, the θ and σ layer are fused as the linear transformation (LT) loop, while γ as the non-linear transformation (NLT) loop from a computation point of view.(De-)multiplexers are mainly required to support different execution orders and data transmission.(1) Nonlinear Layer.The γ layer maps t × n s -bit input to the t × n s -bit output using the substitution box: (X 0 , X 1 , ..., X t−1 ) → (γ 0 (X 0 ), γ 1 (X 1 ), ..., γ t−1 (X t−1 )).As Figure 2 shows, the instantiations of γ layer for SPNbox/Yoroi and WARX are different from each other.(2) Linear Layer.The linear diffusion layer is specified as the multiplication between MDS matrix and state vector: (X 0 , X 1 , ..., X t−1 ) → (X 0 , X 1 , ..., X t−1 ) × MDS matrix.Table 2 summarizes the parameter sets for MDS matrices and their inverse versions, which contain different sizes, finite fields and types.As a special case, WARX adopts random matrices.
(3) Affine Layer.For SPNbox and WARX, the affine layer σ r is unified by adding rounddependent constants to the state: For Yoroi, the affine layer is defined as:

Design Challenges
1) Nested loop dependency with large trip count.Figure 3 presents the difference of data dependency graph (DFG) between the conventional block cipher and NSPN-based WBCs.The conventional block cipher (e.g., AES) consists of the iteration of a round function, yielding a single loop.To improve the throughput by approximately x times, the effective method is to duplicate the round function by x times and then insert pipeline registers among them.Nevertheless, three data loops will be formed in the NSPN-based WBCs.First, the NLT loop needs to iterate M rounds of internal SPN function (γ layer).Then, the high-dimension MDS matrix-vector multiplication is usually decomposed into iterative scalar-vector operations [SKOP15], which leads to the second LT loop.Finally, the round count of external SPN structure is determined as R, which results in the third loop nested with the former two sibling loops.Obviously, the inner-and-inter loop dependency renders the traditional strategy impractical for the NSPN-based WBCs.Specifically, the inter-loop dependency leads to pipeline stall and incurs extra usage of FIFO.Since the trip count is hundreds of magnitude and varies with security parameters, fully unrolling the nested data loop not only entails tremendous resource overhead but also results in low resource utilization.Prior works aim to improve the mapping efficiency of nested loop on coarse-grained reconfigurable architectures (CGRA), but most of them are applicable for inner loops with small trip counts and limited by the fixed size of PE array . Thus, we further leverage the specific information of WBC to attain case-oriented design strategies and gained performance.2) High-dimension and dynamic MDS matrices.As a kind of cryptographic primitive, the linear mapping commonly adopts MDS matrices to achieve diffusion in block cipher.However, there are four different points existing between conventional block ciphers and NSPN-based WBCs.(1) The dimension of MDS matrix applied in WBC is always larger than that of conventional block cipher, which tends to deteriorate the execution performance.For instance, SPNbox-8 applies 16 × 16-sized MDS matrix, whose dimension is four times as large as AES.(2) The sizes of related MDS matrices are diverse, scaling with different security levels.For example, SPNbox-family block ciphers contain four different sizes.(3) The elements of MDS matrix within WARX are dynamically generated rather than fixed constants.(4) Different types of MDS matrices (e.g., circulant or hadamard) are involved as Table 2 shows.A plethora of works are carried out to either search lightweight MDS matrices [AF14, SKOP15, LS16, LSL + 19, VKS22] or optimize implementation of the prescribed matrix for low latency or low area footprint [LWF + 22, XZL + 20, VKS22].This optimization can be transformed as the shortest linear program problem, which is deemed as NP-hard [BMP08].Nevertheless, a configurable and efficient hardware design for dynamic MDS matrices with different sizes is still left fewly explored by now.
The value of implementing a configurable MVM architecture supporting random MDS matrix has been demonstrated in prior works.Besides the reduction of time for re-fabrication, devices on the server side are always required to handle several responses from different client sides, which inevitably encounter cryptographic algorithms with diverse security parameters [LWD + 18].From a security perspective, some communication protocols even adopt more security-conservative parameters [WSH + 10], which are likely to be periodically updated.Additionally, benefiting from the usage of random elements, dynamic matrix makes the statistic property of plaintext-ciphertext pairs hard to be predicted, which averts the linear and differential analysis to some degree.By randomly and uniformly distributing the hamming weight of elements, it tends to be more immerse towards side channel attacks, especially to the power related attacks [MN11].
3) Bit-precision reconfigurability without redundancy.As illustrated in section 2.2, DWBCs involve different bit-widths (4 ∼ 32-bit), parameter sets and execution order, which easily result in complex routings and imbalanced workloads.As identified in [CLF + 17] [CLK + 16], it is challenging to devise a redundancy-free modular multiplier to support variable bit-width GF operations under different irreducible polynomials.Due to the polynomial modulo operation, a smaller GF bit-width multiplication cannot directly use a larger GF bit-width data-path by simply setting the most significant bits to zeros.In this work, besides the flexibility of precision and f (x), one extra dimension of flexibility, namely the number of modular multiplier, is further required to handle multi-size MDS matrices.For example, to process SM 8 and SM 16 in a round-based manner, we need #16×8-bit and #8×16-bit modular multipliers, respectively.In a nutshell, a flexible vectorized modular multiplier has to be designed to support multi-precision data, multi-size matrices and different irreducible polynomials.

Architectural Design Methodology
In this section, we firstly introduce the FTP mechanism, based on which the performance is profiled.Afterwards, we present the overall architecture of UpWB and describe the task distribution for each component.

Proposed FTP Mechanism
Exemplary FTP mechanism.Figure 4 (a) shows that the nested loop dependency results in low degree of parallelism if both NLT and LT are processed serially.A straightforward overlapped processing strategy is presented in Figure 4 (b).However, the gap of loop count between NLT and LT leads to idle time for synchronization, which lowers the resource utilization.Being aware that the loop count of NLT is assumed to be two times as large as that of LT, we could overlap the execution time of two NLT modules with that of one LT module to eliminate idle time.In other words, we partition NLT function into two sub-tasks to decouple the nested loop dependency, which avoids the usage of FIFO for synchronization [WNCY16] and efficiently increases the overall computation throughput.In this way, the operation throughput (TP) is raised by approximately 3× but only incurs 1.5× resource overhead (RO), which achieves better area-time efficiency than other methods like fully unrolling or direct duplication.The FTP mechanism is inspired by [WNCY16], which implements in-memory AES and faces similar data dependency.
Performance model.Based on the FTP mechanism, we attempt to decompose the N -round NLT and M -round LT into about x× (N/x)-round N LT i and y× (M/y)-round LT j , respectively.The uncoupled data-flow graph is shown in the right bottom of Figure 4.In terms of timing schedule, we try to overlap the data processing to hide much more latency.In the ideal case, all of the function modules are utilized to alternatively process  x + y data points.Then, the cycle count for black-box encryption is calculated as below: Here , indicating the timing penalty for waiting the function module with the largest sub-round count.The largest sub-round count, namely max{M i , N j } for 0 ≤ i ≤ x − 1, 0 ≤ j ≤ y − 1, determines the period of alternation.The warm-up time represents the cycle count required to fill all of function modules with data points.Consequently, the computation throughput (T P ) is formulated as below: DW denotes the bit-width of data and f max denotes the peak frequency.Considering that max{M i , N j } ≥ M +N x+y (pigeonhole principle), the equality holds if and only if x+y .By plugging the equality condition into the above equation, the computation throughput will further satisfy the condition: Based on the above analysis, a useful design principle is to narrow down the gap between M i and N j , which reduces the amount of idle time for synchronizing the function modules with different amortized round counts.Being aware of the round parameters of all involved block ciphers, we determine x = 4 and y = 2 to develop the N LT and LT cluster.
As a typical example, Figure 5 presents the timing diagram of overlapped processing for SPNbox-16.The round count for LT j module is about twice as much as the N LT i module.Thus, we align the operation time of two NLT modules with one LT module.Then, 4 data points will be alternatively processed and the TP is calculated as: Round 1

Overall Hardware Architecture
Figure 6 presents the overall architecture for UpWB, which mainly consists of a top control module, KDF module, NLT and LT cluster.Ideally, UpWB will handle 6 data channels synchronized by 6 sequencers.To execute different types of algorithms, data blocks and configuration context are sent to the buffer at the initial phase, including the algorithm type, enc/decryption mode and parameter set.The KDF module is instantiated to the SHAKE-128 function, which affords to generates round keys for NLT cluster at the second phase.In our Keccak design, five permutation steps are conducted upon the 1600-bit state cube sequentially per cycle.At the third phase, the NLT cluster carries out the internal SPN function in a round-based manner, processing n-bit data block per cycle.The LT cluster conducts the MVM of different sizes.As the core arithmetic unit, an adaptive VMM is devised to handle random matrices based on a folded architecture.

Adaptive VMM
To obtain the precision-reconfigurability without redundancy, the new VMM is devised to meet two requirements: (1) The VMM should be able to process multi-precision variable elements under different irreducible polynomials without under-utilization of hardware resource.(2) A single modular multiplier with large q bit width can be exactly decomposed (vectorized) into n modular multipliers with small q n bit width.Prior work on single MM.A scalable radix-2 Montgomery multiplication algorithm is proposed in . b(x) is scanned one bit at a time while a(x) is scanned serially per chunk.[RM13] modifies this algorithm by scanning multiple bits of b(x) per cycle as shown in algorithm 1.The iterative operation within the third j-loop is divided into two types of computation blocks.The parity signal and C 0 are generated in block α.Based on parity signal, block β is responsible to compute the most significant bit (MSB) of C 0 and the remaining C j for 1 ≤ j ≤ W − 1.

Algorithm 1 Scalable radix-2 Montgomery multiplication in [RM13]
Input: Let a(x) = a = (A0, A1, ..., AW −1) and b(x) = b = (B0, B1, ..., BW −1) be two polynomials over GF(2 m ).Here W = m+1 w and w is the bit-length of chunk.Similarly, f (x) = f = (F0, F1, ..., FW −1) is the irreducible polynomial. Output: end for 14: end for 16: end for 17: return c Insight on the data-flow of systolic array.Figure 7 depicts the data-flow of systolic array proposed by [RM13].The systolic architecture suffers from large pipeline latency in that block α i+1,j ought to be conducted before block β i,j+1 .Taking m = 5, w = 1 as an example, Figure 7 presents the timing diagram for one-dimension array with #PE = 3, which takes 16 cycles to obtain the final result.However, the fact that parity signal of the first row can be broadcasted parallelly (1 cycle) rather than serially (≈ m cycles) to the rest rows is not explored in prior work.In the following, we show that the cycle count can be considerably reduced due to this interesting observation.
Proposed VMM Design.Three optimization tricks in terms of area cost and latency are proposed to generate the adaptive radix-2 VMM shown in algorithm 2. x Eliminating the redundant operations.Note that the MSB of f (x) is fixed as 1.Indeed, it is sufficient to decompose the operand into W = m w chunks rather than W = m+1 w .As a result,  the last iteration on index j can be eliminated when m is divisible by w.We make a proof as following.Let W = m w and C i,j denote the j-th chunk of C in the i-th iteration.For j = W , by taking A W = 0 and F W = 1, the last iteration is shown as below: Note that the MSB of C W −1 is parity, which can already be obtained in the (j = 0)-th iteration (Q.E.D.).The elimination of the last iteration will reduce the area cost by m PEs, which is further quantified by practical implementation in section 5.1.y Unrolling the j-loop and folding the i&k-loop.Note that modular addition over GF(2 m ) can naturally avoid the carry chain.Thus, we would like to remark that the pipeline registers used in [RM13] are actually redundant, i.e., have no impact on the critical path.The data-flow in the j-loop direction could be unrolled and processed parallelly.In contrast, the data-flow in the i/k-loop direction should be processed serially.Thus, we replace the systolic array with loop-based folded structure to avoid the unnecessary pipeline registers and reduce approximately m× cycles.z Vectorizing the modular multiplier to achieve adaptivity.Prior arts [GTSK02, RM13, SC06] solely focus on the optimization of a single modular multiplication for ECC based cryptography, where the value of m is typically up to hundreds of bits.However, the matrix-vector multiplication in the linear mapping involves vector (multiple) modular multiplications.We point out that the almost redundancy-free vectorization can be achieved by configuring the source of the MSB for each output of PE.As shown in algorithm 2, the iterative computation consists of four types of blocks by now, which is further extended to support the configuration for the number of modular multipliers.Besides, bm signal is introduced to change the bit width of block, so that MM with data width less than w-bit can also be conducted.

Algorithm 2 Proposed vectorized Montgomery multiplication
Input: Let a(x) = a = (A0, A1, ..., AW −1) and b(x) = b = (B0, B1, ..., BW −1) be two polynomials over GF(2 m ).Here W = m w and w is the bit-length of chunk.Differently, f (x) = f = (F0, F1, ..., FW −1) is the irreducible polynomial, whose most significant bit is fixed to 1 and hence neglected in the vector.Finally, bm is the bit-width mode of PE cell and bm ≤ w. Output: 1: c ← (0, 0, ..., 0) 2: for i = 0 to W − 1 do Travel each chunk index of b(x).end for 24: end for 25: return c Exemplary VMM with redundancy-free configuration.Figure 8 presents the hardware circuit of processing element (PE), which can be configured as five types of blocks by asserting control signals sel0 ∼ sel2.Figure 9 presents an instantiation of VMM along with on-the-fly configuration cases.For example, by setting {sel0_rj,sel1_rj,sel2_rj} = {2 b00, 1 b1, 1 b0}, PEs within each row are all configured as block A. Thus, #4 w-bit MMs are developed.If we set sel2_rj = 1 b1, the width of modular multiplier can be further halved, whereby #4 (w/2)-bit MMs are formed.The third case is configured as #2 2 • w-bit MMs.Block B generates the parity signal and propagates it to block C at the same column.The MSB of output in block B comes from the LSB of output in block C, while the MSB of output in block C is set as the parity signal.It is worth mentioning that the presented VMM consumes constant connection cost (fan-in/outs of Multiplexer) even if the configuration cases are increased.

Configurable MVM
Determining the data-flow structure.Figure 10 depicts the mapping strategy of row-major MVM, which multiplies each row of matrix with the column vector like the school-book order.A circuit structure for implementing row-major MVM is also depicted in the right part, which assumes that the critical path consists of an adder and multiplier (MAC).As a drawback, this circuit structure suffers from extra clock cycles for pipeline filling and emptying, which requires 2 • n clock cycles to finish the entire MVM. Figure 10 presents the mapping strategy of column-major MVM, which multiplies an element of vector with a column of matrix, and then accumulates the vector product.Each element of vector is broadcasted to all channels through the fully-connected crossbar for performing parallel MAC operation.Benefiting from the data-level parallelism, it performs n MACs per cycle and only occupies n clock cycles to complete MVM.Although it matches well with the round-based implementation, the fully-connected communication entails cumbersome MUXs, which becomes even more prominent as the number of channel increases.Figure 10 describes the mapping strategy of diagonal-major MVM, which multiplies the rotated state vector with each diagonal of matrix and then accumulates the vector product in sequence.The hardware circuit of diagonal-major MVM consumes n clock cycles to perform MVM, while the crossbar is replaced by the lightweight shift register.Considering the comprehensive metrics of area, flexibility and performance, we employ the diagonal-major strategy to circumvent the ponderous fan-out of broadcast and achieve data-level parallelism in the meantime.Determining the size of PE array.As mentioned in section 4.1, the critical path of VMM is composed of m PEs, which could be further folded to x PEs.Although the folding mechanism can reduce the area cost and improve the frequency, it also brings the price of much more consumption of clock cycles.Thus, the overall latency is hard to be presents the performance variation (PV.) on FPGA when increasing the number of PE (k) in the ck direction.The right coordinate depicts the cycle counts.Figure 11 (b) depicts the resource variation (RV.) with the increment of k.The right coordinate indicates the value of LUT and Flip-flop (FF).The left coordinate reflects the variation of ATP with k.It is observed that the best area-time efficiency on FPGA is obtained by setting k = 4. Figure 11 (c) (d) also depict how the overall latency and ATP vary with the folding degree under ASIC evaluation.
Architecture of LT cluster.As shown in Figure 12 (a), operand A is the rotated state vector coming from the shift buffer while operand B is 64-bit element of MDS matrix fetched from 4 MDS buffers.As the core computation module of entire MVM, the PE array can be configured as different number of modular multipliers with variable data width.Figure 12 (b) shows six configuration cases for PE array.For example, the PE array is configured as 4 MMs over GF(2 4 ) in case I, which affords to perform 4 × 4 MVM for Yoroi-32.Additionally, the diagonal-wise memory layout of Hadmard MDS matrix can be reduced by 75% through eliminating the repetitive storage.

United (Inv)MixColumns
For SPNbox and Yoroi, multi-scale (inv)MixColumns are applied to construct AES variants with different sizes.For the first time, we present a unified design of multi-scale (inv)MixColumns through sharing sub-expressions and customizing the constant modular multipliers (CMMs).Multi-scale MixColumns. Figure 13 presents the implementation scheme for multiscale MixColumns (MC 16 , MC 24 , MC 32 ).Inspired by [MLH + 20], the core idea is to share as much logic resource as possible to eliminate the redundant operations.Note that MC 16 and MC 24 are actually 2×2 and 3×3-sized sub-matrices of MC 32 .As a result, the first insight is to reuse partial products/sums of MC 16 to compute MC 32 , which can be trivially achieved due to gcm(2,4) = 4.The second insight is to reuse partial products/sums of MC 16 and MC 32 to compute MC 24 .Since the size of MC 24 is prime with those of MC 16 and MC 32 (gcd (2,3,4) = 1), the calculation of MC 24 is relatively non-trivial.However, by dividing index i into even and odd cases, MC 24 can still be obtained just by the re-usage    of immediate results (shown in step3).The final solution is summarized at the bottom of Figure 13.Ultimately, 49 (60.5%)MMs and 31 (33%) modular adders (MAs) are saved when compared to the direct method (DM) as shown in Figure 15.The implementation details are also shown in appendix A.
Multi-scale InvMixColumns. Figure 14 presents the implementation scheme for multi-scale InvMixColumns (InvMC 16 , InvMC 24 , InvMC 32 ).The inverse matrices feature disjoint matrix elements with large hamming weight, which are not as regular as the forward ones.Here, we adopt multiplicative decomposition for InvMC 32 so that MC 32 is reused.For InvMC 16 and InvMC 24 , the additive matrix decomposition technique is proposed to allow much more sub-expression sharing.Note that the adder performing 68 • x 2i + 68 • x 2i+1 can be reused to compute 68 • x 3k + 68 • x 3k+1 when 2i = 3k for 0 ≤ i ≤ 7, 0 ≤ k ≤ 4, that is to say i = k = 0 or i = 3, k = 2.The final solution is summarized at the right part of Figure 14.As shown in Figure 15, we need 69 MAs and 81 MMs to unify the multi-scale InvMCs.Thus, 52% MAs and 13.8% MMs are saved when compared to DM.The concrete implementation details are also shown in appendix B.   Design generator for CMMs.The rightmost part of Figure 14 presents the framework of design generator for CMM.Since the irreducible polynomial within (inv)MCs is fixed as x 8 + x 4 + x 3 + x + 1, the modular multiplications by constants with large hamming weight are customized as Mastrovito's multiplier [Mas88] (detailed in appendix C).At the first step, we generate the binary matrix based on input parameter f (x), GF (2 m ) and constant.Then, we leverage a backward search framework [LWF + 22] to implement the binary matrix vector multiplication, which aims to consume as less XORs as possible while guaranteeing a minimum circuit depth.We make CMMs about 68 and d1 as case study.As shown in Figure 15, when computing CMMs about 68 and d1 using DM, 26 and 29 XORs are needed.Based on the backward search framework, the area cost can be reduced to 18 XORs and 19 XORs with minimum circuit depth 3 XORs.

Implementation Results and Comparisons
The coprocessor is designed and simulated with Verilog HDL, whose functional correctness is further verified with Python model.We obtain the implementation results based on TSMC 28nm ASIC synthesis and Xilinx Zynq UltraScale+ FPGA (xczu7ev-ffvc1156-3-e), respectively.We use Vivado 2020.2 to synthesize and implement the FPGA design.The synthesis result is also reported with Design Compiler P-2019.03.Since evaluations on the FTP mechanism and multi-scale (Inv)MCs modules are already presented in previous sections, we mainly make both theoretical and experimental performance evaluation on VMM, MVM and overall coprocessor in this section.

Alternative Approaches and Comparisons about VMM
Table 3 lists proposed alternative approaches for VMM, based on which we make comparison about area and time complexity.We assume that one m-bit operand is entirely processed and the other operand is scanned x-bit per cycle.For radix-2 LSB-first and MSB-first algorithms, the area cost and cycle count approximately amount to the proposed VMM.However, one important point to consider is that both LSB-first and MSB-first algorithms tend to consume proportionally increased fan-ins of MUXs (O(n)) to support configuration, because the MSB of output inevitably comes from n non-adjacent PEs (detailed in appendix F).This crucial drawback causes the critical path and area to become larger when increasing configuration cases.The proposed VMM scheme, by contrast, consumes constant fan-outs of MUXs (O(1)) which is independent of the configuration case.The cycle count of radix-2 w (x = 2 w ) Montgomery algorithm [MÖPV04] is on par with the proposed scheme.Nevertheless, there are two drawbacks to radix-x Mont.algorithm.First, the area complexity is considerably higher than that of our scheme.Second, extra precomputation is needed.Similar shortcomings also exist in the radix-x LSB-First and MSB-First algorithms [KWP06].Compared with the state-of-the-art radix-2 Mont.
[RM13], our scheme reduces the area cost by 2 FFs while still maintaining the same critical path and even less clock cycles.Figure 16 further presents the comparison of practical implementation between this work (TW) and prior work (PW) [RM13] under the same parameter.It is observed that TW offers 1.2 (1.4) × area-efficiency measured by ATP under FPGA (ASIC) evaluation.The aforementioned algorithms support MM under arbitrary irreducible polynomials.However, Mastrovito's multiplication cannot support the on-the-fly configuration.Some works leverage Karatsuba [PCHS20] or residue polynomial system [SSS12] to implement a single MM over large-sized GF(2 m ).Obviously, their coarse-grained computation patterns lead to limited flexibility in terms of bit-width precision.On the contrary, the proposed VMM makes full advantage of the hybrid-grained radix-2 Mont. to achieve multi-dimension flexibility.Other works aim to further prune the circuit structure by using a special class of irreducible polynomials [FD05], like trinomials [BSF15], pentanomials [Ima18,XHM13] and all-one polynomials [SK99], but most of them cannot cater to the adaptivity.

Evaluations on MVM
Comparison with standalone designs.We implement the first row of MDS matrices following the method of [SKOP15].In total, 9 individual implementations consume 2621 XORs.Additionally, [SKOP15][BKL16] ignore the extra overhead of MUXs incurred by reusing the multipliers of the first row.Indeed, MUXs cause large area overhead for connections, especially for high-dimension matrix.Moreover, this method [SKOP15] requires pre-computation and cannot support dynamic configuration for MDS matrices of different sizes.Fortunately, our configurable method approximately consumes 1024 XORs, 1024 ANDs (AND:XOR=1:2.25 in 28nm process) and enjoys a sound balance between flexibility and area efficiency.
Sensitivity study on random MDS matrices.To quantify the area variation for individually implementing random MDS matrices generated by different scalars (1 ∼ 20), Figure 17 showcases SM 8 ∼ InvSM 24 following typical method of [SKOP15].As the comparison object, the red broken line denotes the area cost (≈ 2048 XORs) of the configurable MVM proposed in this work (TW).We make two important observations as following.First, multiplying each MDS matrix with different scalars yields varying area consumption, which may be even lower than that of the original one.Second, the area accumulation of 2 ∼ 9 generated matrices already overtakes that of TW, which reveals the profit of our configurable design.

Evaluations on UpWB
Timing Evaluation.The optimized software counterparts are made as the baseline for comparison of computation throughput.The highest frequency of UpWB is up to 1.3 GHz under 28nm synthesis and 240 MHz under FPGA implementation with performanceoptimized strategy, respectively.The software implementations of DWBCs are written in C code and run on a laptop computer equipped with 3.2 GHz Intel i7 CPU.A highperformance C++ library Givaro is applied to implement the MM over GF(2 m ).Both AES-NI and AVX2 instruction sets are utilized to speed up the AES variants.The software performance is measured by taking the average over 100000 repetitions, each time encrypting a random message of 2048 bytes.Table 4 offers the evaluation of TP for both software and hardware implementation.SW-TP is calculated as: (F × 8) / CPB (bps) (F denotes the peak frequency, CPB denotes cycle per byte).HW-TP1/2 is calculated following the equation 1.As can be seen, SPNbox-32 possesses the largest TP for black-box en/decryption.Since small-scaled MDS matrix is adopted by the Yoroi-family block ciphers, the obtained TP turns out to be the second largest.As shown in Figure 18, synthesized under TSMC 28nm technology, the TP of UpWB offers approximately 36× speed-up versus 164× software implementations.The FPGA design achieves 7× to 30× speed-up over software counterparts as well.Area cost.Table 5 provides the detailed area breakdown for UpWB.As can be seen, each LT kernel only consumes 6975.4 µm 2 , which could serve as a building block to be further integrated in many other processors.Thanks to the compact implementation, (inv)MixColumns only consume 895 LUTs (8.6 KGEs) and occupy about 24.8% (29.1%) within each NLT module.Each combined (Inv)S-box is also optimized with the state-ofthe-art tower field technique to pursue high area efficiency [ME19].
Comparison with advanced works.As mentioned before, works on the hardware

Conclusion
The enormous performance overhead is one of the obstacles for the wide application of WBC.
Then, we accumulate elements of each row in a tree shape to pursue low circuit depth.To do this, extra 16 modular adders (= 8-bit XORs) at the second layer are used to generate a 4i , a 4i+1 , a 4i+2 and a 4i+3 as: Step 4: Since the size of MC 24 is prime with those of MC 16 and MC 32 (gcd (2,3,4) = 1), the calculation of MixColumn about MC 24 is non-trivial compared with the above two Mixcolumns.However, we will show that the addition nodes generated in step 2 and 3 are still enough to construct MixColumn about MC 24 without the need to yield new addition nodes at the second layer.First, the calculation of MixColumn about MC 24 is directly written as below: Then, when i is even, the first two operands of each row are added together, after which the outcome is added with the last operand.The opposite is true when i is odd.Thus, the final results can be computed as below: In this way, only 15 modular adders of third layer are further needed to obtain the final results y

B Multi-scale Inverse MixColumns
For inverse MixColumn about InvMC 32 , we adopt the technique of multiplicative decomposition to reuse computation resource of forward MixColumn.To be specific, the InvMC 32 Then, based on xtime, A (k) can be computed iteratively as: For a certain fixed irreducible polynomial f (x) and operand a(x), the binary matrix A can be predetermined so that only XORs are required to implement the entire modular multiplication.It is reported in [SKOP15] that the choices of both f (x) and constant operand exert an influence on the number of XOR gates.Additionally, the number of XOR gates and circuit depth for implementing matrix-vector multiplication can be further minimized through sub-expression sharing.Therefore, Mastrovito's multiplier is usually used to customize the constant multiplier with regard to a fixed irreducible polynomial at the price of losing flexibility and scalability.

Table 7:
The detailed implementation of constant modular multipliers.
when directly computing M 1 and M 2 based matrix vector multiplication, 34 − 8 = 26 XORs and 37 − 8 = 29 XORs are needed, respectively.Based on the backward search framework, the area cost of constant modular multiplications about 68 x and d1 x can be reduced to 18 XORs with minimum circuit depth 3 XORs and 19 XORs with minimum circuit depth 3 XORs, respectively.The concrete implementation is shown in Table 7.

E Backward Search Framework
To customize the constant modular multiplier, we adopt the backward search framework proposed in [LWF + 22].This heuristic search algorithm is inspired by the optimization method for constant matrix multiplication proposed in [KHZ17].Compared with the forward search framework [BP09], the backward search framework guarantees the circuit depth to be minimum while still reducing as much area cost as possible.The overall execution steps of backward search framework are described as below: Step 1: Transform the constant C x into binary matrix A m×m based on Mastrovito's multiplication under a certain irreducible polynomial f (x).Then, the goal is turned to calculate the binary matrix vector multiplication: y m×1 = A m×m • x m×1 .
Step 2: Calculate the Hamming weight HW i and logarithmic depth S = log 2 HW i for each row of the matrix, where the maximum depth is determined as S max for 0 ≤ i ≤ m − 1.Then, node is defined as the bit vector of each row.
Step 3: Initialize the X set as the collection of input unit nodes x i with depth equal to 1, the W set as the collection of nodes w i with depth equal to S max , and the P set as the collection of nodes p i with depth less than S max , where 0 ≤ i ≤ m − 1. Sort the nodes p i in descending order according to depth.
Step 4: Split the node w i into two nodes until set W is empty, which is based on one of the following strategies: 1) Construct node w i with one node x j and the other node p k .
2) Construct node w i with one node p j and another generated new node g k .The depth of g k should be less than S max .Append the new node to set P .
3) Construct node w i with two generated new nodes g k and g n .The depth of g k and g n should be less than S max but at least one of them is equal to S max − 1. Append nodes g k and g n to set P .
Note that each splitting operation generates a directed sub-graph with two edges and three nodes.
Step 5: Update the maximum depth as S max = S max − 1. Update the sets W and P based on new S max .If W ∪ X = X, then return the current directed graph.Otherwise, we go to Step 4 for next iteration.
We add an extra sorting operation for set P at each iteration.The descending order guarantees that the node with maximum depth among p i is preferentially picked up, which aims to speed up the convergence for small-dimension matrix.More details about the backward search framework could refer to the original paper [LWF + 22].

F Alternative Methods to VMM
As comparison objects to the proposed radix-2 Montgomery based VMM, Figure 19

Figure 1 :
Figure 1: The application of WBC in cloud-based content distribution.Hardware WBC with small key is deployed under black-box setting (only the input and output can be observed).Software WBC with big key is deployed under white-box setting (untrusted open environment).
where C r = r + 1. Obviously, vectorized modular multiplication (MM) and addition (MA) over GF(2 m ) are required to support the iterative vector operation within the linear and affine layer.The right bottom of Figure2further presents the hierarchical operations for NSPN-structure DWBCs.state at i-th round n s -bit state at r-th round skip the layer

Figure 3 :
Figure 3: The comparison of DFG between conventional block cipher and WBC.

Figure 5 :
Figure 5: An example of FTP mechanism for SPNbox-16 black-box encryption.

Figure 6 :
Figure 6: The proposed hardware architecture for nested SPN-type white-box block cipher.

Figure 7 :
Figure 7: Analysis of data flow and timing diagram for systolic array.

Figure 10 :
Figure 10: Comparison of data-flow structure for MVM.
Proposed MDS matrix vector multiplier.

Figure 12 :
Figure 12: The overall hardware architecture for LT cluster.

Figure 17 :
Figure 17: The area distribution and accumulation (ACC) for random MDS matrices generated by different scalars.

Figure 18 :
Figure 18: The speed-up ratio of ASIC (FPGA) versus software.

Figure 19 :
Figure 19: Two typical examples of alternative methods to VMM (setting parameter as 3w-bit A and 3-bit B).

Table 1 :
The comparison of deployments for WBC in DRM.
[GSM15]KEA -key extraction attack.CTA -cache-timing attack[GSM15].CLA -code-lifting attack.This table also reflects the difference between black-box and white-box implementation.

Table 2 :
The summarized parameter sets for DWBCs.

Table 3 :
The theoretical comparison of different algorithms about MMs.

Table 4 :
Performance evaluation of UpWB on different platforms.DPs denote the number of data points processed in every batch.HW-TP1, HW-TP2 and SW-TP denote the TP evaluated under ASIC synthesis, FPGA implementation and software platform. *

Table 5 :
[SC14]reakdown for UpWB under FPGA and ASIC evaluation.The average throughput per area without consi dering KDF module (SHAKE-128).architectureforWBChaveyetbeenreleased.To demonstrate where the efficiency and flexibility of UpWB locate among similar works, we make comparison with advanced works targeting conventional block ciphers (BCs), which mainly involve congeneric operations over binary Galois filed.The key implementation results of both dedicated and configurable hardware designs are listed in Table6after approximate process normalization.The round count of WBCs is naturally larger than that of the conventional BCs.To make reasonable comparisons, we also normalize the computation efficiency by multiplying it with the average round count (denoted as Norm.CE.).[UHM + 20] presents a decent work for dedicated hardware design of AES with high throughput efficiency (measured by throughput per gate).UpWB has 2.2× lower Norm.CE.than that of [UHM + 20] but supports much more heavy algorithms.Compared with work [WSH + 10] about AES with configurable parameters, UpWB achieves 2.1× improvement of CE. and supports more algorithms.The possible reason is that [WSH + 10] utilizes heavy look-up tables to achieve configuration for modular multiplication with different precisions and irreducible polynomials, which introduces redundant resource utilization.As for the designs pursuing low-power metric, [CLF + 17][DLDN20] propose configurable processors for conventional BCs and asymmetric cryptography over binary Galois field, which are supplied with sound programmability but have about 2.3× lower CE.than this work.Finally, we highlight that the proposed MVM architecture can bring three dimensions of flexibility including size, f (x) and bit-width, which are not completely obtained in other works.Discussions.Prior works about GF(2 m ) arithmetic mainly target at the conventional block ciphers, error correcting codes, elliptic curve cryptography (ECC) and recent post-quantum cryptography.[IG16][PLM13]presentsingleMM hardware designs for ECC, which achieve high throughput but have limited flexibility.[SC14]proposeshigh- * bps/GEs *

Table 6 :
Comparison of implementation efficiency between UpWB and related works.
[MKAF11]ering the process variation.performanceprocessorforconventionalBCs,which obtains configurability mainly based on LUT.[HFW11]addresses AES and ECC on the same kernel by sharing control units and memory, but two individual data-paths are still needed.In terms of MVM,[MKAF11]implements the dynamic MixColumn by using dual-port BRAM on FPGA, which is limited to fixed size and f (x).[WSH + 10][LWD + 18] propose highly flexible and efficient processors capable of processing MixColumns with diverse f (x), but the optimization for large-scale MDS matrix is out of scope.Owing to the adaptivie method, the proposed VMM can further support broad schemes based on GF(2 m ), like ECC with even larger bit-width (≥ 100bit) operations.Additionally, if we only focus on multiple f (x), the data-path of VMM can be further pruned by exploiting the sparsity of f (x).
[KLLM20]k takes effort to develop the first hardware accelerator for a series of NSPN-based WBCs under black-box setting.To improve the resource utilization, we propose an efficient FTP mechanism to considerably decouple the nested data dependency and hide much more latency.By adopting algorithm-hardware co-design, we achieve decent area-time efficiency for several kernels, including an adaptive VMM, a configurable MVM and a compact (Inv)MCs.The computation throughput of UpWB outperforms the optimized software counterparts to a large degree under both ASIC synthesis and FPGA platform.This work mainly serves as an academic project whose principal focus is to demonstrate the feasibility and profit to adopt hardware acceleration for the black-box implementation of DWBC.Future works could extend the UpWB to accelerate other feasible non-NSPN whitebox block ciphers like Feistel-structure-based WhiteBlock[FKKM16]and FPL[KLLM20], since some prior works present the practicability to devise the domain-specific accelerator supporting different structures of traditional block ciphers.ofChina (Grant No.2021YFB2701201), and in part by the National Natural Science Foundation of China (No. 62302285).
last, the final results of MixColumn about MC 32 are calculated as: y