Speed Reading in the Dark

Functional encryption is a new paradigm for encryption where decryption does not give the entire plaintext but only some function of it. Functional encryption has great potential in privacy-enhancing technologies but suffers from excessive computational overheads. We introduce the first hardware accelerator that supports functional encryption for quadratic functions. Our accelerator is implemented on a reprogrammable system-on-chip following the hardware/software codesign methogology. We benchmark our implementation for two privacy-preserving machine learning applications: (1) classification of handwritten digits from the MNIST database and (2) classification of clothes images from the Fashion MNIST database. In both cases, classification is performed with encrypted images. We show that our implementation offers speedups of over 200 times compared to a published software implementation and permits applications which are unfeasible with software-only solutions.


Introduction
Functional Encryption (FE) [BSW11, O'N10] is a new paradigm for encryption where decryption does not return the original plaintext pt but only some function of it f (pt). FE has great potential for improving privacy. E.g., users can encrypt their sensitive data so that a cloud server may compute statistics over the users' data without learning anything else about the data except the results of these computations. It is noteworthy that such solutions are not possible even with Fully Homomorphic Encryption (FHE) because FHE allows the cloud server to perform the required computations with the users' data, but the users' interaction would be needed to get the decrypted result.
Although theoretical FE constructions for arbitrary polynomial-sized circuits have been introduced (e.g., [Wat15, GGH + 13]), they are not practical. (Semi-)practical FE constructions exist only for simple functions, namely, Inner Products (IP) (e.g., [ABDP15, ALS16, ACF + 18]) and Quadratic Functions (QF) (e.g., [BCFG17,DGP18,Gay20,Wee20]). FE-IP allows computing (weighted) means over encrypted data sets and even simple Machine Learning (ML) models based, e.g., on linear regression. FE-QF opens up a significantly larger set of applications including more robust ML applications over encrypted data. An efficient FE-QF scheme was introduced by Dufour Sans et al. [DGP18] in the paper titled "Reading in the Dark" where they also demonstrated how it can be used for classifying handwritten digits from the well-known MNIST database 1 with encrypted images. Even FE-IP and, especially, FE-QF have practical limitations because computational overheads get excessive when data sets get large. Stopar et al. [MSH + 19] recently published GoFE, an open source cryptographic library 2 written in Go language. The library supports multiple FE schemes including the FE-QF scheme from [DGP18]. They tested the FE-QF scheme for the MNIST database and stated that "the decryption of one image [. . . ] takes under 20 seconds" which is still excessive for many practical applications.
Our objective is to show that the performance limitations of FE can be pushed significantly further by utilizing hardware acceleration. Specifically, we introduce an implementation for computing decryption of the FE-QF scheme from [DGP18] on a reprogrammable System-on-Chip (SoC) platform and demonstrate major speedups compared to a software implementation with GoFE. Our accelerator could be installed in a cloud server and it would enable significantly more complex privacy-preserving solutions than what are possible with software-only systems. To the best of our knowledge, the only published hardware acceleration result for FE was recently provided by Bahadori and Järvinen in [BJ20b]. They implemented a multi-input FE-IP scheme based on Paillier encryption from [ALS16, ACF + 18] using Xilinx programmable SoC platforms.

Contributions
In this paper, we provide the following contributions: • We present the first hardware accelerator 3 for an FE-QF scheme. The accelerator is a hardware/software codesign implemented in a Xilinx reprogrammable SoC to maximally utilize the advantages of both software and hardware. The accelerator is optimized particularly for decryption. An FE-QF decryption consists of several cryptographic pairings followed by a discrete logarithm. The hardware side is implemented in programmable logic as a multi-core architecture where the cores are designed for efficient computation of cryptographic pairings and the discrete logarithm is implemented via an efficient interplay of software and hardware.
• We describe a parallel version of Shanks' baby-step giant-step discrete logarithm algorithm which is optimized for reasonably small positive and negative output values. The algorithm splits into precomputation and on-the-fly phases that compute and use a large precomputed table, respectively. We show that this algorithm has importance beyond our work by importing it into the GoFE software library where it provides significant speedups.
• We show that our accelerator provides large speedups compared to software-only implementations. In particular, we show that classification of handwritten digits from the MNIST database can be computed in 87 ms, representing over 200 times speedup compared to the original GoFE software library from [MSH + 19] and 15.5 times speedup even against the optimized GoFE library that uses our algorithm for discrete logarithms. Our accelerator also opens up new applications that are out of the reach of published software-only solutions such as accurate classification of clothes images from the Fashion MNIST database 4 . The model that we use in this classification is arguably too complicated to be practical with software as even the optimized GoFE library requires several seconds for a single classification-a task that our accelerator solves in only 0.38 seconds.
shows how the FE-QF decryption is mapped into this architecture. Sect. 5 provides implementation results and general performance comparisons against software libraries. Sect. 6 introduces the use cases, provides the results for them, and compares the results with software implementations. Finally, Sect. 7 ends the paper by drawing conclusions.

Preliminaries
This section provides the required preliminaries for the rest of the paper. We present the background for FE and cryptographic pairings in Sect. 2.1 and Sect. 2.2, respectively. In Sect. 2.3, we describe the FE-QF scheme from [DGP18] that we implement in this paper.

Functional Encryption
Traditional encryption is all-or-nothing because decryption returns the entire plaintext pt from the ciphertext ct and nothing can be learned about the contents without the correct decryption key dk. FE [BSW11, O'N10] provides more fine-grained control. FE allows to derive decryption keys dk f for different functions f so that dk f enables to compute the value of f (pt) from ct without giving any additional information about pt. Many potential applications of FE include spam filtering from encrypted email, facial recognition from encrypted photos or video, privacy-preserving medical data processing, etc. While such applications are theoretically possible with the existing FE constructions for arbitrary polynomial-sized circuits (e.g., [Wat15, GGH + 13]), they are not practical as these FE constructions are extremely inefficient. Abdalla et al. [ABDP15] chose to take a different, more practical approach where FE schemes are designed for simple but still practically meaningful functions. This trend of research has resulted in many schemes for IP (e.g., [ABDP15, ALS16, ACF + 18]), i.e. functions of the form: [BCFG17] proposed the first efficient FE scheme for other than linear functions. Their FE-QF scheme supports functions of the form x F y = n−1 i=0 m−1 j=0 f i,j x i y j . Dufour Sans et al. presented a slightly more efficient FE-QF in [DGP18]. The latter plays the central role in our work and is described with more details in Sect. 2.3.
The support for more complex functions increases the applicability of FE in practical applications. FE-IP computes (weighted) sums of inputs and, consequently, supports applications like voting (e.g., sums), basic statistics (e.g., weighted average) or simple machine learning (e.g., linear regression). Thanks to the ability to compute sums of products of inputs, FE-QF additionally supports applications like variance and correlation in statistics or simple neural networks in machine learning enabling better prediction accuracy. The use cases considered in Sect. 6 are examples of the capabilities of FE-QF, but the applications are certainly not limited to them.

Cryptographic Pairings
A cryptographic pairing is a bilinear map G 1 × G 2 → G 3 from two additive groups G 1 and G 2 to a multiplicative group G 3 . The first use of bilinear pairings in cryptology was for cryptanalysis [MOV93]. Later they have been utilized for multiple cryptographic constructions including tripartite key exchange [Jou04], identity-based encryption [BF01], short signatures [BLS04], attribute-based encryption [GPSW06,BSW07], searchable encryption [ABC + 08], and-most importantly for this work-FE [BCFG17, DGP18, AGRW17]. These applications need efficiently computable pairings and research for improving efficiency has resulted in various types of pairing algorithms and pairing-friendly elliptic curves such as the Barreto-Naehrig (BN) curves [BN06]. In this paper, we use optimal ate pairings [Ver10] on BN curves as proposed by Beuchat et al. in [BGM + 10]. In that case, G 1 and G 2 are points on elliptic curves E(F p ) and E(F p 2 ) and G 3 is the multiplicative group of F p 12 such that p is a 254-bit prime. Due to Kim and Barbulescu's extended number field sieve algorithm [KB16], it is considered to offer over 100-bit security.

Functional Encryption for Quadratic Functions from Pairings
Next, we describe the FE-QF scheme introduced by Dufour Sans, Gay, and Pointcheval in [DGP18]. The scheme is based on cryptographic pairings and is secure under the generic group model.
The scheme defines four routines: Set-up generates a public key pk and a master secret key msk. It takes a bilinear pairing scheme PG = (G 1 , G 2 , p, g 1 , g 2 , e) for a security parameter 1 λ and the maximum values for the above function parameters as inputs. It returns a public key pk = (g s 1 , g t 2 ) and a master secret key msk = (s, t), where s, t are n random elements of Z p .
Decryption evaluates the function f on the ciphertext ct using the decryption key dk f and the public key pk and returns an integer value r = f (x, y). It computes where e is the bilinear pairing. The final result r is obtained by solving the discrete logarithm r = log g3 (ρ) where g 3 = e(g 1 , g 2 ). Because we focus particularly on the decryption routine in this paper, we describe it with more details in Alg. 1.
The details about the correctness and security of the scheme are presented in [DGP18]. As can be seen above, the encryption and key generation routines consist mainly of scalar multiplications in the elliptic curve groups G 1 and G 2 with base points g 1 and g 2 , respectively. Decryption divides in two parts: (a) 2n 2 + 1 cryptographic pairing computations and (b) one discrete logarithm in the finite field G 3 = F p 12 . In addition Algorithm 1: Decryption of the FE-QF scheme [DGP18]  2 ) // Pairing 6 ρ ← ρ · (e 0 · e 1 ) fi,j // Inv. (if f i,j < 0), exp., and two mults. 7 r ← log g3 (ρ) // Discrete logarithm to these, Alg. 1 requires n 2 exponentiations (typically with small exponents) and 2n 2 multiplications in F p 12 as well as one inversion in F p 12 for each f i,j < 0. However, the cost of these operations is insignificant compared to the pairings and the discrete logarithm. The function f is applied in Alg. 1 by computing exponentiations with f i,j in line 6. If f i,j = 0, then the result of this exponentiation is 1 ∈ F p 12 and the multiplications and the optional inversion in line 6 are not needed. Even more importantly, the pairings in lines 4 and 5 can be skipped too. For this reason, decryption with sparse functions are significantly cheaper operations than general decryptions. This has been utilized extensively in [DGP18,MSH + 19] where image classifications utilize projections that result in f where only the diagonal values (i = j) are nonzero. We utilize similar projections in the use cases of Sect. 6 and compute functions of the form f (x, x) = n−1 i=0f i x 2 i wheref i is the i th value of the diagonal reducing the number of required operations per decryption to only 2n + 1 pairings and one discrete logarithm in F p 12 (and 2n multiplications, n exponentiations, and one inversion for each negativef i ). However, our implementation is not limited to such structure of f and we provide performance results for both general and diagonal f in Sect. 5.
We instantiate the FE scheme of [DGP18] with optimal ate pairings over BN curves as described by Beuchat et al. [BGM + 10] and Shanks' baby-step giant-step discrete logarithm algorithm, which are discussed in Sects. 2.3.1 and 2.3.2, respectively.

Optimal Ate Pairing over a Barreto-Naehrig Curve
The optimal ate pairing over BN curves introduced by Beuchat  The last term f (p 4 −p 2 +1)/r is the hard part, but it can be computed more efficiently with a vectorial addition chain as shown in [SBC + 09]. We refer the readers to [BDM + 10] for detailed algorithms (31 in total) for computing each suboperation of Alg. 2.

Discrete Logarithms with Shanks' Baby-Step Giant-Step Algorithm
Baby-step giant-step algorithm given in Alg. 3 is a classical meet-in-the-middle algorithm invented by Daniel Shanks in the early-1970s for computing discrete logarithms in Algorithm 3: Shanks' baby-step giant-step algorithm for discrete logarithms [Sha71] Input: α, β ∈ G such that β = α x with x ∈ Z and G is a cyclic group of order ν Output: [Sha71]. When given α and β such that α, β ∈ G with an order ν and β = α x , it returns x ∈ [0, ν − 1]. The algorithm works in two phases. Firstly, a table T is constructed comprising tuples (j, α j ) for all powers j = 0, 1, . . . , µ − 1 where µ = √ ν . This is the so-called baby-step phase. Secondly, it makes trials by computing β · (α −µ ) i for i = 0, 1, . . . until it finds a match in T . This is the giant-step phase. When a match is found, then Instead of a general discrete logarithm in G 3 = F p 12 , the FE-QF scheme requires discrete logarithms where the outputs are bounded to Consequently, the discrete logarithm problem can be solved in reasonable time. Because g 3 in Alg. 1 is a fixed domain parameter (g 3 = e(g 1 , g 2 ) ∈ F p 12 ), the table T can be precomputed. Sect. 4.5 describes how Alg. 3 is computed in our implementation.

Architecture
This section describes the architecture of our accelerator. It is a HW/SW codesign, where the computationally heavy cryptographic pairings and F p 12 arithmetic are supported by the HW domain (programmable logic). The SW domain controls the HW domain and computates auxiliary operations. In many settings, the HW/SW codesign paradigm allows combining high performance with low resource usage and flexibility. This is certainly true for implementing the FE-QF scheme which must support a myriad of separate algorithms (i.e., suboperations of cryptographic pairings, arithmetic in F p 12 , and discrete logarithms) so that computations still mostly rely on the same low-level arithmetic in F p .
The FE-QF decryption includes a lot of inherent parallelism. Consequently, we design the HW domain as a multi-core architecture including multiple parallel and programmable Cryptography Processor (CP) cores. We base our CP cores on the pairing cryptography processor architecture recently published in [BJ20a] because it has a good balance between performance and area requirements and is already designed for a HW/SW codesign in a reprogrammable SoC. We instantiate a multi-core architecture where each CP core can be programmed with different microprograms and, hence, the architecture supports two types of processing: (a) symmetric processing, where the CP cores process the same computations (but with different data), and (b) asymmetric processing, where the CP cores process different computations. Implementation of the FE-QF scheme requires significant amounts of data and microcode transfers between the HW and SW domains that easily become the bottleneck for performance. We mitigate this limitation by introducing different levels of data and program memories in the HW domain to reduce HW/SW communication. These memories include both global (for all CP cores) and local (for a single CP core) data and program memories. In the SW domain, we utilize both on-chip and off-chip memories.  Also other HW/SW codesigns for pairings than [BJ20a] could be potentially used, such as [SDK17]. However, they may require adaptation of slightly different approaches in the codesign; e.g., [SDK17] presents a slower and smaller core compared to [BJ20a] and would require a much larger number of parallel cores for similar performance. Certain stand-alone coprocessors for pairings such as [SGZ + 15, YFCV13] allow faster pairings than [BJ20a] but they are typically much larger and would not permit computation of other operations needed in FE-QF. Consequently, we focus solely on the use of [BJ20a] in this paper. Fig. 1 illustrates the high-level architecture of the HW/SW codesign which is divided into two main parts: (1) SW domain and (2) HW domain. The architecture is generic in the sense that it can be instantiated in various programmable SoCs with minor modifications (see App. A for additional discussion). However, in this paper, we consider only instantiations in Xilinx all-programmable SoCs because we use a Xilinx Zynq Ultrascale+ multiprocessor SoC (MPSoC) platform for prototyping. In an MPSoC, the SW domain consists of multicore ARM processors and the HW domain is a powerful Field Programmable Gate Array (FPGA). To provide programmability and to decrease resource utilization, the HW domain utilizes a microprogramming approach instead of hardwired Finite State Machines (FSMs).

SW Domain
The SW domain (on the left in Fig. 1) is responsible for controlling all operations in the HW domain and external peripherals (i.e., DDR memory and I/O peripherals) as well as performing computations which are not supported by the multi-core architecture of the HW domain. The SW domain controls all actions in the HW domain including sending and receiving data and microprogram packs between the SW and HW domains, issuing/receiving commands/statuses to/from the HW domain, configuring and programming all CP cores and other modules in the HW domain, and making control decisions for symmetric or asymmetric parallel execution of the CP cores based on the received statuses and different conditions. The communication between the SW and HW domains is performed via two types of interfaces: High Performance (HP) and General Purpose (GP) interfaces. The HP interfaces are employed for high performance data and microprograms communications whereas the GP interfaces are used for transfering commands and statuses.

HW Domain
The HW domain (on the right in Fig. 1) contains the parallel CP cores for performing the actual computations and many supporting modules for data and microprograms communication and storage as well as for commands and status communication. All modules in the HW domain are connected to the Advanced Extensible Interface (AXI) structure. The main challenge is communicating data and microprogram packs between the SW and HW domains so that it does not become a bottleneck for performance. To alleviate this limitation, various global and local data and program memories are used in the HW domain. As mentioned above, this communication uses the HP interfaces that connect to the global interface unit via an AXI Direct Memory Access (DMA) module. The global interface unit in the HW domain is controlled by the command and status signals and is connected to the Global Data Memory (GDM), Global Program Memory (GPM), as well as directly to the parallel CP cores. It is responsible for managing all data and microprogram transactions between the SW domain memories, global memories, and local memories of the CP cores. Each CP core has its own Local Data Memory (LDM) and Local Program Memory (LPM). Memory transactions for data and microprograms can be done between (1) the SW domain and the global memories of the HW domain, (2) the SW domain and the local memories of the HW domain, and (3) the global and local memories of the HW domain. Fig. 1 also shows the simplified architectural diagram of the CP core which we discuss with more details in Sect. 3.2.
The multi-core architecture contains N parallel CP cores and supports two types of execution flows: (a) symmetric and (b) asymmetric parallel execution. In the symmetric parallel execution, each CP core uses its own LDM and the shared GPM. One CP core (i.e., CP core 1 in Fig. 1) works as the supervisor core and interacts with the GPM by fetching the required microprograms. All CP cores (i.e., CP cores 1, 2, 3, . . . , N ) then execute the fetched instructions in a symmetric manner so that the same microprograms are executed in parallel with different data. In the asymmetric parallel execution, each CP core works independently using its own LPM and LDM. As the result, all CP cores can execute different microprograms in parallel. The GDM is a global shared memory for storing input, output, and intermediate data. It allows fast data transfers between the CP cores without the need to move the data between the HW and SW domains.

Cryptography Core
The architecture of the CP core is based on the pairing cryptography processor described in [BJ20a]. The main idea behind its design was to achieve a good trade-off between speed and area requirements by optimizing the core for the resources of modern FPGAs and by using the HW/SW codesign paradigm. This is in contrast with many other cores available in the literature which have been designed to minimize the latency of a single pairing computation with an expense of a larger resource usage. Having a core that is optimized for the speed-area tradeoff and the HW/SW codesign paradigm provides an excellent starting point for designing an efficient multi-core architecture for FE-QF. The CP core from [BJ20a] was considered to be implemented primarily as a stand-alone core, but our architecture requires its use in a multi-core architecture and, consequently, we introduce certain changes into the architecture of the CP core and its interfaces.
The CP core is based on a microprogramming architecture in order to combine runtime programmability with a small area footprint. The optimal ate pairing of Alg. 2 and the discrete logarithms with Alg. 3 both rely on F p 12 arithmetic which is ultimately based on F p arithmetic. Hence, the efficiency of the entire FE-QF scheme strongly depends on the efficiency of F p arithmetic and the scheduling of these arithmetic operations.
Although the CP core is based on the core in [BJ20a], we modified and extended the original architecture in multiple ways in particular to facilitate its efficient use in the multi-core architecture. We changed its LDM and LPM units as well as its local interface unit to support local and global modes required for symmetric and asymmetric processing. The architecture details of the CP core are depicted in Fig. 2, which contains local interface, arithmetic (datapath), the LDM, the LPM, and address generation and control units. The local interface unit communicates commands, statuses, data, and microprograms with the global interface unit and other HW modules. This unit loads data into the LDM and microprograms into the LPM. The latter can be done in two ways: offline before the actual FE-QF computations or on-the-fly during the computations. The LPM contains a simple dual-port RAM and a controller for different address branch scenarios. The LPM stores microprograms for algorithm(s) that are run in the CP core. Each instruction in a microprogram consists of several fields providing commands to the corresponding units for a working cycle of the CP core. The LPM is partitioned into several segments, where each segment can be loaded separately via the local interface unit during the runtime. In addition, full microprogram loading of the LPM can be done directly from the SW domain or from the GPM of the HW domain.
The control unit generates addresses for the LDM and makes decisions for loop iterations and conditional statements. The inputs and outputs of the arithmetic unit are connected to the LDM. The LDM is a duplicated true dual-port RAM with two independent read and write ports and supports "4-read", "2-write", or "2-read and 1-write" operations from/to the LDM. This facilitates efficient scheduling and parallelization of F p arithmetic. The LDM is also interfaced with the local interface unit for communicating data with the global interface unit (i.e., data communications with the GDM and the SW domain).
The arithmetic unit (datapath) is described in Fig. 2. In this work, the datapath width is 256-bit and it supports up to 256-bit arithmetic computations in F p . It consists of three parts: source registers, arithmetic blocks, and output selectors. The arithmetic blocks comprise three Montgomery Modular Multiplier Blocks (MMMBs) and two Modular Adder/Subtractor Blocks (MASBs) and they can operate in parallel and independently of each other. The inputs of all arithmetic blocks can be loaded from the LDM but the inputs of the MASBs can be additionally loaded from the outputs of the arithmetic blocks. This arrangement together with the multi-read/write feature of the LDM allows efficient computation of F p 12 arithmetic. The modulus and the precomputed constant for the Montgomery arithmetic are registered into the arithmetic unit. Because the arithmetic unit is similar to the one in [BJ20a], we provide a brief summary of its structure in App. B and refer to [BJ20a] for a detailed description.

Implementation of the Algorithms
This section describes how the algorithms of the FE-QF scheme are mapped into the architecture introduced in Sect. 3. The entire mapping is done by writing the required software for the SW domain and the microprograms for the HW domain. In general, the mapping of the pairing algorithm is done following the guidelines of [BJ20a]. The discrete logarithm computations and parallel processing of the pairings and F p 12 arithmetic are novel contributions of this paper. Hence, the focus of this section is on these topics.   domains with the specific parameters and algorithms of the FE scheme and performs the precomputation for the discrete logarithm.

Working Principles of the HW/SW Codesign Accelerator
The algorithms are implemented via optimized microprograms which divide into two types: segment and full microprogram packs. The former allows updating only parts of the microprogram that is currently in the HW domain. To implement a specific subalgorithm of Alg. 1, an indepth analysis was performed and the algorithms were translated into microprograms (i.e., several segments and/or full sub-routine packs). The microprograms were generated by hand through a customized platform and scripts. The microprograms are sequences of instructions for different units of the CP core. The instructions are 72 bits long and divide into 14 fields (e.g., arithmetic, control, next program memory address, LDM address values, LDM, and LPM fields). All microprograms required for the FE-QF decryption are stored in the (off/on-chip) SW domain memory (i.e., DRR memory). Whenever a (set of) particular computation(s) needs to be executed in a specific CP core or in all CP cores, the corresponding microcode(s) are loaded into LPM or GPM, as explained before. Finite field arithmetic has the largest impact on the overall performance and, therefore, special care was taken to optimize microprograms for them to maximimally utilize the datapath of the CP cores (see Sect. 4.2).
In the HW domain, the (full and segment) microprogram packs are stored in the LPM and/or GPM memories depending on how they need to be executed. In this work, the sizes of segment and full packs are 288 B and 18 KB, respectively. In symmetric parallel execution, all microprogram packs are first stored in the GPM and the supervisor core (i.e., CP core 1) interacts with the GPM to get the suitable packs for all CP cores. In asymmetric execution, the SW domain stores different microprogram packs into the LPMs of different CP cores. The latencies reported in the following sections include all latencies related to transferring microprograms, commands, and statuses between the SW and HW domains and also between different modules inside the HW domain. The SW domain contains a large off-chip memory (i.e., a 4 GB DDR memory) and most of this memory is occupied by a large precomputed table for discrete logarithm computations (see Sect. 4.5). Ctrl.

Tower Extension Field Arithmetic
The first step in implementing the FE-QF scheme is to efficiently implement the finite field arithmetic in F p 12 . In this work, we adopt the tower extension field definitions for F p 12 from [BDM + 10] and, consequently, arithmetic operations are computed with series of operations in F p . In particular, Karatsuba-like multiplications in the quadratic extension fields F p 2 and F p 12 can be computed with three multiplications (and additions/subtractions) in F p and F p 6 , respectively. Multiplications in F p 6 require six multiplications in F p 2 (see [BDM + 10] for detailed algorithms). As our CP core is based on the pairing cryptography processor from [BJ20a], we use the same approach for scheduling the tower extension field arithmetic (from F p to F p 12 ). The main observation was that the parallel processing capabilities of the datapath can be utilized so that costs of all additions/subtractions are effectively hidden as they are executed in parallel with multiplications. Fig. 9 in App. B presents the timing diagrams and [BJ20a] provides additional details.

Optimal Ate Pairings over BN Curves
Implementation of optimal ate pairings (Alg. 2) consists of three main levels. The first level is the finite field arithmetic (from F p to F p 12 ) discussed in Sect

Parallel Pairings and F p 12 Arithmetic Calculations
As mentioned in Sect. 2.3, the FE-QF decryption algorithm given in Alg. 1 is divided into two parts: (1) parallel pairings and F p 12 computations and (2) discrete logarithm in G 3 = F p 12 . These parts are discussed in this section and Sect. 4.5, respectively. Fig. 4(a) illustrates the implementation and timing diagrams of the pairings and F p 12 computations of Alg. 1 (i.e., lines 1-6). In this part, the SW domain controls the execution-flow with parallel symmetric or asymmetric executions while all pairings and F p 12 operations are computed in the HW domain. First, 2n 2 + 1 pairings are calculated with the N parallel CP cores through (2n 2 + 1)/N iterations. They are shown with blue blocks in Fig. 4(a). In this step, the SW domain sends all inputs (i.e., g ai 1 and g bi 2 ) into the LDMs of the CP cores and the pairing outputs (i.e., e i ) are stored in the GDM or, when the space limit of the GDM is reached, are sent to memories of the SW domain. These communications are shown with orange and gray blocks in Fig. 4(a), respectively. Second, F p 12 computations (i.e., exponentiations, inversions, and multiplications) are calculated with the N parallel CP cores. These computations are shown with green blocks in Fig. 4(a). In this step, all inputs are loaded to the LDMs of the CP cores by the GDM of the HW domain and/or SW domain memories (i.e., orange and gray blocks in Fig. 4(a)). Furthermore, several segment and full microprogram packs are transferred from the GPM to all LPMs of the CP cores during each pairing (and F p 12 arithmetic operation) in the HW domain. These communications are shown with narrow orange blocks in Fig. 4(a).

Discrete Logarithms
Alg. 4 describes the parallel baby-step giant-step algorithm for discrete logarithms with N parallel CP cores. Compared to the standard baby-step giant-step algorithm shown in Alg. 3, it introduces two main improvements: (1) the computation is efficiently allocated to N parallel CP cores and (2) it splits the computation to precomputation and on-the-fly computation. Precomputation comprises the baby-step phase of Alg. 3 and makes it a one-time effort needed only in the beginning. On-the-fly computation is the giant-step phase of Alg. 3 and must be computed separately for each input.
Regarding the former improvement, the computations of Alg. 4 are almost optimally distributed to N parallel CP cores resulting in a speedup factor that is very close to N . The latter improvement gives a significant speedup because the baby-step phase typically contributes most of the delay as it always needs to be computed in full whereas the giant-step phase terminates as soon as a match is found. We designed also a Constant-Time (CT) variant for cases where timing attacks are a threat (see Sect. 5.4 for more discussion). The FE-QF decryption guarantees that the result of a decryption from valid ciphertexts is in the interval [−B, B] that is determined by the function f . Consequently, the precomputation phase must compute a table of size √ B . However, in this work, we chose to precompute a table of size B p where B p is the maximum size that fits into the memory so that our implementation supports all functions, for which the output bound B satisfies B p ≥ √ B , without the need of new precomputations; obviously, the price of this choice is a slightly longer precomputation phase. Then again, this choice improves the speed of the on-the-fly computation even for functions with smaller output bounds because fewer iterations are needed in the giant-step phase (in particular, the on-the-fly computation terminates before entering the main for loop if the result is at most B p ). The minor improvements of Alg. 4 over Alg. 3 include support for interval [−B, B] where B ν and includes the negative values. Notice that the last line (i.e., line 18) is never reached with inputs fulfilling all input requirements of Alg. 4, but it is included to cleanly capture decryption failures with faulty ciphertexts. We describe the precomputation and on-the-fly phases with more details in the following subsections.

// (a) Precomputation phase
Input: α ∈ G; a function h; a memory bound B p Output: Table T , values γ u for 1 ≤ u ≤ N/2 and δ

Precomputation Phase
As mentioned in Sect. 2.3.2, g 3 in Alg. 1 is a domain parameter (the element g 3 = e(g 1 , g 2 ) ∈ F p 12 ) that can be fixed. Because the shared table T in Alg. 4 includes powers of α = g 3 from 0 to B p , it can be precomputed. Based on the specifications of our implementation platform, we chose B p = 2 27 and to precompute the table when the platform boots up (instead of storing it to non-volatile storage). In Alg. 4, we also employ a function h, which truncates the value of α j to bits, to decrease the memory space for storing T and to increase the performance of data communication from the HW domain to the SW domain. We used = 64 which still guarantees the uniqueness of each h(α j ). Fig. 4(b) illustrates the implementation method and timing diagram of Alg. 4(a) in the HW/SW codesign. The precomputation phase contains a main for-loop (i.e., line 3 of Alg. 4(a)), and the SW domain controls and executes its iterations by assigning N parallel F p 12 computations (i.e., lines 5 and 6) to the CP cores. These computations are shown with green blocks of Fig. 4(b). In each iteration, all CP cores send the -bit results (i.e., h(γ u ) for u = 1, 2, . . . , N ) to the SW domain; this is shown with narrow gray blocks in Fig. 4(b). The SW domain stores the tuples (j, h(α j )) of the shared table T in the DDR memory (see Fig. 3). In addition to the table T , Alg. 4 precomputes also δ and γ u for u = 1, 2, . . . , N . The parallel computations in lines 4-7 compute also α µ , which can be used to be compute α through an inversion in line 9. As the figure shows, after computing lines 1-9, the SW domain initiates the parallel computations of lines 10-11 in the HW domain (i.e., gray, green, and orange blocks). At the same time, it sorts the shared table T (i.e., the black block) using a quicksort routine. A sorted table allows efficient binary searches from the table during the on-the-fly computation phase.

On-The-Fly Computation Phase
This phase is the on-the-fly computation that must be executed separately for each FE-QF decryption. It contains two types of operations: (1) searching from the sorted shared table T and (2) performing F p 12 multiplications. They are handled in parallel in the SW and HW domains, respectively. Fig. 4(c) describes the implementation and timing diagram of the on-the-fly computation. At the beginning, the SW domain loads the LDM and LPM of a CP core with the input data (i.e., β) and the microprogram packs of the inversion. It is shown with the first gray block of Fig. 4(c). Then, the same CP core is initiated to compute a F p 12 inversion needed for β (−) in line 1 of Alg. 4(b), which is shown with the first green block in Fig. 4(c). Simultaneously, the SW domain performs the first search from the table T (i.e., line 2 of Alg. 4(b)) which are shown with the first narrow red block of Fig. 4(c). After computing the F p 12 inversion, the SW domain performs the second search from the table T (i.e., line 4 of Alg. 4(b)) which are shown with the second narrow red block of Fig. 4(c). After that, the SW domain loads the LDMs and LPMs of the CP cores with the input data (i.e., γ u , and δ ) and the microprogram packs of the multiplication. It is shown with the third gray block of Fig. 4(c). Then, the SW domain controls and executes the main for-loop (i.e., line 6 of Alg. 4(b)) by assigning N parallel F p 12 multiplications (i.e., lines 9-10 and 14-15) to the CP cores, which are shown with the green blocks in Fig. 4(c). In each iteration, all CP cores send the -bit results (i.e., h(γ u ) for u = 1, 2, . . . , N ) to the SW domain as shown with the narrow gray blocks in Fig. 4(c). In the main for-loop, the SW domain performs N sequential binary searches from the sorted shared table T (i.e., lines 11 and 16) in each iteration. The parallel computations in lines 8-12 and 13-17 correspond to the positive and negative output values, respectively. This process continues until the SW domain finds a match in T and then returns the output x. This is illustrated with the consecutive and parallel red and green blocks in Fig. 4(c).

Results and Analysis
In this section, we present and analyze the implementation results. To the best of our knowledge, we presented the first hardware accelerator for an FE-QF scheme and, therefore, we compare our implementation results with the state-of-the-art software implementations.

Experimental Setup
In order to evaluate the performance of the HW/SW codesign accelerator, we implemented it on real hardware. We targeted Xilinx programmable SoCs and specifically we used the Zynq UltraScale+ MPSoC ZCU102 evaluation kit including a Xilinx Zynq UltraScale+ MPSoC XCZU9EG-2FFVB1156 device, which features a quad-core ARM Cortex-A53 processor running up to 1.5GHz in the SW domain and a 16nm FinFET+ based FPGA in the HW domain. For the SW domain, we used C programming and Xilinx Software Development Kit (SDK) as the development environment. For the HW domain, we used Verilog (HDL) and Xilinx Vivado v2019.1 tool for compiling and implementing the design to the FPGA. The source codes are publicly available 3 .

Area Consumption
The resource requirements for the HW domain are shown in Table 1. We were able to fit N = 16 CP cores into the FPGA by using 33,057 slices (96.5 %), 401   Table 2 presents total execution times and HW latencies of different low-level operations. It includes four parts. (1) The top part presents the initialization and precomputation part. The initialization step initializes all SW and HW memories with data and microprograms and configures all HW domain modules. The precomputation computes the sorted shared table T for the discrete logarithms using Alg. 4(a). In this work, we chose the largest memory bound, for which the precomputed table T still fits into the DDR memory of the evaluation kit. Specifically, we have B p = 2 27 that supports up to 54-bit outputs and results in a 1.9 GB precomputed table T . It takes 535 seconds to perform all operations required to construct a sorted table T . Although this part needs to be executed only once at boot-up, it takes long and can be problematic in some settings. In such cases, the precomputed table T could be stored in a non-volatile memory (e.g., an SD card) or loaded via an external interface at boot-up.

Timing results of the SW/HW primitive operations
(2) The second part shows the timings for interacting different command/status, data, and microprogram packs between different domains and local/global memories. (3) The third part shows the timings for computing F p 12 arithmetic. Squaring, multiplication, and inversion are CT by default, but exponentiation has two versions: a Variable-Time (VT) square-and-multiply and CT square-and-multiply-always. (4) The bottom part reports the timing for an optimal ate pairing calculation, which is CT. 5 The maximum clock frequency for a single instance of the CP core is 230 MHz [BJ20a] so the frequency drop to 210 MHz with N = 16 is only a minor one and greatly outweighed by the advantages of parallelization. Arguably, the reason why the high slice utilization (96.5 %) did not cause a more significant frequency drop is the fact that the slices are relatively sparsely filled as the LUT utilization is only 56.4 %.  72b), resp. ‡ Exp-CT/VT assume a 11-bit exponent; Exp-VT also assumes Hamming weight of 6. Fig. 5(a) depicts the timing results for the pairings and F p 12 arithmetic part of Alg. 1 (i.e., lines 1-6) as a function of n, the length of x and y, for both general and diagonal f . As discussed in Sect. 2.3, a diagonal f requires 2n + 1 pairings (and arithmetic in F p 12 ) whereas a general f requires 2n 2 + 1 pairings. This difference is clearly visible in Fig. 5(a) which shows that, when n = 140, a diagonal f requires only about 22 ms compared to about 3 s required by a general f . Fig. 5(b) presents the timing results for the discrete logarithm part of Alg. 1 (i.e., line 7) as a function of the bit-length of the output result in bits. Because we use a precomputed table with B p = 2 27 , the computation for at most 27-bit results takes around 0.11 ms because the result is directly available in T . For larger outputs, the computation time increases quickly. For example, for output sizes of 29, 38, and 54 bits, the computation times are 1.2 ms, 17.5 ms, and 1,096,478 ms, respectively. Hence, functions, for which results are typically small, are relatively fast to compute with the VT discrete logarithm algorithm even if the theoretical bound B t for that function is large. The CT-variant of the discrete logarithm computation always executes all iterations of Alg. 4 up to µ required by the maximum output bound B t of that particular function f . Hence, the cost of CT vs. VT discrete logarithms depends on both B t and the distribution of the outputs for the particular function and use case. Fig. 5 allows estimating the total execution time of a FE-QF decryption with Alg. 1 for different functions and features. Table 3 collects timing results for functions of different sizes and types (i.e., general and diagonal f ) and for both VT and CT variants. The CT variant computes discrete logarithms always with the maximum number of iterations for that function whereas the VT variant terminates immediately after the result is found. The VT timings assume uniformly random input vectors x, y, and function f . Table 3 presents also the timing results for the worst cases of the n and B t parameters of the two use cases from Sect. 6. As can be seen, the difference between CT and VT results grows when B t increases for both general and diagonal f . This trend is more obvious for diagonal f because the discrete logarithm computation is more dominant thanks to fewer number of pairings.   Table 4 presents a performance comparison against two software implementations. The decryption timings are for the general f with the maximum output bound (i.e., B t = n 2 · B x · B y · B f ). The first software results (the original GoFE library) are from [MSH + 19] and contain timings of key generation, encryption, and decryption with different parameters. It is evident that it becomes impractical for larger parameters n and B t due to the slowness of the decryption. Moreover, it can be observed that key generation and encryption are considerably faster and do not form significant performance challenges for the real-life FE-QF schemes, supporting our choice to focus on decryption in this work. The original GoFE library reported in [MSH + 19] does not use precomputation tables for discrete logarithms. Hence, we included the second software results from an optimized GoFE library that uses Alg. 4 with B p = 2 25 for discrete logarithms. This significantly improved the software decryption timings and provides fairer comparisons against software. Table 4 shows that our HW/SW codesign offers speedups between 2.3-20.0 times compared to the optimized GoFE library and up to over 1000 times compared to the original GoFE library from [MSH + 19]. The results demonstrate that the original GoFE library suffers from slow discrete logarithms and, hence, the use of Alg. 4 provides major improvements also in software. Compared to the optimized GoFE library, the largest speedups are obtained when the number of pairings is high.

Discussion on Side-Channels
Side-channel attacks are a significant threat to practical cryptosystems and implementations should include countermeasures against attacks that are considered possible in the threat models of applications where the implementations could be deployed. FE has certain features that provide inherent resilience against side-channel attacks. E.g., if we consider the specific case of our HW/SW codesign for FE-QF decryptions, the only (potentially) sensitive values handled by the implementation are the decryption keys dk f and the results of the decryptions. By the nature of FE, the decryption keys do not provide full access to the plaintexts reducing their lucrativeness as targets of side-channel attacks compared to keys of traditional cryptosystems. Nevertheless, one may envision applications where they need to be protected against side-channel attacks. The output results may also be sensitive and an attacker could try to learn information about them via side-channel attacks.
We consider timing attacks as the main threats for our implementation because, in most envisioned applications, the implementation would be installed as an accelerator for a server and it is unlikely that attackers could get physical access to the device. Hence, we have considered CT implementation as the main method to protect against side-channel attacks. Notice that-in our case-this implies protection even against simple power and electromagnetic attacks because the operation patterns are constant. We leave protections against advanced power attacks as topics for future research.
The would still be required. If even the function itself is secret, then it could be protected against timing attacks by adopting a CT exponentiation algorithm. The CT variants in Table 3 utilize CT square-and-multiply-always exponentiations. Notice that there are also function hiding FE schemes (e.g., [BRS13, BJK15, ACF + 18]), which would hide f even from a legitimate decryptor but they are out of the scope of this work.
The discrete logarithm in line 7 of Alg. 1 does not utilize the decryption key but may still require side-channel protections if the output result is sensitive. In particular, the timing of the computation with Alg. 4(b) relies heavily on the size of the result. E.g., if the result is smaller than B p , the computation terminates almost immediately (0.11 ms), but if the result is close to the maximum value 2 54 , the computation takes several minutes.
Consequently, an outside observer could learn information about decryption results simply by measuring response times. However, in a normal setting, the variation of discrete logarithm timings is much more moderate than in the above extreme case (see, e.g., Fig. 7 for output bit-length distributions in the two use cases). Anyways, this leakage can be prevented by using the CT variant of the discrete logarithm algorithm where all iterations of the for loop in lines 6-17 of Alg. 4(b) are always computed regardless of when the match is found. It is crucial that the number of iterations is determined by using the bounds of the particular function f and not by using the maximum bound supported by the implementation (2 54 ). Table 3 shows that the cost of CT discrete logarithms starts to dominate when the upper bound for the output is large. E.g., with at most 33-bit results the overhead compared to the VT variant is only 16 % with uniformly distributed inputs, but it is already 324 % for at most 41-bit results.
To conclude, the implementation can be fully protected against timing attacks, but the cost depends heavily on the function that is computed. Hence, the use of the countermeasures should be decided based on the performance and threat model of the application.

Use Cases
Design and implementation of ML models on the encrypted data are difficult and require special techniques like FE. The performance of such models is typically not comparable to the performance of the models that are executed on the unencrypted data. In what follows, we show how the performance of such models can be significantly improved by using a hardware accelerator.

ML Classification on the Encrypted MNIST Dataset
One of the main tasks that demonstrate the power of ML is image classification. Contemporary ML algorithms are able to solve tasks such as recognition of traffic signs or deciding a patient's condition based on his/her medical image. In many cases images contain private information that should be kept secret. Using traditional encryption, nothing can (and must not) be learned from the encrypted data. FE, on the other hand, permits image classifications without revealing the images themselves.
A basic example used to test ML algorithms is the MNIST (Modified National Institute of Standards and Technology) dataset. It consists of 60,000 images of handwritten digits (28 × 28 pixels), and a test set of 10,000 images. The task is to classify the images into 10 classes based on which digit is written in the image. Modern ML algorithms can achieve above 99.7% accuracy in this task. However, practical FE supports only limited functions that can be evaluated and such accuracy cannot be reached.
Papers [LCFS17, DGP18, MSH + 19] demonstrated how the classification of handwritten digits can be done using FE-IP and FE-QF, where the latter is achieving higher accuracy. The proposed model is a neural network with one hidden layer where the non-linear function used as the activation function in the hidden layer is the element-wise quadratic function. The output layer is of dimension 10 and can be interpreted as the likelihoods of each 10 digits being in the image. A different dk fi is provided for each digit i and the decryption that provides the largest output value is interpreted as the digit in the image. The FE-QF scheme in [DGP18] has homomorphic properties that can be used as a linear transformation between the input and hidden layers. Moreover, the evaluation of 10 quadratic functions with diagonal entries can be used to evaluate the activation function and the linear transformation to the output layer. Such a model-which is in fact equivalent to [DGP18]-achieves an accuracy of about 97% [DGP18, MSH + 19]. We report the timings for FE-QF decryptions from images that have already been projected to support decryption with diagonal f . Fig. 6 gives a visual representation of this use case. Training of the model needs to be done on unencrypted data, while classification is done on encrypted images. The images have been presented as 785-coordinate vectors (28 · 28 and one for bias) and we use n = 40 dimensional hidden layer. The inputs as well as linear transformations between layers must be discretized to be used by FE. Discretization to greater intervals of integers leads to slow decryption, while discretization to small intervals decreases the prediction accuracy. We chose B x = 4 for the inputs and B 1 = 200, B 2 = 40 as the discretization bounds for the first and the second linear transformation. Choosing such bounds, the accuracy does not significantly change and the output values can be bounded with B t = 2 39 ; however, they are much smaller in practice (see Fig. 7).

ML Classification on the Encrypted Fashion MNIST Dataset
The prediction power of an ML model used with FE depends on both the ML techniques (e.g., model design, training algorithm, etc.) and the functionality and performance of FE. In this paper we demonstrated how to boost the performance of FE-QF using HW/SW codesign. A speedup in the FE part implies that the model can use more parameters and a less strict discretization and, consequently, achieve better accuracy while remaining practical. In the neural network model, the hidden layer can have a higher dimension and the discretization for inputs and linear transformations can use wider ranges.
Fashion MNIST is a dataset that is very similar to MNIST but consists of images of 10 different clothing objects (e.g., shoes, t-shirts, coats, etc.). The task of classifying clothes is harder than classifying digits and the most advanced algorithms achieve about 95% accuracy on Fashion MNIST (compared to 99.7% on MNIST). Using the parameters of the MNIST model on Fashion MNIST dataset gives poor classification accuracy. However, increasing the size of the hidden layer to n = 128 and the discretization bounds to B x = 10, B 1 = 1000, and B 2 = 50 achieves an accuracy of above 87.5%. Consequently, the number of pairings is higher and the output values are larger (see Fig. 7) leading to slower discrete logarithms. However, these numbers are still manageable by our HW/SW design. Table 5 presents the result for the ML classification models on encrypted MNIST and Fashion MNIST images using our accelerator. Moreover, we compare the performance with both original and optimized GoFE libraries. Table 5 shows that the HW/SW design gives a similar speedup as reported in the previous section even for real use cases.

Conclusions
In this paper, we presented the first published accelerator architecture for FE-QF. We showed that HW/SW-codesign based acceleration results in significant speedups compared to software-only solutions. Moreover, we demonstrated that large speedups can be received  also in real use cases, namely, for image classifications using encrypted MNIST and Fashion MNIST datasets. We anticipate that our results will help FE to become more feasible for practical adaptation, as the overheads of FE are often excessive and have presented obstacles for practical use of FE in real applications. In general, our accelerator allows to pair FE with complex systems, such as FE-QF based on cryptographic pairings with advanced ML models.
We implemented the FE-QF scheme from [DGP18] but also other published FE-QF schemes (e.g., [BCFG17,Gay20,Wee20]) have very similar decryption routines consisting of pairings and discrete logarithms. Consequently, our implementation could be relatively easily adapted to such schemes, too. The discrete logarithm algorithm and its implementation can be used also for other cryptographic schemes including certain FE-IP schemes (e.g., [ABDP15, ALS16, ACF + 18]). Such adaptations are topics for future research.
programmable SoCs, including other FPGA vendors (e.g., Intel) and soft-core CPUs (e.g., RISC V, NIOS II, Microblaze). In the current HW/SW codesign system instantiated on the Xilinx Zynq UltraScale+ MPSoC, we applied a 128-bit high-performance interface (i.e., AXI HP) and several 32-bit general-purpose (i.e., AXI GP) interfaces between the SW and HW domains for communicating data/microprogram and command/status packets, respectively. Therefore, the main modifications would be required in the HW/SW interfaces because HDL codes for the HW domain (i.e., Verilog source codes) and C codes for the SW domain are mostly generic. More precisely, the current implementation uses 128-bit AXI HP and 32-bit AXI GP interfaces (of Xilinx SoC platforms), but other SoCs may require different interfaces with the same or different sizes. It should be considered that in the targeted SoC platform, the HW side (i.e., FPGA) contains enough logic cells and block memories for implementing the parallel CP cores (i.e., in this work, 16 CP cores) and global/local memories, respectively. Moreover, the SW domain must be interfaced with an external RAM memory (i.e., off/on-chip DDR memory) with at least 2GB space. If enough resources are not available, then fewer parallel CP cores must be used or the size of the precomputed table for discrete logarithms must be decreased, which may have significant effects on the performance compared to the numbers reported in this paper. Naturally, also the overall performance of the FPGA and the processor cores of the target SoC affect the performance.

B Implementation Details
Stucture of the Arithmetic Unit [BJ20a] Fig. 8(a) and 8(b) describe the structure of the MMMB for computing F p multiplications/squarings, and also Fig. 8(c) depicts the structure of the MASB for computing F p additions/subtractions. The MMMB contains three nested parts which are organized bottom-up as a Multiply-Add-Add Block (MAAB), a Multiply-Add-Add-Accumulator Block (MAAAB), and the overall structure of the MMMB. The MAAB is the primary computation block in the datapath and consists of a 64 × 64-bit Karatsuba multiplier (constructed from three parallel 32 × 32-bit multipliers) combined with adders to compute a × b + c + d (all 64-bit values) in a five-stage pipeline. The MAAB consumes most of the FPGA resources, has the highest dynamic power consumption, and also contains the critical path of the CP core. In order to maximize its efficiency, it is implemented using the DSP slices. In the next part, the MAAB is complemented with an accumulation operation (i.e., the MAAAB). The lower part of the MAAB result is accumulated with the previous higher part as well as with the previous most significant bit of the accumulation result (i.e., the input carry). The output carry and the higher part of the MAAB result are stored for the next accumulation (see Fig. 8)(b). The latencies for computing r low and r high are five and six clock cycles, respectively. This accumulation method and the one clock cycle difference between r low and r high are essential for efficient implementation of high-radix Montgomery modular multiplication algorithm. Finally, in the top part, MAAAB (as the main computing core) as well as multiplexers, registers, and FSMs are used for implementing radix-2 64 Montgomery modular multiplication. The MMMB computes a multiplication/squaring in F p with a total latency of 43 clock cycles, but a new multiplication/squaring can be started already after 38 clock cycles due to the pipelined scheme. Furthermore, the structure of MASB with a two-stage pipeline is illustrated in Fig.8(c). Addition and subtraction in F p can be realized by two consecutive adder/subtractor circuits which produce the result in two cycles. Due to the pipeline, its throughput is one F p addition/subtraction per cycle. Applying two MASBs and connecting the outputs of the datapath back to its inputs facilitates efficient field arithmetic operations such as F p 2 addition/subtraction/negation, F p 2 multiplication/squaring, and multiplications by small constants.

Scheduling of the Tower Field Arithmetic [BJ20a]
The timing diagram and scheduling technique of the tower extension field arithmetic are depicted in Fig. 9. On the top, it shows how three parallel F p multiplications/squarings and several parallel additions/subtractions can be computed simultaneously in the datapath by utilizing parallelism and pipelining. In the middle, it depicts how this scheduling allows computing F p 2 multiplications/squarings effectively in 38 clock cycles and, also, that up to eleven F p 2 additions/subtractions can be done during each F p 2 multiplication/squaring. The F p and F p 2 operations are further used for implementing F p 4 , F p 6 , and F p 12 arithmetic. An inversion in F p 12 is decomposed into several additions/subtractions, multiplications/squarings, and a single inversion in F p . We compute the inversion in F p with Fermat's Little Theorem through the modular exponentiation. We compute this F p exponentiation and also F p 12 exponentiation using the right-to-left square-and-multiply algorithm because it allows computing multiplications in parallel with squarings and results in more efficient implementation. ( 28 × 28 pixel array) Figure 9: Timing diagram and scheduling technique of the tower extension field arithmetic operations (i.e., F p , F p 2 , F p 4 , F p 6 , and F p 12 arithmetic) in our datapath [BJ20a].