SHAPER: A General Architecture for Privacy-Preserving Primitives in Secure Machine Learning

. Secure multi-party computation and homomorphic encryption are two primary security primitives in privacy-preserving machine learning, whose wide adoption is, nevertheless, constrained by the computation and network communication overheads. This paper proposes a hybrid Secret-sharing and Homomorphic encryption Architecture for Privacy-pERsevering machine learning ( SHAPER ). SHAPER protects sensitive data in encrypted or randomly shared domains instead of relying on a trusted third party. The proposed algorithm-protocol-hardware co-design methodology explores techniques such as plaintext Single Instruction Multiple Data (SIMD) and ﬁne-grained scheduling, to minimize end-to-end latency in various network settings. SHAPER also supports secure domain computing acceleration and the conversion between mainstream privacy-preserving primitives, making it ready for general and distinctive data characteristics. SHAPER is evaluated by FPGA prototyping with a comprehensive hyper-parameter exploration, demonstrating a 94 × speed-up over CPU clusters on large-scale logistic regression training tasks.


Introduction
Cross-agency data collaboration maximizes the accuracy of Machine learning (ML) models.Nonetheless, from the perspective of user privacy and business interests, concerns about data privacy and security arise [ARC19].In practice, ML cannot be applied directly to health or financial data for competitive and regulatory reasons.These sensitive data sets are isolated by different parties, which is also known as the "isolated data island" problem.To solve this problem, privacy-preserving machine learning (PPML) [XBJ21] allows participants to collaborate on training and inference procedures by applying privacy-preserving computing techniques, e.g.multi-party computation (MPC) [Yao82], homomorphic encryption (HE) [FV12], and trusted execution environment (TEE) [CD16].These security primitives prevent the raw data, model weights, and gradient values from being revealed to any other participants.Since the algorithms and protocols of PPML heavily depend on the data characteristics, scale, ownership, and security model, debates on technical roadmap never stop.Fig. 1 shows an example of PPML in a healthcare scenario.A hospital and a pharmaceutical company collaborate to develop a predictive model for personalized medicine while protecting patient data.The parties have access to different sensitive patient records (labels and features).The parties use privacy-computing techniques to jointly train the model.The computational load is divided between the two parties, with each party performing local calculations and exchanging encrypted updates.The PPML scheme can prevent data security from being compromised beyond the trust barriers.On the one hand, the semi-honest parties act curiously and try to extract data privacy from each other.On the other hand, third-party adversaries can monitor the communication in the insecure network.The goal is to create an accurate model while preserving the privacy of individual patient data.
MPC covers a series of privacy-preserving techniques that support secure computation protocols on mathematically masked data.Garbled circuit (GC) is a secure two-party logical computation protocol, where the evaluation of each gate requires the transmission of a ciphertext look-up table.Secret sharing (SS) guarantees information-theoretic security by randomly sharing the raw data.However, arithmetic on the SS domain relies on intensive in-order data interaction.Even though MPC is versatile to different PPML scenarios, the network overhead always hinders the further development of MPC-based PPML with complex models in real-time applications.
HE-based schemes support multiple operators on encrypted data.Fully HE (FHE) schemes can ideally support any multiplication level by refreshing its noise budget with bootstrapping.Nevertheless, ciphertext evaluation and bootstrapping always require complex modular operations, which introduce tremendous computational overhead.Existing academic FHE accelerators are still expensive and only feasible on small-scale training and inference scenes [SFK + 22].On the other hand, additive HE (AHE) provides partial linear operators except for ciphertext-to-ciphertext multiplication with affordable overhead.However, purely AHE-based two-party schemes require a trusted party to generate and manage the secret key [HHIL + 17].
There are gaps between research and practice when it comes to PPML applications in the real world.In the real world, three or more parties usually involve more commercial interests and regulations, so two-party PPML is the most common use case.In addition, the features are usually sparse in practice, such as intelligent risk control or wire fraud detection, because of the feature engineering such as one-hot [CZW + 21].And the performance of PPML must also be considered.A task that takes a few minutes in non-private ML takes several hours when converted to PPML.Recent works show that hybrid SS-AHE solutions achieve 130× speedup with practical dataset and network bandwidth [CZW + 21, FZT + 21], compared with fully MPC-based PPML schemes.The key insight is to prevent the characteristics of the training set, e.g.sparsity, from being masked in the SS domain or encrypted in the AHE domain.This is achieved by keeping the samples as plaintext within their owner and only transferring small-size intermediate values in the HE domain.Participants can evaluate the layer functions with sparse operations and share the result in the SS domain other than revealing any sensitive values to one participant or a third party.
However, no existing hybrid PPML work tackles the challenges of co-designing implementations and optimizations of PPML protocols.New architectural considerations and methodologies are required when the complex HE algorithm combines SS protocol in the end-to-end PPML solutions.We observe that the computation and communication complexity, which are the bottlenecks of the two primitives, can complement each other.The standalone SS and HE approaches present a highly polarized communication-computation ratio, for which latency hiding between the data transfer and execution units provides little return.We optimize the hybrid approach based on the intuition that a well-balanced and parallel communication-computation flow can ideally reduce latency by 50%.This observation can also make the architecture less sensitive to network bandwidth, which typically dominates MPC performance.At the algorithmic level, it is helpful to tune hyperparameters, such as an overflow-free pack level, to mitigate ciphertext explosion.Since practical computational settings are also critical for PPML, a ready-to-use architecture should be suitable for Field Programmable Gate Array (FPGA) platforms.
This paper presents a general architecture that can efficiently execute the SS-AHE hybrid PPML protocols on the large industrial-level training dataset.The architecture can generally handle different PPML tasks using hybrid primitives.The proposed design preserves privacy in either the SS domain or the AHE domain without relying on trusted hardware manufacturers.Compared with existing software-only hybrid PPML schemes [CZW + 21, FZT + 21], the co-designed architecture takes advantage of hardware units, and has more potential for optimization and acceleration.In summary, this paper makes the following contributions: • A hybrid Secret-sharing and Homomorphic encryption Architecture for Privacy-pERsevering ML (SHAPER) with algorithm-protocol-hardware co-optimization between CPU, hardware accelerators, and network collaboration.SHAPER's hardware design improves throughput, and its software design optimizes data flow through system scheduling for latency overlap and parallelization.
• Vectorized high-performance modular multiplication (MM) engines to improve the efficiency of encryption, decryption, and ciphertext domain evaluation.We present new algorithmic and hardware optimizations for these operations of Paillier, including new MM algorithm, new hardware engine, pipelined execution, etc.
• SHAPER shows universal performance improvement on micro-operations and reduces the end-to-end latency by 94× on large-scale logistic regression training tasks, compared with the software-only benchmarks.

Background
Descriptions of backgrounds, threat model, and primitives are discussed in this section before introducing our SHAPER architecture.

Related Work
Various PPML schemes have been proposed to ensure the security of data and models with different cryptographic primitives.Actually, MPC-based PPML schemes [KVH + 21, Kel20, ZXWG22] divide ML models into fragments of circuits, and then engage multiple parties to cooperatively perform circuit computations, including arithmetic and binary circuits, without additional privacy leakage.Afterwards, the results of these fragments of circuits are collected by the parties to construct the complete result of complex computational tasks.Historically, MPC was proposed in [Yao82], which solved the "Millionaire Problem" with GC. 20] provides the capability to perform operations on encrypted data to protect privacy.Unlike public-key cryptography, AHE supports not only key generation, encryption, and decryption, but also addition/multiplication over ciphertexts without private keys, thus revealing no information about the corresponding plaintexts.Due to the additions and multiplications that one can perform on the ciphertexts, HE-based PPML requires less communication than MPC-based PPML, but requires more computation for expensive HE encryption/decryption.Fig. 2 describes a typical linear function in PPML, the sparse matrix is kept by its owner, Alice, and the result is shared between the participants Alice and Bob.Since AHE protects the confidentiality of the vector y, Alice cannot recover the plaintext from the ciphertext [y] b .On the other hand, Alice shares [z] b in line 5, which guarantees that Bob can only learn a masked result z b .Recent works on the hybrid SS-AHE PPML [CZW + 21,FZT + 21] framework achieve 130× speedup over MPC-based schemes.AHE supports additions between ciphertexts and multiplications between ciphertexts and plaintexts.However, most of the existing AHE algorithms, such as Paillier [Pai99], DGK [DGK07], OU [OU98], depend on large integer modular operations, especially modular multiplications (MM) and exponentiation (ME), which incur large computational overheads.Therefore, the overall performance of SS-AHE hybrid PPML is strongly dominated by the efficiency of the basic modular multiplications.Montgomery modular multiplication [Mon85] is the most classical method, while new modular algorithms have also been proposed recently [LC21].

Paillier Cryptosystem
We choose Paillier as the AHE example in our architecture.The Paillier cryptosystem consists of the following interfaces.
As suggested in [DJ01, CGHGN01], we choose primes (p, q) which satisfy p = q = 3 mod 4 and gcd(p − 1, q − 1) = 2 and set g = n + 1, so that g m can be simplified as: And we have µ = λ −1 .The key generation randomly selects x ← Z * n , and adds h s = −x 2n mod n 2 into the public key.Then the encryption is modified as c = (mn + 1)h a s mod n 2 , where a is randomly chosen in Z . The optimization has two advantages.First, the exponentiation g m is simplified as a multiplication mn.Second, since a is much shorter than n, it is easier to compute hs a than r n .

Additive Secret Sharing
A value additively shared by two parties refers to [[x]] = (x 1 , x 2 ), where x i = x over field F, and (x 1 , x 2 ) are random.The addition over additive shares is almost free, as [[x + y]] = (x 1 + y 1 , x 2 + y 2 ).The multiplication over shares is more tricky.A common approach is to use Beaver triples [Bea92].The Beaver triple is three shared random values

Host
On-Chip  [

SS-AHE Library
MPC-based PPML schemes require a large number of beaver triples because each multiplication consumes a triple.Beaver triples can be generated in batch using Paillier [DSZ15, P + 13].

Threat Model
As a co-designed architecture, the threat model of SHAPER takes into account cross-layer assumptions.
At the protocol level, the adversary model follows the semi-honest assumption in a 2-party setting, as SHAPER mainly focuses on implementing and accelerating existing semi-honest schemes [FZT + 21, CZW + 21].In the semi-honest model, a probabilistic polynomial-time adversary with semi-honest behaviors controls one of the parties and the adversary can corrupt and control one party, and try to learn more information about the other honest party's input, such as recovering the secret messages sealed in the ciphertexts or shares.Meanwhile, the adversary is required to follow the protocol specification honestly.The semi-honest setting is adopted by most existing PPML models, such as [MZ17,MR18].
At the algorithm level, including SS and AHE, the security is given as a security parameter, which defines the hardness of the algorithm the adversary attempts to break.The parameter is positively related to the key lengths.A 2048/3072-bit Paillier cryptosystem corresponds to a 112/128-bit computational security parameter.

Architecture Design
To accelerate the hybrid SS-AHE framework, SHAPER proposes an instruction set and explores efficient design methodologies of AHE, SS, and conversion functions.

Architecture overview
An overview of our proposed SHAPER architecture is shown in Fig. 3, which includes both software and hardware implementations.The host application controls the start and convergence conditions of the training tasks, and also consults the hyper-parameters between the participants, such as the optimal plaintext packing level, the pre-computation window size, etc.The on-chip hardware modules aim at fast computation on basic primitives, mainly including AHE and SS function units.Since AHE computation is still a performance bottleneck of hybrid PPML schemes [CZW + 21], we design new MM algorithms and hardware engines in the AHE units, implement algorithmic optimizations in hardware, and improve the scheduling modules to achieve better acceleration.
SHAPER focuses on 2-party PPML, which is the most common case in industrial applications.The application calls the SS-AHE library, which supports execution flows encapsulated as kernel functions.The kernel functions update algorithm parameters and architecture flags by setting control and status registers (CSRs), and implement the security primitives with customized instructions summarized in Table 1.SHAPER analyzes the control flow dependencies and packs the instructions in VLIW style, ensuring that the packed instructions in a VLIW instruction can be executed in parallel.SHAPER adopts the static scheduling scheme.Each VLIW instruction packs RISC instructions which decode and issue synchronously.Since the instructions are executed sequentially and deterministically, memory allocation is scheduled in a static manner.To communicate with other participants, all network interaction is handled by a network interface card (NIC).The host application always waits for the NIC and SHAPER to interrupt.Since the runtime and driver layers are common components in HW/SW co-design, we omit them in Table 1 for brevity.
On the SHAPER hardware, the parser unpacks the instructions and dispatches them to the appropriate function units.AHE.init reloads data from device memory during the offline phase when the host updates its key pair.Other AHE instructions consist of a series of MM operations handled by the AHE controller.SS.gen returns a vector of random shares sampled from the cryptographic-secure pseudorandom number generator (CSPRNG).Int.add and Int.mul perform a series of integer arithmetic operations in a continuous address space.The memory hierarchy consists of the on-board device memory and the on-chip scratchpad managed by DM.ld/st and SPM.ld/st.

Algorithm-Protocol Co-Optimization
Fig. 4 describes the methodology for analyzing and exploring the PPML solutions.We map a task to the coordinate point according to the computational and communication overhead.The network bandwidth is represented as a dotted guideline, points on which have the same communication and computation latency.The schemes above the guideline (e.g.SecureML [MZ17]) are communication dominated.On the other hand, the communication-less FHE solutions (e.g.CraterLake [SFK + 22]) cost most of the time for ciphertext evaluation.The position of SS-AHE-based solutions depends on computational power, especially the performance of cryptographic engines.In our work, the following optimizations are applied to explore an optimal solution.

Data Characteristic
In real-world scenes, the training dataset is sparse due to incomplete user information and one-hot encoding [CZW + 21].Since SS-AHE schemes preserve the data sparsity, the number of instructions is significantly reduced.

Plaintext Packing
Packing multiple ciphertexts of short plaintexts into one ciphertext greatly reduces the number of ciphertexts and allows SIMD-style computation [P + 13], as explained in Sec.3.5.The packing strategy reduces the communication overhead for transmission and the computational overhead for decryption at the expense of additional homomorphic computation over ciphertexts. .

Latency Hiding
Since the SS-AHE schemes have balanced overhead, overlapping computation and communication brings more benefits.Fig. 5 shows the pipeline execution process of SHAPER, corresponding to line 1 to 3 in Fig. 2. The AHE encryption is the most time-consuming operation in the example, and can hide other delays.Once the first encryption is complete, the second encryption and the transmission of the first ciphertext are performed in parallel in a pipelined flow.In this case, the computation instructions overlap the communication delay.SHAPER consumes the data as soon as the source data is created with multi-buffer transfer.

Efficient AHE Function Units
The AHE unit of SHAPER consists of a Paillier controller and several MM engines.The controller manipulates MM engines to compute the functions of the Paillier cryptosystem with key length |n| = 3072 in parallel.Each MM engine implements our proposed fast MM algorithm, which supports a 5-stage pipeline.To accelerate the modular exponentiation (ME) in the Paillier encryption, a set of Ultra-RAMs (URAMs) and Block-RAMs (BRAMs) are deployed to store the public/private keys of the device, as well as some pre-computed values.
When executing an AHE instruction, the controller divides it into multiple multiplications and exponentiations based on DJN optimizations of Paillier [CGHGN01].Several optimizations suggested in [DSZ15] are considered, including Chinese-Remainder-Theorem (CRT) optimization and fixed-base pre-computation (see Appendix B), which scales down both the base size and the exponent size of the ME.The call to a single ME is divided into multiple multiplications in SHAPER, and the controller then schedules the datapath between different MM engines to compute the ME collaboratively.
The performance of MM engines has a large impact on the efficiency of various AHE interfaces and higher-level applications.Therefore, we propose an efficient MM construction with optimizations in both algorithm and hardware implementation.

The MM Algorithm
Our proposed MM algorithm is inspired by the shift-sub algorithm in [LC21] (see Appendix A), which has the advantage of dealing with large integers.The algorithm requires multiple serial full adders, one for each bit of b, which results in long data paths.To avoid multiple serial additions of large integers, we propose a high-radix shift-sub MM algorithm as described in Alg. 1.Our high-radix shift-sub deals with k bits of b in a single iteration, rather than a single bit, where k is the radix width.Single-bit shift sub in [LC21] deals with a single bit of b in each iteration.Therefore, the strategy does not work well in hardware design as k grows, since it leads to too many cycles when dealing with significantly large a and b.A more efficient MM algorithm is needed to speed up Paillier in hardware.
Our high-radix MM algorithm processes k bits of b in each round, and has τ rounds in total.Each round consists of a Multiply-Accumulate phase (Phase_c) and a Shift-Reduce phase (Phase_a).In the i-th round, Phase_c multiplies the i-th piece of b by the current round's a and adds the product to the accumulation of previous rounds.In the first τ − 1 rounds, Phase_a updates the next round's a with the current round's a. a is modulo reduced after shifting k bits to the left.In the final round, Phase_a modulo reduces the accumulation of Phase_c to get the final result.The correctness of the algorithm is guaranteed: Note that except for the final round, there is no data dependency between Phase_c and Phase_a.Therefore, Phase_c and Phase_a can be executed in parallel to reduce latency.After the latencies of multiplication and addition in Phase_c are hidden by parallelization, the total execution time of one MM is reduced by more than 30%.
Since the modular reductions of Phase_a have additional length constraints, we propose a quick modular reduction algorithm QR in Alg. 2. We note that the inputs of the modular reduction have upper bounds.Each round's a k is less than m k, and the final round's c is less than τ m k.Therefore, the QR algorithm limits the length of the dividend to no more than (l + ∆), where l is the length of the modulus m.The ∆ is set to k in the first τ − 1 rounds and to k + log τ in the final round.The radix k has a large impact on the total number of rounds, as well as the efficiency and consumption of hardware implementations in SHAPER.Different choices of k are discussed in the next subsection about hardware implementation.
We propose and adopt a new strategy using the Most Significant Bits (MSB) approximation to simplify the reduction.Unlike the remainder (i.e., the output of the modular reduction), the quotient in a division is mainly determined by the MSBs of the dividend and divisor, while the lower bits contribute little to the quotient.Therefore, the algorithm approximates a quotient γ with the MSBs of a and m, and then computes an approximate remainder with a − γm, which is then modified to the result with a conditional subtraction.The error between the approximated and the precise quotients is proven to be within 1 if we use the most significant ∆ + 2 bits of m for the approximation, as in Eq.(4,5).(Note that ∀x, y ∈ R, if |x − y| ≤ 1, then | x − y | ≤ 1. ) Furthermore, the existing Barrett approximation [Bar86], which converts the division of a by m + 1 into the product of a and m , is further used to simplify the quotient computation, which involves a deviation of no more than 1, as Eq.(6,7) shows.Therefore, at most two conditional subtractions are needed in the algorithm.
In fact, QR computes b = a − (γ + 1)m instead of b = a − γm.A small change is made to accommodate the hardware implementation.Additions over large integers are not cheap in hardware.Direct computation of b = a − γm results in two additions in the worst case.But when switching to (γ + 1)m, the hardware engine determines whether to compute b − m or b + m based on the sign bit of b.Computing γ + 1 is cheap because γ is short, and this small cost reduces the additions of large integers here from 2 to 1.The optimization effectively reduces the consumption of on-board resources in the hardware implementation.

The MM Engine
The MM engine (shown in Fig. 6) is the fundamental processing element that supports our MM algorithm.It can be divided into two modules: one for Phase_c and one for quick reduction in Phase_a respectively.Phase_c contains a block multiplication (BM) module and an adder module for accumulation, consisting of a carry-save adder and a ripple-carry adder.Phase_a contains a multiplier to compute γ, a BM module for −(γ + 1)m, an adder for a − (γ + 1)m, and a Conditional Subtraction (CS) module for correction.A CS module contains an adder that performs either −m or +m.Note that the subtraction −m is replaced by +m_n, where m_n is the complement of m.Phase_a and Phase_c are updated in parallel to compress the total number of clock cycles.Two phases need no data exchange except for Phase_c fetching a at the beginning of each round.Also, to run an integrated MM, a controller is needed to schedule the input/output of the MM engine in each round.
We implement the BM modules with multiple multipliers on a smaller scale.Specifically, a k × 3072 multiplication is divided into k × k parts, whose subproducts are combined into two large integers.The k × k multipliers are implemented using on-board digital signal processing (DSP) units.
When exploring the design space of on-board resources and hardware clock cycles, we choose radix k = 72.A single piece of DSP supports 27 × 18 multiplication.For example, implementing 64 × 64 multiplication requires 12 DSP units, but the DSP utilization rate is only about 70%, since 12 DSP units theoretically allow 81 × 72 multiplication.The utilization rate peaks at 100% when the multiplier width equals 54 or 108.At 54, a multiplier needs only 6 DSPs, but the round number τ rises to 57 with a 3072×3072 multiplication, leading to more clock cycles of the MM engine.In contrast, in the case of 108, there are only 29 rounds, but one multiplier requires 24 DSPs, making the MM engine too large.A larger MM engine consumes more resources (especially DSP).Therefore, fewer MM engines can be placed in the FPGA implementations, which reduces the overall hardware parallelism.In addition, a larger engine has longer data paths, further reducing the frequency of the hardware implementation.
Considering the trade-off between time and space, we choose the case of the 78-bit multiplier, which requires 40 rounds of multiplication and 12-DSP multipliers.Although the DSP utilization rate cannot reach 100%, it is still higher than 90%.For hardware compatibility of the final round, the hardware implementation should support k + log τ bit quick reduction, as shown in Fig. 6.Therefore, the radix k is finally fixed to 72.
Another bottleneck in the design of an MM engine is the 3072-bit addition, because it introduces large logic delays that limit the hardware frequency.In our design, however, we optimize serial adders with a prediction strategy, along with splitting two addends into multiple 128-bit chunks.Each such chunk (x, y) uses two 128-bit ripple-carry adders to compute two potential sums, x + y and x + y + 1.Using the carry bit propagated from the lower chunk, a multiplexer selects one of the summations and propagates the corresponding carry bit to the higher chunk.Since x + y + 1 ≤ 2 × (2 128 − 1) + 1 = 2 129 − 1, the propagation will not lead to the growth of the carry bit, and there is at most one carry bit during the propagation, which guarantees the correctness of the strategy.
We set the chunk size to 128, taking into account resource consumption and maximum frequency.When the chunks are large, the logic delay within each chunk is still large, and the frequency cannot be improved efficiently.However, when the chunks are particularly small, although the resource consumption for each chunk is reduced, the logic delay outside the chunks to merge the subsums into the final output increases.And this also leads to a decrease in frequency.Therefore, we set the chunk size to 128 to get the peak frequency.
The acceleration of the MM engine over the MM implementation on a standard CPU is discussed in section 5.3.

Paillier Controller
A 3072-bit Paillier cryptosystem has a 3072-bit message space (|n| = 3072) and a 6144bit ciphertext space (|n 2 | = 6144).Making the hardware compatible with the 6144-bit modulo operation is a waste of on-board resources.And our Paillier controller uses Chinese Remainder Theory to convert 6144-bit modulo operations in encryption and decryption to 3072-bit modulo operations.We follow the dataflow of open-sourced Paillier implementation of [DSZ15], and transform it into a micro-instruction control flow that takes more advantage of the MM engines.CRT gives a unique solution to simultaneous linear congruences with coprime moduli.Since n 2 = p 2 q 2 , CRT transforms modulo operations over n 2 into modulo operations over p 2 and q 2 .The prerequisite for CRT optimization is obtaining the private key, since the private key λ can be computed as λ = (p − 1)(q − 1)/2.In public key cryptosystems, it is assumed that the encryptor does not have the private key.However, in HE (including AHE) scenarios, the encrypting party usually has the private key.HE scenarios in hybrid PPML schemes are analogous to proxy execution, such as the matrix multiplication in Fig. 2. Both encryption and decryption are handled locally by the client, and the server only handles execution over ciphertexts.Therefore, it makes sense to optimize Paillier encryption on the client side with CRT.Encryption.Eq.8 shows the DJN-Paillier encryption.
To optimize hardware computation with CRT, the controller first computes the projections of c over the modulo field p 2 and q 2 .c p = c mod p 2 = [(mn mod p 2 ) + 1] × ((hs mod p 2 ) a mod p 2 ) mod p 2 c q = c mod q 2 = [(mn mod q 2 ) + 1] × ((hs mod q 2 ) a mod q 2 ) mod q 2 (9) (hs mod p 2 ) a mod p 2 and (hs mod q 2 ) a mod q 2 are under fixed bases depending on the public keys, and can be computed with precomputed tables.The details of computing ME under fixed bases with precomputation are explained in Sec.4.2.After computing c p and c q , the ciphertext c can be recovered.
(p −2 mod q 2 ) mod q 2 also depends on the keys, and will be precomputed during key generation.The control flow of Paillier encryption at the micro-instruction level is listed as follows.ME_P refers to modular exponentiation with precomputation.Note that the final step of encryption, c p + t c × p 2 , necessarily requires 6144-bit computations, so the hardware sends back c p and t c to the host to obtain the ciphertext.
The controller follows the same CRT strategy to calculate the intermediate To continue the decryption, a naive approach is to send d p and t d back to the host, which recovers d, calculates d−1 n , and returns it to the hardware.However, this approach obviously lacks efficiency, since it involves a round of communication for each decryption.Therefore, we expect the controller to compute m directly using d p and t d .
For any legal plaintext-ciphertext pair (m, c), the correctness of Paillier holds as c λ − 1 = λmn(modn 2 ).Since n = pq, we have d − 1 | p, and of course d p = d mod p 2 > 0. Suppose p is the smaller prime between (p, q), then d p < p 2 < n.Therefore eq.14 holds.
The control flow of the Paillier decryption at the micro-instruction level is listed as follows.RED modulo reduces a 6144-bit input with a 3072-bit modulus, and MDIV returns the quotient of the three inputs x × y/z.Both RED and MDIV can be supported by a slightly modified MM engine.Specifically, Phase_a can independently handle RED in τ rounds, and MDIV can be implemented by collecting the corrected γ in each round of Phase_a and merging them into the final quotient.

Secret Sharing Function Units
Although computation is not the critical overhead in SS-based schemes compared to communication, we design a dedicated SS unit in SHAPER due to the following latencyrelated concern: In hybrid PPML schemes such as [CZW + 21, FZT + 21], the computation of AHE and SS is interleaved, which means that the data must be transmitted frequently between the host and the hardware if the hardware is not capable of computing SS functions locally.Therefore, each transfer requires reading/writing from device memory, and introduces non-negligible redundant latency.
The SS unit consists of multiple integer processing engines and a CSPRNG.The CSPRNG generates the random numbers used in SS schemes.And the integer processing engines support computation over 64-bit integers, which is a common choice in SS-based schemes.
For higher random number generation throughput, we have optimized the CSPRNG in the SS unit with several improvements.Actually, existing CSPRNG constructions in PPML usually use ECB-mode AES encryption, which benefits a lot from AES-NI hardware extensions.However, recent works [XHY + 20] pointed out that SHA3-based CSPRNG would outperform AES hardware implementations because SHA-3 takes advantage of its 1600-bit Keccak structure and fast binary executions, resulting in higher throughput during each iteration.In addition to the SHA-3 Keccak engine, we also use a first-in-firstout (FIFO) buffer to cache the generated random numbers.The SHA-3 Keccak engine dynamically generates random numbers and pushes them into the buffer when it is not full.And SS units pop random numbers from the buffer as needed.The strategy packs different ciphertexts into one.For example, if the server normally computes x 1 y 1 and x 2 y 2 over ciphertexts, two ciphertexts c x1y1 and c x2y2 are sent back to the client.And the client decrypts them to get the plaintexts.However, when using the packing strategy, the server computes the ciphertext of x 1 y 1 + x 2 y 2 × 2 i .Then the client decrypts the ciphertext and truncates the plaintext to get x 1 y 1 and x 2 y 2 .This reduces the number of decryptions from 2 to 1.The packing strategy works if x 1 y 1 < 2 i , otherwise an overflow occurs and the MSBs of x 1 y 1 get mixed with the LSBs of x 2 y 2 .Also, x 1 y 1 + x 2 y 2 × 2 i should stay within the message space.

Conversions between Primitives with Packing
The Paillier message space (3072 bits) is too large for the values in the models (within 64 bits), and decryption in Paillier costs much more than encryption.Therefore, the space can be divided into smaller buckets, with a smaller value in each bucket.The computation results in each bucket will not interfere with others if the bucket size is large enough to ensure that there is no overflow in the subsequent computation.Since there are fewer ciphertexts to send and decrypt, both communication and computation are greatly reduced.
The optimal bucket size depends on the computation under encrypted values in different protocols.In hybrid schemes, the optimal bucket size is roughly (m + 1)l + log(a + 1) + σ to ensure that there is no overflow in each bucket [P + 13], where l is the share length, usually equal to 64. m and a are the numbers of multiplication with plaintext shares and addition with other shares.σ is the statistical security parameter of the scheme, which is 40 by default.For example, AHE handles matrix multiplication in [CZW + 21], where one multiplication and multiple additions are processed with each ciphertext.And the bucket size can be set to 180 by default, where each ciphertext contains about 17 buckets.The 180-bit bucket size remains valid unless the number of additions in the matrix multiplication exceeds 2 12 .Then the decryption overhead can be reduced at the expense of more ciphertext additions and plaintext multiplications for packing.

Performance Optimizations and Security Enhancement
Several optimizations are applied to our implementation to improve performance and FPGA resource efficiency.

Parallel Modular Operations
We observe that there are a large number of matrix computations in real-world PPML scenarios [CZW + 21, FZT + 21], consisting of multiple AHE operations without data dependency.Meanwhile, a single Paillier encryption can also benefit from parallelization, since fixed-base pre-computation is involved to improve efficiency.Therefore, we provide support for vectorized modular operations in our design.
The pipeline implementation of the MM engine has five stages for each iteration, as shown in Fig. 7.Each stage takes 4 execution cycles.Since Phase_a has a longer datapath than Phase_c, we divide the datapath into five.The division (Div) stage computes the Barrett division in Alg. 2 to obtain γ.The Reduction Computation (RC) stage multiplies γ + 1 by −m.The CSA stage includes the carry-save adders to merge c + b i a − km into a single addition, and the Add stage includes an optimized ripple-carry adder.The CS stage performs the conditional subtraction of a.The pipelined datapath of Phase_c consists of 3 stages, BM, CSA, and Add.The block multiplication (BM) stage computes b i a in Alg. 1, and the hardware implementation of the BM stage is identical to the RC stage in Phase_a.Phase_c and Phase_a in the same round can be executed in parallel, so the latency of Phase_c can be overlapped by Phase_a except for the final round.Therefore, pipelining brings almost 5× performance improvement to the MM engine.Besides pipelining, we also use multiple MM engines on FPGA to improve parallelism.

Optimal Pre-computation Window
The ME in Paillier encryption is under fixed bases depending on the public keys.An optimization using this insight is to store all ME of short powers (e.g.window) in the offline phase.Large integer ME are converted to multiple MM operations in the online phase [BGMW92].Enlarging the precomputation window can reduce ME latency.However, the maximum window size is limited by the on-chip memory size, since a larger window has a larger enumeration space.The size of the pre-computed table S pre for encryption depends on the window size w and the key length |n|.Eq.16 shows the theoretical estimation of S pre based on w and n.
The precomputed table sizes for different window sizes in a 3072-bit Paillier cryptosystem are shown in Table 2.The number of MM operations in a precomputed ME also depends on the window size.When the window size is 4, the required memory size for the precomputed table is about 34 MB.If the window size increases to 8, the size increases to about 287 MB.However, the number of windows is only reduced from 384 to 192, and the cost-effectiveness of the increased storage and reduced multiplication is significantly lower.Considering the URAM size of the hardware implementation, SHAPER provides the interface AHE.init to reload the precomputed table with the default window size of 4.

Implementation and Evaluation
We implement the prototype of SHAPER on a Xilinx 16nm VU13P FPGA using the Xilinx Vivado toolchain.MM engine is important in the hardware implementation of SHAPER, and its throughput significantly influences the performance of high-level functions and applications.We evaluate the throughput of the MM engine in Table 4, compared with state-of-art MM designs in [BJ20, YHC20, XYCL22].Since it is unfair to discuss throughput without considering resources, these designs are evaluated based on the average throughput generated by each piece of DSP.In specific, the throughput is measured as the operations or output bits executed by the MM engine.Existing MM implementations usually focus on 1024 or 2048-bit MM operations, while the MM in SHAPER needs to fit in the 3072-bit Paillier cryptosystem.It makes SHAPER handle fewer MM instructions than [XYCL22, YHC20], as their benchmarks are tested with 1024-bit MM.However, considering the output length, our MM engine shows advantages compared with other proposals.

MM Throughput Comparisons
It is notable that our MM engine requires more DSP units than other designs.Because of the pipelining optimization, the DSP units cannot be reused across different stages.So our design assigns different DSPs for each stage, leading to more DSP utilization.

Function-Level Comparisons
We evaluate the latency of general functions, including modular operations, Paillier functions, and several MPC-level functions.The latencies of these micro-benchmarks are shown in Table 5.The inputs of the MM and ME benchmarks are as long as the key length, and the length of plaintexts in Paillier is set to 64 bits.We test the performance of SHAPER with different settings.To fairly compare the latency of our MM engine in SHAPER with the cryptography processor (CP) in [BJ20], we implement a single MM adopted as the performance baseline, whose latencies are taken from [CZW + 21].SHAPER performs 94× faster than CAESAR executed on the CPU when both using 2048-bit Paillier (112-bit security).The acceleration is mainly due to the fast Paillier encryption and pipeline execution.Even when the key is lengthened for 128-bit security (3072-bit key), SHAPER still performs 7.2× better than OU-based CAESAR with 112-bit security.In addition, most solutions of hybrid schemes show significant efficiency gains compared with SS or HE-based solutions, confirming the performance advantage of hybrid PPML schemes.SecureML performs the worst among the solutions since SS-based solutions mask sparse features to dense data shares.

Conclusion
In this paper, we propose SHAPER to accelerate hybrid SS-AHE PPML protocols.The algorithm-protocol-hardware co-design methodology explores the full-stack techniques to minimize the end-to-end latency in various network settings.SHAPER further supports secure domain computing acceleration and the conversion between mainstream privacypreserving primitives, making it ready for general and distinctive data characteristics.We provide a prototype of SHAPER on an off-the-shelf FPGA with several hardware optimizations.Our evaluation shows that SHAPER provides significant speedup over CPU clusters on a large-scale logistic regression training task.

Figure 1 :
Figure 1: PPML allows two parties to securely train ML models on sensitive data.

Figure 3 :
Figure 3: SHAPER Architecture Overview -The grey and blue boxes represent software and hardware components, respectively.

Figure 4 :
Figure 4: Exploring optimization space on data characteristics and algorithms.The example network bandwidth is 50Mbps.Applying the optimizations reduces the encryption overhead for the 40Mbps and 160Mbps throughput configurations.

Figure 5 :
Figure 5: The pipeline execution process of SHAPER, corresponding to line 1 to 3 in Fig. 2. Two successive executions overlap their latency.

Algorithm 1
High-radix Shift-sub Modular Multiplication.MM(a, b, m) Require: Radix width k Require: a

Figure 6 :
Figure 6: The architecture of a single MM engine based on the MM Algorithm.
SHAPER supports an efficient SIMD-style conversion between SS and AHE.We adopt and improve the conversion protocols in [FZT+ 21].Details are shown in Appendix C. The conversions can be optimized with the packing strategy proposed in [P + 13].

Figure 7 :
Figure 7: Pipelined modular multiplication engine.Each round of MM in Alg. 1 is divided into five stages.In the first τ -1 rounds, the accumulation phase (c) and the QR phase (a) are executed in parallel, and in the final round the phases run in serial.

Table 1 :
The instruction set supported by SHAPER.
* The argument len is the length of input or output data.The arguments _ptr is the physical base address of a specific data structure.

Table 2 :
Pre-computed table sizes under different window sizes when |n| = 3072.

Table 3
shows the resource usage of SHAPER.The maximum clock frequency is 285 MHz by default.As the most expensive module, 14 MM engines are used considering performance and LUT consumption.32 integer engines are used to support vectorized SS operations, contributing little to the overall consumption.Only one CSPRNG is used because its throughput is over 2.6 Gbps, which is sufficient for existing hybrid schemes.The resource consumption, excluding the controllers, is about 40% LUT/FF and 75% DSP in terms of the total FPGA resource.75% of URAM is used for the precomputed table, which is a good balance between performance and area.

Table 3 :
Resource utilization on the Xilinx FPGA.
* Measured by percentage in terms of Xilinx VU13P FPGA.

Table 4 :
Comparison between the Hardware Performance of MM implementation.