Over 100x Faster Bootstrapping in Fully Homomorphic Encryption through Memory-centric Optimization with GPUs

Fully Homomorphic encryption (FHE) has been gaining popularity as an emerging way of enabling an unlimited number of operations on the encrypted message without decryption. A major drawback of FHE is its high computational cost. Especially, a bootstrapping that refreshes the noise accumulated through consequent FHE operations on the ciphertext is even taking minutes. This significantly limits the practical use of FHE in numerous real applications. By exploiting massive parallelism available in FHE, we demonstrate the first GPU implementation for bootstrapping CKKS, one of the most promising FHE schemes that support arithmetic of approximate numbers. Through analyzing FHE operations, we discover that the major performance bottleneck is their high main-memory bandwidth requirement, which is exacerbated by leveraging existing optimizations targeted to reduce computation. These observations lead us to extensively utilize memory-centric optimizations such as kernel fusion and reordering primary functions. Our GPU implementation shows a 7.02× speedup for a single FHE-multiplication compared to the state-of-the-art GPU implementation and 0.423us of amortized bootstrapping time per bit, which corresponds to a speedup of 257× over a singlethreaded CPU implementation. By applying this to a logistic regression model training, we achieved a 40.0× speedup compared to the previous 8-thread CPU implementation with the same data.


Introduction
Homomorphic encryption (HE) enables one to perform operations on encrypted data without decrypting them and the result can be decrypted only with the secret key. As HE reveals nothing about the input or output except their sizes during computation, it has been spotlighted as a core technology for the applications such as privacy preserving computation. The HE schemes in an early stage had restrictions on the number and type of operations. After Gentry's breakthrough, however, these restrictions have been removed to obtain Fully Homomorphic Encryption (FHE), where an unlimited number of operations are allowed with the help of bootstrapping operation [Gen09].
The CKKS (Cheon-Kim-Kim-Song) scheme is the one that has recently gained attraction with their efficient approximate computation [CHK + 19]. As opposed to the other FHE schemes, this scheme is equipped with rounding operations as well as addition and 2 Over 100x Faster Bootstrapping in Fully Homomorphic Encryption through Memory-centric Optimization with GPUs multiplication (mult) of real numbers as the primitive operations, and so provides efficient fixed-point arithmetic on ciphertexts. Its advantage is prominent in the fields of privacypreserving data analysis and machine learning. In the secure genome analysis competition iDash, for instance, most of the winners and runner-ups in the HE track has employed CKKS since 2017 [WTW + 18, KJT + 20, JHK + 20]. However, its performance is still far from satisfaction for the industry practitioners [Cra19, YZLT19, ZLX + 20, BHS19, HLF + 19]; Bootstrapping, the most inefficient operation in CKKS, even takes more than 60 seconds in a conventional PC environment [CHK + 18] on a 128-bit security parameter set.
In this paper, we address these challenges by improving the performance of FHE schemes, focusing on the CKKS scheme with a Residue Number System (RNS) [CHK + 19], which is the most efficient scheme for many applications. Furthermore, we strive to maximize the parallelization of the FHE operations, by exploiting a high-performance commodity platform, GPU. That is, all the operations of CKKS have been implemented to run on GPUs, including the HE mult and bootstrapping, which are the most frequently used primitive one and the slowest one, respectively.
Through analysis, we discover that the FHE operations have high memory bandwidth and capacity requirements. Most FHE schemes require many evaluation keys for manipulating ciphertexts (e.g., mult key and rotation keys), whose size is large as each key is a ciphertext of a certain value related to the secret key. When examining the mult operation of CKKS (HMULT), we observe that the arithmetic intensity (OP/B, operation per byte) of HMULT's primary functions is low. Hence, they are not bottlenecked by the limited number of arithmetic units (compute-bound) but by limited memory bandwidth (memory-bound) on modern GPUs. In particular, Tensor-product and Inner-product in key-switching exhibit very low arithmetic intensity (lower than 3 OP/B in Figure 2(b)). We also identify that most of the existing performance optimization techniques further increase the memory bandwidth and capacity requirements; these include RNS-decomposition [BEHZ16], efficient mod-up/down operations [CHK + 19], improved linear transformation [HHC19], and efficient bootstrapping [HK20]. Especially, using a large decomposition number (dnum) heavily increases memory capacity and bandwidth requirement.
These key observations lead us to devise memory-centric optimizations to improve the performance of CKKS on GPUs substantially. First, we fuse the functions that compose an individual HE operation, such as HMULT (intra-HE-fusion), and the ones across HE operations (inter-HE-fusion), through which we reduce main-memory accesses. Second, in contrast to the prior work that trades computation cost with multiplicative level (the number of subsequent HMULT on a ciphertext) [HK20], we also consider main-memory bandwidth and capacity requirements while finding the optimal dnum for HMULT.
Thanks to the aforementioned memory-centric optimizations, our HMULT and rescaling implementations perform 7.02× and 1.36× better, respectively, against the recent result of [BHM + 20] while using the same GPU generation. We are the first to report the performance of bootstrapping in CKKS implemented on GPUs after applying recently proposed algorithm-level optimizations such as [HK20]. The proposed GPU implementation overtakes the speed of a single-thread CPU implementation by 257×. The wall-clock time for bootstrapping is reduced to 526.96ms in a 173-bit security parameter set with 65,536 slots and 19-bit precision bit, which translates into 0.423us in terms of amortized time per bit. Our result is orders of magnitude superior to the recent works that produce 298us [HK20] and 519us [CCS19] in 128-bit security parameter sets. To demonstrate the effectiveness of the proposed optimizations at an application level, we implement Logistic Regression in the same way as [HHCP19] on GPUs. The evaluation confirms that, compared to the implementation on an 8-threaded CPU, there is 40.0× improvement (from 31.0 to 0.775 seconds per iteration) in speed when training with the same data as [HHCP19].
Finally, we propose 'amortized FHE-mult' time as a new unit of measurement for Wonkyung Jung, Sangpyo Kim, Jung Ho Ahn, Jung Hee Cheon and Younho Lee 3 comparing the performance of mult in FHE. Amortized FHE-mult refers to the average time required per mult when an unlimited number of HMULTs are possible by bootstrapping. We calculate this time by measuring the average HMULT time plus the bootstrapping time divided by the maximum number of subsequent mults between two bootstrapping operations. The amortized FHE-mult time of our implementation is 24.35 ms, through which we believe that practical privacy-preserving applications can be developed.

Notation for Homomorphic Encryption
Let C i = {q 0 , · · · , q i } and B = {p 0 , · · · , p k−1 } be the sets of prime numbers where all the primes in each set share the same bit-length. Let [·] qj denotes the modular reduction by q j (e.g., [a(X)] qj ∈ R qj = Z qj [X]/(X N + 1)). If not ambiguous, we use a to represent [a] Ci . [a] Ci can also be denoted as (a (j) ) j∈ [0,i] where each a (j) = (a , w qj is the primitive N -th root of unity in Z * qj , and [x, y] = {a ∈ Z|x ≤ a ≤ y}. a $ ← − A means a is sampled from the distribution A. If A is a set, it refers to the uniform sampling over A. χ err is the error distribution used for encryption and key generation in CKKS [CKKS17]. denotes the Hadamard multiplication. S Y and S Y are two disjoint subsets of Y (∈ {C L , B}). The level of a ciphertext ct refers to the number of multiplication that can be performed with ct without bootstrapping. L( ) is the maximum (current) level of a ct. For a given arbitrary parameter dnum∈Z + , α=((L+1)/dnum)∈Z + and β= ( +1)/α . We define . We omit C i when i = L. FrobeniusMap(a, n) denotes the Frobenius map function on NTT domain [HK20]. For every a (i) = (a where π n (x) = ([5 n (2x + 1)] 2N −1)/2, which is a permutation. Then it returns a = (a (i) ) i∈[0, ] .
Fast basis conversion (Conv), which converts the basis of a polynomial into a new basis that is coprime to the original basis [CHK + 19, BEHZ16], approximate modulus raising (ModUp), RNS decomposition (Dcomp), and approximate modulus reduction (ModDown) are described in Algorithm 1, Algorithm 2, Algorithm 3, and Algorithm 4, respectively [HK20]. Please refer to Appendix C for more information about the subroutines and notations.
Over 100x Faster Bootstrapping in Fully Homomorphic Encryption through Memory-centric Optimization with GPUs Algorithm 4: Approximate Modulus Reduction 10 end procedure

An Overview of GPU Architecture
We briefly explain the architecture of modern GPUs. Typical GPUs exploit massive thread-level parallelism (TLP) by exploiting many scalar, in-order processors that run concurrently. A GPU consists of dozens of hardware units called Streaming Multiprocessors (SM ), each of which can run thousands of threads concurrently. Threads in a GPU are grouped into a unit called thread block. Threads in a thread block share the resource assigned to the thread block, such as register file and a user-configurable scratchpad memory, called shared memory, typically sized several dozens of KBs [NVI17,NVI21].
Threads in a thread block are again grouped into a unit of 32 threads, called warp. All the threads in a warp execute an instruction at a time in a lockstep manner. An SM can hold multiple thread blocks at a time. Each SM holds multiple warp schedulers, each selecting one or more warps ready for issuing an instruction at the cycle. The latency of an instruction executed by a warp is hidden by other warps selected by the warp schedulers. Thus, GPU is well suited to well-parallelizable programs and can run a massive number of threads in a throughput-oriented manner. A group of thread blocks is grouped into a unit called grid. The number of thread blocks in a grid and the number of threads in a thread block is specified by the execution unit of GPU called kernel, a user-specified function called by CPU.
Our target GPU architecture is NVIDIA Tesla V100 GPU [NVI17], which we use for all the experiments in this paper. It features 80 SMs that can run up to 163,840 threads concurrently. Compared to a modern server-class CPU that typically has a last-level cache sized dozens of MBs, a GPU has a smaller cache but provides higher main-memory bandwidth. We evaluated the performance of our CPU implementation of CKKS with Intel Xeon Gold 6234 [Int20] that has 24.75 MBs of last-level cache per socket, which is large enough to accommodate one or two polynomials with typical HE parameters. In contrast, Tesla V100 has a last-level cache of 6 MBs but is equipped with a large main-memory bandwidth of 900 GB/s, being suitable for throughput-oriented programs. Wonkyung Jung, Sangpyo Kim, Jung Ho Ahn, Jung Hee Cheon and Younho Lee 5 3 CKKS: the state-of-the-art FHE

Polynomial Arithmetic in CKKS
Each ciphertext of CKKS is represented as a pair of polynomials of degree < N in R Q = Z Q [X]/(X N + 1) and a FHE operation consists of polynomial operations. Since the polynomial degree N is very high (e.g., degree N = 2 16 or 2 17 ) and the coefficient size is very big (e.g., log Q = 2, 000), such polynomial arithmetic is very costly in FHE. For example, FHE-mult [CLP17, Cry20, HS14] becomes slower by a factor of thousands in terms of latency, compared to the native mult in the unencrypted domain.
Due to the high mult cost between polynomials, FHE schemes [BEHZ16, CHK + 19, BPA + 19] typically have their own variants with Residue Number System (RNS) [GHS12] to reduce their computational cost. First, by exploiting Chinese Remainder Theorem (CRT), they represent each big-integer coefficient as a set of residues, each being the coefficient modulo a distinct prime number. The mult between two big integers is translated into point-wise mult between the two sets of residues, avoiding expensive multi-word arithmetic.
Second, they use Number Theoretic Transform (NTT) [AB75,Har14] and its inverse transformation (iNTT) for polynomial ring mult. NTT is a variant of Discrete Fourier Transform (DFT) [CCF + 67] performed on a finite field of integers. It reduces the operational complexity of the mult in a polynomial ring whose coefficients (c 0 , . . . , c N −1 ) are negacylic convolution of two sequences (a 0 , . . . , a N −1 ) and (b 0 , . . . , b N −1 ). The negacyclic convolution is replaced with the following operations: two NTT performed on each integer sequence, one element-wise modular mult between the two sequences in the NTT domain, and one iNTT on the output of the element-wise mult. Because the computational complexity of NTT and iNTT is both O(N log N ) and the element-wise mult is O(N ), the total complexity becomes O(N log N ).
We target a recent RNS variant of the CKKS scheme [HK20]; it is an improved version of [CHK + 19] in that they reduced complexity by adopting a generalized key-switching technique through RNS decomposition, which is explained in detail in the following section. Throughout this paper, we call this scheme CKKS.

Description of CKKS
We compactly describe CKKS [HK20] and its core operations. CKKS encodes message z, a vector of N/2 complex numbers, into a polynomial m(X) ∈ R Q , called plaintext. The encoding step is twofold: (1) performing inverse Discrete Fourier Transform (iDFT) on z, and (2) scaling up by multiplying with a scalar value ∆ (called scaling factor), followed by a rounding operation. Typical values of ∆ range from 2 40 to 2 50 .
CKKS encrypts a plaintext m(X) as ct ← RLWE(s, m) 1 where s ∈ L i=0 (Z * qi ) N is the secret key and m ← NTT([m(X)] C L ). The ciphertext has the level L after encryption. A single execution of FHE-mult (HMULT) followed by rescaling, a function adjusting the scaling factor ∆ of the message, reduces one level of the ciphertext. For an HE circuit reducing the input ciphertext level by d, we call d as multiplicative depth of the circuit.

Bootstrapping in CKKS
Before the level of a ciphertext is depleted by consecutive operations, bootstrapping must be performed on the ciphertext to reset its level and to allow more FHE operations on the ciphertext. We briefly explain the CKKS bootstrapping algorithm [CHK + 18, HK20]. Modulus Raising: Let ct be a ciphertext given by encrypting a plaintext m(X). Let the current modulus of ct be q = q 0 having zero level. Our goal is to increase the modulus (or, level). First, the modulus of ct is increased into Q L , the modulus of a freshly encrypted ciphertext, producing ct . Although the level is increased, it adds an error polynomial q · I(X) to the plaintext: Decrypt(ct ) = t(X) = m(X) + q · I(X), where I(X) is a polynomial whose coefficients are integers bounded to a small integer K, which is determined by the secret key distribution. To remove the error, we apply the approximate modulo operation in a homomorphic way. CoefficientToSlot: Let the plaintext t(X) = t 0 + t 1 X + t 2 X 2 + . . . + t N −1 X N −1 be the decryption of ct in Modulus Raising and its decoding be z. The goal of CoefficientToSlot is to compute ct 1 and ct 2 , which contain messages z 1 = (t 0 , t 1 , . . . , t N/2−1 ) and z 2 = (t N/2 , t N/2+1 , . . . , t N −1 ), respectively. ct 1 and ct 2 are computed by evaluating encoding circuit, the linear transformations on ct: Instead of performing two linear transformations, the relationships iV = W and V T z = V T z are exploited, requiring only one linear transform to produce V T z, and the following homomorphic conjugation and addition/subtraction compute the other terms. Approximated modulo operation: For the two ciphertexts ct 1 and ct 2 , we want to perform modular reduction on each element of their message: (t i mod q) for all i. However, because the modular reduction operation is not available in FHE, twofold approximation of modulo operation is used instead. First, modulo q operation is approximated as a scaled sine function, such as f (t) = q/2π sin(2πt/q) [CHK + 18]. It exploits the facts that each value t i in message z 1 and z 2 is distributed near q · I for an integer I ∈ (−K, K) for the small K, and the scaled sine function resembles modulo operation near q · I. Second, as the sine function is also not available in FHE, it is approximated as polynomials. We adopted a recent bootstrapping algorithm in [HK20]. First, it approximates not sine, but cosine function by shifting ct 1 and ct 2 . Also, it uses polynomial interpolation with Chebyshev polynomial basis to approximate the cosine function in addition to adopting the double-angle formula. We use one of their parameter sets for bootstrapping; we evaluate a 31-degree polynomial followed by applying the double-angle formula three times, approximating f (t) = cos(24πt). With cosine evaluation on ct 1 and ct 2 , we get two output ciphertexts ct 1 and ct 2 whose messages are z 1 and z 2 . Our cosine evaluation step has multiplicative depth of 11. SlotToCoefficient: This is the opposite of CoefficientToSlot. With ct 1 and ct 2 , we compute an output ciphertext, ct f resh , which contains message z f resh , by evaluating a linear transformation as below: z f resh is approximately equal to the message of ct before Modulus Raising. We also adopt an additional technique of slim bootstrapping in [CH18], which reorders the bootstrapping process (see Figure 1) from (Modulus Raising, CoefficientToSlot, Cosine evaluation, SlotToCoefficient) to (SlotToCoefficient, Modulus Raising, CoefficientToSlot, Cosine evaluation). By postponing Modulus Raising, the computational cost of SlotToCoefficient decreases as the ciphertext level of the input to SlotToCoefficient decreases.
We also use efficient CoefficientToSlot and SlotToCoefficient proposed in [HHC19]. They adopt Cooley-Tukey FFT algorithm [CT65] to linear transforms in CoefficientToSlot and SlotToCoefficient. DFT algorithm can be expressed in a form of mult between DFT 8 Over 100x Faster Bootstrapping in Fully Homomorphic Encryption through Memory-centric Optimization with GPUs matrix D and an input vector v. The computational complexity of directly computing D · v is O(N 2 ). By adopting the Cooley-Tukey FFT algorithm, D can be decomposed into log 2 N block-diagonal sparse matrices, reducing the total complexity to O(N log N ). For a given radix r and degree N , they decompose V rev , which is a row-permuted is a matrix having 2r − 1 (when i > 1) or r (when i = 1) diagonals and other elements are zero. We refer to each diagonal of index i from a matrix M as a length-N/2 vector, The same decomposition technique applies to CoefficientToSlot as well. For more details, see Appendix A.
An important characteristic of the (2r − 1) (or, r) diagonals of each decomposed blockdiagonal sparse matrix is that their indices form an arithmetic progression. Using this property, [HHC19] adopts Baby-step Giant-step algorithm as in [HS15, CH18, CHK + 18], reducing the number of FHE rotations required in each mult between block-diagonal sparse matrix and a vector that is encrypted as a ciphertext. Baby-step Giant-step algorithm: Each mult between a block-diagonal sparse matrix and the input vector is performed in a homomorphic-friendly way, exploiting the Baby-step Giant-step algorithm (BSGS) [HS15, CH18, CHK + 18]. BSGS computes the matrix-vector mult by summation of the products of plaintexts, each being an encoded diagonal, and shifted input vectors, each being the message of correspondingly rotated ciphertext.
Let a predetermined block-diagonal sparse matrix with n diagonals be M and the input vector be v, the message of a ciphertext ct. Let rot i ( v) be an input vector shifted by i, the message of HROTATE(ct, i, evk). For l and t such that l · t = n, BSGS computes M · v as below, where the common difference of the arithmetic progression is one: Given a radix r, by choosing l and t near √ n, the number of rotations required for each SlotToCoefficient and CoefficientToSlot is reduced from O(r log r n) to O( √ r log r n), as each block-diagonal sparse matrix has 2r − 1 (or, r) diagonals. We can choose any l and t with flexibility by adding zero diagonals [HHC19].

Basic HE operations
We provide the pertinent details of the baseline CPU and GPU implementation of CKKS. There are five groups of functions that compose FHE operations: (1) element-wise RNS operation between polynomials, such as mult in NTT domain, addition, and subtraction, (2) NTT and iNTT, (3) fast basis conversion (Conv) used in ModUp and ModDown [BEHZ16, CHK + 19], and (4) Inner-product in key-switching. Also, we explain our implementation of BSGS algorithm. Throughout this paper, we use double-word (i.e., 64-bit) moduli such that log q i ≈ 51 and log p j ≈ 61 where i ∈ [0, L] and j ∈ [0, k). Also, we layout input and output data contiguously in memory in degree-first manner (i.e., a chunk of N residues of a (i) are continuous for each i).

RNS operation:
We refer an RNS operation to a binary operation that takes residues as input and performs a element-wise operation on them, such as mult, addition, and subtraction. In our CPU implementation, each CPU thread takes two vectors with N residues (i.e., a (i) , b (i) ∈ R qj ) at a time and performs N RNS operations. Then, the thread (or another thread in a multi-threaded, multi-core environment) takes another pair of two vectors with N residues, until ( + 1) × N RNS operations complete by all the threads. In contrast, our GPU implementation follows the method in [BPA + 19]; we launch ( + 1) × N threads for a single GPU kernel. We let each GPU thread perform a single RNS operation so that we exploit the massive thread-level parallelism available from modern GPUs, which can run hundreds of thousands of threads concurrently. NTT and iNTT: For NTT and iNTT, in both CPU and GPU implementation, we exploit the David Harvey's NTT algorithm [Har14] along with the in-place Cooley-Tukey algorithm, which is commonly used in other works [CLP17,Sho16,HS14].
For the CPU implementation of NTT, we use the same approach as in the RNS operation, where each thread takes N residues (i.e., a (i) for a given i) at a time, and perform Npoint NTT. For the GPU implementation, we use the hierarchical NTT implementation in [KJPA20], which heavily exploits shared memory in GPUs. More specifically, for every (i)NTT with N residues, we use 8 per-thread (i)NTT kernels described in [KJPA20], where each thread in a kernel loads 8 residues into registers at a time. We launch kernels each performing radix-256 or radix-512 (i)NTT. It uses shared memory as a storage for the temporal output of each (i)NTT stage so that the N residues are loaded from and stored to main memory only once per kernel. We launch two sequential GPU kernels for the practical values of N ranging from 2 16 to 2 17 . See Appendix B for more details.

ModUp and ModDown:
We first explain the implementation of fast basis conversion. Given an input polynomial a(X) and RNS bases S 1 = S C L ∪ S B and S 2 = S C L ∪ S B , a fast basis conversion Conv S1→S2 ([a(X)] S1 ) consists of two steps.
Step 1 multiplies each input residue withQ j −1 mod q j (orQ j −1 mod p j ) for corresponding j. Then step 2 computes each output residue by multiply-accumulate (MAC) operation on |S 1 | input residues each multiplied with (Q j mod q j (p j )) or (Q j mod q j (p j )) for corresponding q j (or p j ) ∈ S 1 , followed by a modular reduction with p i (or q i ) ∈ S 2 (see Conv in Subsection 2.1).
In the CPU implementation of fast basis conversion, each thread takes N residues that are coefficients of [a(X)] qj (or pj ) where q j (or p j ) ∈ S 1 as inputs, and performs the step 1. After all threads complete the first step, the step 2 begins, where each thread computes N output residues; the inner-most loop, which performs MAC followed by a modular reduction, computes one output residue at a time.
The GPU implementation of the fast basis conversion consists of two kernels. The first kernel (called scaling kernel) for step 1 launches (|S 1 | + 1) × N threads where each thread multiplies an input residue with a corresponding constantQ j −1 (Q j −1 ) mod q j (p j ). The second kernel for step 2 launches |S 2 | × N threads where a thread takes (|S 1 | + 1) residues of a coefficient as inputs to compute one output. Throughout the accumulation, the partial sum resides in the registers the thread holds. When the accumulation completes, the thread performs a modular reduction on the result. See Appendix B for more details.
Both ModUp and ModDown exploit the fast basis conversion but with different input and output basis values. Also, both perform iNTT on the inputs of the fast basis conversion, and NTT on the outputs of the fast basis conversion. The difference between ModUp and ModDown lies in the RNS operations coming after NTT. ModUp concatenates the output of NTT to its input with the original basis to extend the basis. ModDown (see Algorithm 4) subtracts the original input from the output of NTT , and scales byQ −1 mod q j (line 7). Because the subtraction and scaling after NTT are the element-wise RNS operations, our CPU and GPU implementations take the same approach for executing them as explained. Inner-product in key-switching: Our baseline implementation of Inner-product in key-switching is shown in Algorithm 10. It includes the lazy modular reduction technique adopted in FHE libraries [CLP17,Cry20]. The technique performs wide MAC operations (i.e., the partial sum being 128-bit) instead of doing modular reduction for β times.

Over 100x Faster Bootstrapping in Fully Homomorphic Encryption through Memory-centric Optimization with GPUs
Algorithm 10: Baseline implementation of Inner-product in key-switching In our CPU implementation, a thread loads three vectors of N residues, one from d (i) 2 and two from evk i for i ∈ [0, β), performs mult (line 5) or MAC (line 7) in an element-wise manner. This is repeated β times, followed by Barrett's modular reduction [Bar87] (line 8 and 9). Our baseline GPU implementation launches three kinds of kernels: one for mult (line 5), another for MAC (line 7), and the other for modular reduction (line 9 and 10). Each of the three kernels launches (αβ + k)N threads. In the first and the second kernel, each thread takes three residues as inputs, one from d (i) 2 and two from evk i . Then the thread multiplies the first input with the others, accumulates them into two 128-bit partial sums, and stores them as in the CPU implementation. The last kernel reduces the 128-bit partial sums into 64-bit using Barrett's modular reduction.

Baby-step Giant-step
Let a matrix M be predetermined and has l diagonals whose indices form arithmetic progression, and let vector v be encrypted as a ciphertext ct. In CKKS, the evaluation of BSGS in Equation 1, where rot −ti (diag ti+j (M)) is encoded as a plaintext m i,j for each 0 ≤ i < l, 0 ≤ j < t, is done as shown in Algorithm 11. It requires a distinct rotation key for each rotation index j, but we skip the notation here for simplicity.
We describe the implementation of HE operations required in BSGS. CMULT is performed as two element-wise RNS operations as described in the previous section. HROTATE is mostly similar with HMULT for its key-switching, except it performs the Frobenius map (see Section 2), which can be understood as a permutation. For the implementation of the permutation with index n, where a {j} [0,N ) -th residue of a given input {a (i) } [0, ] is moved into the π n (j)-th position, each CPU thread executes in-place permutation on N residues, whereas GPU performs out-of-place permutation by launching ( + 1) × N threads such that (i × N + j)-th thread takes the j-th residue of a (i) . Hoisting: Halevi et al. [HS18] suggested reducing the computation cost of BSGS by applying hoisting. Hoisting reduces the required number of ModUp, when multiple HROTATE operations with different rotation indices are performed on the same ciphertext. Specifically, it restructures the HROTATE algorithm to perform ModUp first, before the Frobenius map. Below is the computation order in HROTATE with hoisting: By applying the hoisting, one PrecomputeModUp is required right before the line 5 of Algorithm 11, whereas HROTATE in line 6 changes to FastRotate, reducing the required number of ModUp from (t − 1) + (l − 1) to 1 + (l − 1). We evaluate the performance improvement of bootstrapping with hoisting in Section 7.

Bottleneck Analysis of FHE-mult
In this section, we show that the major FHE operations are primarily limited by the main-memory bandwidth of GPUs.

Function-level Operation Complexity and Memory Access Analysis
First, we show operation complexity and the number of main-memory accesses of each function in mult of CKKS (HMULT). We evaluate the operation complexity in terms of the number of integer mults, and additions (adds) as in [TLW19]. We first count the number of modular mults (ModMuls) as well as the other operations, and convert the ModMuls into the number of mults and adds. We count one Barrett's ModMul [Bar87] (Shoup's ModMul [Sho16]) as 4 (3) mults and 3 (2) adds, using the methodologies explained in [TLW19]. For simplicity, we assume that the two input ciphertext has the same level , and β = ( + 1)/α without ceiling function.
Prior work [HK20] showed the operation complexity of FHE-mult for various dnum in terms of the number of ModMuls and showed that it is practical to choose a moderate dnum; however, they did not consider the high main-memory bandwidth requirement, which is directly translated into the performance of FHE-mult especially on GPUs. We assume that each modulus {q i } 0≤i≤L and {p j } 0≤j<k is a double-word (8 bytes). We count the main-memory accesses in unit of 8 bytes. Also, we consider subtraction as signed addition.

NTT & iNTT:
For a single execution of (i)NTT on N residues of each {a (i) } i∈[0, ) , there are N log N/2 ModMuls and N log N adds. As there are log N stages for each execution of (i)NTT, the number of main-memory accesses is up to 2N log N for loading and storing input and output, respectively, and N for loading twiddle factors. However, there are two more factors that affect the total number of memory accesses: cache memory size of a processor and Shoup's ModMul [Sho16]. If the cache is large enough to accommodate the length-N input of (i)NTT, it does not have to access the main memory on each stage, reducing main-memory accesses by up to a factor of 1/ log N . Modern server-class CPUs have last-level cache memory of dozens of megabytes [AFK + 19], easily achieving such reduction in memory accesses. In contrast, GPUs have much smaller last-level cache with few megabytes [NVI17], hardly accommodating the input/output of (i)NTT. Therefore, the number of memory accesses required on GPU for loading an input polynomial of (i)NTT is between 2N · log N and 2N .
Our baseline (i)NTT implementation [KJPA20] launches two GPU kernels, each of which performs radix-√ N (i)NTT, resulting in 2N log N/ log √ N accesses for input and output. Rather than Barrett's algorithm [Bar87], we adopt Shoup's modular mult [Sho16], which is commonly used in (i)NTT to reduce the operation complexity of ModMuls [CLP17,HS14]. Using Shoup's method adds extra N memory accesses as it demands a precomputed value for each ModMul. iNTT is performed at the front of ModUp and ModDown, each being done β times for αN residues and once for (αβ + k)N residues, respectively (see Subsection 2.1). NTT is performed on the output of basis conversion in ModUp and ModDown, each done β times for (αβ + k − α)N residues and once for αβN residues. ModUp: Because we count operations in (i)NTT separately, we consider only the fast basis conversion in ModUp and ModDown here. With the lazy modular reduction technique, one needs to perform modular reduction only once after computing a sum of product, as stated in Section 4. Therefore, αβ(αβ + k − α)N mults, αβ(αβ + k − α)N adds, and β(αβ + k − α)N modular reductions are required in total. The number of memory accesses is αβN for read and β(αβ + k)N for write in total. Inner-product: As in ModUp, the lazy modular reduction technique is also applied to Inner-product in key-switching; this results in 2β(αβ + k)N mults, 2(β − 1)(αβ + k)N adds, and 2(αβ + k)N modular reductions. The required memory accesses are 2β(αβ + k)N for loading {evk i } 0≤i<β , β(αβ + k)N for loading {d 2,i } 0≤i<β , and finally, 4(β − 1)(αβ + k)N and 4β(αβ + k)N for loading and storing accum in Algorithm 10, respectively. ModDown: There are k(αβ)N mults, k(αβ)N adds, and (αβ)N modular reduction for each of the two polynomials (a accum and b accum ), which are the output of Inner-product. Meanwhile, there exist kN memory accesses for loading the input and (αβ + k)N for storing the output for each of the two polynomials.
We summarize the total operation complexity and the number of memory accesses required in Table 1.

# of ModMuls
# of muls and adds # of total integer ops Memory accesses (each 8B) Tensor-product 3( + 1)N 4( + 1)N 25( + 1)N 7( + 1)N RNS-decomposition ( + 1)N - which is especially true in GPUs compared to CPUs. Figure 2(a) shows the FHE-mult time with various dnum values on our baseline CKKS implementation, both with CPUs and GPUs. As stated above, because modern server-class CPUs have a last-level cache of several dozens of megabytes that can accommodate one or two polynomials in FHE operations, the last-level cache absorbs most DRAM traffics. This makes the execution time of FHE-mult directly proportional to the number of arithmetic operations. In contrast, GPUs have smaller caches and thus the execution time is rather determined by the main-memory accesses and becomes susceptible to memory bandwidth, making it preferable to use a smaller dnum for fewer memory accesses.

Memory-bandwidth Bottleneck of FHE Operations on GPUs
A single ModMul operation is translated into different instructions depending on the machine types. To understand the actual operational characteristics of FHE-mult on GPUs, we provide a roofline plot [PH12] of CKKS mult in our baseline GPU implementation in Figure 2(b). We measured integer operation throughput and DRAM access counts using a profiling tool provided by NVIDIA Nsight Compute [NVI20].
Most FHE operations are not compute-bound, but main-memory-bandwidth bound. Among those operations, Inner-product has the lowest operation intensity, severely bounded by main-memory bandwidth. The operation intensity of Inner-product is even lower than that of the addition operations because the baseline Inner-product implementation accumulates the i-th product d (i) 2 · evk i into accum (line 6-8 in Algorithm 10), whose elements are 128-bit (quad-word) long, incurring a lot of memory accesses.
We observe that Inner-product in key-switching is a major bottleneck in accelerating FHE-mult because it has extremely low operational intensity albeit having a large number of memory accesses, especially with large dnum. Figure 3 presents the execution time, main-memory (DRAM) bandwidth, and instruction throughput (SM utilization) of FHE-  mult with our baseline GPU implementation. For most functions, their DRAM bandwidth utilization is higher than SM utilization, concurring with the low operational intensity shown in the roofline plot ( Figure 2)(b). Also, although NTT performs most of the modular mults required in FHE-mult, Inner-product dominates by 54.5% of the overall execution time. Because Inner-product is the most memory-intensive one in FHE-mult, it utilizes SM in GPU only by 3.4%. The second-most time-consuming function, NTT, also exhibits high main-memory utilization (71.0%) and low SM utilization (38.4%).
One of the factors that attribute to the large number of memory accesses in Innerproduct is using a large dnum, as the number of evaluation keys is proportional to dnum ({swk i } 0≤i<dnum ). Throughout this paper, we adopt generalized key switching in [HK20] that prefers using smaller dnum values, reducing the size of keys and thus amortizing the main-memory bandwidth bottleneck of Inner-product in GPUs. Moreover, we apply memory-centric optimizations that are commonly applicable to GPUs, such as operation fusion (kernel fusion [QRHT19]) in both intra-and inter-FHE manners, which significantly reduces main-memory bandwidth requirement.

Memory-centric optimizations for FHE on GPUs
Kernel fusion is a technique that fuses multiple GPU kernels into one kernel. Typical GPU kernels operate in three steps: load the input data from DRAM into shared memory or registers, compute on the data, and store the output data. Kernel fusion combines multiple kernels together, and by reusing data in the register file or shared memory, it reduces the number of main-memory accesses between kernels. By judiciously applying various kernel fusion methods both in intra-and inter-FHE-operation manners, we significantly reduce the memory accesses required for memory-intensive FHE operations.

Intra-FHE-operation Fusion
There are many opportunities of kernel fusion inside a single FHE operation. More specifically, we show that the major FHE operations such as mults, rotation, and rescaling have ample chances for adopting kernel fusion to save memory accesses. Then, we describe how we exploit the kernel fusion technique to those FHE operations. Because the scaling kernel has low arithmetic intensity and becomes memory-bound, fusing it with its preceding batched iNTT kernel removes most of the required memory accesses. Specifically, since we perform 8 per-thread iNTT for the batched iNTT kernel, each thread holds 8 double-word output elements in its registers, before storing them to the main memory. We multiply each of the 8 output elements with its corresponding [Q −1 j ] qj or [Q −1 j ] pj , right before storing them into main memory. This reduces 2αβN memory accesses in total. Fusing Inner-product in key-switching: Inner-product in key-switching (see Algorithm 10) requires (11β − 4) × (αβ + k) × N memory accesses in total (Table 1). Most main-memory accesses required come from reading and writing accum multiple times. Because each element of accum is 128-bit long, it takes up most of memory accesses.
To reduce the memory accesses, we fuse all three kinds of kernels in Algorithm 10, which are described in Subsection 4.1 into a single kernel that performs all the mults, mult-and-add, and reduction operations. In the fused kernel, each thread holds the partial sum in its registers until the last mult-and-add is completed. This simple but effective approach reduces the number of total memory accesses required from (11β − 4)(αβ + k)N to (3β + 2)(αβ + k)N . Fusing element-wise operations in ModDown with NTT: We fuse two element-wise operations in ModDown, the subtraction and the scaling with [Q −1 ] qj for each j (line 6-8 in Algorithm 4), with a preceding NTT (line 5). Prior to the fusion, we call one kernel for subtraction and one for scaling. Similar to the kernel fusion we apply to ModUp, these two operations are applied right before storing the NTT's output to main memory, reducing 2( + 1)N main-memory accesses in the subtraction and 2( + 1)N in the scaling in total.
This fusion in ModDown is also applicable to the rescaling operation, because RESCALE can be understood as ModDown that drops a prime (i.e., ModDown C →C −1 ) on both input ([a] C , [b] C ) ← ct. Therefore, we also apply the fusion in rescaling and evaluate its performance after fusion in Section 7.

Inter-FHE-operation Fusion
Most of the bootstrapping time is spent for evaluating linear transformation. Therefore, we suggest optimizations applicable to the linear transformation: batching plaintext-ciphertext multiply-accumulate (MAC) operations. Breakdown on bootstrapping: Figure 4 Memory-centric Optimization with GPUs is spent for evaluating linear transforms: SlotToCoefficient and CoefficientToSlot, which are homomorphic decoding and encoding, respectively. We focus on these two functions and accelerate their core algorithm, BSGS (see Equation 1). Batching plaintext-ciphertext multiply-accumulate (MAC): As described in Subsection 3.3, the core operation of SlotToCoefficient and CoefficientToSlot is a series of mults between block-diagonal sparse matrices, whose diagonals are encoded into plaintexts, and an input vector, which is a ciphertext, exploiting the BSGS algorithm (Equation 1).
Our key observation is that the inner-most sum of BSGS (line 10-13 in Algorithm 11) exhibits a severe memory bandwidth bottleneck on GPU, similar to Inner-product (line 5-7 in Algorithm 10). For a given ciphertext with level , the number of required memory accesses of the innermost sum of Algorithm 11 is 5t( +1)N for t CMULT and 6(t−1)( +1)N for t − 1 HADD. We fuse the kernels for CMULT and HADD in line 9-13 of Algorithm 11 into one, reducing the total memory accesses of (11t − 6)( + 1)N to (3t − 1)( + 1)N . We call this fusion as batching plaintext-ciphertext MAC.
As CMULT and HADD in Algorithm 11 are mostly memory-bound, reduction in the number of memory accesses is translated into a speedup of up to asymptotically 11/3 times with batching plaintext-ciphertext MAC. We benchmark the speedup of line 9-13 in Algorithm 11 with our GPU implementation for various t in Figure 4(b). Even with t = 2, we get a 2.66× speedup. On the other hand, the overall execution time of BSGS is as follows. Let f speedup (t) = (11t − 6)/(3t − 1). Then, the execution time of Algorithm 11 becomes (l + t − 2) (HROTATE time) + {l· (CMULT time) We can model the speedup of each execution of BSGS in CoefficientToSlot and SlotTo-Coefficient. For a given radix r, because each block-diagonal sparse matrix (except the first one that is multiplied with input first) has 2r − 1 diagonals, we should select a proper value of l and t in Equation 1 such that l · t > 2r − 1 and both l and r are near √ 2r − 1. Following [HHC19], we select a proper radix r first. For N = 2 16 , we use r as 2 5 , resulting in three matrix-vector mults, each with radix 2 5 (because N/2 = (2 5 ) 3 ). For N = 2 17 , as log r = 5 does not divide log(N/2) without a remainder, we perform three matrix-vector mults, each with radix 2 5 , 2 5 , and 2 6 . Thus, we get the total time of CoefficientToSlot (the same applies to SlotToCoefficient) when N = 2 16 as:

Evaluation
We present performance improvement in FHE operations with our inter-and intra-FHEoptimizations, the effect of generalized key-switching, and hoisting. All experiments were performed on a single NVIDIA Tesla V100 [NVI17] GPU with CUDA 11.2 and Intel Xeon Gold 6234 @3.3GHz CPU with 8 cores [Int20]. We also cached all the evaluation keys and constants in the main memory for both CPU and GPU implementation. We used all parameters that meet the security bit λ > 80, calculated from the LWE estimator [APS15]. Our parameter sets used for evaluating the performance of bootstrapping add 2 −19 mean errors to the input message after bootstrapping; they are small enough not to hinder most of applications [HK20, CCS19, HHC19, CHK + 18]. Table 2 summarizes the latency of FHE operations on a single-threaded CPU, our GPU implementation, and PrivFT [BHM + 20], the state-of-the-art implementation of CKKS [CHK + 19] on GPU. Because PrivFT is closed-source, we compared the latency of each operation provided by their paper using the same GPU. Also, we compared the performance on the parameter set with the largest level in [BHM + 20], considering the practical use of FHE. This parameter set with a large level represents the parameter sets suited for applications not requiring bootstrapping.

Performance of individual FHE operations:
First, by applying all the intra-FHE-operation fusion techniques, our optimized GPU  implementation beats the baseline CPU implementation by 152×, 153×, 229×, and 135× in HMULT, HROTATE, RESCALE, and HADD, respectively. Second, compared to PrivFT [BHM + 20], even without the optimizations, our baseline implementation outperforms PrivFT by 1.67× in HMULT. Our memory-centric optimizations are effective such that we get a speedup of 3.21× in HMULT (fused), compared to PrivFT.
We also showed the impact of choosing a proper value of dnum on GPU implementation. Because using a smaller dnum decreases the multiplicative level with a fixed log P Q and security bit, we show two parameter sets having either the same log P Q with fewer levels (fused L ), or the same level with higher log P Q and N to ensure high security of 128-bit (fused H ). Comparing the result of fused H with PrivFT that does not exploit generalized key-switching, we improved the performance by 7.02× in HMULT. In fused H , HADD, RESCALE, and CMULT become slower because of the ciphertext size is larger than fused; however, the execution time of HMULT and HROTATE generally dominates by an order of magnitude. Compared to the one that uses maximum dnum (fused), fused L obtains the speedup of 5.88× and 6.6× in HMULT and HROTATE, respectively. Our results show the importance of memory-centric optimizations and selection of dnum in accelerating FHE operations. Performance of bootstrapping: We evaluate the performance of bootstrapping with our intra-and inter-FHE-operation fusion, and hoisting [HS18] in Table 3 we select two parameter sets with small dnum and moderate levels, which represent bootstrap-friendly parameter sets suited for applications demanding bootstrapping. First, our baseline GPU implementation already provides up to a 185× of speedup compared to the CPU implementation. Using small dnum increased the performance gap between the CPU and GPU implementation by relieving the memory bandwidth bottleneck on GPU. Second, intra-FHE-operation fusions lead to a speedup of up to 1.15× compared to Baseline. Moreover, batching plaintext-ciphertext MAC and hoisting gives an additional speedup, resulting in up to 257× in total, compared to the single-thread CPU implementation. In both parameter sets, we exhibited similar speedups on each optimization. Speedup breakdown: Using a small dnum significantly reduces the Inner-product time by a factor of 7.48 and 7.50 in our baseline GPU implementation of HMULT and HROTATE, respectively (see Figure 5(a) and (b)). As the number of memory accesses required by Inner-product reduces proportional to dnum, the portion of Inner-product time changes from 54.6% (55.8%) when dnum= 45 to 29.8% (33.5%) when dnum= 3 in HMULT (HROTATE). After reducing memory accesses using a small dnum, we analyze the effect of each intra-FHE-operation fusion ( Figure 5(b)). We call three intra-FHE-operation fusions listed in Subsection 6.1 as ModUp Fusion (MF ), Inner-product Fusion (IF ), and ModDown Fusion (MDF ), respectively. Figure 5(b) shows the execution time breakdown when we applied MF, IF, and MDF incrementally. MF increases the overall performance of HMULT and HROTATE by 1.02× and 1.03×, respectively. The speedup is relatively small, because the size of iNTT in ModUp is small, taking up a small portion in the overall execution time. On top of MF, IF significantly decreases execution time of Inner-product by reducing the number of memory accesses by up to a factor of 11/3, resulting in 3.75× (3.75×) of speedup in Inner-product and 1.35× (1.41×) in total for HMULT (HROTATE). Lastly, MDF increases the overall performance by 1.46×, 1.54×, and 1.32× in HMULT, HROTATE, and RESCALE, compared to the baseline.
We also evaluate the effectiveness of hoisting [HS18] in HROTATE, shown as a distinct column in Figure 5 It considers both bootstrapping cost and the number of mults between two consecutive bootstrapping operations. Let one perform consecutive HMULT until the level of a ciphertext depletes, perform bootstrapping, and then repeat. We define the amortized FHE-mult time as T mult,amortized = T mult + T boot L boot where T mult is the average time of HMULT between two bootstrapping operations, T boot is the bootstrapping time, and L boot is the number of HMULT between bootstrappings. Figure 6(b) shows the amortized FHE-mult time of our best-performing and baseline GPU implementation with various dnum for a given security bit. In both implementations, the amortized time becomes minimal when dnum = 4 (24.35 ms and 43.5 ms for the optimized and baseline, respectively). However, our optimized implementation largely shrinks the performance gap between the optimal dnum and the large one (dnum = 8) as it significantly amortizes the memory bandwidth pressure by applying memory-centric optimizations. This enables high dnum values more attractive for the ones who want more security with the parameter sets that consider bootstrapping.
Our optimized implementation of CKKS outperforms the baseline by 1.68×, 1.78×, 1.79×, and 2.46× when dnum of 2, 3, 4, and 8, respectively. By adopting generalized key-switching as well as memory-centric optimizations, the amortized FHE-mult time of our optimized GPU implementation even outperforms the single mult time of PrivFT [BHM + 20], which does not support bootstrapping. Performance of Logistic Regression: Our optimized GPU implementation of CKKS achieves a similar degree of performance improvement in complex applications requiring bootstrapping. As a target application, we evaluated the execution time of training a logistic regression model. We used the methodology in [HHCP19] which trains a binary classification model with non-RNS CKKS [CKKS17]; we used the same dataset (the subset of MNIST labeled as 3 and 8), learning rate, weight update algorithm, and the circuit that approximates a Sigmoid function as a cubic polynomial.
In [HHCP19], they exploited the gradient descent algorithm to find a weight vector w that minimizes a negative log-likelihood function N L(w) = −1/m · log P (w), where m is the batch size, P (w) = Π m i=1 p w (x i ) yi · (1 − p w (x i )) 1−yi and p w (x i ) = S(IP(w, (1, x i ))). Here (1, x i ) is a concatenation of 1 and vector x i , IP(w, (1, x i )) is an inner product between the two vectors, and S(x) is a Sigmoid function, which is approximated as a cubic polynomial [HHCP18].

20
Over 100x Faster Bootstrapping in Fully Homomorphic Encryption through Memory-centric Optimization with GPUs Given a learning rate α and the gradient ∆ w N L(w), training is an iterative process that updates weight w to w such that We packed the training data into ciphertexts such that each sample of training data is packed to a ciphertext, while a ciphertext holds (N/2)/f training samples where f is the smallest power of two, which is bigger than the number of features f . As in [HHCP19], we compress each patch of 2×2 pixels in MNIST dataset [Lec98] into their arithmetic mean, resulting in f = (28/2) × (28/2) = 196.
We apply our implementation of CKKS to the circuit [HHCP19] that exploits non-RNS CKKS, resulting in 5 multiplicative depth per training iteration. We use a parameter set of N = 2 17 , ∆ = 2 51 , log P Q = 2395, L = 35, and dnum= 3. Because the level in our parameter set is 35, each bootstrapping consumes 16 levels and leaves 19 levels, requiring a bootstrapping for every 3 iterations. Table 4 shows the execution time of our 8-threaded CPU and the best-performing GPU implementation for a single iteration of training the binary classification model. We showed the amortized time per iteration (i.e., the total learning time divided by the number of iterations). We achieve 96.4% accuracy and 0.99 AUROC (area under the receiver operating characteristics) with 30 iterations, which are the same as in [HHCP19]. Our GPU implementation outperforms the CPU implementation by 40.0× in total, resulting in a sub-second iteration per mini-batch, whereas [HHCP19] reported the 4 minutes of execution time per iteration with an 8-threaded CPU with non-RNS CKKS implementation [Cry20]. An open-source library called cuFHE [Ver18] implemented a TFHE scheme shown in [CGGI16,CGGI17] including bootstrapping of TFHE, which is the fastest bootstrapping but accommodates at most up to several bit plaintext per ciphertext. We conduct a benchmark on the GPU we used throughout evaluation and get 0.5 ms of bootstrapping time per binary gate in cuFHE. Our bootstrapping implementation, as shown in Table 3, exhibits 8 us of per-slot bootstrapping time, resulting in a 62.5× of speedup.

Conclusion
In this paper, we have demonstrated the first GPU implementation for bootstrapping of CKKS. We showed that the main-memory bandwidth bottleneck is the key challenge in accelerating FHE operations with GPUs. Also, we found that the decomposition number (dnum) in an FHE parameter set significantly impacts the performance of FHE operations; raising dnum significantly increases both main-memory accesses and capacity required for Inner-product in key-switching. Based on the observation of high memory bandwidth requirement, we devised memory-centric optimizations to our GPU implementation, such as kernel fusion and a proper choice of dnum. Also, by applying our batching plaintextciphertext MAC in bootstrapping, we obtained 8 us per-slot bootstrapping time with 19-bit precision, which corresponds to a 257× speedup over the single-threaded CPU implementation. Finally, we showed the effectiveness of our solutions by training a logistic regression model, which obtains a speedup of 40.0× with our single GPU implementation compared to the 8-threaded CPU implementation.

A Matrix Decomposition in CoefficientToSlot/SlotToCoefficient
The matrix decomposition of V in Subsection 3.3 with Cooley-Tukey algorithm [CT65] in the previous work [HHC19] is as follows. For the bit-reversing (permutation) matrix R, let V rev = VR. With a radix of 2, the following equations hold: where I n is an n × n identity matrix and W n is an n × n diagonal matrix whose (i, i)-th element is ω 5 i 4n for 0 ≤ i < n. With the decomposition, each V (2) i has two (i = 1) or three (i = 1) diagonals. A decomposition with a higher radix with r > 2 is obtained by multiplying the matrices together: has r (when i = 1) or 2r − 1 (i = 1) diagonals. For the complete proof, please refer to [HHC19].
Then, SlotToCoeff and CoeffToSlot directly use the bit-reversed matrix V rev for their evaluations. CoeffToSlot is modified to output bit-reversed results t 1 and t 2 : In the same way, SlotToCoeff takes bit-reversed inputs but outputs a non-bit-reversed result because the inputs are reversed again: z = V rev ( t 1 + i t 2 ).

B GPU implementation of NTT and fast basis conversion
We describe 8 per-thread NTT implementation [KJPA20] that we used throughout this paper. Because iNTT is symmetric with NTT, we only show the case of NTT. Figure 7(a) and (b) show the data access pattern of the two kernels composing an NTT execution. In the two kernels, a thread block loads R1 or R2 residues that are selected from N residues corresponding to a prime. The residues loaded to the register file can be considered as a 3-D data cube (Figure 7(c)), where an index of the residues, idx, maps to (x, y, z) = (idx% R 64 , (idx% R 8 )/ R 64 , idx/ R 8 ) so that the x dimension comes first.  (c) The thread block, which consists of R/8 threads, performs radix-R NTT on the loaded data cube, where a thread performs an NTT with a size of 8 three times (, , ). Also, synchronization of a thread block for storing the output to shared memory is shown (, ).
Algorithm 12: Implementation of fast basis conversion in GPU threads, performs NTT along the z axis first (using the twiddle factors inside the main memory), stores the outputs to shared memory, synchronizes, loads the stored data in shared memory to the register file, and repeats for the y and x axes. Because each thread performs up to 8-point NTT, we call this as 8 per-thread NTT.
Algorithm 12 shows the implementation of the fast basis conversion on GPU. It consists of two kernels, where each index of data that a thread processes (idx) is characterized by the index of the corresponding thread block (blockIdx.x), dimension of the block (blockDim.x), and thread index inside the warp (threadIdx.x).

C Details on CKKS subroutines
This section provides supplementary explanations for the subroutines referred in the main body of the paper to enhance the understanding of the CKKS method, with the required mathematical background.

C.1 Table for Notation in HE
Ciphertext whose form is given as a is (uniformly) sampled from the distribution (or a set) S.

evk
Key-switching key of the form (evk0,...,evkdnum−1),where Table 5 tabulates the symbols and their descriptions used throughout this paper.

C.2 RLWE C i (s, m)
CKKS uses RLWE instances for key generation and encryption. For example, to encrypt a plaintext polynomial m(X) ∈ R Q , CKKS first generates (a(X), b(X)), which is a pair of elements in R Q where a(X) , and e(X) $ ← − χ err . After generating the pair, by adding m(X) to b(X) we complete the encryption. Here, e(X) ∈ R Q , and since the absolute value of a coefficient of e(X) is about several tens 30 Over 100x Faster Bootstrapping in Fully Homomorphic Encryption through Memory-centric Optimization with GPUs at maximum with very high probability [CKKS17], the bit lengths of e(X)'s coefficients are very short. s(X) ∈ R Q is the secret key, and only with this, m(X) + e(X) can be recovered through (a(X), b(X)) · (−s(X), 1). We can use CRT to convert an element in R Q into one in j=0 R qj . By applying NTT further, finally it can be expressed as an element in j=0 (Z * qj ) N . That is, the following relations are preserved through CRT and NTT: The algorithm RLWE Ci (s, m) in the main body of the text generates an RLWE instance in NTT domain (a, b) ∈ ( j=0 (Z * qj ) N ) 2 with s and m, each of which is a secret key s(X) and message m(X) in NTT domain, respectively; the output (a, b) can be converted from and to (a(X), b(X)) in (R Q ) 2 .

C.3 FrobeniusMap(a,n)
This algorithm performs Frobenius map function in NTT domain for HROTATE. In CKKS, a vector z of complex numbers of dimension M ≤ N/2 can be converted into a plaintext polynomial m( where τ : z i → m Q (ζ i ) and z = (z i ) i∈[0,N −1] , ζ i = ζ 5 i , ζ = exp(−2πi/4M ), and ∆· : a → ∆ · a . In this case, ζ M = ζ 0 = ζ. We define the result obtained by substituting ζ i for X in m Q (X) as the value of the i-th slot. Now we consider the slot rotation operation. For example, assuming that the rotation by n slots is performed in m Q (X), the resultant polynomial m Q (X) ∈ Q[X]/(X N + 1) should preserve m Q (ζ i ) = m Q (ζ i+n ). That is, the value in the (i + n)-th slot of m Q (X) should be the same as the one in the i-th slot of m Q (X). Because ζ i+n = ζ 5 i+n = ζ 5 n i , we need to compute m Q (X) = m Q (X 5 −n ) from m Q (X).

C.4 Conv S C L ∪S B →S C L ∪S B ([a(X)] S C L ∪S B )
This subroutine refers to the fast basis conversion algorithm [CHK + 19, BEHZ16]. Given the result of modular operations with the prime numbers in S C L ∪ S B for a polynomial a(X), the purpose of the algorithm is to produce the result of modular operations on the coefficients of a(X) with the different prime numbers in S C L ∪ S B .
More precisely, given [a(X)] S C L ∪S B as an element in qi∈S C L R qi × pj ∈S B R pj , the algorithm converts it into an element [a(X)] S C L ∪S B in qi∈S C L R qi × pj ∈S B R pj with the prime numbers given in S C L ∪ S B .
To obtain the result above, one can simply use iCRT for the coefficient of [a(X)] S C L ∪S B and then perform the modular reduction operation on the coefficient using each of the prime modulus values to represent the target rings in the result. However, if the number of target rings is large, many long-precision operations should be conducted, which exhibits huge computation cost. For example, if iCRT is performed to (a qj (X)) qj ∈C which is an element of R C , the result, A(X) ∈ R, can be calculated with the following formula: whereq j = i∈[0, ]∧i =j q i . The bit length ofq j is much larger than that of the primes that are single-or double-word, increasing the computation cost. The fast basis conversion algorithm directly retrieves [a(X)] S C L ∪S B by computing the following equation for each q i ∈ S C L ∪ S B : ( j=0 (((q −1 j mod q j ) · [a(X)] qj ) mod q j ) · (q j mod q i )) mod q i . (4) The algorithm uses the fact that only the prime numbers in a specific set are used to represent polynomial rings used in CKKS. The prime numbers used are all elements of C L ∪ B. Thus, if we pre-compute [Q −1 j ] qj , [Q j ] qj for all q j ∈ C L , and [Q −1 j ] pj , [Q j ] pj for all p j ∈ B in advance, we obtain the result with much less computation cost compared to using iCRT. Line 2-4 of Algorithm 1 of the main body of this paper are the main computation steps for the above formula. Because the bit lengths of [Q j ] qi and [Q j ] qi are at most that of a single prime number, they can be performed efficiently.
In Equation 4, ' mod p i ' operation is performed without performing the last 'mod Q ' in Equation 3. Thus, there is a difference in the result compared to executing modular reduction by p i to the result of running Equation 4. However, since the calculation result is returned to R Q through modDown in the future, the error can be ignored.

C.5 ModUp C i →D β ([a] C i )
This algorithm takes [a] C i ∈ i·α+α−1 j=i·α (Z * qj ) N as input and changes its basis to D β including the existing basis C i using the fast basis conversion: [a] D β ∈ i·α+α−1 j=0 (Z * qj ) N × k−1 j=0 (Z * pj ) N . The input [a] C i is one of the decomposed part of [a] C after the RNS decomposition that will be described later.
Specifically, in the basis conversion, Conv C i →D β −C i ([a] C i ) is performed and the result is concatenated with the input [a] C i to finally obtain the result. Since the input of Conv operates only on a polynomial element of coefficient-wise representation, iNTT is performed to convert [a] C i into [a(X)] C i ∈ qj ∈C i R qj before executing Conv.

32
Over 100x Faster Bootstrapping in Fully Homomorphic Encryption through Memory-centric Optimization with GPUs C.6 ModDown D β →C (b (0) ,b (1) ,· · · ,b (k+αβ−1) ) Suppose P = ri∈D β −C r i and Q = ri∈C r i . In this paper, the elements of D β − C are expressed in the form of either p i or q j , However, we use r i instead to simplify the expression.
If iNTT and iCRT are executed for each coefficient of the inputb (0) ,b (1) ,· · · ,b (k+αβ−1) , it can be expressed as [b(X)] P Q ∈ R P Q . ModDown basically does the following equations but in RNS domain: ModDown performs the above formula on RNS with ([b(X)] ri ) ri∈D β ∈ ri∈D β R ri , which is approximately [b(X)] P Q after applying CRT.
Please be aware that in the description of ModDown in Algorithm 4, NTT is executed before executing (F). Since Equation (F) still holds although we replace X to ω j (j ∈ [0, N − 1]), it is possible to execute (F) after applying NTT; thus, we obtain b (i) = ([b(ω  = (d (0) , · · · , d ( ) ))) Recall Q = i=0 q i . Suppose there are two elements A and B in Z * Q . We can calculate Z on the equation below.

C.7 RNS Decomposition (Dcomp(d
We can see that Z mod q i = A · B mod q i is satisfied for all q i (i ∈ [0, L]). Therefore, Z mod Q = AB mod Q by CRT.
The above relationship holds true also for A(X), B(X) ∈ R Q and Z(X) ∈ R Q , which are corresponding to A, B, and Z, respectively. Since we are using an RNS-variant of CKKS, suppose that (A(X) ·q −1 i ) mod q i ) · (q i · B(X)) is performed in j=0 R qj after applying CRT on A(X) and B(X). Then, the termq i · B(X) mod q j becomes all zero for all j where j = i. Eventually, the following relationship is satisfied: