Revisiting the functional bootstrap in TFHE

. The FHEW cryptosystem introduced the idea that an arbitrary function can be evaluated within the bootstrap procedure as a table lookup. The faster bootstraps of TFHE strengthened this approach, which was later named Functional Bootstrap (Boura et al. , CSCML’19). From then on, little eﬀort has been made towards deﬁning eﬃcient ways of using it to implement functions with high precision. In this paper, we introduce two methods to combine multiple functional bootstraps to accelerate the evaluation of reasonably large look-up tables and highly precise functions. We thoroughly analyze and experimentally validate the error propagation in both methods, as well as in the functional bootstrap itself. We leverage the multi-value bootstrap of Carpov et al. (CT-RSA’19) to accelerate (single) lookup table evaluation, and we improve it by lowering the complexity of its error variance growth from quadratic to linear in the value of the output base. Compared to previous literature using TFHE’s functional bootstrap, our methods are up to 2.49 times faster than the lookup table evaluation of Carpov et al. (CT-RSA’19) and up to 3.19 times faster than the 32-bit integer comparison of Bourse et al. (CT-RSA’20). Compared to works using logic gates, we achieved speedups of up to 6.98, 8.74, and 3.55 times over 8-bit implementations of the functions ReLU, Addition, and Maximum, respectively.


Introduction
The efficient evaluation of non-linear functions with high precision is a challenge for homomorphic encryption schemes and many of them rely on arithmetic approximations, such as Taylor, Fourier, and Chebyshev series [BGGJ19, CCS19,KWN20]. They allow them to work with packed messages in a SIMD 1 manner [BGH13], which greatly reduces the amortized cost of operations. However, the cost of implementing such approximations grows exponentially with the desired precision [LJ19], which makes this approach unfit for many applications. Other schemes implement circuits using binary gates [CGGI20], which are a versatile and straightforward way of achieving good precision. Their main disadvantage is the low throughput of operations, which leads to scalability problems.
All currently known fully homomorphic encryption (FHE) schemes rely on noisy ciphertexts for security, i.e., the encryption process adds a small error (noise) to the message. This error grows when performing arithmetic, and, eventually, it might affect significant bits of the message. To preserve the message, FHE schemes reset the error from time to time during the circuit evaluation by using a bootstrap procedure. It is usually an expensive process, but schemes implementing circuits using logic gates may enable very fast bootstraps at the cost of performing one at every gate. In some of them, it is also possible to implement the evaluation of arbitrary functions within the bootstrap, which usually results in more efficient implementations [CJP20].
A bootstrap procedure that evaluates a function (other than an ordinary reset of error) is called Functional Booststrap [BGGJ19]. In FHEW-like cryptosystems [DM15], the bootstrap is just a rounding function evaluated using a lookup table (LUT). Thus, to evaluate an arbitrary function, we only need to replace the bootstrap LUT with one that encodes it. This idea was introduced with the FHEW cryptosystem [DM15,MP20] and has been used by a few applications in the literature. We can refer, for example, to FHE-DiNN [BMMP18] and the work of Izabachène et al. [ISZ19], which both implement the sign function using the functional bootstrap of the TFHE cryptosystem [CGGI20]. These applications can perform a functional bootstrap in just tens of milliseconds thanks to the low precision (1 bit) required by the sign function. Applications requiring higher precision, on the other hand, need to increase the parameters of the cryptosystem to keep the evaluation correct, which leads to deteriorated performance. For example, a 6-bit-to-6-bit LUT takes 1.5 seconds to be evaluated using TFHE's functional bootstrap, as implemented by Carpov et al. [CIM19].

Contributions.
We show how to evaluate functions with high precision using multiple functional bootstraps, but without increasing (too much) the parameters of the cryptosystem. This approach results in a much smaller impact on performance. The following contributions are presented in this work.
• We introduce two new methods to combine multiple functional bootstraps in TFHE: -A tree-based one that allows the easy implementation of arbitrary functions, as well as a tree optimization based on particular properties of each function. We leverage the multi-value bootstrap of Carpov et al. [CIM19] to lower the number of bootstraps in this method (asymptotically) from exponential to linear in the size of the input. -A chaining one that presents a better error rate growth behavior, but which is more intricate to implement depending on the target function.
• We perform an error variance analysis, including experimental validation, and a comparison between the aforementioned methods.
• We present optimizations to the building blocks used in our methods, which are also contributions of independent interest.
-We introduce a multi-value extract procedure that produces multiple LWE samples encrypting the same value with independent errors. It enables improving the error growth on ciphertext scaling from quadratic to linear with little performance overhead. It also improves the error variance growth in the multivalue bootstrap [CIM19] from quadratic to linear in the output base. -We introduce a "base-aware" Key Switching to pack B < N LWE samples in an RLWE sample, where N is the polynomial size. In this work, B is the base of our integer encoding (thus, "base-aware"), but the technique enables gains of up to N B times for any B < N . • We present implementations 2 of several relevant functions and compare their performance with state-of-the-art implementations from the literature.
Lookup tables are an important tool in the implementation of arbitrary functions in homomorphic circuits. They are as versatile as logic gates while capable of providing better throughput of operations. Our methods speed up their evaluation in up to 2.49 times, and, for specific functions, we also show possibilities of optimizations over the generic LUT evaluation. Compared to implementations using logic gates, we achieve speedups of up to 8.74 times in simple and useful functions, such as integer addition.
This paper is organized as follows: Section 2 reviews the basics of the TFHE cryptosystem with a focus on its functional bootstrap; Section 3 introduces the methods to combine functional bootstraps, analyzes their error variance behavior, and presents optimizations in their building blocks; Section 4 presents a performance analysis of our methods, including comparison with the literature; Section 5 summarizes the related work; Section 6 concludes the paper.

The TFHE Cryptosystem
TFHE is a fully homomorphic cryptosystem with security based on the (Ring) Learning With Errors problem [Reg09]. It is based on the FHEW cryptosystem [DM15], but it features much faster bootstraps thanks to the use of binary secrets [MP20]. In this section, we review the concepts of TFHE necessary for the understanding of this paper.
Let A be a set, we denote by A n q the set of vectors with n elements in A modulo q and by A N [X] n the set of vectors of n polynomials modulo (X N + 1). If omitted, n = 1 and q = ∞. The Real Torus T = R/Z is the set of real numbers modulo 1 and B = Z 2 is the set of binary numbers {0, 1}. TFHE defines three types of ciphertexts, which we summarize below as samples of zero.
• TLWE Sample: A pair (a, b) ∈ T n+1 , where b = a, s + e. The vector a is uniformly sampled from T n , the secret key s is uniformly sampled from B n , the error e ∈ T is sampled from a Gaussian distribution with mean 0 and standard deviation σ, and , denotes the inner product.
The vector a is uniformly sampled from T N [X] k , the secret key S is uniformly sampled from B N [X] k , and the error e ∈ T N [X] is a polynomial with random coefficients sampled from a Gaussian distribution with mean 0 and standard deviation σ.

Encryption
To encrypt a message m ∈ T (TLWE) or m ∈ T N [X] (TRLWE), we simply add (0, m) to a fresh sample of zero. We denote by c ∈ T(R)LWE s (m) the T(R)LWE sample c that encrypts m with key s. To ease the notation, we consider each key has its attached set of parameters. A message m ∈ T N [X] can also be encrypted in TRGSW samples by adding m · H to a TRGSW sample of zero, where H is a gadget decomposition matrix. We do not use TRGSW samples in our algorithms and, therefore, we will get into further details about it only when necessary.
Decryption To decrypt a sample, we first calculate its phase (message + error): φ(c) = b − a, s . Considering approximate computing, the phase might be a good enough approximation for the message (depending on the error variance). For exact computing, we need to remove the error, and we do so by rounding the phase to the nearest valid value for messages. This requires us to define a set of valid messages over the Torus. The rounding procedure fails if the error is greater than half the distance (in the Torus) between two consecutive messages.
Arithmetic We add two ciphertext (TLWE or TRLWE) samples c 1 = (a 1 , b 1 ) and c 2 = (a 2 , b 2 ) by simply adding their terms: c 1 + c 2 = (a 1 + a 2 , b 1 + b 2 ). The multiplication between a ciphertext c 1 = (a 1 , b 1 ) and a scalar cleartext z ∈ Z (or z ∈ Z N [X] for TRLWE) is a direct result from the addition: c 1 · z = (a 1 · z, b 1 · z). TFHE does not support multiplications between T(R)LWE samples. To be fully homomorphic, it relies on external products between TRGSW and TRLWE samples. In this case, we first decompose the TRLWE sample in TRLWE samples using a Gadget decomposition algorithm [GMP19]. Then, we perform an inner product between the decomposed TRLWE and the TRGSW sample (which already is a vector of TRLWE samples).

Building Blocks of TFHE
TFHE has three algorithms as its main building blocks, which we briefly review in this section. Following the notation defined by Chillotti et al. [CGGI20], we use underbars and overbars to indicate input and output variables, respectively. The first building block is the Public Functional Key Switching, shown in Algorithm 1. It allows the switching of keys and parameters from TLWE to T(R)LWE samples, such as the packing of TLWE samples in a TRLWE sample. It also evaluates a linear function f over the input TLWEs.
Chillotti et al. [CGGI20] define this algorithm with binary decomposition, as Algorithm 1 shows, but TFHE supports other power of 2 bases. We analyze the impact of changing the decomposition base on the error growth in Section 3.3.

Algorithm 1: TFHE's TLWE-to-T(R)LWE Public Functional Key Switching [CGGI20]
Input : a precision parameter t ∈ Z Input : a Key Switching key KSi,j ∈ T(R)LWE s ( The second building block is the SampleExtract p (c) algorithm, which receives a TRLWE sample c ∈ TRLWE S (m) and a position p The last building block is the BlindRotate algorithm, shown in Algorithm 2. It rotates the polynomial encrypted in a TRLWE sample by an encrypted number. This is the core procedure for the evaluation of look-up tables (LUTs) in TFHE, introduced in the vertical packing by Chillotti et al. [CGGI20]. Suppose we want to look up the number s ∈ Z in the LUT A, both encrypted. To use BlindRotate directly, s needs to be binary decomposed (s = n−1 i=0 s i · 2 i ) and encrypted in C i ∈ TRGSW S (s i ), for i ∈ [[1, n]]. Each position of A is packed in a coefficient of a Torus polynomial in T N [X] and encrypted in a TRLWE sample. By calling BlindRotate(A, (2 0 , 2 1 , ..., 2 n , 0), (C 0 , C 1 , C n )), the position we want to look up in A is moved to the constant term of the TRLWE sample, and we can use SampleExtract 0 (c) to obtain its LWE encryption. It is important to note that the multiplication X ai · ACC occurs modulo the cyclotomic polynomial Φ 2N = (X N + 1) and, hence, presents a negacyclic property, e.g. X N · ACC mod Φ 2N = −ACC.

Bootstrap
There are two bootstraps algorithm in TFHE. Algorithm 3 shows the Gate Bootstrap, which was introduced for the implementation of logic gates and is supposed to be used on samples representing binary numbers. Torus values are in the interval (−0.5, 0.5] in the signed representation. Zero is usually represented by − 1 4 while one is represented by + 1 4 . The arithmetic of each logic gate implementation change these values, but it keeps their signal. Then, TFHE uses the Gate Bootstrap with µ = 1 4 to set the value back to + 1 4 or − 1 4 (plus a small bootstrap error), depending on the bit value. For example, the NAND gate between TLWE samples The second bootstrap algorithm in TFHE is the Circuit Bootstrap, which converts TLWE samples to TRGSW samples. This bootstrap is 10 times more expensive than the Gate Bootstrap, but it enables the possibility of fully composable CMUX circuits. Since it requires a 64-bit Torus precision, it is implemented only in the experimental branch of TFHE, which we do not use in this work.

Functional Bootstrap in TFHE
Similarly to the Functional Key Switching, the bootstrap can be said to be "Functional" if it maps a set of inputs to a codomain in a programmatical manner, i. e. it can evaluate functions. In TFHE, the bootstrap uses the BlindRotate algorithm to perform a lookup table (LUT) evaluation. LUTs are a very simple yet efficient way of evaluating discretized functions. They encode functions by storing discretized elements of their images, and the evaluation is performed by selecting a table position based on the function parameter, which is the lookup Selector. The gate bootstrap of TFHE is the evaluation of a 1-bit lookup table (LUT) encoding the rounding function. The test vector (0, v) (line 2 of Algorithm 3) encodes the LUT with the value of µ and −µ (thanks to the negacyclic property), and c is the selector. We can evaluate some other functions by just adjusting the value of µ. For example, the sign function, as used in FHE-DiNN [BMMP18] and by Izabachène et al. [ISZ19]. To evaluate an arbitrary LUT (with more than one value and its negative), we need to increase the size of the LUT, to discretize the function according to this size, and to work only with the positive half of the Torus, since the negacyclic property can only be explored by anti-symmetric functions.
We represent integers in the Torus by partitioning its positive half in slices. We define the Torus base, B, as the number of integers mapped, and the Torus slice as the distance between two consecutive integers in the Torus. For example, with B = 4, the size of each Torus slice is 0.125 and the map of integers is (0, 1, 2, 3) → (0, 0.125, 0.25, 0.375). This approach has been extensively used to represent bounded integers in the Torus [BMMP18, CIM19, ISZ19]. To represent unbounded integers, we decompose them in digits of the base B and encrypt each digit in a TLWE sample. This approach is more common for applications using binary digits, but it has also been used with arbitrary bases [BST20].

Encoding a LUT in a TRLWE sample
A single TRLWE sample c ∈ T N [X] k+1 can encode at most N entries of a lookup table (LUT). However, when using the bootstrap to evaluate the LUT, only a small fraction of N will actually be available if the goal is perfect computation. The first step of the bootstrap is to scale each element of the ciphertext to 2N and round it to an integer. This process introduces a significant error variance, which is additive with the variance of the Gaussian error. To prevent these lookup errors, we map each position of a LUT to a sequence of many consecutive coefficients of the test vector and we adjust the selector to lookup positions in the middle of these sequences. For example, with B = 4 and N = 1024, an integer LUT L = [l 1 , l 2 , l 3 , l 4 ] ∈ Z 4 B is mapped to l4 2B X i and we add a precision offset of 1 4B = 1 16 to the selector c. Algorithm 4 shows the functional bootstrap in TFHE already considering the integer and LUT encoding we defined.

Multi Value Bootstrap
The most expensive procedure in the LUT evaluation using the functional bootstrap of TFHE is the BlindRotate. The multi-value bootstrap is a technique that allows the evaluation of multiple LUTs with the same selector using just one BlindRotate. Suppose we want to evaluate q LUTs (L 0 , L 1 , ..., L q ) with the same selector c ∈ TLWE s (m) and we want to minimize the number of blind rotations. A simple way of doing so is to perform a (single-value) functional bootstrap with test vector (0, v) = (0, 1) ∈ T N [X] k+1 and selector c. Its output would be c = T LW E s (X −2N ·φ(c) mod 2N ). Then, to evaluate each LUT, we represent it as an integer polynomial in Z N [X], multiply by c, and extract the constant term. This approach requires just one BlindRotate, but the error variance grows quadratically with the square norm of each LUT.
Carpov et al. [CIM19] introduced a multi-value bootstrap scheme that allows for a much smaller error growth. As Algorithm 5 shows, the test vector is set to (0, τ · 1 where τ is a scaling factor usually set as the gcd among the coefficients of all LUTs. After the blind rotation, each LUT (represented as a polynomial in T V Fi ∈ Z N [X]) is divided by v, before being multiplied by the accumulator (ACC). This division greatly reduces the square norm of the LUTs and, hence, the error growth. Carpov et al. also introduced a very efficient way of calculating that only requires a subtraction between each pair of consecutive coefficients of each LUT.

Combining Functional Bootstraps
To evaluate large LUTs considering the limitations we describe in Section 2.3.1, we either need to increase N (which increases the bootstrap time superlinearly [BST20]) or to lookup in multiple TRLWE samples. In this section, we follow this latter approach and introduce two methods to combine multiple functional bootstraps to evaluate a single large LUT. Figure 1 illustrates the evaluation of an 8-bit parity function using them. In both methods, we use the encoding for unbounded integers we described in Section 2.3 and represent them in base B with d digits.

Tree-based method
Our first method to evaluate functions using multiple functional bootstraps is structured as a (convergence) tree and uses the output of a lookup to construct a new LUT. Algorithm 6 shows its final version. Let L be the B d -sized LUT encoded in B d−1 TRLWE samples (following the encoding we described in Section 2.3.1) and Our first step is to perform a functional bootstrap using the same selector c 0 on each of the B d−1 TRLWE samples. This process results in B d−1 TLWE samples. If d = 1, the evaluation is finished. Otherwise, we use a TLWE-to-TRLWE key switching to pack them in B d−2 TRLWE samples. With that, we reduced our problem to the evaluation of Without the multi-value bootstrap, the complexity of this process (measured in the number of functional bootstraps) would be exponential in the number of digits d. However, each level of the tree performs B d−1−i bootstraps using the same selector c i , thus allowing us to replace them with a single multi-value bootstrap, which reduces the complexity to linear in the number of digits. This complexity improvement results in similar performance gain asymptotically, but it faces practical problems for not so large LUTs. As we described in Section 2.3.2, the multi-value bootstrap relies on multiplications between the accumulator (ACC in line 4 of Algorithm 5) and the LUTs encoded as polynomials in Z N [X]. In the first level of our tree, this is not a problem since the LUTs are cleartext and can be encoded as polynomials in Z N [X]. Starting from level 1, the LUTs are now encrypted as TRLWE samples. Arithmetically, this is not an obstacle, but the multiplication between ACC and the LUT is now a multiplication between two TRLWE samples, which is not directly performed in TFHE. We would need to convert ACC from TRLWE to TRGSW using a Circuit Bootstrap, which would only be worth it if the number of Functional Bootstraps we are replacing is very large. In this work, most of our examples are 8-bit functions, and, therefore, we only use the multi-value bootstrap in the first level of the tree. TFHE presents an experimental version of a TLWE-to-TRGSW circuit bootstrap implemented with 64-bit precision that costs around 130 ms (or 10 times a gate bootstrap). It does not present a TRLWE-to-TRGSW bootstrap. Naively, we could adapt TFHE's algorithm to implement a TRLWE-to-TRGSW circuit bootstrap, which would be N times more expensive. However, we cannot consider such implementation a representative of TRLWE-to-TRGSW conversion performance. Practical implementations on this are mostly an open problem and there are many recent developments in the literature [MS18, BMMP18, CDKS20] that can be used to achieve much more efficient conversions. The original (32-bit) implementation of TFHE does not support the circuit bootstrap, and, if we worked on another implementation of TFHE, it would be hard to compare with results from the literature. We instead prefer to leave the implementation of a TRLWE-to-TRGSW conversion as future work, especially considering that this paper is focused more on algorithmic than implementation aspects.

Optimizing the Tree
One of the main advantages of this method is the versatility. We can not only evaluate any function but also optimize the tree considering the particular characteristics of each function. The sigmoid function, for example, has three intervals in its domain that could be linearly evaluated or approximated: are not bootstrapped, which is something to consider when composing functions to make an application. Considering security, the implications of optimizing the tree are limited to secondary aspects that are not on our scope. For example, the full-tree design enables us to easily achieve circuit privacy by simply encrypting the initial LUT, whereas tree optimizations could give partial information about the function.

Chaining Method
This second method is a generalized version of the integer comparison algorithm of Bourse et al. [BST20]. It is much more functionally restricted than the first method, but it usually presents a smaller error growth. Its main characteristic is that the output of a lookup is used to construct the selector of the next lookup, whereas, in the tree-based approach, the output is used to construct the next LUT. This difference has deep implications on the error propagation and overall functionality. Consider a selector The first functional bootstrap uses c 0 as the selector and outputs c 0 . From then on, each evaluation uses the selector ( where is a linear combination. We can define this method as being functionally capable of evaluating any function that can be encoded in LUTs such that, for each digit, either the output of (c i , c i−1 ) is smaller than the size of the LUT, B, or the function being encoded follows a B-anti-cyclic [BGGJ19] logic. Although it is hard to generically define functions with such restrictions, this method seems to be especially good for functions that require carry-like logics, such as additions and multiplications.

Error analysis
In this section, we analyze the error variance growth of each algorithm used to perform a functional bootstrap and we calculate the overall probability of error based on the final error variance. We start by reproducing two equations from Chillotti et al. [CGGI20]. Equation 1 and 2 show the variance resultant of the key switching and the gate bootstrap procedures, respectively. The underbar indicates input variables, ϑ KS and ϑ BK are, respectively, the variance of the bootstrap and key switching keys, and = 1 2B g and B g are the gadget decomposition quality and base, also respectively. All other variables were introduced at the beginning of Section 2.
V ar(Err(c)) ≤ R 2 V ar(Err(c)) + ntN ϑ KS + n2 −2(t+1) (1) In the chaining method, functional bootstraps have the same output error variance as the gate bootstrap (Equation 2) since both use noiseless test vectors. In the tree-based method, the test vector is a TRLWE sample that might be encrypting the result of previous table lookups and, hence, we need to consider its error variance, which is additive with the one introduced by the bootstrap, giving us V ar(Err(c)) ≤ V ar(Err(T V )) + n(k + 1) N ( As for the key switching algorithm, Chillotti et al. [CGGI20] only analyzes it using binary decomposition and TFHE only implements the TLWE-to-TLWE key switching. In this work, we use both the TLWE-to-TLWE key switching of TFHE and a TLWE-to-TRLWE key switching to pack TLWE samples in a TRLWE sample. This TLWE-to-TRLWE key switching is a core algorithm of our tree-based approach and, to improve efficiency, we need to use greater bases for decomposition. Considering that, we reanalyzed the key switching error variance introduction. Equation 1 presents three terms. The first term, R 2 V ar(Err(c)), comes from possible scalings performed by the linear function f (we only use 1-Lipschitz functions, therefore R 2 = 1). The second term, ntN ϑ KS , comes from the addition of (ntN ) T(R)LWE samples (the summation in line 5 of Algorithm 1), each of them with variance ϑ KS . The third term, n2 −2(t+1) , is the variance introduced by the rounding of the binary decomposition. Chillotti et al. defines the error variance starting from the error amplitude: Each of the n elements of the vector a of the TLWE input c are rounded to the closest multiple of 2 −t , which introduces an error of at most | 2 −t 2 | = 2 −(t+1) . The variance is then calculated by squaring this amplitude and multiplying it by n, i.e., | 2 −t 2 | 2 · n = n2 −2(t+1) . To change the decomposition base, we can replace 2, which gives us 1 4 nbase −2t . Albeit correct, this arithmetic approach is not as tight as desirable.
The vector a of the input TLWE sample is generated from a uniform distribution. Hence, the error inserted by rounding each of its elements to 2 −t is also uniform and varies from − base −t 2 to + base −t 2 . The variance of an uniform distribution varying from a to b is 1 12 (a−b) 2 and the sum of n uniformly distributed variables are a (scaled) Irwin-Hall distribution with variance n · 1 12 (a − b) 2 . Applying these equations to our case, we have that the error variance introduced by the decomposition is n · 1 12 (− base −t 2 − base −t 2 ) 2 = 1 12 nbase −2t . Irwin-Hall distributions are very good approximations for Gaussian distributions and we can just replace the third term of the Equation 1, which results in Equation 5.

Error Rate
Knowing the error variance is useful for approximate computation, but, for perfect execution, we need to calculate the probability of such error affecting significant bits of the message.
In the decryption, a failure occurs if the rounding procedure rounds to the wrong integer, which will happen if the absolute value of the error is greater than half of the least significant bit of the message in the Torus representation. Equation 6 gives us the probability of this error occurring for a TLWE sample x, with standard deviation σ x and the least significant bit encoded in the torus with value (2 · interval). erf is the Gaussian error function.
The bootstrap may also fail since the error affects the accumulator blind rotation amount. The failure occurs if the accumulator is rotated to a polynomial in which the coefficient of the constant term is different from the desired one. In this way, we need the error to be within the interval of half the Torus slice size we are working with, which is 1 2 · 0.5 B . Scaling by 2 32 , we have interval = 2 31 2B . Besides the TLWE error variance, we also need to consider the rounding error introduced when selecting the log 2 (2N ) most significant bits of each position of a scaled to 2N . The discarded bits of each position are uniformly distributed with value ranging from − 2 31 2N to + 2 31 2N . The variance of a uniform distribution is 1 12 times the square of its amplitude, thus σ 2 ai = 1 12 · 2 32 2N 2 , for each a i ∈ a. In the worst case, n positions will be added to calculate the phase, leading to variance σ 2 n i ai = n · σ 2 ai . This variance is additive with the one of the Gaussian error, and we can obtain the error rate using Equation 6.
The bootstrap is much more susceptible to failures than the decryption. Therefore, the error variance of TLWE samples that are input for bootstraps ends up being the main component to control the error rate. Given a function that can be evaluated by both of our methods, we can define which one presents the better error rate in function of such variance. Let be the linear operation of the chaining method and s be its error variance scaling, i.e. how many times the error variance of the output of is greater than the one of the inputs. Figure 3 shows the error rate in function of the TLWE input error variance for 8-bit functions in base 4 for s ∈ {2, 4, 6, 10}. We found few practical cases with s > 2, but, in all of them, the tree-based approach required more bootstraps, and s could be lowered to 2 by increasing the number of bootstraps of the chaining one. In this way, we conclude that the chaining method is usually the better choice for the functions it can evaluate. However, as we noted before, its functionality is very restricted. Table 1 summarizes the main characteristics of our two methods.

Improving building blocks 3.4.1 Base-aware TLWE-to-TRLWE Key Switching
A regular TLWE-to-TRLWE Key Switching packs N TLWE samples in one TRLWE sample in T N [X] k+1 . However, as we described in Section 2.3.1, we want to pack B TLWE samples with each one being mapped to a sequence of consecutive coefficients in the TRLWE sample. In Algorithm 7, we exploit the fact that B < N to both accelerate the key switching and to reduce the error variance growth in N B times compared to the regular TLWE-to-TRLWE Key Switching. Comparing with Algorithm 1, we obtain speedups by replacing multiplications between N -sized polynomials (line 5 of Algorithm 1) by inner products between a B-sized vector of binary digits and a B-sized vector of TRLWE samples (line 6 of Algorithm 7). The Key Switching key, however, is B times bigger.

Multi-Value Extract
The error variance growth of adding two variables x and y is defined as σ 2 x+y = σ 2 x + σ 2 y + 2ρσ x σ y , where σ 2 is the variance and ρ is the correlation between the normally distributed variables. When this correlation is linear, ρ may be defined as its degree. If the variables are completely independent, then ρ = 0 and σ 2 x+y = σ 2 x + σ 2 y . If they are the same variable, then ρ = 1 and σ 2 x+x = σ 2 x + σ 2 x + 2σ x σ x = 4σ 2 x . Defining the multiplication as a sequence of additions of the same variable, we have that σ 2 n×x = n 2 × σ 2 x , for n ∈ Z.  : x0)). Any.

Suitable Functions
Most functions following carry-like (e.g. addition) or test (e.g. sign) logics. All other functions. To avoid the quadratic variance growth at multiplications, we could implement them as sequences of additions of independent TLWE samples encrypting the same number. We obtain these independent encryptions by extracting multiple coefficients from the accumulator (ACC) at the end of a bootstrap procedure. We call this process Multi-Value Extract. Recall that, in our LUT encoding (Section 2.3.1), each LUT position is mapped to a sequence of N B coefficients. Therefore, after the bootstrap, we should have N B independent encryptions of each number, which we can use to perform the multiplication as shown in Algorithm 8. The additional extracts on the ACC reduce the interval used to calculate the error rate in Equation 6 from 1 4B to ( 1 4B − b 4N ). However, we find this reduction to have a negligible impact on the error rate for the values we tested. We sustain the independence between coefficients of the ACC on the Independence Heuristic [CGGI20] (Definition 1).
Definition 1 (Independence Heuristic, [CGGI20]). The error of the coefficients of TRLWE samples (including TRGSW samples) and all linear combinations of them considered in TFHE are independent and concentrated.
We tried to experimentally validate the error variance of the multiplication using the multi-value extract, and we obtained the results in Figure 4a. We noticed that the error variance is still growing quadratically, which indicates that the coefficients are not independent. To obtain formal guarantees of independence (instead of a heuristic), Chillotti et al. [CGGI20] points out that we could perform the gadget decomposition in a probabilistic way. We implemented the probabilistic gadget decomposition proposed by Genise et al. [GMP19], but we obtained no improvements over the deterministic algorithm. We were only able to obtain the linear growth by lowering the error variance introduced by the gadget decomposition. We increased the size of the decomposition base log 2 (B g ) from 4 to 5, which improves its precision log 2 (B g ) from 20 to 25 bits, a reasonably high value for a 32-bit implementation. Figure 4b shows the results.
From this experiment, we can conclude that, although the independence heuristic holds

Algorithm 8: Multiplication (scaling) using the multi-value extract
Input : a TRLWE sample c ∈ TRLWE S (p), which is the accumulator (ACC) of a previous functional bootstrap, and a cleartext scalar b ∈ Z. Output : a TLWE sample c ∈ TLWE S (b · p0), where p0 is the constant term of p, and for the Gaussian error, the same cannot be said about the error introduced by the gadget decomposition. We also measured that the error variance introduced by the decomposition is bigger than the estimations using the right term of Equation 2, n(1 + kN ) 2 . We consider it as an indication that this dependency between coefficients might affect not only our multi-value extract but also the bootstrap itself (although further research is certainly necessary to support that). The use of a probabilistic gadget decomposition with higher entropy might be an alternative solution to the increase of or B g . However, in our case, it would require us to increase other parameters, which would impact performance.

Impact on the multi-value bootstrap
The multi-value bootstrap of Carpov et al. [CIM19] increases the bootstrap error variance in T V f 2 2 ≤ s(q − 1) 2 times, where s and q are the input and output bases, respectively. Carpov et al. uses q = 2 (the binary base) to lower the error, but it needs to convert from base q to s to use the output in another function. It does so by using a key switching at the beginning of each functional bootstrap to perform a base composition, which is s-Lipschitz, and, therefore, introduces the previously avoided quadratic error. In summary, composable circuits often need the input and output bases to be the same, and, in these cases, solely outputting in the binary base would just transfer the complexity to the base composition. However, with the introduction of the multi-value extract, we can perform the scaling required by the base composition linearly and, thus, reduce the complexity of error variance from s(q − 1) 2 to s(q − 1).

Practical parameters and experimental error variance measument
Our first step to define practical parameters was to survey the literature on non-linear functions that are usually implemented using perfect computation with TFHE. Then, based on the required precision, we manually searched for parameters aiming at an error rate of at least 2 −30 . Table 2   We estimated the error rate for parameters using the equations of Section 3.3, and we experimented to validate their results. We find this validation to be necessary mainly because we based our error analysis on equations designed for TFHE to work with binary digits and logic gates. Once we introduced larger bases with arbitrary LUT evaluation and tightened some of the variance estimations, we could no longer support our statistics on the experimental validation provided by Chillotti et al. [CGGI20]. Table 3 shows the results of our experiments. We measure the variance of 2 14 samples and calculated a 95%-confidence interval using the Chi-square distribution. Due to computational limits, we could only validate the error rate for higher variances. We performed 15,438,720 bootstraps over LWE samples with σ 2 = 1.70E-04. The output was wrong in only 39 of them. In both cases, the experiments showed that our estimations are reasonably close upper bounds for the actual values. Although we cannot extrapolate the results for different variances, the experiments provide some evidence of correctness for our equations on the considered parameters.

Performance Results
We benchmarked our methods by using them to implement a set of relevant functions from the literature. We selected two functions from previous literature on the functional bootstrap of TFHE and three functions that are building blocks for neural network algorithms. We set the precision of our implementations to match the ones we are comparing to. We executed our experiments using an Intel i7-7700 processor at 4.20 GHz running Ubuntu 18.04. We compiled our code using GCC 7.3.0 with flags -O3 -std=c++11 -funroll-all-loops -march=native -ltfhe-spqlios-fma -lm, and we used the "optim" build of TFHE. Each result is the average of 100 executions. We tried to compile and execute previous implementations on the same environment. When we could not reproduce them (either because the authors did not provide the source code or because their source code did not compile in our environment), we report the results presented by the authors and add an observation about the differences between machines. Fortunately, most authors report the execution time for the default gate bootstrap of TFHE, which we can use as a "TFHE benchmark score" and adjust the speedup values accordingly. In our machine, the default gate bootstrap of TFHE runs in 9ms and 13ms using the old (80-bit security) and the new (127-bit security) set of parameters, respectively.

Lookup Table evaluation
Lookup tables are very commonly used in homomorphic circuits, and the implementation of Carpov et al. [CIM19] is one of the most recent and efficient of them. It introduces the multi-value bootstrap of TFHE, which we explore in our tree-based method. Table 4 compares their implementation for a 6-bit-to-6-bit LUT with one using our tree-based method (Algorithm 6). Both implementations are based on the functional bootstrap of TFHE. The difference is that the implementation of Carpov et al. [CIM19] performs a single functional bootstrap with base B = 2 6 , whereas our implementations perform several functional bootstraps with base B = 2 2 and combine them using the tree-based method.

32-bit Integer Comparison
Integer comparison is an extremely useful function in computing, but it presents some challenges to be implemented in homomorphic circuits, especially for unbounded integers. Bourse et al. [BST20] presented a very efficient implementation using the functional bootstrap of TFHE. The chaining method we presented in Section 3.2 is a generalization of their technique. To compare with them, we implemented a unary tree to evaluate the integer comparison. Algorithm 9 describes our implementation. Table 5 compares them with the integer comparison implemented by Zhou et al. [ZLPL20] using logic gates. We calculated the speedups using as reference the slowest implementation, which, in this case, is the second implementation of Bourse et al. [BST20]. However, adjusting the speedups to consider the difference in speed between machines, we can see that the implementation of Zhou et al. [ZLPL20] using logic gates is up to 1.75 times slower than the one of Bourse et al. [BST20] and up to 5.6 times slower than ours. The chaining and tree-based methods perform the same number of bootstraps and should present a very similar performance. Nonetheless, we were able to achieve significant speedups thanks to our tighter variance and error rate estimations, which enabled a better choice of parameters.

Neural Network Functions
Neural network inference algorithms are currently one of the major use cases for homomorphic encryption [BGGJ19, LJ19, ZLPL20]. They achieve great performance levels with approximate computation in some cryptosystems [CKKS17], but perfect computation still seems to be necessary to achieve state-of-the-art inference accuracy for deep neural networks [BGGJ19,LJ19]. We implemented three functions that are building blocks for them and that are provided by SHE [LJ19], an implementation of secure neural network inference based on TFHE that achieves state-of-the-art inference accuracy. SHE presents very fast arithmetic (thanks to the use of Lognet), but it relies on logic gates to implement non-linearities. We also compare our results with the implementations of Zhou et al. [ZLPL20], which are generally faster than SHE but present worse inference accuracy.

ReLU
The Rectified Linear Unit (ReLU) is a neural network activation function broadly used due to its simple implementation and non-linear properties. Algorithm 10 shows our implementation using the Functional Bootstrap. The logic is similar to a multiplexer. Table 6 compares it with implementations using logic gates. For the 127-bit security level, our implementations are up to 6.98 times faster than the one of Lou and Jiang [LJ19] and 1.19 times faster than the one of Zhou et al. [ZLPL20]. Although the logic gates of TFHE generally introduce error rates much smaller than the functional bootstrap, the error rate of our implementation is 2 −137 , which is also negligible compared to the security level.

Maximum
Our implementation of the maximum function is, at first, similar to the integer comparison followed by a multiplexer. However, in this context, we have to also consider signed numbers, which leads us to Algorithm 11. Table 7 shows performance results.

Addition
We implemented this function using the chaining method since it presents a carry-like logic.
Our implementations using the tree-based approach were more expensive since, although we can produce a linear combination for which the carry follows a B-anti-cyclic logic, we could not do the same for the addition itself. Algorithm 12 describes our implementation and Table 8 compares it with implementations using logic gates. We used the multi-value extract to perform the scaling in line 6 of Algorithm 12.

Additional estimations
Using the results of Section 4.3, we can estimate the performance gains our methods could bring to full applications. Let us take, for example, the binarized convolutional neural network (CNN) of Zhou et al. [ZLPL20], which can be implemented using the functions of Section 4.3. This CNN classifies images in the MNIST dataset and is composed of 3 binarized convolutional layers, 3 max-pooling layers, and two fully connected layers. We counted the number of operations on each one of them and estimated their execution time using the results of Section 4.3. Table 9 shows the results. Although this is a basic estimation, our execution times are reasonably close to the ones reported by Zhou et al. [ZLPL20], especially considering the differences in execution environments. A real implementation would likely present better performance for our methods since they allow for further optimizations, such as using the multi-value bootstrap to batch multiple operations. Nonetheless, the current estimation indicates a speedup of up to 4.9 times, which is within the expected considering the results of Section 4.3, where the speedup over basic operations for this same implementation ranged from 1.19 (ReLU) to 6.77 (Addition) times. We can also estimate the performance impact of changing the size of the keys. The main reason we use large keys (compared to implementations using logic gates) is the use of decomposition bases greater than 2 in the TLWE-to-TRLWE key switching. Using base 64, the Key Switching keys take 4.0 GiB and 6.0 GiB for the parameters 5_5_6_2 and 6_4_6_3, respectively. Decreasing the base would also linearly decrease the size of these keys, but, to avoid increasing the error, we would need to logarithmically increase the value of t and, consequently, the key-switching execution time. For example, using base 16 instead of 64, the parameter 5_5_6_2 would need a 1 GiB key, but t would need to be increased from 2 to 3, which would increase the execution time of the key switching from 6 ms to 9 ms. At first, this tradeoff seems promising for many functions, especially the ones which make little use of the TLWE-to-TRLWE key switching. However, increasing the value of t also increases the second term of Equation 5 and, hence, might affect the output error variance negatively. For simplicity, in this paper, we chose two sets of parameters that fit all functions we implemented. A more targeted search for parameters would likely yield better results, and the methods we introduced allow for easily changing parameters even within a single function evaluation.

Related Work
The literature on using lookup tables (LUT) in homomorphic circuits dates back to the first fully homomorphic encryption schemes presented and has been used with most of the modern FHE schemes [CGH + 18, LFFJ19, NRPH19]. LUTs are a simple and powerful technique to represent arbitrary functions, but problems with latency and precision had also been reported [LFFJ19,NRPH19]. The introduction of LUT evaluations within the bootstrap by FHEW [DM15] started a new line of work on this topic. Bonnoron et al. [BDF18] introduced techniques to evaluate gates with a large number of input bits using a single bootstrap of FHEW and implemented a 6-bit-to-6-bit LUT that runs in less than 10 seconds. Based on their work, Carpov et al. [CIM19] presented the multi-value bootstrap of TFHE and lowered the execution time of the 6-bit-to-6-bit LUT to only 1.5 seconds. Our work makes extensive use of the multi-value bootstrap, but we focus more on accelerating the evaluation of non-linear functions than improving the multi-value bootstrap itself. Nonetheless, we did present some contributions towards it, such as the reduction of the complexity of its error growth. Moreover, our combination methods also introduce two new ideas to this line of work: how to use the multi-value bootstrap to accelerate single (instead of multiple) LUT evaluations; and how to improve the LUT evaluation based on particular properties of the encoded function.

Tree-based approach
The most similar strategy to our tree-based evaluation is the vertical packing of Chillotti et al. [CGGI20], which suggests the use of a CMUX tree to choose among multiple TRLWE samples and, then, uses a single BlindRotate to perform a final lookup. Similarly to ours, their method also allows some optimizations based on the encoded function (although they did not present nor explore this idea itself). On the other hand, our method constructs LUTs on-the-fly using results of previous lookups, which allows optimizations even within a single LUT. The main difference between them, however, is in their use of TFHE's building blocks. The vertical packing is directly based on CMUX gates and, hence, requires the selector to be encrypted in the binary base in TRGSW samples. This makes it a very good choice in the leveled setting, but it requires one circuit bootstrap per bit in the fully homomorphic one. Chillotti et al. [CGGI20] reports an execution time of around 800ms to evaluate a 6-bit-to-6-bit LUT in the fully homomorphic setting with 80 bits of security. Correcting for the security level and difference between machines, the implementation of Carpov et al. is already slightly faster. Bouse et al. [BST20] also presents a "tree-based" technique for integer comparison using the functional bootstrap, but, besides the name, by our definitions, it is equivalent to the chaining method. More specifically, it rearranges the chaining evaluation in a tree-like fashion (specific to the integer comparison logic) so that it can be parallelized in multiple threads. The output of each LUT is still used to construct the selector of the next one (which defines the chaining) and the technique has no further similarities with our "tree-based" approach.

Conclusion
In this paper, we presented two methods to combine multiple functional bootstraps in TFHE; we performed a thorough error variance and error rate analysis on our methods and on the functional bootstrap itself, including experimental validation; we introduced a multivalue extract procedure to improve the error behavior on scalings and, especially, on the multi-value bootstrap; we introduced a "base-aware" TLWE-to-TRLWE Key Switching to speedup the LWE packing; and, finally, we selected practical parameters and benchmarked our methods using relevant functions from the literature. We achieved speedups of up to 3.19 times compared to previous literature on the functional bootstrap of TFHE, and of up to 8.7 times compared to implementations using logic gates. Arbitrary LUTs are inherently exponential to evaluate, which gives more importance to the possibility of optimizing them based on the function particularities, a feature that our methods introduce. In our practical experiments, we limited ourselves to work with the original implementation of TFHE and with precision levels that were previously defined in the literature. Our results demonstrate efficient evaluations with the precision of up to 6 bits for completely arbitrary functions, and up to 32 bits for functions with enough opportunities for optimizations. The techniques themselves can be easily extended to work with higher parameters and are already defined to efficiently exploit the circuit bootstrap. To test it in practice, however, we would need to move to more optimized versions of TFHE (with at least 64-bit Torus precision) and to implement an efficient version of the circuit bootstrap, which is mostly a practical open problem. This process would bring contributions of its own and, thus, we leave it as future work. Nonetheless, the speedups we achieved are certainly a good measurement of improvement over previous literature, and some of our contributions, such as the multi-value extract, go beyond the context of functional bootstrap implementations and are useful even for pure arithmetic.
As future work, we also intend: to implement our algorithms for the functional bootstrap using more optimized versions of TFHE; to pursue implementation optimizations for our techniques themselves; the research on efficient ways of implementing the TRLWE-to-TRGSW circuit bootstrap; and to accelerate the Functional Bootstrap and TRLWE Key Switching exploiting techniques such as proposed by Bourse et al. [BMMP18], Chen et al. [CDKS20], and Micciancio and Sorrell [MS18]. Ultimately, we intend to create a library for the functional bootstrap and to test it in frameworks of automatic code generation.

A Algorithms of functions presented in Section 4
For simplicity, we define two functions to use in the algorithms presented in this section: • FunctionalBoostrap: It receives a selector c encrypted in a TLWE sample, a LUT encrypted in a TRLWE sample, and the bootstrap key BK. If necessary, it performs a TLWE key switching in the sample c to switch from the key S (with n = 1024) to s (with n = 630). Then, it executes the functional bootstrap and returns a TLWE sample with the result.
• TRLWEKeySwitch: It receives a vector of B = 4 TLWE samples and a Key Switching key. It performs a base-aware TLWE-to-TRLWE key switching using our default packing technique.
Algorithm 9: Integer comparison algorithm using the tree-based method.