ConvKyber: Unleashing the Power of AI Accelerators for Faster Kyber with Novel Iteration-based Approaches

. The remarkable performance capabilities of AI accelerators oﬀer promising opportunities for accelerating cryptographic algorithms, particularly in the context of lattice-based cryptography. However, current approaches to leveraging AI accelerators often remain at a rudimentary level of implementation, overlooking the intricate internal mechanisms of these devices. Consequently, a signiﬁcant number of computational resources is underutilized. In this paper, we present a comprehensive exploration of NVIDIA Tensor Cores and introduce a novel framework tailored speciﬁcally for Kyber. Firstly, we propose two innovative approaches that eﬃciently break down Kyber’s NTT into iterative matrix multiplications, resulting in approximately a 75% reduction in costs compared to the state-of-the-art scanning-based methods. Secondly, by reversing the internal mechanisms, we precisely manipulate the internal resources of Tensor Cores using assembly-level code instead of ineﬃcient standard interfaces, eliminating memory accesses and redundant function calls. Finally, building upon our highly optimized NTT, we provide a complete implementation for all parameter sets of Kyber. Our implementation surpasses the state-of-the-art Tensor Core based work, achieving remarkable speed-ups of 1.93x, 1.65x, 1.22x and 3.55x for polyvec_ntt , KeyGen , Enc and Dec in Kyber-1024, respectively. Even when considering execution latency, our throughput-oriented full Kyber implementation maintains an acceptable execution latency. For instance, the execution latency ranges from 1.02 to 5.68 milliseconds for Kyber-1024 on R3080 when achieving the peak throughput.


Introduction
The imminent possibility of large-scale quantum algorithm computers capable of practically executing Shor's algorithm [Sho97] in the near future has sparked intensive research in the realm of post-quantum cryptosystems, designed to withstand quantum attacks.In anticipation of the quantum era, NIST proactively issued a call for proposals to replace their current standards for digital signatures, public-key encryption (PKE), and key-encapsulation mechanisms (KEM).Following three rounds of thorough evaluation and scrutiny, NIST announced in July 2022 that they had chosen CRYSTALS-KYBER (abbreviated as Kyber) as the standard algorithm for Post-Quantum Cryptography (PQC) Public-key Encryption and Key-establishment Algorithms.In contrast to signature algorithms, adopting KEM algorithms appears more pressing due to the threat of "harvest now, decrypt later".In August 2023, the Chromium Project declared its adoption of a hybrid cryptographic algorithm (X25519/Kyber768) for Chrome and Google Servers [Blo].
In the realm of cryptographic schemes based on lattice-related problems, such as Ring-LWE [LPR10], Module-LWE [LS15], and Module-LWR [BPR12], the most time-consuming components typically involve polynomial multiplication (over the ring R q ) and hash functions.Hash functions primarily entail bit operations, which can be expedited through readily available commercial products featuring processor-assisted accelerations, such as the SHA extension found in Intel and ARM CPUs [SKS + 21].Consequently, much of the effort in lattice-based cryptography (LBC) acceleration is centered on optimizing polynomial multiplication.Various techniques exist to expedite polynomial multiplication.In addition to leveraging Karatsuba [Kar63] and Toom-Cook algorithms [Too63], a prevalent approach is to employ Number Theoretic Transform (NTT), especially when the condition n|(q − 1) holds, where q represents the modulus and n signifies the dimension.Kyber [BDK + 18, Sch], for instance, even incorporates a customized NTT within its algorithms to enhance efficiency.
Up to this point, researchers have put forth numerous optimized implementations of Kyber across various hardware and software platforms, with particular emphasis on leveraging the NTT.Notably, for general use cases, researchers have crafted optimized implementations using SIMD (Single Instruction, Multiple Data) vector instructions on widely adopted x86 and ARM platforms [SKS + 21].

An Opportunity for PQC with AI-accelerators
Many manufacturers have developed high-performance AI (artificial intelligence) accelerators to cater to the demands of AI applications.Notable examples include Google TPU [Clo], Apple M1 [App], and NVIDIA Tensor Core [NVIb].
Compared to general-purpose processors, AI accelerators primarily emphasize lowprecision arithmetic, novel data-flow architectures, and often boast significantly higher computational power.For instance, In 2017, Nvidia introduced the Volta architecture, notable for being the first graphics processor to feature dedicated cores, known as Tensor Cores, specifically designed for tensor calculations.These Tensor Cores can perform 64 General Matrix Multiplications (GEMMs) per clock cycle on 4 × 4 matrices, which contain FP16 (16-bit floating-point) values or a combination of FP16 multiplication and FP32 addition.Despite their small size, these cores effectively break down larger matrices into smaller tiles to compute the final results.Tensor Cores are further enhanced in each new NVIDIA GPU architecture generation.Tensor Cores of the latest Tesla H100 can deliver up to 1979 Tensor TFLOPS for INT8 precision data.In the case of embedded products, the NVIDIA Jetson AGX Orin offers supercomputer-level performance of up to 275 Tensor TOPS while consuming up to 60W of power [NVIc].
The formidable capabilities of AI accelerators open up new possibilities for accelerating Post-Quantum Cryptography (PQC).One pioneering effort in this domain was undertaken by Wan et al. [WZL21].They harnessed the power of Volta Tensor Cores to accelerate the lattice-based scheme LAC [LLZ + 18], which utilizes a byte-level modulus.This endeavor yielded significant performance improvements.

Technical Challenges and Contributions
The substantial performance advantage has inspired researchers to integrate AI accelerators into cryptographic implementations.Since AI accelerators are specialized for machine learning and neural networks, the primary challenge lies in adapting cryptographic workloads to their operations while ensuring the accuracy of results and achieving significant performance improvements.
It is noteworthy that most prior work has focused on lattice-based cryptography without the need for Number Theoretic Transform (NTT), specifically schemes like LAC [WZL21,LSZH22] or NTRU [LSZH22].Their approaches are quite straightforward.For example, Wan et al. [WZL21] extend the coefficient vector of a polynomial into a matrix resembling a Toeplitz matrix.Subsequently, it transforms polynomial multiplication c = ab over the ring R q = Z q [X]/(X n + 1) in LAC into a vector-matrix multiplication, as follows: Then, they batch multiple such vector-matrix multiplications into matrix-matrix multiplications.These matrix-matrix multiplications can be directly accelerated using the dedicated instructions of Tensor Cores, and the byte-level modulus q = 251 used in LAC aligns perfectly with the most efficient INT8 parameter configuration of Tensor Cores.However, the more commonly employed NTT-based schemes, such as Kyber, are not amenable to Tensor Core acceleration.The prevalent approach for polynomial multiplication using NTT relies on the divide and conquer method.Nevertheless, the resulting butterfly operations are incompatible with the Tensor Core's operational mode.In a recent work, Wan et al. [WZF + 22] further introduced a practical implementation of Kyber that leverages Tensor Cores.In this implementation, they employed the formula method to replace the commonly used butterfly calculations in NTT.Their research into the relationship between accuracy and performance of WMMA (Warp Matrix Multiply and Accumulate) operations provides valuable insights for future endeavors in this domain.The essence of Wan et al.'s approach [WZF + 22] lies in transforming NTT operations into a "large" matrix multiplication of size 128 × 128 and processing this large matrix using the native "16 × 16" matrix multiplication instructions of Tensor Cores.This strategy has delivered highly favorable outcomes, primarily because the dimension n = 128 is relatively small.It has achieved a remarkable over six-fold improvement compared to traditional implementations [GJCC20].

Twiddle
However, it is evident that this direct and straightforward approach comes with significant computational complexity.The primary goal of this paper is to harness Tensor Cores to implement a more efficient algorithm, addressing this complexity and enhancing performance.
Based on our explored internal mechanism of Tensor Core and the unique design of the customized NTT in Kyber, this paper proposes an NTT calculation scheme dedicated to Kyber and further implements an entire implementation of Kyber.We name this framework convKyber because we convert the core NTT operations into operations resembling convolutions to leverage Tensor Cores better.
This paper mainly includes the three following contributions.
• Firstly, our work forms an iteration-based framework for an AI accelerator to accelerate module-lattice based cryptography.Two innovative approaches are proposed under this framework that efficiently break down Kyber's NTT into iterative matrix multiplications, resulting in approximately a 75% reduction in costs compared to the state-of-the-art scanning-based methods.Although these approaches are highly tailored for Kyber's NTT, they can be slightly modified for the universal NTT acceleration.(Section 3).
• Secondly, we have gained a deep understanding of the internal operational mechanisms of Tensor Cores behind the WMMA API through reverse-engineering.We precisely manipulate the internal resources of Tensor Cores using assembly-level code rather than relying on inefficient standard interfaces, thereby eliminating memory accesses and redundant function calls.(Section 4).
• Finally, building upon our highly optimized NTT, we provide a complete implementation for all parameter sets of Kyber, including Kyber-512/768/1024, with non-trivial optimizations for SHA-3, memory access coalescing etc. (Section 5).
For the NTT part, empirical results illustrate that our proposed NTT schemes achieve a 12.48x performance improvement compared to conventional butterfly operation based NTT on the same GPU platform.Even when compared to state-of-the-art Tensor Core-based NTT, we achieve a 93% speed-up.For a full Kyber implementation, compared with the state-of-the-art implementation, we have achieved 1.93x, 1.65x, 1.22x, and 3.55x for polyvec_ntt, KeyGen, Enc, Dec in Kyber-1024, as compared to the state-of-the-art Tensor Core-based implementation in the same GPUs.We have also achieved improvements of 1 to 2 orders of magnitude compared to CPU, FPGA, and other embedded platforms.Although achieving such performance advantages requires simultaneous processing of thousands of requests, and the execution latency for these thousands of requests is several hundred times higher compared to other platforms (e.g., CPU, FPGA) handling individual requests, our throughput-oriented full Kyber implementation maintains an acceptable execution latency and is highly suitable for scenarios targeting large-scale users.(Section 6).
The rest of this paper is organized as follows.Section 2 introduces some background knowledge, especially the details of Kyber and the features of Tensor Core.Section 3 describes our approach to converting NTT into Tensor core workload.We proposed the Tensor core based NTT approach in Section 4. Section 5 demonstrates the implementation details.In Section 6, We present a series of evaluation and comparison results and discuss the limitations and generalization of the proposed schemes.Then, we make a brief conclusion in Section 7.

Preliminary
In this section, we give a basic background of Kyber and AI accelerators.

Notation
For a prime q, Z q = {0, 1, . . ., q − 1} is the residue class ring modulo q.Z n q represents n coefficients from Z q .Define the ring R q = Z q [X]/(X n + 1), which means the coefficients are from Z q .Regular font letters denote elements in R q (which includes elements in Z q ) and bold lower-case letters represent vectors with coefficients in R q .By default, all vectors will be column vectors.Bold upper-case letters are matrices.For a vector v (or matrix A), v T (or A T ) means its transpose, and v[i] denotes its i-th entry (with indexing starting at zero).For a matrix A, A[i][j] denotes the entry in row i, column j (again, with indexing starting from zero).The rank k represents that a polynomial vector contains k polynomials, and a matrix contains k × k polynomials.For a finite field F = Z q , the primitive n-th root ω of unity exist whenever n|(q − 1), where ω n ≡ 1(mod q).

Module-LWE
A lattice is the set of all integer linear combinations of some linearly independent vectors belonging to the euclidean space.Most lattice-based cryptographic schemes are built upon the assumed hardness of the Short Integer Solution (SIS) [Ajt96] and Learning With Errors (LWE) [Reg05] problems.The LWE problem was popularized by Regev [Reg05] who showed that solving a random LWE instance is as hard as solving certain worst-case instances of certain lattice problems.This assumption states that it is hard to distinguish the uniform distribution from (A, As+e), where A is a uniformly-random matrix in Z m×n q , s is a uniformly-random vector in Z n q , and e is chosen from some distribution.Later, Lyubashevsky et al. [LPR10] introduced a similar adaptation for LWE, called Ring-LWE, which showed that it is also hard to distinguish a variant of the LWE distribution from the uniform one over certain polynomial rings.Combining the security advantages of LWE and the flexibility of Ring-LWE, Langlois et al. [LS15] demonstrated the worst-case to average-case reductions for module lattices.Intuitively, the size of matrix A in Module-LWE is k × k, where k is the rank.The elements in the matrix are vectors selected from Z n q .

Description of Kyber
Kyber is an IND-CCA2-secure post-quantum key exchange mechanism.The security of Kyber is based on the hardness of solving the LWE problem in module lattices.The submission to NIST PQC [SAB + 22] lists three different parameter sets, Kyber-512, Kyber-768, and Kyber-1024, aiming at different security levels roughly equivalent to respectively.The parameters are listed in Table 1, where η 1 and η 2 are the parameters of centered binomial distribution (CBD).The key generation, encryption, and decryption are described in Algorithm 1, 2, and 3.In the KeyGen phase, d is a random number, ρ and σ are fixed-length intermediate variables generated by d through hash function G.The parameter Â is a k × k polynomial matrix generated by ρ.The parameters s and e are polynomial vectors generated through different sample functions but same distribution B η1 .The final parameters need to be Algorithm 1 KYBER.CPAPKE.KeyGen(): key generation Ensure: Secret key sk, Public key pk.
q from B η1 6: ŝ := NTT(s) 7: ê := NTT(e) 8: t := Â • ŝ + ê 9: return pk := Encode( t||ρ), sk := Encode(ŝ) compressed and encode.In the Enc phase, the public key pk will be decoded first.Here, we need to emphasize that e 2 and v are polynomials rather than vectors.The ciphertext c consists of two parts: c 1 and c 2 , which are obtained from u and v with different encode.Correspondingly, in the Dec phase, these two parts need to be decoded with different functions first.Then the NTT and the subsequent INTT are performed.

AI Accelerator and Tensor Core
Due to the explosive growth of AI applications, general-purpose processors are having difficulty meeting the needs of machine learning.Therefore, a dedicated AI accelerator, an application-specific integrated circuit with a more specific design, may gain far more efficiency.The well-known AI accelerators include Google TPU, Apple M1, M1 MAX, M1 Pro, and ARM NPU.These accelerators mainly focus on optimized memory use and lower precision arithmetic to accelerate calculation and increase the throughput.
Tensor Core.In December 2017, NVIDIA released the 1st generation Tensor Core (on Volta architecture), which is just for tensor calculations.Tensor Cores are designed to carry 64 GEMMs (General Matrix Multiplication) per clock cycle on 4 × 4 matrices, containing FP16 values (16-bit floating-point numbers) or FP32 (the float format).A year later, NVIDIA launched the Turing architecture Tensor Core, which has been updated to support other data formats, such as INT8 (8-bit integer values).In the latest Ampere architecture, NVIDIA has improved the performance (256 GEMMs per cycle, up from 64) and added further data formats, shown in Table 2.
Up to now, Tensor Core can only support operations at the warp level, usually 32 threads.The warp matrix function requires co-operation from all threads in the warp and performs D = A × B + C , where A, B, C , D are matrices with specific size, as shown in Fig. 2.
It is further complicated by threads holding only a fragment (a type of opaque architecture-specific ABI data structure) of the overall matrix, with the developer not allowed to make assumptions on how the individual parameters are mapped to the registers participating in the matrix multiply-accumulate.There are also some restrictions on matrix size.Generally, k is fixed to 16, and m can be 8, 16, or 32 (n corresponds to 32, 16, or 8).
The WMMA instruction exhibits the shortest latency when using INT8 as the computational precision [WZF + 22].We have adopted this technique, decomposing values exceeding 8 bits into multiple INT8 representations.
Given the degree of the polynomial (n = 256) in Kyber and the proposed schemes (Section 3.2.2 and Section 3.2.3),we employ m, n, and k values of 16, 16, and 16, respectively.

Decomposing NTTs into Matrix Multiplications
This section describes how we decompose NTTs into matrix multiplications.

The Limitation of Scanning-based Methods
On average, computing a single coefficient in the NTT domain required 128 multiplications.
It is important to note that only a fixed-size tile can be loaded into a fragment at a time, while the target matrix is significantly larger.Therefore, Wan et al. devised two scanning methods based on the raw precision of the data being processed.For parameters with element values less than 8 bits (e.g., 256 or 128 for signed numbers), such as the secret s and random noise r, e generated from CBD (Centered Binomial Distribution), which have at most 3 significant bits, they employed a basic NTT method.
However, this direct approach is not very efficient.Firstly, it requires 16 concurrent inputs, i.e., batching is necessary.Secondly, its computational workload is relatively high.For Kyber's NTT, it necessitates a total of 2 × 128 × 128 (where 2 accounts for odd and even coefficients) element multiplications.Consequently, the computational complexity of their scheme for a single NTT operation is equivalent to performing 2×128×128 16×16×16 = 8 instances of 16 × 16 matrix multiplications.While this scheme is functional in its design, it is straightforward and offers ample opportunities for optimization.

The Proposed Iteration-based Algorithms
In this section, we decompose NTTs into matrix multiplications (Section 3.2.1).Building on this, and varying the handling of twiddle factors, we introduce a 2-phase scheme (Section 3.2.2) and a 3-phase scheme (Section 3.2.3),respectively.

Decomposing Kyber's NTT
Let us revisit Kyber's NTT.Kyber has incorporated NTT into its algorithms, resembling the polynomial form of the Chinese Remainder Theorem (CRT).For the prime modulus q = 3329 with q − 1 = 2 8 • 13, the base field Z q includes 2 8 -th roots of unity.Consequently, the defining polynomial (X 256 + 1) of the ring R can be factored into 128 polynomials of degree 2 modulo q.This polynomial can be expressed as: Here, br 7 (i) for i = 0, 1, • • • , 127 represents the bit reversal of the unsigned 7-bit integer i.Consequently, the NTT of a polynomial f ∈ R q results in a vector of 128 polynomials of degree 1, and can be represented as: In Kyber, the NTT is defined NTT : R q → R q to be the bijection that maps f ∈ R q to the polynomial with the aforementioned coefficient vector.Hence, (2) (3) Equation ( 1) -( 4) is evidently far from the 16 × 16 matrix multiplication we envision for.As a starting point, we can attempt to transform the input vector with 128 elements into a 16 × 8 matrix.We take the NTT calculation of f2i as an example (i.e., Equation (1)).The NTT calculation for f2i+1 is identical, so we will not derive it separately.While there are some differences in the INTT part (mainly in the order of i and j), the basic approach is similar, and we detail it in Appendix A.
To facilitate subsequent explanations, we introduce the following notations: Using these notations, Equation (1) can be deduced as follows: It should be noted that ζ 256k ≡ 1 (mod q), k ∈ Z and br 7 , br 4 , br 3 represent the 7-bit, 4-bit and 3-bit bit-reverse function, respectively.The sub-twiddle factors are denoted as follows: i1) .Equation (5) will serve as the foundation for our work.While designed for the NTT transform of even-indexed terms, this formula can also be applied to the NTT transform of odd-indexed terms and the INTT transform of both odd and even-indexed terms.We won't delve into those aspects here.
In the subsequent section, we will develop two distinct strategies, a 2-phase scheme and a 3-phase scheme, to efficiently compute NTT based on Equation (5).Please refer to Appendix A for the discussion for INTT.

2-Phase Scheme
It is important to highlight that all the sub-twiddle factors ζ 0 (i 0 , j 0 ), ζ 1 (i 0 , j 1 ), ζ 2 (i 1 , j 1 ) in Equation ( 5) can be precomputed.A natural idea is to merge as many sub-twiddle factors as possible to avoid unnecessary online computation.However, it is not possible to merge too many sub-twiddle factors because it would involve too many indices (i.e., i 0 , i 1 , j 0 , j 1 ), resulting in an extremely large table.In practice, it is acceptable to work with construction methods that involve only three indices, and there are two possible approaches: 1. Offline computing the product of ζ 0 and ζ 1 , and denoting it as ζ 01 (i 0 , j 0 , j 1 ), Equation (5) can be computed in two phases as follows : Phase 2 2. Offline computing the product of ζ 1 and ζ 2 , and denoting it as ζ 12 (i 0 , i 1 , j 1 ), Equation ( 5) can be computed in two phases as follows: Phase 2 The computation method for f1 can be derived similarly, and the twiddle factors are the same as those for f0 .When comparing these two merging approaches, the table for ζ 01 (i 0 , j 0 , j 1 ) comprises 8 precomputed tables, each of size 16 × 16, while ζ 12 (i 0 , i 1 , j 1 ) requires 16 tables.Clearly, the former is more space-efficient; therefore, we have chosen it for our 2-phase scheme.
Considering both f0 and f1 , let us now delve into the implementation of the computational aspects of this scheme using matrix operations as outlined below.Note that f0 and f1 in the matrix are interwoven in rows, much like the coefficients in a polynomial are arranged in a power-order, alternating between even and odd.Besides, g 0 and g 1 represent intermediate values in the computation, and the elements within the shadow matrix are precomputed.
Phase 2 computes the following terms: For i 0 = 0, . . ., 15 and j 1 = 0, . . ., 7, In Phase 2, there is a significant difference: ζ 2 (i 1 , j 1 ) will be expanded into an 8 × 8 matrix (notably, i 1 , j 1 ∈ Z 8 ), while the minimum unit of a Tensor Core is 16×16.Therefore, we further expand this small matrix into a 16×16 sparse matrix, simultaneously completing the computations for f0 and f1 .In Figure 4, the red box highlights the elements involved in the computation of g 0 (0, 0), serving as an example.
The second phase of 2-phase scheme

3-Phase Scheme
Merging sub-twiddle factors leads to a larger precompute table size, necessitating the introduction of a batching process.Moving forward, let us return to the non-merging approach, which is: The calculation method for f1 can be derived analogously.
Taking both f0 and f1 into consideration, let us now detail how the computational aspects of this scheme can be implemented using matrix operations as follows.Note that f0 and f1 in the matrix are interwoven in rows, much like the coefficients in a polynomial are arranged in a power-order, alternating between even and odd.Besides, g 0 , g 1 , h 0 , h 1 merely represent intermediate values in the computation, and the elements within the shadow matrix are precomputed.
Phase 1 computes the following terms: For i 0 = 0, . . ., 15, j 1 = 0, . . ., 7, The preceding equations will be expanded into a matrix multiplication, as illustrated in Figure 5.In Figure 5, the red box denotes the elements involved in the computation of g 0 (0, 0), serving as an example.Different from the 2-phase approach, phase 1 of the 3-phase approach requires only one 16 × 16 precompute table.

Summary of These 2 Approaches
Decomposing the 256-points NTT schemes into atomic operations (element multiplication and 16 × 16 matrix multiplication operations), the comparison between 4 NTT schemes are summarized in Table 3.Note that in the first phase of the two-phase scheme, the 16 matrix-vector multiplications fold into 1 matrix multiplication.
As observed in Table 3, the butterfly operation has the lowest computational complexity.However, the scanning-based scheme demonstrates superior performance when implemented with Tensor Cores.
It is crucial to emphasize that while the butterfly operation requires fewer element multiplications, GPU's general instruction set only supports INT32 instructions.Even though each element is relatively small, it still necessitates the use of slower INT32 instructions.In contrast, Tensor Cores can natively support faster INT8 instructions.Therefore, Table 3 provides only a rough comparison, and in GPU implementations, the Our proposed iteration-based scheme reduces computational complexity by a quarter (considering only matrix multiplications) and is also highly compatible with Tensor Core acceleration.Furthermore, there is no substantial disparity in computational load between the 2-phase scheme and the 3-phase scheme (the 2-phase scheme involves additional operations to integrate matrix-vector multiplications).Therefore, further investigation is required, and their performance should be empirically compared through implementations.

The Proposed NTT Implementation
Section 3 provides the basic principle of the proposed Kyber's NTT.This section will further illustrate how to implement the two schemes with Tensor Cores.This section will highlight three key techniques to overcome the limitations of native Tensor Core instructions: • Splitting Kyber's NTT into small Tensor Cores based NTTs: We will briefly outline how to implement finite field arithmetic for Kyber using the limited precision of Tensor Cores.A standard Kyber's NTT will be separated into several Tensor Cores based NTT computations.
• A batch implementation for 2-phase NTT: Based on 2-phase NTT schemes, we will implement a batch NTT that simultaneously processes 8 NTT operations.Additional pre-processing and post-processing steps will be used to transform the requests into matrices that can be accelerated by Tensor Cores.
• A memory-free 3-phase NTT implementation: Based on the 3-phase NTT scheme and a thorough exploration of the internal mechanisms of Tensor Cores, we will implement a novel memory-free NTT implementation.

Multi-precision Presentation and NTT Splitting
As Section 2 describes, Tensor Core performs FMA mixed-precision operation, with lowprecision input and high-precision output.For example, on the Ampere architecture, the input can be INT8 (char), and the output can be INT32 (int).Wan et al. [WZF + 22] experimented on the performance of different precision combinations and pointed out that the WMMA matrix multiplication instruction with INT8 precision delivers the highest performance, even though it offers very limited precision.Consequently, depending on the modulus, we can categorize matrix multiplication using the WMMA instruction into two groups: basic-Mul for smaller moduli and split-Mul for larger moduli.
Additionally, the computation process of split-Mul can be integrated with the Karatsuba multiplication [Kar63].By incurring additional additions, the four multiplications in the schoolbook multiplication can be reduced to three by Karatsuba multiplication.The detailed computation process of split-Mul combined with Karatsuba multiplication is illustrated in Figure 7.We implement both split-Mul combined with schoolbook multiplication and split-Mul combined with Karatsuba multiplication, with their performance comparison detailed in Section 6.

2-phase NTT Implementation with Batch of 8 4.2.1 Pre-and Post-Processing
In the 2-phase NTT scheme, the first step involves splitting a matrix containing 256 polynomial coefficients into 16 vectors and performing 16 matrix-vector multiplications.These 16 vectors cannot be simply merged because the matrices they multiply with are not entirely identical.A straightforward approach would be to calculate all 16 NTT operations simultaneously and then merge the vectors at corresponding positions for computation.However, it's worth noting that vectors at odd positions and their corresponding vectors at even positions use the same twiddle factor matrix.Hence, we only need to compute 8 NTT operations concurrently to fill a 16 × 16 matrix.
The input for these 8 NTT operations consists of 8 vectors, each containing 256 elements.These matrices are recombined into 8 16 × 16 matrices, and they are multiplied individually with 8 different precomputed twiddle factor matrices.The resulting product matrices are then decomposed once again to restore the original arrangement.
Before delving into the details of this scheme, let us establish a foundation by assuming that we are about to perform NTT on eight polynomials.For the sake of reference, we denote their respective coefficient matrices as P 0 through P 7 .

Pre-Processing Post-Processing
Figure 8: Pre-and post-processing of the 2-phase scheme The initial step of the batch 2-phase scheme is elegantly depicted in Figure 8.To given context, let's consider the precomputed matrix Z 0 (defined by Equation ( 8)).It corresponds to the first and second columns (highlighted by the red box) spanning from P 0 to P 7 .These columns are thoughtfully consolidated into a restructured matrix labeled as 0, which is subsequently multiplied by Z 0 and then reverted to their original positions.The computation relating to Z 7 (defined by Equation ( 9)) can be derived in a similar fashion, illustrated by the blue box.
In the actual implementation, to facilitate the merging of memory accesses, the restructured matrix should be in row-major order rather than column-major order (with elements in the same row arranged consecutively in memory).Therefore, all matrices involved in the first phase need to be transposed, and the left matrix multiplication operations should be changed to right multiplication.
In the second phase of the batch 2-phase scheme, the coefficient matrices associated with polynomials 0 through 7 undergo multiplication with a precomputed matrix formed by ζ 2 (defined in Section 3.2.1)following the computations carried out in the preceding stage.As a result, we obtain the NTT domain representations for all eight polynomials simultaneously.
It is worth noting that the first matrix multiplication in the scheme requires pre-processing and post-processing.However, the pre-processing can be omitted when the NTT object is a polynomial obtained by sampling.

Implementation Summary
The resulting implementation method is illustrated in Figure 9.

8X 8X
Global Memory Global Memory Batch processing is constrained by phase 1, requiring each block to simultaneously contain 8 warps, which means a block size with a multiple of 256.In terms of memory, shared memory is used for reorganizing matrices and facilitating data exchange among 8 warps after multiplication.Since the precomputed tables are fixed, we can load them all at once when loading the binary, and there is no need for additional memory writes when using the precomputed tables subsequently.

Limitation of the Standard Interfaces of Tensor Cores
Whether it is the 2-phase or 3-phase approach, we must execute two WMMA instructions sequentially, with the output of the first instruction serving as the input for the second.
Listing 1 provides a detailed explanation of the standard method for invoking the WMMA API.After computation, the results need to be stored in (global or shared) memory using wmma::store_matrix_sync before proceeding with the next operation.However, this step appears to be unnecessary in theory if we recursively use the output as input, as it should be possible to fully reuse the wmma::fragment variable.The challenge lies in the fact that wmma::fragment is type-specific, being one of wmma::matrix_a, wmma::matrix_b, or wmma::accumulator.Unfortunately, wmma::accumulator cannot be used as input, and its type cannot be converted.The only way out is to "store" it in memory and then "load" it into a new wmma::fragment.
We illustrate the limitation of the standard WMMA interfaces in Figure 10.It is evident that 4 load/store operations are entirely unnecessary.While these surplus memory transfers do not significantly impact the 2-phase approach due to the necessity of pre-and post-processing in memory, they become entirely redundant in the case of the 3-phase approach.
Listing = 8 registers.If we can manually manipulate these registers, it becomes possible to perform two consecutive matrix multiplications without the need to store them back in memory.However, this idea faces two challenges: • First, wmma::fragment itself is an abstract representation, and its internal registers are transparent to developers.We do not have a direct way to access these register resources via standard WMMA interfaces.
• Second, the storage arrangement within the warp for wmma::fragment is unknown.Regarding this point, the CUDA C++ Programming Guide (Release 12.2) states, "the mapping of matrix elements into fragment internal storage is unspecified and subject to change in future architectures" ( §10.24.1 in [NVIa]).
Fortunately, while the first issue cannot be resolved through standard interfaces, we can achieve this through the PTX ISA assembly code.Through examination of the original code using cuobjdump, we discovered that WMMA instructions, when actually executed, directly use registers rather than wmma::fragment as parameters.The relationship between wmma::fragment and registers is shown in Figure 11.As a result, we can delve into the data arrangement within the warp for wmma::fragment.The following observation is primarily based on the Jetson AGX Orin platform, with extended verification conducted on the RTX 3080 platform.
Data layout caused by wmma.load:A continuous memory address space containing 256 polynomial coefficients can be interpreted as a 16 × 16 matrix, denoted as M , with coefficients arranged sequentially, starting from row 0. When using wmma.load, it loads 256 data points into 32 threads within a warp, with each thread holding 8 data points.This data arrangement within these 32 threads can also be considered as the 16 × 16 matrix W , where the first row comprises 16 data points from thread 0 and thread 1, and so forth.We subsequently refer to the transformation that converts the matrix M to the matrix W as the wmma.loadtransformation (W =wmma.load(M)), as depicted in Figure 12.Please be aware that this transformation is only correct when loading as matrix A; a different transformation occurs when loading as matrix B (refer to Figure 2 for the definitions of the matrix A and B).
Data layout caused by wmma.store:Similarly, we consider the data arrangement within the warp before wmma.storeas the matrix W and the data arrangement in memory after wmma.store as the matrix M .We refer to the layout transformation from matrix W  192 193 194 195 196 197 198 199 200 192 193 194 195 196 197 198 199 200  Figure 13: Fragment layout across 32 thread within a warp after wmma.store

Processing the Data Layout Displacement
Based on the exploration in Section a rather challenging issue arises due to data layout displacements between memory and registers across the warp.What complicates matters further is our plan to perform two consecutive matrix multiplications, where the result of the first multiplication serves as the input for the second.However, the data layout displacements caused by wmma.store and wmma.load are entirely different.
For instance, when used as input, after wmma.load,thread 0 stores elements with indices 0, 1, 2, 3, 128, 129, 130, and 131.However, when used as output, before wmma.store,thread 0 stores elements with indices 0, 1, 128, 129, 8, 9, 136, and 137.Therefore, before performing the next matrix multiplication, we must rearrange the elements stored by the threads from the layout at the output to the layout at the input.
As a result, the core issue lies in finding a cost-effective method to adjust the internal data layout of the warp, ensuring the correct calculation of the following matrix multiplication.One seemingly straightforward approach is to use CUDA shuffle instruction or shared memory to rearrange the output data back into the required input layout.However, this approach contradicts our original intention, as shuffle instructions have efficiency comparable to INT32 multiplication, and shared memory introduces unnecessarily complicated matters.
Upon careful analysis of the data arrangement, a crucial observation is that we only need to ensure that the data included in each row of the matrix, as it is actually represented, remains consistent with the original representation, and the order within each row is irrelevant.The reason for this simplicity is that we are performing matrix operations, and in the context of matrices used for left multiplication, the matrices participate in the operations on a row-by-row basis.Changing the order within the rows only requires us to alter the column order of the corresponding right multiplication matrix (i.e., the precomputed table of twiddle factors).Building on this insightful observation, we have devised an innovative approach to tackle the data displacement challenge.We reconfigure the data layout within the adjusted warp, as illustrated in Figure 14.The resulting matrix can be obtained by performing matrix multiplication between the permutation matrix P (as depicted in Figure 15) and the order matrix (found in the rightmost matrix of Figure 13).
Remarkably, after this adjustment, although the order may seem shuffled, each row i of the matrix contains elements spanning from 16i to 16i + 15.There remains just one task to complete: the corresponding precomputed table requires a right multiplication by P −1 .
The surprising thing is that the implementation of the above method is very neat.Notably, after wmma.mma,256 products are scattered across 32 threads, with 8 b32 registers in each thread.Let us label them sequentially as r0, r1, r2, r3, r4, r5, r6, and r7.To address the data layout confusion resulting from omitting wmma.store and wmma.load,we simply adjust the register order in all threads of a warp from (r0, r1, r2, r3, r4, r5, r6, r7) to (r0, r1, r4, r5, r2, r3, r6, r7) (swap registers r2, r3 with r4, r5).This way, we effectively perform the operation of multiplying the permutation matrix P. Regarding the corresponding precomputed table, since it is precomputed, we only need to make adjustments in the code to accommodate this change.
Note that the matrix obtained through shuffling the registers among threads can only Figure 15: Permutation matrix P serve as matrix A in subsequent matrix multiplications.This is because the combined transformation of storing and loading matrix B cannot be implemented by shuffling registers (refer to Figure 2 for the definitions of the matrix A and B).

Implementation Summary
The final implementation consists of three phases and does not require additional memory; it is entirely register-based, as shown in Figure 16.Because it does not impose any batch requirements, a block can include any number of warps, as opposed to the 2-phase scheme that requires a multiple of 8.Moreover, there is no need for pre-processing, post-processing, or data reorganization, and the data input and output fully comply with Kyber's requirements.

Extending Proposed NTT Schemes to a Full Kyber Implementation
In this section, we will demonstrate how to extend our proposed NTT schemes to a complete Kyber implementation and elucidate some non-trivial optimization techniques.

Overall Architecture
Our full Kyber implementation follows the SIMT mode, where each thread handles one Kyber KeyGen/Enc/Dec instance.While the specific procedures may vary slightly for different phases, the high-level overview is depicted in Figure 17.This architecture is throughput-oriented.In fact, previous works either implemented NTT with a single thread [GJCC20] or required a high degree of batching (e.g., a batch size of 16 was necessary for [WZF + 22]).This have to be throughput-oriented because a single NTT operation could not fully leverage the parallel capabilities of the GPU.In contrast, our proposed 3-phase NTT scheme can complete a single NTT with a warp (i.e., 32 threads), and 2-phase NTT scheme requires a batch size of 8.
By lowering the concurrency requirements, we have the flexibility to choose whether our solution is throughput-oriented or latency-oriented.Since our primary focus is on NTT acceleration and for direct comparison with existing literature, we have implemented full Kyber in a more straightforward throughput-oriented manner.In future work, we plan to explore a latency-oriented implementation.
It is worth noting that the KeyGen step, as it does not require input requests, can continuously generate key pairs into a pool in an offline manner and thus is well-suited for the throughput-oriented architecture.Additionally, some works [PZZ + 17, WZG + 21, BZW + 23] have introduced an intermediate layer for caching requests to fully harness the potential of throughput-oriented implementations.
The specific workings of the 8-batch 2-phase NTT module and 3-phase NTT module are shown respectively in Figure 9 and Figure 16 and the precomputed tables can be found in appendix B. Note that the inputs and outputs of these two modules are sequential coefficients; there is no need for reordering (the specific order required by the 2-phase NTT scheme is completed within the module itself).
Furthermore, since the 8-batch 2-phase NTT module operates in batches of 8, the number of instances input to this module must be a multiple of 8. Finally, since the modules before and after the NTT module operate independently per thread, and the NTT module is executed collaboratively by 32 threads (1 warp).When the program needs to perform NTT, it will synchronize between threads in the same thread block, and then input the data into the NTT module.

Other Implementation Details
This section details some worth-noting techniques in the full Kyber implementation.

Coalescing Memory Access
Ensuring coalesced memory access is vital for optimizing performance in applications developed on GPU architectures.It involves efficiently using the memory bus to minimize the number of memory accesses, which can significantly reduce the time spent on memory read and write operations.For example, Orin's memory bus width is 128 bits.This means that for individual read and write operations on INT8, INT16, and INT data types on Orin, there would be a wastage of 93.75%, 87.5%, and 75% of the data carrying capacity, respectively.In other words, a substantial portion of the memory bus's capacity remains unused when dealing with these data types.However, in practice, GPU compilers are designed to allocate individual instructions for load operations and store operations tailored to various data precisions.This optimization helps mitigate the wastage of memory bus capacity by aligning data transfers more closely with the actual data type requirements, thereby improving memory access efficiency.
For instance, consider the code in Listing 2, which represents the poly_tobytes function provided by the Kyber designers in the reference implementation for serializing polynomials.In this context, the code that reads the variable coeffs[i] may be compiled into a form like ld.global.s16in0, [%0], and the code that stores the variable r[i] might be compiled into a form like st.global.u8[%1], out0.
To optimize memory access, we complete the read and write operations using constructs like: ld.global.v4.s32 {in0,in1,in2,in3}, [%0] and st.global.v4.s32 [%1],{out0,out1,out2,out3}.This approach retrieves data in chunks, which is then split based on the required precision.In this specific case, reading 32 coeffs[i] and writing 48 r[i] ensures that each read/write operation aligns with multiples of 128 bits.This alignment maximizes memory access efficiency, making better use of the available memory bus capacity.

SHA-3
The symmetric primitives of Kyber are also a performance bottleneck for Kyber, with the specific computations belonging to the SHA-3 family, at the heart of which is the StatePermute function f .The f function consists of a total of five processes (θ,ρ,π,ι and χ), of which ρ,π,ι require table look-ups.
We meticulously translated the computation process of the f function into assembly code, unrolling each loop.Steps requiring table lookups were embedded directly into the assembly instructions using immediate values, minimizing memory access requirements during SHA-3 computation.

Lazy Reduction
When performing NTT on polynomials sampled from the CBD distribution, note that the coefficients are expressed in less than 4 bits.The data in all precalculated tables is less than or equal to 12 bits (modulus q = 3329), which means that the b32 register of the result after matrix multiplication and Hadamard product can still be contained.Therefore, instead of reducing after matrix multiplication immediately, we can reduce after the Hadamard product.

Performance Evaluation & Discussion
In this section, we present our evaluation results, including the performance of the NTT module and Kyber-512, Kyber-768, Kyber-1024, and perform a comparative analysis with related works.All codes are written in CUDA C++, which uses PTX inline assembly to optimize key operations and intensive memory access operations.
Table 4 details the specifications of the targeted platforms.To simplify matters, we will refer to these two platforms as "Orin" and "R3080" respectively.Notably, the Jetson AGX Orin is an embedded platform which offers adjustable power modes at 15W, 30W, 50W, and 60W, allowing for tailored performance optimization.And we will evaluate Kyber across these modes.
It should be noted that our optimizations are primarily tailored for the Orin platform, and evaluations based on the R3080 are principally aimed at facilitating comparisons with other works within the same platform.Therefore, the parameters used, such as block size, while being optimal for Orin, have been adopted as-is for the R3080 and may not represent the most optimal configuration for the latter.

Results of NTT/INTT
Our evaluation starts with the testing for the peak throughput of the 2-phase phase and 3-phase NTT/INTT for input types of both INT8 and INT16 on Orin and R3080.Given that the 2-phase scheme operates in batches of 8, a multiple of 8 warps is required for collaborative computation, making the block size a minimum of 256.Since the INTT always deals with data with full precision, we only evaluate implementations where the input type is INT16.The experimental results are listed in Table 5.
As evident from Table 5 and consistent with the design expectations, the performance of NTT and INTT is essentially comparable.When comparing the 2-phase NTT/INTT scheme and the 3-phase NTT/INTT scheme, an advantage of the former is that it is universally applicable to platforms that support efficient matrix operations.However, this scheme is slower than the 3-phase scheme in performance, especially in Orin (which we particularly optimized for), due to the data exchange between warps in the post-processing of the first phase, specifically during the matrices reconstruction, as illustrated in Figure 8.In contrast, the latter offers higher performance, involves fewer precomputed tables, and eliminates the need to allocate temporary memory for storing intermediate data.Since the 3-phase NTT/INTT scheme has better performance, we will adopt this scheme in the subsequent comparisons with other works targeting NTT and Kyber.
Notably, we made an unusual discovery during our experiments when testing NT-T/INTT: utilizing column-major WMMA load/store is approximately 10% faster than using row-major access.Consequently, we adjusted our implementation to prioritize reading and writing polynomial coefficients in a column-major fashion wherever possible.

Comparison of Throughput under Peak Concurrency Conditions.
When comparing with other GPU-based research results, we consider the maximum level of parallelism and compute the average time cost for each instance, as shown in Table 6.In addition, we implement split-Mul (Section 4.1) utilizing two methods: schoolbook multiplication and Karatsuba multiplication.
When performing schoolbook multiplication, computing the product of double-precision numbers (e.g., a 1 β + a 0 and b 1 β + b 0 ) requires 4 multiplication operations (a 0 b 0 , a 0 b 1 , a 1 b 0 , and a 1 b 1 ) and 1 addition.As illustrated in Figure 7, in the case of Karatsuba multiplication, with the cost of 3 additional additions/subtractions, it only requires 3 multiplication operations (a 0 b 0 , a 1 b 1 , and (a 0 + a 1 )(b 0 + b 1 )).In our implementation, the main reason for the comparable performance between schoolbook and Karatsuba multiplication lies in the fact that multiplication utilizes fast Tensor Cores while addition and subtraction operations uses general CUDA cores, which may offset the advantage in the number of multiplications.But overall, Karatsuba multiplication still demonstrates slightly better performance than the schoolbook approach.Table 6 demonstrates our approach achieving nearly double the speed compared to the current state-of-the-art implementations.In fact, the performance data of our work lists in Table 6 is not the maximum achievable, and with increasing (grid size, block size), the performance can be further improved.
However, our method boasts just one-fourth of the computational complexity of Wan  et al.'s [WZF + 22], as evident in Table 3, suggesting the potential for a fourfold performance boost.This discrepancy is primarily due to precision decomposition and reduction constraints between the two sequential matrix multiplications, limiting it from reaching its theoretical potential.Moreover, upon consultation with the authors of [WZF + 22], when measuring, their input and output are pre-arranged with even and odd terms separated, incurring additional time consumption when extending to a complete Kyber implementation.In contrast, our 3-phase scheme does not necessitate data reordering.While the time cost is minimal, it is one of the potential reasons why our performance falls short of expectations.
Another factor is that the implementation by Wan et al. [WZF + 22] considers maximum concurrency.Their scanning-based method, which continuously computes a substantial number of matrix multiplications, optimally utilizes Tensor Core's pipeline, ultimately achieving peak performance when amortized.

Comparison of Execution Latency
In reality, there may not always be such a high level of concurrency.Therefore, we also consider a challenging scenario for GPU-based implementation where only one polyvec_ntt operation is executed at a time.In this context, we evaluate the latency of executing a single polyvec_ntt operation to make comparisons across different types of devices, as illustrated in Table 7.
For polyvec_ntt, we leverage the strength of the 3-phase scheme, where each warp computes the NTT of an individual polynomial independently.By setting the block size to 128, we make sure there are 4 warps in a grid.In this manner, one grid computes one set of polyvec_ntt, allowing us to evaluate the delay of computing a single polyvec_ntt when k = 4.The enhancement in performance, as evidenced by our evaluation, can be attributed to ARM Cortex-A72 @1.5GHz 3.2 # Ours Orin @1.3GHz 0.35 † R3080 @1.8GHz 0.259 † § The implementation result of this work is obtained using one core of processor.
• Derived from the "custom" implementation [AEL + 20], where a NTT requires 6868 cycles for a single polynomial.* The authors in [MWB21] reported that a NTT requires 543 cycles for Kyber-1024.# The authors in [BHK + 22] indicated that an necessitates 1200 cycles for a single polynomial.
Moreover, it is imperative to note that their NTT is 32-bit.
The comparison pertains to polyvec_ntt for Kyber-1024, thus the latency for NTT of one polynomial should be multiplied by 4. † In fact, we measure the execution latency for 16 instances of polyvec_ntt in Orin and 68 instances in R3080.Because executing this many instances of polyvec_ntt and executing a single polyvec_ntt yield nearly the same overall execution time.Grid_size=16, Block_size=128 is set for Orin and Grid_size=68, Block_size=128 is set for R3080 several pivotal aspects as follows: 1) The remarkable technique of Tensor Core in executing matrix computations bestows our scheme with a computationally efficient backbone; 2) Our approach smartly decomposes NTTs into matrix multiplications, shifting the computational load of the NTTs onto the Tensor Core to leverage its strong computational capabilities effectively; 3) Our 3-phase scheme enables simultaneously computing of multiple NTTs by utilizing parallel computing platforms.
In fact, it must be emphasized that the implementations [AEL + 20, MWB21, BHK + 22] based on FPGA and embedded devices in Table 7 are single-core, and their design goals prioritize low latency or low cost.The cross-platform comparison we conducted here is solely to illustrate that our proposed approach not only achieves high throughput but also demonstrates the potential for low-latency implementations on parallel platforms that do not inherently excel in terms of latency.
For Kyber1024, on Orin, the peak KeyGen, Enc, and Dec performance is achieved when the (grid size, block size) is set to (32,128).On R3080, the best KeyGen and Enc performance is attained when the (grid size, block size) is set to (136, 128), while, for Dec, the highest performance is achieved with (grid size, block size) set to (68,128).Additionally, from a latency perspective, the lowest latency is observed when the (grid size, block size) is set to (16/68, 128).The latencies for (16/68, 256) and (32/136, 128) are approximately twice that of (32/68, 128), while (32/136, 256) is approximately four times that of (16/68, 128) on Orin or R3080 respectively.
From Figure 18, the computational performance of R3080 is approximately five times that of Orin and it is evident that the throughput for KeyGen and Enc is relatively low, whereas the throughput for Dec is significantly higher.This is because both KeyGen (shown in Algorithm 1) and Enc (shown in Algorithm 2) involve the computationally intensive XOF and PRF functions, which are based on hashing, to generate matrix A and to sample multiple polynomials, respectively.On the other hand, Dec (shown in Algorithm 3) only involves decoding and polynomial computations, which are memory access-intensive operations.Effectively leveraging the data transmission capability of the memory bus yields notable results.

Performance Comparison with Other Implementations on the Same GPU
We first compare our results with other GPU-based works, as shown in Table 8.As seen in Table 8, in terms of Kyber's implementation performance, our implementation's KX (Key Exchange) is 2.82x faster than Gupta et al.'s, which uses butterfly operations, and 1.62x faster than Wan et al.'s, which is based on Tensor Core as well.Specifically, compared to Wan et al.'s implementation, our Dec is 3.55x faster.Beyond adopting a more efficient NTT approach, as described in Section 5.2.1, we meticulously merged nearly every memory access, ensuring that each access transfers data close to 128 bits (the memory bus width of Orin).This substantially reduce memory access latency, leading to a significant performance improvement.
Notably, the Dec operation showcases a more significant performance advantage.The root of this lies in our utilization of Tensor Core-accelerated NTT/INTT implementations, and the computational load for KeyGen and Dec is primarily hash-based, which are challenging to accelerate on parallel computing platforms.

Performance Comparison with Other Implementations on CPU and FPGA
The previous implementations of Kyber are based on various platforms, targeting different scenarios and following different design ideas.Table 9 lists the throughput of Kyber-1024 for the related works.§ The implementation results of this work are obtained using one core of processor.† The authors did not explicitly state the frequency of this platform in the paper.We consulted technical documentation and provided the common frequency of this platform for reference purposes only.
From Table 9, it can be observed that performance and power consumption under different power modes of Orin exhibit an approximately proportional relationship.This implies that we can choose power settings according to specific requirements, making it easy to strike a balance between power consumption and performance.Compared to resource-constrained devices, our implementation based on Orin exhibits performance approximately one order of magnitude higher, while the implementation based on R3080 demonstrates performance approximately two orders of magnitude higher.Even in the 15W power mode, our proposed method outperforms platforms like Apple A12 CPU, ARM Cortex-A72 CPU, and Xilinx Artix-7 FPGA, with performance gains 2.5-8.8x.Considering platform power consumption, this performance advantage remains substantial.
Then, we compare implementations based on different platforms in Table 9 for Ky-ber1024 in terms of throughput and latency.
When measured against the state-of-the-art ASIC implementation detailed in [AMI + 23], they achieved KeyGen, Encaps, and Decaps performance of 33.7, 42.04, and 51.5 µs, respectively, based on ASIC platform.Our implementation's throughput on Orin (60W) is approximately 17x higher than theirs.However, achieving such throughput is achieved when concurrently processing 4096 (32 × 128) requests .The execution latency for computing these 4096 requests is 240x greater than that of processing a single request as indicated in [AMI + 23].
In comparison to the leading FPGA-based implementation, Xing et al. [XL21] achieved decent performance by carefully scheduling sampling and NTT-related calculations within limited hardware resources.They achieved KeyGen, Encaps, and Decaps performances of 58.2, 67.9, and 86.2 µs, respectively, based on a single processor.In terms of throughput, our implementation on Orin (60W) is approximately 30x higher than theirs, but our latency of 4096 requests is around 136x larger.
In the context of efficient ARMv8-based software implementation discussed in [BHK + 22], they achieved KeyGen, Encaps, and Decaps performance of 104.5, 128.2, and 122.8 µs, respectively, based on the Cortex-A72.Our implementation on Orin (60W) achieves 42-56x higher throughput than theirs, while their latency is 73-97x faster, considering that our implementation simultaneously processes 4096 instances.
The above comparisons clearly illustrate the significant differences in performance aspects between implementations based on embedded platforms and parallel platforms.
Objectively speaking, their implementations are all single-core or single-threaded, focusing on low latency.Comparisons above are not conventional, and the purpose here is to provide readers with a rough understanding of the current state-of-the-art performance of different platforms.While their implementations may be more suitable for resourceconstrained platforms, as a throughput-oriented approach, our implementation is most pronounced under high-scale concurrency, making our implementation well-suited for edge devices connecting a multitude of IoT devices.
For desktop platforms, we compare the results of desktop-grade CPUs with our performance on the R3080 platform.For the optimized AVX2 version of Kyber-1024 [Sch], our implementation on R3080 achieves speedup factors of approximately 43x, 44x, and 30x for KeyGen, Enc, and Decaps, respectively.Regarding latency, as we concurrently process 17408 (136×128) requests, we incur respective slowdowns of 95x, 93x, and 136x.It is worth noting that even in the lowest-power mode (15W) of Orin, the implementation exhibits performance superiority over the AVX version of Kyber-1024.Furthermore, given that the primary focus of this paper pertains to enhancements in NTT, the present implementation of Kyber, albeit effective, has not been subjected to fine-grained optimizations, thereby suggesting significant scope for additional improvements.

Limitation and Discussion
In this section, we will delve into the limitations and scalability of our proposed solution, identify its applicable scenarios, and explore generalizations to other PQC schemes.Finally, we will discuss the impact of various modular multiplication methods on implementation performance.

Execution latency of the proposed scheme
Our primary focus is on throughput, making our solution well-suited for high-concurrency scenarios like servers.However, it does not offer a specific advantage in terms of latency when compared with CPU and FPGA-based implementations.
In future work, we aim to explore how to harness the internal parallelism of the algorithm on GPUs to achieve a relatively low-latency implementation of Kyber.Section 6.1.2demonstrates the potential for ultra-low latency in the NTT implementation.Nevertheless, for the overall solution, pursuing extremely low latency in GPU implementations is considered imprudent.This is particularly true for components that require frequent synchronization or strict serial execution, such as SHA-3.Forcing fine-grained parallelization may result in a significant sacrifice in throughput.
In our future endeavors, we plan to design coarse-grained parallelism for relatively low-latency Enc and Dec processes while maintaining high throughput.It is important to note that KeyGen does not necessitate a latency-oriented implementation and can be offline computed into a pool.For instance, in the Enc process, leveraging our 3-Phase NTT implementation, which does not demand extensive concurrent requirements, we can dispatch multiple CUDA threads to independently compute a SHA-3 instance for sampling and matrix generation.

Generalization to other PQC candidates
For other PQC candidates, in fact, as long as NTT is used, our proposed solution should be applicable.For instance, the implementation of Dilithium [DKL + 18], which is also based on MLWE, can be easily adapted using our techniques.
In Dilithium, the polynomial ring is denoted as Z q [X]/(X n + 1) with n = 256, just like in Kyber.However, in Dilithium, q = 2 23 − 2 13 + 1 means that the coefficients are represented as 23-bit integers.Therefore, the most challenging aspect of applying our proposed schemes to Dilithium lies in the larger modulus size (from 12-bit integers in Kyber to 23-bit integers in Dilithium), which necessitates more complex multi-precision or RNS decomposition.Overall, the difficulty in doing so is not significant; it merely requires some customized modifications to harness the performance of AI accelerators for efficient NTT in Dilithium.
Firstly, we undertake a simple theoretical analysis of these three ModMult algorithms.On the one hand, Shoup ModMult algorithm has lower computational complexity than the conventional Barrett ModMult algorithm as analyzed in [KLC + 19], Section II.On the other hand, Shoup ModMult necessitates the computation of a precomputed value related to one of the multiplicands (e.g., when computing a × b mod q, βb q should be precomputed, where β is the radix), and thus reduction must be performed immediately after each multiplication.However, it is essential to highlight that in our proposed scheme, we do not perform modular reduction immediately after each multiplication.Instead, we accumulate multiple product results (utilizing Tensor Cores) and perform the reduction only once at the end and the multiplicand used in each multiplication is different.This approach is compatible with both Montgomery and Barrett modular reduction since they work for general multiplication products, but Shoup ModMult is not well-suited for our proposed accumulation pattern.
Secondly, when analyzing the computational cost of Montgomery ModMult and Barrett ModMult, excluding the precomputation part, we observe that both reduction operations involve two multiplications, one shift and one addition.Therefore, the computational overhead is quite similar, and the choice between the two has little impact on performance.In both cases, the accumulated product is first computed and then subjected to reduction.

Conclusion
This paper proposes an NTT calculation scheme dedicated to Kyber and further implements an entire implementation of Kyber.Based on our explored internal mechanism of Tensor Core and the unique design of the customized NTT in Kyber, our implementation outperforms existing Tensor Core-based solutions.We achieve significant speed-ups of 1.93x, 1.65x, 1.22x and 3.55x for polyvec_ntt, KeyGen, Enc and Dec in Kyber-1024, respectively. .Looking ahead, we plan to extend our achievements to accelerate NTT operations in Dilithium and Fully Homomorphic Encryption (FHE) cryptographic schemes.We introduce notations for the sub-twiddle factors like NTT: With these notations, f0 (u 0 , u 1 ) can be represented as: Phase 1 •ζ 0 (u 0 , v 0 ) Phase 3

B Precomputed Table of Twiddle Factors
Since the powers of ζ can be known in advance, then all the twiddle factors can be precomputed and stored in the memory before the procedure.When NTT is executed, these values can be obtained by directly looking up the table instead of multiplying, like the original implementation.Furthermore, considering the Montgomery reduction, the precomputed table should be multiplied by R = 2 12 (mod q) in advance to transform it into the Montgomery form, reducing the computational load during the reduction process.
Please note, the generation method provided here pertains only to the NTT precomputated table; the part for INTT can be similarly derived.

Phase 1
In the first phase of 2-phase NTT, the 8 sections of the coefficient matrix will be multiplied by the 8 precomputed matrices.

Phase 2
In the second phase of 2-phase NTT, we perform matrix multiplication.

B.2 Precomputed Table of 3-Phase NTT
Phase 1 In the first phase of 3-phase NTT, we perform matrix multiplication.The second phase of the 3-phase NTT involves computing the Hadamard Product of matrices.Since the product matrix obtained from the previous step has a layout in the registers that is equivalent to applying the inverse of the wmma.storetransformation (refer to Figure 13), the precomputed matrix also needs to apply the inverse of the wmma.storetransformation to achieve a one-to-one correspondence of multiplication elements.In the third phase of 3-phase NTT, we perform matrix multiplication.Based on the analysis from the previous section, the coefficient matrix at this moment is equivalent to having been right-multiplied by the permutation matrix P (refer to Figure 15).Therefore, the precomputed matrix in this phase needs to be pre-multiplied by P −1 on the left to obtain the correct result.

Figure 3 :
Figure 3: The first phase of 2-phase scheme

Figure 5 :Figure 6 :
Figure 5: The first phase of 3-phase scheme

Figure 9 :
Figure 9: Details of the entire process of 2-phase NTT

Figure 11 :
Figure 11: From CUDA WMMA APIs to PTX assembly instructions

Figure 14 :
Figure 14: Reordered data layout for memory-free implementation

Figure 16 :
Figure 16: Details of the entire process of 3-phase NTT

Figure 17 :
Figure 17: Our overall implementation of Kyber

Table 1 :
Parameter sets for Kyber version 3

Table 3 :
Comparisons of operations per NTT in different schemes

Table 2
Shared Memory

1 :
Standard invocation methods of WMMA API This issue has piqued our interest.Next, we will delve deeper into the actual structure of wmma::fragment to see if we can achieve functionalities beyond what standard interfaces offer.

Table 4 :
Experimental Platform Configuration

Table 5 :
The peak throughput of NTT/INTT on Orin (the data inside the parentheses represents the throughput on R3080), n = 256; For Orin (60W),

Table 6 :
Comparison average time of polyvec_ntt in Kyber-1024, n = 256, k = 4 This performance is obtained by continuously testing 100 times with the (grid size, block size) set to(136, 512)and averaging the results.

Table 7 :
Comparison of execution latency of single polyvec_ntt of Kyber-1024,

Table 8 :
Comparison of Kyber-1024 throughput with GPU-based works The throughput of KX is computed by ab a+b , where a, b are the throughput of KeyGen and Dec. •

Table 9 :
Comparison of throughput on Kyber-1024 with related works Enc and Encaps have comparable computational costs.•The throughput of decapsulation is computed by ab a+b , where a, b are the throughput of Enc and Dec.