ConvKyber: Unleashing the Power of AI Accelerators for Faster Kyber with Novel Iteration-based Approaches

Tian Zhou; Fangyu Zheng; Guang Fan; Lipeng Wan; Wenxu Tang; Yixuan Song; Yi Bian; Jingqiang Lin

doi:10.46586/tches.v2024.i2.25-63

Authors

Tian Zhou School of Cyber Security, University of Science and Technology of China, Heifei, China
Fangyu Zheng School of Cryptology, University of Chinese Academy of Sciences, Beijing, China
Guang Fan Ant Group, Hangzhou, China
Lipeng Wan School of Cryptology, University of Chinese Academy of Sciences, Beijing, China
Wenxu Tang School of Cyber Security, University of Science and Technology of China, Heifei, China
Yixuan Song Ant Group, Hangzhou, China
Yi Bian School of Computer Science and Technology, University of Chinese Academy of Sciences, Beijing, China
Jingqiang Lin School of Cyber Security, University of Science and Technology of China, Heifei, China; Beijing Research Institute, University of Science and Technology of China, Beijing, China

DOI:

https://doi.org/10.46586/tches.v2024.i2.25-63

Keywords:

Lattice-based Cryptography, GPUs, Tensor Core, Kyber

Abstract

The remarkable performance capabilities of AI accelerators offer promising opportunities for accelerating cryptographic algorithms, particularly in the context of lattice-based cryptography. However, current approaches to leveraging AI accelerators often remain at a rudimentary level of implementation, overlooking the intricate internal mechanisms of these devices. Consequently, a significant number of computational resources is underutilized.
In this paper, we present a comprehensive exploration of NVIDIA Tensor Cores and introduce a novel framework tailored specifically for Kyber. Firstly, we propose two innovative approaches that efficiently break down Kyber’s NTT into iterative matrix multiplications, resulting in approximately a 75% reduction in costs compared to the state-of-the-art scanning-based methods. Secondly, by reversing the internal mechanisms, we precisely manipulate the internal resources of Tensor Cores using assembly-level code instead of inefficient standard interfaces, eliminating memory accesses and redundant function calls. Finally, building upon our highly optimized NTT, we provide a complete implementation for all parameter sets of Kyber. Our implementation surpasses the state-of-the-art Tensor Core based work, achieving remarkable speed-ups of 1.93x, 1.65x, 1.22x and 3.55x for polyvec_ntt, KeyGen, Enc and Dec in Kyber-1024, respectively. Even when considering execution latency, our throughput-oriented full Kyber implementation maintains an acceptable execution latency. For instance, the execution latency ranges from 1.02 to 5.68 milliseconds for Kyber-1024 on R3080 when achieving the peak throughput.

ConvKyber: Unleashing the Power of AI Accelerators for Faster Kyber with Novel Iteration-based Approaches

Authors

DOI:

Keywords:

Abstract

Downloads

Published

Issue

Section

License

How to Cite

iacr-logo