Formally verifying Kyber

. In this paper we present the ﬁrst formally veriﬁed implementations of Kyber and, to the best of our knowledge, the ﬁrst such implementations of any post-quantum cryptosystem. We give a (readable) formal speciﬁcation of Kyber in the EasyCrypt proof assistant, which is syntactically very close to the pseudocode description of the scheme as given in the most recent version of the NIST submission. We present high-assurance open-source implementations of Kyber written in the Jasmin language, along with machine-checked proofs that they are functionally correct with respect to the EasyCrypt speciﬁcation. We describe a number of improvements to the EasyCrypt and Jasmin frameworks that were needed for this implementation and veriﬁcation eﬀort, and we present detailed benchmarks of our implementations, showing that our code achieves performance close to existing hand-optimized implementations in C and assembly.


Introduction
In July 2022, NIST announced the first batch of "winners" of the post-quantum project, i.e., schemes that will be forwarded to standardization [AAC + 22].To the surprise of many, this first batch contained three signature schemes (CRYSTALS-Dilithium [BDK + 21, DKL + 18], Falcon [FHK + 20], and SPHINCS + [ABB + 22, BHK + 19]), but only one keyencapsulation mechanism (KEM): the lattice-based scheme CRYSTALS-Kyber (or in 1.We formalize the specification of Kyber [ABD + 21, Sec.1] in EasyCrypt [BGHZ11]. Great care (and effort) was dedicated to ensure that the EasyCrypt specification is syntactically close enough to the Kyber specification to enable efficient human validation.This means, in particular, that we use algorithmic definitions of the randomness sampling routines and provide computer-verified proofs that these algorithms generate outputs that follow the distributions defined in the Kyber specification.
2. We extend the Jasmin programming language [ABB + 17, ABB + 20] with local functions, sub-arrays, pointer types required for call-by-reference, and system calls to correctly implements the specification of Kyber in EasyCrypt; Second we prove that the optimized AVX2 Jasmin implementation is equivalent to the reference implementation.6. Finally we provide comparative benchmarks of our implementations on different generations of Intel CPUs.As we did not prove correctness of the rejection-sampling routine in the AVX2 implementation, we benchmark two variants of our AVX2 implementation: one that is fully optimized and one that is fully verified.(The latter uses rejection sampling from the reference implementation.)These benchmarks show that the fully optimized AVX2 Jasmin implementation is within about 10% of the performance of the hand-optimized C/asm implementation by the Kyber team.
An overview of our implementation and verification effort and the tooling we use is given in Fig. 1.

What does our result mean?
The use of the formally verified Jasmin compiler guarantees correct translation to assembly, so the behavior of the assembly code is guaranteed to match the behavior of the source Jasmin code.The computer-verified proofs of functional correctness establish the link between the implementations and the specification of Kyber and, ultimately, guarantee that the assembly code is indeed performing the computations specified by the Kyber specification.However, there are two aspects that still require careful manual inspection by expert auditors: the formal specification of Kyber in EasyCrypt and the statements of the theorems that we prove.We therefore dedicate most of the paper to explaining our formalization choices with respect to these two aspects, and present only a summarized view of the proof itself, focusing on the most innovative aspects and documenting lessons learned for the future.
With regards to verification of implementation security, we refer to [ASBG + 23], which reports formally proven protection against classical timing attacks and Spectre v1 attacks [KHF + 19] for Jasmin implementations of multiple cryptographic routines, including Kyber.The implementations of Kyber considered in that paper are very close to the ones proven correct here.We are confident that we can use the same approach to prove constant-timeness and protect against Spectre v1, once the security type system in the Jasmin compiler from [ASBG + 23] is integrated into the main branch.Also, we are confident that the performance overhead for Spectre v1 protections will be just as low (i.e., < 1%) as reported in [ASBG + 23].
Limitations and TCB.Our computer-verified proofs rely on a trusted code base (TCB) and are limited in scope to the Kyber specification.The correctness proofs of the Jasmin compiler rely on the assumption that the semantics of individual machine instructions are correctly formalized in Coq and EasyCrypt.The mechanism that extracts EasyCrypt code from Jasmin code is also part of the TCB.Finally, the EasyCrypt tool itself is part of the TCB.In terms of scope, our proofs do not cover the correctness of the SHA3 implementations, which come from the libjade library1 .Some of this code has been verified in [ABB + 19], but we do not establish a formal connection to this work.
We note also that the correctness theorem for the Jasmin compiler and the extraction to EasyCrypt only apply to memory-safe programs.However, automatically proving memory safety of our Kyber implementations is currently out of scope for the static safety analyzer in the Jasmin compiler.For this reason, we check memory safety using a combination of static analysis and dynamic analysis: the functional correctness proof guarantees that the memory accesses which are relevant for the computation of the scheme are safe, but they do not exclude accesses that cause unsafety due to, e.g., misalignment or simply accessing memory that is subsequently not used.For this reason we use the Jasmin interpreter and valgrind to reinforce the memory-safety guarantees.
The bigger picture.As the subtitle of this paper suggests, formally verifying the functional correctness and certain aspects of implementation security of Kyber is only part of the whole story.The use of EasyCrypt as specification language and verification tool allows us to also conduct mechanized reductionist proofs of security, i.e., establish through computer-verified proofs that Kyber is indeed an IND-CCA2 secure KEM.Our development is fully prepared so that, when the ongoing machine-checked proof of security is concluded, this yields a security theorem for the Kyber specification that, by the proofs presented in this paper, extends all the way down to the implementations at assembly level.
Related Work.While this paper is the first to present a fully computer-verified implementation of a post-quantum crypto scheme, it is not the first to present verification efforts in the realm of post-quantum cryptography.Hwang et al. [HLS + 22] use CryptoLine 2 to formally verify the AVX2 implementation of the NTT transform used in the Kyber NIST submission.This is essentially the same code used for NTT in our AVX2 implementation, so our results corroborate the findings in [HLS + 22].However, our proof goes beyond NTT and covers two full Kyber implementations.We further discuss [HLS + 22] in the concluding remarks of the paper.Kreuzer [Kre23] focuses on machine-checking the correctness of the Kyber specification as a cryptographic construction, i.e., a proof that encryption and decryption commute.Other works have focused on machine-checking proofs of security for post-quantum cryptographic constructions, such as [Unr20, BBF + 21, HMS22].These works are orthogonal to what we present in this paper, since here we are focusing on the correctness of the Kyber implementations.A comprehensive description of related work in computer-aided cryptography can be found in [BBB + 21], including prior uses of Jasmin to obtain verified high-speed implementations of symmetric and elliptic-curve cryptographic primitives, as well as alternative frameworks designed to achieve the same goal.
Artifact.An artifact including our extended versions of Jasmin and EasyCrypt, the Jasmin implementations of Kyber, and the EasyCrypt proofs is available from https: //artifacts.formosa-crypto.org.We will also submit this artifact to the CHES artifact evaluation.The artifact includes a README with detailed instructions on how to build the tools, build the software, validate the proofs in EasyCrypt, and run our benchmarks.The contents of this artifact are under permissive licenses (MIT license for all modifications to EasyCrypt and CC0 for all Jasmin implementations and associated proofs).

Specification of Kyber in EasyCrypt
The EasyCrypt specification of Kyber closely follows the description in the latest version (3.02) of the CRYSTALS-Kyber proposal [ABD + 21], omitting some features that are not relevant for Kyber-768, namely noise sampling and compression algorithms that are only used for other Kyber parameter sets 3 .We will gradually introduce the various definitions and intermediate results leading to the full code in Figures 2 to 5.This section was written so that it is easy to match our code to the pseudocode specification of Kyber in [ABD + 21, Sec.1].

Algebraic Structure
Operations over values in field F q .We begin by fixing the prime q = 3329 and instantiating the EasyCrypt theory ZModField with this parameter to obtain type Fq 2 CryptoLine is a formal-verification framework that takes low-level code with user annotations for pre-conditions, post-conditions and intermediate assertions, and generates inputs for SMT solvers and computer algebra systems that can then be used to automatically check correctness.It is available from https://github.com/fmlab-iis/cryptoline.
Here, asint is the operator that takes a field element and returns an integer in the range [0, q).We use / R to denote real division, use / for integer division and % for the mod operation.round denotes rounding to the closest integer with ties being rounded up, as indicated in the Kyber specification.Compression and decompression map field elements into integers in the range [0, 2 d ) and back, with a loss that is captured by the following lemma stated and proved in EasyCrypt.
Using these operators we can also capture the encoding/decoding of bits as finite-field   The next step is to specify the distributions over field elements that are used by Kyber.We define the Binomial distribution and the uniform distribution over field elements by declaring them as follows, and then deriving auxiliary lemmas that describe the mass function of each of them in an explicit way (this is important for proving correctness of the sampling procedures used by the implementation).We omit the details for brevity.The ring R q , matrices and vectors.At this point we can formalize the ring R q of polynomials over which Kyber operates.We start by defining a concrete representation of { aa[j] ← inFq d1; j ← j + 1; } if ((d2 < q) && (j < 256)) { aa[j] ← inFq d2; j ← j + 1; } } } return aa; } }. these polynomials using arrays of size 256 of field elements.As in the Kyber specification, we will reuse this representation for ring elements in the NTT domain.
We write explicit formulae for the basic ring operations over this array representation (shown below) and extend the definitions of compression/decompression and binomial/uniform distributions to ring elements by applying them pointwise to each coefficient.
We prove that these definitions implement the correct operations over Z q [X]/(X 256 + 1) as follows.We first instantiate the generic polynomial theory PolyReduce recently added to EasyCrypt with n = 256 and q = 3329, to obtain a description of the corresponding mathematical ring and its operations.We then define a bijection between our poly type above and the type of ring elements in the resulting theory, which allows us to easily prove that our ring operations above are correct with respect to the explicit formulae defined by PolyReduce.
We can now define matrices and vectors over ring elements in our representation by cloning the Matrix theory in the EasyCrypt library with the correct dimension kvec=3 for Kyber-768, and instantiating the ring operations with those we described above.Note that, when instantiating the Matrix theory, we must prove that the provided operators indeed constitute a ring, which we can do by leveraging results in the PolyReduce theory.

Number Theoretic
Transform.Finally, we complete the algebraic definitions by specifying the Number Theoretic Transform (NTT) in EasyCrypt and proving some fundamental results about this transform that are essential to the security and correctness of Kyber.We define the NTT forward and inverse operations as literal translations of what is stated in the Kyber specifications (including the use of the same representation for inputs, and outputs, and the operation that commutes with the NTT transform for fast multiplication  in R q ).We omit here the definition of NTT for vectors and matrices of ring elements, as this is simply the pointwise application of NTT to each component.We use notation (•) 1 and (•) 2 for the first and second element in a pair.op basemul(a b : poly) : poly = Array256.init(fun i In Section 5.3 we describe the EasyCrypt library that was developed with general results about the NTT transform, which we used to establish the following properties, as well as natural extensions for vectors and matrices of ring elements.These properties play an important role in the proof of correctness and in the ongoing proof of security, and they include the expected cancellation (invertibility) properties, commuting with ring addition and multiplication.Additionally, the preservation of scaling factors is used to deal with multiplicative terms used in Montgomery representation (see Section 4).We note that these are well known results that are taken as a given in the Kyber specification and, to the best of our knowledge, have not been previously formally verified in the context of cryptographic proofs. 4As stated in the introduction,we believe that these results are of independent interest, since the NTT is pervasively used in post-quantum cryptography and will surely be used in future EasyCrypt developments.

Algorithmic Sampling, Encoding and Decoding
The Kyber specification relies on a number of auxiliary algorithms.We follow the same structure and define them as independent EasyCrypt modules used by the main Kyber specification in EasyCrypt.In this section we mention SHA3 algorithms several times, when the specification refers to it.By this we mean algorithms in the SHA-3 standard, including the SHA3 hash functions and the SHAKE extendable-output functions.In Section 7 we further discuss how our formalization and an ongoing machine-checked security proof of the specification clarify the assumptions that Kyber makes about SHA3 algorithms.
Rejection sampling a uniform ring element.The first auxiliary algorithm used by Kyber is a rejection sampling procedure called Parse, which we give in Figure 4.The algorithm is parametric on an Extendable-Output Function (XOF), which must be initialized prior to the execution of Parse, and is then used by this algorithm to obtain as many random bytes as needed to complete the 256 coefficients that make up an element in R q .We note that in Kyber the output of this procedure is interpreted as being in the NTT domain, which has implications in various parts of the correctness and security proofs.We model the XOF as producing a sequence of byte blocks of length 168 (each call to next_bytes is the next batch of XOF output bytes), since this matches the way Kyber uses SHAKE-128 for this purpose.
Using SHA3 to hash, smooth randomness, and sample noise.Kyber uses SHA3 algorithms over strings of different input lengths to implement the various hash functions required for the KEM (Fujisaki-Okamoto-like) construction.Our specification clarifies exactly how the SHA3 algorithms (a combination of SHA3-256 and SHAKE-256) are being used to "instantiate" these hash functions, as discussed in Section 7.
Kyber also uses SHA3 algorithms to expand the raw randomness fed to key generation into two 32-byte seeds, one for the matrix rejection sampling, and another to be used as PRF key in the noise sampling procedure above.Here SHA3-256 and SHA3-512 are used as pseudorandom generators (PRG), referred in the spec as G. Since this PRG maps fixed-length input strings to fixed-length output strings, we model it as a PRF that can be evaluated on a single point.
Finally, Kyber relies on SHA3 algorithms to sample pseudorandom ring elements of small norm (and vectors thereof).The procedure used to sample ring elements is specified in Figure 5, which exactly matches the Kyber specification for the concrete parameter of η = 2, which is used by the implementation of Kyber-768.Here we model SHAKE-256 as a Pseudo Random Function (PRF), using the corresponding EasyCrypt library and an operator SHA3_PRF.
type ipoly = int Array256.t.op toipoly(p : poly) : ipoly = map asint p. op ofipoly(p : ipoly) : poly = map inFq p.In the proofs we describe in this paper we do not assume any security semantics to these uses of SHA3; this will be subject of a follow-up ongoing work to prove the security of the Kyber specification and, by transitivity, security of the implementations.Instead, we simply assume that the Kyber specification is using SHA3 algorithms in exactly the same way that the implementation is using SHA3 algorithms and, prove, under this assumption, that the implementation is functionally correct with respect to the specification.We note that some of the implementations of SHA3 algorithms that we use in the development have been proved correct in [ABB + 19], but we do not establish a formal connection between the two verification results.This means that, for the time being, correctness of the SHA3 library is an assumption in our proof.
Encoding and Decoding.The last auxiliary module is called EncDec and it specifies various forms of serializing and de-serializing keys, messages and ciphertexts to/from byte arrays.We specify these operations by defining the canonical bijection toipoly/ofipoly between (possibly compressed) ring elements and arrays of integers representing the polynomial coefficients, and a number of procedures that encode/decode polynomials that have been previously compressed to varying lengths.We show in Figure 6 the case of encoding a polynomial that has been compressed to 4-bits per coefficient into 128 bytes.
These encoding/decoding procedures are specified in imperative form, with a loop structure and output/input byte array types that simplify the connection to the implementation and are also more readable.Our choice here is motivated by the fact that we use fixed-sized arrays for the encoded bytes, which means that we must independently specify an explicit encoding/decoding procedure for each compression size.
In the Kyber specification [ABD + 21], decoding is defined by first mapping the compressed polynomial to a bit string (BytesToBits), which is subsequently partitioned into the various coefficients.We also formalize this version of the operation for completeness.
The Kyber specification [ABD + 21] leaves encoding implicitly defined as the inverse of the decoding operation.We rely on a new EasyCrypt feature to lift our imperative versions of the encoding and decoding operations to functional operators, prove that decoding matches the Kyber specification, and finally prove that encoding inverts decoding.We discuss this in more detail in Section 3.2.
The full specification.The code given in Figures 2 to 5 and the code snippets we provide throughout this section closely match the pseudo-code and definitions in the Kyber specification [ABD + 21].As discussed in the introduction, this is relevant as this specification is the centerpiece of our effort that all formal-verification results refer to.Hence, their applicability to Kyber itself depends on how faithfully it matches what is written in the handwritten proposal.For this reason, our goal was to simplify the task of a human checking by inspection that the EC specification we described in this section matches the hand-written description5 .

Jasmin before and after this paper
Jasmin is a programming language and a set of accompanying tools designed to allow the efficient implementation and the formal verification of cryptographic primitives.It empowers programmers with precise control over low-level features of the assembly code produced by the Jasmin compiler, such as instruction selection and register spilling.Yet, the language provides high-level abstractions such as functions and arrays that help structuring programs and proofs.For instance, arrays have a functional semantics and functions are call-by-value, vastly simplifying modular reasoning.In accordance with the high-efficiency objective, most abstractions are eliminated at compile-time; the compiler checks that arrays are used linearly, which in turn allows to modify them in-place and ensure that there are no run-time copies.
Before it was improved with the extensions described in this work, the Jasmin compiler had a few limitations.First of all, running a computation on a slice of an array (i.e., a contiguous region within an array as opposed to the full array) was very tedious and pollutes the proofs.Second, the compiler inlined all functions at every call-site, which dramatically increased the code size.For instance, the Jasmin implementations of the SHA3 algorithms from previous work [ABB + 19] are already quite large (hundreds or thousands of assembly instructions, depending on the parameters).The Kyber implementations rely on this primitive for various purposes, such that inlining it at every call-site would inevitably result in an unaffordable code blow-up.
To overcome these limitations and allow the efficient implementation of large routines such as Kyber, the Jasmin infrastructure has been extended with three new features described below: local functions, pointers and sub-arrays, and a randombytes system call.
Local functions.Jasmin programs now are made of functions: at the source level, they now come in three flavors: export functions, inline functions, and local functions.Export functions correspond to entry-points, meant to be called from other programs.Therefore they must comply, after compilation, with the targeted ABI.In particular, the number of arguments and returned values is limited and callee-saved registers must be preserved.Moreover, their argument and return types are limited to machine words.At the other end of the spectrum, inline functions disappear early in the compilation process.They thus enjoy a lot of flexibility without incurring any run-time cost: they can take any number of arguments of any types, including arrays and return as many as necessary.
Finally, local functions are kept as separate entities after compilation.Yet, they are not exposed to external users and therefore need not comply with any convention.For instance, they can return many results and the programmer chooses whether the return address is passed through the stack memory or through a register.Interestingly, there is no fixed list of callee-saved registers, i.e., registers whose contents must be preserved by function calls.This ensures that there will be no hidden cost to using a function.This is made possible by a rather unique method to register allocation: all functions are compiled together and all their reg local variables are allocated simultaneously.
Arrays, sub-arrays and implicit pointers to stack.In Jasmin, each variable is labeled either reg or stack, specifying that the compiler must allocate it in a register or on the stack respectively.This is the case also for arrays.reg arrays are a convenient way to manipulate several reg variables together.They are compiled to individual registers.stack arrays are allocated as contiguous memory zones in the stack.The compiler, at the stack allocation pass, decides on a shape for the stack, and replaces accesses to arrays with memory accesses at positions relative to SP, the stack pointer.For instance, if array b is allocated on the stack at address SP + 5, an access b[r+3], where r is a register, is compiled to a memory access at address SP + r + 8.
A new feature that represented a major change to the compiler was allowing local functions to accept stack arrays as arguments.This implied a redesign of the stack allocation pass: there are now arrays whose address is not known at compile time.To achieve that, we impose that arrays in the stack are taken by local functions using a new type called reg ptr.At the source level, such an array behaves as the other arrays.Jasmin semantics makes no distinction between arrays of different kinds, thus reasoning is not more complex than before.However reg ptr is compiled as a pointer, to a register holding its address.The stack allocation pass checks that this transformation is valid, meaning that all modifications to the array are done in place.If an array is changed in the body of a local function, it has to be returned to abide by the high-level semantics of Jasmin.This spurious copy will be eliminated during stack allocation.Since reg ptr arrays are actually registers, we need a way to let the user spill them if needed.For that matter, we introduced stack ptr array.
In the version of Jasmin prior to this work, arrays could be passed as arguments to local functions, but there was no way to call a function with only part of the array without making an explicit copy.To make the best use of local functions, we allow the user to take slices of arrays called sub-arrays.This allows to call a local function on a small part of a large array.The starting position and length must for now be known at compile time.If the array is a reg ptr, this is compiled as pointer arithmetic.
Note that the features described in this section are the only facility given to the user to manipulate (indirectly) pointers.The fact that the user never gets direct access to pointers ensures that formally reasoning on the source remains tractable.
Randombytes.Many cryptographic routines process random data.A common practice is to receive randomness as input.However, Kyber (as well as other post-quantum schemes) uses a rejection sampling procedure which may consume an unbounded stream of random data.To overcome this limitation it proves convenient to sample fresh randomness on demand.To this end there is now a randombytes interface in Jasmin.The usual method of taking random data as input remains of course available.Moreover, the user of the Jasmin program is responsible for implementing this interface: no actual implementation is needed before link time.Possible implementation methods include asking the OS through a system call, or calling some dedicated library, e.g., to access a hardware-security module.
Semantically, this is modeled as an unspecified deterministic function in a state monad.This ensures that the compiler correctness theorem applies seamlessly.The EasyCrypt formalization uses a slightly different view: the EasyCrypt code is parameterized by a module providing the implementation of this system call.Said implementation can indeed be stateful.The functional correctness of the Kyber implementations can be established for any instantiation of the parameter, since they will match the semantics of the specification for any input random bytes.Security analysis (ongoing work) however will need to make assumptions about the distribution.
More generally, there is a general facility to call external functions from Jasmin programs.Currently only randombytes is available through this mechanism but, should the need arise, other operations can be seamlessly added.

EasyCrypt and extensions added in this work
EasyCrypt is a proof assistant geared towards proving the correctness and security of implementations of cryptographic primitives.The Jasmin compiler can output either assembly or EasyCrypt code with semantics equivalent to its input.Properties about this EasyCrypt code can then be proven in EasyCrypt using one of the variants of Hoare logic it features: • HL is a deterministic Hoare logic that allows to prove result on deterministic programs.Statements in this logic are of the form hoare [ M.foo : P Q], where M.foo is a procedure in module M, P is a precondition and Q is a post condition.
• pHL is a probabilistic Hoare logic, that allows to reason about the probability of occurrence of events in potentially probabilistic programs.Statements in this logic are of the form phoare [ M.foo : P Q] = p, where p is now the probability that Q holds in the post-state.In the current paper we typically have p=1 and the pHL statement can be proved by first establishing that the program always terminates, and then proving the statement using HL.
• pRHL is a probabilistic relational Hoare logic, that allows to prove relational results about the lock-step execution of a pair of programs.Statements in this logic are of the form equiv [ M.foo ~N.foo : rP rQ], where rP and rQ are now relational pre-and post-conditions that can refer to the initial and final states of both procedures.
In particular, pRHL can be used in conjunction with pHL to prove results by gamehopping: using a correctness result about a procedure P 1 in pHL and an equivalence result between procedures P 1 and P 2 in pRHL, one can prove a correctness result about P 2 .Also, in a deterministic setting, results in HL can be proven using results in pHL, as well as the other way around.The proof of correctness for the Kyber Jasmin code relies on these two mechanisms.In particular, the game-hopping strategy allows us to build intermediate programs between the EasyCrypt code produced by Jasmin and our desired specification.We use it to remove optimizations and to abstract operations on algebraic structures.
EasyCrypt also features a standard library formalizing some basic algebraic structures, which we use to capture the algebraic structure underlying Kyber.This library is based on the EasyCrypt cloning mechanism, originally introduced to manage abstract types in cryptographic primitives.One thing that became clear during this project is that cloning has proven to be a source of complexity and limitations as more algebraic structures were added to the hierarchy, namely polynomial rings, vectors and matrices thereof needed for Kyber.We have communicated this to the EasyCrypt developers and learned that future versions of EasyCrypt will use type classes for this purpose.
The most innovative use of EasyCrypt in this development was, perhaps, the functional correctness of the Number Theoretic Transform (NTT) implementations, which we will describe in detail later in the paper.However, a number of enhancement to EasyCrypt were created to support this development, which we summarize next.
Additions to the standard library.Kyber is the first cryptographic primitive studied using EasyCrypt to involve algebraic structures such as polynomials arithmetic, modular rings and ring quotients.As a result, many parts of the standard library had to be created or improved to lay foundations underlying the Kyber specification.These included results about lists, bit-vector manipulations, ring quotients, logarithm and p-adic valuations.
Automatic inference of functional specifications.One of the tedious aspects of some EasyCrypt proofs arises from the fact that some parts of the proofs are easier to do with functional specifications, whereas other parts are easier to do with imperative specifications.The sweet spot usually resides in using both approaches at the same time, which means writing both specifications by hand and proving that they are equivalent.In Kyber, a number of simple procedures are used to encode and decode data into byte arrays, so applying this approach would be quite time consuming.To avoid this, a new mechanism was developed that can automatically infer a functional specification from an imperative program, provided that the latter has as reasonably simple control flow.
Dealing with nested for loops in EasyCrypt.The reference implementation of the NTT in Kyber involves multiple nested for loops whose formal study is tedious in EasyCrypt.Indeed, for dealing with loops, EasyCrypt only comes with a restricted set of general purposes proof rules that trade expressiveness for ease of use.In consequence, we designed a new library that recasts the aforementioned general purpose rules as simplest ones but that only apply to more restricted forms of while loops, as the ones that can be found in Kyber.This library targets the different flavors of EasyCrypt's Hoare logic and has proven particularly helpful in game-hopping based proofs on programs that contain multiple nested for loops.

Implementation
This paper presents two different implementations of Kyber-768, a reference implementation and an optimized implementation using AVX2 vector instructions.Those implementations follow largely the structure and implementation strategies of the ref and avx2 implementations by the Kyber team6 , which are also included in the submission package to NIST and integrated into PQClean7 .
Reference implementation.The structure of the reference implementation largely follows the specification.On the lowest level they implement arithmetic in F q , i.e. on integers modulo the prime q; these are represented as signed 16-bit integers.The most interesting routines on this level are Montgomery and Barrett reductions.The Montgomery reduction takes as input a signed 32-bit integer a and produces a signed 16-bit output r with r ≡ a • 2 16 (mod q).This function does not actually work for arbitrary inputs and is also not producing fully reduced results: more specifically, it requires the input to be in {−q • 2 15 , . . ., q • 2 15 − 1} and guarantees that the output is in {−q + 1, . . ., q − 1}.Barrett reduction takes an arbitrary signed 16-bit integer a as input and produces r ∈ {−(q − 1)/2, . . ., (q − 1)/2}, such that r ≡ a (mod q).
On top of this is arithmetic in R q , i.e., on univariate polynomials over F q modulo (X 256 + 1).Addition and subtraction in R q are simply coefficient-wise addition and subtraction of arrays of 16-bit integers.The implementation follows a lazy-reduction approach here and does not apply modular reduction after addition and subtraction.The more interesting routines on polynomials are forward and inverse NTT, and "pointwise" multiplication of polynomials in NTT domain.The forward NTT takes an array of 256 coefficients in normal order and produces the output inplace in bit-reversed order.It uses 7 layers of Cooley-Tukey butterfly operations.Twiddle factors are precomputed in Montgomery and multiplication by those twiddle factors is followed by Montgomery reduction, bringing the result back to normal domain.Addition and subtraction inside the butterflies is not followed by modular reduction, i.e., also here the implementation makes use of the lazy-reduction optimization.The inverse NTT takes input in bit-reversed order and uses 7 layers of Gentleman-Sande butterflies to produce output inplace in normal order.Multiplication by twiddle factors is again using Montgomery arithmetic.The result of addition inside the butterflies is reduced using the Barrett reduction routine.The result of subtraction is again left unreduced.As the NTT in Kyber is incomplete (see [ABD + 21, Sec.1.1]), the pointwise multiplication of polynomials in NTT domain corresponds to 128 multiplications of 2-coefficient polynomials modulo X 2 − ζ i , where ζ i are precomputed powers of ζ = 17, the smallest 256-th root of unity modulo q.This operation is computed through Montgomery multiplications and additions that again omit modular reduction.Note that Montgomery reduction moves resulting coefficients to "inverse Montgomery domain", i.e., they get multiplied by a factor of 2 −16 .This factor is removed again at the very end of the inverse NTT.For the public key, which is transmitted in NTT domain, the factor is removed in a dedicated function that multiplies each coefficient by 2 32 mod q and then applies Montgomery reduction.
Also on this layer are functions to sample polynomials with coefficients that are uniformly distributed modulo q and to sample noise polynomials with coefficients that follow a centered binomial distribution with parameter η = 2.The former is using simple rejection sampling on output of SHAKE-128.The latter is a bit more involved and treats 32-bit words (in the C reference implementation) or bytes (in the Jasmin reference implementation) as vectors of 2-bit integers to compute Hamming weight and subsequent subtraction.Finally, the implementation of R q includes functions for serialization and deserialization and for "LWR-style" rounding to a smaller, power-of-two modulus to compress ciphertexts.For the reference implementations, these functions are rather straightforward translations of the specification into C respectively Jasmin.
The highest level of arithmetic implements operations on vectors and matrices over R q .These functions are all rather straightforward loops over calls to functions performing arithmetic in R q .These functions are first combined to implement the Kyber.CPAPKE passively secure public-key encryption scheme and then, together with multiple functions derived from Keccak (SHA3-256, SHA3-512, SHAKE-128, SHAKE-256) to build the full IND-CCA2 secure KEM Kyber.

AVX2 implementation.
As most optimized implementations, the main idea of the AVX2 implementation code is to identify the functions that consume most of the CPU cycles in the reference implementation and, one-by-one, replace those by more optimized versions.One might think that verification "simply" proves equivalence between the reference and the optimized implementations function-by-function and that this verification would mostly mean showing equivalence of longer loops over scalar operations and shorter loops over vector operations.While this is true for some simple functions like polynomial addition and subtraction, the differences are much more substantial for the bulk of the AVX2 implementation.The most interesting differences between the reference and the AVX2 implementation are the following: • Whenever possible, the AVX2 implementation uses a Keccak implementation that processes 4 independent streams of input and output in parallel.Using this more efficient 4× parallel implementation requires the high-level code to batch calls; the vectorization is thus not a local change encapsulated inside a low-level function anymore.
• The AVX2 implementation of the NTT produces output in a "permuted bit-reversed" order; this order is also used by the pointwise multiplication, so all polynomials in NTT domain use a different ordering of coefficients in the AVX2 implementation compared to the reference implementation.As the public key of Kyber is transmitted in NTT domain, serialization of the public key includes a permutation to move to standard bit-reversed order.
• The AVX2 NTT code is also an illustrative example of where the aggressive optimization strategies, together with the willingness to maximize the impact of the available AVX2 registers and instruction-set, often leads to ingenious but contrived code, intertwining arithmetic and bitwise operations.Correctness verification is challenging, since one needs to keep track of what is happening at each step with significant detail, but the deep understanding of the underlying arguments occasionally suggested tweaks for additional speed-ups that were incorporated into the code.
• The AVX2 modular reductions are subtly different, so they produce output that is equivalent modulo q, but not equal to those produced by the reference implementation.This also means that the whole lazy-reduction strategy of the AVX2 implementation is different, which requires updates to the range analysis to make sure that no overflows occur.
• The compression function for vectors of polynomials uses a different algebraic expression to compute the rounding to the smaller power-of-two modulus.More specifically, while the reference implementation computes rounding from q = 3329 to q = 1024 as ((1024*i + (Q-1)/2)/Q) & 1023, the AVX2 implementation uses (8*V*i -( i+15) + (1 << 18)) >> 19, with the value V set to 20159.The equivalence of these two expressions for all relevant inputs is in principle not hard to prove, but the situation is further complicated by the way how results of vectorized multiplications are split into low and high parts of the result.In other words the complication is extracting this algebraic expression from the code in the first place.
• Finally, the rejection sampling to generate the matrix A is using a rather involved vectorization approach inspired by the techniques introduced for NewHope in [GS16,Sec. 4.2].This routine is the only part of the AVX2 implementation that is not yet proven correct; this is why Section 6 reports performance results for a fully optimized (with this involved AVX2-based sampling) and a fully proven (with the sampling from the reference implementation) implementation.This faster rejection sampling routine is challenging to verify because it takes a totally different approach to generating the necessary coefficients than the reference implementation: rather than going sequentially through the output bytes produced by the XOF, it generates a large number of candidates and checks their bounds in parallel, to compute what intuitively is a bit map of the good elements; the code then uses this information to pack these good elements in a contiguous memory region, and then takes the necessary prefix.Since the used set of SIMD instructions and optimization thechniques have little in common with the rest of the code, verifying this part of the implementation represents significant additional effort.

Proof of functional correctness
The strategy we adopted for the functional correctness proof divided the effort into a first phase, in which we constructed a proof for the reference implementation, and a second phase where we addressed the AVX2 implementation.The first phase included the development of a number of EasyCrypt libraries to deal with various data structures and operators used by the EasyCrypt code extracted from the Jasmin implementations, namely distributions over arrays, conversions between arrays of field elements and their representations, etc.One important library that may be reused in future developments establishes results for field elements in Montgomery representation, including not only the expected congruence relations, but also the ranges of the representations of these elements when stored as microprocessor words by the implementation.For the latter, a formalization and proof of the Barrett reduction operation correctness.Another one establishes results for the NTT, which we will describe in detail at the end of this section.

Verifying the Reference Implementation
The proof for the reference implementation follows the hierarchical structure of the code described in Section 4. We present the results we establish in a bottom-up order.The lemma states that the Jasmin function __fqmul, given two 16-bit words a and b, produces a result res with probability 1, meaning that it always terminates, satisfying the expected relation to the input parameter values: the value stored in the output 16-bit word, when seen as a signed integer, contains the output of a specification operator SREDC computed over the inputs.The operator is defined in the Montgomery theory as follows: Note that the operator semantics is given by the above lemma both in terms of its algebraic meaning and input/output value ranges.(Here R is 2 16 , the range of 16-bit words.)Note also that this lemma imposes a precondition on the ranges of input values, which must be met by caller functions.
At the field level we also state and prove auxiliary lemmas that relate some sequences of processor operations used by higher-level functions to the field element compression operators introduced in Section 2. These are then used in the proofs of correctness for ring element and vector operations.

Ring (polynomial) operations.
Operations over ring elements come in three flavors: i. those that follow from the pointwise application of some operation to each coefficient; ii.those that involve reading and writing polynomials from memory; and iii.NTT-related operations.The first type of operations includes addition, subtraction, Barrett reduction, etc.The correctness proofs for these Jasmin functions are straightforward loop arguments that rely on the lower level results described above and stated similarly to the phoare judgment given as an example for field multiplication.
The operations involving memory access typically perform compression followed by memory write operations, or memory read operations followed by decompression.The values read from/written to memory are decoded/encoded into byte arrays, as described in Section 2. This motivates a different form of functional correctness statement, which we illustrate with the Jasmin function that writes a compressed ring element to memory.The above lemma establishes the equivalence between the specification of the encoding procedure EncDec.encode4 and the _poly_compress Jasmin function.It states that, when the input parameter a{1} to the Jasmin function contains the representation (the lift_array256 operator) of a ring element _a with coefficients in the [0..2q) range (pos_bound_256_cxq predicate), and if it starts executing in some initial memory mem, then the final state of the memory will satisfy two properties: i. memory is only modified in a specific region identified via the touches predicate; and ii.reading from that region yields the correct encoding result (this is stated by fixing the input p{2} to EncDec.encode4 on the right to be the result of compressing ring element _a.Expressing correctness in this way has two advantages: i. we can match the control flow of the encoding specification to the control flow of the implementation (see Section 2) and rely on relational Hoare logic to avoid dealing with closed forms for the encoding mechanisms in this part of the proof and; ii.we can directly plug-in these equiv results into the top level equivalence proof for the IND-CPA encryption scheme.
Finally, the most challenging steps in the proof of ring operations concern the three functions used for NTT-based multiplication: NTT, inverse NTT and pointwise multiplication of degree one polynomials.We prove correctness of the NTT and inverse NTT functions using the same sequence of steps.We first show that the NTT function does not introduce computation errors at the representation level; to this end, we define an EasyCrypt procedure that has the same control flow as the Jasmin implementation, but performs all operations in F q .This is expressed as the following equivalence in EasyCrypt between our idealized NTT implementation NTT.ntt and the actual extracted code _poly_ntt.equiv ntt_correct_aux : NTT.ntt ~M._poly_ntt : r{1} = lift_array256 rp{2} ∧ zetas{1} = zetas ∧ signed_bound_cxq rp{2} 0 256 2 res{1} = lift_array256 res{2} ∧ pos_bound256_cxq res{2} 0 256 2.
This equivalence imposes a contract on the ranges of input/output representations for the implementation, and it shows that our idealized code, when called with the ring-element represented by the input to the Jasmin code, and a set of global constants (zetas) used by the Jasmin code, computes the same result over the ring.
We then prove that the global constants used by the implementation are correct representations of the necessary roots of unity, and we prove that our idealized algorithm computes the mathematical NTT operator.This part of the proof relies on an NTT library developed for this project, which we describe in more detail in Section 5.3.The final statement looks as follows: The correctness proof for the _poly_basemul and NTT inverse operations come with an interesting twist.Rather than computing the pointwise multiplication of the degree one polynomials, the former actually uses the _fqmul field multiplication operation in a way that leaves an extra Montgomery inverse multiplicative term in the result.Conversely, the NTT inverse operation produces a result that carries an uncancelled Montgomery multiplicative term.This is achieved by introducing an additional Montgomery multiplicative term in the precomputed constants, and therefore has no impact on performance.Since the NTT transform preserves multiplicative scalars (a property we have proved), the composition of these two functions actually produces a correct result without any spurious terms.In Section 5.3 we give some more detail on the EasyCrypt library we developed to support the NTT proofs.
Vector operations.Most operations over vectors do not bring any additional challenges: some of them work coefficient-wise over all the coefficients of all ring elements in the  vector, and therefore almost identical to the corresponding ring operations, or they are point-wise operations over ring elements (e.g., vector addition, NTT, etc.) which we simply handle by plugging in the results described above three times.The remaining and most interesting operation is the inner product between two vectors in the NTT domain, which we prove correct by relying on the correctness of the basemul and poly_add functions, to get the following result.(Here again we note the extra Montgomery inverse multiplicative term (inFq 169) that will be subsequently canceled by the inverse NTT operation.)Top-level correctness.At the top level of the correctness proof reside the lemmas that establish an equivalence between the Kyber specifications of the IND-CPA encryption scheme and the KEM construction described in Section 2 and the corresponding Jasmin implementations.The proof of the KEM implementation correctness follows directly from the fact that the IND-CPA implementation is correct and the various calls to the hash functions in the Fujisaki-Okamoto-like construction are done in the correct sequence.Two additional simple results are needed for comparing two arrays and conditionally copying an array in constant-time.
The proof of the IND-CPA encryption scheme implementation is done in two steps.We first create an EasyCrypt auxiliary implementation of the scheme that is very close to the extracted Jasmin code, bar the fact that we restructure it slightly to better match the structure of the specification and we create additional mock versions of the matrix and noise generation functions, as shown in Figure 7.These functions have the same type as the extracted ones, but exactly match the specification modulo the representation of  outputs, which here comes as arrays of processor words.As an intermediate step we prove that these functions are equivalent to the corresponding ones in the extracted Jasmin code, assuming that both use SHA3 in the same way: to this end we parameterize the mock implementations in Figure 7 with concrete XOF and a PRF modules that call SHA3 code in exactly the same way as the Jasmin implementations do.Looking back at Figure 1, this explains why we show the EC specification parameterized by  In the same figure we also show that we have proved as important side-results that, should the XOF and PRF modules we show above be replaced with purely random coins, then the noise sampling and matrix sampling procedures in the Kyber specification produce the correct output distribution.
The top-level IND-CPA proof then proceeds by connecting our intermediate implementation to the Kyber specification by using the lower level results described above for ring elements and vectors thereof.The proof boils down to guaranteeing that plugging the outputs of one function into the next always satisfies the contracts that constrain the ranges of the representations.When it comes to noise sampling and matrix sampling, we first replace the extracted code with the mock implementation, which allows us to conclude the equivalence proof in a trivial way.An example of a top-level result is given in Figure 8 for encapsulation.
In addition to the validity of the input memory regions, the precondition binds these inputs to symbolic constants representing pointers to the input public key provided to the implementations, and specifies that the input public key provided to the specification should match the one stored in the starting memory for the implementation.Furthermore, the random coins provided to both algorithms are also assumed to be the same.The post condition specifies which part of the memory is modified, and states that reading from that memory region correctly recovers the ciphertext and key computed by the specification.

Verifying the AVX2 Implementation
In light of the work described in the previous section for the reference implementation, the process for the AVX2 implementation can also be described as having two parts: 1) showing that AVX2 operations over ring elements and vectors are equivalent to their reference counterparts and 2) proving that this implies that the AVX2 implementations of the IND-CPA PKE and KEM, which call these functions, are equivalent to their reference implementation counterparts.The proof of correctness then follows by transitivity, since we have already proved that the reference implementation is correct with respect to the Kyber specification.We now describe these steps in more detail.
Top-level Correctness.We prove functional correctness statements for the KEM and IND-CPA encryption scheme constructions that are identical to those given above for the reference implementation.This is natural, since the AVX2 implementation is meant to be a drop-in replacement, whenever the target platform supports the necessary extensions.The proof is by transitivity: we have that the reference implementation is correct with respect to the Kyber specification, so it suffices to prove that the AVX2 implementation is equivalent to the reference implementation.We propagate this transitivity argument using relational Hoare logic down to lower level operations, as far as the structure of the code allows.For example, it is straightforward to prove that the IND-CPA encryption operations are equivalent, assuming that the intermediate level operations over vectors and ring elements on each side are, in turn, equivalent.We end up with a series of proof goals that look like the statement below.Here, we have on the left an intermediate-level AVX2 function that performs Barrett reduction of polynomial coefficients and, on the right, the corresponding reference implementation function.Note that we express the input/output relation modulo a permutation nttunpack, since this particular function is executed over ring elements in the NTT domain, which the AVX2 code stores differently (see Section 4).
Lower-level equivalences.As one proceeds down the call tree, the code of the AVX implementation becomes significantly different from the reference implementation: there is no-longer a one-to-one correspondence between inner functions and the control-flow is completely different.This is because the optimization based on vector operations relies on a totally different instruction schedule.The most extreme cases are, perhaps, the NTT and inverse NTT operations, where the order in which the computations are performed is indeed totally different: after a first round where the full array is transformed, the algorithm processes the bottom half of the array before moving to the top one.
To prove these lower-level equivalences, we therefore proceed in two steps.We first prove that the AVX2 function satisfies the same contract that we proved for the corresponding reference implementation function.For simple functions, such as addition, subtraction, etc., this task takes similar (small) effort when comparing to the reference implementation counterparts.The only difference lies in the data types and operations used, but the proof steps are essentially the same.
For more complex functions the proof effort was sometimes significantly greater, as SIMD operations often require sequences of intricate permutations before a semantically meaningful arithmetic or bitwise operation can be executed.Again, the most challenging cases were the NTT-related operations, where the proof effort was divided into three additional steps: i. defining an intermediate EasyCrypt implementation that performs field operation in the same order as the AVX implementation; ii.proving that this code is equivalent to the AVX implementation by excluding the possibility of representation errors; and iii.proving that the resulting sequence of field operations is indeed computing the NTT as prescribed by the specification.Our NTT library was extended with additional variants of the NTT computation, all proved equivalent to the specification, to facilitate the connection to the AVX implementation, as we describe next.

Verifying the NTT
As described above, to manage complexity, the proof of NTT correctness has two main parts: 1) dealing with the representation of finite field elements and 2) proving that the computations performed over field elements is correct with respect to the specification.
Step 1) is carried out as a simple game hop where we replace all the Jasmin register and stack variables that represent group elements, as well as computations over these variables, with computations over F q .Proving the correctness of this step is done by relying on a series of low level results for computations over F q .
In step 2), we take care of the most interesting part of the correctness proof: dealing with various optimizations such as the use of an array to store the values of the 256 th roots of the unity, the efficient computation of the index used to access this array, and the computation in reverse bit order.The proof is again done by game-hopping and its structure is summed up in Figure 9.The thin double arrow represents step 1) discussed above.The thick double arrows correspond to pRHL equivalence proofs.The dashed double arrow corresponds to the final HL correctness proof, where the EasyCrypt program NTTnaive is proven to implement the EasyCrypt functional specification of the NTT, described in Section 2. This proof is then lifted to pHL and carried by transitivity to all the other versions of the NTT via the other equivalence proofs.The intermediate NTTs are characterized by the following: • NTT Fq replaces the calls to auxiliary functions in NTT Jasmin by their mathematical specification • NTTAVX2 Fq replaces the calls to auxiliary functions in NTTAVX2 Jasmin by their mathematical specification • NTTopt splits the NTT into several small functions to allow for modular reasoning • NTTbsrev replaces the incremental computation of the index used to access the array of roots of unity by the computation of it's value from the loop indexes, and the index variable of the innermost loop by it's final value when it's used after the innermost loop • NTTnaive has it's two innermost loops in reverse bit order when compared to NTTbsrev, and the accesses to the array of roots of the unity are replaced by the computation of the value of the desired root • NTTAVX2 Fun replaces the two innermost loops by the evaluation of a function, and is designed to be easy to be shown equivalent to both NTTbsrev and NTTAVX2 The most challenging proof was the equivalence between NTTbsrev and NTTnaive, as it is in this transformation that the loops on both sides do not have the same order.As pRHL was made to reason about the lock-step execution of two programs, the invariants used were more complex than in other proofs.The AVX2 NTT implementation, NTTAVX2 Fq, is shown in pRHL to be equivalent to NTTAVX2 Fun, which is itself shown to be equivalent to NTTbsrev.The structure of the correctness proof of the inverse NTT, and the challenges it poses are essentially the same.

Benchmarks
In this Section we present benchmark results of our Jasmin implementations and compare to the original implementations written in C and assembly.We benchmark on three different generations of Intel CPUs that all implement the AMD64 architecture and support the AVX2 instruction set.Specifically, we benchmark on an Intel Core i7-6500U (Skylake), an Intel Core i7-4770K (Haswell) and on an Intel Core i7-10700K (Comet Lake) CPU.All benchmarks are using only a single core of these CPUs; we followed the standard practice of disabling Hyperthreading and TurboBoost; the C compiler used for all benchmarks was Clang, version 14. 8Table 1 reports the results of these benchmarks.All cycle counts in this table are the median of 10 000 runs with different inputs.All implementations used the same implementation of randombytes, which is implemented through the Linux getrandom system call.We see that the Jasmin reference implementation is considerably slower than the C reference implementation.This is expected, because the Jasmin compiler, unlike C compilers, does not optimize the code: this is by design the job of the programmer.The goal of this implementation was not to achieve high performance, but rather to facilitate the EasyCrypt proof of correctness.
More relevant is the performance comparison of the optimized AVX2 implementations.We first comment that, for the same C/asm implementation and very similar CPUs, SUPERCOP [BL] reports somewhat lower cycle counts.The main reason for this difference is that our benchmarks obtain random bytes through a system call, while SUPERCOP uses a fast userspace randomness generator.We see that the fully optimized Jasmin code is within about 10% of the performance of the optimized C and assembly implementation by the Kyber team.We also see the rather dramatic performance impact of our two implementations of rejection sampling, i.e., the only difference between the fully verified and the fully optimized AVX2 implementations.Clearly this difference motivates further effort on verifying a faster rejection sampling routine, but, as we will discuss in Section 7, such a proof might want to wait until EasyCrypt supports a higher degree of proof automation.

Discussion
Proof effort and the need for automation.The development we present in this paper took almost three years to complete.This was certainly to some extent because of the large number of people involved, their availability to work on the project, and their varying sets of skills and expertise.More importantly though, it was due to the sheer scale of the effort and the fact that different tools needed to be improved in order to tackle the example at hand.
If an equivalent project were to be started now with the same team, e.g., to formally verify the implementation of another lattice-based scheme, the expected development time would be cut down significantly due to the accumulated experience and improved tooling, but it would still require a significant resource investment.An obvious conclusion we draw from this fact is that we need to be able to increase automation in the available tools.
The approach in [HLS + 22] is promising but, at the moment, there are significant differences in the level of assurance that our work provides in comparison, namely: (1) our coverage of the implementation (excluding SHA3) is complete, whereas in [HLS + 22] only the NTT transform is covered; and (2) the trusted computing base (TCB) for the tooling in our development is arguably smaller, since EasyCrypt has been around for a long time and is reasonably self-contained, and the Jasmin compiler is verified in Coq.Nevertheless, like in [HLS + 22], our TCB includes hand-written semantics for assembly operations and the part of the Jasmin compiler that extracts Jasmin code to EasyCrypt.In any case, future research will certainly include an effort to integrate the ideas of [HLS + 22] (or even the tooling) into the current toolchain.Lessons learned and going forward.Given the necessary resources, the natural question to ask is whether the benefits justify the means.We believe that, for security-critical code that be used and replicated in a vast number of applications -which is the case for cryptographic standards -the following take-aways from our work indeed show that the effort is justified: 1.The proof clarifies how the implementation, which has a formal semantics, interprets the specification which is given in paper.Indeed, providing a formal semantics for the specification is an important side result of this line of work, and it allows a complementary effort of verifying the security and correctness of the specification itself.
2. The proof looks at properties that cannot be reasonably checked by testing, such as the fact that the sampling procedures are generating uniform matrices or correctly distributed noise.
3. The proof highlights sub-optimal parts of the code, e.g., the fact that bounds are explicitly stated and proved at the function level permit identifying the points where these bounds are too tight for what is needed, and hence can be relaxed by removing unnecessary computations.This was indeed the case in this development, where the final version of the AVX2 code is faster because we were able to eliminate redundant reduction steps that became evident when formally proving the ranges of intermediate computations.
As an example of the importance of point (1) above, which is perhaps the less obvious one, let us consider the declarations in Figure 10, which pinpoint how SHA3 algorithms are used in Kyber.We note that a total of 9 different types of calls to the SHA3 library are performed: these are declared as operators, which we axiomatize as the semantics of the corresponding functions in the SHA3 library for concrete input and output sizes.
We can see that we also annotate these declarations with the security semantics we will assume for these operators in the ongoing follow-up work where we analyze the security of the Kyber specification.These will be either standard PRG/PRF assumptions or, when nothing is stated in the code, they will be modeled as random oracles because the attacker will potentially know the full input.For that future analysis it is important to check that all of these calls to SHA3 algorithms can be considered to be independent hash functions.The argument to justify this claim will rely on the fact that each use of SHA3 algorithms is domain-separated from the other for one of two reasons: 1) using SHAKE vs SHA3 hashing, or 2) if using the same family of algorithms, by using different input and/or output sizes.We do not need to make this argument formal for the purpose of this paper, but it is important to emphasize that our specification is precise enough that we can formally state it.
Follow-up work includes tying the functional correctness results we established to formal proofs of correctness for SHA3 implementations, and extending the correctness results to cover other Kyber implementations, namely the faster AVX2 rejection sampling routine and implementations offering protection against timing attacks and Spectre v1 attacks [ASBG + 23].These results will also be connected to machine-checked security proofs that apply to the Kyber specification we present here.

Figure 4 :
Figure 4: Kyber EasyCrypt Specification: Rejection sampling a uniform element in R q .

Figure 7 :
Figure 7: Mock matrix and noise generation functions.

Figure 8 :
Figure 8: Example of a top-level correctness result.

Figure 9 :
Figure 9: Structure of the NTT correctness proof.

Table 1 :
Cycle counts for C, C/assembly, and our Jasmin implementations of Kyber-768 on different generations on Intel CPUs